Tuesday, December 8, 2009

Data Cleaning with Stata

During the course of my research, I come across a lot of messy data.

This includes everything from human errors like misspelled words, data entry mistakes, or inconsistent variable/value labels to machine/software issues like weird characters in the data, strange data shapes from external programs, or data embedded in something like HTML.

Stata has a lot of powerful tools to help automate cleaning data, one of my favorites is -filefilter-, but when all else fails, sometimes you've just got to sit down and clean data by hand (or get a research assistant to do it for you).
For example, when you are trying to join data from multiple data sources that label their records with similar, but unsystematically different labeling conventions, it can be tricky to get the data to link up.  A recent statalist posting asked about this issue in particular (note: I'll try to tackle some of these other 'data cleaning' areas in future posts).

The question asked how to merge datasets with similar, but irreconcilable record identifiers, like:

Dataset2                                         dataset2
68th precinct youth council inc                  68th precinct youth council, inc.
action center for education and community        action center for education and
amistad child day care and family center inc     amistad early childhood 


In my reply, I suggested using -reclink- to help reconcile the two lists.  In fact, since then I've started using it more frequently for a lot of my datasets.  The important thing to realize is that even when the identifiers in the two lists are pretty unique, there will be errors.  That is, -reclink- will be quite accurate in finding matches in the example above because the identifiers are close (usually only a few characters off) and unique (they don't closely resemble other identifiers).  However, if you had to reconcile the lists below, you would have less success:

dataset1               dataset2
sal                      sally
salamander               salaman
salad s                  salads
salute                   salu
sally johnson            sally johns
sally johnston           sally j. johnston


Something to note in the lists above is that "sal" in dataset1 would match all the identifiers in dataset2, and "sally johns" in dataset2 would match both of the last names in dataset1.
So, -reclink- is best used only as a timesaver in helping you to flag the records that might be a good match.

No comments:

Post a Comment