During the course of my research, I come across a lot of messy data.
This includes everything from human errors like misspelled words, data entry mistakes, or inconsistent variable/value labels to machine/software issues like weird characters in the data, strange data shapes from external programs, or data embedded in something like HTML.
Stata has a lot of powerful tools to help automate cleaning data, one of my favorites is -filefilter-, but when all else fails, sometimes you've just got to sit down and clean data by hand (or get a research assistant to do it for you).
For example, when you are trying to join data from multiple data sources that label their records with similar, but unsystematically different labeling conventions, it can be tricky to get the data to link up. A recent statalist posting asked about this issue in particular (note: I'll try to tackle some of these other 'data cleaning' areas in future posts).
The question asked how to merge datasets with similar, but irreconcilable record identifiers, like:
68th precinct youth council inc 68th precinct youth council, inc.
action center for education and community action center for education and
amistad child day care and family center inc amistad early childhood
In my reply, I suggested using -reclink- to help reconcile the two lists. In fact, since then I've started using it more frequently for a lot of my datasets. The important thing to realize is that even when the identifiers in the two lists are pretty unique, there will be errors. That is, -reclink- will be quite accurate in finding matches in the example above because the identifiers are close (usually only a few characters off) and unique (they don't closely resemble other identifiers). However, if you had to reconcile the lists below, you would have less success:
salad s salads
sally johnson sally johns
sally johnston sally j. johnston
Something to note in the lists above is that "sal" in dataset1 would match all the identifiers in dataset2, and "sally johns" in dataset2 would match both of the last names in dataset1.
So, -reclink- is best used only as a timesaver in helping you to flag the records that might be a good match.