Monday, April 25, 2011

-obsdiff- available from SSC

-obsdiff- is a Stata module to find differences in a variable across records/observations.  It's ideal for finding the differences between rows that are near-duplicates.  This is usually the result of data that have been merged or joined in a way that created duplicates.  The solution may be to remove the extra record or reshape to move the extra observation to a new column (as is the case with Var10 below).

A quick example:

*******************watch for wrapping:
inp    var1 str9 var2 var3 var4 str9(var5 var6) var7 str9 var8 var9 str9(var10 var11)
1 "a" 1 2 "c" "s" 3 "d" 5 "AA" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "BB" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
2 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
obsdiff var1 var2 , r(1/2)
obsdiff , all
obsdiff, r(1/4)
**var10 is different across records
*-- we'll reshape to stack it wide across columns
bys var1: g j = _n
reshape wide var10, i(var1) j(j)

The output is just the -list- output for the values and rows that are different within each variable.  Since I haven't figured a way to put this all into one nice table yet, the output can get a bit unwieldy when you're examining many rows and many variables.  One solution is to use the "using" option to send the log to an external file for examination.

No comments:

Post a Comment