-obsdiff- is a Stata module to find differences in a variable across records/observations. It's ideal for finding the differences between rows that are near-duplicates. This is usually the result of data that have been merged or joined in a way that created duplicates. The solution may be to remove the extra record or reshape to move the extra observation to a new column (as is the case with Var10 below).
A quick example:
*******************watch for wrapping:
clear
inp var1 str9 var2 var3 var4 str9(var5 var6) var7 str9 var8 var9 str9(var10 var11)
1 "a" 1 2 "c" "s" 3 "d" 5 "AA" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "BB" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
2 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
end
obsdiff var1 var2 , r(1/2)
obsdiff , all
obsdiff, r(1/4)
**var10 is different across records
*-- we'll reshape to stack it wide across columns
bys var1: g j = _n
reshape wide var10, i(var1) j(j)
*******************
A quick example:
*******************watch for wrapping:
clear
inp var1 str9 var2 var3 var4 str9(var5 var6) var7 str9 var8 var9 str9(var10 var11)
1 "a" 1 2 "c" "s" 3 "d" 5 "AA" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "BB" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
2 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
end
obsdiff var1 var2 , r(1/2)
obsdiff , all
obsdiff, r(1/4)
**var10 is different across records
*-- we'll reshape to stack it wide across columns
bys var1: g j = _n
reshape wide var10, i(var1) j(j)
*******************
The output is just the -list- output for the values and rows that are different within each variable. Since I haven't figured a way to put this all into one nice table yet, the output can get a bit unwieldy when you're examining many rows and many variables. One solution is to use the "using" option to send the log to an external file for examination.
Comments
Post a Comment