Saturday, April 30, 2011

A new SSC package to convert numbers to text (-num2words-)

-num2words- has been posted to the SSC Archives.  It is a Stata module to convert numbers to text.  It can convert integers, fractional numbers, and ordinal numbers (e.g., 8 to 8th).  The idea for this program originated from a LaTeX report I was creating that had some code that wrote the text version of numbers into sentences, including writing the proper case text for a number if it started a sentence.  So, the LaTeX file (written via -texdoc- from SSC) had some code like:
****texdoc example
sum x, meanonly
loc totalN "`=_N'"
loc pct1  "`=myvar[1]'"
loc totalN "`r(N)'"
if `totalN'>`lastN' loc change1 "increase"
****texdoc text written:
tex  `totalN' respondents took the survey this month.
tex  There was a `pct1' percent `change1' in respondents who reported using incentive payment dollars....and so on

where the macros are defined as:
`totalN' - the total number of relevant respondents (so, loc totalN "`=_N'")
`pct1' - the calculated percent respondents from the dataset (so, loc pct1  "`=myvar[1]'")
`change1' - substitutes the word "increase" in if `pct1' increased from the last survey wave, "decrease" for a decrease and "equal to" if it was the same.
I created -num2words- to ease the process of converting many variables--like those underlying the macros `totalN' and `pct1'--to words/text, so it can change "212 respondents" to "Two-hundred and twelve respondents" or  "25.5 percent" to "twenty-five and 5 tenths percent" (or it can be truncated to just "twenty-five percent" with the "round" option in -num2words-) in the narrative.
You can also automatically change numbers to words for insertion in table or figure titles, notes, etc.
******************fig 1 example
set obs 10
g x = round(runiform()*100, .05)
num2words x, g(x_rounded) round
egen mx = mean(x)
num2words mx, round
gr bar x  , over(x2_ordinal, sort(1)) ///
note({bf: X for Obs 2 is `=x_rounded[2]'}) ///
text(60 20 `"Mean = `=mx2'"',  box )
**********************fig 1 example
Fig. 1

You can get -num2words- from the SSC Archives [1][2]

Visualization of my iPhone tracking data

Last week, the internets were flooded with panic about the iPhone storing location data in a sqlite DB.  The DB (called consolidated.db) contains longitude, latitude, altitude, accuracy, and timestamp information for near by wifi hotspots and GPS locations (when maps apps are used).  You can take a look at the data stored on you by your iPhone by using the iPhonetracker app (for Mac OSX) or if you've got a Windows machine you can find the conslidated.db file and load it into a database software.
I used the program to visualize the data on my recently purchased iPhone 4 (I had a 3G until March, so I've only got 1 month of data to visualize--which in a way makes it easier to see how much tracking is really going on since I can easily recall where I've traveled in the past month).  Here is the graph that iPhonetracker shows:

The advantage of this tracker app is that it visualizes the movement over time.  However, it's difficult to see the locations on the map and since it's using a heat map to display the frequency that each location was in my DB -- that is, it's hard to see some of the smaller dots on the maps.  As you can see, I spent most of my time since March in College Station and Waco, TX.  The dark purple spot in CS is near my building on campus.  
There are a couple of reasons that, after looking at this data, I am not too concerned about the tracking capability.  
First, when examining the timestamp data and my location, it is off by a few hours and even a full day in several cases -- such as the days I traveled to Waco. 
Also, notice that the tracking did not pickup any locations during the drive to Waco (I guess that's because I didn't use it for GPS directions and there probably aren't many wifi locations between College Station and Waco).  
Finally, the tracking it not very precise. There are lots of small dots on the map that show locations I've never been, including the dots far outside of Waco near Gatesville and Hico.  Also, when I zoom in on the map and examine the records in the DB, it puts in lots of places in the College Station area that I haven't traveled to in the last month. My only guess is that the blips are due to wifi hotspots or cell towers that my cell phone connected to but have either 1) a really long range or 2) somehow have inaccurate longitude/latitude entries.
  If you don't want to use or you don't have a Mac, you can extract the data yourself by first finding the "consolidated.db" file in your iPhone backups maintained by iTunes (assuming they are unencrypted or you have unencrypted them). An easy way to find this file is to run this python script with the command:
"~//" | grep "consolidated"
this will tell you the name of the file containing your tracking DB.  Open this file in your favorite sqlite software.  In Mac OSX you can open/import this from terminal with the commands:
>sqlite3 "yourfilenamehere"
>CREATE TABLE CellLocation (MCC INTEGER, MNC INTEGER, LAC INTEGER, CI INTEGER, Timestamp FLOAT, Latitude FLOAT, Longitude FLOAT, HorizontalAccuracy FLOAT, Altitude FLOAT, VerticalAccuracy FLOAT, Speed FLOAT, Course FLOAT, Confidence INTEGER, PRIMARY KEY (MCC, MNC, LAC, CI));
then export the table to csv or whatever format you'd like.  Next, I uploaded this data to google maps and created the map below:

This google map shows the erroneous entries points more clearly.  Thinking that this might have to do with the "horizontal accuracy" column in the DB, I examined the raw data for these erroneous locations, but "horizontal accuracy" was about the same as all the entries, so it wasn't an accuracy problem.  Regardless, I'm not too worried about Apple or someone else using this data for no-good.

Update is looking for donated iPhone tracking data so that they can make visualizations like this one.  They'd better hurry and gather the data they need before the next iOS update that makes this tracking file more difficult to locate/access.

Tuesday, April 26, 2011

-writeinput- available from SSC

-writeinput- was recently posted to the SSC Archive[1][2].  I've written a bit about it before.  My announcement of this program focuses on using this program for creating a self-contained dataset example or snippet that can help other Statalist posters understand your question or response, but I've found the same is true for transmitting do-file examples to coworkers and students. It's very much in the spirit of Statalist FAQ.
Many times a simple data example would prevent a lot of confusion, but creating one isn't always convenient.  Sometimes users can post relevant data examples using one of Stata's canned datasets or by building fake data through a series of -generate- commands.  Some Statalist posters will simply try to explain their data, which sometimes causes confusion and varying interpretations of the problem and the data structure. Other users might copy and paste a snippet of the data, but wrapping can be a problem. Plus, others have to add double quotes to string variable values with embedded spaces in order to get the data into Stata to respond to the thread.
If you really want to get your point across, writing an -input- example is a good option.  However, this can be time consuming.  You'll want to put quotes around string variable values, and formats for variables you define.  -writeinput- helps automate this process by letting you create a -input- statement for the entire dataset or a selected subset of your data.
Finally, since the do-file editor in Stata 11 was improved to accommodate very large files, you can now save example datasets (or entire datasets -- though I wouldn't suggest it) along with do-file code.  -writeinput- makes all of this easier.
Here's an example:
//install writeinput from SSC//
cap which writeinput
if _rc ssc install writeinput, replace
sysuse auto, clear
writeinput make mpg price for in 1/5 using "", repl
writeinput make mpg price for if for==0 in 20/60 using "", ///
    replace n(Here's some notes)
writeinput make if for==1 & pri>200 in 1/50 using "", ///
    replace n(write some notes here)
type ""

More on label wrapping and -statplot-: Adding N's to your figures

While using -statplot- in the real world, we came across a situation where we needed to place the N's for sub-groups in certain value or variable labels.
For these figures, the N's change over sub-datasets and when the survey data is updated with each wave, so actually writing something like "(N = 100)" into each value or variable label or graph title is repetitive.  These figures are a heavy on the information side (they'd surely be an easy target for junkcharts for many reasons), but the real version of these figures use less N's than the examples below and they are made to mirror the output produced by Ian Watson's -tabout- (from SSC).  
Here's a strategy to add some N's to graphs automatically & wrap these labels with N's.  This example follows from the examples presented in earlier posts about -statplot- here and here.
***********************************!Create Data Example
sysuse nlsw88, clear
lab var grade "Really Long Variable Label for the Variable GRADE that will cutoff at 80 chars"
lab var tenure "Another Really Long Var Label for the Variable TENURE that will cutoff at 80 chars"
lab var wage "Long Variable Label, this time for the Variable WAGE that will cutoff at 80 chars"
 d grade tenure wage
 replace wage = . if race==3
***********************************!Create Data Example

Figure 14 shows how to add automatic wrapping and Ns to variable labels (watch for wrapping -- download entire do-file with link at the bottom of this page):
loc vars grade tenure wage
loc ll 25 //sets the length of the labels
loc j = 1
foreach u of local vars {
**calc N**
qui count if !mi(`u')  
loc nn `r(N)'
di "`j'"
loc len = length("`:var l `u''")
if `len' > `ll' {
loc pieces "`=round(`len'/`ll')'"
forval p = 1/`pieces' {
loc p`p' : piece `p' `ll' of "`:var l `u''", nobreak
loc relabeling`j' `" `relabeling`j''  `"`p`p'' "'   "'
loc relabeling `" `relabeling'  `j'`"`relabeling`j'' "(N=`nn')" "' "'
loc `j++'
di "`relabeling'"
loc totalN = _N
statplot `vars',  ///
    name(g2, replace)  over(race) graphregion(margin(vlarge)) ///
    varopts( relabel( `relabeling' )) ///
    title("Wrapping Long Variable Labels - Automatically") ///
     note(Total N = `totalN')
graph export "fig14.png", as(png) replace

Monday, April 25, 2011

-obsdiff- available from SSC

-obsdiff- is a Stata module to find differences in a variable across records/observations.  It's ideal for finding the differences between rows that are near-duplicates.  This is usually the result of data that have been merged or joined in a way that created duplicates.  The solution may be to remove the extra record or reshape to move the extra observation to a new column (as is the case with Var10 below).

A quick example:

*******************watch for wrapping:
inp    var1 str9 var2 var3 var4 str9(var5 var6) var7 str9 var8 var9 str9(var10 var11)
1 "a" 1 2 "c" "s" 3 "d" 5 "AA" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "BB" "z"
1 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
2 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z"
obsdiff var1 var2 , r(1/2)
obsdiff , all
obsdiff, r(1/4)
**var10 is different across records
*-- we'll reshape to stack it wide across columns
bys var1: g j = _n
reshape wide var10, i(var1) j(j)

The output is just the -list- output for the values and rows that are different within each variable.  Since I haven't figured a way to put this all into one nice table yet, the output can get a bit unwieldy when you're examining many rows and many variables.  One solution is to use the "using" option to send the log to an external file for examination.

Sunday, April 17, 2011

Data cleaning with Google Refine

There's a lot to be said about the data and text cleaning abilities of programs like R [1] [2] and Stata [3] [4] [5].  But when it comes to cleaning up data with lots of spelling errors, different forms of the same string, abbreviations, acronyms, etc - or - if you've got to task a student worker who's skill set barely includes M$ Excel, then Google Refine (it used to be called Freebase gridworks) is a great tool for cleaning data.
Here's the  Google Code page and below is a video on it's data cleaning tools.  Google Refine can also transform data and access external data (like JSON data) from other websites, but I've found it most useful for data cleaning.  

Monday, April 11, 2011

LaTeX Short Course Material Available

For those interested in LaTeX, I've posted the presentation, handouts, and related materials from a recent short course (taught by Emily Naiser and myself) on getting started with LaTeX for both Windows and Mac OSX.  We're teaching it again this summer.