Skip to main content


Showing posts from 2011

Unabbreviate Macro Lists in Stata

This Statalist thread from a few months ago started by Nick Mosely asked about working with hundreds of macros and eventually got onto the topic of expanding or unabbreviating (see -help unab- for the varlist version of this idea) macro lists.  Based on my posts in that thread, I recently posted -mac_unab- to the SSC Archives to help with this problem. -mac_unab- is still a bit of a kludge solution, but I haven't figured out a better approach ( nor did anyone suggest a better approach ).  The biggest issues with mac_unab, which I hope to find better solutions for, include: 1.  When you run mac_unab, it will print all the contents of the -macro list- command in the Results window.  This might be desirable for some, but I'd like to be able to toggle it on/off.  Currently, the way I've gathered the macros is via a log, so there's no way to avoid printing the -mac list- output each time -mac_unab- is run. 2.  Currently, the program will only match macros with the patt

For Computers, understanding natural language is sometimes hard ...

T his paper  by Chloe ́ Kiddon and Yuriy Brun at U of Washington describes a bayes classifier that can be used to find accidental double entendres or "potential innuendos" (called "That's what she said" or TWSS jokes) in sentences.  Here's the ruby script to run this classifier to identify so-called "low brow comedy" (their words, not mine) in natural, human language . Hopefully, this foreshadows the great things we can expect from our computers' auto-complete functionality in the near future. This article from Wired on detecting humor with computer software is also relevant.  Andrew Gelman,  a bayesian scholar and co-author of the great zombie survey paper ^, links to this article  in his blog  after I recently mentioned it to him. ___ ^ This paper contains a Technical Note, describing the authors' rationale for using LaTeX, that is one of my all-time favorite quotes:  " We originally wrote this article in Word, but then we

A new SSC package to convert numbers to text (-num2words-)

- num2words - has been posted to the SSC Archives.  It is a Stata module to convert numbers to text.  It can convert integers, fractional numbers, and ordinal numbers (e.g., 8 to 8th).  The idea for this program originated from a LaTeX report I was creating that had some code that wrote the text version of numbers into sentences, including writing the proper case text for a number if it started a sentence.  So, the LaTeX file (written via -texdoc- from SSC) had some code like: ****texdoc example sum x, meanonly loc totalN "`=_N'" loc pct1  "`=myvar[1]'" loc totalN "`r(N)'" if `totalN'>`lastN' loc change1 "increase" ****texdoc text written: tex  ` totalN ' respondents took the survey this month. tex  There was a ` pct1 ' percent  ` change1 ' in respondents who reported using incentive payment dollars....and so on **** where the macros are defined as: ` totalN ' - the total number of relevant re

Visualization of my iPhone tracking data

Last week, the internets were flooded with panic about the iPhone storing location data in a sqlite DB.  The DB (called consolidated.db) contains longitude, latitude, altitude, accuracy, and timestamp information for near by wifi hotspots and GPS locations (when maps apps are used).  You can take a look at the data stored on you by your iPhone by using the iPhonetracker app (for Mac OSX) or if you've got a Windows machine you can find the conslidated.db file and load it into a database software. I used the program to visualize the data on my recently purchased iPhone 4 (I had a 3G until March, so I've only got 1 month of data to visualize--which in a way makes it easier to see how much tracking is really going on since I can easily recall where I've traveled in the past month).  Here is the graph that iPhonetracker shows: The advantage of this tracker app is that it visualizes the movement over time.  However, it's difficult to see the locations o

-writeinput- available from SSC

-writeinput- was recently posted to the SSC Archive[ 1 ][ 2 ].  I've written a bit about it before .  My announcement of this program focuses on using this program for creating a self-contained dataset example or snippet that can help other Statalist posters understand your question or response, but I've found the same is true for transmitting do-file examples to coworkers and students. It's very much in the spirit of Statalist FAQ . Many times a simple data example would prevent a lot of confusion, but creating one isn't always convenient.  Sometimes users can post relevant data examples using one of Stata's canned datasets or by building fake data through a series of -generate- commands.  Some Statalist posters will simply try to explain their data, which sometimes causes confusion and varying interpretations of the problem and the data structure. Other users might copy and paste a snippet of the data, but wrapping can be a problem. Plus, others have to add doub

More on label wrapping and -statplot-: Adding N's to your figures

While using -statplot- in the real world, we came across a situation where we needed to place the N's for sub-groups in certain value or variable labels. For these figures, the N's change over sub-datasets and when the survey data is updated with each wave, so actually writing something like "(N = 100)" into each value or variable label or graph title is repetitive.  These figures are a heavy on the information side (they'd surely be an easy target for junkcharts  for many reasons), but the real version of these figures use less N's than the examples below and they are made to mirror the output produced by Ian Watson's -tabout- (from SSC).   Here's a strategy to add some N's to graphs automatically & wrap these labels with N's.  This example follows from the examples presented in earlier posts about -statplot- here and here . ***********************************!Create Data Example sysuse nlsw88, clear **varlabels** lab var grade &quo

-obsdiff- available from SSC

-obsdiff- is a Stata module to find differences in a variable across records/observations.  It's ideal for finding the differences between rows that are near-duplicates.  This is usually the result of data that have been merged or joined in a way that created duplicates.  The solution may be to remove the extra record or reshape to move the extra observation to a new column (as is the case with Var10 below). A quick example: *******************watch for wrapping: clear inp    var1 str9 var2 var3 var4 str9(var5 var6) var7 str9 var8 var9 str9(var10 var11) 1 "a" 1 2 "c" "s" 3 "d" 5 "AA" "z" 1 "a" 1 2 "c" "s" 3 "d" 5 "BB" "z" 1 "a" 1 2 "c" "s" 3 "d" 5 "CC" "z" 2 "a" 1 2 "c" "s" 3 "d" 5 "CC"

Data cleaning with Google Refine

There's a lot to be said about the data and text cleaning abilities of programs like R [ 1 ] [ 2 ] and Stata [ 3 ] [ 4 ] [ 5 ].  But when it comes to cleaning up data with lots of spelling errors, different forms of the same string, abbreviations, acronyms, etc - or - if you've got to task a student worker who's skill set barely includes M$ Excel, then Google Refine  (it used to be called Freebase gridworks) is a great tool for cleaning data. Here's the  Google Code page and below is a video on it's data cleaning tools.  Google Refine can also transform data and access external data (like JSON data) from other websites, but I've found it most useful for data cleaning.  

Some -statplot- examples, Part 2 (wrapping long labels)

...continued from  Part 1 ... Part 1 of this post covered some advanced examples of -statplot-, focusing on the use of combinations of over() and by() options. In Part 2 , I examine some strategies to use -statplot- with really long variable and/or value labels.  Recently, I was using -statplot- to create some tables in a paper where some of the labels in the tables needed to be the (longish) question and answer choice text, I discovered how long labels can really be a pain for graphs.  This is a problem for any graph in Stata, regardless of whether your labels are in the legend or at the axis; however, my preference is that long labels (up to a limit) look better at the axis. So, the examples below show how to use -statplot- options to create wrapped labels.  I hope to create an option to help make this a part of -statplot- at some point in the future, but for now, the code below is a good template for helping you to automate wrapping labels.  This can be extended to other plott

Some -statplot- examples

- statplot - (co-authored by Nick Cox and myself) was released earlier this month.  You can get it at the SSC [ 1 ] [ 2 ]. In this posting, I show you some more advanced examples of using -statplot- using the Stata nlsw88 dataset (-sysuse nlsw88.dta-).  [ Note : Click on any of the graphs below to see a larger example in a new tab/window.] First, a basic example of -statplot- might look like: ***********************!begin sysuse nlsw88.dta, clear  statplot grade tenure wage, blabel(bar) subtit({it:-statplot-} example) graph export "fig1.png", as(png) replace ***********************!end Fig. 1 The main advantage of -statplot- is creating plots of summary stats with the labels moved from the legend (the usual placement when using -gr bar|hbar|dot-) to the axis.  So, I could create a graph of the same data above with something like: graph hbar (mean) grade tenure wage however, it would look like the graph on the left in Fig. 2 below, where we still have a legend and an a

TextWrangler and Stata

With the introduction of syntax highlighting and the ability to handle larger do-files, I've started to use the built-in Stata do-file editor more and more in lieu of BBS's TextWrangler . However, several times a week I still find myself firing up TW for more complex tasks.  Most often it's when I need to show differences 2 or more versions/revisions of code, do a complex find/replace, character substitution, duplicate line deletion, or regular expression search.  I use it for FTP uploads and inspecting/opening text-versions unknown filetypes on my Mac OSX.  And occasionally, I'll open it if I'm working simultaneously on several do-files since the Mac version of Stata 11 still doesn't support tabbed do-files. I use the TW for Stata scripts found here  and I use the c ustomizable shortcut functions in OSX 10.6 to send my do-files or a section of my do-files to Stata from TW. The only issue I have with the script above is that it's outdated.  The DataNi