Saturday, December 12, 2009

A Note on Posting Code in Blogger

As is the case with a lot of free blogging sites, posting code can be a pain.  Google sites and wordpress have fixed the issue, but blogger (Google-owned site) has not.  Many users have posted solutions that involve uploading code to your server and downloading a "syntax highlighter" file, but you have to paste in the css tags before every code posting & I'm not willing to mess with that solution (and I'm hoping that blogger will add this functionality soon).

So, for the purposes of this site, I am posting all code (Stata or otherwise) in grey, standard-width font between a starting and ending set of asterisks.  Also, I'll use row continuation flags (/*  and */) to indicate text wrapping.  I copied/pasted the code from my previous posting into the Stata do-file editor (and and it ran without issues.  

Friday, December 11, 2009

Automatically Generating Reports with Stata

At PPRI, we get a lot of data in waves, so we end of recreating various versions of the same report with updated data.  This can become tiresome when having to recreate, re-copy/paste, and reformat lots of tables and figures.  Thankfully, Roger Newson, from Imperial College London, has created some ado-files, called -rtfutil-,  that help automate the insertion of these elements into a .rtf document.
Here's an snippet of how I've adopted his example code to help automatically generate some reports with new data:

*-----------------------BEGIN CODE
local sf  "`pwd'"
sysuse auto, clear
twoway scatter mpg price, mlabel(make) || lfitci mpg price
graph export "`sf'myplot1.eps", replace
twoway scatter price mpg, mlabel(make) by(for)
graph export "`sf'myplot2.eps", replace
     tempname handle1
rtfopen `handle1' using "`sf'mydoc1.rtf", replace
file write `handle1' _n _tab "{\pard\b SAMPLE DOCUMENT \par}" _tab _n
file write `handle1' _n "{\line}"
// Figure1
file write `handle1' "{\pard\b FIGURE 1: Plot of Price\par}" _n
rtflink `handle1' using "`sf'myplot1.eps"
// Figure2
file write `handle1' _n "{\page}" _n /*
*/ "{\pard Here is the plot and a paragraph about it. Here is the plot and a paragraph about it. Here is the plot and a paragraph about it. Here is the plot and a paragraph about it.....blah blah blah blah blah \line}" _n
file write `handle1' _n "{\line}"
file write `handle1' "{\pard\b FIGURE2: Plots of Price vs. MPG\par}" _n
rtflink `handle1' using "`sf'myplot2.eps"

// Table Title
file write `handle1' _n "{\page}" _n
file write `handle1' _n "{\par\pard}" _n /*
*/ "{\par\pard HERE IS A TABLE WITH THE CARS: \par\pard}" _n
file write `handle1' _n "{\par\pard}" _n

// Summary Table
rtfrstyle make mpg weight, cwidths(2000 1440 1440) local(b d e)
listtex make foreign mpg if mpg<15, /*
*/ handle(`handle1') begin("`b'") delim("`d'") end("`e'") /*
*/ head("`b'\ql{\i Make}`d'\qr{\i Foreign}`d'\qr{\i MPG }`e'")
file write `handle1' _n "{\line}"
file write `handle1' _n _tab(2) /*
*/ "{\pard\b Sources: Census Data, etc... \par}" _n _n
rtfclose `handle1'
*-----------------------END CODE

I posted this as part of a solution to a question at StackOverflow.  You can get -rtfutil- here or by typing "ssc install rtfutil" in Stata.

Tuesday, December 8, 2009

Data Cleaning with Stata

During the course of my research, I come across a lot of messy data.

This includes everything from human errors like misspelled words, data entry mistakes, or inconsistent variable/value labels to machine/software issues like weird characters in the data, strange data shapes from external programs, or data embedded in something like HTML.

Stata has a lot of powerful tools to help automate cleaning data, one of my favorites is -filefilter-, but when all else fails, sometimes you've just got to sit down and clean data by hand (or get a research assistant to do it for you).
For example, when you are trying to join data from multiple data sources that label their records with similar, but unsystematically different labeling conventions, it can be tricky to get the data to link up.  A recent statalist posting asked about this issue in particular (note: I'll try to tackle some of these other 'data cleaning' areas in future posts).

The question asked how to merge datasets with similar, but irreconcilable record identifiers, like:

Dataset2                                         dataset2
68th precinct youth council inc                  68th precinct youth council, inc.
action center for education and community        action center for education and
amistad child day care and family center inc     amistad early childhood 

In my reply, I suggested using -reclink- to help reconcile the two lists.  In fact, since then I've started using it more frequently for a lot of my datasets.  The important thing to realize is that even when the identifiers in the two lists are pretty unique, there will be errors.  That is, -reclink- will be quite accurate in finding matches in the example above because the identifiers are close (usually only a few characters off) and unique (they don't closely resemble other identifiers).  However, if you had to reconcile the lists below, you would have less success:

dataset1               dataset2
sal                      sally
salamander               salaman
salad s                  salads
salute                   salu
sally johnson            sally johns
sally johnston           sally j. johnston

Something to note in the lists above is that "sal" in dataset1 would match all the identifiers in dataset2, and "sally johns" in dataset2 would match both of the last names in dataset1.
So, -reclink- is best used only as a timesaver in helping you to flag the records that might be a good match.

Tuesday, December 1, 2009

my "research_notes"

Currently, I am in the process of migrating my various webpages to a new domain and included this blog to collect, comment,  respond, or describe efforts with my current academic research at TAMU PPRI, some attempts at Stata & OSX programming, etc.   /enjoy!