Friday, January 22, 2010

AWK scripts & Large Datasets

Working with large datasets in statistical program, like Stata, SPSS, or R, can be a pain if the variable formats and shapes aren't what you need because stats programs can be notoriously slow at cleaning up the data when they have to load a dataset into memory that is larger than the physical memory (SAS is an exception--it uses the hard drive exclusively).  Of course there are plenty of other issues with large datasets [1]  [2] .  

A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK.  Using AWK scripts can save you some time with the clean up and Brenden at published a table comparing the speed of various languages in getting the job done:

The Impact of Corporate Spending on Elections

The Citizens United decision is in many ways a surprising decision.  The decision overrules standing precedent (namely Austin v. Michigan Chambers of Commerce), some provisions in McCain-Feingold,  and was decided on issues that weren't explicitly stated in the petition to the court.  More over, the Roberts court didn't follow their modus operandi of deciding controversial cases narrowly.

Tuesday, January 12, 2010

Referencing Cells/Observations in Stata using [brackets]

Earlier today there was a posting on Statalist that asked about labeling a numeric variable with the words contained in a string variable.  This reminded me of my first Statalist posting last year where I asked a similar question.  
There are several solutions to this question, but more importantly, it got me thinking more about referencing cells using brackets (called "explicit subscripting" or using "_variables" in the Stata manual).
Before I go into  subscripting, here's a recap of how to label numeric variables from a list contained in a string variable based on the Statalist exchange:

*-------------------------BEGIN CODE
The posting asked how to label col1 with values from col2 
in this example dataset:
inp byte col1 str5 col2
1 "Table"
2 "Chair" 
3 "Stool"
4 "Chair"
5 "Pan"
/*  Martin Weiss offered the solution:  */
loc h ""
forv i=1/`=_N'{
 loc h `h' `=col1[`i']' "`=col2[`i']'"
la def col1 `h'
la val col1 col1
la li col1
Which works great, but this could also take the shape
of a solution he gave me a year ago:
 forvalues n = 1/`=_N' {
  la def lab `n' "`=col2[`n']'", modify
label val col1 lab
la li lab
/*  Finally, you can install -labutil- from SSC and */
labmask col1, values(col2) lbl(label)
*-------------------------END CODE

So based on this, I started doing some more exploration of subscripting, especially in the context of grouped observations.  I run into a lot of situations where I want to grab the largest/smallest/average/etc. value in each group, and this can easily be accomplished via some -egen- command in combination with -bys-; however, extracting specific values from a group is made easier by explicit subscripting. A second advantage of subscripting is that it can reduce your typing of complicated "if" statements when generating new variables.

Monday, January 11, 2010

Visualization of Netflix Data

NYT posted some visualizations of the top Netflix rentals for 12 cities, by zip code.  Unfortunately, the raw data is not available.  Netflix has a history of releasing their rental data/algorithms for contests where participants have attempted to improve these algorithms, so hopefully they'll be releasing more data on consumer preferences for general use.
The city with data available that is closest to where I live is Dallas, Tx.  Before taking a look at the Netflix rentals, let's take a brief look at some demographic characteristics, by zip code, of this same region for reference (this is from the ERsys 2000 Census data, which is outdated).

Here's household income in Dallas:

(darker zip codes = higher mean household incomes)

Here's level of education in Dallas:

(Darker blue = higher mean education level)