Skip to main content

Posts

Showing posts from January, 2010

AWK scripts & Large Datasets

Working with large datasets in statistical program, like Stata, SPSS, or R, can be a pain if the variable formats and shapes aren't what you need because stats programs can be notoriously slow at cleaning up the data when they have to load a dataset into memory that is larger than the physical memory (SAS is an exception--it uses the hard drive exclusively).  Of course there are plenty of other issues with large datasets [1]  [2] .  


A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK.  Using AWK scripts can save you some time with the clean up and Brenden at anyall.org published a table comparing the speed of various languages in getting the job done:



The Impact of Corporate Spending on Elections

The Citizens United decision is in many ways a surprising decision.  The decision overrules standing precedent (namely Austin v. Michigan Chambers of Commerce), some provisions in McCain-Feingold,  and was decided on issues that weren't explicitly stated in the petition to the court.  More over, the Roberts court didn't follow their modus operandi of deciding controversial cases narrowly.

Referencing Cells/Observations in Stata using [brackets]

Earlier today there was a posting on Statalist that asked about labeling a numeric variable with the words contained in a string variable.  This reminded me of my first Statalist posting last year where I asked a similar question.  
There are several solutions to this question, but more importantly, it got me thinking more about referencing cells using brackets (called "explicit subscripting" or using "_variables" in the Stata manual).
Before I go into  subscripting, here's a recap of how to label numeric variables from a list contained in a string variable based on the Statalist exchange:

*-------------------------BEGIN CODE /* The posting asked how to label col1 with values from col2 in this example dataset: */ clear* inp byte col1 str5 col2 1 "Table" 2 "Chair" 3 "Stool" 4 "Chair" 5 "Pan" end /* Martin Weiss offered the solution: */ loc h "" forv i=1/`=_N'{ loc h `h' `=col1[`i']' &q…

Visualization of Netflix Data

NYT posted some visualizations of the top Netflix rentals for 12 cities, by zip code.  Unfortunately, the raw data is not available.  Netflix has a history of releasing their rental data/algorithms for contests where participants have attempted to improve these algorithms, so hopefully they'll be releasing more data on consumer preferences for general use.
The city with data available that is closest to where I live is Dallas, Tx.  Before taking a look at the Netflix rentals, let's take a brief look at some demographic characteristics, by zip code, of this same region for reference (this is from the ERsys 2000 Census data, which is outdated).

Here's household income in Dallas:



















(darker zip codes = higher mean household incomes)

Here's level of education in Dallas:

(Darker blue = higher mean education level)


Decision Mapping: What Should I Drink? Edition

Linked here or after the jump....


Update:  They've also got one for fast food.