Working with large datasets in statistical program, like Stata, SPSS, or R, can be a pain if the variable formats and shapes aren't what you need because stats programs can be notoriously slow at cleaning up the data when they have to load a dataset into memory that is larger than the physical memory (SAS is an exception --it uses the hard drive exclusively ). Of course there are plenty of other issues with large datasets [ 1 ] [ 2 ] . A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK. Using AWK scripts can save you some time with the clean up and Brenden at anyall.org published a table comparing the speed of various languages in getting the job done:
Eric A. Booth :: Notes on social science research, Stata, and OSX programming ::