Friday, January 22, 2010

AWK scripts & Large Datasets

Working with large datasets in statistical program, like Stata, SPSS, or R, can be a pain if the variable formats and shapes aren't what you need because stats programs can be notoriously slow at cleaning up the data when they have to load a dataset into memory that is larger than the physical memory (SAS is an exception--it uses the hard drive exclusively).  Of course there are plenty of other issues with large datasets [1]  [2] .  

A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK.  Using AWK scripts can save you some time with the clean up and Brenden at published a table comparing the speed of various languages in getting the job done:

 Language Time (min:sec)Speed (vs. gawk)Lines of codeNotesType
mawk1:067.8x3Mike Brennan’s Awk, system default on Ubuntu/Debian Linux.VM

java1:206.4x32version 1.6 (-server didn’t matter)VM+JIT

c-ish c++1:355.4x42g++ 4.0.1 with -O3, using stdio.hNative

python2:153.8x20version 2.5, system default on OSX 10.5VM

perl3:002.9x17version 5.8.8, system default on OSX 10.5VM

nawk6:101.4x3Brian Kernighan’s “One True Awk”, system default on OSX, *BSD?

c++6:501.3x48g++ 4.0.1 with -O3, using fstream, stringstreamNative

ruby7:301.1x22version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slowerInterpreted

gawk8:351x3GNU Awk, system default on RedHat/Fedora LinuxInterpreted

You can get the practice data and scripts from here and try these for yourself (the data is about 1G in size).  I was interested in trying this in Stata to see if it competes with the times mentioned in their table. describes the task at hand:
And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.

The data look like this:

000794107-10-K-19960401 limited 1
000794107-10-K-19960401 colleges 1
000794107-10-K-19960401 code 2
000794107-10-K-19960401 customary 1
000794107-10-K-19960401 breadth 1


And the filenames in the practice data look like this:

So, in Stata we can do this with:

local source "/$sf/sec_10k_pruned_triples//"
global all: dir "`source'" files "*.p", respectcase
di "`"$all"'"
set obs 1
g z = .
save "`source'/master.dta", replace
foreach j of global all {
insheet using "`source'/`j'", nonames
split v1, p(" ")
drop v1
rename v11 item
rename v12 feature
rename v13 value
save "`source'/`j'.dta", replace
use "`source'/master.dta", clear
append using "`source'/`j'.dta"
save "`source'/master.dta", replace
drop in 1
drop z

foreach i in item feature {
bys `i': g `i'_id = _n
keep item_id item feature_id feature
save "`source'/index_file.dta" , replace
//<--creates the file for linking id & item/feature
keep item_id feature value 
save using "`source'/cleaned_data.dta", replace

Using a Macbook Pro with 2-core 2.8 GHZ, 4 GB, & Stata 11 MP, this script ran in 13 min, 9 sec. Then I tried it again with my 8-core 3.33 GHZ, 20 GB, & Stata 11 MP and it reduced the time to 6 min, 32 sec, which puts it slightly faster than the c++ language results in the table above (though the c++ script was run on a slower machine).   I ran the perl and python scripts from the repository on my 8-core machine and got slightly faster time than they reported (2:01 and 2:45, respectively).

However, this really isn't a test of running a script to clean up data on a "large" dataset, Stata can handle this somewhat efficiently while the dataset is smaller than the physical memory.  When the data become bigger than the RAM, then the use of virtual memory slows it down significantly. I used the -expand- command to multiply the size of the dataset by 40 and ran the script on my Mac Pro again with the 40 G of data and the time increased to 592 min, 10 sec (yikes!).

A final note:  If you run this kind of script in Stata's Mata, rather than a standard do-file, it would run faster, but I'm a newbie to Mata, so I couldn't get the string parsing bit to work...if anyone tries this in Mata, let me know.  

Update:  William Gould has a article on processing difficult files in Mata in the latest issue of Stata Journal.


  1. Thanks for sharing these niche piece of coding to our knowledge. Here, I had a solution for my inconclusive problems & it’s really helps me a lot keep updates…
    sas training chennai

  2. PHP provides the best option to build the website where we can design our website in a very interactive manner that provides better functioning in data management.
    PHP Training in Chennai | PHP course in Chennai