A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK. Using AWK scripts can save you some time with the clean up and Brenden at anyall.org published a table comparing the speed of various languages in getting the job done:
|Language||Time (min:sec)||Speed (vs. gawk)||Lines of code||Notes||Type|
|mawk||1:06||7.8x||3||Mike Brennan’s Awk, system default on Ubuntu/Debian Linux.||VM|
|java||1:20||6.4x||32||version 1.6 (-server didn’t matter)||VM+JIT|
|c-ish c++||1:35||5.4x||42||g++ 4.0.1 with -O3, using stdio.h||Native|
|python||2:15||3.8x||20||version 2.5, system default on OSX 10.5||VM|
|perl||3:00||2.9x||17||version 5.8.8, system default on OSX 10.5||VM|
|nawk||6:10||1.4x||3||Brian Kernighan’s “One True Awk”, system default on OSX, *BSD||?|
|c++||6:50||1.3x||48||g++ 4.0.1 with -O3, using fstream, stringstream||Native|
|ruby||7:30||1.1x||22||version 1.8.4, system default on OSX 10.5; also tried 1.9, but was slower||Interpreted|
|gawk||8:35||1x||3||GNU Awk, system default on RedHat/Fedora Linux||Interpreted|
You can get the practice data and scripts from here and try these for yourself (the data is about 1G in size). I was interested in trying this in Stata to see if it competes with the times mentioned in their table.
Anyall.org describes the task at hand:
And then rename items and features into sequential numbers as a sparse matrix: (i, j, value) triples. Items should count up from inside each file; but features should be shared across files, so they need a shared counter. Finally, we need to write a mapping of feature IDs back to their names for later inspection; this can just be a list.
The data look like this:
000794107-10-K-19960401 limited 1
000794107-10-K-19960401 colleges 1
000794107-10-K-19960401 code 2
000794107-10-K-19960401 customary 1
000794107-10-K-19960401 breadth 1
And the filenames in the practice data look like this:
So, in Stata we can do this with:
Using a Macbook Pro with 2-core 2.8 GHZ, 4 GB, & Stata 11 MP, this script ran in 13 min, 9 sec. Then I tried it again with my 8-core 3.33 GHZ, 20 GB, & Stata 11 MP and it reduced the time to 6 min, 32 sec, which puts it slightly faster than the c++ language results in the table above (though the c++ script was run on a slower machine). I ran the perl and python scripts from the anyall.org repository on my 8-core machine and got slightly faster time than they reported (2:01 and 2:45, respectively).
However, this really isn't a test of running a script to clean up data on a "large" dataset, Stata can handle this somewhat efficiently while the dataset is smaller than the physical memory. When the data become bigger than the RAM, then the use of virtual memory slows it down significantly. I used the -expand- command to multiply the size of the dataset by 40 and ran the script on my Mac Pro again with the 40 G of data and the time increased to 592 min, 10 sec (yikes!).
A final note: If you run this kind of script in Stata's Mata, rather than a standard do-file, it would run faster, but I'm a newbie to Mata, so I couldn't get the string parsing bit to work...if anyone tries this in Mata, let me know.
Update: William Gould has a article on processing difficult files in Mata in the latest issue of Stata Journal.