Wednesday, August 25, 2010

More on Working with Field-Tagged Data in Stata

I’ve been doing a lot with field tagged data recently (some of it is in html tables which I am still struggling with).

Below is the code for a (still messy and buggy) program called removetags.ado which is a working .ado file used to pull information from field tagged data from within Stata.

The biggest challenges for me have been (1) how to adapt the code when some fields span multiple rows without field tags to identify them and (2) how to deal with long entries in Stata (since there is a 244 string limit).

I think I’ve solved (1) for most cases (see lines 71-90 in the ado-file linked below), but it’s clunky and I’m sure I’ll need to adapt it further when I run across things that make it choke. However, I am currently working on solving (2).

[[So far, my solution involves identifying the rows that are longer than 244 chars using -file read-, and then reading those rows into Stata as characteristics associated with each record. The advantage of this approach is that you can still manipulate the macros using extended_fcn s and you can keep the association of the long strings with the field tagged records when you merge or export the data...but I'm still having issues with the file read/write commands to manipulate the long strings. I've read these posts on grepmerge[1] and Gabi Huiber's posts on doing something similar in mata [2], but I'm not much a programmer and I don't know much about mata, so will be my attempt at a work around]]
Ignoring that problem that long strings will be truncated in Stata, below is some code that gets me most of the way there.

1. To get the ado file, visit this link from my dropbox.

2. To test this ado-file, I’ve put two sample .txt files in my dropbox (referenced in the code below)
The first one is just the WOS record you provided above (but I’ve -expanded- it to 3 copies).
The second example is from EndNote. When you export your citations to a text file, it becomes field tagged data, so this is a another good test example. (If any of these links are dead, please email me for a copy)

3. Once you’ve downloaded and copied the .ado file into the proper location, try these examples (there’s no help file, so some instructions are included below):

***************!
Instructions:
  /*
Data must be field tagged records where each records spans across multiple rows

Options:
recordstart() must contain the field tag that begins 
each record
fieldtagdelimiter() must contain the char that delimits 
the field tag (default is a space char)
keep() is optional list of fieldtags that will be 
kept in reshaped file
removechars() is an optional list of problematic 
non-alphanumeric chars that need to be removed from 
your fieldtags
((note: use the -charlist- program -ascii- (from SSC) 
to help identify these, if they exist))
  */


**examples**
clear
which removetags

//example 1//
removetags using "http://dl.dropbox.com/u/428249/websci.txt", ///
rec("PT") fieldtagd(" ") keep(AU UT JI CR*)
li

//example 2//
removetags using "http://dl.dropbox.com/u/428249/endnote.txt", ///
rec("Reference Type") fieldtagd(":")
li record Year Author Title in 1/10
***************!


So, in example 1 (WOS records) each record starts with the fieldtag “PT”, is delimited by the first space characters, and I want to keep the vars AU, UT, JI, and the CR’s.
In example 2, “Reference Type” denotes the start of each record entry, the field tags are delimited by a colon, and I keep all the vars.
Maybe this is useful for someone, but more importantly, I’d like to hear if anyone tries this out and has any feedback, comments, or help on how to improve this.