Tuesday, January 12, 2010

Referencing Cells/Observations in Stata using [brackets]

Earlier today there was a posting on Statalist that asked about labeling a numeric variable with the words contained in a string variable.  This reminded me of my first Statalist posting last year where I asked a similar question.  
There are several solutions to this question, but more importantly, it got me thinking more about referencing cells using brackets (called "explicit subscripting" or using "_variables" in the Stata manual).
Before I go into  subscripting, here's a recap of how to label numeric variables from a list contained in a string variable based on the Statalist exchange:

*-------------------------BEGIN CODE
The posting asked how to label col1 with values from col2 
in this example dataset:
inp byte col1 str5 col2
1 "Table"
2 "Chair" 
3 "Stool"
4 "Chair"
5 "Pan"
/*  Martin Weiss offered the solution:  */
loc h ""
forv i=1/`=_N'{
 loc h `h' `=col1[`i']' "`=col2[`i']'"
la def col1 `h'
la val col1 col1
la li col1
Which works great, but this could also take the shape
of a solution he gave me a year ago:
 forvalues n = 1/`=_N' {
  la def lab `n' "`=col2[`n']'", modify
label val col1 lab
la li lab
/*  Finally, you can install -labutil- from SSC and */
labmask col1, values(col2) lbl(label)
*-------------------------END CODE

So based on this, I started doing some more exploration of subscripting, especially in the context of grouped observations.  I run into a lot of situations where I want to grab the largest/smallest/average/etc. value in each group, and this can easily be accomplished via some -egen- command in combination with -bys-; however, extracting specific values from a group is made easier by explicit subscripting. A second advantage of subscripting is that it can reduce your typing of complicated "if" statements when generating new variables.

Here are some examples:

*-------------------------BEGIN CODE
//fake data//
inp dummycase x y str3 name
1 100 200 "aaa"
1 100 300 "bbb"
1 100 400 "ccc"
1 200 300 "zzz"
2 100 200 "ggg"
2 200 500 "hhh"
3 999 999 "jij"
4 120 120 "r2r"
4 100 100 "2r2"
4 700 700 "rrr"
/*Here are groups are defined by "dummycase" */
/*Generate obs=1 for first observation in a group*/
  bys dummycase: g obs = [1==_n]
this gets rid of the "if" & subsidary replace statement
that would look like this:
  bys dummycase:  g obs2 = 1 if _n==1
  recode obs2 (.=0)
Next, we can generate a value of "x" for first observation in 
group if that value is equal to y in the same observation
  bys dummycase: g firstobservation_value = (x[1]==y) 
//again, no "if" statement, plus this calls the observation # in a group//
/* Now, we can return first & last value in group via */
  bys dummycase: g y_ends = y if inlist(_n, 1, _N)
/* You can list the last 10 observations in a dataset with:  */
li in `=_N-10'/l
/* Finally, label dummycase with corresponding string value (name) */
forvalues num = 1/`=_N' {
 label define l `num'   "`=name[`num']'", modify
     label val dummycase l
*-------------------------END CODE

One last thing that I found interesting...you can use subscripting to help fill in missing observations when a spreadsheet uses "super rows".  This is particularly frustrating when working with timeseries datasets from sources like WDI or state health data that I use frequently.  
Often, these spreadsheets have a super row identifier every couple of rows, and the rows in between are meant to be a part of that same group.  These subrows might be years of data for a country or counties in a state, etc.   Of course, Stata sees this as missing data.  
Here's an example of how to work with data in that shape:

*-------------------------BEGIN CODE
inp str10 country year obs1 obs2
"Afghanistan" 1990 100 200
"" 1991 100 300
"" 1992 200 400
"" 1999 400 100
"Burma" 1990 300 500
"" 1991 999 888
"" 1993 999 880
"Congo" 1997 999 999
//use subscripting to fill in blank country names
replace country = country[_n-1] if mi(country[_n])
//use -fillin- command to equalize panels with all missing years
fillin country year
//create lagged var
bys country: gen obs2_lag = obs2[_n-1]
*-------------------------END CODE

This is pretty simple ((and probably obvious to most Stata programmers)), but I didn't realize it until long after I had spent many grueling hours during grad school cleaning up these types of datasets by dragging down the columns in Excel or writing clunky Excel 'if' statements.  ugh.  

1 comment:

  1. Hi there
    I am trying to create a new variable which represents only the first year's value of a variable in an unbalanced panel dataset, with some missing values in between the 15 years of observation period. However, sine the first year's value for each Company ID is different, when I run the following codes, only some is correctly generated for a few companies.

    xtset CoID Year
    sort CoID Year
    bysort CoID: generate ROA3=ROA[1] if ROA !=.

    I have tried to apply your codes above using subscripting, but it still could not work.

    I would be grateful for your help.