Skip to main content

Working around issues with long IDs in Stata (and also more about -markstat-)









Working with long IDs in Stata (and also more about -markstat-)




In this example, I discuss some potential issues with using long IDs in Stata and how to avoid them. The problems presented below are separate from precision issues with large numeric values (that is, large in terms of number of places, including precise decimal representations). I previously discussed precision issues that are alleviated by using double or string format at this link.

Long IDs and -levelsof-

First, let’s create some fake data for this session:

. clear

. input campus

         campus
  1.  11990  
  2.  119902041  
  3.  243905112  
  4.  243905129  
  5.  243905131  
  6.  244903001  
  7.  00244903041 
  8.  end

.  g x = cond(_n<4, 1, 0, .)

. desc

Contains data
  obs:             7                          
 vars:             2                          
 size:           112                          
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
              storage   display    value
variable name   type    format     label      variable label
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
campus          double  %10.0g                
x               double  %10.0g                
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Sorted by: 
     Note: Dataset has changed since last saved.

. g mark = .
(7 missing values generated)

** -levelsof- doesn’t handle long IDs**

In this example, some of the IDs are longer than the standard 9.0g format because they have 10 or 11 digits. -levelsof- formats the values in scientific notation and there isnt a format() option for this program. So we when we run -levelsof- :

. levelsof campus if x , loc(jj)
11990 1.20e+08 2.44e+08

notice that the local macro jj contains the values: 11990 1.20e+08 2.44e+08 !

Yeah, this is a problem

Therefore, when we include these values in any later conditional statement (e.g., replacing or recoding values if something is equal to that campus id of interest) they won’t match but we won’t get an error indicating that it’s because of the formatting difference, especially if we’ve tostringed campus as is (without formatting it properly). Below, we get a false positive/match with the first ID since it’s less than 9 digits and if we saw that some proportion of our observations were changed in our command we might wrongly assume that our command worked as intended even though it’s missed all the longer IDs.

** so,**

. tostring campus, g(campus2)
campus2 generated as str9

. foreach j in `jj' {
  2.     tr: replace mark = 1 if campus2== `"`j'"'
  3.     }
          - `macval(cmdline)'
          =  replace mark = 1 if campus2== `"11990"'
(1 real change made)
          - `macval(cmdline)'
          =  replace mark = 1 if campus2== `"1.20e+08"'
(0 real changes made)
          - `macval(cmdline)'
          =  replace mark = 1 if campus2== `"2.44e+08"'
(0 real changes made)

. l , noobs t div // noh    

  ┌───────────┬───────────┬───┬──────┐
  │    campus │   campus2 │ x │ mark │
  ├───────────┼───────────┼───┼──────┤
  │     11990 │     11990 │ 1 │    1 │
  │ 1.199e+08 │ 119902041 │ 1 │    . │
  │ 2.439e+08 │ 243905112 │ 1 │    . │
  │ 2.439e+08 │ 243905129 │ 0 │    . │
  │ 2.439e+08 │ 243905131 │ 0 │    . │
  ├───────────┼───────────┼───┼──────┤
  │ 2.449e+08 │ 244903001 │ 0 │    . │
  │ 2.449e+08 │ 244903041 │ 0 │    . │
  └───────────┴───────────┴───┴──────┘

Solution: -tostring- long IDs

The solution is to always -tostring- your long IDs and to use the format() option. As Robert Picard has noted on Statalist “With respect to identifiers, they should remain string; there’s no expectation of doing math with their value.” Doing this also helps protect against the issue with leading zeroes.

. levelsof campus2  
`"11990"' `"119902041"' `"243905112"' `"243905129"' `"243905131"' `"244903001"' `"244903041"'

. tostring campus, g(campus3) format(%011.0f)
campus3 generated as str11

. levelsof campus3  
`"00000011990"' `"00119902041"' `"00243905112"' `"00243905129"' `"00243905131"' `"00244903001"' `"00244903041"'

.     cap drop campus2  //dont need this

Double-down on your safeguards

Another set of advice has to do with how to store these variables when efficiency/size is a problem. First off, I prefer to use data type double for large (and really all) integers to protect against potential problems with precision (some people would point out that this wastes space though). You can do this with the command:

 set type double, permanently

Using encode to save space

The size/efficiency problem can be solved if you encode these IDs. For example, let’s expand our dataset from above so that we can better see the impact of encoding (and compressing) variables with many digits.

. preserve

. expand 1000
(6,993 observations created)

. g double verylong = int(runiform()*100000)

. qui d, sh

.     loc old `"`c(width)'"'

. encode campus3, g(shortcampus)

. tostring verylong, g(vl2)
vl2 generated as str5

. encode vl2, g(shorter)

. drop vl2 campus verylong

. compress
  variable shortcampus was long now byte
  variable shorter was long now int
  variable x was double now byte
  variable mark was double now byte
  (133,000 bytes saved)

. qui d, sh

.     loc new `"`c(width)'"'

. restore

Dataset width changed by c.60% with the same # of variables; that is, it changed from 43 to 16 (-27).

Downsides for this reduction method

  1. it takes a while to -tostring-, -encode- and -compress- variables and

  2. importantly, if you ever need to merge/match/append this data to another datset with campus IDs your encoded values will likely not match between datasets
  • you’ll want to merge/match/append on the original (or better string versions) of the IDs.
  1. -encode- has an upper limit of values
  • I dont actually know what this is nor could I find it in the help files, but it might be limited by the value label length limit (32k)… but you can manually encode by assigning a number by `panel’ (that is by campus group) and then if you have too many values to label them, create a look up table for later use.

Alternatively: save to external file

So, using a bigger version of the expanded data example above:

. expand 1000000
(6,999,993 observations created)

. g double verylong = int(runiform()*100000)

. qui d, sh

.     loc old `"`c(width)'"'

. egen lookup1 = group(campus)

. egen lookup2 = group(verylong)

.     preserve

.     keep campus   lookup1

.     qui duplicates drop campus lookup1, force // dont need all copies of index

.     qui sa lookuptable.dta, replace

.     restore

. drop campus* verylong

. qui compress

. qui d, sh

.     loc new `"`c(width)'"'

Dataset width changed by about 80% with the same # of variables; that is, it changed from 43 to 7 (-36).


Notes

I created this example using -markstat- from SSC. The code for producing this file at the link below:

Stata command file is here (.stmd)

The most powerful part of -markstat- is that this same .stmd file is used to produce a HTML, PDF (LaTeX), DOCX, S5 Web Slide Show, and BEAMER presentation (LaTeX again).

Here are the links for those files:

HTML

PDF/LaTeX

LaTeX

DOCX

BEAMER/LaTeX

S5 Web slides

ZIP OF ALL FILES

Run the stmd file:

The command file is a .stmd file and is run with the code:

. markstat using levelsofex,   strict  
. 
. markstat using levelsofex, docx  nodo strict
. markstat using levelsofex, pdf   nodo strict
. **because beamer below will overwrite!
.     copy levelsofex.pdf levelsofexpdf2.pdf, replace public 
. markstat using levelsofex,   beamer(madrid) nodo strict
. markstat using levelsofex, slides(santiago)  nodo strict
. 

it is not a standard do-file because this is the type of file that is required by -markstat-.


Comments