Monday, December 20, 2010

Creating example datasets for collaboration with other Stata users

I'm lucky to be in a research environment where most of my colleagues and students use Stata.  Also, I regularly participate on Statalist.  Both of these have helped pushed me to periodically refine my habits when it comes to communicating about Stata.

When it comes to asking questions on Statalist, I've tried to stick closely to the Statalist FAQ and other tips mentioned by William Gould on the Stata NEC Blog.  However, for answering questions on Statalist, I find Maarten Buis's page on his Statalist postings especially helpful .

I've learned a lot from Maarten's FAQ about
(1) the types of questions that are not obvious to others on Statalist (and this tends to translate over to my students & colleagues as well) and
(2) ways to minimize this confusion by doing things as simple as creating clearly marked, self-contained working examples of code or using commenting to help create a roadmap for the code in an example as well as avoid issues with wrapping of code.

When it comes to creating clearly marked, self-contained examples for others, there are a couple of standard tools:
  • Using a canned Stata dataset for the example (as Maarten mentions )
  • Creating a fake dataset using a variety of -generate-, -replace-, or random data functions.  See my previous post about adding a random, fake string function (-ralpha-) to this set of tools.
  • Finally, if you cannot easily get the structure you need for an example from a canned or easily -generate-d dataset, you can always create a data example using -input-
The idea behind -input- is that I can insert a working example into a do-file or Statalist posting that is self-contained.  Running the code below will -input- this data example into Stata's memory:

***************!
clear
inp   str14(state) pop str2(state2) divorce region marriage pop65p
"Alabama" 3893888 "AL" 26745 3 49018 440015
"Alaska" 401851 "AK" 3517 4 5361 11547
"Arizona" 2718215 "AZ" 19908 4 30223 307362
"Arkansas" 2286435 "AR" 15882 3 26513 312477
"California" 23667902 "CA" 133541 4 210864 2414250
"Georgia" 5463105 "GA" 34743 3 70638 516731
"Hawaii" 964691 "HI" 4438 4 11856 76150
"Idaho" 943935 "ID" 6596 4 13428 93680
"Illinois" 11426518 "IL" 50997 2 109823 1261885
"Indiana" 5490224 "IN" 40006 2 57853 585384
"Iowa" 2913808 "IA" 11854 2 27474 387584
"Kansas" 2363679 "KS" 13410 2 24847 306263
"Kentucky" 3660777 "KY" 16731 3 32727 409828
"Louisiana" 4205900 "LA" 18108 3 43460 404279
"Maine" 1124660 "ME" 6205 1 12040 140918
"Maryland" 4216975 "MD" 17494 3 46278 395609
"Massachusetts" 5737037 "MA" 17873 1 46273 726531
"Michigan" 9262078 "MI" 45047 2 86898 912258
end
***************!



This example is from the canned "census.dta" Stata dataset.  Running the code above will create a dataset with 7 variables (2 string, 5 numeric) and 22 observations.

If this were a subset of my data and I wanted to post a question or answer to Statalist using this subset of data, I could copy and paste a few lines into a do-file editor (or other text editor) and then do the following to make it a self-contained example dataset:
  1. manually add the -clear-, -input-, and -end- commands
  2. manually write the variable names and formats (really you only need the formats for string variables)
  3. manually (or via Ctrl+F) add double quotes around all the values of the string variables (state and state2 in the example above)

I do this for many of my Statalist questions and answers--it may not sound like much to get a dataset ready for sharing with others, but I find it tedious at times. Many times I've created these and then realize that I've missed a couple of example observations that have what I need to make the point, so I have to redo part or all of the -input- statement.

So, I finally got around to writing a program that would do all the work of writing this -input- statement from the data in memory.  It's called -writeinput.ado-

The idea behind -writeinput.ado- is that you can indicate which variables, observations, and conditions to write to a do-file, then you can include this do-file code in other do-files or Statalist postings.  The syntax is:


writeinput varlist [if] [in] using "newdofile.do"  [, Replace noCLEAR ]


where the options are defined as:

Replace -   replace existing do-file
clear -   insert a -clear- command to the top of the new do-file. Clear is the default, type noclear to avoid adding the -clear- command for when you want the -input- command to add data to existing data.






So, to obtain the example dataset I included above, you would do the following:

**************!
sysuse cancer.dta, clear
writeinput state pop state2 divorce region marriage pop65r in 1/22 using "test.do", r 
**************!


Here are some other examples:

**********************!
which writeinput
sysuse auto, clear
**
g make2 = make
**
writeinput make mpg price for in 1/5 ///
using "test1.do", r n noclear

writeinput make mpg price for ///
if for==0 using "test2.do", r n

writeinput make price for make2 price mpg ///
if for==1 in 1/50 using "test3.do", r
**
type "test3.do" //<-- type the saved file
**********************!

You can get -writeinput.ado- from here or here.

2 comments:

  1. I suppose you mean Maarten Buis instead of Maartin Buis. You are not the first (and probably not the last):

    ReplyDelete
  2. Yes, sorry about that Maarten. I've corrected the misspelling in the post. Thanks.

    ReplyDelete