Thursday, December 30, 2010

Finding your way around Stata

One of the things my students first get stuck on is how to find things (e.g. files, directories, variables with particular labels or notes) in Stata.
There are a lot of commands to find things like files/datasets, directories, command help documentation, user commands/ado-files, variables, values, notes/chars, etc -- there are some commands that find only one of these things, some commands can find several of these things, and most of these things can be found by more than one command.  It can be a bit overwhelming and confusing and I've found that students who fall behind early in a class using Stata often get stuck at the point of being able to find these things -- particularly directories and command ado/help files.

Of course good use of a search engine is a key resource, but the table below gives an overview of the commands I use to find these things in Stata (this table can also found in my Module 1 Lecture for PHPM 672).    Undoubtedly, there are other commands that will do these tasks, but these are the ones that stuck with me after I started using them.

Sunday, December 26, 2010

Fun with Stata: Games for Stata Edition

Over at Mitch's "Stata Daily" blog, he describes a "hangman" game sent to him by Marek Hlavac.  I'm a sucker for non-standard uses of Stata (e.g., [1] [2] [3]), so I played with it for a while.  This also convinced me to make public one of my earliest attempts at writing a Stata ado-file/program:  -blackjack-.

The game is played by typing -blackjack- into the command window and then the game prompts the user for the amount she wants to bet (default is $500 which replenishes after you lose it all or you exit Stata), and whether to hit or stay.  It doesn't accurately represent all the rules and scenarios of a real game a blackjack (e.g., no doubling down), so don't use it to prep for your run at taking down a Vegas casino.

Fair warning that -blackjack- is visually quite ugly (the cards tend to misalign; I could have come up with a better card design for face cards than a "{Stata}" center; and (because I was learning about Stata chars) I used some ascii symbols for suits instead of something simple like K, Q, J, A ) and I've run into the occasional bug that I haven't taken time to investigate & fix.
One thing I like about Hlavac's -hangman- is how he uses subprograms to define and display the stages of building the hangman.  I wish I had thought about this for displaying my cards -- it probably would have saved a lot of copying/pasting of -if- loops displaying the various card configurations.

Writing/tinkering with the ado-file for this game probably provided more amusement for me than actually playing it. It's a great mindless activity to do if you're doing some Stata coding and need a break.    Check out -blackjack- here.

At the Stata Daily blog, Nick J. Cox comments about some other Stata games/simulations/etc available at SSC:  -chaos- and -irrepro-. Also, I mention similar programs -dice-, -cards- (which I cannot get to work on Stata 11), and -heads- from UCLA's Stata page, see:
****
net install dice, ///
from(http://www.ats.ucla.edu/stat/stata/ado/teach) ///
replace all
****
All these are fun (and possibly instructive) programs for Stata.

Monday, December 20, 2010

Creating example datasets for collaboration with other Stata users

I'm lucky to be in a research environment where most of my colleagues and students use Stata.  Also, I regularly participate on Statalist.  Both of these have helped pushed me to periodically refine my habits when it comes to communicating about Stata.

When it comes to asking questions on Statalist, I've tried to stick closely to the Statalist FAQ and other tips mentioned by William Gould on the Stata NEC Blog.  However, for answering questions on Statalist, I find Maarten Buis's page on his Statalist postings especially helpful .

I've learned a lot from Maarten's FAQ about
(1) the types of questions that are not obvious to others on Statalist (and this tends to translate over to my students & colleagues as well) and
(2) ways to minimize this confusion by doing things as simple as creating clearly marked, self-contained working examples of code or using commenting to help create a roadmap for the code in an example as well as avoid issues with wrapping of code.

When it comes to creating clearly marked, self-contained examples for others, there are a couple of standard tools:
  • Using a canned Stata dataset for the example (as Maarten mentions )
  • Creating a fake dataset using a variety of -generate-, -replace-, or random data functions.  See my previous post about adding a random, fake string function (-ralpha-) to this set of tools.
  • Finally, if you cannot easily get the structure you need for an example from a canned or easily -generate-d dataset, you can always create a data example using -input-
The idea behind -input- is that I can insert a working example into a do-file or Statalist posting that is self-contained.  Running the code below will -input- this data example into Stata's memory:

***************!
clear
inp   str14(state) pop str2(state2) divorce region marriage pop65p
"Alabama" 3893888 "AL" 26745 3 49018 440015
"Alaska" 401851 "AK" 3517 4 5361 11547
"Arizona" 2718215 "AZ" 19908 4 30223 307362
"Arkansas" 2286435 "AR" 15882 3 26513 312477
"California" 23667902 "CA" 133541 4 210864 2414250
"Georgia" 5463105 "GA" 34743 3 70638 516731
"Hawaii" 964691 "HI" 4438 4 11856 76150
"Idaho" 943935 "ID" 6596 4 13428 93680
"Illinois" 11426518 "IL" 50997 2 109823 1261885
"Indiana" 5490224 "IN" 40006 2 57853 585384
"Iowa" 2913808 "IA" 11854 2 27474 387584
"Kansas" 2363679 "KS" 13410 2 24847 306263
"Kentucky" 3660777 "KY" 16731 3 32727 409828
"Louisiana" 4205900 "LA" 18108 3 43460 404279
"Maine" 1124660 "ME" 6205 1 12040 140918
"Maryland" 4216975 "MD" 17494 3 46278 395609
"Massachusetts" 5737037 "MA" 17873 1 46273 726531
"Michigan" 9262078 "MI" 45047 2 86898 912258
end
***************!

Sunday, December 19, 2010

Statistics Software Showdown: Google Ngram

Using Google's Ngram Viewer, here's the breakdown of Stata vs. SAS vs. SPSS.  







Stata didn't do as well as I hoped, but in taking a closer look there are at least a couple of reasons to be optimistic about Stata's prospects.
(1)  SAS is benefitting from lot's of books written about the British Special Air Service (SAS).  
(2) As of yet, there doesn't appear to be a way to refine these searches with boolean search parameters.  If so, we could have searched for "SAS -British" or "SPSS | PASW", etc.
(3) I couldn't find a way to search for the software 'R' using Ngram.
(4)  Stata seems to have as much, if not more, web presence / resources as the other software packages.  
Using a regular google search:

"Stata" + statistical software  22.3 million pages
"SPSS" + statistical software  28.2 million pages
"SAS" + statistical software  17.6 million pages
"PASW" + statistical software  53K pages
"R" + statistical software:  7.7 million pages

GoogleBattle.com  has Stata ahead of SPSS by a count of 35 million to 13 million pages  (besides leaving out the "statistical software" part, I'm not sure what else explains the difference)

(5)  Finally, while more books are being published in modern times (especially with the increasing output of the Stata Press), this graph at least shows an uptick in Stata presence in Google Books since 1990:




________________
Other fun Google NGram Viewer comparisons include:

Probit v. Logit (is that a CDF?)





Sunday, December 12, 2010

Fun with Stata: Running Stata from your iPhone

There are literally tens of people out there in the world that have at some point or another thought "I really wish I could run something in Stata right now on my iPhone." Well, I recently killed some time making that a possibility.

In order to get results from Stata on your iPhone anywhere/anytime, this process requires 5 components:
(1) Stata (10 or later) installed on a Mac OSX (10.5 or later) that is always connected to the internet
(2) A Dropbox account linked to your Mac that has Stata installed
(3) iStata.scpt Applescript file to manage files put in Dropbox
(4) iStata.ado to run the file, log the output, and put it back in Dropbox
(5) the free application Plaintext for iPhone (or some equivalent) to write and view .do files written and run from your iPhone

Setup:
You'll need to save the iStata.scpt and iStata.do files into the folders referenced in these scripts in your Dropbox folders on your Mac OSX. Really, you can place these files/folders anywhere in your Dropbox that you'd like, but you need to change the paths in these scripts to point to the proper location.


The basic idea of the workflow is illustrated below. Basically, the process is to create the file "torun.txt" on your iPhone in the app Plaintext and then save it.  The scripts do all the work and send back a new file called "Results.txt" with the logfile of what you ran in Stata.

Thursday, November 25, 2010

Analyzing Stimulus Funds Data

A report from Brito & De Rugy at the Mercatus Center at GMU from earlier this year reports that (emphasis added):
"An OLS regression analysis controlling for the district representative’s political party, tenure in office, leadership position, membership on the appropriations committee, as well as for the district’s unemployment, mean income (i.e., the average income of a given wage earner in the district), and the percentage of employed persons working in the construction sector in 2008 finds that having a Republican representative decreases a district’s stimulus award by 24 percent.   This effect is statistically significant at the p < .001 level."
It's an interesting and useful paper.  They've put a lot of effort into compiling some data on the federal stimulus outlays and some other political variables as well as even correcting & error-checking the data they got from other sources. I like that they have provided the actual Stata dataset that they compiled (and they've even copied/pasted the Stata output tables in the document), but I thought that it would be interesting to run the models again with some interaction terms and clustering as well as adding in a couple of other political variables commonly used in some of the literature on Congressional pork/spending.    Looking past the chatter about the alleged political leanings of the Mercatus Center , it seemed to me that a regression model concluding that the stimulus funds where doled out by partisan lines should probably be examined a bit more closely.


First of all, I wanted to re-run their model with an interaction term for being a republican in a leadership position in a marginal seat and on the appropriations committee (republican*marginally*leadership*accountability).  My guess is that the effect of being republican might be different for those in leadership, marginal, and/or appropriations committee positions and if Democrats do have an advantage, which democrats got the biggest share (leadership, safe seats, appropriations committee members ?)-- that is, there's a bit more to the political calculus of stimulus spending than simple party lines.  
Second, I wanted to add a couple of other political variables, such as: 
  • the number and size ( in dollars) of earmarks secured by each member (No. of Earmarks & log_ttlearmarks)
  • the number of solo-authored legislation by each member (solo)
  • the number of votes and % of Democratic vote in each member's district in the prior election (log_ttlvotes & democrat_pctvote)
  • whether the governor, upper, and lower house of the member's state delegation are Republican controlled or not (governor_republican2, lower/upper_legisl_republican2)



Friday, November 19, 2010

Generating Random/Fake String Data in Stata

When posting to Statalist I usually try to provide an example of my question or answer using the in-built "auto.dta" dataset, the -input- command to manually create a dataset,  or by generating fake, random data using Stata functions.  To create fake, random numeric data, you can use any of the random number generators detailed in -help random_numbers- (such as runiform), but there is no random generator for alphabet characters (A-Z or a-z).  

Sometimes it's useful to illustrate to Statalist or students in class how to manipulate the dataset if it includes some kind of string variable that you want to use to identify panels or illustrate how to -encode- variables, etc.  (or maybe you just want a random string generator because you lost your dice for playing Scattergories)

-ralpha- generates random string characters for Stata.    In many cases, you could generate the numeric variable and -tostring- it, but if you need string (alpha) characters, this package presents an easy way to obtain them.


Syntax:

ralpha [newvarname] [, Loweronly Upperonly Range(string)]

Options:

upper - random alpha from uppercase letters only
lower - default; random alpha from lowercase letters only
range() -   examples include: A/Z,  a/z, A/z(uppercase is first), a/c, A/G
            - numerical range stored in `r(num_range)'
If [newvarname] is left blank, the variable "ralpha" is created (if it doesn't already exist). 

Examples:

**********************!
which ralpha                       //<--  see instructions

//Example 1 //
clear 
set obs 20
ralpha                           //nothing specified-new var named "ralpha" by default
ralpha lowerdefault,             //no options specified - default is lowercase
ralpha upper, upperonly
ralpha lower, low
li


//Example 2: Using the range() option //
**Note: range goes from a/Z (a to Z)
clear
set obs 20
ralpha somerange, range(A/z)
ralpha, range(B/g)
    di in white "`r(num_range)'"      //Here's numerical range equiv. of "B/g"


//Example 3: create random words/strings in a loop //
clear
set obs 50
g newword = ""
loc lnewword 5                  //how many letters in new word?
forval n = 1/`lnewword' {
ralpha new, upp
replace newword = newword + new
drop new
}

**make newword proper**
replace newword = proper(newword)

**********************!






Tuesday, November 2, 2010

An Update to Dual Finder.app

Previously I described an application/script that enabled you to open two or more Finder windows, stacked side-by-side, on Mac OSX (10.4 or later).  This update changes the proportion of the windows the the display of the files to something closer to Path Finder 5; that is, the finder window stacked on the left (Window 1 below) is set to "list view"; Finder Window 2 (right) is set to "column view."  Adjust the size of the Finder windows using the "hei"(height) and "wid"(width) properties.  The script will also "hide" all other open windows.

You can download the .app version of this file (below) and then drag it to the dock next to your Finder.app icon for a convenient way to open and positioning multiple Finder windows. 

Here's the .script:

property wid : 10
property hei : 40
set the startup_disk to (path to startup disk)
tell application "Finder"
close every Finder window
activate
set visible of (every process whose visible is true and frontmost is falseto false
         ---get the target of the 1st Finder window
                 --Window 1--
set win to make new Finder window
set the target of win to the folder "Desktop" of the home
set the bounds of win to {heiwid, (hei + 650), (wid + 750)}
set the current view of win to list view
                 --Window 2--
set win to make new Finder window
set the target of win to the folder "Documents" of the home
set the bounds of win to {(hei + 660), (wid), (hei + 1400), (wid + 750)}
set the current view of win to column view
select the last Finder window
        ----close the first Finder window
end tell



Tuesday, October 19, 2010

Fun with Stata: Google for Stata

I use the -findit- command from the Stata command line/window dozens of times a day.  However, I've always wished that there was a way to have -findit- search for than just Stata commands or materials on the web so that I wouldn't have to leave the Stata window just to Google something.  


Also, I've always wanted -findit- to search Statalist archives for me as part of the process of looking for an answer to some burning Stata-related question.  Now, you can stay tethered to Stata and Google with impunity without ever leaving the safety of your Stata command line.

The ado-file described below allows you to Google something from the Stata command line and have the results either return to you in the Stata "Results" window (default) or export to your system's browser.  The title of the results webpages are returned and the links in blue are clickable to navigate to that page in your OS default browser.



The syntax is:

 google [ , Filetype(string) Site(string) STATAlist  Browser  Chrome]

where [options] are defined as:

Filetype  -  Specify the filetype to search for.  E.g., pdf, txt, doc, gif, etc.
Site  -    Specify the site/domain to search within.  E.g., www.google.com, http://www.nyt.com
Statalist -    Search within Statalist archives only (via http://www.stata.com/statalist/archive/)
Browser    -    Issues Google search to system's default browser rather than the Results Window (default)
Chrome  -  (For Mac OSX only) Will redirect browser command to Chrome Browser, if installed

-google- requires that the package -intext- (downloadable from SSC or using -ssc install intext-) is installed.  -google- will auto-detect and install -intext- from web-aware Stata during its initial run.

Examples:



**1.  Most useful if issued from the Stata Command Window**
google my test search
google "my test search"
google my+test+search

/*
Then you can click any of the returned results to open the webpage in your browser.
It only returns the top 10 results right now, but I'm trying to improve the speed of it before I allow it to render more search results.
*/


**2.  It also understands how to search for certain file types**
google test search, filetype(pdf)
google test search, file(xls)

**3.   and you can send it to open in your default browser instead**
google test+search, browser
google test search, browser filetype(txt)

**4.  search Statalist Archives**
google "ttest error", stata


**5.  search any website/domain**
google "top headlines", site(www.nyt.com)

**6.  or for a MacOSX with Chrome installed**
google test search, chrome




See my software page for more on this code & other software I've written for Stata and Mac OSX.



Wednesday, August 25, 2010

More on Working with Field-Tagged Data in Stata

I’ve been doing a lot with field tagged data recently (some of it is in html tables which I am still struggling with).

Below is the code for a (still messy and buggy) program called removetags.ado which is a working .ado file used to pull information from field tagged data from within Stata.

The biggest challenges for me have been (1) how to adapt the code when some fields span multiple rows without field tags to identify them and (2) how to deal with long entries in Stata (since there is a 244 string limit).

I think I’ve solved (1) for most cases (see lines 71-90 in the ado-file linked below), but it’s clunky and I’m sure I’ll need to adapt it further when I run across things that make it choke. However, I am currently working on solving (2).

[[So far, my solution involves identifying the rows that are longer than 244 chars using -file read-, and then reading those rows into Stata as characteristics associated with each record. The advantage of this approach is that you can still manipulate the macros using extended_fcn s and you can keep the association of the long strings with the field tagged records when you merge or export the data...but I'm still having issues with the file read/write commands to manipulate the long strings. I've read these posts on grepmerge[1] and Gabi Huiber's posts on doing something similar in mata [2], but I'm not much a programmer and I don't know much about mata, so will be my attempt at a work around]]
Ignoring that problem that long strings will be truncated in Stata, below is some code that gets me most of the way there.

1. To get the ado file, visit this link from my dropbox.

2. To test this ado-file, I’ve put two sample .txt files in my dropbox (referenced in the code below)
The first one is just the WOS record you provided above (but I’ve -expanded- it to 3 copies).
The second example is from EndNote. When you export your citations to a text file, it becomes field tagged data, so this is a another good test example. (If any of these links are dead, please email me for a copy)

3. Once you’ve downloaded and copied the .ado file into the proper location, try these examples (there’s no help file, so some instructions are included below):

***************!
Instructions:
  /*
Data must be field tagged records where each records spans across multiple rows

Options:
recordstart() must contain the field tag that begins 
each record
fieldtagdelimiter() must contain the char that delimits 
the field tag (default is a space char)
keep() is optional list of fieldtags that will be 
kept in reshaped file
removechars() is an optional list of problematic 
non-alphanumeric chars that need to be removed from 
your fieldtags
((note: use the -charlist- program -ascii- (from SSC) 
to help identify these, if they exist))
  */


**examples**
clear
which removetags

//example 1//
removetags using "http://dl.dropbox.com/u/428249/websci.txt", ///
rec("PT") fieldtagd(" ") keep(AU UT JI CR*)
li

//example 2//
removetags using "http://dl.dropbox.com/u/428249/endnote.txt", ///
rec("Reference Type") fieldtagd(":")
li record Year Author Title in 1/10
***************!


So, in example 1 (WOS records) each record starts with the fieldtag “PT”, is delimited by the first space characters, and I want to keep the vars AU, UT, JI, and the CR’s.
In example 2, “Reference Type” denotes the start of each record entry, the field tags are delimited by a colon, and I keep all the vars.
Maybe this is useful for someone, but more importantly, I’d like to hear if anyone tries this out and has any feedback, comments, or help on how to improve this.

Sunday, June 20, 2010

Dual_Finder.app: An OSX Script to Open Multiple Finder Windows with a Single Click

Want to be able to open 2 Finder windows side-by-side by clicking an icon on the dock?

To do so:

1. save this applescript to your “~/applications/applescript” folder;
2. open this script with "Applescript Editor.app";
3. modify the “hei”ght and “wid”th to match your monitor;
4. “save as” an application.
5. drag the application (.app) file to your dock next to Finder and give it a new icon.

Here's the script:

property wid : 1200
property hei : 850
set the startup_disk to (path to startup disk)
tell application "Finder"
 activate
 set visible of (every process whose visible is true and frontmost is false) ¬
  to false
 -- Window 1--
 set win to make new Finder window
 set the target of win to the startup_disk
 set the bounds of win to {0, (hei * 0.056) div 1, ¬
  wid, (hei * 0.55) div 1}
 set the current view of win to column view
 -- Window 2--
 set win to make new Finder window
 set the target of win to the startup_disk
 set the bounds of win to {0, (hei * 0.58) div 1, ¬
  wid, hei}
 set the current view of win to column view
 select the last Finder window
end tell

Wednesday, May 19, 2010

Automatically Building Summary Tables based on Variable Type

When working with large datasets, creating & exporting summary/descriptive tables can be a pain, even if you take advantage of looping over the variables in combination with table exporting programs like -tabout- .  The reason is that you want different kinds of summary tables for different kinds of variables.  For categorical string variables you probably want a cross-tab.  Creating tables for numeric variables is less straightforward.  For continuos numeric variables with many categories, you probably want a table with the mean, sd, min, max, and so on.  For numeric variables that are categorical or otherwise limited, you may want a cross-tab.

So, while you could use -ds- to detect the "type" (e.g. string or numeric) of variable you have, this could be misleading if a numeric categorical variable has many values (say 50+).  Another approach could be to decide the type of table you create & export based on the number of unique values of a categorical numeric variable.  Here we could use the -egenmore- (from SSC) function nvals() after detecting the variable type using -ds-.

Here's an example where the variables with more than 50 unique categories are treated as continuous variables:






*-------------------------------------------------BEGIN EXAMPLE
global files "yourfilehere"

** from web-aware Stata:
ssc install fre
ssc install tabout
ssc install egenmore
**

webuse auto, clear

ds, has(type numeric)
local all `r(varlist)'
di "`all'"
foreach v in `all' {
 egen zz = nvals(`v')
  if zz > 50 & !mi(zz)  {
di "----------------------------------"
di " Descriptives for variable:   `v'  "
table for, c(n `v' mean `v' sd `v' min `v' max `v' ) col 
di "totals-->"
sum `v'
qui {
tabout for using "$files\descriptives_`v'.txt", /*
*/ format(3) sum replace c(mean `v' sd `v' min `v' max `v') /*
*/ npos(lab) nlab((n=#)) h1(title here)
      }  
  }                            
   

 else {
di "----------------------------------"
di " Descriptives for c. variable:   `v'  "
fre `v'  , nowrap des t(10)
fre `v' using "$files\onewaydescriptives_`v'.txt" , /*
*/ all nowrap des replace
tab2 `v' for, col chi2 miss
qui {
tabout `v' for using "$files\descriptives_`v'.txt", /*
*/ replace c( col) f(2p)  npos(lab) nlab( (n = #) ) /* 
*/ ptotal(all)  h1(title here)
  }
di "~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~"
    }
  cap drop zz
}

ds, has(type string)
local all `r(varlist)'
di "`all'"

foreach v in `all' {
di "----------------------------------"
di " Descriptives for c. variable:   `v'  "
fre `v'  , nowrap des t(10)
fre `v' using "$files\onewaydescriptives_`v'.txt" , /*
*/ all nowrap des replace
tab2 `v' for, col chi2 miss
 qui {
tabout `v' for using "$files\descriptives_`v'.txt", /*
*/ replace c( col) f(2p)  npos(lab) nlab( (n = #) ) /*
*/   ptotal(all)  h1(title here)
        }
   di "~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~"
    } 
*-------------------------------------------------END EXAMPLE



Sunday, April 4, 2010

Mad Men: A Foucaultian take

from potlatch

Report indicates that "Toyota Accelerator Data Skewed Toward Elderly"....maybe not

 "An anonymous reader passes along this discussion on the data for the Toyota accelerator problem, from a few weeks back. (Here's a Google spreadsheet of the data.) 'Several things are striking. First, the age distribution really is extremely skewed. The overwhelming majority are over 55. Here's what else you notice: a slight majority of the incidents involved someone either parking, pulling out of a parking space, in stop and go traffic, at a light or stop sign... in other words, probably starting up from a complete stop.'  Read more of this story at Slashdot."


I don't think this is that striking really... this is only out of a N of 30 (plus 5 incidents with individuals of an 'unknown' age which would be enough to shift the distribution if they were all younger) and there's nothing here that accounts for any other factors that could have been involved with the accident.  Importantly, what about the age distribution of individuals who are most likely to own a late model Toyota--are these accidents simply a reflection of that distribution?

Sunday, February 21, 2010

Update: Encoding Variables and Labeling Values Consistently (and Efficiently) in Stata

On 2010-02-09,  I posted this example on strategies for labeling values for lots of variables efficiently in Stata.

Today I discovered a function in NJC's -egenmore- extension of -egen- that I think is easier to use and, in many cases, faster to implement than the combination of -labutil- and -multencode- that I had suggested in my Feb 9 posting; so, to extend my previous example, here how we could label those variables using -egen- and the function "ston()" (scroll to the bottom to see the UPDATED code):


*-------------------------------------------------BEGIN CODE
clear
**this first bloc will create a fake dataset, run it all together**
input str12 region regioncode str20 quest1 str20 quest2 str20 quest3
"Southwest" 1 "Strongly Agree" "Strongly Disagree" "Disagree"
"West" 2 "Agree" "Neutral" "Agree"
"North" 3 "Disagree" "Disagree" "Strongly Disagree"
"Northwest" 5 "Disagree" "Agree" "Strongly Agree"
"East" 4 "Strongly Disagree" "Strongly Agree" "Agree"
"South" 9 "Neutral" "Agree" "Agreee"
end
//1. Create labeled REGION variable
/*
If we -encode- region it would not line up with regioncode
because encode operates in alphabetical order, for example:
*/
encode region, gen(region2) label(region2)
fre region2   //<-- these values don't match regioncode
drop region2


/* 
INSTEAD, we use -labmask- to quickly assign the values in 
region to the regioncodes
*/
ssc install labutil
labmask regioncode, values(region)
fre regioncode


//2. Creating comparable survey question scales
/*
We want all the survey questions to be on the same scale 
so that we can compare them in a model or table
-encode- can help us here with quest 1 and 2 because they
have the same categories, but quest3 has different categories 
(it's missing "neutral" and "agree" is spelled differently, so we could
either (1) use replace to define the numeric categoreis for these 
survey questions and then relabel them with -label define- and
-label values-, or (2) use -multencode- after fixing the misspelled 
"agree" value in quest3 
*/
replace quest3 = "Agree" if quest3=="Agreee"
**
ssc install multencode
multencode quest1-quest3, gen(e_quest1-e_quest3)
label li
fre e_*
/* 
The categories are labeled properly, but the scale isn't in
order--we want it to increase in satisfaction as it moves from
1 to 5
*/
 //-labvalch- is also from -labutil-
labvalch quest1, f(1 2 3 4 5) t(4 2 3 5 1)
label li
fre e_*
***UPDATE***
**using egenmore & the "ston()" function:
ssc install egenmore
forval n = 1/3 {
 egen ee_quest`n' = ston(quest`n'), to(1/5) /*
 */ from("Strongly Disagree" Disagree Neutral Agree "Strongly Agree")
 label val ee_quest`n' quest1
 **note: val label "quest1" already defined, if not, 
 **you'll need to define the value labels
 }
li quest1 ee_quest1 
*-------------------------------------------------END CODE

Thursday, February 18, 2010

Merging international / cross-national data using Stata

Giulia Catini, Ugo Panizza, & Carol Saade have published "Macro Data 4 Stata" with the codes to link up many popular international and cross-national economic datasets.   
This is similar to a Stata adofile on SSC called -kountry- which cross-links other international political science or economic datasets.   Taken together, this provides a great resource for working with international datasets.

Monday, February 15, 2010

"Aero Snap" with OSX

A colleague's new favorite critique of the Mac is that it doesn't have the "aero snap" feature that MS has been touting in their recent ads (where you can drag a window to the left or right of the screen and it will become a half window on that size of the screen.  I admit this feature does look handy (in lieu of simply dragging the corner to resize the screen)...here's a freeware that will do what aero snap will do in Mac (and it will do it for the top half of the screen and the bottom half, in addition to the snap left and snap right features in Win7): TwoUp

The only downside (kind of) is that it operates on keyboard shortcuts.  There is a paid version of this software, called Cinch,  that will let you simply drag the window for the snap feature (at $7 its probably a bargain, but I like to stick to freeware apps)

I like using shortcuts, and my hands are at the keyboard more than the mouse, so this works for me, but it might not work for others. (to get mouse gestures like aero, you could use this with BTT)  Speaking of BTT,  another great app that I'm using a lot is called "secondbar" which puts a second menu bar on an extra monitor, so that you don't have to go back to the main screen to click the menus.

Tuesday, February 9, 2010

Encoding Variables and Labeling Values Consistently (and Efficiently) in Stata

In PHPM 672 this week, we are working with -label define-, -label variables-, and -label values- to label data.  You can use these these (in combination with commands like -recode- and -replace-) to get a consistently labeled dataset, but there are three utilities that can get you there more efficiently:  -encode-, -multencode-(from SSC), and a suite of utilities in -labutil-(from SSC).
The goal is to get to a common value assignment across all similar variables (example:  many variables that use the same categorical scale of responses).  You can simply use -encode- if you are sure that the string variables that you are using to build categorical variables with value labels all have the same categories that are all spelled the exact same way.  If not, you will need to use -multencode- to deal with missing categories and -replace- or -subinstr()- to correct any spelling deviations.  Finally, once you've labeled the categorical variables consistently, you want to put them in an order that makes sense (e.g., lowest to highest) using -labvalch-

Here's an example of this process with a sample dataset.  Copy and paste this snippet into a do-file & run it to see how this process words:


*-------------------------------------------------BEGIN CODE
clear
**this first bloc will create a fake dataset, run it all together**
input str12 region regioncode str20 quest1 str20 quest2 str20 quest3
"Southwest" 1 "Strongly Agree" "Strongly Disagree" "Disagree"
"West" 2 "Agree" "Neutral" "Agree"
"North" 3 "Disagree" "Disagree" "Strongly Disagree"
"Northwest" 5 "Disagree" "Agree" "Strongly Agree"
"East" 4 "Strongly Disagree" "Strongly Agree" "Agree"
"South" 9 "Neutral" "Agree" "Agreee"
end
//1. Create labeled REGION variable
/*
If we -encode- region it would not line up with regioncode
because encode operates in alphabetical order, for example:
*/
encode region, gen(region2) label(region2)
fre region2   //<-- these values don't match regioncode
drop region2


/* 
INSTEAD, we use -labmask- to quickly assign the values in 
region to the regioncodes
*/
ssc install labutil
labmask regioncode, values(region)
fre regioncode


//2. Creating comparable survey question scales
/*
We want all the survey questions to be on the same scale 
so that we can compare them in a model or table
-encode- can help us here with quest 1 and 2 because they
have the same categories, but quest3 has different categories 
(it's missing "neutral" and "agree" is spelled differently, so we could
either (1) use replace to define the numeric categoreis for these 
survey questions and then relabel them with -label define- and
-label values-, or (2) use -multencode- after fixing the misspelled 
"agree" value in quest3 
*/
replace quest3 = "Agree" if quest3=="Agreee"
**
ssc install multencode
multencode quest1-quest3, gen(e_quest1-e_quest3)
label li
fre e_*
/* 
The categories are labeled properly, but the scale isn't in
order--we want it to increase in satisfaction as it moves from
1 to 5
*/
 //-labvalch- is also from -labutil-
labvalch quest1, f(1 2 3 4 5) t(4 2 3 5 1)
label li
fre e_*
*-------------------------------------------------END CODE


.

Sunday, February 7, 2010

Quick Links

1. This study demonstrates how reducing information/effort obstacles for the college application process increases the college attendance rate.
At PPRI we're seeing some of this same pattern in Texas high schools where we've helped evaluate one high school reform grant or another.  Campus counselors or other staff are sitting down with HS seniors to help them fill out FAFSA and other application materials, and many of them think that this is helping to improve their college attendance rates.  However, the problem here is knowing what really happens when that senior exits high school & if they do start college, what their likelihood is of actually completing a degree.

2. Yet another reason to learn Bayesian Modeling / Analysis 

3. PhD Comics take down of news media poll reporting

4. Is the spurious regression problem spurious? [1] [2]