Skip to main content


Showing posts from 2010

Finding your way around Stata

One of the things my students first get stuck on is how to find things  (e.g. files, directories, variables with particular labels or notes) in Stata. There are a lot of commands to find things like files/datasets, directories, command help documentation, user commands/ado-files, variables, values, notes/chars , etc -- there are some commands that find only one of these things, some commands can find several of these things, and most of these things can be found by more than one command.  It can be a bit overwhelming and confusing and I've found that students who fall behind early in a class using Stata often get stuck at the point of being able to find these things -- particularly directories and command ado/help files. Of course good use of a search engine is a key resource, but the table below gives an overview of the commands I use to find these things in Stata (this table can also found in my Module 1 Lecture for PHPM 672).    Undoubtedly, there are other commands that wi

Fun with Stata: Games for Stata Edition

Over at Mitch's "Stata Daily" blog, he describes a "hangman" game sent to him by Marek Hlavac .  I'm a sucker for non-standard uses of Stata (e.g., [ 1 ] [ 2 ] [ 3 ]), so I played with it for a while.  This also convinced me to make public one of my earliest attempts at writing a Stata ado-file/program:  - blackjack -. The game is played by typing -blackjack- into the command window and then the game prompts the user for the amount she wants to bet (default is $500 which replenishes after you lose it all or you exit Stata), and whether to hit or stay.  It doesn't accurately represent all the rules and scenarios of a real game a blackjack (e.g., no doubling down), so don't use it to prep for your run at taking down a Vegas casino. Fair warning that -blackjack- is visually quite ugly (the cards tend to misalign; I could have come up with a better card design for face cards than a "{ Stata }" center; and (because I was learning about Stat

Statistics Software Showdown: Google Ngram

Using Google's Ngram Viewer , here's the breakdown of Stata vs. SAS vs. SPSS.   Stata didn't do as well as I hoped, but in taking a closer look there are at least a couple of reasons to be optimistic about Stata's prospects. (1)  SAS is benefitting from lot's of books written about the British Special Air Service (SAS).   (2) As of yet, there doesn't appear to be a way to refine these searches with boolean search parameters.  If so, we could have searched for "SAS -British" or "SPSS | PASW", etc. (3) I couldn't find a way to search for the software 'R' using Ngram. (4)  Stata seems to have as much, if not more, web presence / resources as the other software packages.   Using a regular google search: "Stata" + statistical software  22.3 million pages "SPSS" + statistical software  28.2 million pages "SAS" + statistical software  17.6 million pages "PASW" + statistical software  53K pa

Fun with Stata: Running Stata from your iPhone

There are literally tens of people out there in the world that have at some point or another thought "I really wish I could run something in Stata right now on my iPhone." Well, I recently killed some time making that a possibility. In order to get results from Stata on your iPhone anywhere/anytime, this process requires 5 components: (1) Stata (10 or later) installed on a Mac OSX (10.5 or later) that is always connected to the internet (2) A Dropbox account linked to your Mac that has Stata installed (3) iStata.scpt Applescript file to manage files put in Dropbox (4) iStata.ado to run the file, log the output, and put it back in Dropbox (5) the free application Plaintext for iPhone (or some equivalent) to write and view .do files written and run from your iPhone Setup: You'll need to save the iStata.scpt and files into the folders referenced in these scripts in your Dropbox folders on your Mac OSX. Really, you can place these files/folders an

Analyzing Stimulus Funds Data

A report from Brito & De Rugy at the Mercatus Center at GMU from earlier this year reports that (emphasis added): "An OLS regression analysis controlling for the district representative’s political party, tenure in office, leadership position, membership on the appropriations committee, as well as for the district’s unemployment, mean income (i.e., the average income of a given wage earner in the district), and the percentage of employed persons working in the construction sector in 2008 finds that  having a Republican representative decreases a district’s stimulus award by 24 percent .   This effect is statistically significant at the p < .001 level." It's an interesting and useful paper.  They've put a lot of effort into compiling some data on the federal stimulus outlays and some other political variables as well as even correcting & error-checking the data they got from other sources. I like that they have provided the actual Stata dataset that they c

Generating Random/Fake String Data in Stata

When posting to Statalist  I usually try to provide an example of my question or answer using the in-built "auto.dta" dataset, the -input- command to manually create a dataset,  or by generating fake, random data using Stata functions.  To create fake, random numeric data, you can use any of the random number generators detailed in -help random_numbers- (such as runiform), but there is no random generator for alphabet characters (A-Z or a-z).   Sometimes it's useful to illustrate to Statalist or students in class how to manipulate the dataset if it includes some kind of string variable that you want to use to identify panels or illustrate how to -encode- variables, etc.  (or maybe you just want a random string generator because you lost your dice for playing Scattergories ) -ralpha- generates random string characters for Stata.    In many cases, you could generate the numeric variable and -tostring- it, but if you need string (alpha) characters, this package presents an e

An Update to Dual

Previously I described an application/script that enabled you to open two or more Finder windows, stacked side-by-side, on Mac OSX (10.4 or later).  This update changes the proportion of the windows the the display of the files to something closer to Path Finder 5; that is, the finder window stacked on the left (Window 1 below) is set to "list view"; Finder Window 2 (right) is set to "column view."  Adjust the size of the Finder windows using the "hei"(height) and "wid"(width) properties.  The script will also "hide" all other open windows. You can download the .app version of this file (below) and then drag it to the dock next to your icon for a convenient way to open and positioning multiple Finder windows.  Here's the .script: property   wid  : 10 property   hei  : 40 set   the   startup_disk   to  ( path to   startup disk ) tell   application  "Finder" close   every   Finder window activate set   visi

StataCorp has a blog

Stata now has an official blog -- Not Elsewhere Classified  -- check it out.

Fun with Stata: Google for Stata

I use the -findit- command from the Stata command line/window dozens of times a day.  However, I've always wished that there was a way to have -findit- search for than just Stata commands or materials on the web so that I wouldn't have to leave the Stata window just to Google something.   Also, I've always wanted -findit- to search Statalist archives for me as part of the process of looking for an answer to some burning Stata-related question.  Now, you can stay tethered to Stata and Google with impunity without ever leaving the safety of your Stata command line. The ado-file described below allows you to Google something from the Stata command line and have the results either return to you in the Stata "Results" window (default) or export to your system's browser.  The title of the results webpages are returned and the links in blue are clickable to navigate to that page in your OS default browser. The syntax is:  google [ , Filetype(string) Site(stri

More on Working with Field-Tagged Data in Stata

I’ve been doing a lot with field tagged data recently (some of it is in html tables which I am still struggling with). Below is the code for a (still messy and buggy) program called removetags.ado which is a working .ado file used to pull information from field tagged data from within Stata. The biggest challenges for me have been (1) how to adapt the code when some fields span multiple rows without field tags to identify them and (2) how to deal with long entries in Stata (since there is a 244 string limit). I think I’ve solved (1) for most cases (see lines 71-90 in the ado-file linked below), but it’s clunky and I’m sure I’ll need to adapt it further when I run across things that make it choke. However, I am currently working on solving (2). [[So far, my solution involves identifying the rows that are longer than 244 chars using -file read-, and then reading those rows into Stata as characteristics associated with each record. The advantage of this approach is that you can st An OSX Script to Open Multiple Finder Windows with a Single Click

Want to be able to open 2 Finder windows side-by-side by clicking an icon on the dock? To do so: 1. save this applescript to your “ ~/applications/applescript ” folder; 2. open this script with "Applescript"; 3. modify the “hei”ght and “wid”th to match your monitor; 4. “save as” an application. 5. drag the application (.app) file to your dock next to Finder and give it a new icon. Here's the script: property wid : 1200 property hei : 850 set the startup_disk to (path to startup disk) tell application "Finder" activate set visible of (every process whose visible is true and frontmost is false) ¬ to false -- Window 1-- set win to make new Finder window set the target of win to the startup_disk set the bounds of win to {0, (hei * 0.056 ) div 1, ¬ wid, (hei * 0.55 ) div 1} set the current view of win to column view -- Window 2-- set win to make new Finder window set the target of win to the startup_disk set the bounds o

Automatically Building Summary Tables based on Variable Type

When working with large datasets, creating & exporting summary/descriptive tables can be a pain, even if you take advantage of looping over the variables in combination with table exporting programs like -tabout- .  The reason is that you want different kinds of summary tables for different kinds of variables.  For categorical string variables you probably want a cross-tab.  Creating tables for numeric variables is less straightforward.  For continuos numeric variables with many categories, you probably want a table with the mean, sd, min, max, and so on.  For numeric variables that are categorical or otherwise limited, you may want a cross-tab. So, while you could use -ds- to detect the "type" (e.g. string or numeric) of variable you have, this could be misleading if a numeric categorical variable has many values (say 50+).  Another approach could be to decide the type of table you create & export based on the number of unique values of a categorical numeric variab

Report indicates that "Toyota Accelerator Data Skewed Toward Elderly"....maybe not

 "An anonymous reader passes along this discussion on the data for the Toyota accelerator problem, from a few weeks back. (Here's a Google spreadsheet of the data .) 'Several things are striking. First, the age distribution really is extremely skewed. The overwhelming majority are over 55. Here's what else you notice: a slight majority of the incidents involved someone either parking, pulling out of a parking space, in stop and go traffic, at a light or stop sign... in other words, probably starting up from a complete stop.'   Read more of this story at Slashdot. " I don't think this is that striking really... this is only out of a N of 30 (plus 5 incidents with individuals of an 'unknown' age which would be enough to shift the distribution if they were all younger) and there's nothing here that accounts for any other factors that could have been involved with the accident.  Importantly, what about the age distribution of individuals who ar

Update: Encoding Variables and Labeling Values Consistently (and Efficiently) in Stata

On 2010-02-09,  I posted this example on strategies for labeling values for lots of variables efficiently in Stata. Today I discovered a function in NJC's - egenmore - extension of - egen - that I think is easier to use and, in many cases, faster to implement than the combination of -labutil- and -multencode- that I had suggested in my Feb 9 posting; so, to extend my previous example, here how we could label those variables using -egen- and the function "ston()" (scroll to the bottom to see the UPDATED code): *-------------------------------------------------BEGIN CODE clear **this first bloc will create a fake dataset, run it all together** input str12 region regioncode str20 quest1 str20 quest2 str20 quest3 "Southwest" 1 "Strongly Agree" "Strongly Disagree" "Disagree" "West" 2 "Agree" "Neutral" "Agree" "North" 3 "Disagree" "Disagree" "Strongly Disagr

Merging international / cross-national data using Stata

Giulia Catini, Ugo Panizza, & Carol Saade have published " Macro Data 4 Stata " with the codes to link up many popular international and cross-national economic datasets .    This is similar to a Stata adofile on SSC called - kountry - which cross-links other international political science or economic datasets.   Taken together, this provides a great resource for working with international datasets.

"Aero Snap" with OSX

A colleague's new favorite critique of the Mac is that it doesn't have the "aero snap" feature that MS has been touting in their recent ads (where you can drag a window to the left or right of the screen and it will become a half window on that size of the screen.  I admit this feature does look handy (in lieu of simply dragging the corner to resize the screen)'s a freeware that will do what aero snap will do in Mac (and it will do it for the top half of the screen and the bottom half, in addition to the snap left and snap right features in Win7): TwoUp The only downside (kind of) is that it operates on keyboard shortcuts.  There is a paid version of this software, called Cinch ,  that will let you simply drag the window for the snap feature (at $7 its probably a bargain, but I like to stick to freeware apps) I like using shortcuts, and my hands are at the keyboard more than the mouse, so this works for me, but it might not work for others. (to get mous

Encoding Variables and Labeling Values Consistently (and Efficiently) in Stata

In PHPM 672 this week, we are working with -label define-, -label variables-, and -label values- to label data.  You can use these these (in combination with commands like -recode- and -replace-) to get a consistently labeled dataset, but there are three utilities that can get you there more efficiently:  -encode-, -multencode-(from SSC), and a suite of utilities in -labutil-(from SSC). The goal is to get to a common value assignment across all similar variables (example:  many variables that use the same categorical scale of responses).  You can simply use -encode- if you are sure that the string variables that you are using to build categorical variables with value labels all have the same categories that are all spelled the exact same way.  If not, you will need to use -multencode- to deal with missing categories and -replace- or -subinstr()- to correct any spelling deviations.  Finally, once you've labeled the categorical variables consistently, you want to put them in an ord

Quick Links

1. This study  demonstrates how reducing information/effort obstacles for the college application process increases the college attendance rate. At PPRI we're seeing some of this same pattern in Texas high schools where we've helped evaluate one high school reform grant or another.  Campus counselors or other staff are sitting down with HS seniors to help them fill out FAFSA and other application materials, and many of them think that this is helping to improve their college attendance rates.  However, the problem here is knowing what really happens when that senior exits high school & if they do start college, what their likelihood is of actually completing a degree . 2. Yet another reason to learn Bayesian Modeling / Analysis  3.  PhD Comics take down of news media poll reporting 4. Is the spurious regression problem spurious? [ 1 ] [ 2 ]

3D Mac Desktop via "bumptop"

I like the way  bumptop organizes file stacks (similar to "stacks" in the dock, but more flexible)--though the jury is still out about whether it will improve my productivity or not. Below is my 3D desktop using bumptop. The stacks of photos are there for illustration, but the video at the bumptop webpage shows how you can sort and scrub through stacks of documents/photos.  Notice how you can post links, pending documents, and sticky notes on the 4 walls (the back/bottom wall is not visible unless you navigate to it).  You can also switch to 2D mode on the fly.

Quick Links

BJPS comes out from behind its pay wall Problems in the Census Data    ...  the referenced Alexander, Davern, and Stevenson paper    ...   WSJ Bialik write up Adjustments for Modeling Medicare Service Use Statalist changes its subscription rules [ 1 ] [ 2 ]    ... and no, mental health screening is not an official subscription criteria Wall Street Fight Clubs  ... I am Jack's complete lack of surprise

AWK scripts & Large Datasets

Working with large datasets in statistical program, like Stata, SPSS, or R, can be a pain if the variable formats and shapes aren't what you need because stats programs can be notoriously slow at cleaning up the data when they have to load a dataset into memory that is larger than the physical memory (SAS is an  exception --it uses the hard drive exclusively ).  Of course there are plenty of other issues with large datasets [ 1 ]  [ 2 ] .   A faster way to get the data cleaned up is to read the data in line-by-line and then manipulate it using something like AWK.  Using AWK scripts can save you some time with the clean up and Brenden at published a table comparing the speed of various languages in getting the job done:

The Impact of Corporate Spending on Elections

The Citizens United decision is in many ways a surprising decision.  The decision overrules standing precedent (namely Austin v. Michigan Chambers of Commerce ), some provisions in McCain-Feingold,  and was decided on issues that weren't explicitly stated in the petition to the court.  More over, the Roberts court didn't follow their modus operandi of deciding controversial cases narrowly .

Referencing Cells/Observations in Stata using [brackets]

Earlier today there was a posting on Statalist that asked about labeling a numeric variable with the words contained in a string variable.  This reminded me of my first Statalist posting last year where I asked a similar question.   There are several solutions to this question, but more importantly, it got me thinking more about referencing cells using brackets (called "explicit subscripting" or using "_variables" in the Stata manual). Before I go into  subscripting, here's a recap of how to label numeric variables from a list contained in a string variable based on the Statalist exchange: *-------------------------BEGIN CODE /* The posting asked how to label col1 with values from col2 in this example dataset: */ clear* inp byte col1 str5 col2 1 "Table" 2 "Chair" 3 "Stool" 4 "Chair" 5 "Pan" end /* Martin Weiss offered the solution: */ loc h "" forv i=1/`=_N'{ loc h `h' `=col1[`i']

Visualization of Netflix Data

NYT posted some visualizations of the top Netflix rentals for 12 cities, by zip code.  Unfortunately, the raw data is not available.  Netflix has a history of releasing their rental data/algorithms for contests where participants have attempted to improve these algorithms, so hopefully they'll be releasing more data on consumer preferences for general use. The city with data available that is closest to where I live is Dallas, Tx.  Before taking a look at the Netflix rentals, let's take a brief look at some demographic characteristics, by zip code, of this same region for reference (this is from the ERsys  2000 Census data , which is outdated). Here's household income in Dallas: (darker zip codes = higher mean household incomes) Here's level of education in Dallas: (Darker blue = higher mean education level)

Decision Mapping: What Should I Drink? Edition

Linked here or after the jump.... Update :  They've also got one for  fast food.