Skip to main content


Showing posts from November, 2017

Unexpected results with conditions and functions in Stata

Just dropped in to see what condition my condition was in... Using Stata functions like those found in -help functions- and -help egen- can be tricky when used in combination with conditions (by 'conditions' here I mean everything that changes the range over which functions operate, so this includes [if] and [in] conditions, conditions within a function like: gen count= sum(gender == 1 & age <= 18),etc) . The examples in this post serve as a cautionary tale for some things to watch for when combining the use of functions and conditions.

The advice in this post falls into three categories/topics: (1) Avoid subscripting with -egen- functions; (2) be wary of conditions inside functions (as well as nested functions); and (3) there are better ways to use conditions with functions to avoid problems.
Note : a subset of this discussion appeared on Statalist in this thread:…

Creating example datasets for collaboration with other Stata users

Robert Picard and Nick Cox developed a (better) program called -dataex- that was uploaded to SSC and as of Stata 15.1 is officially included to help users share example code. The major difference is that -writeinput- writes a .do file with an -input-statement while -dataex- is Statalist-centric (even producing the enclosing [CODE] tags unless the 'elsewhere' option is specified) and produces the data example in the results window (Updated Nov 2017)


I'm lucky to be in a research environment where most of my colleagues and students use Stata.  Also, I regularly participate on Statalist.  Both of these have helped pushed me to periodically refine my habits when it comes to communicating about Stata.

When it comes to asking questions on Statalist, I've tried to stick closely to the Statalist FAQ and other tips mentioned by William Gould on the Stata NEC Blog.  However, for answering questions on Statalist, I find Maarten Buis's page on his Statalist postings espec…

Tukey, not TuRkey (that's for tomorrow), on outliers and multiple comparisons

John Tukey contributed many things to statistics and data visualizations (box plots!).  I also cannot help thinking "TuRkey" every time I read or verbalize his name.  In my (biased, failable human) mind, the frequency with which I think to use Tukey methods in my own work, and the potential for an embarrassing Tukey-TuRkey switch during regular communication, increases across the year to look something like the figure to the right.

Why am I reading this?
In this post, I discuss a few ways to use some of Tukey's contributions to help with some analyses I've been recently running in Stata. I use (a simulated version of) data from student survey responses, discipline involvement, and exam performance to do Tukey-related things like assess outliers and test for differences across multiple groups. This post includes some basic code run on fake data but in addition to what's presented below you could also consider using Stata programs (some from SSC) such as:  -extreme…

I think this technically violates Asimov's zeroth law...

First, AI was tasked with dealing with the pesky Reviewer #2 problem of the scientific peer review process (ok, the Evise feature is just a search & match function, not really AI).  Now, AI is here to handle the messy business of actually writing your scientific manuscript for you.

SciNote has their new magic AI plug-in (sarcasm intended) that will purportedly take the results of your analyses and links to relevant literature and "magically" turn it into a scientific manuscript.

From the product page:
This is where the magic happens
Once your data is nicely organized in sciNote, Manuscript Writer can do its job!
Based on the data you have in sciNote, it will generate a draft of your manuscript.
oof.    Insert lateral plaintiff face type emoji here.  

This only perpetuates the issues with paper mills/publishers (that thankfully get exposed (using a fake manuscript generator no less)). At least they didn't launch this new product at 2:14 a.m. Eastern time,  on August…

Precision in Stata

In this post, I explore how to deal with precision issues with Stata.
First, Create Data for Example.
.clear . set obs 1000obs was 0, now 1000 . g x = 1.1. list in 1/5, noobs +-----+ | x | |-----| | 1.1 | | 1.1 | | 1.1 | | 1.1 | | 1.1 | +-----+ . count if x ==1.1 // zero matches!! 0Precision of Stata storage formatsStata isnt wrong, it's just that you stored the variable x with too little precision (some decimal numbers have no exact finite-digit binary representation in computing). If we change the precision to float or store the variable as double format then it fixes the issue. Note below how  x is represented in Hexidecimal and Binary IEEE format vs. Stata general (16g) and fixed (f) format.
. . count if x == float(1.1) 1000. **formats . di %21x x //hex+1.19999a0000000X+000. di %16L x //IEEE precision000000a09999f13f. di %16.0g round(x, .1) 1.1. di %4.2f round(x, .1)1.10. di %23.18f round(x, .1) 1.100000000000000089Double formatsStoring the var…

I did a thing....

In 2009, New Mexico adopted more rigorous high school graduation requirements. I (finally) completed the last of my remaining REL studies that examined the changes in New Mexico’s high school students’ advanced course completion rates under these new requirements. We're providing a webinar and you can join in and listen to the results of the study if this is a topic you're interested in.  See the webinar announcement (at the newly minted Gibson blog) for registration details.

The study that will be presented:
Booth, E., Shields, J., & Carle, J. (2017). Advanced course completion rates among New Mexico high school students following changes in graduation requirements (REL 2018–278). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southwest.
Accessible at