Skip to main content


Showing posts from 2017

Happy holidays - I made you this Stata tree...

Two Stata-related problems/issues I've been working on involve building weaved, literate documents via -markstat- (from SSC) and post-processing Stata graphs via gr_edit commands. gr_edit commands are not documented but are helpful for adding elements across many graphs that you'd otherwise have to add manually (or by creating a graph recording). The biggest downside to gr_edit commands (or applying graph recordings) is that it's a slow process when you have many hundreds/thousands of graphs to edit. In the code example below, I made you a holiday tree with 100k lights. I know, this is very thoughtful of me. You are welcome. Update (Dec 25): First here's a new version of a holiday tree using -tw scatter-/-tw bar-. I like this version much better than the approach I had originally included when I first put up this post. These previous attempts using -scatteri- and gr_edit are still included below : **************! clear set obs 2000 g z = 10 g obs =

Slack: Alchemizing FOMO into neurosis

At my research firm, we’ve finally caught up to c.2013 and adopted Slack for project communications and management. My initial (<2 weeks) impression is that I like it -- the command line, programmable bots, and API parts, in particular, are appealing (and I'm guessing that the utility of those will increase over time.) There is a lot of conflicting advice out there about how to best use slack, so in this post I add to the noise with some guiding principals we’re following (at least initially). The tl;dr version of this post is that more channels with fewer conversations > fewer channels with more conversations (noise) mute your notifications! Slack will let you know when you are needed (as long as others follow standard @ mentioning (tagging) conventions) threading = good (but use it deliberately) Reply with purpose, otherwise just React don’t be a luddite-learn & use the Slack /slash commands. Below are more details/notes/tips on how we are using Slac

Unexpected results with conditions and functions in Stata

Just dropped in to see what condition my condition was in... Speaking of Kenny Rogers... Using Stata functions like those found in -help functions- and -help egen- can be tricky when used in combination with conditions (by ' conditions ' here I mean everything that changes the range over which functions operate, so this includes [if] and [in] conditions, conditions within a function like: gen count= sum(gender == 1 & age <= 18) , etc) . The examples in this post serve as a cautionary tale for some things to watch for when combining the use of functions and conditions. The advice in this post falls into three categories/topics: (1) Avoid subscripting with -egen- functions; (2) be wary of conditions inside functions (as well as nested functions); and (3) there are better ways to use conditions with functions to avoid problems. Note : a subset of this discussion appeared on Statalist in this thread:

Creating example datasets for collaboration with other Stata users

Robert Picard and Nick Cox developed a (better) program called -dataex- that was uploaded to SSC and as of Stata 15.1 is officially included to help users share example code. The major difference is that -writeinput- writes a .do file with an -input-statement while -dataex- is Statalist-centric (even producing the enclosing [CODE] tags unless the 'elsewhere' option is specified) and produces the data example in the results window  ( Updated Nov 2017) ... I'm lucky to be in a research environment where most of my colleagues and students use Stata.  Also, I regularly participate on Statalist.  Both of these have helped pushed me to periodically refine my habits when it comes to communicating about Stata. When it comes to asking questions on Statalist, I've tried to stick closely to the Statalist FAQ and other tips mentioned by William Gould  on the Stata NEC Blog .  However, for answering questions on Statalist, I find Maarten Buis's page on his Statalist posti

Tukey, not TuRkey (that's for tomorrow), on outliers and multiple comparisons

John Tukey contributed many things to statistics and data visualizations (box plots!).  I also cannot help thinking "TuRkey" every time I read or verbalize his name.  In my (biased, failable human) mind, the frequency with which I think to use Tukey methods in my own work, and the potential for an embarrassing Tukey-TuRkey switch during regular communication, increases across the year to look something like the figure to the right. Why am I reading this? See - it's not just me who mixes these up. In this post, I discuss a few ways to use some of Tukey's contributions  to help with some analyses I've been recently running in Stata. I  use (a simulated version of) data from student survey responses, discipline involvement, and exam performance to do Tukey-related things like assess outliers and test for differences across multiple groups. This post includes some basic code run on fake data but in addition to what's presented below you could also cons

I think this technically violates Asimov's zeroth law...

First, AI was tasked with dealing with the pesky Reviewer #2 problem  of the scientific peer review process (ok, the  Evise  feature is just a search & match function, not really AI ) .  Now, AI is here to handle the messy business of actually writing your scientific manuscript for you. SciNote has their new  magic AI plug-in  (sarcasm intended) that will purportedly take the results of your analyses and links to relevant literature and "magically" turn it into a scientific manuscript. From the product page : This is where the magic happens Once your data is nicely organized in sciNote, Manuscript Writer can do its job! Based on the data you have in sciNote, it will generate a draft of your manuscript. oof.     Insert lateral plaintiff face type emoji here.   This only perpetuates the issues with paper mills/publishers (that t hankfully get exposed  (using a fake  manuscript generator  no less)). At least they didn't launch this new product  at 2:1

Precision in Stata

In this post, I explore how to deal with precision issues with Stata. First, Create Data for Example. .clear . set obs 1000 obs was 0, now 1000 . g x = 1.1 . list in 1/5, noobs +-----+ | x | |-----| | 1.1 | | 1.1 | | 1.1 | | 1.1 | | 1.1 | +-----+ . count if x ==1.1 // zero matches!! 0 Precision of Stata storage formats Stata isnt wrong, it's just that you stored the variable  x  with too little precision (some decimal numbers have no exact finite-digit binary representation in computing). If we change the precision to float or store the variable as double format then it fixes the issue. Note below how   x  is represented in Hexidecimal and Binary IEEE format vs. Stata general (16g) and fixed (f) format. . . count if x == float(1.1) 1000 . **formats . di %21x x //hex +1.19999a0000000X+000 . di %16L x //IEEE precision 000000a09999f13f . di %16.0g round(x, .1) 1.1 . di %4.2f round(x, .1) 1.10 . di %23.18f round(x, .1)

I did a thing....

In 2009, New Mexico adopted more rigorous high school graduation requirements. I (finally) completed the last of my remaining REL studies that examined the changes in New Mexico’s high school students’ advanced course completion rates under these new requirements. We're providing a webinar and you can join in and listen to the results of the study if this is a topic you're interested in.  See the webinar announcement (at the newly minted Gibson blog) for registration details . The study that will be presented: Booth, E., Shields, J., & Carle, J. (2017). Advanced course completion rates among New Mexico high school students following changes in graduation requirements (REL 2018–278). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southwest. Accessible at .

Whitewashing your standard errors

Great quote from Gary King warning about the dangers of the all-to-common " , robust " (or I guess it's ( , vce(robust) now) solution for whitewashing the SEs in your model : "[...] if robust and classical standard errors diverge—which means the author acknowledges that one part of his or her model is wrong—then why should readers believe that all the other parts of the model that have not been examined are correctly specified? We normally prefer theories that come with measures of many validated observable implications; when one is shown to be inconsistent with the evidence, the validity of the whole theory is normally given more scrutiny, if not rejected (King, Keohane, and Verba 1994). Statistical modeling works the same way: each of the standard diagnostic tests evaluates an observable implication of the statistical model. The more these observable implications are evaluated, the better, since each one makes the theory vulnerable to being proven wrong. T