Skip to main content

Encoding Variables and Labeling Values Consistently (and Efficiently) in Stata

In PHPM 672 this week, we are working with -label define-, -label variables-, and -label values- to label data.  You can use these these (in combination with commands like -recode- and -replace-) to get a consistently labeled dataset, but there are three utilities that can get you there more efficiently:  -encode-, -multencode-(from SSC), and a suite of utilities in -labutil-(from SSC).
The goal is to get to a common value assignment across all similar variables (example:  many variables that use the same categorical scale of responses).  You can simply use -encode- if you are sure that the string variables that you are using to build categorical variables with value labels all have the same categories that are all spelled the exact same way.  If not, you will need to use -multencode- to deal with missing categories and -replace- or -subinstr()- to correct any spelling deviations.  Finally, once you've labeled the categorical variables consistently, you want to put them in an order that makes sense (e.g., lowest to highest) using -labvalch-

Here's an example of this process with a sample dataset.  Copy and paste this snippet into a do-file & run it to see how this process words:

*-------------------------------------------------BEGIN CODE
**this first bloc will create a fake dataset, run it all together**
input str12 region regioncode str20 quest1 str20 quest2 str20 quest3
"Southwest" 1 "Strongly Agree" "Strongly Disagree" "Disagree"
"West" 2 "Agree" "Neutral" "Agree"
"North" 3 "Disagree" "Disagree" "Strongly Disagree"
"Northwest" 5 "Disagree" "Agree" "Strongly Agree"
"East" 4 "Strongly Disagree" "Strongly Agree" "Agree"
"South" 9 "Neutral" "Agree" "Agreee"
//1. Create labeled REGION variable
If we -encode- region it would not line up with regioncode
because encode operates in alphabetical order, for example:
encode region, gen(region2) label(region2)
fre region2   //<-- these values don't match regioncode
drop region2

INSTEAD, we use -labmask- to quickly assign the values in 
region to the regioncodes
ssc install labutil
labmask regioncode, values(region)
fre regioncode

//2. Creating comparable survey question scales
We want all the survey questions to be on the same scale 
so that we can compare them in a model or table
-encode- can help us here with quest 1 and 2 because they
have the same categories, but quest3 has different categories 
(it's missing "neutral" and "agree" is spelled differently, so we could
either (1) use replace to define the numeric categoreis for these 
survey questions and then relabel them with -label define- and
-label values-, or (2) use -multencode- after fixing the misspelled 
"agree" value in quest3 
replace quest3 = "Agree" if quest3=="Agreee"
ssc install multencode
multencode quest1-quest3, gen(e_quest1-e_quest3)
label li
fre e_*
The categories are labeled properly, but the scale isn't in
order--we want it to increase in satisfaction as it moves from
1 to 5
 //-labvalch- is also from -labutil-
labvalch quest1, f(1 2 3 4 5) t(4 2 3 5 1)
label li
fre e_*
*-------------------------------------------------END CODE