Skip to main content

Unexpected results with conditions and functions in Stata

Just dropped in to see what condition my condition was in...

Speaking of Kenny Rogers...
Using Stata functions like those found in -help functions- and -help egen- can be tricky when used in combination with conditions (by 'conditions' here I mean everything that changes the range over which functions operate, so this includes [if] and [in] conditions, conditions within a function like: gen count= sum(gender == 1 & age <= 18), etc) . The examples in this post serve as a cautionary tale for some things to watch for when combining the use of functions and conditions.


The advice in this post falls into three categories/topics: (1) Avoid subscripting with -egen- functions; (2) be wary of conditions inside functions (as well as nested functions); and (3) there are better ways to use conditions with functions to avoid problems.
Note : a subset of this discussion appeared on Statalist in this thread: https://www.statalist.org/forums/forum/general-stata-discussion/general/1419537-sum-of-values-of-a-column-until-a-certain-point

Example data

First, let's create some example data to work with.
. 
. clear

. input ID    Var_1   Y  

ID      Var_1          Y
 1    1     2
 2    4     2
 3    0     .
 4    5     4
 5    4     4 
end

.   sort ID, stable

. list

     +----------------+
     | ID   Var_1   Y |
     |----------------|
  1. |  1       1   2 |
  2. |  2       4   2 |
  3. |  3       0   . |
  4. |  4       5   4 |
  5. |  5       4   4 |
     +----------------+

. 
Before we start, recall how total() and sum() are different:

. //First, note the difference in using total instead of sum with an if condition:
. g y = sum(Var_1) if ID<4
(2 missing values generated)

. egen y2 = total(Var_1) if ID<4
(2 missing values generated)

. list

     +-------------------------+
     | ID   Var_1   Y   y   y2 |
     |-------------------------|
  1. |  1       1   2   1    5 |
  2. |  2       4   2   5    5 |
  3. |  3       0   .   5    5 |
  4. |  4       5   4   .    . |
  5. |  5       4   4   .    . |
     +-------------------------+

. **both have missings where the if condition fails
. **sum is the running sum, total is the overall summed total.
.         drop y y2

Careful with subscripting & -egen-. 

From the Stata User manual: "Depending on fcn(), arguments, if present, refers to an expression, varlist, or a numlist, and the options are similarly fcn dependent. Explicit subscripting (using N and n), which is commonly used with generate, should not be used with egen." The reason is egen will internally change the sort order, so subscript locations may change. See this entry in the Stata user manual
To be fair, there are cases where using -egen- with subscripting works as expected (and where the Stata manual even uses them in combination in examples) but, in this example, let's focus on a few cases where doing so causes problems.
. **Consider these issues:
. 
.         egen v = sum(Y) if _n==_N
(4 missing values generated)

.         egen vv = sum(Y) in l
(4 missing values generated)

.         egen vvvv = total(Y) if Var_1!=0
(1 missing value generated)

.         egen vvv = total(Y) if Var_1[3]!=0 //explicit
(5 missing values generated)

.         assert vvvv==vvv  , rc0
4 contradictions in 5 observations
assertion is false

. list 

     +--------------------------------------+
     | ID   Var_1   Y   v   vv   vvvv   vvv |
     |--------------------------------------|
  1. |  1       1   2   .    .     12     . |
  2. |  2       4   2   .    .     12     . |
  3. |  3       0   .   .    .      .     . |
  4. |  4       5   4   .    .     12     . |
  5. |  5       4   4   4    4     12     . |
     +--------------------------------------+

.         
.         egen z = total(Y[1])  

.         sort v

.         egen zz = total(Y[1])   

.         assert z==zz  , rc0
5 contradictions in 5 observations
assertion is false

. list

     +------------------------------------------------+
     | ID   Var_1   Y   v   vv   vvvv   vvv    z   zz |
     |------------------------------------------------|
  1. |  5       4   4   4    4     12     .   10   20 |
  2. |  4       5   4   .    .     12     .   10   20 |
  3. |  3       0   .   .    .      .     .   10   20 |
  4. |  2       4   2   .    .     12     .   10   20 |
  5. |  1       1   2   .    .     12     .   10   20 |
     +------------------------------------------------+

.         cap drop v*

.         cap drop z*

.         
. **using if and bysort with egen vs. conditions inside the function:
. bys Y (ID): egen j  = mean(Var_1) if Y<3
(3 missing values generated)

. bys Y (ID): egen jj  = mean(Var_1*(Y<3))  

. list

     +----------------------------+
     | ID   Var_1   Y     j    jj |
     |----------------------------|
  1. |  1       1   2   2.5   2.5 |
  2. |  2       4   2   2.5   2.5 |
  3. |  4       5   4     .     0 |
  4. |  5       4   4     .     0 |
  5. |  3       0   .     .     0 |
     +----------------------------+

.         assert j==jj  , rc0
3 contradictions in 5 observations
assertion is false

.                 cap drop j*

.                 
. ** in the same way that these are not equivalent:
. bysort Y: egen k = total(Var_1) if ID >3
(3 missing values generated)

. bysort Y: egen kk = total(cond(ID >3 , Var_1, .))

. list

     +-------------------------+
     | ID   Var_1   Y   k   kk |
     |-------------------------|
  1. |  1       1   2   .    0 |
  2. |  2       4   2   .    0 |
  3. |  4       5   4   9    9 |
  4. |  5       4   4   9    9 |
  5. |  3       0   .   .    0 |
     +-------------------------+

.         * because of missing values
.         cap drop k*

Understanding what conditions inside functions are doing: 

In the statalist post I linked above, I mentioned how you can define conditions inside the parenthesis of a function but it is evaluating the condition as true/false logic. Consider the differences in how the sum function is either summing/adding the values contained in Var_1 versus summing the "1" value that the expression produces when the condition is true. For more on how Stata treats conditions as True/False or 1/0 conditions see Nick Cox's article "Speaking Stata: How to move step by: step"
. g z1 = sum(Var_1<5)  //evaluates to 1 if true and sums that "1" across observations , so this is essentially a counter for number of times this is true 

. g z2 = Var_1[4] //value of Var1 in the 4th observation 

. g z3 = Var_1[_n<4] //evaluates to 1 if condition is true
(2 missing values generated)

. g z4 = sum(Var_1[ID<4]) //evaluates to 1 if condition is true and then sums across panel/dataset

. g z5 = sum(Var_1==4)  

. list ID V* z*

+-------------------------------------+
     | ID   Var_1   z1   z2   z3   z4   z5 |
     |-------------------------------------|
  1. |  1       1    1    4    1    1    0 |
  2. |  2       4    2    4    1    2    1 |
  3. |  4       5    2    4    1    2    1 |
  4. |  5       4    3    4    .    2    2 |
  5. |  3       0    4    4    .    3    2 |
     +-------------------------------------+

.         cap drop z*

Conditions inside and outside of functions in the same command: 

When do functions like sum() and total() fail (or fail to produce things I might expect) when using conditions within the function, after the function in an [if], and using subscripting? The person who posted on SL asked why this syntax/statement did not work:
. **
. egen X = sum(cond(ID<4,sum(Var_1),0))

. **
. list

+---------------------+
     | ID   Var_1   Y    X |
     |---------------------|
  1. |  1       1   2   20 |
  2. |  2       4   2   20 |
  3. |  4       5   4   20 |
  4. |  5       4   4   20 |
  5. |  3       0   .   20 |
     +---------------------+

. **first consider: 
. g cond1 = sum(cond(ID<4,sum(Var_1),0))

. 
. **take the cond out of the sum() to see how it first sum's the entire column of Var_1 (inside the cond() function) and then it sum's across these values if the ID<4 condition is true. 
. **The zero never shows up in the values where ID is not less than 4 because you've wrapped the cond() in a sum() that would just keep adding up the column. So:
.  
. g cond2a = cond(ID<4,sum(Var_1),0)

. list ID V* cond*

     +-----------------------------+
     | ID   Var_1   cond1   cond2a |
     |-----------------------------|
  1. |  1       1       1        1 |
  2. |  2       4       6        5 |
  3. |  4       5       6        0 |
  4. |  5       4       6        0 |
  5. |  3       0      20       14 |
     +-----------------------------+

. 
. 
. *so if the ID<4 condition will evaluate to 1 and you only want to add across Var_1 if that condition  is true, why not write the sum() function like 'cond3' below?  
.  * based on the SL quetsion, the original posting actually wanted the total, not the running sum, so 
> cond4 below is a good way to get the right answer:
. g cond3 = sum(Var_1*(ID<4))

. egen cond4 = total(Var_1*(ID<4))

. list ID V* cond*

     +---------------------------------------------+
     | ID   Var_1   cond1   cond2a   cond3   cond4 |
     |---------------------------------------------|
  1. |  1       1       1        1       1       5 |
  2. |  2       4       6        5       5       5 |
  3. |  4       5       6        0       5       5 |
  4. |  5       4       6        0       5       5 |
  5. |  3       0      20       14       5       5 |
     +---------------------------------------------+

. 
Another set of examples of how to use the sum function with if and in conditions to get results outside of a -gen- or -egen- command (based on a SAS vs. Stata syntax comparison by the UNC Carolina Population Center http://www.cpc.unc.edu/research/tools/data_analysis/sas_to_stata/sas_to_stata.html):
. sysuse auto, clear
(1978 Automobile Data)

. list make in 1/10 if mpg < 20 

+---------------+
     | make          |
     |---------------|
  2. | AMC Pacer     |
  5. | Buick Electra |
  6. | Buick LeSabre |
  9. | Buick Riviera |
 10. | Buick Skylark |
     +---------------+

.  
. // the order of if and in does not matter: 
.  list make if mpg < 20 in 1/10 

     +---------------+
     | make          |
     |---------------|
  2. | AMC Pacer     |
  5. | Buick Electra |
  6. | Buick LeSabre |
  9. | Buick Riviera |
 10. | Buick Skylark |
     +---------------+

. 
.  
. // Both will first subset the data to the first 10 observations and then attempt to subset the data b
> ased on the condition "if mpg < 20". 
. 
. // So, another approach is to use the sum() function in the -list- command. Since sum() creates a run
> ning sum, you have to repeat the condition outside the sum() to subset the data to that condition to 
> list the first 10 observations. 
. 
. // Importantly , the sum() function adds up the true conditions because true conditions evaluate to 1
>  (one) and false evaluate to 0(zero).
. 
. **so if we now want to list the first ten times that mpg is less than 20 (which I think is the most u
> seful part of this application) :
. g x = sum((mpg < 20))

. list make x mpg if inrange(sum((mpg < 20)), 1, 10)  & mpg<20

     +---------------------------+
     | make              x   mpg |
     |---------------------------|
  2. | AMC Pacer         1    17 |
  5. | Buick Electra     2    15 |
  6. | Buick LeSabre     3    18 |
  9. | Buick Riviera     4    16 |
 10. | Buick Skylark     5    19 |
     |---------------------------|
 11. | Cad. Deville      6    14 |
 12. | Cad. Eldorado     7    14 |
 15. | Chev. Impala      8    16 |
 19. | Chev. Nova        9    19 |
 21. | Dodge Diplomat   10    18 |
     +---------------------------+

. **or**
. list make x mpg if sum((mpg < 20)) <= 10 & mpg<20

     +---------------------------+
     | make              x   mpg |
     |---------------------------|
  2. | AMC Pacer         1    17 |
  5. | Buick Electra     2    15 |
  6. | Buick LeSabre     3    18 |
  9. | Buick Riviera     4    16 |
 10. | Buick Skylark     5    19 |
     |---------------------------|
 11. | Cad. Deville      6    14 |
 12. | Cad. Eldorado     7    14 |
 15. | Chev. Impala      8    16 |
 19. | Chev. Nova        9    19 |
 21. | Dodge Diplomat   10    18 |
     +---------------------------+
There have been other examples I've run across where -egen- with an -if- condition (and or explicit subscripting and/or a -bysort- prefix) have caused unexpected results, but couldnt readily find them. I'll update this post if I do.
However, I think the upshot is that the combination of functions and conditions in Stata make it powerful, however, provide for some potential for mistakes that should be checked by 'building' the logic of commands to make sure you are getting what you expect.