Just dropped in to see what condition my condition was in...
![]() |
Speaking of Kenny Rogers... |
The advice in this post falls into three categories/topics: (1) Avoid subscripting with -egen- functions; (2) be wary of conditions inside functions (as well as nested functions); and (3) there are better ways to use conditions with functions to avoid problems.
Note : a subset of this discussion appeared on Statalist in this thread: https://www.statalist.org/forums/forum/general-stata-discussion/general/1419537-sum-of-values-of-a-column-until-a-certain-point
Example data
First, let's create some example data to work with.. . clear . input ID Var_1 Y ID Var_1 Y 1 1 2 2 4 2 3 0 . 4 5 4 5 4 4
end . sort ID, stable . list +----------------+ | ID Var_1 Y | |----------------| 1. | 1 1 2 | 2. | 2 4 2 | 3. | 3 0 . | 4. | 4 5 4 | 5. | 5 4 4 | +----------------+ .Before we start, recall how total() and sum() are different:
. //First, note the difference in using total instead of sum with an if condition: . g y = sum(Var_1) if ID<4 (2 missing values generated) . egen y2 = total(Var_1) if ID<4 (2 missing values generated) . list +-------------------------+ | ID Var_1 Y y y2 | |-------------------------| 1. | 1 1 2 1 5 | 2. | 2 4 2 5 5 | 3. | 3 0 . 5 5 | 4. | 4 5 4 . . | 5. | 5 4 4 . . | +-------------------------+ . **both have missings where the if condition fails . **sum is the running sum, total is the overall summed total. . drop y y2
Careful with subscripting & -egen-.
From the Stata User manual: "Depending on fcn(), arguments, if present, refers to an expression, varlist, or a numlist, and the options are similarly fcn dependent. Explicit subscripting (using N and n), which is commonly used with generate, should not be used with egen." The reason is egen will internally change the sort order, so subscript locations may change. See this entry in the Stata user manualTo be fair, there are cases where using -egen- with subscripting works as expected (and where the Stata manual even uses them in combination in examples) but, in this example, let's focus on a few cases where doing so causes problems.
. **Consider these issues: . . egen v = sum(Y) if _n==_N (4 missing values generated) . egen vv = sum(Y) in l (4 missing values generated) . egen vvvv = total(Y) if Var_1!=0 (1 missing value generated) . egen vvv = total(Y) if Var_1[3]!=0 //explicit (5 missing values generated) . assert vvvv==vvv , rc0 4 contradictions in 5 observations assertion is false . list +--------------------------------------+ | ID Var_1 Y v vv vvvv vvv | |--------------------------------------| 1. | 1 1 2 . . 12 . | 2. | 2 4 2 . . 12 . | 3. | 3 0 . . . . . | 4. | 4 5 4 . . 12 . | 5. | 5 4 4 4 4 12 . | +--------------------------------------+ . . egen z = total(Y[1]) . sort v . egen zz = total(Y[1]) . assert z==zz , rc0 5 contradictions in 5 observations assertion is false . list +------------------------------------------------+ | ID Var_1 Y v vv vvvv vvv z zz | |------------------------------------------------| 1. | 5 4 4 4 4 12 . 10 20 | 2. | 4 5 4 . . 12 . 10 20 | 3. | 3 0 . . . . . 10 20 | 4. | 2 4 2 . . 12 . 10 20 | 5. | 1 1 2 . . 12 . 10 20 | +------------------------------------------------+ . cap drop v* . cap drop z* . . **using if and bysort with egen vs. conditions inside the function: . bys Y (ID): egen j = mean(Var_1) if Y<3 (3 missing values generated) . bys Y (ID): egen jj = mean(Var_1*(Y<3)) . list +----------------------------+ | ID Var_1 Y j jj | |----------------------------| 1. | 1 1 2 2.5 2.5 | 2. | 2 4 2 2.5 2.5 | 3. | 4 5 4 . 0 | 4. | 5 4 4 . 0 | 5. | 3 0 . . 0 | +----------------------------+ . assert j==jj , rc0 3 contradictions in 5 observations assertion is false . cap drop j* . . ** in the same way that these are not equivalent: . bysort Y: egen k = total(Var_1) if ID >3 (3 missing values generated) . bysort Y: egen kk = total(cond(ID >3 , Var_1, .)) . list +-------------------------+ | ID Var_1 Y k kk | |-------------------------| 1. | 1 1 2 . 0 | 2. | 2 4 2 . 0 | 3. | 4 5 4 9 9 | 4. | 5 4 4 9 9 | 5. | 3 0 . . 0 | +-------------------------+ . * because of missing values . cap drop k*
Understanding what conditions inside functions are doing:
In the statalist post I linked above, I mentioned how you can define conditions inside the parenthesis of a function but it is evaluating the condition as true/false logic. Consider the differences in how the sum function is either summing/adding the values contained in Var_1 versus summing the "1" value that the expression produces when the condition is true. For more on how Stata treats conditions as True/False or 1/0 conditions see Nick Cox's article "Speaking Stata: How to move step by: step". g z1 = sum(Var_1<5) //evaluates to 1 if true and sums that "1" across observations , so this is essentially a counter for number of times this is true . g z2 = Var_1[4] //value of Var1 in the 4th observation . g z3 = Var_1[_n<4] //evaluates to 1 if condition is true (2 missing values generated) . g z4 = sum(Var_1[ID<4]) //evaluates to 1 if condition is true and then sums across panel/dataset . g z5 = sum(Var_1==4) . list ID V* z* +-------------------------------------+ | ID Var_1 z1 z2 z3 z4 z5 | |-------------------------------------| 1. | 1 1 1 4 1 1 0 | 2. | 2 4 2 4 1 2 1 | 3. | 4 5 2 4 1 2 1 | 4. | 5 4 3 4 . 2 2 | 5. | 3 0 4 4 . 3 2 | +-------------------------------------+ . cap drop z*
Conditions inside and outside of functions in the same command:
When do functions like sum() and total() fail (or fail to produce things I might expect) when using conditions within the function, after the function in an [if], and using subscripting? The person who posted on SL asked why this syntax/statement did not work:. ** . egen X = sum(cond(ID<4,sum(Var_1),0)) . ** . list +---------------------+ | ID Var_1 Y X | |---------------------| 1. | 1 1 2 20 | 2. | 2 4 2 20 | 3. | 4 5 4 20 | 4. | 5 4 4 20 | 5. | 3 0 . 20 | +---------------------+ . **first consider: . g cond1 = sum(cond(ID<4,sum(Var_1),0)) . . **take the cond out of the sum() to see how it first sum's the entire column of Var_1 (inside the cond() function) and then it sum's across these values if the ID<4 condition is true. . **The zero never shows up in the values where ID is not less than 4 because you've wrapped the cond() in a sum() that would just keep adding up the column. So: . . g cond2a = cond(ID<4,sum(Var_1),0) . list ID V* cond* +-----------------------------+ | ID Var_1 cond1 cond2a | |-----------------------------| 1. | 1 1 1 1 | 2. | 2 4 6 5 | 3. | 4 5 6 0 | 4. | 5 4 6 0 | 5. | 3 0 20 14 | +-----------------------------+ . . . *so if the ID<4 condition will evaluate to 1 and you only want to add across Var_1 if that condition is true, why not write the sum() function like 'cond3' below? . * based on the SL quetsion, the original posting actually wanted the total, not the running sum, so > cond4 below is a good way to get the right answer: . g cond3 = sum(Var_1*(ID<4)) . egen cond4 = total(Var_1*(ID<4)) . list ID V* cond* +---------------------------------------------+ | ID Var_1 cond1 cond2a cond3 cond4 | |---------------------------------------------| 1. | 1 1 1 1 1 5 | 2. | 2 4 6 5 5 5 | 3. | 4 5 6 0 5 5 | 4. | 5 4 6 0 5 5 | 5. | 3 0 20 14 5 5 | +---------------------------------------------+ .Another set of examples of how to use the sum function with if and in conditions to get results outside of a -gen- or -egen- command (based on a SAS vs. Stata syntax comparison by the UNC Carolina Population Center http://www.cpc.unc.edu/research/tools/data_analysis/sas_to_stata/sas_to_stata.html):
. sysuse auto, clear (1978 Automobile Data) . list make in 1/10 if mpg < 20 +---------------+ | make | |---------------| 2. | AMC Pacer | 5. | Buick Electra | 6. | Buick LeSabre | 9. | Buick Riviera | 10. | Buick Skylark | +---------------+ . . // the order of if and in does not matter: . list make if mpg < 20 in 1/10 +---------------+ | make | |---------------| 2. | AMC Pacer | 5. | Buick Electra | 6. | Buick LeSabre | 9. | Buick Riviera | 10. | Buick Skylark | +---------------+ . . . // Both will first subset the data to the first 10 observations and then attempt to subset the data b > ased on the condition "if mpg < 20". . . // So, another approach is to use the sum() function in the -list- command. Since sum() creates a run > ning sum, you have to repeat the condition outside the sum() to subset the data to that condition to > list the first 10 observations. . . // Importantly , the sum() function adds up the true conditions because true conditions evaluate to 1 > (one) and false evaluate to 0(zero). . . **so if we now want to list the first ten times that mpg is less than 20 (which I think is the most u > seful part of this application) : . g x = sum((mpg < 20)) . list make x mpg if inrange(sum((mpg < 20)), 1, 10) & mpg<20 +---------------------------+ | make x mpg | |---------------------------| 2. | AMC Pacer 1 17 | 5. | Buick Electra 2 15 | 6. | Buick LeSabre 3 18 | 9. | Buick Riviera 4 16 | 10. | Buick Skylark 5 19 | |---------------------------| 11. | Cad. Deville 6 14 | 12. | Cad. Eldorado 7 14 | 15. | Chev. Impala 8 16 | 19. | Chev. Nova 9 19 | 21. | Dodge Diplomat 10 18 | +---------------------------+ . **or** . list make x mpg if sum((mpg < 20)) <= 10 & mpg<20 +---------------------------+ | make x mpg | |---------------------------| 2. | AMC Pacer 1 17 | 5. | Buick Electra 2 15 | 6. | Buick LeSabre 3 18 | 9. | Buick Riviera 4 16 | 10. | Buick Skylark 5 19 | |---------------------------| 11. | Cad. Deville 6 14 | 12. | Cad. Eldorado 7 14 | 15. | Chev. Impala 8 16 | 19. | Chev. Nova 9 19 | 21. | Dodge Diplomat 10 18 | +---------------------------+There have been other examples I've run across where -egen- with an -if- condition (and or explicit subscripting and/or a -bysort- prefix) have caused unexpected results, but couldnt readily find them. I'll update this post if I do.
However, I think the upshot is that the combination of functions and conditions in Stata make it powerful, however, provide for some potential for mistakes that should be checked by 'building' the logic of commands to make sure you are getting what you expect.