Bug with Regression Labels on Categorical Variables - asdocx

Tagged: encoded variables, labels, nested regression

Viewing 5 posts - 1 through 5 (of 5 total)

Author

Posts

Ross

Participant

February 12, 2025 at 12:59 pm

Post count: 3

#18571

So I have some encoded categorical variables I make like this


gen Fund_Strategy = ""  // Initialize the new variable
replace Fund_Strategy = "[01] Micro Cap" if trim(lower(Fund_Strategy_EV_Size)) == "micro cap"
replace Fund_Strategy = "[02] Small Cap" if trim(lower(Fund_Strategy_EV_Size)) == "small cap"
replace Fund_Strategy = "[03] Lower Mid Cap" if trim(lower(Fund_Strategy_EV_Size)) == "lower mid cap"
replace Fund_Strategy = "[04] Mid Cap" if trim(lower(Fund_Strategy_EV_Size)) == "mid cap"
replace Fund_Strategy = "[05] Large Cap" if trim(lower(Fund_Strategy_EV_Size)) == "large cap"

replace Fund_Strategy = "[99] Unknown" if missing(Fund_Strategy_EV_Size)

* Step 4: Encode the Fund_Strategy variable into a numeric variable
encode Fund_Strategy, generate(cat_Fund_Strategy) label(cat_Fund_Strategy)
label variable cat_Fund_Strategy "Fund Strategy Categories, From Raw Data"

I label the variable and the categories of said variable, when setting the base to anything other than the defaul of 1. That is I set another category as a base and run my regression using the other categories, like this…


global x_fact_controls ///	
    ib(2).cat_Fund_Strategy  ///  // Base: [02] Small Cap
    ib(3).cat_Fund_Status  ///  // Base: [03] Divestment

ASDOCX word ouput for nested regressions seems to mis-interpret that. Now this is just the labeling the regressions properly drop the base from the output values.

Also, the formatting on binary variables where I have a 0 and 1 where 0 is automatically the base, the labels seem to be tabbed over to the right quite a bit.

Attaullah Shah

Moderator

February 12, 2025 at 2:05 pm

Post count: 80

#18572

Dear Ross
To replicate the issue you’re facing, I have created a dummy dataset and performed a nested regression. In the code and output below, I noticed that the base category includes extra wording (“Fund Strategies, Categories from Raw Data”), which I will fix. If you are experiencing a different issue, please either use my dataset or share a sample of your own so I can better understand the problem.


*-----------------------------------------
* Step 1: Create dummy data
*-----------------------------------------
clear
set seed 1234
set obs 10

* Create a string variable Fund_Strategy_EV_Size
gen Fund_Strategy_EV_Size = ""

* Populate dummy data with various cases and missing values:
replace Fund_Strategy_EV_Size = "micro cap"       in 1
replace Fund_Strategy_EV_Size = "small cap"       in 2
replace Fund_Strategy_EV_Size = "lower mid cap"   in 3
replace Fund_Strategy_EV_Size = "mid cap"         in 4
replace Fund_Strategy_EV_Size = "large cap"       in 5
replace Fund_Strategy_EV_Size = "MICRO CAP"       in 6  // test case: uppercase letters
replace Fund_Strategy_EV_Size = "sMaLL CaP"       in 7  // test case: mixed case
* Observations 8 and 10 remain empty (i.e., missing)
replace Fund_Strategy_EV_Size = "mid cap"         in 9


*-----------------------------------------
* Step 2: Create and assign Fund_Strategy based on EV_Size
*-----------------------------------------
gen Fund_Strategy = ""  // Initialize the new variable

replace Fund_Strategy = "[01] Micro Cap"  if trim(lower(Fund_Strategy_EV_Size)) == "micro cap"
replace Fund_Strategy = "[02] Small Cap"  if trim(lower(Fund_Strategy_EV_Size)) == "small cap"
replace Fund_Strategy = "[03] Lower Mid Cap"  if trim(lower(Fund_Strategy_EV_Size)) == "lower mid cap"
replace Fund_Strategy = "[04] Mid Cap"  if trim(lower(Fund_Strategy_EV_Size)) == "mid cap"
replace Fund_Strategy = "[05] Large Cap"  if trim(lower(Fund_Strategy_EV_Size)) == "large cap"

* For observations with a missing Fund_Strategy_EV_Size, mark as Unknown
replace Fund_Strategy = "[99] Unknown" if missing(Fund_Strategy_EV_Size)

*-----------------------------------------
* Step 3: Encode Fund_Strategy into a numeric variable
*-----------------------------------------
encode Fund_Strategy, generate(cat_Fund_Strategy) label(cat_Fund_Strategy)
label variable cat_Fund_Strategy "Fund Strategy Categories, From Raw Data"
expand 100


gen returns = uniform()
gen size = uniform()
gen expense = returns + uniform()/10
gen cat_Fund_Status = mod(_n,4)
label define fundstatus 1 "Active" 2 "Passive" 3 "Divestment" 0 "Unknown", modify
label values cat_Fund_Status fundstatus

global x_fact_controls ///	
    ib(2).cat_Fund_Strategy  ///  // Base: [02] Small Cap
    ib(3).cat_Fund_Status  ///  // Base: [03] Divestment

asdocx	reg returns size expense $x_fact_controls,  replace label tzok fs(9) abb(.) nest

Table: Regression results

	(1)
Variables	returns
size	-0.001
	(0.003)
expense	0.990***
	(0.003)
[01] Micro Cap	0.002
	(0.003)
Fund Strategy Categories, From Raw Data : base [02] Small Cap

[03] Lower Mid Cap	0.001
	(0.003)
[04] Mid Cap	0.000
	(0.003)
[05] Large Cap	-0.003
	(0.003)
[99] Unknown	0.003
	(0.003)
Unknown	-0.001
	(0.003)
Active	-0.002
	(0.003)
Passive	0.001
	(0.003)
: base Divestment

Intercept	-0.044***
	(0.003)
Observations	1000.000
R²	0.991
Notes: Standard errors are in parentheses. * p<.01, p<.05, * p<.1

Ross

Participant

February 13, 2025 at 10:14 am

Post count: 3

#18576

Thank you Dr. Shah, but the issue is a bit different. Actually I have 2 issues.
1) It is basically that the base (aka reference) generated text is being assigned to the wrong label. So the generated text “Reference: )” is appearing on the 1st category and not the one I assign as the base for the regression.

So I am getting using the above example, I receive “Fund Strategy Categories, From Raw Data (Reference: Micro Cap)” on Micro, which is not the base I set with the syntax, it should be assigned to Small Cap


global x_fact_controls ///	
    ib(2).cat_Fund_Strategy  ///  // Base: [02] Small Cap

Said differently, the text “(Reference: )” should move from the 1st label to the 2nd label “Small Cap.” Actually, I rather like the label before the categories “Fund Strategy Categories, From Raw Data” if it can stay that would be great!

2) there is spacing in the docx output sometimes and sometimes not on the individual variables/labels. It seems to only happen on variables that I have which are dummies and the values are either 0 or 1.

COVID-19 Recession (2020)

New Normal, Post-Pandemic Economy (2021-2024)

Dummy indicator if fund is the first-time fund of the PE firm (Reference: Not first-time fund)

Yes a first-time fund

Dummy Indicator If Fund Is Completed (Reference: Unrealized Fund)

Also the 1st part of the dummy variables was always cutoff, but I came up with a fix by adding some characters before.


Syntax:
gen d_first_time_fund = 0
replace d_first_time_fund = 1 if Fund_Generation_Sequence == 1
label variable d_first_time_fund "Dummy indicator if fund is the first-time fund of the PE firm"
label define first_time_fund_lbl 0 "0 Not first-time fund" 1 "1 Yes a first-time fund"
label values d_first_time_fund first_time_fund_lbl

Ross

Participant

February 13, 2025 at 10:22 am

Post count: 3

#18577

Actually, I just noted your output has “base” and not “(Reference:” before the category label anbd after the variable label. I wonder why that is?

My regression syntax is like this.


asdocx regress V1 V2 ///
	$x_cont_controls $x_fact_controls,  ///
	replace nest ///
	rep(p) save($export_folder\temp_reg01.doc) title(`reg1') dec(2) abb(.) fs(8) label 
	
asdocx regress V3 V2 ///
	$x_cont_controls $x_fact_controls,  ///
	nest ///
	rep(p) save($export_folder\temp_reg01) title(`reg1') dec(2) stat(r2_a) abb(.) fs(8) label

Attaullah Shah

Moderator

February 13, 2025 at 1:44 pm

Post count: 80

#18578

Dear Ross
Can you please send (email) some sample data that can replicate the issue you are having. I think the specific issue might be an artifact of your dataset.

Author

Posts

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic.