Multiple Regression with purely dummy (qualitative) variables

[ARCHIVED THREAD] - Multiple Regression with purely dummy (qualitative) variables

Posted: 4/25/2017 9:50:59 PM EDT

Is there something special that needs to be done if doing a regression with variables that are always either 0 or 1?

I'm writing software to do regressions by solving for B in Y=BX. So far its worked good when doing quantitative variables, but when its purely qualitative I get weird values, they're all the same huge number for each coefficient. I remove all columns that don't contain any 1s prior to doing the regression.

Posted: 4/26/2017 12:21:46 AM EDT

[#1]

Read up on categorical variables, dummy and dichotomous variables in multiple regression as search terms. If all variables are categorical, you are really doing the same thing as ANOVA.

You need to leave out one variable as the reference variable, not sure from your description if you are doing that. I.e. You can't have two columns that are the exact opposite, let's say you sold tables and chairs, and you entered in a column where 1 = chair, 0 = not chair, you would leave out the column where, 1= table and 0 = not table. Some software will drop it one automatically, some will just return an error, but sounds like you are doing custom work.

If you are using a dichotomous 0, 1 variable as a dependent variable, use binary logistic regression and interpret using the odds-ratios.

http://groups.chass.utoronto.ca/pol242/Labs/LM-9B/LM-9B_content.htm

ETA: by reference variable, let's take race as an example, four options, you would enter in three columns of 0/1 dichotomized (dummy coded) columns the one you leave out is the reference, and you compare the coefficients as +/- in relation to that one you left out. Leave them all in, and you get multi-collinearity. Doesn't matter which one you leave out, that only changes your intepretation.

Posted: 4/26/2017 12:55:23 AM EDT

[#2]

I was doing it with about 5 dummy variables. Each observation has one and exactly one '1' (not counting the first column, see below), so the Matrix looked something like this:

- A B C D E
1 1 0 0 0 0
1 1 0 0 0 0
1 0 0 0 1 0
1 0 1 0 0 0
1 0 0 1 0 0
1 0 0 0 0 1

I'm doing this in C# using a Linear Algebra library. It doesn't have a statistics package so I have to extract all that out from the math itself. The first column of 1s is added to solve for the intercept value, I do this with all regressions.

Edit: Dependent variable is not categorical.

Posted: 4/26/2017 1:15:44 AM EDT

[#3]

Right so in this case you would leave out the e column for instance and then interpret the coefficients for a-d as +/- vs. e.

How you Program it to drop the last category if the number of categories changes may pose some difficulty if they change on the fly (e.g. Drop down menu to select variables to enter in the model, gender would only add one column, your example would only add four variables, my example would only add three variables, etc.)

run it that way through your process and then do the same in excel using columns a-d and see what you get.

I have no idea how the first column comes into play doing what you are doing though, but shouldn't be an issue as long as it is working w/ continuous variables.

Posted: 4/26/2017 1:23:07 AM EDT

[#4]

Its for the Design Matrix

At least when doing normal multiple regressions, the parameter vector (usually ß) is one element longer than the number of independent variables, so the first value is the intercept value of the line.

I was hoping to use the exact same math regardless of whether or not all values are indicators or not, since the idea of the software is the users decide which independent values to use based on their data and they run it though a regression to get a model. It sounds like that won't be the case though and I'll need a special case for all values being dummies.

Posted: 4/26/2017 1:31:01 AM EDT

[#5]

Not a special case for all dummy variables, a special case for any dummy variables. The only reason it looks reasonable with other variables is that they are clouding the issue you have by entering in all five columns of dummy variables, anytime they select a dummy (ETA: I meant categorical) variable only n-1 columns (n being number of categories) should be added to the equation, giving you a reference category.

you can download a demo/freebie version of spss/sas/stata/r/alteryx/minitab (I forget which have demos or personal license, for sure alteryx and r) or use the excel data analysis plugin to check your results.

ETA: actually, if you set it up,so that each column has to be selected and then set up, logic so that only 4 of 5 or 1 of 2, etc categorical variables can be chosen (e.g you can select male or female, but not both) or select only one (A vs all others) you could get around it that way.

i suppose you could have a pop up that explains what they need to do

Posted: 4/26/2017 1:46:23 AM EDT

[#6]

Thanks, I've been using Excel to check the results we get so far. Maybe when I'm back in I can get an actual dataset to show as an example.

Posted: 4/26/2017 5:44:05 PM EDT

[#7]

I suppose you could also leave out the first column of 1s and it will return the mean for each categorical variable (not relative to anything).

Posted: 4/26/2017 9:27:20 PM EDT

[#8]

Ok I made sense of the advice given this morning and got it working. Taking the indicator value with the most '1' values and removing it to use it as reference is the current plan.

Posted: 4/29/2017 11:05:04 PM EDT

[#9]

That's a pretty sound way to go, then you are always comparing against the majority/plurality. E.g for race reference would (typically) be white, and interpretation is being black +/- vs. white, Hispanic +/- vs white, asian vs.... and then you can compare the coefficients of those three in the model to each other.

also, if you aren't planning on it, the standardized coefficients are also very helpful, and allow you to sum the absolute value of the standardized coefficients, divide each (abs Val) coefficient by the sum, and come up with a % driver weight for each of the variables in the model. Lots and lots of presentations driven by pie charts or stacked bar charts with those kind of %s.

Posted: 5/21/2017 5:48:54 PM EDT

[#10]

At least when doing normal multiple regressions, the parameter vector (usually ß) is one element longer than the number of independent variables,

View Quote

Makes it a defined system.

You cannot solve a system that has no 'right hand sides.'

[ARCHIVED THREAD] - Multiple Regression with purely dummy (qualitative) variables

General » Math And Science

Win a FREE Membership!

Sign up for the ARFCOM weekly newsletter and be entered to win a free ARFCOM membership. One new winner* is announced every week!

You will receive an email every Friday morning featuring the latest chatter from the hottest topics, breaking news surrounding legislation, as well as exclusive deals only available to ARFCOM email subscribers.

Warning

Confirm Action

About AR15.COM

Stay Connected

Newsletter

Contact Us