WEEK 11:
BIVARIATE AND MULTIPLE REGRESSION
Regression is actually a pretty
cool technique. It requires that your variables all be measured at the interval
level, because it provides mathematical prediction of the exact value of your
dependent variable.
Bivariate Regression
In the bivariate (two variables) case, without a
computer you can get a sheet of graph paper, put the independent variable at
the bottom of the page (the X axis), and put the dependent variable on the left
side (Y axis). Make sure the scales of each variable have the same distances
between the graph paper lines (each line might be 5, 10, 15; not 5, 15, 30,
100). Then, just plot each point’s position onto the two-dimensional graph
paper.
This technique finds
the best fitting straight line through
a set of points. Best fitting is defined by minimizing the sum of squared
distances between the points and the regression line.
For example, you’d like to think that all of
your years of higher education are worth something, dollar wise. Using the
2010-2014 pooled Mississippi Poll, my hypothesis was that education affected
family income, so education was the independent variable at the bottom, and
income was the dependent variable on the left side. My graph paper has vertical
and horizontal lines across the page every few millimeters, so I had each
vertical line reflect one year of education completed (high school grad is 12
years, etc.), and each horizontal line represent $2,000 in family income. When
I plotted all of the points, I got a nice pretty straight line that went upward
and to the right. All of the points (people) fell closely around that line. If
you remember from Algebra, any straight line can be represented by a formula:
Y = a + (b * X), where Y is dependent
variable, x is independent variable, a is the Y intercept, and b is the slope of the line, or (change
in Y)/(change in X)
The Y intercept is where the line crosses the Y axis. What is that Y value? It is the Y value when X is 0. In this case, it is Family Income when one has zero years of education.
b = unstandardized regression coefficient = slope
= (change in Y) / (change in X)
The equation for these set of points ended up
being:
Family Income = -18,643 + (4,620) X (Education
years)
The b value is the slope, which is $4,620, which
is the change in Y (income) for every 1-unit change in X (education). That
means that for every 1 additional year of education, your family income on
average rises $4,620. So, on average, getting a BA degree results in a higher
family income of $18,480, compared to someone with just a high school degree.
The a value or Y intercept is where the line
crosses the Y axis. So it is the Y value when X has a value of 0. So, someone
with zero years of education would have a negative family income of $18,643.
Sounds impossible, but maybe they are unemployed and on welfare. So they have
no income that they earned, but are entirely dependent on government benefits. Regression
is a little weird for such extreme cases, as it more accurately plots the
points for values of variables that have actual people on them. Therefore, your
Y intercept may be an “impossible” value.
Another thing that must
be considered with regression is that it makes these nice predictions, but we
always say that “on average”, that is what the prediction would be. The reason
is that those with 2 years of college might have a mean income of $46,037. MEAN
income, average, computed as Income = -18,643 + (4,620) X 14. Some community college grads will be above that, and some will be
below. Hence, we say that there is variation around the mean.
That gets to the
question of how good our predictive equation really is. That gets us into
explained variance.
R2 is explained variance, the variance in Y
explained by the independent variable's regression line.
R2 = (total variation - unexplained
variation)/ Total Variation
Total Variation = sum of squared distances
between the mean of Y and each case's Y value
Unexplained Variation (Residual) = sum of
squared distances between each case's Y value and each case's predicted Y value
(from the regression equation)
Explained Variation = sum of squared distances
between each case's predicted Y value and the mean of Y.
R2 is the explained variation.
It is the predictive ability of your independent variable.
Adjusted R2 shrinks the value
of R2 by penalizing for each additional independent variable,
and is statistically preferable to the R2. It considers the fact
that if you had as many predictors as you have cases (people), you would have
100% explained variance.
The F statistic tests the statistical
significance of the regression equation as a whole, and must be below .05.
All of that will make
more sense if you take a statistics class, or go to graduate school. The SPSS computer
program does the statistical calculations for you. How good a job you are doing
explaining your dependent variable with your independent variable is the
Adjusted R squared value. It ranges in value from a low of 0% (no value) to a
high of 100% (perfect prediction). In my education affects income example
above, the adjusted R squared is 16%. So while you get a nice straight line,
there are a lot of points that are above the line, or below the line.
Therefore, some people make a lot more money that you would expect given their
education level (a high school dropout who starts her own company, and becomes
a millionaire). And there are some people who make a lot less money that you
would expect from their education level (A PhD in 18th century
Serbian literature who is now driving an Uber).
One problem with
regression is that it is affected by Deviant Cases- a few scattered cases that
are way off the line. Their sum of squared distances from the line are large,
so they pull the line towards them. So, for example, most English professors
don’t make a lot of money. However, one of them at MSU used to be the Academic
Vice President. Rejoining the faculty in his department, he made a lot more
than the other professors. If a researcher plots all of their salaries by
seniority, he would be way above the regression line. That would pull the regression
line towards his point, so it would become more vertical, so this b value would
be larger. Hence, the researcher would think that all of the professors in that
department would be getting nice pay raises every year. (This actually happened
at MSU one year, and the professors were threatened with a low pay raise.) In
cases of deviant points, we sometimes drop those points from the analysis, and
plot the lines without those points.
Example of calculating a bivariate regression
problem.
You are asked to examine the relationship
between years of service since receiving a PhD degree, and nine-month salaries
of ten history professors. You need to plot the following points on graph
paper, and then calculate the b value (unstandardized regression coefficient
value or slope) and the y-intercept. Also, calculate what salary would have to
be given to a senior professor with 30 years of service since their PhD (if
they were hired from another university). Also, what would the starting salary
be (for someone with zero years of service who just got their PhD):
In this example, you can see that the b value or
slope would be $1,000. Two points on the line might be 10 and 20 years of service,
so change in Y would be $75,000-$65,000= $10,000. Change in X would be (20-10)
= 10. So, $10,000/10 = $1,000.
The Y intercept would be $55,000. You can check
the graph paper, or just do your calculations. Each year of service is worth $1,000,
so starting salary or zero years of service would be $57,000 for 2 years of service,
minus $2,000 = $55,000. The equation is Y = $55,000 + ($1,000) * (X).
Someone with 30 years of service, X = 30, just
put 30 into the equation where X is. So, Y = $55,000 + ($1,000) * (30) =
$85,000
MULTIPLE REGRESSION
Multiple regression is a
real cool technique that lets you look at increasing your predictive ability by
adding in more independent variables. For example, we only predicted 16% of the
variance of family income from one’s education. What other variables can we add
to the equation to increase our predictive ability? Well, how about number of
wage earners in the family. A two-income couple is likely to have a higher
family income than a one-income household. Also, how about what someone majored
in. Someone in Engineering will make more than someone in philosophy (because,
who wants to work in math or engineering their whole life, no offense…).
Another factor is the quality of the education institution. I think MSU has a
high reputation among employers; indeed, many of them are our alumni. Look at
the year of the Covid crisis. MSU stayed open with many faculty offering in-person classes, doing the best we could to make the best
of a difficult situation. Many other universities were closed for the entire year, offering only remote classes. I
heard that at one university in Texas, one Dean self-isolated himself at
home for two weeks (his faculty laughed and said he never came to the office
anyway), and when he tried to dump more of his work on his department heads, at
least one department head promptly resigned. You can pretty much figure out
what reputation such universities have.
Multiple Regression is linear regression applied to more than one independent
variable. With two independent variables, the predicted values comprise a plane
(instead of a line in the one independent variable case).
A Multiple Regression Equation
with four predictors:
b value is the
unstandardized regression coefficient, controlling for the effects of all other
predictors. It is used to predict the value of the dependent variable from the
known values of the independent variables.
b value is also used in making comparisons
across subsamples. For example, is an independent variable more important in
affecting the dependent variable among men or among women.
Beta is the standardized regression coefficient, controlling for the
effects of all other predictors. It tells the relative importance of the
independent variables in influencing the dependent variable. It ranges from 0
to 1, with 1 being most important and 0 being least important. Negative signs
reflect the direction of the variables' coding.
Beta = standardized regression coefficient = b *
(sdx/sdy), where sd means standard deviation. It adjusts
for the differing ranges and scales of the variables.
Pearson R is the correlation coefficient. It
equals the Beta in the bivariate case only.
Repeating, Beta ranges from -1 to +1 with 0
being no relationship between the independent and dependent variables. The sign
depends on the direction of the coding of your variables. A +1 or -1 is a
perfect relationship. b values have a greater range which is not confined to 1
or -1.
In other words, Beta is kind of like Gamma. Take
the absolute value of the Beta, and that shows what is the most important,
second most important, and least important variable.
You’ll notice in your computer output, that I
ran a multiple regression equation for the three variables in your model. That
regression analysis kind of confirmed your bivariate tables results. I then
came up with multivariate tables unique for each of your papers, that I thought
gave interesting results to comment on.
Multiple r is the correlation
between the actual Y value and the predicted Y value from the multiple
regression equation.
R2 is the variance in
the dependent variable explained by all of the independent variables.
Once again, the Adjusted R squared is the
most accurate value to use. Adjusted R2 shrinks
the value of R2 by penalizing for each additional independent
variable, and is statistically preferable to the R2. It considers
the fact that if you had as many predictors as you have cases (people), you
would have 100% explained variance.
In the education-income example, what do you
think would happen to this Adjusted R squared value when I add other predictors
that are indeed relevant and important? Obviously, you’d expect your predictive
ability to go up. So, this is a real cool technique. But it is a little
advanced for undergrads, so it is not required to be put into your research paper.
It is covered more in graduate school or other statistics classes.
The advantage of multiple regression is that it
can simultaneously look at more predictors than two, and estimate how important
each predictor is after discounting the other predictors. Your multiple tables
approach is actually looking at only two predictors at the same time. Multiple
regression can look at 3, 4, whatever. Very powerful technique. But there’s all
sorts of statistical assumptions that have to be met to use multiple
regression. Those are typically covered in a graduate program.
Example of a multiple regression equation
problem (taken from the 2006-2010 Mississippi Polls).
Predicting who believes they have been racially
profiled. This dependent variable is coded 1 for reported being profiled, and 2
for reported not being profiled. The independent variables and their coding
follow:
The Betas or standardized regression
coefficients for these predictors follow:
The significance levels for each of these
regression coefficients follow:
The adjusted R-squared for this regression
equation is 12%.
Using the above information, answer the
following questions:
Test example for multiple regression. Try to
answer it yourself:
(20 points) A multiple regression analysis of the 2010-2014
Mississippi Poll data examines the causes of party identification using five
independent variables. The variables are coded as follows:
Party
identification ranges from 1 for Strong Democrat to 7 for Strong Republican.
Ideology
ranges from 1 for Very Liberal to 5 for Very Conservative.
Race is
coded as 1 for White, and 2 for African American.
Family
Income is coded as 1 for Under $10,000 to 8 for Over $70,000.
Sex is coded
as 1 for Male and 2 for Female.
Age ranges
from 18 to 93, and is age in years.
The adjusted
R squared for this multiple regression equation is .52
The Betas or
standardized regression coefficients and their signs are:
Ideology = +.276
Race = -.531
Income = +.157
Sex = -.035
Age =
-.052
The significance levels of these five predictors are:
Ideology = .001
Race = .001
Income = .001
Sex = .129
Age = .023
Answer the
following questions:
A)
How good a job are these five
predictors doing in explaining party identification? That is, what percentage
of the variance in party identification is being explained by these five
independent variables?
B)
What predictor is MOST important in
affecting party identification? What predictor is Second in importance? What
predictor is Third in importance? What predictor is Fourth in importance? What
predictor is Least important?
C)
List each predictor that is
statistically significant at the .05 level or even more significant level.
D)
What category of each independent
variable is most Republican? Circle the correct category in each of the
following pairs:
Very Liberal
or Very Conservative
White or
African American
Under
$10,000 or Over $70,000
Males or
Females
The younger
in age, or the older in age
CAUSAL MODELING
We’ve run out of time.
This topic is typically covered in a PhD course. I’ll give you the basics now,
just so that you can think about it. I actually drew up a nice causal model for
my Multivariate
Explanation of Voter Turnout article in American Journal of Political
Science, a top journal that got a lot of citations (see page 87).
Multiple regression
provides only the direct effects that independent variables exert on
dependent variables. Yet outside variables may also affect the dependent
variable by affecting an intervening variable in the model. Hence, an outside
variable may exert an indirect effect on the dependent variable.
Total effects of an independent
variable are equal to the sum of the direct effect of that variable and all of
its indirect effects. Each indirect effect is the product of the effect that
that outside variable has on an intervening variable, and the effect that the
intervening variable has on the dependent variable.
Causal Modeling
procedures.
1) Devise a model that shows temporal-causal ordering of the variables
2) Use multiple regression SPSS program and regress each dependent variable in
the model on all of the independent variables that are "earlier" than
it is
3) Draw arrows for all statistically significant linkages. Put Betas just above
each line.
4) Indirect effects involve multiplying the relevant Betas together
5) Total effect = direct effect + indirect effects