WEEK 11:

BIVARIATE AND MULTIPLE REGRESSION

Regression is actually a pretty cool technique. It requires that your variables all be measured at the interval level, because it provides mathematical prediction of the exact value of your dependent variable.

Bivariate Regression

In the bivariate (two variables) case, without a computer you can get a sheet of graph paper, put the independent variable at the bottom of the page (the X axis), and put the dependent variable on the left side (Y axis). Make sure the scales of each variable have the same distances between the graph paper lines (each line might be 5, 10, 15; not 5, 15, 30, 100). Then, just plot each point’s position onto the two-dimensional graph paper.

This technique finds the best fitting straight line through a set of points. Best fitting is defined by minimizing the sum of squared distances between the points and the regression line.

For example, you’d like to think that all of your years of higher education are worth something, dollar wise. Using the 2010-2014 pooled Mississippi Poll, my hypothesis was that education affected family income, so education was the independent variable at the bottom, and income was the dependent variable on the left side. My graph paper has vertical and horizontal lines across the page every few millimeters, so I had each vertical line reflect one year of education completed (high school grad is 12 years, etc.), and each horizontal line represent $2,000 in family income. When I plotted all of the points, I got a nice pretty straight line that went upward and to the right. All of the points (people) fell closely around that line. If you remember from Algebra, any straight line can be represented by a formula:

Y = a + (b * X), where Y is dependent variable, x is independent variable, a is the Y intercept, and b is the slope of the line, or (change in Y)/(change in X)

The Y intercept is where the line crosses the Y axis. What is that Y value? It is the Y value when X is 0. In this case, it is Family Income when one has zero years of education.

b = unstandardized regression coefficient = slope = (change in Y) / (change in X)

The equation for these set of points ended up being:

Family Income = -18,643 + (4,620) X (Education years)

The b value is the slope, which is $4,620, which is the change in Y (income) for every 1-unit change in X (education). That means that for every 1 additional year of education, your family income on average rises $4,620. So, on average, getting a BA degree results in a higher family income of $18,480, compared to someone with just a high school degree.

The a value or Y intercept is where the line crosses the Y axis. So it is the Y value when X has a value of 0. So, someone with zero years of education would have a negative family income of $18,643. Sounds impossible, but maybe they are unemployed and on welfare. So they have no income that they earned, but are entirely dependent on government benefits. Regression is a little weird for such extreme cases, as it more accurately plots the points for values of variables that have actual people on them. Therefore, your Y intercept may be an “impossible” value.

Another thing that must be considered with regression is that it makes these nice predictions, but we always say that “on average”, that is what the prediction would be. The reason is that those with 2 years of college might have a mean income of $46,037. MEAN income, average, computed as Income = -18,643 + (4,620) X 14. Some community college grads will be above that, and some will be below. Hence, we say that there is variation around the mean.

That gets to the question of how good our predictive equation really is. That gets us into explained variance.

R² is explained variance, the variance in Y explained by the independent variable's regression line.

R² = (total variation - unexplained variation)/ Total Variation

Total Variation = sum of squared distances between the mean of Y and each case's Y value

Unexplained Variation (Residual) = sum of squared distances between each case's Y value and each case's predicted Y value (from the regression equation)

Explained Variation = sum of squared distances between each case's predicted Y value and the mean of Y.

R² is the explained variation. It is the predictive ability of your independent variable.

Adjusted R² shrinks the value of R² by penalizing for each additional independent variable, and is statistically preferable to the R². It considers the fact that if you had as many predictors as you have cases (people), you would have 100% explained variance.

The F statistic tests the statistical significance of the regression equation as a whole, and must be below .05.

All of that will make more sense if you take a statistics class, or go to graduate school. The SPSS computer program does the statistical calculations for you. How good a job you are doing explaining your dependent variable with your independent variable is the Adjusted R squared value. It ranges in value from a low of 0% (no value) to a high of 100% (perfect prediction). In my education affects income example above, the adjusted R squared is 16%. So while you get a nice straight line, there are a lot of points that are above the line, or below the line. Therefore, some people make a lot more money that you would expect given their education level (a high school dropout who starts her own company, and becomes a millionaire). And there are some people who make a lot less money that you would expect from their education level (A PhD in 18^th century Serbian literature who is now driving an Uber).

One problem with regression is that it is affected by Deviant Cases- a few scattered cases that are way off the line. Their sum of squared distances from the line are large, so they pull the line towards them. So, for example, most English professors don’t make a lot of money. However, one of them at MSU used to be the Academic Vice President. Rejoining the faculty in his department, he made a lot more than the other professors. If a researcher plots all of their salaries by seniority, he would be way above the regression line. That would pull the regression line towards his point, so it would become more vertical, so this b value would be larger. Hence, the researcher would think that all of the professors in that department would be getting nice pay raises every year. (This actually happened at MSU one year, and the professors were threatened with a low pay raise.) In cases of deviant points, we sometimes drop those points from the analysis, and plot the lines without those points.

Example of calculating a bivariate regression problem.

You are asked to examine the relationship between years of service since receiving a PhD degree, and nine-month salaries of ten history professors. You need to plot the following points on graph paper, and then calculate the b value (unstandardized regression coefficient value or slope) and the y-intercept. Also, calculate what salary would have to be given to a senior professor with 30 years of service since their PhD (if they were hired from another university). Also, what would the starting salary be (for someone with zero years of service who just got their PhD):

2 years, $57,000
5 years, $60,000
7 years, $62,000
10 years, $65,000
13 years, $68,000
15 years, $70,000
17 years, $72,000
20 years, $75,000
22 years, $77,000
25 years, $80,000

In this example, you can see that the b value or slope would be $1,000. Two points on the line might be 10 and 20 years of service, so change in Y would be $75,000-$65,000= $10,000. Change in X would be (20-10) = 10. So, $10,000/10 = $1,000.

The Y intercept would be $55,000. You can check the graph paper, or just do your calculations. Each year of service is worth $1,000, so starting salary or zero years of service would be $57,000 for 2 years of service, minus $2,000 = $55,000. The equation is Y = $55,000 + ($1,000) * (X).

Someone with 30 years of service, X = 30, just put 30 into the equation where X is. So, Y = $55,000 + ($1,000) * (30) = $85,000

MULTIPLE REGRESSION

Multiple regression is a real cool technique that lets you look at increasing your predictive ability by adding in more independent variables. For example, we only predicted 16% of the variance of family income from one’s education. What other variables can we add to the equation to increase our predictive ability? Well, how about number of wage earners in the family. A two-income couple is likely to have a higher family income than a one-income household. Also, how about what someone majored in. Someone in Engineering will make more than someone in philosophy (because, who wants to work in math or engineering their whole life, no offense…). Another factor is the quality of the education institution. I think MSU has a high reputation among employers; indeed, many of them are our alumni. Look at the year of the Covid crisis. MSU stayed open with many faculty offering in-person classes, doing the best we could to make the best of a difficult situation. Many other universities were closed for the entire year, offering only remote classes. I heard that at one university in Texas, one Dean self-isolated himself at home for two weeks (his faculty laughed and said he never came to the office anyway), and when he tried to dump more of his work on his department heads, at least one department head promptly resigned. You can pretty much figure out what reputation such universities have.

Multiple Regression is linear regression applied to more than one independent variable. With two independent variables, the predicted values comprise a plane (instead of a line in the one independent variable case).

A Multiple Regression Equation with four predictors:

Y = a + b₁x₁ + b₂x₂ + b₃x₃ + b₄x₄

b value is the unstandardized regression coefficient, controlling for the effects of all other predictors. It is used to predict the value of the dependent variable from the known values of the independent variables.

b value is also used in making comparisons across subsamples. For example, is an independent variable more important in affecting the dependent variable among men or among women.

Beta is the standardized regression coefficient, controlling for the effects of all other predictors. It tells the relative importance of the independent variables in influencing the dependent variable. It ranges from 0 to 1, with 1 being most important and 0 being least important. Negative signs reflect the direction of the variables' coding.

Beta = standardized regression coefficient = b * (sd_x/sd_y), where sd means standard deviation. It adjusts for the differing ranges and scales of the variables.

Pearson R is the correlation coefficient. It equals the Beta in the bivariate case only.

Repeating, Beta ranges from -1 to +1 with 0 being no relationship between the independent and dependent variables. The sign depends on the direction of the coding of your variables. A +1 or -1 is a perfect relationship. b values have a greater range which is not confined to 1 or -1.

In other words, Beta is kind of like Gamma. Take the absolute value of the Beta, and that shows what is the most important, second most important, and least important variable.

You’ll notice in your computer output, that I ran a multiple regression equation for the three variables in your model. That regression analysis kind of confirmed your bivariate tables results. I then came up with multivariate tables unique for each of your papers, that I thought gave interesting results to comment on.

Multiple r is the correlation between the actual Y value and the predicted Y value from the multiple regression equation.

R² is the variance in the dependent variable explained by all of the independent variables.

Once again, the Adjusted R squared is the most accurate value to use. Adjusted R² shrinks the value of R² by penalizing for each additional independent variable, and is statistically preferable to the R². It considers the fact that if you had as many predictors as you have cases (people), you would have 100% explained variance.

In the education-income example, what do you think would happen to this Adjusted R squared value when I add other predictors that are indeed relevant and important? Obviously, you’d expect your predictive ability to go up. So, this is a real cool technique. But it is a little advanced for undergrads, so it is not required to be put into your research paper. It is covered more in graduate school or other statistics classes.

The advantage of multiple regression is that it can simultaneously look at more predictors than two, and estimate how important each predictor is after discounting the other predictors. Your multiple tables approach is actually looking at only two predictors at the same time. Multiple regression can look at 3, 4, whatever. Very powerful technique. But there’s all sorts of statistical assumptions that have to be met to use multiple regression. Those are typically covered in a graduate program.

Example of a multiple regression equation problem (taken from the 2006-2010 Mississippi Polls).

Predicting who believes they have been racially profiled. This dependent variable is coded 1 for reported being profiled, and 2 for reported not being profiled. The independent variables and their coding follow:

Race (1-white)(2-black)
Sex (1-male)(2-female)
Age (ranges from a low of 18 to a high of 93 years old)
Ideology (ranges from a 1 for very liberal to a 5 for very conservative)
Income (ranges from 1 for under $10,000 to an 8 for over $70,000)

The Betas or standardized regression coefficients for these predictors follow:

Race..... = -.287
Sex..... = .148
Age..... = .043
Ideology = .122
Income... = -.013

The significance levels for each of these regression coefficients follow:

Race.... = .001
Sex..... = .001
Age..... = .121
Ideology = .001
Income.. = .664

The adjusted R-squared for this regression equation is 12%.

Using the above information, answer the following questions:

How much of the variance in racial profiling is being explained by these five independent variables? Only 12%.
Which of these five predictors is statistically significant, at least at the .05 level? Race, Sex, and Ideology are all statistically significant, as their significance levels are all below .05.
What is the direction of the relationship between each of these predictors and the dependent variable? That is, which categories of each of these five predictors is most likely to be racially profiled? Well, take a look at the coding of your dependent variable; a lower value is more likely to be profiled. So the lower values of the independent variables would be more likely to be profiled if the Beta has a positive value. Race has a negative value, so the higher value of race is more likely to be profiled; the higher value for race is black. Sex, age, and ideology all have positive values, so the lower values of these predictors are more likely to be profiled. Those values are men, young, very liberals. Income has a negative value, so the higher value is more likely to be profiled, so that is high income.
What predictor is most important in explaining racial profiling? What predictor is second in importance? Which one is third in importance? Which is fourth in importance in explaining racial profiling? You take the absolute value of the Beta signs (so ignore the signs), so Race is most important. Second in importance is Sex. Third in importance is Ideology. Fourth in importance is Age, which is not statistically significant. So, left-wing black males are more likely to be profiled, compared to right-wing white females. Maybe self-identified liberals are more sensitive to racial profiling incidents than conservatives are, or maybe they engage in more risky behavior. That African American males are most likely to report being racially profiled by police is consistent with studies that show that they are indeed more likely to be targeted by law enforcement.

Test example for multiple regression. Try to answer it yourself:

(20 points) A multiple regression analysis of the 2010-2014 Mississippi Poll data examines the causes of party identification using five independent variables. The variables are coded as follows:

Party identification ranges from 1 for Strong Democrat to 7 for Strong Republican.

Ideology ranges from 1 for Very Liberal to 5 for Very Conservative.

Race is coded as 1 for White, and 2 for African American.

Family Income is coded as 1 for Under $10,000 to 8 for Over $70,000.

Sex is coded as 1 for Male and 2 for Female.

Age ranges from 18 to 93, and is age in years.

The adjusted R squared for this multiple regression equation is .52

The Betas or standardized regression coefficients and their signs are:

Ideology = +.276

Race = -.531

Income = +.157

Sex = -.035

Age = -.052

The significance levels of these five predictors are:

Ideology = .001

Race = .001

Income = .001

Sex = .129

Age = .023

Answer the following questions:

A) How good a job are these five predictors doing in explaining party identification? That is, what percentage of the variance in party identification is being explained by these five independent variables?

B) What predictor is MOST important in affecting party identification? What predictor is Second in importance? What predictor is Third in importance? What predictor is Fourth in importance? What predictor is Least important?

C) List each predictor that is statistically significant at the .05 level or even more significant level.

D) What category of each independent variable is most Republican? Circle the correct category in each of the following pairs:

Very Liberal or Very Conservative

White or African American

Under $10,000 or Over $70,000

Males or Females

The younger in age, or the older in age

CAUSAL MODELING

We’ve run out of time. This topic is typically covered in a PhD course. I’ll give you the basics now, just so that you can think about it. I actually drew up a nice causal model for my Multivariate Explanation of Voter Turnout article in American Journal of Political Science, a top journal that got a lot of citations (see page 87).

Multiple regression provides only the direct effects that independent variables exert on dependent variables. Yet outside variables may also affect the dependent variable by affecting an intervening variable in the model. Hence, an outside variable may exert an indirect effect on the dependent variable.

Total effects of an independent variable are equal to the sum of the direct effect of that variable and all of its indirect effects. Each indirect effect is the product of the effect that that outside variable has on an intervening variable, and the effect that the intervening variable has on the dependent variable.

Causal Modeling procedures.
1) Devise a model that shows temporal-causal ordering of the variables
2) Use multiple regression SPSS program and regress each dependent variable in the model on all of the independent variables that are "earlier" than it is
3) Draw arrows for all statistically significant linkages. Put Betas just above each line.
4) Indirect effects involve multiplying the relevant Betas together
5) Total effect = direct effect + indirect effects