OLS REGRESSION analysis tutorial

OLS is an acronym for ordinary least squares.

What this means is that OLS regression creates a linear model that produces the minimum average prediction error.  Of course, the goal of any predictive model is to predict as accurately as possible.  So it should come as no surprise that OLS regression is concerned with finding a regression line that provides the best fit to the data points.

Objectives

·        Interpret a correlation matrix

·        Know how to generate a regression equation. 

·        Understand average prediction error (residual difference).

·        Use a multiple regression model to predict a criterion* variable

·        Determine whether there is a relationship between the criterion* variable and the predictor**  variables using in the regression model

·        Determine which predictor** variables make a significant contribution to the regression model

·        Interpret the coefficient of multiple determination

·        Interpret the partial regression coefficients (beta weights).

·        Understand how categorical predictor** variables can be included in the regression model

·        Understand regression models that include interaction terms

·        Recognize when multicollinearity is a problem and how it affects your regression model

·        Know when to use logistic regression to predict a criterion* variable

 * Criterion variable is analogous with dependent variable, but is generally referred to as a criterion in correlational analyses.

.** Predictor variable is analogous with independent variable, but is generally referred to as a predictor in correlational analyses.

 Because regression is a technique based on correlation, the first step should always be interpretation of the correlation matrix.  It is my personal belief that you should always view the scatterplots as well.  This will allow you to ensure that relationships are in fact linear. 

At this point it would be prudent to point out the assumptions of OLS regression.

  1. Linearity: Relationship among variables are linear.
  2. Homoscedasticity: The variation around the line of regression are constant.
  3. Normality: Values of the criterion variable are normally distributed.
  4. Independence of error: Error (residual difference) is independent for each predictor value.  This is mainly a concern when dealing with time series data.
Now we are ready to generate a multiple regression equation.  The appropriate analysis will have to be done in a statistical software package, but I can show you the formula (which will not be provided in the output). I believe the easiest way to create your regression equation is to use the equation of a line.  (All linear regression models use the equation of a line, but some people try to trip you up with fancy new symbols.)
                    Y = mx + b            ( y = a + bx   another commonly used formula, don't let it fool you,  it's the same!)
y = the criterion score            
m = slope (beta weight)
x = predictor score
b = intercept (constant)
So, the formula with multiple predictors would be:
    y = b + mx1 +mx2 + mx3 + mx4 ...
You may have heard the term residual as opposed to prediction error (e).  Either way it refers to the difference between the predicted criterion score and the actual criterion score.  The regression line with the smallest average prediction error will be the line chosen using OLS regression.
Predicting a criterion score is as simple as plugging the value of each predictor (for the case you wish to predict) into your equation (once you've found m & x using a statistical software package).  

The coefficient of multiple determination, R squared, indicates the proportion of criterion variance that is explained by the predictor variables.

The partial regression coefficients, more commonly referred to as B or beta weights,  indicate the variation in the criterion variable that is explained by each predictor variable while controlling for the other predictor variables.  To put is more simply, partial regression coefficients indicate the unique contribution of each predictor variable.

A dummy variable is not something you call your roommate when he eats your last frozen burrito.   Dummy variables are actually the vehicle that permits us to consider categorical predictor variables as part of the regression model.

Interaction terms are created using the product of two predictor variables.

Multicollinearity is a condition in which two (or more) of the predictor variables are highly correlated.  Generally considered to be a serious problem if predictor variables are correlated above ç.70ç.

Logistic regression is an alternative approach to regression that allows us to use a criterion variable that is categorical.  An example of this type of analysis comes from the health sciences, when predicting survival (categorical) is of paramount concern.