Math 221 – Principles of Statistics

Simple Least-Squares Linear Regression Reference Page


Purpose: Linear regression is used to quantify the linear association (if there is one) between two variables, by producing a formula (for a line) that precisely and concisely describes the linear association.

 

 

Conditions:

 

1) Measurements must be pairs, each response value in the data set being accompanied by the corresponding value of the explanatory variable.

 

2) Both variables are quantitative, in that it makes sense to do arithmetic with measurements from each variable.

 

3) There is some excuse for randomness in the data. (Simple random sampling ought to be used, but exactly what gets sampled is different for correlation than regression. We’ll let your second-semester teachers sort this out!)

 

4) A scatterplot suggests…

(a) …that a linear relationship exists between the variables (or better yet, there are sound theoretical reasons to believe that the variables may be linearly associated.)

(b) …that there are no influential points or outliers outside the linear pattern.

 

5) The points in a plot of standardized residuals versus standardized predictions are (a) randomly scattered throughout a (b) linear band of (c) roughly constant width, (d) centered on the line “residual = 0,” and (e) outliers (if any) are under control.

 

In advanced classes, the above conditions will be modified. There are many variations on the regression theme, each of which carries its own conditions.

 

 

How to Compute It: We recommend using software such as SPSS (or a graphing calculator, though we are less enthusiastic about this option). However, we note here that if you have and , then you can calculate the least-squares regression line thus: compute  and , and write down the formula .

 

 

Interpretation:

    1) The slope of the regression line can be interpreted as the change in response per unit increase in the explanatory variable.

 

    2) A one-standard-deviation increase in the explanatory variable yields a change of  standard deviations in the response variable.

 

    3) Predictions (Forecasts):

  

        A) IF THE NULL HYPOTHESIS  IS REJECTED BY THE APPROPRIATE TEST, then the formula  can be used to predict values of the response variable  by substituting values of the explanatory variable .

 

        B) OTHERWISE, THE BEST PREDICTED VALUE OF  IS , REGARDLESS OF THE VALUE OF .

 

    4) Subject to the warning against extrapolation given below, the intercept can be interpreted as the expected value of  when the explanatory factor is absent or has value 0. However, variation in measurement often leads to ridiculous values of .

 

 

Warnings:

 

    1) Every regression line ought to be accompanied at the very least by the coefficient of determination , as a measure of the “quality” of the line, by a scatterplot of the data, and by a residual plot of the data.

 

      2) Linear regression describes linear associations only. Thus, a regression line must not be used for predictive purposes unless there are reasons to believe there is a linear association between the two variables. A HIGH VALUE OF  IS NOT, BY ITSELF, SUFFICIENT EVIDENCE TO CONCLUDE THAT THERE IS A LINEAR ASSOCIATION. Either there must be sound theoretical reasons to believe in a linear association, or there must be independent evidence, such as a scatterplot showing a clear, non-horizontal linear association, with consistent variation in responses throughout the range of the explanatory variable.

 

    3) The slope of the regression line is quite sensitive to influential points and sensitive to outliers and to skewing in the responses.

 

    4) After computing the regression line, you must not use it to predict values of the response for values of the explanatory variable outside the range of the data used to compute the line in the first place. This practice, called extrapolation, is dangerous because the original data can only produce a formula that describes the association for values found in the original data.

 

    5) If your data are averages, use of regression with averages is unwise, as any averaging process hides the nature of the distribution of those measurements, as well as that of the association (if any) between the variables in question. Predictions made using the line will not acknowledge the hidden information, and may be terribly wrong.

 

    6) Regression more or less ignores any lurking variables.

 

    7) Linear regression more or less ignores any non-linearity present in the data. Hence, the fact that a line has been calculated does not imply that the association is linear, even if both and  are high.

 

    8) A regression line is not a cause-and-effect relationship. A regression line can be calculated for any set of paired, quantitative data. If the roles of the two variables are reversed, a new regression line can be calculated. So if you originally thought of the explanatory variable as the cause, you must now view it as the effect! However, regression may be used carefully as part of a larger program for establishing causality.

 

    9) If there is only one explanatory and one response variable, it doesn’t take a very large sample to get enough power to reject a null hypothesis. Thus, very weak linear associations can be statistically significant, but insignificant in any practical sense. THIS IS ESPECIALLY A PROBLEM IN EDUCATION AND THE SOCIAL SCIENCES, WHERE DATA TYPICALLY HAVE A HUGE AMOUNT OF VARIATION. The social consequences of using statistically significant but practically insignificant experimental results can be very costly.

 

    10) If there is more than one explanatory variable, more than one response variable, or if the type of regression is not linear, then people with sophomore-level statistical education are ill-prepared to face the attendant challenges. (In other words, leave it to people with upper-division or, better yet, graduate-level education in statistics.)

 

    11) Even if there is a linear association between variables, and even if the coefficient of determination is high, there can be a large amount of variation in y-values, so predictions based on the regression line can be very inaccurate.

 

 

SPSS instructions for calculating regression lines: Click here.

 

 

Examples of regression lines with the coefficient of determination and scatterplots:

 

(NOT YET AVAILABLE)

 

 

Related topics:

 

Scatterplots

 

Pearson’s (linear) correlation coefficient

 

The coefficient of determination

 

Testing claims about a regression line

 

Conducting a complete least-squares linear regression analysis


David E. Brown

BYU-Idaho                                            mailto:brownd@byui.edu

232H Ricks Building                           208-496-1839 voice

Rexburg, ID 83460                              208-496-2005 fax

                Please do not call me at home.