Simple Least-Squares Linear Regression Reference Page
Purpose: Linear regression is used to quantify the linear association (if there is one) between two variables, by producing a formula (for a line) that precisely and concisely describes the linear association.
1) Measurements must be pairs, each response value in the data set being accompanied by the corresponding value of the explanatory variable.
2) Both variables are quantitative, in that it makes sense to do arithmetic with measurements from each variable.
3) There is some excuse for randomness in the data. (Simple random sampling ought to be used, but exactly what gets sampled is different for correlation than regression. We’ll let your second-semester teachers sort this out!)
4) A scatterplot suggests…
(a) …that a linear relationship exists between the variables (or better yet, there are sound theoretical reasons to believe that the variables may be linearly associated.)
(b) …that there are no influential points or outliers outside the linear pattern.
5) The points in a plot of standardized residuals versus standardized predictions are (a) randomly scattered throughout a (b) linear band of (c) roughly constant width, (d) centered on the line “residual = 0,” and (e) outliers (if any) are under control.
In advanced classes, the above conditions will be modified. There are many variations on the regression theme, each of which carries its own conditions.
How to Compute It: We recommend using software such as SPSS (or a graphing calculator, though we are less enthusiastic about this option). However, we note here that if you have and , then you can calculate the least-squares regression line thus: compute and , and write down the formula .
1) The slope of the regression line can be interpreted as the change in response per unit increase in the explanatory variable.
2) A one-standard-deviation increase in the explanatory variable yields a change of standard deviations in the response variable.
3) Predictions (Forecasts):
A) IF THE NULL HYPOTHESIS IS REJECTED BY THE APPROPRIATE TEST, then the formula can be used to predict values of the response variable by substituting values of the explanatory variable .
B) OTHERWISE, THE BEST PREDICTED VALUE OF IS , REGARDLESS OF THE VALUE OF .
4) Subject to the warning against extrapolation given below, the intercept can be interpreted as the expected value of when the explanatory factor is absent or has value 0. However, variation in measurement often leads to ridiculous values of .
1) Every regression line ought to be accompanied at the very least by the coefficient of determination , as a measure of the “quality” of the line, by a scatterplot of the data, and by a residual plot of the data.
2) Linear regression describes linear associations only. Thus, a regression line must not be used for predictive purposes unless there are reasons to believe there is a linear association between the two variables. A HIGH VALUE OF IS NOT, BY ITSELF, SUFFICIENT EVIDENCE TO CONCLUDE THAT THERE IS A LINEAR ASSOCIATION. Either there must be sound theoretical reasons to believe in a linear association, or there must be independent evidence, such as a scatterplot showing a clear, non-horizontal linear association, with consistent variation in responses throughout the range of the explanatory variable.
3) The slope of the regression line is quite sensitive to influential points and sensitive to outliers and to skewing in the responses.
4) After computing the regression line, you must not use it to predict values of the response for values of the explanatory variable outside the range of the data used to compute the line in the first place. This practice, called extrapolation, is dangerous because the original data can only produce a formula that describes the association for values found in the original data.
5) If your data are averages, use of regression with averages is unwise, as any averaging process hides the nature of the distribution of those measurements, as well as that of the association (if any) between the variables in question. Predictions made using the line will not acknowledge the hidden information, and may be terribly wrong.
6) Regression more or less ignores any lurking variables.
7) Linear regression more or less ignores any non-linearity present in the data. Hence, the fact that a line has been calculated does not imply that the association is linear, even if both and are high.
8) A regression line is not a cause-and-effect relationship. A regression line can be calculated for any set of paired, quantitative data. If the roles of the two variables are reversed, a new regression line can be calculated. So if you originally thought of the explanatory variable as the cause, you must now view it as the effect! However, regression may be used carefully as part of a larger program for establishing causality.
9) If there is only one explanatory and one response variable, it doesn’t take a very large sample to get enough power to reject a null hypothesis. Thus, very weak linear associations can be statistically significant, but insignificant in any practical sense. THIS IS ESPECIALLY A PROBLEM IN EDUCATION AND THE SOCIAL SCIENCES, WHERE DATA TYPICALLY HAVE A HUGE AMOUNT OF VARIATION. The social consequences of using statistically significant but practically insignificant experimental results can be very costly.
10) If there is more than one explanatory variable, more than one response variable, or if the type of regression is not linear, then people with sophomore-level statistical education are ill-prepared to face the attendant challenges. (In other words, leave it to people with upper-division or, better yet, graduate-level education in statistics.)
11) Even if there is a linear association between variables, and even if the coefficient of determination is high, there can be a large amount of variation in y-values, so predictions based on the regression line can be very inaccurate.
SPSS instructions for calculating regression lines: Click here.
Examples of regression lines with the coefficient of determination and scatterplots:
(NOT YET AVAILABLE)
Pearson’s (linear) correlation coefficient
The coefficient of determination
Testing claims about a regression line
Conducting a complete least-squares linear regression analysis
David E. Brown
232H Ricks Building 208-496-1839 voice
Rexburg, ID 83460 208-496-2005 fax
Please do not call me at home.