# The coefficient of determination r2

Purpose: The coefficient of determination measures the predictive ability of a least-squares regression line.

Requirements: The conditions under which r2 may be used are similar to those under which least-squares linear regression may be used. To wit:

1. Measurements must be pairs, each response value in the data set being accompanied by the corresponding value of the explanatory variable.
2. Both variables are quantitative, in that it makes sense to do arithmetic with measurements from each variable.
3. There is some excuse for randomness in the data. (Preferably, both variables should be random. Using a simple random sample ought to be sufficient for our class.)
4. YOU MUST HAVE A REASON TO BELIEVE THAT THERE IS A LINEAR ASSOCIATION BETWEEN THE TWO VARIABLES BEFORE YOU COMPUTE r2. IF THE ASSOCIATION IS NONLINEAR, r2  IS GUARANTEED TO BE LARGER THAN IT SHOULD BE, WHICH CAN SOMETIMES INCORRECTLY LEAD YOU TO BELIEVE THAT A NONLINEAR ASSOCIATION IS LINEAR. In my classes, it is enough to have a scatterplot that suggests that a linear relationship exists between the variables. But it's better to have sound theoretical reasons to believe that the variables may be linearly associated.
5. A scatterplot shows that any unusual points are under control (not too many, not too far away, preferably within the linear pattern; no influential points).

The above conditions are simplified, for use in introductory courses. They may be modified in advanced classes.

How to Compute It: I recommend using software. The University's current statistical software is SPSS. Any spreadsheet program can calculate r2, as well. However, I note here that if you have r (Pearson's linear correlation coefficient), then you can calculate the coefficient of determination r2 as r x r (that is, as r times itself).

Interpretation:

It doesn't make much sense to use r2 without a regression line, since r2 is a measure of the predictive ability of the regression line's formula. You also ought to have a scatterplot of the data.

r2 is literally the proportion of the variation in response accounted for or explained by the regression line. In other words, the percentage of the variation in the response variable that is explained by the linear association with the explanatory variable, is 100r2 . Therefore, r2 is always between 0 and 1; the closer r2 is to 1, the better a predictor the line is.

Warnings:

1) r2 describes linear associations only. Thus, r2 must not be used to describe any other kind or any unknown kind of association. A HIGH VALUE OF r2 IS NOT, BY ITSELF, SUFFICIENT EVIDENCE TO CONCLUDE THAT THERE IS A LINEAR ASSOCIATION BETWEEN YOUR VARIABLES. Either there must be sound theoretical reasons to believe that there is a linear association, or there must be "independent" evidence, such as a scatterplot showing a clear, non-horizontal linear association, with consistent variation in responses throughout the range of the explanatory variable.

2) r2 is quite sensitive to outliers, influential points, and skewing in the responses.

3) r2 says nothing about the association (if any) between the variables, beyond the range of the data used to compute r2 in the first place. r2 can only describe the association for values found in the original data. In short, don't extrapolate!

4) If your data are averages, use of r2 is unwise, as any averaging process hides the nature of the distributions in the measurements, as well as the true nature of the association (if any) between the variables.

5) r2 can be affected by lurking variables, without giving explicit information about them.

6) If the association between variables is not linear, r2 is GUARANTEED TO BE TOO HIGH. Hence, the fact that r2 is high does not (by itself) imply that an association is linear.

7) The value of r2 does not change if the explanatory and response variables trade roles. Therefore, r2 is incapable of establishing the existence of a cause-effect relationship. However, r2 may be used carefully as part of a larger program for establishing causality.

8) If there is more than one explanatory variable, more than one response variable, or if the type of regression is not linear, then people with sophomore-level statistical education are ill-prepared to face the attendant challenges. (In other words, leave it to people with upper-division or, better yet, graduate-level education in statistics.)

10) Even if there is a linear association between variables, and even if the coefficient of determination is relatively high, there can be a large amount of variation in y-values, so individual predictions based on the regression line can be very inaccurate. (This is true in spite of the fact that r2 measures the predictive ability of the line!)

SPSS instructions for calculating regression lines and the coefficient of determination: Click here.

Examples of regression lines with the coefficient of determination and scatterplots:

(NOT YET AVAILABLE)

Related topics:

Testing claims about Pearson's linear correlation coefficient

Simple least-squares linear regression

Testing claims about simple least-squares linear regression

Conducting a complete linear regression analysis