### Math 221 – Principles of Statistics

Coefficient of Determination  Reference Page

Purpose: The coefficient of determination measures the predictive ability of a least-squares regression line.

Conditions: The conditions under which may be used are similar to those under which least-squares linear regression may be used. To wit:

1) Measurements must be pairs, each response value in the data set being accompanied by the corresponding value of the explanatory variable.

2) Both variables are quantitative, in that it makes sense to do arithmetic with measurements from each variable.

3) There is some excuse for randomness in the data. (Preferably, both variables should be random. Using a simple random sample ought to be sufficient for our class.)

4) A scatterplot suggests that…

(a) …a linear relationship exists between the variables (or better yet, there are sound theoretical reasons to believe that the variables may be linearly associated.)

(b) …any unusual points are under control (not too many, not too far away, preferably within the linear pattern; no influential points).

In advanced classes, the above conditions may be modified so as to be more accurate.

How to Compute It: We recommend using software such as SPSS (or a graphing calculator, though we are less enthusiastic about this option). However, we note here that if you have , then you can calculate the coefficient of determination as (that is, as times itself).

It doesn't make much sense to use without a regression line, as is a measure of the predictive ability of the regression line's formula. You also ought to have a scatterplot of the data.

Interpretation:

is literally the proportion of the variation in response accounted for or explained by the regression line. In other words, the percentage of the variation in the response variable that is explained by the linear association with the explanatory variable, is 100 x . Therefore,  is always between 0 and 1; the closer  is to 1, the better a predictor the line is.

Warnings:

1)  describes linear associations only. Thus, must not be used to describe any other kind or any unknown kind of association. A HIGH VALUE OF  IS NOT, BY ITSELF, SUFFICIENT EVIDENCE TO CONCLUDE THAT THERE IS A LINEAR ASSOCIATION. Either there must be sound theoretical reasons to believe in a linear association, or there must be "independent" evidence, such as a scatterplot showing a clear, non-horizontal linear association, with consistent variation in responses throughout the range of the explanatory variable.

2)  is quite sensitive to outliers, influential points, and skewing in the responses.

3) says nothing about the association (if any) between the variables, beyond the range of the data used to compute in the first place. can only describe the association for values found in the original data.

4) If your data are averages, use of is unwise, as any averaging process hides the nature of the distributions in the measurements, as well as the true nature of the association (if any) between the variables.

5) more or less ignores any lurking variables.

6) If the association between variables is not linear, is GUARANTEED TO BE TOO HIGH. Hence, the fact that is high does not (by itself) imply that an association is linear.

7) The value of does not change if the explanatory and response variables trade roles. Therefore, is incapable of establishing the existence of a cause-effect relationship. However, may be used carefully as part of a larger program for establishing causality.

8) If there is more than one explanatory variable, more than one response variable, or if the type of regression is not linear, then people with sophomore-level statistical education are ill-prepared to face the attendant challenges. (In other words, leave it to people with upper-division or, better yet, graduate-level education in statistics.)

10) Even if there is a linear association between variables, and even if the coefficient of determination is relatively high, there can be a large amount of variation in y-values, so individual predictions based on the regression line can be very inaccurate. (This is true in spite of the fact that measures the predictive ability of the line!)

SPSS instructions for calculating regression lines and the coefficient of determination: Click here.

Examples of regression lines with the coefficient of determination and scatterplots:

(NOT YET AVAILABLE)

Related topics:

232H Ricks Building                           208-496-1839 voice

Rexburg, ID 83460                              208-496-2005 fax

Please do not call me at home.