Math 221 – Principles of Statistics

Pearson’s (Linear) Correlation Coefficient  Reference Page

Purpose: Pearson’s correlation coefficient () is a numerical measure of the strength of a linear association between two variables.

Conditions:

1) Both variables must be quantitative, in the sense that for each variable, it makes sense to do arithmetic with the measurements.

2) YOU MUST HAVE A REASON TO BELIEVE THAT THERE IS A LINEAR ASSOCIATION BETWEEN THE TWO VARIABLES BEFORE YOU COMPUTE . IF THE ASSOCIATION IS NONLINEAR, IS GUARANTEED TO BE HIGHER THAN IT SHOULD BE, WHICH WILL SOMETIMES INCORRECTLY LEAD YOU TO BELIEVE THE ASSOCIATION IS LINEAR.

3) Strictly speaking, both variables should be random; indeed, the interpretation of is safest when the two variables are “jointly normal” (whatever that means). However, many people use  even when the values of one of the variables are not random. We discourage this practice; the coefficient of determination  is a better tool to use in that situation.

4) This coefficient should be accompanied by a scatterplot whenever possible.

There are additional technical considerations; unfortunately, they are too technical for an introductory course. Consequently, many people who use  do so without the proper safeguards, which is irresponsible and potentially hazardous.

How to Compute It: We recommend using software such as SPSS (or a calculator, though we are less enthusiastic about this option).

Interpretation: The value of  is always between -1 and 1; the closer  is to either of these values, the stronger the linear association between the variables. If  is close to 0, the linear association is weak. If is zero, there is no linear association between the variables. If  is positive, so is the association; likewise for negative . The value of  does not distinguish between explanatory and response variables.

Warnings:

1) Pearson’s correlation coefficient measures the strength of linear associations only. There may be times when you see the strength of a nonlinear association described by something called ; if those responsible are using their statistical tools correctly, their  is not a direct measure of the strength of that nonlinear association.

2) A low value of   does not mean that a linear association does not exist, only that if there is a linear association present, it is weak. Likewise, a high value of  does not mean that a linear association is present, only that if one is present, it is strong. Thus, as a descriptive statistic,  must not be used to infer the nature of an association, only to measure the strength of a linear association.

3) Note carefully that  only describes this strength to the extent of the actual measured values of the two variables included in its calculation. It cannot measure the strength on a linear association for values of variables outside the ranges of the measurements made on individuals in the sample.

4)  is quite sensitive to outliers, influential points, and skewing, and if the data include unusual points that fit the linear pattern, these can cause  to be higher than it should. Also, it is possible for a decidedly nonlinear association to yield a high value of ; this is due largely to bias caused by the nonlinear nature of the association.

5) Use of  with averages is unwise, as any averaging process hides not only the variation in the individual measurements, but the nature of the distribution of those measurements and therefore the nature of the association (if any) between the variables whose averages are used in the calculation. Typically, is lower for individual measurements and higher for averages made from those very same measurements.

6) Moreover, more or less ignores any lurking variables.

7) Indeed, the value of  does not depend on whether either variable is seen as the “cause” (explanatory variable) or the “effect” (response)! So  is utterly incapable of establishing that one thing causes another. Hence, above all, a value of  close to 1 or -1 must not be construed as "proof" of a cause-and-effect relationship between the variables. However, it may be used carefully as part of a larger program for establishing causality.

Related topics:

232 Ricks Building                              208-496-1839 voice

Rexburg, ID 83460                              208-496-2005 fax

Please do not call me at home.