# Pearson's linear correlation coefficient as a descriptive statistic

Purpose: Pearson’s correlation coefficient r is a numerical measure of the strength of a linear association between two variables. (This is the version of "correlation" most people have in mind when they use the word "correlation.")

Requirements:

1. Both variables must be quantitative, in the sense that for each variable, it makes sense to do arithmetic with the measurements.
2.  YOU MUST HAVE A REASON TO BELIEVE THAT THERE IS A LINEAR ASSOCIATION BETWEEN THE TWO VARIABLES BEFORE YOU COMPUTE r. IF THE ASSOCIATION IS NONLINEAR, r  IS GUARANTEED TO BE LARGER THAN IT SHOULD BE, WHICH CAN SOMETIMES INCORRECTLY LEAD YOU TO BELIEVE THAT A NONLINEAR ASSOCIATION IS LINEAR.
3. Strictly speaking, both variables should be random; indeed, the interpretation of r is safest when the two variables are “jointly normal” (a distribution not studied in introductory Statistics courses at BYU--Idaho). However, many people use  even when the values of one of the variables are not random. I discourage this practice; the coefficient of determination r2 is a better tool to use in that situation.
4. Never use r by itself. Use also a scatterplot whenever possible.

There are additional considerations; unfortunately, they are too technical for an introductory course. Consequently, many people who use r do so without the proper safeguards, which is irresponsible and potentially hazardous.

How to Compute It: I recommend using software. The statistical software currently used in statistics courses at this university is SPSS. Any spreadsheet program can calculate r, as well.

Interpretation: The value of r is always between -1 and 1; the closer r is to either of these values, the stronger the linear association between the variables. If r is close to 0, the linear association is weak. If r is zero, there is no linear association between the variables. If r is positive, so is the association; likewise for negative r. The value of r does not distinguish between explanatory and response variables.

Warnings:

1) Pearson’s linear correlation coefficient measures the strength of linear associations only. There may be times when you see the strength of a nonlinear association described by something called r (or perhaps r2). If those responsible are using their statistical tools in ways generally approved by the statistics-using culture, their r (or r2) is not a direct measure of the strength of that nonlinear association. Their r (or r2) measures the strength of a related linear association, and the relationship does not allow the interpretation of r (or r2) to carry over directly from the related linear association to the given nonlinear one. And if they are not using generally accepted tools, their r (or r2) is not to be trusted without sufficient explanation of what it actually measures.

2) A low value of r does not mean that a linear association does not exist, only that if there is a linear association present, it is weak. Likewise, a high value of r does not mean that a linear association is present, only that if one is present, it is strong. Thus, r is to be used as a descriptive statistic only after you're already validly convinced there's a linear association between your variables, and even then, only to measure the strength of the linear association.

3) Note carefully that r only describes the strength of a linear association only so far as the values in your data extend. It cannot measure the strength on a linear association for values of variables outside the ranges of the measurements you have actually made. In short, don't extrapolate!

4) r is quite sensitive to outliers, influential points, and skewing. Also, if the data include unusual points that fit the linear pattern, these can cause r to be higher than it should. It also happens that r is large even when the association is decidedly nonlinear. This is due (largely) to bias caused by the nonlinear nature of the association.

5) Use of r with averages is unwise, as any averaging process hides not only the variation in the individual measurements, but the nature of the distribution of those measurements and therefore the nature of the association (if any) between the variables in use. Typically, r is lower for individual measurements and higher for averages made from those very same measurements.

6) Unsurprisingly, r can be influenced by lurking variables. If the only variable that can possibly affect your response variable is your explanatory variable, this is not a problem. Otherwise, I recommend keeping an open mind about what r tells you about the things that influence your responses.

7) Indeed, the value of r does not depend on whether either variable is seen as the “cause” (explanatory variable) or the “effect” (response)! So r is utterly incapable of establishing that one thing causes another. Hence, above all, a value of  close to 1 or -1 must not be construed as "proof" of a cause-and-effect relationship between the variables. However, it may be used carefully as part of a larger effort to establish causality.