# Bro. Brown's statistics reference pages:

# Pearson's linear correlation coefficient as a descriptive statistic

**Purpose:** Pearson’s correlation coefficient *r* is a numerical measure of the strength of a linear association between two variables. (This is the version of "correlation" most people have in mind when they use the word "correlation.")

**Requirements:**

- Both variables must be quantitative, in the sense that for each variable, it makes sense to do arithmetic with the measurements.
- YOU MUST HAVE A REASON TO BELIEVE THAT THERE IS A LINEAR ASSOCIATION BETWEEN THE TWO VARIABLES
**BEFORE**YOU COMPUTE*r*. IF THE ASSOCIATION IS NONLINEAR,*r*IS GUARANTEED TO BE**LARGER THAN IT SHOULD BE**, WHICH CAN SOMETIMES INCORRECTLY LEAD YOU TO BELIEVE THAT A NONLINEAR ASSOCIATION IS LINEAR. - Strictly speaking, both variables should be random; indeed, the interpretation of
*r*is safest when the two variables are “jointly normal” (a distribution not studied in introductory Statistics courses at BYU--Idaho). However, many people use even when the values of one of the variables are not random. I discourage this practice; the coefficient of determination*r*^{2}is a better tool to use in that situation. - Never use
*r*by itself. Use also a scatterplot whenever possible.

There are additional considerations; unfortunately, they are too technical for an introductory course. Consequently, many people who use *r* do so without the proper safeguards, which is irresponsible and potentially hazardous.

**How to Compute It:** I recommend using software. The statistical software currently used in statistics courses at this university is SPSS. Any spreadsheet program can calculate *r*, as well.

**Interpretation:** The value of *r* is always between -1 and 1; the closer *r* is to either of these values, the stronger the linear association between the variables. If *r* is close to 0, the linear association is weak. If *r* is zero, there is no linear association between the variables. If *r* is positive, so is the association; likewise for negative *r*. The value of *r* does not distinguish between explanatory and response variables.

**Warnings:**

1) Pearson’s linear correlation coefficient measures the strength of **linear** associations **only**. There may be times when you see the strength of a nonlinear association described by something called *r* (or perhaps *r*^{2}). If those responsible are using their statistical tools in ways generally approved by the statistics-using culture, their *r* (or *r*^{2}) is **not** a direct measure of the strength of that nonlinear association. Their *r* (or *r*^{2}) measures the strength of a *related* linear association, and the relationship does not allow the interpretation of *r* (or *r*^{2}) to carry over directly from the related linear association to the given nonlinear one. And if they are not using generally accepted tools, their *r* (or *r*^{2}) is not to be trusted without sufficient explanation of what it actually measures.

2) A low value of *r* does **not** mean that a linear association does not exist, only that **if** there is a linear association present, it is weak. Likewise, a high value of *r* does **not** mean that a linear association is present, only that **if** one is present, it is strong. Thus, *r* is to be used as a descriptive statistic **only after** you're already validly convinced there's a linear association between your variables, and even then, **only** to measure the strength of the linear association.

3) Note carefully that *r* only describes the strength of a linear association only so far as the values in your data extend. It cannot measure the strength on a linear association for values of variables outside the ranges of the measurements you have actually made. In short, don't extrapolate!

4) *r* is quite sensitive to outliers, influential points, and skewing. Also, if the data include unusual points that fit the linear pattern, these can cause *r* to be higher than it should. It also happens that *r* is large even when the association is decidedly nonlinear. This is due (largely) to bias caused by the nonlinear nature of the association.

5) Use of *r* with averages is unwise, as any averaging process hides not only the *variation* in the individual measurements, but the *nature of the distribution* of those measurements and therefore the nature of the association (if any) between the variables in use. Typically, *r* is lower for individual measurements and higher for averages made *from those very same measurements*.

6) Unsurprisingly, *r* can be influenced by lurking variables. If the only variable that can possibly affect your response variable is your explanatory variable, this is not a problem. Otherwise, I recommend keeping an open mind about what *r* tells you about the things that influence your responses.

7) Indeed, the value of *r* does not depend on whether either variable is seen as the “cause” (explanatory variable) or the “effect” (response)!** So **** r is utterly incapable of establishing that one thing causes another**. Hence,

**above all**, a value of close to 1 or -1 must

**not**be construed as "proof" of a cause-and-effect relationship between the variables. However, it may be used carefully as part of a larger effort to establish causality.

**SPSS instructions for computing Pearson’s correlation coefficient:** Click here.

**Examples of Pearson’s correlation coefficient, with scatterplots:** Click here.

**Related topics:**