```
# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import nbinteract as nbi
```

In this section we will develop a measure of how tightly clustered a scatter diagram is about a straight line. Formally, this is called measuring *linear association*.

The *correlation coefficient* measures the strength of the linear relationship between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.

The term *correlation coefficient* is a quite long word, so usually the term shortened to *correlation* and denoted by $r$.

Here are some mathematical facts about $r$ that we will just observe by simulation.

- The correlation coefficient $r$ is a number between -1 and 1.
- $r$ measures the extent to which the scatter plot clusters around a straight line.
- $r$ = 1 if the scatter diagram is a perfect straight line sloping upwards, and $r$ = -1 if the scatter diagram is a perfect straight line sloping downwards.

The function `r_scatter`

takes a value of $r$ as its argument and simulates a scatter plot with a correlation very close to $r$. Because of randomness in the simulation, the correlation is not expected to be exactly equal to $r$.

Call `r_scatter`

a few times, with different values of $r$ as the argument, and see how the scatter plot changes.

When $r$ = 1 the scatter plot is perfectly linear and slopes upward. When $r$ = -1, the scatter plot is perfectly linear and slopes downward. When $r$ = 0, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be *uncorrelated*.

```
z = np.random.normal(0, 1, 500)
def r_scatter(xs, r):
"""
Generate y-values for a scatter plot with correlation approximately r
"""
return r*xs + (np.sqrt(1-r**2))*z
corr_opts = {
'aspect_ratio': 1,
'xlim': (-3.5, 3.5),
'ylim': (-3.5, 3.5),
}
nbi.scatter(np.random.normal(size=500), r_scatter, options=corr_opts, r=(-1, 1, 0.05))
```

The formula for $r$ is not apparent from our observations so far. It has a mathematical basis that is outside the scope of this class. However, as you will see, the calculation is straightforward and helps us understand several of the properties of $r$.

**Formula** for $r$:

$r$ is the **average of the products of the two variables**, when both variables are measured in standard units.

Here are the steps in the calculation. We will apply the steps to a simple table of values of **x** and **y**.

```
x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns(
'x', x,
'y', y
)
t
```

Based on the scatter diagram, we expect that $r$ will be positive but not equal to 1.

```
nbi.scatter(t.column(0), t.column(1), options={'aspect_ratio': 1})
```

**Step 1.** Convert each variable to standard units.

```
def standard_units(nums):
return (nums - np.mean(nums)) / np.std(nums)
```

```
t_su = t.with_columns(
'x (standard units)', standard_units(x),
'y (standard units)', standard_units(y)
)
t_su
```

**Step 2.** Multiply each pair of standard units.

```
t_product = t_su.with_column('product of standard units', t_su.column(2) * t_su.column(3))
t_product
```

**Step 3.** $r$ is the average of the products computed in Step 2.

```
# r is the average of the products of standard units
r = np.mean(t_product.column(4))
r
```

As expected, $r$ is positive but not equal to 1.

The calculation shows that:

- $r$ is a pure number. It has no units. This is because $r$ is based on standard units.
- $r$ is unaffected by changing the units on either axis. This too is because $r$ is based on standard units.
- $r$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called
**x**and which**y**. Geometrically, switching axes reflects the scatter plot about the line**y = x**, but does not change the amount of clustering nor the sign of the association.

```
nbi.scatter(t.column(1), t.column(0), options={'aspect_ratio': 1})
```

We are going to be calculating correlations repeatedly, so it will help to define a function that computes it by performing all the steps described above. Let's define a function `correlation`

that takes a table and the labels of two columns in the table. The function returns $r$, the mean of the products of those column values in standard units.

```
def correlation(t, x, y):
return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))
```

```
interact(correlation, t=fixed(t),
x=widgets.ToggleButtons(options=['x', 'y'], description='x-axis'),
y=widgets.ToggleButtons(options=['x', 'y'], description='y-axis'))
```

Let's call the function on the `x`

and `y`

columns of `t`

. The function returns the same answer to the correlation between $x$ and $y$ as we got by direct application of the formula for $r$.

```
correlation(t, 'x', 'y')
```

As we noticed, the order in which the variables are specified doesn't matter.

```
correlation(t, 'y', 'x')
```

Calling `correlation`

on columns of the table `suv`

gives us the correlation between price and mileage as well as the correlation between price and acceleration.

```
suv = (Table.read_table('https://raw.githubusercontent.com/data-8/materials-fa17/master/lec/hybrid.csv')
.where('class', 'SUV'))
interact(correlation, t=fixed(suv),
x=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='x-axis'),
y=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='y-axis'))
```

```
correlation(suv, 'mpg', 'msrp')
```

```
correlation(suv, 'acceleration', 'msrp')
```

These values confirm what we had observed:

- There is a negative association between price and efficiency, whereas the association between price and acceleration is positive.
- The linear relation between price and acceleration is a little weaker (correlation about 0.5) than between price and mileage (correlation about -0.67).

Correlation is a simple and powerful concept, but it is sometimes misused. Before using $r$, it is important to be aware of what correlation does and does not measure.