```# HIDDEN
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')
import math
import numpy as np
from scipy import stats
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
import nbinteract as nbi
```

### Correlation¶

In this section we will develop a measure of how tightly clustered a scatter diagram is about a straight line. Formally, this is called measuring linear association.

#### The correlation coefficient¶

The correlation coefficient measures the strength of the linear relationship between two variables. Graphically, it measures how clustered the scatter diagram is around a straight line.

The term correlation coefficient is a quite long word, so usually the term shortened to correlation and denoted by \$r\$.

Here are some mathematical facts about \$r\$ that we will just observe by simulation.

• The correlation coefficient \$r\$ is a number between -1 and 1.
• \$r\$ measures the extent to which the scatter plot clusters around a straight line.
• \$r\$ = 1 if the scatter diagram is a perfect straight line sloping upwards, and \$r\$ = -1 if the scatter diagram is a perfect straight line sloping downwards.

The function `r_scatter` takes a value of \$r\$ as its argument and simulates a scatter plot with a correlation very close to \$r\$. Because of randomness in the simulation, the correlation is not expected to be exactly equal to \$r\$.

Call `r_scatter` a few times, with different values of \$r\$ as the argument, and see how the scatter plot changes.

When \$r\$ = 1 the scatter plot is perfectly linear and slopes upward. When \$r\$ = -1, the scatter plot is perfectly linear and slopes downward. When \$r\$ = 0, the scatter plot is a formless cloud around the horizontal axis, and the variables are said to be uncorrelated.

```z = np.random.normal(0, 1, 500)
def r_scatter(xs, r):
"""
Generate y-values for a scatter plot with correlation approximately r
"""
return r*xs + (np.sqrt(1-r**2))*z

corr_opts = {
'aspect_ratio': 1,
'xlim': (-3.5, 3.5),
'ylim': (-3.5, 3.5),
}

nbi.scatter(np.random.normal(size=500), r_scatter, options=corr_opts, r=(-1, 1, 0.05))
```

#### Calculating the correlation¶

The formula for \$r\$ is not apparent from our observations so far. It has a mathematical basis that is outside the scope of this class. However, as you will see, the calculation is straightforward and helps us understand several of the properties of \$r\$.

Formula for \$r\$:

\$r\$ is the average of the products of the two variables, when both variables are measured in standard units.

Here are the steps in the calculation. We will apply the steps to a simple table of values of x and y.

```x = np.arange(1, 7, 1)
y = make_array(2, 3, 1, 5, 2, 7)
t = Table().with_columns(
'x', x,
'y', y
)
t
```
x y
1 2
2 3
3 1
4 5
5 2
6 7

Based on the scatter diagram, we expect that \$r\$ will be positive but not equal to 1.

```nbi.scatter(t.column(0), t.column(1), options={'aspect_ratio': 1})
```

Step 1. Convert each variable to standard units.

```def standard_units(nums):
return (nums - np.mean(nums)) / np.std(nums)
```
```t_su = t.with_columns(
'x (standard units)', standard_units(x),
'y (standard units)', standard_units(y)
)
t_su
```
x y x (standard units) y (standard units)
1 2 -1.46385 -0.648886
2 3 -0.87831 -0.162221
3 1 -0.29277 -1.13555
4 5 0.29277 0.811107
5 2 0.87831 -0.648886
6 7 1.46385 1.78444

Step 2. Multiply each pair of standard units.

```t_product = t_su.with_column('product of standard units', t_su.column(2) * t_su.column(3))
t_product
```
x y x (standard units) y (standard units) product of standard units
1 2 -1.46385 -0.648886 0.949871
2 3 -0.87831 -0.162221 0.142481
3 1 -0.29277 -1.13555 0.332455
4 5 0.29277 0.811107 0.237468
5 2 0.87831 -0.648886 -0.569923
6 7 1.46385 1.78444 2.61215

Step 3. \$r\$ is the average of the products computed in Step 2.

```# r is the average of the products of standard units

r = np.mean(t_product.column(4))
r
```
`0.6174163971897709`

As expected, \$r\$ is positive but not equal to 1.

#### Properties of \$r\$¶

The calculation shows that:

• \$r\$ is a pure number. It has no units. This is because \$r\$ is based on standard units.
• \$r\$ is unaffected by changing the units on either axis. This too is because \$r\$ is based on standard units.
• \$r\$ is unaffected by switching the axes. Algebraically, this is because the product of standard units does not depend on which variable is called x and which y. Geometrically, switching axes reflects the scatter plot about the line y = x, but does not change the amount of clustering nor the sign of the association.
```nbi.scatter(t.column(1), t.column(0), options={'aspect_ratio': 1})
```

#### The correlation function¶

We are going to be calculating correlations repeatedly, so it will help to define a function that computes it by performing all the steps described above. Let's define a function `correlation` that takes a table and the labels of two columns in the table. The function returns \$r\$, the mean of the products of those column values in standard units.

```def correlation(t, x, y):
return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))
```
```interact(correlation, t=fixed(t),
x=widgets.ToggleButtons(options=['x', 'y'], description='x-axis'),
y=widgets.ToggleButtons(options=['x', 'y'], description='y-axis'))
```
`<function __main__.correlation(t, x, y)>`

Let's call the function on the `x` and `y` columns of `t`. The function returns the same answer to the correlation between \$x\$ and \$y\$ as we got by direct application of the formula for \$r\$.

```correlation(t, 'x', 'y')
```
`0.6174163971897709`

As we noticed, the order in which the variables are specified doesn't matter.

```correlation(t, 'y', 'x')
```
`0.6174163971897709`

Calling `correlation` on columns of the table `suv` gives us the correlation between price and mileage as well as the correlation between price and acceleration.

```suv = (Table.read_table('https://raw.githubusercontent.com/data-8/materials-fa17/master/lec/hybrid.csv')
.where('class', 'SUV'))

interact(correlation, t=fixed(suv),
x=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='x-axis'),
y=widgets.ToggleButtons(options=['mpg', 'msrp', 'acceleration'],
description='y-axis'))
```
`<function __main__.correlation(t, x, y)>`
```correlation(suv, 'mpg', 'msrp')
```
`-0.6667143635709919`
```correlation(suv, 'acceleration', 'msrp')
```
`0.48699799279959155`

These values confirm what we had observed:

• There is a negative association between price and efficiency, whereas the association between price and acceleration is positive.
• The linear relation between price and acceleration is a little weaker (correlation about 0.5) than between price and mileage (correlation about -0.67).

Correlation is a simple and powerful concept, but it is sometimes misused. Before using \$r\$, it is important to be aware of what correlation does and does not measure.