Your weak spots

Insights load after your first practice round.

Module 5 · L9 of 15 ~35 min ⚡ +95 XP available

Bivariate Data Analysis

Ice cream sales and drowning deaths are strongly correlated — $r \approx 0.9$. Does ice cream cause drowning? Of course not. Both are driven by hot weather. This is the most dangerous trap in statistics: mistaking correlation for causation. You'll learn to measure, interpret, and critique relationships between two variables.

Today's hook — Nicolas Cage films and swimming pool drownings correlated almost perfectly ($r \approx 0.99$) from 1999–2009. Did Cage cause drownings? No — and understanding why "no" is the most exam-tested idea in all of statistics.

0/5QUESTS

Worksheets

Practise this lesson

Three printable worksheets that build from foundations to mastery — or build your own from any module’s questions.

Build Foundations & guided practice Apply Application practice Master Mastery challenge Build custom Build your own from any module question

Think First — gut answer before you read

+5 XP warm-up

Ice cream sales and drowning deaths are strongly correlated. Does ice cream cause drowning? Write your gut answer — no peeking ahead.

auto-saved

Formula reference for this lesson

+5 XP to read

Two formulas. Lock them down before the worked examples.

Pearson's $r$ (definition)

$$r = \dfrac{\sum(x - \bar{x})(y - \bar{y})}{(n - 1)s_x s_y}$$

Pearson's $r$ (computational)

$$r = \dfrac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$$

Range & $r^2$

$-1 \leq r \leq 1$ always. $r^2$ = proportion of variation in $y$ explained by $x$.

Key insight. $r$ measures linear association only. A perfect parabola can have $r = 0$ despite a clear relationship. Always draw the scatter plot before trusting $r$.

What you'll master

Know

Key facts

$r = \frac{\sum(x - \bar{x})(y - \bar{y})}{(n-1)s_x s_y}$; range $-1 \leq r \leq 1$
Sign = direction; magnitude = strength
Correlation does not imply causation

Understand

Concepts

$r$ measures linear relationships only
Outliers can dramatically affect $r$
Confounding variables create spurious correlations

Can do

Skills

Interpret scatter plots for direction, form, and strength
Calculate $r$ from raw data using the computational formula
Critique claims of causation from correlational evidence

Key terms

Scatter plotA graph of pairs $(x,y)$ as points on a coordinate plane, used to visualise relationships.

Pearson's $r$A number in $[-1,1]$ that quantifies the strength and direction of a linear relationship.

CorrelationA statistical association between two variables — does not imply causation.

CausationWhen one variable directly produces a change in another, established by controlled experiments.

Coefficient of determination ($r^2$)The proportion of variation in $y$ explained by the linear relationship with $x$.

Confounding variableA third variable $Z$ that drives both $X$ and $Y$, creating a spurious correlation.

Reading scatter plots

core concept

Before calculating any number, always draw and inspect the scatter plot. Three features to describe:

Direction — positive (uphill left to right), negative (downhill), or no trend.
Form — linear, curved, clustered, or no pattern.
Strength — how tightly points follow the form: strong (tight), moderate, or weak (scattered).

Also note any outliers — points that deviate substantially from the overall pattern. A single outlier can dramatically inflate or deflate $r$.

Critical limitation: Two data sets can have identical $r$ values but very different scatter plots. Always look before you calculate.

Scatter plot features: Direction (positive/negative/none), Form (linear/curved), Strength (strong/moderate/weak); Always check for outliers — a single point can dominate $r$

Pause — copy the three scatter plot features to describe: Direction (positive/negative/none), Form (linear/curved), Strength (strong/moderate/weak) — plus always check for outliers that could dominate $r$ into your book.

Did you get this? True or false: $r = 0$ proves there is no relationship between $X$ and $Y$.

Pearson's correlation coefficient

core concept

We just saw that scatter plots show direction, form, and strength of association by eye. That raises a question: how do we measure the strength of a linear association precisely with a single number? This card answers it → Pearson's $r$ quantifies linear correlation: sign gives direction, $|r|$ gives strength (from 0 = none to 1 = perfect).

Pearson's $r$ quantifies the strength and direction of a linear relationship. Always $-1 \leq r \leq 1$.

$$r = \dfrac{n\sum xy - (\sum x)(\sum y)}{\sqrt{\bigl[n\sum x^2 - (\sum x)^2\bigr]\bigl[n\sum y^2 - (\sum y)^2\bigr]}}$$

Interpreting strength:

$\|r\|$ range	Descriptor
0.00 – 0.30	Weak or no linear correlation
0.30 – 0.50	Moderate correlation
0.50 – 0.70	Moderate-strong correlation
0.70 – 0.90	Strong correlation
0.90 – 1.00	Very strong correlation

$-1 \leq r \leq 1$

Context matters — in physics $r = 0.9$ might be expected; in psychology $r = 0.5$ could be groundbreaking.

$r$ formula: $r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$; Sign of $r$ = direction; $|r|$ = strength

Pause — copy the Pearson $r$ formula and the two interpretations: sign of $r$ = direction, $|r|$ = strength of linear association (range: $-1 \leq r \leq 1$) into your book.

Quick check: A data set has $r = -0.82$. Which statement best describes the relationship?

Worked examples · reveal step by step

PROBLEM 1 · CALCULATE PEARSON'S r

For 5 students: study hours and test scores are $(2, 55), (4, 62), (5, 70), (6, 75), (8, 88)$. Calculate $r$.

$\sum x = 25,\quad \sum y = 350,\quad n = 5$

Tally the sums first — these feed every other calculation.

Coefficient of determination $r^2$

core concept

We just saw that $r$ measures the strength of the linear relationship between $x$ and $y$. That raises a question: what does it actually mean to say the relationship is "strong" — how much of the variation in $y$ does $x$ actually explain? This card answers it → $r^2$ is the proportion of variation in $y$ explained by the linear relationship with $x$.

$r^2$ tells you the proportion of variation in $y$ explained by the linear relationship with $x$.

$$r^2 = (\text{Pearson's } r)^2$$

$r = 0.8$

$r^2 = 0.64$ → 64% of variation in $y$ explained by $x$

$r = -0.9$

$r^2 = 0.81$ → 81% explained (sign does not affect $r^2$)

$r = 0.3$

$r^2 = 0.09$ → only 9% explained — 91% is from other factors

$r^2$ = proportion of variation in $y$ explained by the linear relationship with $x$; Remaining $(1 - r^2)$ is due to other factors, noise, or non-linear effects

Pause — copy the interpretation: $r^2$ = proportion of variation in $y$ explained by the linear relationship with $x$; $(1 - r^2)$ = proportion due to other factors or noise into your book.

Fill in the blank: If $r = 0.7$, then $r^2 = $ ___, meaning ___% of variation in $y$ is explained by $x$.

Correlation does NOT imply causation

most exam-tested idea

A strong correlation between $X$ and $Y$ does not mean $X$ causes $Y$. Three explanations always exist:

Explanation 1

$X$ causes $Y$

Smoking causes lung cancer — but this was established through controlled experiments, not just correlation.

Explanation 2

$Y$ causes $X$ (reverse)

More police in high-crime areas — does police presence cause crime, or do high-crime areas attract more police?

Explanation 3

A third variable $Z$ causes both

Hot weather causes both ice cream sales AND drowning. $Z$ = heat, not ice cream → drowning.

Nicolas Cage and pool drownings. From 1999–2009, the number of people who drowned in swimming pools correlated almost perfectly ($r \approx 0.99$) with the number of films Nicolas Cage appeared in. Pure coincidence — both variables happened to trend upward during the same period. Statistical significance must always be paired with a logical mechanism.

When NOT to trust Pearson's $r$

We just saw that $r^2$ tells us how much of the variation in $y$ is explained by $x$ — but a high $r$ does not mean $x$ causes $y$. That raises a question: in what specific situations does $r$ give a misleading picture of the relationship? This card answers it → three failure cases: $r = 0$ can still hide a strong non-linear pattern; outliers can inflate or deflate $r$; causation requires more than correlation.

$r$ is powerful but has critical limitations — examiners love testing these:

Trap 01

Non-linear relationships

$r$ measures linear association only. A perfect parabola can have $r = 0$. Always look at the scatter plot first.

Trap 02

Outliers

A single extreme point can inflate or deflate $r$ dramatically. Check the scatter plot for outliers before interpreting $r$.

Trap 03

Restricted range

If you sample only a narrow range of $x$ values, $r$ will underestimate the true relationship. e.g. measuring temperature vs ice cream sales only in December–February.

Always: (1) Draw a scatter plot first. (2) Check for outliers. (3) Consider whether a linear model is appropriate before calculating or interpreting $r$.

Three reasons correlation ≠ causation: $X→Y$, $Y→X$ (reverse), or confounding $Z→$ both; $r = 0$ means no linear relationship, not no relationship at all

Pause — copy the three reasons correlation ≠ causation ($X \to Y$; $Y \to X$ (reverse causation); confounding variable $Z$ causes both) and the rule $r = 0$ means no LINEAR relationship, not no relationship at all into your book.

Match each scenario to the correct explanation for the correlation.

Shoe size and reading ability in children (both increase with age)

Ice cream and drowning (both increase in summer)

More police in high-crime areas

Confounding variable (age)

Confounding variable (hot weather)

Reverse causation

Activities

Activity 1 — Calculate

Calculate $r$ for: $(1, 2), (2, 4), (3, 6), (4, 8)$. What does this value indicate? What is $r^2$ and what does it mean?

Calculate $r$ for: $(1, 8), (2, 5), (3, 4), (4, 3), (5, 2)$. Describe the relationship and find $r^2$.

Calculate $r$ for: $(1, 1), (2, 4), (3, 9), (4, 16)$. Why is $r$ not close to 0 despite the relationship being curved?

A study finds $r = 0.3$ between hours of TV watched and exam scores. What is $r^2$? Is this practically significant?

For a data set with $r = 0.95$, describe the scatter plot you would expect to see. What percentage of variation is explained?

Activity 2 — Analyse and connect

Studies show a strong correlation between chocolate consumption per capita and Nobel Prize winners per capita. Propose at least two explanations other than "chocolate makes you smarter."

A pharmaceutical company reports $r = 0.8$ between their supplement and weight loss in an observational study. Why can they not claim the supplement causes weight loss?

Explain why $r$ is inappropriate for measuring the relationship between temperature and ice cream sales if you only collect data from December to February.

Revisit your thinking

Earlier you wrote about ice cream and drowning. Ice cream does not cause drowning. The correlation is spurious — both are driven by a confounding variable: hot weather. When temperatures rise, more people buy ice cream and more people swim, increasing drowning risk. The lesson: correlation measures association, not mechanism. To establish causation you need controlled experiments, temporal precedence (cause before effect), and a plausible mechanism. Use the ice cream example as your mental alarm bell every time a headline says "X linked to Y."

auto-saved

Multiple choice

+5 XP per correct · +25 XP all-correct

Pick your answer, then rate your confidence — that tells the system what to drill next.

Short answer

ApplyBand 43 marks

Q1. A researcher records hours of sleep ($x$) and test scores ($y$) for 6 students:

Sleep (h)	5	6	7	7	8	9
Test score	55	62	68	70	78	85

(a) Calculate Pearson's correlation coefficient $r$. (b) Describe the direction, form, and strength of the relationship. (c) Predict the test score of a student who sleeps 6.5 hours, explaining whether this is interpolation or extrapolation. (3 marks)

auto-saved

ApplyBand 43 marks

Q2. A scatter plot of monthly data shows a strong positive correlation ($r = 0.92$) between umbrella sales and traffic accidents. (a) Describe what the scatter plot would look like. (b) Explain why this correlation does not mean umbrella sales cause traffic accidents. (c) Identify a likely confounding variable and explain how it produces this correlation. (d) Describe an experimental design that could test whether umbrellas affect road safety. (3 marks)

auto-saved

AnalyseBand 53 marks

Q3. A pharmaceutical company funds a study finding $r = 0.75$ between their vitamin supplement and reduced cold symptoms (observational, not randomised). (a) What percentage of variation in cold symptoms is explained by the supplement? (b) A journalist writes: "New study proves vitamin supplement prevents colds." Analyse three critical flaws in this headline. (c) The company later runs a randomised controlled trial ($p < 0.05$, 15% fewer colds in treatment group). Explain why this is stronger evidence but describe one remaining limitation. (3 marks)

auto-saved

Comprehensive answers (click to reveal)

Activity 1:

1. $r = 1$ — perfect positive linear correlation ($y = 2x$). $r^2 = 1$ = 100% variation explained.

2. $r \approx -0.97$ — very strong negative linear correlation. $r^2 \approx 0.94$, 94% explained.

3. $r \approx 0.97$ — actually high because over this small domain, $y = x^2$ is nearly linear. For a full parabola including negative $x$, $r \approx 0$.

4. $r^2 = 0.09$ — only 9% of variation explained. Practically, TV watching is a weak predictor.

5. Points cluster tightly around an upward straight line. $r^2 = 0.9025$ → 90.25% explained.

Activity 2:

1. (1) Wealthier nations eat more chocolate AND invest more in research. (2) Both are associated with European cultural traditions. (3) The correlation is driven by a few outlier countries.

2. Observational studies cannot control confounders (diet, exercise). People who take supplements may already be health-conscious. No randomisation → no causation.

3. Restricted range — Dec–Feb captures only warm temperatures, giving a misleadingly narrow picture; the full year may show a non-linear seasonal pattern.

Q1 (3 marks): (a) $\Sigma x = 42$, $\Sigma y = 418$, $\Sigma x^2 = 304$, $\Sigma y^2 = 29{,}822$, $\Sigma xy = 3{,}007$, $n = 6$. $r = \frac{6(3007) - (42)(418)}{\sqrt{[6(304)-1764][6(29822)-174724]}} = \frac{486}{\sqrt{60 \times 4208}} \approx 0.967$ [1]. (b) Positive, approximately linear, very strong [0.5]. (c) Interpolation (6.5 within range 5–9), predicted ≈ 65–66 [0.5+0.5].

Q2 (3 marks): (a) Points rise left-to-right, tightly clustered around an upward trend [0.5]. (b) No mechanism — umbrellas do not affect driving [0.5]. (c) Confounding: rainy weather → more umbrella sales AND more accidents [1]. (d) Control for weather in analysis; or randomised driving trial [0.5+0.5].

Q3 (3 marks): (a) $r^2 = 0.5625$ → 56.25% explained [0.5]. (b) (1) "Proves" — correlation cannot prove causation. (2) "Prevents" — observational; health-conscious people may take supplements AND have better immunity (confounding). (3) "New study" — single studies can be flukes; replication needed [1.5]. (c) RCT balances confounders by randomisation; placebo controls expectation. Limitation: may not generalise to real-world conditions or long timeframes; $p < 0.05 \neq$ large effect [0.5+0.5].

Boss battle · The Spurious Statistician

earn bronze · silver · gold

Five timed questions on scatter plots, Pearson's $r$, and causation. Beat the boss to bank a tier — gold (90% + speed), silver (75%), bronze (50%).

Enter the arena

Science Jump · platform challenge

Climb platforms using scatter plots, Pearson's $r$, and causation questions. Pool: lesson 9.

Mark lesson as complete

Tick when you've finished the practice and review.

← Lesson 8 · Comparing Data Sets Lesson 10 · Regression Analysis →

Module overview · Maths Advanced · Checkpoint 2