Bivariate Data Analysis
Ice cream sales and drowning deaths are strongly correlated — $r \approx 0.9$. Does ice cream cause drowning? Of course not. Both are driven by hot weather. This is the most dangerous trap in statistics: mistaking correlation for causation. You'll learn to measure, interpret, and critique relationships between two variables.
Practise this lesson
Three printable worksheets that build from foundations to mastery — or build your own from any module’s questions.
Ice cream sales and drowning deaths are strongly correlated. Does ice cream cause drowning? Write your gut answer — no peeking ahead.
Two formulas. Lock them down before the worked examples.
Key facts
- $r = \frac{\sum(x - \bar{x})(y - \bar{y})}{(n-1)s_x s_y}$; range $-1 \leq r \leq 1$
- Sign = direction; magnitude = strength
- Correlation does not imply causation
Concepts
- $r$ measures linear relationships only
- Outliers can dramatically affect $r$
- Confounding variables create spurious correlations
Skills
- Interpret scatter plots for direction, form, and strength
- Calculate $r$ from raw data using the computational formula
- Critique claims of causation from correlational evidence
Before calculating any number, always draw and inspect the scatter plot. Three features to describe:
Direction — positive (uphill left to right), negative (downhill), or no trend.
Form — linear, curved, clustered, or no pattern.
Strength — how tightly points follow the form: strong (tight), moderate, or weak (scattered).
Also note any outliers — points that deviate substantially from the overall pattern. A single outlier can dramatically inflate or deflate $r$.
Critical limitation: Two data sets can have identical $r$ values but very different scatter plots. Always look before you calculate.
Scatter plot features: Direction (positive/negative/none), Form (linear/curved), Strength (strong/moderate/weak); Always check for outliers — a single point can dominate $r$
Pause — copy the three scatter plot features to describe: Direction (positive/negative/none), Form (linear/curved), Strength (strong/moderate/weak) — plus always check for outliers that could dominate $r$ into your book.
Did you get this? True or false: $r = 0$ proves there is no relationship between $X$ and $Y$.
We just saw that scatter plots show direction, form, and strength of association by eye. That raises a question: how do we measure the strength of a linear association precisely with a single number? This card answers it → Pearson's $r$ quantifies linear correlation: sign gives direction, $|r|$ gives strength (from 0 = none to 1 = perfect).
Pearson's $r$ quantifies the strength and direction of a linear relationship. Always $-1 \leq r \leq 1$.
Interpreting strength:
| $|r|$ range | Descriptor |
|---|---|
| 0.00 – 0.30 | Weak or no linear correlation |
| 0.30 – 0.50 | Moderate correlation |
| 0.50 – 0.70 | Moderate-strong correlation |
| 0.70 – 0.90 | Strong correlation |
| 0.90 – 1.00 | Very strong correlation |
Context matters — in physics $r = 0.9$ might be expected; in psychology $r = 0.5$ could be groundbreaking.
$r$ formula: $r = \frac{n\sum xy - (\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$; Sign of $r$ = direction; $|r|$ = strength
Pause — copy the Pearson $r$ formula and the two interpretations: sign of $r$ = direction, $|r|$ = strength of linear association (range: $-1 \leq r \leq 1$) into your book.
Quick check: A data set has $r = -0.82$. Which statement best describes the relationship?
Worked examples · reveal step by step
For 5 students: study hours and test scores are $(2, 55), (4, 62), (5, 70), (6, 75), (8, 88)$. Calculate $r$.
We just saw that $r$ measures the strength of the linear relationship between $x$ and $y$. That raises a question: what does it actually mean to say the relationship is "strong" — how much of the variation in $y$ does $x$ actually explain? This card answers it → $r^2$ is the proportion of variation in $y$ explained by the linear relationship with $x$.
$r^2$ tells you the proportion of variation in $y$ explained by the linear relationship with $x$.
$r^2$ = proportion of variation in $y$ explained by the linear relationship with $x$; Remaining $(1 - r^2)$ is due to other factors, noise, or non-linear effects
Pause — copy the interpretation: $r^2$ = proportion of variation in $y$ explained by the linear relationship with $x$; $(1 - r^2)$ = proportion due to other factors or noise into your book.
Fill in the blank: If $r = 0.7$, then $r^2 = $ ___, meaning ___% of variation in $y$ is explained by $x$.
A strong correlation between $X$ and $Y$ does not mean $X$ causes $Y$. Three explanations always exist:
We just saw that $r^2$ tells us how much of the variation in $y$ is explained by $x$ — but a high $r$ does not mean $x$ causes $y$. That raises a question: in what specific situations does $r$ give a misleading picture of the relationship? This card answers it → three failure cases: $r = 0$ can still hide a strong non-linear pattern; outliers can inflate or deflate $r$; causation requires more than correlation.
$r$ is powerful but has critical limitations — examiners love testing these:
Three reasons correlation ≠ causation: $X→Y$, $Y→X$ (reverse), or confounding $Z→$ both; $r = 0$ means no linear relationship, not no relationship at all
Pause — copy the three reasons correlation ≠ causation ($X \to Y$; $Y \to X$ (reverse causation); confounding variable $Z$ causes both) and the rule $r = 0$ means no LINEAR relationship, not no relationship at all into your book.
Match each scenario to the correct explanation for the correlation.
Activities
Calculate $r$ for: $(1, 2), (2, 4), (3, 6), (4, 8)$. What does this value indicate? What is $r^2$ and what does it mean?
Calculate $r$ for: $(1, 8), (2, 5), (3, 4), (4, 3), (5, 2)$. Describe the relationship and find $r^2$.
Calculate $r$ for: $(1, 1), (2, 4), (3, 9), (4, 16)$. Why is $r$ not close to 0 despite the relationship being curved?
A study finds $r = 0.3$ between hours of TV watched and exam scores. What is $r^2$? Is this practically significant?
For a data set with $r = 0.95$, describe the scatter plot you would expect to see. What percentage of variation is explained?
Studies show a strong correlation between chocolate consumption per capita and Nobel Prize winners per capita. Propose at least two explanations other than "chocolate makes you smarter."
A pharmaceutical company reports $r = 0.8$ between their supplement and weight loss in an observational study. Why can they not claim the supplement causes weight loss?
Explain why $r$ is inappropriate for measuring the relationship between temperature and ice cream sales if you only collect data from December to February.
Earlier you wrote about ice cream and drowning. Ice cream does not cause drowning. The correlation is spurious — both are driven by a confounding variable: hot weather. When temperatures rise, more people buy ice cream and more people swim, increasing drowning risk. The lesson: correlation measures association, not mechanism. To establish causation you need controlled experiments, temporal precedence (cause before effect), and a plausible mechanism. Use the ice cream example as your mental alarm bell every time a headline says "X linked to Y."
Pick your answer, then rate your confidence — that tells the system what to drill next.
Q1. A researcher records hours of sleep ($x$) and test scores ($y$) for 6 students:
| Sleep (h) | 5 | 6 | 7 | 7 | 8 | 9 |
|---|---|---|---|---|---|---|
| Test score | 55 | 62 | 68 | 70 | 78 | 85 |
(a) Calculate Pearson's correlation coefficient $r$. (b) Describe the direction, form, and strength of the relationship. (c) Predict the test score of a student who sleeps 6.5 hours, explaining whether this is interpolation or extrapolation. (3 marks)
Q2. A scatter plot of monthly data shows a strong positive correlation ($r = 0.92$) between umbrella sales and traffic accidents. (a) Describe what the scatter plot would look like. (b) Explain why this correlation does not mean umbrella sales cause traffic accidents. (c) Identify a likely confounding variable and explain how it produces this correlation. (d) Describe an experimental design that could test whether umbrellas affect road safety. (3 marks)
Q3. A pharmaceutical company funds a study finding $r = 0.75$ between their vitamin supplement and reduced cold symptoms (observational, not randomised). (a) What percentage of variation in cold symptoms is explained by the supplement? (b) A journalist writes: "New study proves vitamin supplement prevents colds." Analyse three critical flaws in this headline. (c) The company later runs a randomised controlled trial ($p < 0.05$, 15% fewer colds in treatment group). Explain why this is stronger evidence but describe one remaining limitation. (3 marks)
Comprehensive answers (click to reveal)
Activity 1:
1. $r = 1$ — perfect positive linear correlation ($y = 2x$). $r^2 = 1$ = 100% variation explained.
2. $r \approx -0.97$ — very strong negative linear correlation. $r^2 \approx 0.94$, 94% explained.
3. $r \approx 0.97$ — actually high because over this small domain, $y = x^2$ is nearly linear. For a full parabola including negative $x$, $r \approx 0$.
4. $r^2 = 0.09$ — only 9% of variation explained. Practically, TV watching is a weak predictor.
5. Points cluster tightly around an upward straight line. $r^2 = 0.9025$ → 90.25% explained.
Activity 2:
1. (1) Wealthier nations eat more chocolate AND invest more in research. (2) Both are associated with European cultural traditions. (3) The correlation is driven by a few outlier countries.
2. Observational studies cannot control confounders (diet, exercise). People who take supplements may already be health-conscious. No randomisation → no causation.
3. Restricted range — Dec–Feb captures only warm temperatures, giving a misleadingly narrow picture; the full year may show a non-linear seasonal pattern.
Q1 (3 marks): (a) $\Sigma x = 42$, $\Sigma y = 418$, $\Sigma x^2 = 304$, $\Sigma y^2 = 29{,}822$, $\Sigma xy = 3{,}007$, $n = 6$. $r = \frac{6(3007) - (42)(418)}{\sqrt{[6(304)-1764][6(29822)-174724]}} = \frac{486}{\sqrt{60 \times 4208}} \approx 0.967$ [1]. (b) Positive, approximately linear, very strong [0.5]. (c) Interpolation (6.5 within range 5–9), predicted ≈ 65–66 [0.5+0.5].
Q2 (3 marks): (a) Points rise left-to-right, tightly clustered around an upward trend [0.5]. (b) No mechanism — umbrellas do not affect driving [0.5]. (c) Confounding: rainy weather → more umbrella sales AND more accidents [1]. (d) Control for weather in analysis; or randomised driving trial [0.5+0.5].
Q3 (3 marks): (a) $r^2 = 0.5625$ → 56.25% explained [0.5]. (b) (1) "Proves" — correlation cannot prove causation. (2) "Prevents" — observational; health-conscious people may take supplements AND have better immunity (confounding). (3) "New study" — single studies can be flukes; replication needed [1.5]. (c) RCT balances confounders by randomisation; placebo controls expectation. Limitation: may not generalise to real-world conditions or long timeframes; $p < 0.05 \neq$ large effect [0.5+0.5].
Five timed questions on scatter plots, Pearson's $r$, and causation. Beat the boss to bank a tier — gold (90% + speed), silver (75%), bronze (50%).
Enter the arenaClimb platforms using scatter plots, Pearson's $r$, and causation questions. Pool: lesson 9.
Mark lesson as complete
Tick when you've finished the practice and review.