Mathematics Advanced • Year 12 • Module 5 • Lesson 9

Bivariate Data Analysis

Apply scatter plots, Pearson's r and the correlation-vs-causation distinction to real contexts — exam study, ice cream sales, chocolate & Nobel prizes, medical claims and restricted-range sampling.

Apply · Problem Set

Problem 1 — Study hours vs test score (computing and interpreting r)

Six Year 12 students record their study hours x and the corresponding trial-exam score y:

StudentS1S2S3S4S5S6
Hours x235689
Score y525866708288

Set up: What are we solving for?

(i) Calculate Pearson's r using the computational formula (show the sums Σx, Σy, Σx², Σy², Σxy, n).   3 marks

(ii) Calculate r² and write a one-sentence interpretation of its value in context.   2 marks

(iii) A parent reads the result and concludes "more study causes higher exam marks". Reply in one sentence, drawing on the lesson's correlation-vs-causation principle.   2 marks

Stuck on (i)? Build the table of x, y, x², y², xy row by row before substituting.

Problem 2 — Ice cream sales and drowning (confounding)

A council records, for 12 randomly chosen days during the year, the number of ice-creams sold at the local pool kiosk (x) and the number of swim-related rescues that day (y). It finds r = +0.92.

Set up: What are we solving for?

(i) Interpret r = +0.92 in context (strength, direction, in one phrase).   1 mark

(ii) A local newspaper headline reads "Council blames ice-cream for rise in pool rescues!" Explain in 2 sentences why the headline is statistically wrong, naming the most likely confounding variable.   2 marks

(iii) Describe an experimental or observational refinement to the data collection that would help separate the effect of ice-cream sales from the effect of the confounding variable.   2 marks

Stuck? Revisit lesson § Causation vs Correlation and § Real-World Anchor — Nicolas Cage.

Problem 3 — Chocolate consumption vs Nobel laureates (spurious correlation)

A widely-cited study reports r = 0.79 between chocolate consumption per capita and Nobel laureates per capita across 23 countries.

Set up: What are we solving for?

(i) Compute r² and state in one phrase what proportion of variation in Nobel laureates per capita the linear model "explains".   1 mark

(ii) Propose three alternative explanations (other than "chocolate makes you smarter") for the strong correlation. For each, briefly say whether it is a confounder, reverse causation, or coincidence.   3 marks

(iii) Drawing on the Nicolas Cage Real-World Anchor in the lesson, complete the sentence: "Statistical significance must be paired with ____________________ before we infer causation."   1 mark

Problem 4 — Supplement and weight loss (observational study)

A pharmaceutical company observes 200 customers who took its weight-loss supplement and reports a correlation of r = 0.80 between dose taken (mg/day) and kilograms of weight lost over 12 weeks.

Set up: What are we solving for?

(i) Interpret r = 0.80 (strength, direction) and compute r² as a percentage.   2 marks

(ii) Identify two reasons why the company should not, on this evidence alone, claim that the supplement causes weight loss.   2 marks

(iii) Outline (in 2-3 lines) the study design that would let the company support a causal claim. Include the words "randomised", "controlled", and "placebo".   3 marks

Problem 5 — Restricted-range sampling (temperature & ice cream)

A student measures daily ice-cream sales y and daily maximum temperature x over just the summer months December–February in Sydney, when x is almost always between 25°C and 35°C.

Set up: What are we solving for?

(i) The student calculates r = 0.20 and concludes "temperature and ice-cream sales are barely related." Explain in 2 sentences why this conclusion is misleading, naming the restricted-range issue from the lesson.   2 marks

(ii) Suggest a data-collection change that would let the student compute a more meaningful r — and predict (with one sentence justification) whether the new r would be larger or smaller than 0.20.   2 marks

(iii) Briefly explain why always inspecting the scatter plot would have helped the student spot the problem before trusting r.   1 mark

Stuck? Revisit lesson § Limitations of r — restricted range.

How did this worksheet feel?

What I'll revisit before next class:

Answers — Do not peek before attempting

Problem 1 — Study hours vs test score

Set up. We are quantifying the linear association between study hours and exam scores, interpreting r and r², and pushing back on a causal interpretation.

(i) Σx = 2+3+5+6+8+9 = 33, Σy = 52+58+66+70+82+88 = 416, n = 6.
Σx² = 4+9+25+36+64+81 = 219.
Σy² = 2704+3364+4356+4900+6724+7744 = 29,792.
Σxy = 104+174+330+420+656+792 = 2476.
Numerator = 6(2476) − 33(416) = 14,856 − 13,728 = 1128.
n·Σx² − (Σx)² = 1314 − 1089 = 225.
n·Σy² − (Σy)² = 178,752 − 173,056 = 5696.
r = 1128 / √(225 × 5696) = 1128 / √1,281,600 = 1128 / 1132.07 ≈ 0.996.

(ii) r² ≈ 0.993 → about 99% of the variation in trial-exam scores in this sample is explained by the linear relationship with study hours.

(iii) The data show a very strong positive association, but not causation. Correlation does not imply causation: confounders (motivation, sleep, prior ability) could drive both more study and higher scores, so a controlled study would be needed to support the "more study causes higher marks" claim.

Problem 2 — Ice cream sales and drowning

Set up. We are diagnosing a textbook confounding-variable example and proposing a refinement to control for it.

(i) r = +0.92 indicates a very strong positive linear association between ice-cream sales and pool rescues — days with high ice-cream sales tend to be days with more rescues.

(ii) The headline confuses correlation with causation. The most likely confounding variable is hot weather: on hot days more people buy ice-cream and more people swim, which lifts the rescue count — neither variable causes the other.

(iii) Record the daily maximum temperature alongside x and y, then compute the partial correlation between sales and rescues at fixed temperature (or stratify the data by temperature bands and compute r within each band). If the within-band correlations are near zero, that supports the confounding-by-temperature explanation.

Problem 3 — Chocolate and Nobel prizes

Set up. We are critiquing a widely cited correlation, generating alternative explanations and naming the type (confounder / reverse causation / coincidence) for each.

(i) r² = 0.624 → the linear model "explains" about 62% of variation in Nobel laureates per capita — though "explains" is in scare quotes because no causal mechanism is established.

(ii) Three plausible alternatives:
1. Confounder — national wealth (GDP per capita): richer countries can afford both more chocolate and more world-class universities/research funding.
2. Confounder — Western European concentration: the 23 countries in the study are heavily Western European, where both chocolate consumption and historic Nobel performance are high for unrelated cultural reasons.
3. Reverse causation (or, more likely, coincidence): the cohort of Nobel laureates each year is tiny, so chance variation can produce spurious patterns at this level of aggregation.

(iii) "Statistical significance must be paired with a plausible mechanism (or controlled experiments / logical mechanism) before we infer causation."

Problem 4 — Supplement and weight loss

Set up. We are distinguishing observational evidence from experimental evidence and naming the design that does support causal claims.

(i) r = 0.80 → strong positive linear association between dose and kg lost. r² = 0.64 → 64% of variation in weight lost is explained linearly by dose in this sample.

(ii) Two reasons: (a) The study is observational, so the participants who chose to take a higher dose may differ systematically from those who took less (motivation, diet, baseline weight) — these are confounders. (b) There is no placebo control, so the placebo effect, regression to the mean, and reverse causation (people losing weight for other reasons may then take more supplement) all remain plausible explanations.

(iii) A randomised controlled trial: recruit a large sample, randomly assign participants to a supplement group and a placebo group (preferably double-blind), track weight loss over the same 12 weeks, and compare. Randomisation balances confounders across groups; the placebo controls for the placebo effect; the controlled design isolates the supplement's effect.

Problem 5 — Restricted range

Set up. We are diagnosing why a small r value need not mean "no relationship" when the range of x is artificially narrow.

(i) The student sampled only summer days (x ≈ 25 – 35°C), so the natural temperature variation that drives ice-cream demand has been almost eliminated. With restricted range, r underestimates the true relationship — small r ≈ 0.20 here reflects the lack of variation in x, not a lack of relationship between temperature and sales.

(ii) Collect data across all four seasons (or at least also include winter days where x ≈ 10°C). r will be substantially larger because the full range of x will produce a clearer positive trend (cold days have low sales; hot days have high sales).

(iii) Inspecting the scatter plot first would show a tight horizontal blob of points clustered between 25 – 35°C, immediately revealing that x has very little variation and warning the student that any r computed from these data is unreliable.