Mathematics Advanced • Year 12 • Module 5 • Lesson 9

Bivariate Data Analysis

Practise HSC-style writing on scatter plots, Pearson's r and the correlation-vs-causation distinction — with a structured extended response on critiquing a causal claim.

Master · Past-Paper Style

1. Short-answer questions

1.1 For the data (1, 5), (2, 7), (3, 11), (4, 13), (5, 17), compute Pearson's r using the computational formula.    3 marks    Band 3-4

1.2 A study reports r = −0.85 between hours of social media use per day and self-reported sleep hours. (a) State r² and interpret it in context. (b) Explain in one sentence what the negative sign of r means in this context.    3 marks    Band 3

1.3 A researcher sees a strong correlation (r = 0.9) in a sample but, when plotting the scatter, notices a perfect U-shape. (a) Explain in 1-2 sentences why Pearson's r is misleading here. (b) State two of the lesson's five "limitations of r" that would lead you to mistrust this number.    3 marks    Band 4

Stuck on 1.3(b)? Revisit lesson § Limitations of r.

2. Extended response

2.1 A wellness-influencer's blog post claims:

"My new study of 500 Australians shows a correlation of r = 0.72 between daily green-tea consumption (cups) and self-reported happiness score (out of 10). Drinking more green tea makes you happier — clearly the only sensible recommendation is two cups a day."

Write an HSC-style critique of this claim. Your response must:

(a) State and interpret r = 0.72 (strength, direction, r²), being precise about what "explained" means.
(b) Identify and explain at least three alternative explanations for the correlation — covering at least one each of (i) confounding variable, (ii) reverse causation, and (iii) limitations of r (selection bias / measurement / restricted range).
(c) Conclude with a one-paragraph recommendation about what kind of study would support the blogger's causal claim, explicitly referencing the principle "correlation does not imply causation".

   8 marks    Band 5-6

Explicit marking criteria

Part (a) — 2 marks

1 mark — strength (strong) and direction (positive) named, with r² = 0.5184 ≈ 0.52 calculated.

1 mark — interprets r² in context: about 52% of the variation in happiness score is "explained" by the linear relationship with cups of green tea — and flags that "explained" does not mean "caused".

Part (b) — 4 marks

1 mark — confounder: identifies a plausible third variable (e.g. income / education / overall healthy-lifestyle bundle) that drives both green-tea drinking and self-reported happiness.

1 mark — reverse causation: argues that happier people may simply be more likely to take up green-tea rituals, not the other way around.

1 mark — limitation of r / selection bias: notes the sample is self-selected (blog readers), self-reported and possibly restricted in range (only people who drink > 0 cups).

1 mark — explanation quality: each alternative is described in enough detail that the marker can see how it produces the observed r without requiring a green-tea → happiness causal arrow.

Part (c) — 2 marks

1 mark — recommends a randomised controlled trial (or equivalent) with placebo / control group, blinding, and pre-specified outcome measure, in enough detail to count as a real proposal.

1 mark — closes by explicitly invoking the lesson principle "correlation does not imply causation" (or equivalent) and linking it to why a controlled experiment, not an observational correlation, is needed to justify the recommendation "two cups a day".

Your response:

Stuck on (b)? Pick three distinct mechanisms — don't lump "confounder" and "selection bias" together.

How did this worksheet feel?

What I'll revisit before next class:

Answers — sample responses + marking notes

1.1 — r for (1, 5), (2, 7), (3, 11), (4, 13), (5, 17) (3 marks)

Sample response. Σx = 15, Σy = 53, Σx² = 55, Σy² = 653, Σxy = 1(5)+2(7)+3(11)+4(13)+5(17) = 5+14+33+52+85 = 189, n = 5.
Numerator = 5(189) − 15(53) = 945 − 795 = 150.
n·Σx² − (Σx)² = 275 − 225 = 50.
n·Σy² − (Σy)² = 3265 − 2809 = 456.
r = 150 / √(50 × 456) = 150 / √22,800 = 150 / 151.0 ≈ 0.993.

Marking notes. 1 mark — all sums correct. 1 mark — numerator and denominator computed correctly. 1 mark — final r ≈ 0.99 (accept 0.99 – 1.00). Stopping with the right formula but no arithmetic scores 1; getting r > 1 or r < −1 is a hard zero on the last mark (impossible value).

1.2 — Social media use and sleep (3 marks)

Sample response. (a) r² = 0.7225 ≈ 0.72: about 72% of the variation in self-reported sleep hours is explained by the linear relationship with daily social-media use. (b) The negative sign means more social-media use is associated with fewer sleep hours (and vice versa) — when one goes up, the other tends to go down.

Marking notes. (a) 1 mark — r² value correct; 1 mark — "% of variation in y explained by x" phrasing in context. (b) 1 mark — clear "as x increases, y decreases" interpretation. Students who write "r negative means there is no relationship" score 0/3 on (b).

1.3 — Misleading r & limitations (3 marks)

Sample response. (a) Pearson's r measures only the linear component of association. A perfect U-shape (e.g. y = x² − 1) has equal-and-opposite linear trends on the two arms that partially cancel, so r can be small or misleading — even though the relationship is strong and entirely deterministic. Reporting r = 0.9 without showing the scatter plot would mask the curved shape.
(b) Two limitations from the lesson: (1) Non-linear relationships — r is blind to curved patterns; (2) Outliers — a single extreme point can dramatically inflate or deflate r.

Marking notes. (a) 1 mark — explicit statement that r measures linear association only; 1 mark — names how a curved relationship can produce a misleading r. (b) 1 mark — two distinct limitations correctly named (accept any two of: non-linear, outliers, heteroscedasticity, restricted range, ecological fallacy).

2.1 — Extended response (8 marks): sample Band-6 response with annotations

Sample Band-6 response.

(a) Interpretation of r = 0.72. r = 0.72 indicates a strong positive linear association between daily green-tea consumption and happiness score. r² = 0.5184, so about 52% of the variation in happiness score in this sample is "explained" by the linear relationship with cups of green tea. The word "explained" is statistical jargon, not a causal claim: it tells us how tightly the points cluster around a linear trend, not whether tea causes happiness. [1 mark — strength + direction + r²; 1 mark — context interpretation that flags "explained" ≠ "caused".]

(b) Three alternative explanations.

(i) Confounding variable — income / overall healthy-lifestyle bundle. People with higher disposable income, more time for self-care, regular exercise, and a balanced diet are more likely to both drink green tea (a discretionary purchase) and report higher happiness scores. The third variable "healthy-lifestyle bundle" plausibly drives both x and y, generating the correlation without green tea itself causing happiness. [1 mark — confounder.]

(ii) Reverse causation. Happier people may be more likely to take up wellness rituals (including green-tea drinking) because they have more energy, optimism and openness to new habits — not the other way around. The blogger has assumed an x → y arrow; the data are equally consistent with a y → x arrow. [1 mark — reverse causation.]

(iii) Selection bias / measurement issues (limitation of r). The 500 respondents are drawn from blog readers (a self-selected sample of people already interested in wellness), happiness is self-reported (subjective and prone to social-desirability bias), and the sample may be range-restricted to people who drink some green tea — none of these support a generalisable causal claim. [1 mark — selection bias / limitation of r.] Each of these alternatives can produce r = 0.72 without any causal arrow from green tea to happiness — that is, the data are consistent with all three at once, and Pearson's r alone cannot distinguish them. [1 mark — explanation quality: ties the three together by showing each alone can produce the observed r.]

(c) What would actually support a causal claim?

The blogger's data are observational. To support a causal "two cups a day will make you happier" claim, we would need a randomised controlled trial: recruit a representative sample (not blog readers), randomly assign participants to a "two cups of green tea per day" arm and a matched placebo arm (e.g. caffeine-matched herbal tea), blind both participants and assessors to which beverage they receive, then measure happiness with a validated scale over a pre-specified period. Randomisation balances confounders (income, lifestyle) across groups, the placebo controls for the act of "drinking a special tea" and for any caffeine effect, and blinding controls for reporting bias. Until such an experiment is run, the data show only association: this is precisely the lesson principle correlation does not imply causation, and it is why no health recommendation should be issued from r = 0.72 alone. [1 mark — concrete RCT proposal; 1 mark — explicit invocation of the lesson principle and link to why observational correlation cannot justify the recommendation.]

Total: 8/8.

Band descriptors for marker.

Band 3: States "strong positive correlation" and computes r²; lists one alternative explanation but not three distinct categories; no controlled-trial proposal. ≈ 3-4 marks.

Band 4: (a) complete; (b) names two of three categories (confounder + reverse causation, say) but omits limitations of r; (c) vague proposal ("do a better study"). ≈ 5-6 marks.

Band 5: All three alternative-explanation categories present and clearly distinguished; (c) names an RCT but with weak justification. ≈ 7 marks.

Band 6: Full marks across (a)-(c); explicit "explained ≠ caused" flag in (a); three categorically distinct alternatives in (b), each plausibly producing the observed r; (c) concrete RCT with randomisation + placebo + blinding, and an explicit invocation of "correlation does not imply causation" linking the lesson principle to the practical recommendation. 8/8.