Mathematics Advanced • Year 12 • Module 5 • Lesson 10
Regression Analysis
Practise HSC-style writing on regression — slope, intercept, residual analysis, and a structured extended response critiquing a prediction whose intercept is physically impossible.
1. Short-answer questions
1.1 Given x̄ = 8, s_x = 3, ȳ = 50, s_y = 12, r = 0.75, find the equation of the least-squares regression line, then use it to predict y when x = 10. 3 marks Band 3
1.2 For a regression model ŷ = 25 + 4x where y is weekly sales and x is advertising spend ($000s), interpret the slope and intercept in plain English, then state the residual for an actual observation (x = 6, y = 55). 3 marks Band 3-4
1.3 A regression model ŷ = 12 + 0.4x is built from data with x ∈ [10, 50]. (a) Predict y at x = 25 and at x = 80. (b) For each prediction, state whether it is interpolation or extrapolation, and whether it is reliable, justifying your answer in one sentence per prediction. 4 marks Band 4
Stuck on 1.3? Refer to the data range [10, 50] explicitly when judging each prediction.2. Extended response
2.1 A medical researcher publishes a regression of newborn birth weight y (kg) on mother's age x (years), based on data from mothers aged 20–40 in one hospital:
ŷ = −2.5 + 0.25 x
and reports r = 0.45.
(a) Compute the predicted birth weight at x = 25 and at x = 35; calculate the residual for an actual baby born to a 30-year-old mother weighing 4.0 kg. (b) Interpret the slope and the intercept in context, and judge the intercept's meaningfulness. (c) The researcher's press release uses the equation to "warn teenage mothers that the model predicts very low birth weights" by plugging in x = 14, getting ŷ = 1.0 kg. Write a structured critique that (i) identifies the statistical error (with the correct named principle from the lesson), (ii) explains why even an r = 0.45 model has limited predictive power, and (iii) recommends one improvement to the study before any public-health statement is made.
8 marks Band 5-6
Explicit marking criteria
Part (a) — 2 marks
• 1 mark — ŷ(25) = 3.75 kg and ŷ(35) = 6.25 kg correctly computed.
• 1 mark — ŷ(30) = 5.0 kg and residual = 4.0 − 5.0 = −1.0 kg, with the interpretation "the model over-predicted by 1 kg".
Part (b) — 2 marks
• 1 mark — slope: "for each extra year of mother's age, the model predicts an additional 0.25 kg of birth weight on average".
• 1 mark — intercept: intercept −2.5 kg represents the predicted birth weight at mother's age 0, which is biologically impossible; therefore the intercept is not meaningful in this model.
Part (c) — 4 marks
• 1 mark — statistical error named: x = 14 is outside the data range [20, 40], so this is extrapolation, and the lesson principle "extrapolation is dangerous" applies.
• 1 mark — predictive-power critique: r = 0.45 ⇒ r² = 0.2025, so only about 20% of variation in birth weight is explained by mother's age; 80% is driven by other factors (genetics, nutrition, gestational age, etc.).
• 1 mark — biological / practical link: the predicted ŷ = 1.0 kg is biologically improbable for a teenage pregnancy (typical newborns are 2.5–4.5 kg), confirming that the linear model has broken down outside the observed range.
• 1 mark — recommendation: proposes a specific improvement (collect data on teenage mothers; control for confounders; use a non-linear model; or run a controlled clinical study) and explicitly couples it with the principle that no public-health recommendation should rest on an extrapolation from a low-r² observational regression.
Your response:
Stuck on (c)? Be precise — extrapolation, low r², impossible prediction, and how to fix the study.How did this worksheet feel?
What I'll revisit before next class:
1.1 — Fit and predict (3 marks)
Sample response. b = 0.75 × (12 / 3) = 0.75 × 4 = 3.00. a = 50 − 3.00(8) = 50 − 24 = 26. ŷ = 26 + 3x. At x = 10: ŷ = 26 + 30 = 56.
Marking notes. 1 mark — b correct. 1 mark — a correct. 1 mark — prediction at x = 10 correct. Students who write a = 50 + 3(8) (sign error) lose the intercept mark.
1.2 — Interpret slope/intercept & compute residual (3 marks)
Sample response. Slope = 4: for each additional $1000 spent on advertising, weekly sales rise by 4 units (or $4, or 4 of whatever y is in) on average. Intercept = 25: with no advertising spend (x = 0), the model predicts 25 weekly sales — meaningful provided x = 0 is inside or near the data range. At (x = 6, y = 55): ŷ = 25 + 4(6) = 49. Residual = 55 − 49 = +6 — actual sales exceeded the prediction by 6 (the model under-predicted).
Marking notes. 1 mark — slope interpreted in context with "per unit increase in x" phrasing. 1 mark — intercept interpreted with a comment on whether x = 0 is realistic. 1 mark — residual computed correctly with sign and one-sentence interpretation.
1.3 — Interpolation vs extrapolation (4 marks)
Sample response. (a) ŷ(25) = 12 + 0.4(25) = 22. ŷ(80) = 12 + 0.4(80) = 44. (b) x = 25: inside the data range [10, 50], so this is interpolation — the prediction is reliable (within the usual residual scatter), because the regression line was fitted on data that surrounds this x-value. x = 80: outside the data range, so this is extrapolation — the prediction is unreliable because we have no evidence that the linear relationship continues to hold above x = 50, and other mechanisms or physical limits may apply.
Marking notes. (a) 1 mark — both predictions correct. (b) 1 mark — labels x = 25 as interpolation and judges it reliable; 1 mark — labels x = 80 as extrapolation and judges it unreliable; 1 mark — references the data range [10, 50] explicitly and gives a one-sentence reason for the "dangerous extrapolation" judgement.
2.1 — Extended response (8 marks): sample Band-6 response with annotations
Sample Band-6 response.
(a) Predictions and a residual. ŷ(25) = −2.5 + 0.25(25) = −2.5 + 6.25 = 3.75 kg. ŷ(35) = −2.5 + 0.25(35) = −2.5 + 8.75 = 6.25 kg. [1 mark — both predictions.] ŷ(30) = −2.5 + 7.5 = 5.0 kg. Residual = 4.0 − 5.0 = −1.0 kg; the model over-predicted by 1 kg for this baby (actual was 1 kg below the line). [1 mark — residual with interpretation.]
(b) Slope and intercept in context. Slope = 0.25 kg/year: for each additional year of mother's age, the model predicts an additional 0.25 kg of birth weight on average. [1 mark — slope in context.] Intercept = −2.5 kg: this is the model's prediction at mother's age 0. The value is both biologically impossible (a negative weight cannot exist) and far outside the data range (data covers ages 20–40). The intercept is therefore not meaningful — it is a mathematical artefact of fitting a straight line, not a real-world estimate. [1 mark — intercept critique.]
(c) Critique of the press release.
(i) Statistical error. The press release plugs x = 14 into a model fitted on data from ages 20–40. This is extrapolation beyond the observed range. The lesson principle is explicit: "extrapolation is dangerous because the linear relationship may not hold outside the observed range; physical limits may apply, and different mechanisms may dominate at extreme values." The same model that produces a plausible ŷ at age 25 produces an absurd ŷ = 1.0 kg at age 14 — and an impossible negative weight at age 9. [1 mark — extrapolation principle correctly named.]
(ii) Limited predictive power. Even inside the data range, r = 0.45 gives r² ≈ 0.20: only about 20% of variation in birth weight is explained by the linear relationship with mother's age. The other 80% comes from genetics, gestational age, nutrition, smoking status, prior pregnancies, ethnicity, foetal sex and so on. A regression that explains 20% of variation cannot, in isolation, drive individual or public-health predictions; it tells us about average trends, not individual outcomes. [1 mark — r² critique tied to predictive power.]
(iii) Biological reality check. Typical full-term newborns weigh 2.5 – 4.5 kg. Predicting ŷ = 1.0 kg for a teenage mother is implausible: very-low-birth-weight outcomes do occur and may be more common in teenage pregnancies, but they are driven by gestational age and a complex bundle of confounders rather than chronological age alone. The model's prediction is not just statistically extrapolated — it is physiologically inconsistent with what we know about teenage pregnancies. [1 mark — biological/practical sanity check.]
Recommendation. Before any public-health statement is made, the researcher should (1) collect data on teenage mothers explicitly so that ages 14–19 are inside the observed range, (2) control for the key confounders (gestational age, parity, smoking, nutrition) using multivariable methods, and (3) use a non-linear or piecewise model if the residuals show a curved pattern. The general principle: no health recommendation should rest on extrapolation from a low-r² observational regression. Until the study is redesigned, the press release should be withdrawn. [1 mark — concrete recommendation explicitly linked to the extrapolation/predictive-power principle.]
Total: 8/8.
Band descriptors for marker.
Band 3: Correctly fits the line and computes predictions, but residual sign or interpretation slips; in (b) interprets slope but not intercept critique; in (c) says only "the prediction is wrong" without naming extrapolation. ≈ 3-4 marks.
Band 4: All of (a) and (b) correct; (c) names extrapolation but omits the r² critique or biological check; recommendation generic. ≈ 5-6 marks.
Band 5: All sub-parts correct; (c) extrapolation + low r² critique, but recommendation does not explicitly tie back to the lesson principle. ≈ 7 marks.
Band 6: Full computational accuracy in (a) and (b); explicit "intercept is not meaningful" with two reasons (biological impossibility + out of range); (c) extrapolation + r² + biological reality check + concrete recommendation that explicitly invokes "extrapolation is dangerous" and "low-r² observational regression cannot drive health policy". 8/8.