Mathematics Advanced • Year 12 • Module 5 • Lesson 10
Regression Analysis
Apply the least-squares regression line to real contexts — ice-cream sales, car depreciation, marathon training, real-estate distance-from-CBD and pandemic projections — and critique extrapolation risk.
Problem 1 — Ice-cream sales vs temperature (full regression workflow)
For 30 trading days in summer, a kiosk records the daily maximum temperature x (°C) and number of ice-cream cones sold y. Summary statistics:
x̄ = 25, s_x = 5, ȳ = 150, s_y = 40, r = 0.90
Set up: What are we solving for?
(i) Find the equation of the least-squares regression line ŷ = a + bx. 2 marks
(ii) Predict sales when the temperature is 22°C. If the actual sales that day were 130 cones, calculate the residual and interpret its sign in one sentence. 3 marks
(iii) Interpret the slope and intercept in context. Comment in one sentence on whether the intercept is meaningful here. 2 marks
Stuck on (iii)? The intercept is the predicted y when x = 0°C — is that inside the data range?Problem 2 — Car depreciation (interpolation vs extrapolation)
A regression of car value y ($000s) on age x (years), using data from cars aged 1–10 years, produces:
ŷ = 28 − 2.4 x
Set up: What are we solving for?
(i) Interpret the slope and the intercept in this car-depreciation context. 2 marks
(ii) Predict the value of a 12-year-old car and a 15-year-old car. State whether each is interpolation or extrapolation, and whether each prediction is reliable. 3 marks
(iii) Solving ŷ = 0 gives x ≈ 11.67. Explain in one sentence why this "the car is worthless at ~12 years" interpretation is misleading, referring to the lesson principle about extrapolation. 2 marks
Problem 3 — Marathon training (residual analysis)
A runner records weekly training mileage x (km) and her best 10 km time y (minutes) over an 8-week training block. The least-squares regression line is
ŷ = 55 − 0.20 x
The actual data points are tabulated below.
| Week | x (km) | y (min) | ŷ (predicted) | Residual y − ŷ |
|---|---|---|---|---|
| 1 | 30 | 50.0 | ||
| 2 | 40 | 47.5 | ||
| 3 | 50 | 46.0 | ||
| 4 | 60 | 44.5 | ||
| 5 | 70 | 43.5 | ||
| 6 | 80 | 42.5 | ||
| 7 | 90 | 42.0 | ||
| 8 | 100 | 41.5 |
Set up: What are we solving for?
(i) Complete the ŷ and residual columns in the table. 2 marks
(ii) Sketch a rough residual plot (residual vs x) in the space below. Describe in one sentence the pattern you see. 2 marks
(iii) Based on the residual pattern, is the linear model appropriate? If not, what type of relationship does the residual plot suggest (e.g. curved, fan-shaped, systematic trend)? 2 marks
Stuck? Revisit lesson § Residuals — "problematic patterns" table.Problem 4 — House prices vs distance from CBD
A real-estate agent records median house price y ($000) and distance from CBD x (km) for 25 suburbs:
x̄ = 12, s_x = 5, ȳ = 950, s_y = 250, r = −0.80
Set up: What are we solving for?
(i) Find the regression equation. Interpret the slope in plain English ("for each additional km from the CBD…"). 3 marks
(ii) A suburb 8 km from the CBD has actual median price $1,150,000. Calculate the predicted price and the residual, then say in one sentence whether this suburb is "over-priced" or "under-priced" relative to the linear model. 3 marks
(iii) Explain in one sentence why predicting the median price 40 km from the CBD using this equation would be unreliable. 1 mark
Problem 5 — Early COVID-19 projections (a warning tale)
In March 2020, an early modeller fits a linear regression of confirmed cases y on days since first case x over the first 14 days of the outbreak in a country. The line is
ŷ = −300 + 250 x
(Yes — for an early-exponential outbreak, a straight line is a poor fit; assume the modeller used it anyway.)
Set up: What are we solving for?
(i) Predict cases at day 10 (interpolation) and day 90 (extrapolation). 2 marks
(ii) At day 1, the model predicts ŷ = −50 cases — physically impossible. Use this fact to write a 2-sentence critique of using a linear model for early outbreak data, drawing on the lesson's COVID-19 Real-World Anchor. 2 marks
(iii) Name two reasons why even extrapolation from a well-fitted early-outbreak curve (e.g. an exponential one) failed during 2020. The lesson lists at least three; pick any two. 2 marks
Stuck? Revisit lesson § Real-World Anchor — COVID-19 Projections.How did this worksheet feel?
What I'll revisit before next class:
Problem 1 — Ice-cream sales vs temperature
Set up. We are fitting the regression line, using it to predict and compute a residual, and judging whether the intercept makes sense in context.
(i) b = 0.90 × (40 / 5) = 0.90 × 8 = 7.20. a = 150 − 7.20(25) = 150 − 180 = −30. ŷ = −30 + 7.20x.
(ii) ŷ at x = 22 = −30 + 7.20(22) = −30 + 158.4 = 128.4 cones. Residual = 130 − 128.4 = +1.6 — the model slightly under-predicted: actual sales were just above the line.
(iii) Slope = 7.2: for each 1 °C increase in daily maximum temperature, sales rise by about 7.2 cones on average. Intercept = −30: the line predicts negative sales at 0 °C, which is nonsensical (you can't sell a negative number of cones). The intercept is not meaningful because 0 °C is far outside the summer data range (the data only cover days with x near 25) and a negative prediction is physically impossible.
Problem 2 — Car depreciation
Set up. We are interpreting a fitted line, then distinguishing safe interpolation from dangerous extrapolation, then critiquing a "value = 0" interpretation.
(i) Slope = −2.4: for each additional year of age, the car's value drops by $2,400 on average. Intercept = 28: a brand-new (0-year-old) car is predicted to be worth $28,000; this is reasonable provided the linear model holds back to age 0, although the data only cover ages 1–10.
(ii) Age 12: ŷ = 28 − 2.4(12) = 28 − 28.8 = −$0.8 (i.e. −$800). This is extrapolation (12 > 10) and unreliable — both because we are extrapolating and because a negative value is physically impossible. Age 15: ŷ = 28 − 2.4(15) = 28 − 36 = −$8000. This is even further extrapolation, even more unreliable, even more absurd.
(iii) Solving ŷ = 0 gives x = 11.67, but 11.67 lies outside the data range [1, 10] — that is extrapolation, and the linear model breaks down (in reality, cars depreciate non-linearly, with the drop tapering off so that 12-year-old cars are worth a few thousand dollars, not nothing). The lesson principle: extrapolation is dangerous because the linear relationship may not hold outside the observed range.
Problem 3 — Marathon training
Set up. We are computing predictions and residuals to interrogate whether a straight line is the right model for this relationship.
(i)
| Week | x | y | ŷ = 55 − 0.20x | Residual |
|---|---|---|---|---|
| 1 | 30 | 50.0 | 49.0 | +1.0 |
| 2 | 40 | 47.5 | 47.0 | +0.5 |
| 3 | 50 | 46.0 | 45.0 | +1.0 |
| 4 | 60 | 44.5 | 43.0 | +1.5 |
| 5 | 70 | 43.5 | 41.0 | +2.5 |
| 6 | 80 | 42.5 | 39.0 | +3.5 |
| 7 | 90 | 42.0 | 37.0 | +5.0 |
| 8 | 100 | 41.5 | 35.0 | +6.5 |
(ii) Residual plot: residuals are all positive and increase as x increases, drifting from about +1 at x = 30 up to +6.5 at x = 100.
(iii) The systematic upward trend in residuals shows the linear model is not appropriate — actual times are higher than the line predicts at high x, suggesting diminishing returns (a curved relationship: time falls less per extra km as mileage rises). A better model would be exponential decay or a power curve approaching a non-zero asymptote. (Pattern type from the lesson's table: "curved pattern" / "systematic trend".)
Problem 4 — House prices vs distance from CBD
Set up. We are fitting a regression line with a negative slope, predicting and finding a residual, then critiquing extrapolation beyond the data range.
(i) b = −0.80 × (250 / 5) = −0.80 × 50 = −40. a = 950 − (−40)(12) = 950 + 480 = 1430. ŷ = 1430 − 40x ($000). Slope: for each additional km from the CBD, the median house price drops by about $40,000 on average.
(ii) ŷ at x = 8 = 1430 − 40(8) = 1430 − 320 = $1,110,000. Residual = 1150 − 1110 = +$40,000 — the suburb sits $40k above the line, so it is mildly over-priced relative to the model (or the model under-predicts prices for that suburb).
(iii) x = 40 km is well outside the observed range (assuming the 25 suburbs lie roughly within 20 km of the CBD), so this is extrapolation; the linear relationship between price and distance may not hold for distant outer-suburban or rural areas where different market dynamics apply.
Problem 5 — Early COVID-19 projections
Set up. We are using a deliberately-bad linear fit to dramatise both the cost of extrapolation and the limits of even good models when underlying dynamics shift.
(i) Day 10: ŷ = −300 + 250(10) = 2200 cases. Day 90: ŷ = −300 + 250(90) = 22,200 cases.
(ii) At day 1, ŷ = −50 is physically impossible — you cannot have negative cases. The linear model is the wrong shape for early-outbreak data, which grow exponentially, not linearly: any model whose intercept implies negative cases at any point inside its data range should be discarded. As the lesson's COVID-19 anchor warns, extrapolation from early-phase pandemic data was one of the most consequential statistical failures in recent history.
(iii) Any two of: (a) behavioural changes — lockdowns, masking and physical distancing reduce transmission below early estimates; (b) herd immunity / saturation effects, as the susceptible pool shrinks the growth rate falls; (c) interventions such as vaccination programmes and improved treatments shift the trajectory. All three change the underlying mechanism, so a curve fitted to early data no longer applies to later x values.