Mathematics Advanced • Year 12 • Module 5 • Lesson 10

Regression Analysis

Apply the least-squares regression line to real contexts — ice-cream sales, car depreciation, marathon training, real-estate distance-from-CBD and pandemic projections — and critique extrapolation risk.

Apply · Problem Set

Problem 1 — Ice-cream sales vs temperature (full regression workflow)

For 30 trading days in summer, a kiosk records the daily maximum temperature x (°C) and number of ice-cream cones sold y. Summary statistics:

x̄ = 25,   s_x = 5,   ȳ = 150,   s_y = 40,   r = 0.90

Set up: What are we solving for?

(i) Find the equation of the least-squares regression line ŷ = a + bx.   2 marks

(ii) Predict sales when the temperature is 22°C. If the actual sales that day were 130 cones, calculate the residual and interpret its sign in one sentence.   3 marks

(iii) Interpret the slope and intercept in context. Comment in one sentence on whether the intercept is meaningful here.   2 marks

Stuck on (iii)? The intercept is the predicted y when x = 0°C — is that inside the data range?

Problem 2 — Car depreciation (interpolation vs extrapolation)

A regression of car value y ($000s) on age x (years), using data from cars aged 1–10 years, produces:

ŷ = 28 − 2.4 x

Set up: What are we solving for?

(i) Interpret the slope and the intercept in this car-depreciation context.   2 marks

(ii) Predict the value of a 12-year-old car and a 15-year-old car. State whether each is interpolation or extrapolation, and whether each prediction is reliable.   3 marks

(iii) Solving ŷ = 0 gives x ≈ 11.67. Explain in one sentence why this "the car is worthless at ~12 years" interpretation is misleading, referring to the lesson principle about extrapolation.   2 marks

Problem 3 — Marathon training (residual analysis)

A runner records weekly training mileage x (km) and her best 10 km time y (minutes) over an 8-week training block. The least-squares regression line is

ŷ = 55 − 0.20 x

The actual data points are tabulated below.

Weekx (km)y (min)ŷ (predicted)Residual y − ŷ
13050.0
24047.5
35046.0
46044.5
57043.5
68042.5
79042.0
810041.5

Set up: What are we solving for?

(i) Complete the ŷ and residual columns in the table.   2 marks

(ii) Sketch a rough residual plot (residual vs x) in the space below. Describe in one sentence the pattern you see.   2 marks

(iii) Based on the residual pattern, is the linear model appropriate? If not, what type of relationship does the residual plot suggest (e.g. curved, fan-shaped, systematic trend)?   2 marks

Stuck? Revisit lesson § Residuals — "problematic patterns" table.

Problem 4 — House prices vs distance from CBD

A real-estate agent records median house price y ($000) and distance from CBD x (km) for 25 suburbs:

x̄ = 12,   s_x = 5,   ȳ = 950,   s_y = 250,   r = −0.80

Set up: What are we solving for?

(i) Find the regression equation. Interpret the slope in plain English ("for each additional km from the CBD…").   3 marks

(ii) A suburb 8 km from the CBD has actual median price $1,150,000. Calculate the predicted price and the residual, then say in one sentence whether this suburb is "over-priced" or "under-priced" relative to the linear model.   3 marks

(iii) Explain in one sentence why predicting the median price 40 km from the CBD using this equation would be unreliable.   1 mark

Problem 5 — Early COVID-19 projections (a warning tale)

In March 2020, an early modeller fits a linear regression of confirmed cases y on days since first case x over the first 14 days of the outbreak in a country. The line is

ŷ = −300 + 250 x

(Yes — for an early-exponential outbreak, a straight line is a poor fit; assume the modeller used it anyway.)

Set up: What are we solving for?

(i) Predict cases at day 10 (interpolation) and day 90 (extrapolation).   2 marks

(ii) At day 1, the model predicts ŷ = −50 cases — physically impossible. Use this fact to write a 2-sentence critique of using a linear model for early outbreak data, drawing on the lesson's COVID-19 Real-World Anchor.   2 marks

(iii) Name two reasons why even extrapolation from a well-fitted early-outbreak curve (e.g. an exponential one) failed during 2020. The lesson lists at least three; pick any two.   2 marks

Stuck? Revisit lesson § Real-World Anchor — COVID-19 Projections.

How did this worksheet feel?

What I'll revisit before next class:

Answers — Do not peek before attempting

Problem 1 — Ice-cream sales vs temperature

Set up. We are fitting the regression line, using it to predict and compute a residual, and judging whether the intercept makes sense in context.

(i) b = 0.90 × (40 / 5) = 0.90 × 8 = 7.20. a = 150 − 7.20(25) = 150 − 180 = −30. ŷ = −30 + 7.20x.

(ii) ŷ at x = 22 = −30 + 7.20(22) = −30 + 158.4 = 128.4 cones. Residual = 130 − 128.4 = +1.6 — the model slightly under-predicted: actual sales were just above the line.

(iii) Slope = 7.2: for each 1 °C increase in daily maximum temperature, sales rise by about 7.2 cones on average. Intercept = −30: the line predicts negative sales at 0 °C, which is nonsensical (you can't sell a negative number of cones). The intercept is not meaningful because 0 °C is far outside the summer data range (the data only cover days with x near 25) and a negative prediction is physically impossible.

Problem 2 — Car depreciation

Set up. We are interpreting a fitted line, then distinguishing safe interpolation from dangerous extrapolation, then critiquing a "value = 0" interpretation.

(i) Slope = −2.4: for each additional year of age, the car's value drops by $2,400 on average. Intercept = 28: a brand-new (0-year-old) car is predicted to be worth $28,000; this is reasonable provided the linear model holds back to age 0, although the data only cover ages 1–10.

(ii) Age 12: ŷ = 28 − 2.4(12) = 28 − 28.8 = −$0.8 (i.e. −$800). This is extrapolation (12 > 10) and unreliable — both because we are extrapolating and because a negative value is physically impossible. Age 15: ŷ = 28 − 2.4(15) = 28 − 36 = −$8000. This is even further extrapolation, even more unreliable, even more absurd.

(iii) Solving ŷ = 0 gives x = 11.67, but 11.67 lies outside the data range [1, 10] — that is extrapolation, and the linear model breaks down (in reality, cars depreciate non-linearly, with the drop tapering off so that 12-year-old cars are worth a few thousand dollars, not nothing). The lesson principle: extrapolation is dangerous because the linear relationship may not hold outside the observed range.

Problem 3 — Marathon training

Set up. We are computing predictions and residuals to interrogate whether a straight line is the right model for this relationship.

(i)

Weekxyŷ = 55 − 0.20xResidual
13050.049.0+1.0
24047.547.0+0.5
35046.045.0+1.0
46044.543.0+1.5
57043.541.0+2.5
68042.539.0+3.5
79042.037.0+5.0
810041.535.0+6.5

(ii) Residual plot: residuals are all positive and increase as x increases, drifting from about +1 at x = 30 up to +6.5 at x = 100.

(iii) The systematic upward trend in residuals shows the linear model is not appropriate — actual times are higher than the line predicts at high x, suggesting diminishing returns (a curved relationship: time falls less per extra km as mileage rises). A better model would be exponential decay or a power curve approaching a non-zero asymptote. (Pattern type from the lesson's table: "curved pattern" / "systematic trend".)

Problem 4 — House prices vs distance from CBD

Set up. We are fitting a regression line with a negative slope, predicting and finding a residual, then critiquing extrapolation beyond the data range.

(i) b = −0.80 × (250 / 5) = −0.80 × 50 = −40. a = 950 − (−40)(12) = 950 + 480 = 1430. ŷ = 1430 − 40x ($000). Slope: for each additional km from the CBD, the median house price drops by about $40,000 on average.

(ii) ŷ at x = 8 = 1430 − 40(8) = 1430 − 320 = $1,110,000. Residual = 1150 − 1110 = +$40,000 — the suburb sits $40k above the line, so it is mildly over-priced relative to the model (or the model under-predicts prices for that suburb).

(iii) x = 40 km is well outside the observed range (assuming the 25 suburbs lie roughly within 20 km of the CBD), so this is extrapolation; the linear relationship between price and distance may not hold for distant outer-suburban or rural areas where different market dynamics apply.

Problem 5 — Early COVID-19 projections

Set up. We are using a deliberately-bad linear fit to dramatise both the cost of extrapolation and the limits of even good models when underlying dynamics shift.

(i) Day 10: ŷ = −300 + 250(10) = 2200 cases. Day 90: ŷ = −300 + 250(90) = 22,200 cases.

(ii) At day 1, ŷ = −50 is physically impossible — you cannot have negative cases. The linear model is the wrong shape for early-outbreak data, which grow exponentially, not linearly: any model whose intercept implies negative cases at any point inside its data range should be discarded. As the lesson's COVID-19 anchor warns, extrapolation from early-phase pandemic data was one of the most consequential statistical failures in recent history.

(iii) Any two of: (a) behavioural changes — lockdowns, masking and physical distancing reduce transmission below early estimates; (b) herd immunity / saturation effects, as the susceptible pool shrinks the growth rate falls; (c) interventions such as vaccination programmes and improved treatments shift the trajectory. All three change the underlying mechanism, so a curve fitted to early data no longer applies to later x values.