Mathematics Advanced • Year 12 • Module 5 • Lesson 10

Regression Analysis

Build procedural fluency in fitting the least-squares regression line from summary statistics, interpreting slope and intercept, and computing residuals.

Build · Skill Drill

1. Quick recall

Answer each question in the space provided. 1 mark each

Q1.1 Write the three regression formulas (general line, slope, intercept):

ŷ = ____________

b = ____________

a = ____________

Q1.2 Complete the residual formula and one-sentence interpretation of its sign.

Residual = ____________

Positive residual means the model ____________ (under / over)-predicted; negative residual means the model ____________ -predicted.

Q1.3 The least-squares regression line always passes through the point ( ______ , ______ ).

Stuck? Revisit lesson § Formula Reference and § Regression Line.

2. Worked example — fit the regression line and compute a residual

Problem. For study hours x vs test scores y, summary statistics give x̄ = 5, s_x = 2, ȳ = 70, s_y = 10, r = 0.8. Find (a) the regression equation, (b) the predicted score at x = 6, (c) the residual if the actual score at x = 6 was 78.

Step 1 — Slope.

b = r × (s_y / s_x) = 0.8 × (10/2) = 0.8 × 5 = 4

Reason: the slope has the same sign as r, and its magnitude scales the y-spread relative to the x-spread.

Step 2 — Intercept.

a = ȳ − b·x̄ = 70 − 4(5) = 70 − 20 = 50

Reason: forces the line to pass through (x̄, ȳ) = (5, 70).

Step 3 — Regression equation.

ŷ = 50 + 4x

Step 4 — Predict at x = 6.

ŷ = 50 + 4(6) = 74 marks

Step 5 — Residual at x = 6, observed y = 78.

Residual = y − ŷ = 78 − 74 = +4

Reason: positive residual ⇒ the model under-predicted (actual was 4 marks above the line).

Conclusion. ŷ = 50 + 4x. Predicted score at 6 hours = 74. The actual score 78 sits 4 marks above the line — a positive residual.

3. Faded example — fill in the missing steps

For house price y ($000s) vs distance from CBD x (km): x̄ = 10, s_x = 4, ȳ = 800, s_y = 200, r = −0.75. Find the regression equation and predict the price 15 km from the CBD. 4 marks

Step 1 — Slope.

b = r × (s_y / s_x) = ______ × ( ______ / ______ ) = ____________

Step 2 — Intercept.

a = ȳ − b·x̄ = ______ − ( ______ )( ______ ) = ____________

Step 3 — Regression equation. ŷ = __________ + __________ · x

Step 4 — Predict at x = 15 km.

ŷ = __________ + __________ (15) = ____________

Conclusion. Predicted price ≈ $____________ (give in actual dollars, not $000s).

Stuck? Revisit lesson § Worked Example. Remember b inherits the sign of r.

4. Graduated practice — fit, predict, interpret, residuals

Show your working. Quote slope and intercept to 2 decimal places.

Foundation — single-step calculations (4 questions)

QQuestionAnswer
4.1 1Given r = 0.6, s_x = 1.5, s_y = 12, compute b.
4.2 1Given x̄ = 4, ȳ = 60, b = 4.8, compute the intercept a.
4.3 1For ŷ = 20 + 3x, predict y when x = 5.
4.4 1For ŷ = 30 + 2.5x and a data point (x = 6, y = 45), find the residual.

Standard — typical HSC difficulty (6 questions)

Show your working in the space below each part. Quote final numbers with units.

4.5 Find the full regression line ŷ = a + bx given x̄ = 4, s_x = 1.5, ȳ = 60, s_y = 12, r = 0.6.    2 marks

4.6 Using the line from 4.5, predict y when x = 5, and find the residual if the actual value at x = 5 was 65.    2 marks

4.7 A model predicts test scores from hours studied using data from students who studied 2–10 hours. State whether predicting for x = 15 hours is interpolation or extrapolation, and whether it is reliable.    2 marks

4.8 Given ŷ = 10 − 2x where y is car value ($000s) and x is age (years), interpret the slope and the intercept in context.    2 marks

4.9 The regression line from 4.5 passes through which special point? Verify by substitution.    1 mark

4.10 A residual plot shows a clear U-shape. What does this pattern indicate about the choice of model, and what should you do instead?    2 marks

Extension — combine concepts (2 questions)

4.11 A regression of children's height vs age uses data from children aged 5–12 years. The intercept is 80 cm. State whether the intercept is meaningful, and explain in two sentences why.    3 marks

4.12 A regression equation ŷ = 30 + 4x is built from data with x ∈ [2, 20]. (i) Predict y at x = 10 and at x = 30. (ii) Explain in 1-2 sentences why one prediction is safe and the other is dangerous, naming the principle from the lesson.    3 marks

Stuck on 4.10? Revisit lesson § Residuals — "problematic patterns".

5. Self-check the easy 3

Tick the first three once you've checked your method works.

How did this worksheet feel?

What I'll revisit before next class:

Answers — Do not peek before attempting

Q1.1 — Regression formulas

ŷ = a + bx. b = r · (s_y / s_x). a = ȳ − b · x̄.

Q1.2 — Residual formula and sign

Residual = y − ŷ (observed − predicted). Positive residual ⇒ under-predicted (point above the line). Negative residual ⇒ over-predicted (point below the line).

Q1.3 — Special point

The line always passes through (x̄, ȳ) — the mean of x and the mean of y.

Q3 — Faded example: house price vs distance from CBD

Step 1: b = −0.75 × (200/4) = −0.75 × 50 = −37.5.
Step 2: a = 800 − (−37.5)(10) = 800 + 375 = 1175.
Step 3: ŷ = 1175 − 37.5x.
Step 4: ŷ at 15 km = 1175 − 37.5(15) = 1175 − 562.5 = 612.5.
Conclusion: predicted price ≈ $612,500.

Q4.1 — Slope

b = 0.6 × (12 / 1.5) = 0.6 × 8 = 4.80.

Q4.2 — Intercept

a = 60 − 4.8(4) = 60 − 19.2 = 40.80.

Q4.3 — Predict for ŷ = 20 + 3x at x = 5

ŷ = 20 + 3(5) = 35.

Q4.4 — Residual for ŷ = 30 + 2.5x at (6, 45)

ŷ = 30 + 2.5(6) = 45. Residual = 45 − 45 = 0 (the point lies exactly on the line).

Q4.5 — Full regression line from x̄ = 4, s_x = 1.5, ȳ = 60, s_y = 12, r = 0.6

b = 0.6 × (12/1.5) = 4.80. a = 60 − 4.80(4) = 40.80. ŷ = 40.80 + 4.80x.

Q4.6 — Predict and residual at x = 5

ŷ = 40.80 + 4.80(5) = 40.80 + 24 = 64.80. Residual = 65 − 64.80 = +0.20.

Q4.7 — Predicting at x = 15 from data x ∈ [2, 10]

x = 15 is outside the observed range, so this is extrapolation. It is not reliable — the linear relationship may not hold beyond x = 10, and physical limits (e.g. exam scores capped at 100%) may make the prediction nonsensical.

Q4.8 — Slope and intercept for ŷ = 10 − 2x (car value vs age)

Slope = −2: for each additional year of age, the car's value decreases by $2000 on average. Intercept = 10: a brand-new car (age 0) is predicted to be worth $10,000 — meaningful only if x = 0 is inside or near the data range and not contradicted by physical reality.

Q4.9 — Does the line pass through (x̄, ȳ)?

(x̄, ȳ) = (4, 60). Check: ŷ at x = 4 = 40.80 + 4.80(4) = 40.80 + 19.20 = 60 ✓. Confirmed: the regression line always passes through (x̄, ȳ).

Q4.10 — Residual plot with U-shape

A U-shaped residual plot indicates the linear model is the wrong shape — the underlying relationship is curved. You should fit a non-linear model (e.g. quadratic, exponential, log) or transform one of the variables (e.g. plot y vs x², y vs ln(x)) before refitting and re-checking the residuals for randomness.

Q4.11 — Children's height vs age, intercept 80 cm

The intercept is not meaningful. It represents the predicted height when age = 0 (a newborn), but the data only cover ages 5–12, so x = 0 is far outside the data range — that is extrapolation. In addition, 80 cm is a reasonable rough toddler height but not a newborn height (~50 cm), so the linear model breaks down at the extreme; the intercept is a mathematical artifact, not a clinical estimate.

Q4.12 — Predict ŷ = 30 + 4x at x = 10 vs x = 30

(i) x = 10: ŷ = 30 + 40 = 70. x = 30: ŷ = 30 + 120 = 150.
(ii) x = 10 is inside the data range [2, 20] → interpolation, which is safe. x = 30 is outside the data range → extrapolation, which is dangerous because the linear relationship may not hold beyond the observed data and there may be physical limits or behavioural changes at high x.