Your weak spots

Insights load after your first practice round.

Module 5 · L10 of 15 ~40 min ⚡ +100 XP available

Regression Analysis

Correlation told you two variables move together. Regression tells you exactly how. The least-squares line is statistics' most powerful prediction tool — but predicting far outside your data is dangerous, a negative intercept might be impossible, and a curved residual plot reveals your straight line is simply the wrong model.

Today's hook — A regression of baby birth weight on mother's age gives an intercept of $-2.5$ kg. Is this meaningful? Should you trust it? That one question unlocks the entire art of regression interpretation.

0/5QUESTS

Worksheets

Practise this lesson

Three printable worksheets that build from foundations to mastery — or build your own from any module’s questions.

Build Foundations & guided practice Apply Application practice Master Mastery challenge Build custom Build your own from any module question

Think First — gut answer before you read

+5 XP warm-up

A regression predicts a baby's birth weight from the mother's age. The intercept is $-2.5$ kg. Is this meaningful? Explain before reading on.

auto-saved

Formula reference for this lesson

+5 XP to read

Four formulas. The line, the slope, the intercept, and the residual — in that order.

Regression line

$\hat{y} = a + bx$ — predicted value of $y$ for a given $x$

Slope

$b = r \times \dfrac{s_y}{s_x}$ — sign of $b$ matches sign of $r$

Intercept

$a = \bar{y} - b\bar{x}$ — line always passes through $(\bar{x}, \bar{y})$

$\text{Residual} = y - \hat{y}$

Key insight. Predictions inside the data range (interpolation) are reliable. Predictions outside the data range (extrapolation) are dangerous — the linear relationship may not hold at extreme values.

What you'll master

Know

Key facts

$\hat{y} = a + bx$ where $b = r \cdot s_y/s_x$ and $a = \bar{y} - b\bar{x}$
Residual = observed $-$ predicted = $y - \hat{y}$
Interpolation = safe; extrapolation = dangerous

Understand

Concepts

The regression line minimises the sum of squared residuals
Residual plots reveal whether a linear model is appropriate
The intercept may not be meaningful if $x = 0$ is outside the data range

Can do

Skills

Calculate the least-squares regression line from summary statistics
Use the line to predict and calculate residuals
Critique a regression model using residual analysis

Key terms

Least-squares regression lineThe line $\hat{y} = a + bx$ that minimises the sum of squared residuals.

Slope ($b$)The predicted change in $y$ for each 1-unit increase in $x$. Sign matches sign of $r$.

Intercept ($a$)Predicted $y$ when $x = 0$. May not be meaningful if $x = 0$ is outside the data range.

Residual$y - \hat{y}$: difference between observed and predicted. Positive = model underpredicted.

InterpolationPredicting $y$ for an $x$ value inside the original data range. Generally reliable.

ExtrapolationPredicting $y$ for an $x$ value outside the original data range. Often unreliable.

Fitting the least-squares regression line

core concept

The least-squares regression line is the straight line that minimises the sum of the squared vertical distances from each point to the line.

The regression line always passes through $(\bar{x}, \bar{y})$. Residuals (purple) are the vertical gaps between data and line.

$$b = r \times \dfrac{s_y}{s_x} \qquad a = \bar{y} - b\bar{x} \qquad \hat{y} = a + bx$$

Key property: The regression line always passes through $(\bar{x}, \bar{y})$. The sign of slope $b$ always matches the sign of $r$.

Regression line: $\hat{y} = a + bx$ where $b = r \cdot s_y/s_x$ and $a = \bar{y} - b\bar{x}$; Line always passes through $(\bar{x}, \bar{y})$

Pause — copy the regression line $\hat{y} = a + bx$ with formulas $b = r \cdot \frac{s_y}{s_x}$ and $a = \bar{y} - b\bar{x}$, and the key anchor fact: the line always passes through $(\bar{x}, \bar{y})$ into your book.

Did you get this? True or false: the least-squares regression line always passes through $(\bar{x}, \bar{y})$.

Worked examples · reveal step by step

PROBLEM 1 · FIND THE REGRESSION LINE

For study hours ($x$) vs test scores ($y$): $\bar{x} = 5$, $s_x = 2$, $\bar{y} = 70$, $s_y = 10$, $r = 0.8$. Find (a) the regression equation, (b) predicted score for 6 hours, (c) residual if actual score was 78.

$b = 0.8 \times \dfrac{10}{2} = 4$

Slope formula: $b = r \cdot s_y/s_x$. Sign is positive (matches $r > 0$).

PROBLEM 2 · NEGATIVE SLOPE (HOUSE PRICES)

For house price ($y$, $000s) vs distance from CBD ($x$, km): $\bar{x} = 10$, $s_x = 4$, $\bar{y} = 800$, $s_y = 200$, $r = -0.75$. Find the regression line and predict the price 15 km from the CBD.

$b = -0.75 \times \dfrac{200}{4} = -37.5$

Negative slope (matches $r < 0$): as distance increases, price decreases by $37{,}500 per km on average.

Interpreting slope and intercept in context

core concept

We just saw how to calculate the regression line $\hat{y} = a + bx$ using $r$, $s_y$, $s_x$, and the mean point. That raises a question: once we have the equation, what do the numbers $a$ and $b$ actually mean in the real-world context of the data? This card answers it → $b$ means "for each additional unit of $x$, $y$ changes by $b$ units on average"; only interpret $a$ if $x = 0$ is meaningful.

Slope interpretation template: "For each additional [unit of $x$], [variable $y$] [increases/decreases] by [|b|] units on average."

Intercept — when meaningful vs when not:

Meaningful

$x = 0$ in or near the data range and makes sense

$\hat{y} = 40.8 + 4.8x$ (exam score from study hours, 0–12 hours range): a student who studies 0 hours is predicted to score 40.8 — plausible.

Not meaningful

$x = 0$ is outside the data range or physically impossible

Birth weight from mother's age (data: ages 20–40): intercept = weight at age 0 — biologically impossible. This is the Think First example.

Rule

Only interpret the intercept when $x = 0$ is meaningful

If in doubt, state: "The intercept is not meaningful in this context because $x = 0$ is outside the data range / physically impossible."

Slope interpretation: "For each additional [unit of $x$], $y$ [increases/decreases] by $|b|$ units on average."; Only interpret the intercept if $x = 0$ is within or near the data range and makes practical sense

Pause — copy the slope template ("for each additional [unit of $x$], $y$ increases/decreases by $|b|$ units on average") and the intercept caution (only interpret if $x = 0$ is within or near the data range) into your book.

Quick check: A regression of car value ($\$000$s) on age (years) gives $\hat{y} = 28 - 2.4x$. Which statement correctly interprets the slope?

Residuals and model adequacy

core concept

A residual is $y - \hat{y}$ (observed minus predicted). Plotting residuals against $x$ reveals whether the linear model is appropriate.

Good model

Residuals are scattered randomly around zero. No obvious pattern — linear model is appropriate.

Curved pattern (U-shape)

Linear model is wrong — relationship is curved. Consider a quadratic or logarithmic model.

Fan shape

Spread increases with $x$ (heteroscedasticity). Model is unreliable for large $x$ values. Consider transforming $y$.

Systematic trend in residuals: An important variable is missing from the model. Cluster of large residuals: Outliers or subgroups with different relationships.

The danger of extrapolation

exam favourite

We just saw how to interpret the slope and intercept in context, including the caution about meaningless intercepts. That raises a question: what happens when we use the regression line to predict $y$ at an $x$-value far outside the data range — is that reliable? This card answers it → extrapolation is dangerous; only interpolation (inside the data range) is generally trustworthy.

Interpolation — predicting $y$ for an $x$ value inside the data range. Generally safe and reliable.

Extrapolation — predicting $y$ for an $x$ value outside the data range. Dangerous and often wildly wrong.

Danger 1

The linear relationship may not hold

A regression of tree height vs age (5–20 years): $\hat{y} = 2 + 0.5 \times \text{age}$. Extrapolating to age 200 gives 102 m — no tree on Earth grows that tall.

Danger 2

Physical limits apply

A negative predicted weight or a test score over 100% signals that the model has been applied outside its valid range.

COVID-19 projections (2020). Some early models extrapolated exponential growth indefinitely, predicting billions of infections within weeks. These projections failed because they did not account for behavioural changes, herd immunity, and interventions — factors that alter the relationship outside the initial data range. Extrapolation from early-phase data was one of the most consequential statistical failures in recent history.

Interpolation (inside data range) = generally safe; extrapolation (outside) = dangerous; A good residual plot shows random scatter around zero — no pattern

Pause — copy the interpolation vs extrapolation distinction (inside data range = safe; outside = dangerous) and the residual plot test (random scatter around zero = good fit; pattern = poor fit) into your book.

Fill in the blank: For a data point with $x = 6$, $y = 45$, and regression line $\hat{y} = 30 + 2.5x$, the predicted value is $\hat{y} = $ ___ and the residual is ___.

Match each residual plot pattern to its meaning.

Random scatter around zero

Clear U-shape (curve)

Fan shape — spread grows with $x$

Linear model is appropriate

Relationship is curved, not linear

Heteroscedasticity — unreliable for large $x$

Activities

Activity 1 — Calculate

Find the regression line given: $\bar{x} = 4$, $s_x = 1.5$, $\bar{y} = 60$, $s_y = 12$, $r = 0.6$.

For the line $\hat{y} = 20 + 3x$, predict $y$ when $x = 5$ and calculate the residual if the observed value was 38.

A model predicts exam scores from hours studied using data from students who studied 2–10 hours. Is predicting for 15 hours interpolation or extrapolation? Is it reliable?

Given $\hat{y} = 10 - 2x$, interpret the slope and intercept in the context of predicting car value ($y$, in $000s) from age ($x$, in years).

For a data point with $x = 6$, $y = 45$, and regression line $\hat{y} = 30 + 2.5x$, find the residual.

Activity 2 — Analyse and connect

A residual plot shows a clear U-shape. What does this tell you about the regression model? What should you do instead?

A regression of children's height vs age has an intercept of 50 cm. Is this meaningful? Explain.

Explain why predicting house prices from data collected in 2019 might fail badly for 2024, even if the model had $r = 0.95$.

Revisit your thinking

The intercept of $-2.5$ kg is not meaningful. It predicts a baby's weight when the mother is 0 years old — biologically impossible. The regression was fitted using data for mothers aged perhaps 20–40, so $x = 0$ is far outside the data range. This is extrapolation combined with a physically impossible scenario. The intercept is a mathematical artifact of the algebra, not a meaningful prediction. Critical principle: only interpret the intercept when $x = 0$ is within or near your data range and makes practical sense.

auto-saved

Multiple choice

+5 XP per correct · +25 XP all-correct

Pick your answer, then rate your confidence — that tells the system what to drill next.

Short answer

ApplyBand 43 marks

Q1. For a data set relating temperature ($x$, °C) and ice cream sales ($y$, cones): $\bar{x} = 25$, $s_x = 5$, $\bar{y} = 150$, $s_y = 40$, $r = 0.9$. (a) Find the equation of the least-squares regression line. (b) Interpret the slope and intercept in context. (c) Predict sales when temperature is 22°C. (d) Calculate the residual if actual sales were 130 cones at 22°C. (3 marks)

auto-saved

ApplyBand 43 marks

Q2. A regression of car value ($y$, $000s) on age ($x$, years) using data from cars 1–10 years old produces $\hat{y} = 28 - 2.4x$. (a) Interpret the slope. (b) Interpret the intercept. (c) Predict the value of a 12-year-old car. Is this prediction reliable? (d) Predict the value of a 15-year-old car and explain why this prediction is problematic. (3 marks)

auto-saved

AnalyseBand 53 marks

Q3. A real estate agent uses a regression model to predict house prices from lot size, built on data from suburban lots of 400–800 m². The agent wants to use it for: (a) a 600 m² lot in the same suburb, (b) a 200 m² apartment, (c) a 1500 m² rural property. For each case, evaluate the reliability of the prediction. (d) The residual plot shows a fan shape. What does this reveal, and what modelling improvement could address it? (3 marks)

auto-saved

Comprehensive answers (click to reveal)

Activity 1:

1. $b = 0.6 \times (12/1.5) = 4.8$; $a = 60 - 4.8(4) = 40.8$; $\hat{y} = 40.8 + 4.8x$

2. $\hat{y} = 20 + 3(5) = 35$; Residual = $38 - 35 = 3$

3. Extrapolation (15 > 10). Not reliable — relationship may not remain linear beyond 10 hours; diminishing returns may apply.

4. Slope = $-2$: each additional year of age decreases value by $2{,}000 on average. Intercept = $10$: a new car (age 0) is predicted to be worth $10{,}000.

5. $\hat{y} = 30 + 2.5(6) = 45$; Residual = $45 - 45 = 0$

Activity 2:

1. U-shape = non-linear relationship. Linear model is inappropriate. Consider quadratic or exponential model.

2. Meaningful — age 0 = birth, and 50 cm is a plausible newborn length. $x = 0$ is within the natural range of the data.

3. Extrapolation across time. Market conditions changed (interest rates, supply, post-pandemic). A 2019 model cannot capture 2024 economics even if it fit 2019 data perfectly.

Q1 (3 marks): (a) $b = 0.9 \times (40/5) = 7.2$ [0.5]; $a = 150 - 7.2(25) = -30$ [0.5]; $\hat{y} = -30 + 7.2x$ [0.5]. (b) Slope: each 1°C increase predicts 7.2 more cones sold on average [0.5]. Intercept: at 0°C, predicted sales are −30 — not meaningful (extrapolation below data range) [0.5]. (c) $\hat{y}(22) = -30 + 7.2(22) = 128.4$ cones [0.5]. (d) Residual = $130 - 128.4 = 1.6$ cones [0.5].

Q2 (3 marks): (a) Slope = $-2.4$: each additional year decreases predicted value by $2{,}400 on average [0.5]. (b) Intercept = $28$: a new car (age 0) is predicted to be worth $28{,}000 — meaningful because $x = 0$ is near the data range [0.5]. (c) $\hat{y}(12) = 28 - 28.8 = -0.8$ — negative, physically impossible; unreliable (12 is outside the 1–10 year range = extrapolation) [0.5+0.5]. (d) $\hat{y}(15) = 28 - 36 = -8$. Problematic: far outside data range; physically impossible; depreciation is rarely perfectly linear over long periods [0.5+0.5].

Q3 (3 marks): (a) 600 m²: Reliable — interpolation within 400–800 m², same suburb [0.5]. (b) 200 m² apartment: Unreliable — extrapolation below range; apartments have different pricing structures to suburban houses [0.5+0.5]. (c) 1500 m² rural: Very unreliable — extrapolation far above range; rural properties have different value drivers (zoning, infrastructure) [0.5+0.5]. (d) Fan shape = heteroscedasticity — errors increase with lot size, predictions less reliable for large lots. Improvement: transform $y$ (e.g. $\log(\text{price})$) or use weighted least squares [0.5+0.5].

Boss battle · The Extrapolator

earn bronze · silver · gold

Five timed questions on regression lines, residuals, and interpolation vs extrapolation. Beat the boss to bank a tier — gold (90% + speed), silver (75%), bronze (50%).

Enter the arena

Science Jump · platform challenge

Climb platforms using regression, residuals, and prediction questions. Pool: lesson 10.

Mark lesson as complete

Tick when you've finished the practice and review.

← Lesson 9 · Bivariate Data Analysis Lesson 11 · Distributions →

Module overview · Maths Advanced · Checkpoint 2