Regression Analysis
Correlation told you two variables move together. Regression tells you exactly how. The least-squares line is statistics' most powerful prediction tool — but predicting far outside your data is dangerous, a negative intercept might be impossible, and a curved residual plot reveals your straight line is simply the wrong model.
Practise this lesson
Three printable worksheets that build from foundations to mastery — or build your own from any module’s questions.
A regression predicts a baby's birth weight from the mother's age. The intercept is $-2.5$ kg. Is this meaningful? Explain before reading on.
Four formulas. The line, the slope, the intercept, and the residual — in that order.
Key facts
- $\hat{y} = a + bx$ where $b = r \cdot s_y/s_x$ and $a = \bar{y} - b\bar{x}$
- Residual = observed $-$ predicted = $y - \hat{y}$
- Interpolation = safe; extrapolation = dangerous
Concepts
- The regression line minimises the sum of squared residuals
- Residual plots reveal whether a linear model is appropriate
- The intercept may not be meaningful if $x = 0$ is outside the data range
Skills
- Calculate the least-squares regression line from summary statistics
- Use the line to predict and calculate residuals
- Critique a regression model using residual analysis
The least-squares regression line is the straight line that minimises the sum of the squared vertical distances from each point to the line.
The regression line always passes through $(\bar{x}, \bar{y})$. Residuals (purple) are the vertical gaps between data and line.
Key property: The regression line always passes through $(\bar{x}, \bar{y})$. The sign of slope $b$ always matches the sign of $r$.
Regression line: $\hat{y} = a + bx$ where $b = r \cdot s_y/s_x$ and $a = \bar{y} - b\bar{x}$; Line always passes through $(\bar{x}, \bar{y})$
Pause — copy the regression line $\hat{y} = a + bx$ with formulas $b = r \cdot \frac{s_y}{s_x}$ and $a = \bar{y} - b\bar{x}$, and the key anchor fact: the line always passes through $(\bar{x}, \bar{y})$ into your book.
Did you get this? True or false: the least-squares regression line always passes through $(\bar{x}, \bar{y})$.
Worked examples · reveal step by step
For study hours ($x$) vs test scores ($y$): $\bar{x} = 5$, $s_x = 2$, $\bar{y} = 70$, $s_y = 10$, $r = 0.8$. Find (a) the regression equation, (b) predicted score for 6 hours, (c) residual if actual score was 78.
For house price ($y$, $000s) vs distance from CBD ($x$, km): $\bar{x} = 10$, $s_x = 4$, $\bar{y} = 800$, $s_y = 200$, $r = -0.75$. Find the regression line and predict the price 15 km from the CBD.
We just saw how to calculate the regression line $\hat{y} = a + bx$ using $r$, $s_y$, $s_x$, and the mean point. That raises a question: once we have the equation, what do the numbers $a$ and $b$ actually mean in the real-world context of the data? This card answers it → $b$ means "for each additional unit of $x$, $y$ changes by $b$ units on average"; only interpret $a$ if $x = 0$ is meaningful.
Slope interpretation template: "For each additional [unit of $x$], [variable $y$] [increases/decreases] by [|b|] units on average."
Intercept — when meaningful vs when not:
Slope interpretation: "For each additional [unit of $x$], $y$ [increases/decreases] by $|b|$ units on average."; Only interpret the intercept if $x = 0$ is within or near the data range and makes practical sense
Pause — copy the slope template ("for each additional [unit of $x$], $y$ increases/decreases by $|b|$ units on average") and the intercept caution (only interpret if $x = 0$ is within or near the data range) into your book.
Quick check: A regression of car value ($\$000$s) on age (years) gives $\hat{y} = 28 - 2.4x$. Which statement correctly interprets the slope?
A residual is $y - \hat{y}$ (observed minus predicted). Plotting residuals against $x$ reveals whether the linear model is appropriate.
Systematic trend in residuals: An important variable is missing from the model. Cluster of large residuals: Outliers or subgroups with different relationships.
We just saw how to interpret the slope and intercept in context, including the caution about meaningless intercepts. That raises a question: what happens when we use the regression line to predict $y$ at an $x$-value far outside the data range — is that reliable? This card answers it → extrapolation is dangerous; only interpolation (inside the data range) is generally trustworthy.
Interpolation — predicting $y$ for an $x$ value inside the data range. Generally safe and reliable.
Extrapolation — predicting $y$ for an $x$ value outside the data range. Dangerous and often wildly wrong.
Interpolation (inside data range) = generally safe; extrapolation (outside) = dangerous; A good residual plot shows random scatter around zero — no pattern
Pause — copy the interpolation vs extrapolation distinction (inside data range = safe; outside = dangerous) and the residual plot test (random scatter around zero = good fit; pattern = poor fit) into your book.
Fill in the blank: For a data point with $x = 6$, $y = 45$, and regression line $\hat{y} = 30 + 2.5x$, the predicted value is $\hat{y} = $ ___ and the residual is ___.
Match each residual plot pattern to its meaning.
Activities
Find the regression line given: $\bar{x} = 4$, $s_x = 1.5$, $\bar{y} = 60$, $s_y = 12$, $r = 0.6$.
For the line $\hat{y} = 20 + 3x$, predict $y$ when $x = 5$ and calculate the residual if the observed value was 38.
A model predicts exam scores from hours studied using data from students who studied 2–10 hours. Is predicting for 15 hours interpolation or extrapolation? Is it reliable?
Given $\hat{y} = 10 - 2x$, interpret the slope and intercept in the context of predicting car value ($y$, in $000s) from age ($x$, in years).
For a data point with $x = 6$, $y = 45$, and regression line $\hat{y} = 30 + 2.5x$, find the residual.
A residual plot shows a clear U-shape. What does this tell you about the regression model? What should you do instead?
A regression of children's height vs age has an intercept of 50 cm. Is this meaningful? Explain.
Explain why predicting house prices from data collected in 2019 might fail badly for 2024, even if the model had $r = 0.95$.
The intercept of $-2.5$ kg is not meaningful. It predicts a baby's weight when the mother is 0 years old — biologically impossible. The regression was fitted using data for mothers aged perhaps 20–40, so $x = 0$ is far outside the data range. This is extrapolation combined with a physically impossible scenario. The intercept is a mathematical artifact of the algebra, not a meaningful prediction. Critical principle: only interpret the intercept when $x = 0$ is within or near your data range and makes practical sense.
Pick your answer, then rate your confidence — that tells the system what to drill next.
Q1. For a data set relating temperature ($x$, °C) and ice cream sales ($y$, cones): $\bar{x} = 25$, $s_x = 5$, $\bar{y} = 150$, $s_y = 40$, $r = 0.9$. (a) Find the equation of the least-squares regression line. (b) Interpret the slope and intercept in context. (c) Predict sales when temperature is 22°C. (d) Calculate the residual if actual sales were 130 cones at 22°C. (3 marks)
Q2. A regression of car value ($y$, $000s) on age ($x$, years) using data from cars 1–10 years old produces $\hat{y} = 28 - 2.4x$. (a) Interpret the slope. (b) Interpret the intercept. (c) Predict the value of a 12-year-old car. Is this prediction reliable? (d) Predict the value of a 15-year-old car and explain why this prediction is problematic. (3 marks)
Q3. A real estate agent uses a regression model to predict house prices from lot size, built on data from suburban lots of 400–800 m². The agent wants to use it for: (a) a 600 m² lot in the same suburb, (b) a 200 m² apartment, (c) a 1500 m² rural property. For each case, evaluate the reliability of the prediction. (d) The residual plot shows a fan shape. What does this reveal, and what modelling improvement could address it? (3 marks)
Comprehensive answers (click to reveal)
Activity 1:
1. $b = 0.6 \times (12/1.5) = 4.8$; $a = 60 - 4.8(4) = 40.8$; $\hat{y} = 40.8 + 4.8x$
2. $\hat{y} = 20 + 3(5) = 35$; Residual = $38 - 35 = 3$
3. Extrapolation (15 > 10). Not reliable — relationship may not remain linear beyond 10 hours; diminishing returns may apply.
4. Slope = $-2$: each additional year of age decreases value by $2{,}000 on average. Intercept = $10$: a new car (age 0) is predicted to be worth $10{,}000.
5. $\hat{y} = 30 + 2.5(6) = 45$; Residual = $45 - 45 = 0$
Activity 2:
1. U-shape = non-linear relationship. Linear model is inappropriate. Consider quadratic or exponential model.
2. Meaningful — age 0 = birth, and 50 cm is a plausible newborn length. $x = 0$ is within the natural range of the data.
3. Extrapolation across time. Market conditions changed (interest rates, supply, post-pandemic). A 2019 model cannot capture 2024 economics even if it fit 2019 data perfectly.
Q1 (3 marks): (a) $b = 0.9 \times (40/5) = 7.2$ [0.5]; $a = 150 - 7.2(25) = -30$ [0.5]; $\hat{y} = -30 + 7.2x$ [0.5]. (b) Slope: each 1°C increase predicts 7.2 more cones sold on average [0.5]. Intercept: at 0°C, predicted sales are −30 — not meaningful (extrapolation below data range) [0.5]. (c) $\hat{y}(22) = -30 + 7.2(22) = 128.4$ cones [0.5]. (d) Residual = $130 - 128.4 = 1.6$ cones [0.5].
Q2 (3 marks): (a) Slope = $-2.4$: each additional year decreases predicted value by $2{,}400 on average [0.5]. (b) Intercept = $28$: a new car (age 0) is predicted to be worth $28{,}000 — meaningful because $x = 0$ is near the data range [0.5]. (c) $\hat{y}(12) = 28 - 28.8 = -0.8$ — negative, physically impossible; unreliable (12 is outside the 1–10 year range = extrapolation) [0.5+0.5]. (d) $\hat{y}(15) = 28 - 36 = -8$. Problematic: far outside data range; physically impossible; depreciation is rarely perfectly linear over long periods [0.5+0.5].
Q3 (3 marks): (a) 600 m²: Reliable — interpolation within 400–800 m², same suburb [0.5]. (b) 200 m² apartment: Unreliable — extrapolation below range; apartments have different pricing structures to suburban houses [0.5+0.5]. (c) 1500 m² rural: Very unreliable — extrapolation far above range; rural properties have different value drivers (zoning, infrastructure) [0.5+0.5]. (d) Fan shape = heteroscedasticity — errors increase with lot size, predictions less reliable for large lots. Improvement: transform $y$ (e.g. $\log(\text{price})$) or use weighted least squares [0.5+0.5].
Five timed questions on regression lines, residuals, and interpolation vs extrapolation. Beat the boss to bank a tier — gold (90% + speed), silver (75%), bronze (50%).
Enter the arenaClimb platforms using regression, residuals, and prediction questions. Pool: lesson 10.
Mark lesson as complete
Tick when you've finished the practice and review.