Mathematics Advanced • Year 12 • Module 5 • Lesson 9

Bivariate Data Analysis

Build procedural fluency in describing scatter plots, calculating Pearson's r from raw data, and interpreting r² as the proportion of variation explained.

Build · Skill Drill

1. Quick recall

Answer each question in the space provided. 1 mark each

Q1.1 State the range of Pearson's r:

______ ≤ r ≤ ______ and describe what each extreme value means: r = +1 → ______________________, r = −1 → ______________________.

Q1.2 Name the three features used to describe a scatter plot.

________________, ________________, ________________

Q1.3 If r = 0.7, state r² and complete the sentence: "____% of the variation in y can be explained by the linear relationship with x."

Stuck? Revisit lesson § Formula Reference and § Coefficient of Determination.

2. Worked example — Pearson's r for 5 (x, y) pairs

Data: (2, 55), (4, 62), (5, 70), (6, 75), (8, 88) — study hours x vs test scores y.

Problem. Calculate Pearson's correlation coefficient r using the computational formula.

Step 1 — Build the sums (set up a tidy table).

x: 2 + 4 + 5 + 6 + 8 = 25
y: 55 + 62 + 70 + 75 + 88 = 350
x²: 4 + 16 + 25 + 36 + 64 = 145
y²: 3025 + 3844 + 4900 + 5625 + 7744 = 25,138
xy: 110 + 248 + 350 + 450 + 704 = 1,862
n = 5

Step 2 — Apply the computational formula.

r = [ n·Σxy − (Σx)(Σy) ] / √{ [ n·Σx² − (Σx)² ] [ n·Σy² − (Σy)² ] }

Step 3 — Plug in.

Numerator = 5(1862) − (25)(350) = 9310 − 8750 = 560
Denominator = √{ [5(145) − 25²][5(25138) − 350²] }
= √{ [725 − 625][125,690 − 122,500] }
= √{ 100 × 3190 } = √319,000 ≈ 564.8

Step 4 — Compute r and r².

r = 560 / 564.8 ≈ 0.992
r² ≈ 0.984

Conclusion. r ≈ 0.992 indicates a very strong positive linear relationship between study hours and test scores. About 98% of the variation in test scores is explained by the linear relationship with study hours.

3. Faded example — fill in the missing steps

Calculate r for the data (1, 3), (2, 5), (3, 4), (4, 8), (5, 9). Fill in each blank. 4 marks

Step 1 — Sums.

Σx = 1+2+3+4+5 = ______ · Σy = 3+5+4+8+9 = ______

Σx² = 1+4+9+16+25 = ______ · Σy² = 9+25+16+64+81 = ______

Σxy = 1(3)+2(5)+3(4)+4(8)+5(9) = ______ · n = ______

Step 2 — Numerator. n·Σxy − (Σx)(Σy) = ____·____ − ____·____ = ____________

Step 3 — Denominator pieces.

n·Σx² − (Σx)² = ____·____ − ____² = ____________

n·Σy² − (Σy)² = ____·____ − ____² = ____________

Step 4 — r and r². r = ______ / √( ____ × ____ ) ≈ ____________ · r² ≈ ____________

Conclusion. r ≈ ____ describes a ____________ correlation; r² ≈ ____ means ____% of the variation in y is explained.

Stuck? Revisit lesson § Worked Example.

4. Graduated practice — calculate or interpret r

Show your working. Quote r and r² to 2 decimal places.

Foundation — single-step calculations (4 questions)

Q	Question	Answer
4.1 1	If r = 0.6, find r² and complete: "______% of variation explained".
4.2 1	If r = −0.9, find r² and state the strength descriptor.
4.3 1	State the value of r for: (1, 2), (2, 4), (3, 6), (4, 8). No formula needed — explain in one phrase.
4.4 1	From the lesson's strength table, classify \|r\| = 0.42 (e.g. "weak", "moderate").

Standard — typical HSC difficulty (6 questions)

Compute r using the computational formula. Show the sums.

4.5 Calculate r for (1, 8), (2, 5), (3, 4), (4, 3), (5, 2). Describe the relationship and find r². 3 marks

4.6 Calculate r for (1, 1), (2, 4), (3, 9), (4, 16). Explain in one sentence why r is < 1 despite the clear pattern. 3 marks

4.7 A study reports r = 0.3 between hours of TV watched and exam scores. State r² and answer: is this a practically significant relationship? Justify in one sentence. 2 marks

4.8 For a data set with r = 0.95, sketch (in the space below) the kind of scatter plot you would expect, and state what percentage of variation in y is explained by x. 2 marks

4.9 The scatter plot of two variables looks like a perfect parabola (a "U" shape). What value of Pearson's r is consistent with this, and why does this not mean there is no relationship? 2 marks

4.10 A small data set has Σx = 30, Σy = 60, Σxy = 372, Σx² = 220, Σy² = 736, n = 5. Compute r. 2 marks

Extension — combine concepts (2 questions)

4.11 A data set has r = 0.4. An additional data point is added that lies exactly on the original trend line. State (without recomputing) whether r will increase, decrease, or stay roughly the same, and justify in one sentence. 3 marks

4.12 Two studies of the same phenomenon report r = 0.9 (Study 1) and r = 0.3 (Study 2). State the proportion of variation explained in each, then explain in one sentence what it tells us — geometrically — about the spread of points around the trend line in each study. 3 marks

Stuck on 4.9? Recall: r measures linear association only.

5. Self-check the easy 3

Tick the first three once you've checked your method works.

For 4.1 I squared 0.6 to get r² = 0.36, then said "36% of variation explained" — not 60%.

For 4.2 I noted r² = 0.81 ignores the sign — and that this is a strong correlation (|r| ≥ 0.9).

For 4.3 I named r = +1 because the points (1,2), (2,4), (3,6), (4,8) lie exactly on the line y = 2x.

How did this worksheet feel?

Got it Partly Lost

What I'll revisit before next class:

Answers — Do not peek before attempting

Q1.1 — Range of r

−1 ≤ r ≤ +1. r = +1: perfect positive linear relationship. r = −1: perfect negative linear relationship.

Q1.2 — Three scatter-plot features

Direction (positive / negative / none), form (linear / curved / clustered), strength (strong / moderate / weak).

Q1.3 — r = 0.7

r² = 0.49 → 49% of the variation in y is explained by the linear relationship with x.

Q3 — Faded example: (1, 3), (2, 5), (3, 4), (4, 8), (5, 9)

Σx = 15, Σy = 29, Σx² = 55, Σy² = 195, Σxy = 101, n = 5.
Numerator = 5(101) − 15(29) = 505 − 435 = 70.
n·Σx² − (Σx)² = 275 − 225 = 50. n·Σy² − (Σy)² = 975 − 841 = 134.
r = 70 / √(50 × 134) = 70 / √6700 ≈ 0.855. r² ≈ 0.731.
Conclusion: strong positive correlation; ≈ 73% of variation in y is explained by x.

Q4.1 — r = 0.6

r² = 0.36 → 36% of variation explained.

Q4.2 — r = −0.9

r² = 0.81 (the sign drops out). |r| = 0.9 falls in the 0.90 – 1.00 band → very strong (negative) correlation.

Q4.3 — (1, 2), (2, 4), (3, 6), (4, 8)

r = +1 — the four points lie exactly on the line y = 2x (perfect positive linear relationship).

Q4.4 — Classify |r| = 0.42

Falls in the 0.30 – 0.50 band → moderate correlation.

Q4.5 — (1, 8), (2, 5), (3, 4), (4, 3), (5, 2)

Σx = 15, Σy = 22, Σx² = 55, Σy² = 118, Σxy = 51, n = 5.
Numerator = 5(51) − 15(22) = 255 − 330 = −75.
n·Σx² − (Σx)² = 275 − 225 = 50. n·Σy² − (Σy)² = 590 − 484 = 106.
r = −75 / √(50 × 106) = −75 / √5300 ≈ −1.03… rounding suggests we recheck. Σxy = 1(8)+2(5)+3(4)+4(3)+5(2) = 8 + 10 + 12 + 12 + 10 = 52. Numerator = 5(52) − 330 = 260 − 330 = −70. r = −70 / √(50 × 106) = −70 / 72.80 ≈ −0.96. r² ≈ 0.92 → very strong negative correlation; ≈ 92% of variation in y explained.

Q4.6 — (1, 1), (2, 4), (3, 9), (4, 16)

The points lie on y = x² (a perfect parabola). Σx = 10, Σy = 30, Σx² = 30, Σy² = 354, Σxy = 1(1)+2(4)+3(9)+4(16) = 1+8+27+64 = 100, n = 4.
Numerator = 4(100) − 10(30) = 400 − 300 = 100.
n·Σx² − (Σx)² = 120 − 100 = 20. n·Σy² − (Σy)² = 1416 − 900 = 516.
r = 100 / √(20 × 516) = 100 / √10320 ≈ 0.984. r² ≈ 0.968. r is < 1 because the true relationship is curved (y = x²), so a straight line cannot fit perfectly — there are non-zero vertical gaps between the points and any line. r captures linear association, not the full curved fit.

Q4.7 — r = 0.3 for TV vs exam scores

r² = 0.09 → only 9% of variation in exam scores is explained by TV hours; not practically significant in any meaningful sense: 91% of variation is driven by other factors (study time, sleep, etc.).

Q4.8 — Scatter plot for r = 0.95

Sketch: tightly clustered points hugging an upward-sloping straight line, with very little vertical scatter. r² = 0.9025 → about 90% of variation in y is explained by x.

Q4.9 — Perfect parabola

r ≈ 0 (or very small) — the positive and negative correlations on the two arms of the parabola cancel out. This does not mean there is no relationship: there is a strong non-linear relationship that Pearson's r is simply blind to.

Q4.10 — From sums Σx = 30, Σy = 60, Σxy = 372, Σx² = 220, Σy² = 736, n = 5

Numerator = 5(372) − 30(60) = 1860 − 1800 = 60. n·Σx² − (Σx)² = 1100 − 900 = 200. n·Σy² − (Σy)² = 3680 − 3600 = 80. r = 60 / √(200 × 80) = 60 / √16000 ≈ 0.474 (≈ moderate).

Q4.11 — Adding a point exactly on the trend line

r will increase (or stay essentially the same, depending on the position). Adding a point that already fits the linear pattern perfectly reduces the relative weight of any off-line points, tightening the linear fit and pushing r closer to 1.

Q4.12 — r = 0.9 vs r = 0.3

r² = 0.81 → 81% of variation in y explained (Study 1). r² = 0.09 → 9% explained (Study 2). Geometrically: Study 1's scatter plot has points tightly hugging the trend line; Study 2's scatter plot has points loosely scattered around the trend line, with most of the up-and-down variation driven by something other than the linear relationship with x.