Biology • Year 12 • Module 5 • Lesson 16

Frequency Data and SNP Analysis

Apply allele-frequency arithmetic, Hardy–Weinberg expectations, and SNP comparison to real-style population data sets.

Apply · Data & Reasoning

1. Calculate and compare allele frequencies across populations

Researchers genotyped a single SNP (the T/C variant at locus rs4988235, near the LCT lactase gene) in four sampled human populations. Each population sample consisted of 100 unrelated individuals (200 alleles per population). 8 marks

Population	TT	TC	CC	Total individuals
Northern European	49	42	9	100
Southern European	16	48	36	100
East Asian	1	10	89	100
West African	4	30	66	100

Use the rule: p(T) = (2 × TT + TC) ÷ (2 × total individuals). Then q(C) = 1 − p.

1.1 Calculate the frequency of the T allele p in each of the four populations. Show working for at least one. 4 marks

1.2 Describe the trend in T-allele frequency across the four populations using cautious scientific language. 2 marks

1.3 Each sample is only 100 individuals. State two limitations of using these frequencies to characterise each whole population. 2 marks

Stuck? Card 1 frames the arithmetic; Card 2 frames the limitations.

2. Bar chart — observed allele frequencies at one SNP

The chart below plots the T-allele frequency from your calculations in Q1 across the four populations. 6 marks

Figure 2.1. T-allele frequency at SNP rs4988235 in four sampled populations. Stylised data after Bersaglieri et al. (2004), Am. J. Hum. Genet. 74:1111–1120.

2.1 Which population has the highest T-allele frequency, and which has the lowest? State the values. 2 marks

2.2 Calculate the difference in T-allele frequency between the Northern European and East Asian samples. 1 mark

2.3 A student concludes: "These four groups must be different species because the bar heights are so different." Explain, using lesson terms, what is wrong with that conclusion. 3 marks

Stuck? Card 4 — "What SNPs cannot do alone".

3. Hardy–Weinberg — observed vs expected genotypes

Under Hardy–Weinberg equilibrium, expected genotype frequencies are p² + 2pq + q² = 1. Use this to compare the Northern European sample (TT=49, TC=42, CC=9; p(T)=0.70, q(C)=0.30). 7 marks

Genotype	Observed count (n=100)	HW expected frequency	HW expected count
TT	49	p² = (0.70)² = 0.49	3.1 ____
TC	42	2pq = 2 × 0.70 × 0.30 = 0.42	3.2 ____
CC	9	q² = (0.30)² = 0.09	3.3 ____
Sum	100	1.00	100

3.4 Compare the observed and expected counts. Does this sample look close to Hardy–Weinberg equilibrium? Justify in one sentence. 2 marks

3.5 State two assumptions Hardy–Weinberg requires for the equation p² + 2pq + q² = 1 to hold. 2 marks

HW assumptions to draw on: large population, random mating, no migration, no mutation, no selection.

4. Punnett square — how allele frequency drives offspring genotype frequency

In an idealised mating, the probability of each gamete carrying allele T is p = 0.6 and allele C is q = 0.4. Complete the Punnett square below (offspring genotype probabilities, expressed as decimals to 2 d.p.) 6 marks

	Egg: T (p = 0.6)	Egg: C (q = 0.4)
Sperm: T (p = 0.6)	4.1 Genotype: ____ P = ____	4.2 Genotype: ____ P = ____
Sperm: C (q = 0.4)	4.3 Genotype: ____ P = ____	4.4 Genotype: ____ P = ____

4.5 Sum the heterozygote probabilities (cells 4.2 and 4.3) and confirm that the four offspring probabilities sum to 1.00. 1 mark

4.6 Relate the four cell probabilities back to the Hardy–Weinberg expression p² + 2pq + q². Which cells correspond to which term? 1 mark

Each cell probability = (probability of egg allele) × (probability of sperm allele).

5. Apply — interpret a SNP-based relatedness claim

A press release reports: "DNA testing has shown that Population P and Population Q share 99.8% of their genome, but a single SNP at locus X differs between them. Therefore the two populations are essentially identical." Population P (n = 240) carries the A allele at frequency 0.05; Population Q (n = 260) carries the A allele at frequency 0.94 at locus X. 6 marks

5.1 Calculate the difference in A-allele frequency at locus X between the two populations. 1 mark

5.2 Explain why the press release's overall conclusion is too strong, despite the genome-wide similarity figure being high. Use the words marker, variation and sample. 3 marks

5.3 Suggest one way the team could strengthen their evidence for the claim about relatedness. 2 marks

Stuck? Card 4 — multiple markers, larger samples, representative sampling.

Answers — Do not peek before attempting

Q1.1 — Allele frequencies (4 marks)

Using p(T) = (2 × TT + TC) ÷ 200:

Northern European: (2 × 49 + 42) ÷ 200 = 140 ÷ 200 = 0.70 [1]
Southern European: (2 × 16 + 48) ÷ 200 = 80 ÷ 200 = 0.40 [1]
East Asian: (2 × 1 + 10) ÷ 200 = 12 ÷ 200 = 0.06 [1]
West African: (2 × 4 + 30) ÷ 200 = 38 ÷ 200 = 0.19 [1]

Q1.2 — Trend (2 marks)

The T-allele frequency is highest in the Northern European sample (~0.70) and falls progressively through the Southern European (~0.40) and West African (~0.19) samples to the lowest in the East Asian sample (~0.06) [1]. The trend should be described as "in the sampled groups" rather than "in every member of these populations" [1].

Q1.3 — Limitations (2 marks)

Any two of: (a) sample of only 100 is small relative to the size of each population [1]; (b) one geographic sample may not be representative of the diversity within a population (e.g. one city only) [1]; (c) data describes only one SNP locus, not the whole genome; (d) sampling/recruitment may be biased toward certain groups.

Q2.1 — Highest / lowest (2 marks)

Highest: Northern European at 0.70 [1]. Lowest: East Asian at 0.06 [1].

Q2.2 — Difference (1 mark)

0.70 − 0.06 = 0.64 (or 64 percentage points) [1].

Q2.3 — Why the species claim is wrong (3 marks)

A SNP is a single position out of millions in the genome — one position cannot define species identity [1]. Frequency differences at one locus typically reflect normal genetic variation within a species, not separation between species [1]. All four populations are Homo sapiens; differences in this SNP's frequency reflect population history (e.g. selection for lactase persistence in dairy-cultured groups) and not speciation [1].

Q3.1–3.3 — HW expected counts

Expected counts = HW frequency × 100:

3.1 TT = 0.49 × 100 = 49
3.2 TC = 0.42 × 100 = 42
3.3 CC = 0.09 × 100 = 9

Q3.4 — Observed vs expected (2 marks)

Observed (49, 42, 9) matches expected (49, 42, 9) almost exactly [1], so this sample is essentially at Hardy–Weinberg equilibrium for this SNP [1].

Q3.5 — HW assumptions (2 marks; any two)

Any two of: large population (no genetic drift) [1]; random mating with respect to the locus [1]; no migration in or out [1]; no mutation at the locus [1]; no natural selection acting on the alleles [1].

Q4.1–4.4 — Punnett square cells (4 marks)

4.1 (T × T) — TT, P = 0.36 = p²
4.2 (T sperm × C egg) — TC, P = 0.24 = pq
4.3 (C sperm × T egg) — TC, P = 0.24 = pq
4.4 (C × C) — CC, P = 0.16 = q²

Q4.5 — Heterozygote total & check (1 mark)

Heterozygotes: 0.24 + 0.24 = 0.48 = 2pq. Total: 0.36 + 0.24 + 0.24 + 0.16 = 1.00. [1]

Q4.6 — Mapping back to HW (1 mark)

Cell 4.1 = p² (TT homozygotes); cells 4.2 + 4.3 = 2pq (heterozygotes); cell 4.4 = q² (CC homozygotes). The Punnett-square sum (p + q)² expands exactly to p² + 2pq + q². [1]

Q5.1 — Allele-frequency difference (1 mark)

0.94 − 0.05 = 0.89 (an extremely large per-locus difference). [1]

Q5.2 — Why the press release overreaches (3 marks)

Genome-wide 99.8% similarity refers to the average across the whole genome, but most of the variation that distinguishes populations is concentrated in specific SNPs — one such marker can differ in frequency by nearly 90 percentage points [1]. So "essentially identical" is misleading: the populations show substantial variation at this locus, which may have biological or evolutionary significance [1]. Conclusions should reference the sample measured and be proportional to it — the press release converts a per-locus frequency difference into a sweeping identity claim and ignores the meaning of marker-level variation [1].

Q5.3 — Strengthening the evidence (2 marks)

Any one of: examine many SNPs across the genome rather than a single locus and report the pattern of differences [1]; increase and diversify sample sizes (more individuals across more locations) so the samples better represent each population [1]; report both genome-wide similarity and per-locus variation rather than mixing them.