Biology • Year 12 • Module 5 • Lesson 16
Frequency Data and SNP Analysis
Apply allele-frequency arithmetic, Hardy–Weinberg expectations, and SNP comparison to real-style population data sets.
1. Calculate and compare allele frequencies across populations
Researchers genotyped a single SNP (the T/C variant at locus rs4988235, near the LCT lactase gene) in four sampled human populations. Each population sample consisted of 100 unrelated individuals (200 alleles per population). 8 marks
| Population | TT | TC | CC | Total individuals |
|---|---|---|---|---|
| Northern European | 49 | 42 | 9 | 100 |
| Southern European | 16 | 48 | 36 | 100 |
| East Asian | 1 | 10 | 89 | 100 |
| West African | 4 | 30 | 66 | 100 |
Use the rule: p(T) = (2 × TT + TC) ÷ (2 × total individuals). Then q(C) = 1 − p.
1.1 Calculate the frequency of the T allele p in each of the four populations. Show working for at least one. 4 marks
1.2 Describe the trend in T-allele frequency across the four populations using cautious scientific language. 2 marks
1.3 Each sample is only 100 individuals. State two limitations of using these frequencies to characterise each whole population. 2 marks
2. Bar chart — observed allele frequencies at one SNP
The chart below plots the T-allele frequency from your calculations in Q1 across the four populations. 6 marks
Figure 2.1. T-allele frequency at SNP rs4988235 in four sampled populations. Stylised data after Bersaglieri et al. (2004), Am. J. Hum. Genet. 74:1111–1120.
2.1 Which population has the highest T-allele frequency, and which has the lowest? State the values. 2 marks
2.2 Calculate the difference in T-allele frequency between the Northern European and East Asian samples. 1 mark
2.3 A student concludes: "These four groups must be different species because the bar heights are so different." Explain, using lesson terms, what is wrong with that conclusion. 3 marks
3. Hardy–Weinberg — observed vs expected genotypes
Under Hardy–Weinberg equilibrium, expected genotype frequencies are p² + 2pq + q² = 1. Use this to compare the Northern European sample (TT=49, TC=42, CC=9; p(T)=0.70, q(C)=0.30). 7 marks
| Genotype | Observed count (n=100) | HW expected frequency | HW expected count |
|---|---|---|---|
| TT | 49 | p² = (0.70)² = 0.49 | 3.1 ____ |
| TC | 42 | 2pq = 2 × 0.70 × 0.30 = 0.42 | 3.2 ____ |
| CC | 9 | q² = (0.30)² = 0.09 | 3.3 ____ |
| Sum | 100 | 1.00 | 100 |
3.4 Compare the observed and expected counts. Does this sample look close to Hardy–Weinberg equilibrium? Justify in one sentence. 2 marks
3.5 State two assumptions Hardy–Weinberg requires for the equation p² + 2pq + q² = 1 to hold. 2 marks
4. Punnett square — how allele frequency drives offspring genotype frequency
In an idealised mating, the probability of each gamete carrying allele T is p = 0.6 and allele C is q = 0.4. Complete the Punnett square below (offspring genotype probabilities, expressed as decimals to 2 d.p.) 6 marks
| Egg: T (p = 0.6) | Egg: C (q = 0.4) | |
|---|---|---|
| Sperm: T (p = 0.6) | 4.1 Genotype: ____ P = ____ | 4.2 Genotype: ____ P = ____ |
| Sperm: C (q = 0.4) | 4.3 Genotype: ____ P = ____ | 4.4 Genotype: ____ P = ____ |
4.5 Sum the heterozygote probabilities (cells 4.2 and 4.3) and confirm that the four offspring probabilities sum to 1.00. 1 mark
4.6 Relate the four cell probabilities back to the Hardy–Weinberg expression p² + 2pq + q². Which cells correspond to which term? 1 mark
5. Apply — interpret a SNP-based relatedness claim
A press release reports: "DNA testing has shown that Population P and Population Q share 99.8% of their genome, but a single SNP at locus X differs between them. Therefore the two populations are essentially identical." Population P (n = 240) carries the A allele at frequency 0.05; Population Q (n = 260) carries the A allele at frequency 0.94 at locus X. 6 marks
5.1 Calculate the difference in A-allele frequency at locus X between the two populations. 1 mark
5.2 Explain why the press release's overall conclusion is too strong, despite the genome-wide similarity figure being high. Use the words marker, variation and sample. 3 marks
5.3 Suggest one way the team could strengthen their evidence for the claim about relatedness. 2 marks
Q1.1 — Allele frequencies (4 marks)
Using p(T) = (2 × TT + TC) ÷ 200:
- Northern European: (2 × 49 + 42) ÷ 200 = 140 ÷ 200 = 0.70 [1]
- Southern European: (2 × 16 + 48) ÷ 200 = 80 ÷ 200 = 0.40 [1]
- East Asian: (2 × 1 + 10) ÷ 200 = 12 ÷ 200 = 0.06 [1]
- West African: (2 × 4 + 30) ÷ 200 = 38 ÷ 200 = 0.19 [1]
Q1.2 — Trend (2 marks)
The T-allele frequency is highest in the Northern European sample (~0.70) and falls progressively through the Southern European (~0.40) and West African (~0.19) samples to the lowest in the East Asian sample (~0.06) [1]. The trend should be described as "in the sampled groups" rather than "in every member of these populations" [1].
Q1.3 — Limitations (2 marks)
Any two of: (a) sample of only 100 is small relative to the size of each population [1]; (b) one geographic sample may not be representative of the diversity within a population (e.g. one city only) [1]; (c) data describes only one SNP locus, not the whole genome; (d) sampling/recruitment may be biased toward certain groups.
Q2.1 — Highest / lowest (2 marks)
Highest: Northern European at 0.70 [1]. Lowest: East Asian at 0.06 [1].
Q2.2 — Difference (1 mark)
0.70 − 0.06 = 0.64 (or 64 percentage points) [1].
Q2.3 — Why the species claim is wrong (3 marks)
A SNP is a single position out of millions in the genome — one position cannot define species identity [1]. Frequency differences at one locus typically reflect normal genetic variation within a species, not separation between species [1]. All four populations are Homo sapiens; differences in this SNP's frequency reflect population history (e.g. selection for lactase persistence in dairy-cultured groups) and not speciation [1].
Q3.1–3.3 — HW expected counts
Expected counts = HW frequency × 100:
- 3.1 TT = 0.49 × 100 = 49
- 3.2 TC = 0.42 × 100 = 42
- 3.3 CC = 0.09 × 100 = 9
Q3.4 — Observed vs expected (2 marks)
Observed (49, 42, 9) matches expected (49, 42, 9) almost exactly [1], so this sample is essentially at Hardy–Weinberg equilibrium for this SNP [1].
Q3.5 — HW assumptions (2 marks; any two)
Any two of: large population (no genetic drift) [1]; random mating with respect to the locus [1]; no migration in or out [1]; no mutation at the locus [1]; no natural selection acting on the alleles [1].
Q4.1–4.4 — Punnett square cells (4 marks)
- 4.1 (T × T) — TT, P = 0.36 = p²
- 4.2 (T sperm × C egg) — TC, P = 0.24 = pq
- 4.3 (C sperm × T egg) — TC, P = 0.24 = pq
- 4.4 (C × C) — CC, P = 0.16 = q²
Q4.5 — Heterozygote total & check (1 mark)
Heterozygotes: 0.24 + 0.24 = 0.48 = 2pq. Total: 0.36 + 0.24 + 0.24 + 0.16 = 1.00. [1]
Q4.6 — Mapping back to HW (1 mark)
Cell 4.1 = p² (TT homozygotes); cells 4.2 + 4.3 = 2pq (heterozygotes); cell 4.4 = q² (CC homozygotes). The Punnett-square sum (p + q)² expands exactly to p² + 2pq + q². [1]
Q5.1 — Allele-frequency difference (1 mark)
0.94 − 0.05 = 0.89 (an extremely large per-locus difference). [1]
Q5.2 — Why the press release overreaches (3 marks)
Genome-wide 99.8% similarity refers to the average across the whole genome, but most of the variation that distinguishes populations is concentrated in specific SNPs — one such marker can differ in frequency by nearly 90 percentage points [1]. So "essentially identical" is misleading: the populations show substantial variation at this locus, which may have biological or evolutionary significance [1]. Conclusions should reference the sample measured and be proportional to it — the press release converts a per-locus frequency difference into a sweeping identity claim and ignores the meaning of marker-level variation [1].
Q5.3 — Strengthening the evidence (2 marks)
Any one of: examine many SNPs across the genome rather than a single locus and report the pattern of differences [1]; increase and diversify sample sizes (more individuals across more locations) so the samples better represent each population [1]; report both genome-wide similarity and per-locus variation rather than mixing them.