Biology • Year 12 • Module 5 • Lesson 16
Frequency Data and SNP Analysis
Build HSC band 5–6 extended-response technique on interpreting allele-frequency data, Hardy–Weinberg expectations and SNP analysis — including their limits.
1. Stimulus-based extended response — interpret a Hardy–Weinberg departure (Band 5–6)
8 marks Band 5–6
Stimulus. A research team sampled 400 adults from a small, isolated island village and genotyped them at a single SNP linked to a recessive metabolic condition. The C allele has frequency q = 0.30; T allele p = 0.70. The team report the following observed counts and Hardy–Weinberg expected counts.
| Genotype | HW expected count (n=400) | Observed count |
|---|---|---|
| TT | 196 (p² × 400) | 232 |
| TC | 168 (2pq × 400) | 104 |
| CC | 36 (q² × 400) | 64 |
The team note that the village has been geographically isolated for ~200 years, with documented marriages overwhelmingly within a small number of extended families.
Q1. Analyse and explain, using the data and at least two Hardy–Weinberg assumptions, why the observed genotype distribution departs from the Hardy–Weinberg expectation. Comment on what conclusions are — and are not — safe to draw from a single SNP measured in this sample.
2. Multi-criteria evaluation — interpreting SNP frequency data (Band 5–6)
7 marks Band 5–6
Q2. Compare and evaluate the use of (a) a single SNP and (b) a whole-genome SNP panel (thousands of SNPs) for inferring relatedness between two human populations. In your response you must:
- Define a SNP and explain how SNP frequencies are calculated from genotype counts.
- Compare the two approaches on at least three criteria (e.g. resolution, sensitivity to single-locus selection, susceptibility to sampling error, ease of interpretation).
- Use a worked numerical example to show how a per-locus frequency difference (e.g. 0.05 vs 0.94 in Worksheet 2) can coexist with high genome-wide similarity.
- Reach a context-aware judgement — not a one-winner ranking.
3. Evaluate this claim (Band 5–6)
6 marks Band 5–6
"A small sample of 30 people gave a trait frequency of 80% in Population X. Therefore exactly 80% of all individuals in Population X have this trait, and any change in this frequency between generations must be due to natural selection."
Q3. Evaluate this claim. Identify which parts are correct, which are wrong, and rewrite the claim into a biologically defensible statement using the lesson's framing of frequency data, sample size and the multiple causes of allele-frequency change.
Q1 — Sample Band 6 response (8 marks), annotated
Under Hardy–Weinberg, expected genotype frequencies are p² + 2pq + q² = 1. With p(T)=0.70 and q(C)=0.30 in a sample of 400, expected counts are 0.49 × 400 = 196 TT, 2(0.70)(0.30) × 400 = 168 TC, and 0.09 × 400 = 36 CC. [1 — states HW + computes expecteds]
The observed counts depart from this in a characteristic way: TT is elevated (232 vs 196), CC is elevated (64 vs 36), and TC is depressed (104 vs 168). [1 — numerical departure] Qualitatively, the village has a deficit of heterozygotes and an excess of homozygotes. [1 — names the signal]
This pattern is the signature of non-random mating, the second Hardy–Weinberg assumption: when matings preferentially occur within extended families (consanguinity), parents share more alleles than chance would predict, and so produce more homozygous offspring. The documented history of within-family marriage in the village directly violates the random-mating assumption. [1 — links to assumption 1]
A second relevant violation is the assumption of large population size. The village's small isolated population (~400 adults) is also vulnerable to genetic drift, which can shift genotype counts away from HW expectations even without active selection. [1 — second assumption]
Importantly, the allele frequencies p and q themselves are unchanged — they still sum to 1 — but the distribution of those alleles across genotypes has shifted. Hardy–Weinberg equilibrium is about expected genotype frequencies given p and q; allele frequencies can be the same in two populations while one is in HW and the other is not. [1 — allele frequency vs genotype frequency]
However, the data describe one SNP in one isolated village. They cannot characterise the genetic structure of any wider population, nor confirm a specific cause of the deviation — n=400 from a single village is not representative of any larger group, and the conclusion rests on a single marker. [1 — limitation]
A defensible conclusion is: the observed deficit of heterozygotes is consistent with non-random mating (and possibly drift) in this isolated village, but a stronger genetic claim would require multiple SNPs across the genome, comparison with a mainland reference sample, and explicit statistical testing of HW. [1 — proportionate conclusion]
Q2 — Sample Band 6 response (7 marks), annotated
A SNP (single nucleotide polymorphism) is a one-base difference at a specific position in the DNA between individuals or populations. The frequency of an allele at a SNP is calculated as p = (2 × homozygotes + heterozygotes) ÷ (2 × total individuals), so a population of 100 people with 49 TT and 42 TC has p(T) = (98 + 42) ÷ 200 = 0.70. [1 — definition + formula]
A single SNP gives one comparison between populations. A whole-genome SNP panel (thousands of SNPs) gives thousands of effectively independent comparisons, so the resolution of relatedness inference is dramatically higher with the panel. [1 — resolution]
A single SNP is also vulnerable to single-locus forces: natural selection acting at one site (e.g. the lactase-persistence SNP near LCT) can drive the allele frequency from 0.06 in East Asian samples to 0.70 in Northern European samples without making the two populations distant overall. A whole-genome panel averages over these locus-specific signals, so it is much less likely to mislead about overall relatedness. [1 — locus-level selection criterion]
Both approaches share the same sample-size and bias issues — they use the same individuals — but per-locus noise is much higher than panel-wide noise. A 0.05 frequency in a sample of 50 has a much larger margin of error than the same number averaged across thousands of loci. [1 — sampling-error criterion]
For example, two populations might differ by 0.89 in the frequency of one SNP (0.05 vs 0.94) yet share 99.8% of their overall genome. The first number is per-locus; the second is genome-wide. Treating the two as the same would be the error the lesson warns against — drawing a sweeping identity (or non-identity) claim from a single marker. [1 — worked numerical example interpreted]
One SNP is sometimes the right tool: e.g. clinical testing for a specific disease allele, where the question is "does this person carry this allele?" — the relatedness question is not in play. A panel is the right tool when the question is overall relatedness, ancestry, or population structure. [1 — appropriate context]
Neither approach is universally better — it depends on the question. The lesson's caution applies: the strength of a conclusion drawn from SNP data must be proportional to how many positions were compared, how many individuals were sampled, and how representative those samples are. [1 — context-aware judgement linked to lesson framing]
Q3 — Sample Band 6 response (6 marks)
The claim is partly correct but largely flawed. [1 — judgement]
What is defensible: If 24 of 30 sampled individuals carry the trait, then 24 ÷ 30 = 80% is the correct observed frequency in the sample — this part is arithmetic and accurate. [1 — concedes the defensible element]
What is wrong:
- "Exactly 80% of all individuals." A sample of 30 is small and may not be representative. 80% in the sample is a point estimate of the wider population — the true value may be quite different, and individual outcomes are not bound by it. A larger, representative sample is required before generalising. [1 — refutes "exactly 80% of all"]
- "Must be due to natural selection." Allele frequencies change through several mechanisms: natural selection, genetic drift (especially powerful in small populations), mutation, and gene flow / migration. Concluding "natural selection" from a frequency change alone requires additional evidence — for example, evidence of differential survival or reproduction tied to the allele. [1 — refutes "must be natural selection" using the lesson's misconceptions box]
Defensible reformulation: "In a sample of 30 individuals from Population X, 80% showed the trait. This is an observed sample frequency; the wider population frequency may differ, and a larger representative sample would give a more reliable estimate. A change in this frequency between generations could be caused by natural selection, but also by genetic drift, mutation or migration — additional evidence is required before attributing the change to any one mechanism." [1 — uses precise lesson terminology and rewrites the claim defensibly]