Biology • Year 12 • Module 5 • Lesson 18

Large-Scale Population Genetics Data — Disease, Conservation, Human Evolution

Build HSC Band 5–6 extended-response technique on large-scale population genetics — value, limits, and three flagship case studies (Tasmanian devils, gnomAD, Out-of-Africa).

Master · Extended Response

1. Extended response — compare what large-scale data can and cannot deliver (Band 5–6)

7 marks Band 5–6

Q1. Compare and evaluate the value of large-scale collaborative population-genetics data sets with their limitations. In your response you must:

Define what makes a data set "large-scale collaborative" and explain why size strengthens inference.
Compare what such data improves (pattern detection, trend confidence, between-population comparison) with what it does not remove (sampling bias, method limits, per-individual uncertainty).
Use at least one named example per application context (conservation, disease inheritance, human evolution) — e.g. Save the Tasmanian Devil Program, gnomAD / 1000 Genomes, or the National Centre for Indigenous Genomics.
Reach an evaluative judgement that frames the conclusion as a strong inference, not certainty.

Stuck? Plan: definition + size→pattern → three named examples → limits (sampling, methods, individual) → evaluative judgement framing conclusions as inferences.

2. Stimulus-based extended response — DFTD and the Tasmanian devil (Band 5–6)

8 marks Band 5–6

Stimulus. Devil Facial Tumour Disease (DFTD) is a transmissible clonal cancer first reported in north-eastern Tasmania in 1996. The tumour cells themselves are the infectious agent — they are spread between devils through biting and are not rejected by the host immune system, in part because of low Major Histocompatibility Complex (MHC) variation across the species. Population declines exceed 80% in long-affected regions. The Save the Tasmanian Devil Program has used genome-wide SNP genotyping of >3,000 devils to track allele frequencies, identify candidate resistance loci, and design an insurance population. Epstein et al. (Nature Communications 2016) reported two genomic regions showing strong signatures of selection in surviving devils — suggesting partial genetic adaptation to DFTD within ~6 generations.

Q2. Analyse and evaluate how large-scale population-genetics data has contributed to conservation of the Tasmanian devil, and explain what it cannot guarantee about the species' future.

In your answer:

Explain why genome-wide SNP data is more powerful than a few markers from a small sample.
Describe two specific contributions of the data (e.g. bottleneck detection, candidate resistance alleles, insurance-population design).
Identify two limits of inference that remain despite the data set's size.
Reach a justified judgement using the lesson's framing of inference vs certainty.

Stuck? Use Cards 2 + 5 as your skeleton: conservation value (bottleneck detection, resistance loci, insurance population) → limits (drift confounds, individual uncertainty) → inference, not certainty.

3. Evaluate this claim (Band 5–6)

6 marks Band 5–6

"Now that we have sequenced over one million human genomes through projects like gnomAD and the UK Biobank, the story of human evolution and the genetic basis of every inherited disease are essentially solved. A clinician reading any patient's genome can predict that person's disease risk and ancestry with certainty."

Q3. Evaluate this claim. Identify which parts are defensible and which are wrong, and reformulate the claim into a biologically defensible statement using the lesson's framing of inference and the well-documented limitations of current data sets.

Stuck? Revisit Card 5 ("Limits of Inference") and the lesson's misconceptions box about the Human Genome Project.

Answers — Do not peek before attempting

Q1 — Sample Band 6 response (7 marks), annotated

A large-scale collaborative population-genetics project pools genetic data from many laboratories, sites and populations so that allele frequencies, rare variants and between-population patterns can be measured with statistical power impossible for a single small study. The larger the sample, the more detectable rare alleles and subtle trends become — this is why projects like the 1000 Genomes Project, gnomAD, the UK Biobank, the Save the Tasmanian Devil Program and the National Centre for Indigenous Genomics exist. [1 — definition + size→pattern; 1 — names large-scale projects]

In conservation, Hohenlohe et al. (Conservation Genetics 2019) used genome-wide SNP genotyping of Tasmanian devils to show that heterozygosity has dropped ~30% in north-eastern populations affected by Devil Facial Tumour Disease since 1996, while populations not yet exposed retain near-original diversity. This pattern is direct evidence of a disease-driven bottleneck and informs which individuals enter the insurance population. [1 — conservation example]

In disease inheritance, the gnomAD consortium (Karczewski et al. Nature 2020) aggregated >125,000 exomes and showed that pathogenic CFTR allele frequencies vary by an order of magnitude across ancestry groups (~1 in 27 carriers in Ashkenazi populations vs ~1 in 416 in East Asian populations). This supports population-specific carrier screening. [1 — disease example]

In human evolution, Ramachandran et al. (PNAS 2005) showed that pairwise F_ST rises approximately linearly with geographic distance from East Africa, consistent with a serial founder effect during the Out-of-Africa expansion. Malaspinas et al. (Nature 2016) extended this with whole-genome data from Aboriginal Australian groups, supporting continuous occupation of Sahul for >50,000 years. [1 — human-evolution example]

However, the lesson's "Limits of Inference" position remains correct: even data sets of millions of individuals carry sampling biases (European ancestry dominates gnomAD, Sirugo et al. 2019), methodological assumptions (linkage models, neutral evolution), and unavoidable individual uncertainty — a population trend does not predict any one carrier couple's pregnancy outcome, nor any one devil's tumour survival. [1 — explicit comparison of what is improved vs not removed]

Overall, large-scale collaborative data has transformed conservation, disease inheritance and human evolution from inference based on a handful of markers to inference based on whole-genome trends across thousands of individuals. The conclusions it supports are stronger inferences — not certainties — and that distinction is essential when the data is used to make management or clinical decisions. [1 — evaluative judgement framed as inference]

Marking criteria.

1 mark — Defines large-scale collaborative project (pooled samples from many sites/labs/populations) and links size to detectability of patterns and rare alleles.
1 mark — Names at least one specific large-scale project (e.g. 1000 Genomes, gnomAD, UK Biobank, Save the Tasmanian Devil Program, National Centre for Indigenous Genomics, H3Africa).
1 mark — Conservation example with mechanism (e.g. Tasmanian devil H_O loss after DFTD bottleneck, Hohenlohe et al. 2019).
1 mark — Disease-inheritance example with mechanism (e.g. gnomAD ancestry-specific CFTR carrier frequencies, Karczewski et al. 2020).
1 mark — Human-evolution example with mechanism (e.g. F_ST–distance relationship from Ramachandran et al. 2005, or Aboriginal Australian deep ancestry, Malaspinas et al. 2016).
1 mark — Explicitly compares what large data improves with what it does not remove (uses precise terminology — sampling bias, ancestry under-representation, methodological limits, individual uncertainty).
1 mark — Reaches an evaluative judgement that frames the outcome as an evidence-based inference, not certainty, and links back to all three contexts.

Q2 — Sample Band 6 response (8 marks), annotated

Genome-wide SNP genotyping measures variation at hundreds of thousands of independent loci across the genome, rather than the handful of markers used in early conservation studies. This provides high-resolution estimates of heterozygosity, relatedness, effective population size and selection signatures, and lets researchers detect changes that small marker panels would miss entirely. [1 — power of genome-wide data]

One contribution has been direct detection of the DFTD-driven bottleneck. Hohenlohe et al. (Conservation Genetics 2019) reported a ~30% decline in observed heterozygosity in long-affected north-eastern Tasmanian devil populations compared to pre-DFTD samples, with smaller declines in more recently affected populations and almost none in regions DFTD has not yet reached. This dose–time pattern confirms the disease as the proximate cause of diversity loss. [1 — bottleneck detection]

A second contribution is the identification of candidate resistance loci. Epstein et al. (Nature Communications 2016) compared allele frequencies before and after DFTD outbreak and found two genomic regions showing strong shifts within ~6 generations — a signature of natural selection acting in real time, suggesting partial genetic adaptation to the tumour. [1 — resistance loci]

A third contribution is the design of the insurance population housed in zoos and on Maria Island. Genome-wide relatedness and diversity estimates are used to select founders whose collective allele content maximises retained genetic diversity, so that future re-introductions to the wild start from a representative gene pool. [1 — insurance population design]

Two limits of inference remain. First, candidate-resistance signals can still be artefacts of drift in a small, highly bottlenecked population — the apparent selection may not be DFTD-driven, and replication with functional studies is required before any allele is treated as causal. [1 — drift confound] Second, none of these analyses can predict whether any individual devil will catch DFTD or survive it, nor whether the species will persist under ongoing habitat fragmentation, road mortality and climate change — population-level inference does not translate to individual or future certainty. [1 — individual / species-future uncertainty]

Applying the lesson framing: large-scale population data has substantially strengthened the inference that the devil is recoverable — bottleneck visible, partial resistance plausible, insurance population designed — but it has not removed uncertainty. Strong inference for management, not certainty, is the honest scientific position. [1 — inference vs certainty framing]

The Save the Tasmanian Devil Program therefore illustrates the lesson's two-sided message: large collaborative data turns "small isolated samples and guesses" into actionable patterns, while still requiring scientists and managers to acknowledge sampling assumptions, methodological limits and individual unpredictability. [1 — integrates all four required elements with precise terminology]

Marking criteria.

1 mark — Identifies that genome-wide SNP data provides many independent loci across the genome, enabling estimates of heterozygosity, relatedness and selection signatures that a small panel of markers cannot resolve.
1 mark — Contribution 1: detects the DFTD bottleneck through loss of heterozygosity / rare alleles (e.g. Hohenlohe et al. 2019, ~30% H_O drop in NE).
1 mark — Contribution 2: identifies candidate resistance loci under selection (e.g. Epstein et al. 2016 — two regions showing rapid allele-frequency change in survivors).
1 mark — Contribution 3: informs design of the insurance population (selecting individuals that maximise retained allelic diversity / minimise relatedness).
1 mark — Limit 1: candidate-locus signals can still be false positives from drift in a bottlenecked population — the inference is not certainty.
1 mark — Limit 2: data cannot predict whether any individual devil will develop or survive DFTD, nor guarantee species persistence under ongoing habitat / climate pressures.
1 mark — Frames the strongest conclusion as a justified inference: large data raises confidence in trends and management decisions but does not turn them into deterministic predictions.
1 mark — Uses precise lesson terminology throughout (bottleneck, genetic diversity, selection signature, inference, sampling assumption) and integrates all four required elements into one coherent argument.

Q3 — Sample Band 6 response (6 marks)

The claim is partly correct but largely overstated. [1 — judgement]

What is defensible: Aggregating over one million genomes through gnomAD, the UK Biobank, FinnGen and similar projects has genuinely sharpened allele-frequency estimates, accelerated disease-gene discovery (e.g. confirmation of pathogenic BRCA1/BRCA2 alleles, identification of loss-of-function tolerant genes), and added important detail to human population history (Karczewski et al. 2020; Bycroft et al. Nature 2018). [1 — concedes defensible element]

What is wrong:

"Story of human evolution is solved." Ancient-DNA work (Reich 2018; Prüfer et al. 2014) keeps revising the narrative — Neanderthal and Denisovan admixture, multiple Out-of-Africa pulses, and Aboriginal Australian deep ancestry (Malaspinas et al. 2016) all post-date "the genome is solved" claims. [1 — refutes "solved"]
"Every inherited disease is solved." Most rare variants in gnomAD are classified as variants of uncertain significance (VUS); Sirugo et al. (Cell 2019) document that under-representation of African, Indigenous and South Asian samples means many disease-relevant variants are still missing or misclassified. [1 — refutes "every disease"]
"Predict any patient with certainty." The lesson is explicit: population-level allele frequencies do not predict individual outcomes. Phenotype depends on other genes, environment, chance and the limits of genotype–phenotype mapping; even highly penetrant alleles vary in expression. [1 — refutes "individual certainty"]

Defensible reformulation: "Pooling millions of human genomes has substantially strengthened our inference about disease-variant frequencies and human population history. However, conclusions remain inferences open to revision, ancestry under-representation continues to bias them, and they describe population trends — not deterministic predictions for any individual." [1 — biologically defensible reformulation with inference framing]

Marking criteria.

1 mark — States an overall evaluative judgement (the claim is partly correct but overstated / largely flawed).
1 mark — Correctly identifies the defensible element: pooling 10⁶+ genomes has sharpened allele-frequency estimates and substantially advanced disease-gene discovery and ancestry inference.
1 mark — Refutes "story of human evolution is solved" — ancient DNA, archaic admixture (Neanderthal/Denisovan), and under-represented populations continue to reshape the narrative (e.g. Malaspinas et al. 2016, Reich 2018).
1 mark — Refutes "every inherited disease solved" — gnomAD itself shows most rare variants are of uncertain significance (VUS); ancestry under-representation (Sirugo et al. 2019) means many populations remain poorly characterised.
1 mark — Refutes "predict any patient with certainty" — population trends do not translate to individual certainty; phenotype depends on other genes, environment and chance.
1 mark — Reformulates the claim into a defensible alternative that frames large-scale data as strengthening inference while preserving uncertainty, ancestry equity and individual variation (and explicitly cites the lesson's two-sided position).