Epidemiology — Incidence, Prevalence, Mortality and Study Design
Every claim about disease — "smoking causes lung cancer," "obesity increases heart disease risk," "this vaccine reduces infection by 95%" — comes from epidemiology. This lesson builds the tools to understand how those claims are generated, what makes them reliable, and how to critically evaluate them.
Practise this lesson
Four printable worksheets that build from the foundations up to exam-style questions — start at whatever level suits you.
Epidemiology underpins IQ3 — how we measure whether treatments and prevention strategies work
Between 2000 and 2022, the total number of Australians diagnosed with Type 2 diabetes more than doubled. Headlines reported this as evidence that Australia's diabetes epidemic was worsening catastrophically.
But over that same period, Australia's population also grew substantially — and aged significantly. More people were also being screened and diagnosed than ever before. A researcher argues that when you adjust for population size and age structure, the age-standardised incidence rate of Type 2 diabetes has actually been relatively stable or even declining in some age groups.
Before reading on:
Q1: What is the difference between the total number of cases of a disease and the rate of disease in a population? Why does this distinction matter for public health decisions?
Q2: Why might improved screening and diagnosis make a disease appear to be increasing even if the underlying rate is unchanged?
Know
- The definitions and formulas for incidence, prevalence, and mortality rate
- The key features of cohort, case-control, cross-sectional, and randomised controlled trial (RCT) study designs
- What a confounding variable is and why it matters
- The difference between correlation and causation in epidemiological data
Understand
- Why incidence and prevalence give different pictures of disease burden
- Why age-standardised rates are needed for valid population comparisons
- Why observational studies can establish association but not proof of causation alone
- Why RCTs are considered the gold standard and when they cannot be used
Can Do
- Calculate or interpret incidence, prevalence, and mortality rates from data
- Identify the most appropriate study design for a given research question
- Identify confounding variables in epidemiological scenarios
- Evaluate whether epidemiological data supports a causal or merely associative conclusion
Core Content
Three different measurements of disease burden, each answering a different question about population health
Before any analysis of disease patterns can be done, the disease must be measured consistently. Epidemiologists use three core measures — incidence, prevalence, and mortality — each capturing a different aspect of disease burden. Confusing these three measures is one of the most common errors in interpreting health data.
Incidence — New Cases Over Time
What it measures: The rate at which new cases of a disease arise in a population over a defined time period. Answers the question: "How fast is this disease spreading or developing?"
Example: If 1,800 people are newly diagnosed with melanoma in a population of 10 million in one year, the incidence rate is 18 per 100,000 per year.
Best used for: Measuring the risk of developing a disease; assessing whether a disease is becoming more or less common; evaluating the impact of prevention programs.
Interpretation note: Rising incidence can reflect genuinely increased disease burden, OR improved screening/diagnosis detecting cases that previously went undetected.
Types of bias in epidemiological studies
Prevalence — All Existing Cases at a Point in Time
What it measures: The total proportion of a population that has a condition at a specific time point (point prevalence) or during a specified period (period prevalence). Answers: "How much of this disease exists in the community right now?"
Example: If 1.3 million of Australia's 26 million people have Type 2 diabetes at a given time, prevalence is 5%.
Best used for: Healthcare planning (how many people need treatment?); allocating health resources; understanding disease burden on the healthcare system.
Key relationship: Prevalence = Incidence × Average duration of disease. A disease with low incidence but long duration (e.g. Type 2 diabetes — chronic, lifelong) has high prevalence. A disease with high incidence but short duration (e.g. influenza — resolves or kills quickly) has lower prevalence relative to incidence.
Mortality Rate — Deaths From Disease
What it measures: The number of deaths attributable to a specific disease per unit of population per unit of time. Distinct from case fatality rate (proportion of cases that die).
Example: If 1,800 people die of coronary heart disease per year in a population of 10 million, mortality rate is 18 per 100,000 per year.
Best used for: Measuring the severity of a disease; assessing the impact of treatment advances; comparing the lethality of different diseases.
Important distinction: A disease can have high incidence but low mortality (e.g. most skin cancers — common but rarely fatal if caught early) or low incidence but high mortality (e.g. pancreatic cancer — rare but ~90% mortality within 5 years).
Age-standardisation — making fair comparisons
Raw rates cannot always be fairly compared between populations with different age structures. Older populations will always have higher crude rates of age-related diseases (cancer, cardiovascular disease, dementia) simply because they have more older people — not necessarily because those diseases are more prevalent for any given age. Age-standardisation applies a standard age distribution to both populations, allowing the underlying disease rates to be compared on a level playing field.
This is why Australia's age-standardised cancer mortality has been falling for decades even as the total number of cancer deaths has risen — improved treatment has reduced the death rate per case, but the population is larger and older, producing more total deaths despite the improved rate.
What to write in your book
- Incidence = NEW cases ÷ population at risk × 100,000/yr (risk of developing disease).
- Prevalence = ALL existing cases ÷ total population × 100 (% snapshot); Prevalence ≈ Incidence × duration.
- Mortality rate = deaths ÷ population × 100,000/yr.
- Age-standardised rates adjust for age structure → fair comparison between populations.
A disease with effective treatment that extends life shows rising prevalence even though incidence is falling. Why?
Different questions require different study designs — each with characteristic strengths, limitations, and appropriate uses
Epidemiologists cannot randomly assign people to smoke cigarettes or eat unhealthy diets for decades to study the effect on health — most important questions about disease and exposure must be studied observationally. The choice of study design determines what questions can be answered and what conclusions can be drawn.
Cohort Study (Prospective)
Design: A group of disease-free people is followed over time. Exposed and unexposed subgroups are compared for disease development.
Strength: Establishes temporal sequence (exposure before disease); good for common outcomes; can study multiple outcomes from one exposure.
Limitation: Slow and expensive; loss to follow-up; not practical for rare diseases.
Example: British Doctors Study (Doll and Hill) — followed 40,000 doctors from 1951, comparing smoking status to lung cancer rates over decades.
Case-Control Study (Retrospective)
Design: People with a disease (cases) are compared with disease-free people (controls). Past exposures are compared between groups.
Strength: Efficient for rare diseases; quick and inexpensive; can study multiple exposures simultaneously.
Limitation: Relies on recalled exposure (recall bias); cannot establish temporal sequence as clearly; cannot directly calculate incidence.
Example: Comparing asbestos exposure history in mesothelioma patients vs controls without mesothelioma.
Cross-Sectional Study
Design: Measures both exposure and disease at the same time point — a population snapshot.
Strength: Quick and cheap; good for measuring prevalence; generates hypotheses for further study.
Limitation: Cannot establish which came first (exposure or disease); susceptible to prevalence bias; cannot calculate incidence.
Example: National Health Survey measuring smoking status and cardiovascular disease in a sample of Australians at one time point.
Randomised Controlled Trial (RCT)
Design: Participants randomly allocated to intervention (treatment/exposure) or control (placebo/no treatment) groups. Outcomes compared after defined follow-up period.
Strength: Randomisation controls for confounders — the gold standard for establishing causation. Double-blinding reduces bias.
Limitation: Cannot be used for harmful exposures (unethical); expensive; may lack real-world generalisability.
Example: HPV vaccine trials — participants randomly assigned to vaccine or placebo, HPV infection and precancerous lesion rates compared.
What to write in your book
- Cohort (prospective): follow exposed/unexposed forward → establishes temporal sequence.
- Case-control (retrospective): cases vs controls, look back at exposure → efficient for rare diseases (recall bias).
- Cross-sectional: snapshot of exposure + disease at one time → measures prevalence, can't show which came first.
- RCT: randomised, gold standard for causation; can't use for harmful exposures (unethical).
Why can't a randomised controlled trial be used to study whether smoking causes lung cancer?
Association ≠ causation — understanding what can go wrong in epidemiological studies
Epidemiology measures associations between exposures and diseases in real populations — which means it must contend with all the complexity of real life. Confounding variables, biases, and chance findings can all produce apparent associations that are not genuinely causal. Critical evaluation of epidemiological evidence requires recognising these limitations.
Confounding variables
A confounding variable is one that is associated with both the exposure being studied and the disease outcome, and whose presence can create a spurious or distorted apparent relationship. Classic example: a study finds that coffee drinking is associated with lung cancer. Apparent conclusion: coffee causes lung cancer. But coffee drinkers in the 1950s–1980s were also far more likely to smoke. Smoking is the confounder — it is associated with both coffee drinking (same social context) and lung cancer (causally). When you control for smoking status, the coffee-cancer association largely disappears.
Confounders can be controlled by: matching cases and controls on confounding variables; statistical adjustment; stratified analysis; or — best of all — randomisation (which distributes confounders equally between groups by chance).
Types of bias
- Selection bias: The sample does not represent the target population. Healthy worker effect (workers are healthier than the general population, so occupational studies underestimate disease rates in the general population).
- Recall bias: Cases (who have a disease) may remember past exposures differently from controls (who do not). People who have developed cancer may more carefully recall exposure to potential carcinogens than healthy controls.
- Information bias: Systematic errors in measuring exposure or outcome. Misclassification of disease status or exposure level.
- Reporting bias: Certain outcomes are more likely to be published (publication bias — positive results are more publishable than null findings).
Correlation vs causation
Two variables can be correlated (statistically associated) without one causing the other. The classic examples: ice cream sales correlate with drowning rates (both rise in summer — confounded by hot weather). Countries with higher chocolate consumption have more Nobel Prize winners per capita (confounded by wealth and education). In epidemiology, establishing causation requires more than statistical association — it requires the Bradford Hill criteria (from L08): strength, consistency, specificity, temporality, dose-response, biological plausibility, coherence, experiment, and analogy.
Bradford Hill criterion — meaning
- Strength: large relative risk
- Consistency: association replicated in multiple studies/populations
- Temporality: exposure precedes disease
- Dose-response: more exposure = more disease
- Biological plausibility: known mechanism
- Specificity: exposure linked to specific disease(s)
Example — tobacco & lung cancer
- Smokers have 15–25× higher lung cancer risk than non-smokers
- Found in studies across dozens of countries and populations
- Smoking precedes cancer by 20–40 years
- More pack-years = higher risk; quitting reduces risk
- PAHs form DNA adducts → G→T mutations in TP53 (L08)
- Tobacco specifically causes lung and other cancers, not all diseases equally
The three core epidemiological measures and why age-standardised rates are essential for valid comparisons between populations.
What to write in your book
- Confounder: a variable associated with BOTH exposure and outcome (e.g. smoking confounds coffee–lung cancer).
- Control confounders: matching, statistical adjustment, stratification, or randomisation.
- Biases: selection, recall, information, reporting/publication.
- Correlation ≠ causation → need Bradford Hill criteria (strength, consistency, temporality, dose-response, plausibility…).
A variable associated with both the exposure and the disease outcome, which can create a false apparent association, is called a _____ variable.
The practical skills needed to interpret tables, graphs, and data from studies — tested directly in HSC exams
HSC Biology exams regularly include tables or graphs of epidemiological data and ask students to interpret, analyse, and evaluate them. These questions test whether you can read what the data shows (describe), identify patterns and relationships (analyse), and assess whether the data supports a conclusion (evaluate).
Worked example — interpreting a data table
The following table shows hypothetical data on Type 2 diabetes in Australia:
| Year | Total diagnosed cases | Population (millions) | Crude prevalence (%) | Age-standardised prevalence (%) |
|---|---|---|---|---|
| 2000 | 640,000 | 19.2 | 3.3% | 4.1% |
| 2010 | 970,000 | 22.3 | 4.4% | 4.3% |
| 2022 | 1,300,000 | 25.9 | 5.0% | 4.2% |
What you should notice and state:
- Total cases increased by ~100% from 2000 to 2022 — but this partly reflects population growth.
- Crude prevalence increased from 3.3% to 5.0% — but this partly reflects the ageing of the population (older people have higher T2D rates).
- Age-standardised prevalence changed much less (4.1% → 4.2%) — suggesting the underlying disease rate in comparable age groups has been relatively stable, not dramatically increasing. Much of the apparent increase reflects demographic change rather than worsening epidemic.
- This illustrates why age-standardised rates are essential for valid comparisons over time and between populations.
What to write in your book
- Rising total cases ≠ rising rate (check population size).
- Crude rate vs age-standardised rate (check age structure).
- Rising prevalence ≠ rising incidence (check treatment/survival).
- Always quote specific data values in "analyse" answers.
If the total number of diagnosed cases of a disease rises, the underlying disease rate must also be rising.
Incidence measures the number of new cases of a disease in a population over a specific time period.
Prevalence and incidence are the same measure and can be used interchangeably in epidemiological studies.
Calculating and Comparing Disease Measures
Use the data provided to calculate and interpret epidemiological measures. Show your working.
1. In a population of 5 million people, 450 new cases of bowel cancer are diagnosed in one year. Of those already living with bowel cancer (total existing cases = 8,500), 90 die from the disease during that year. Calculate: (a) the incidence rate per 100,000 per year; (b) the prevalence (%); (c) the case fatality rate (% of existing cases who die).
2. The table below shows cardiovascular disease data for two countries in the same year. Interpret the data and explain what can and cannot be concluded from these crude vs age-standardised rates.
| Country | CVD deaths | Population | Crude mortality (per 100k) | Age-standardised mortality (per 100k) |
|---|---|---|---|---|
| Country A | 48,000 | 24 million | 200 | 145 |
| Country B | 18,000 | 12 million | 150 | 190 |
Choosing and Evaluating Study Designs
For each research question, identify the most appropriate study design, justify your choice, and identify one major limitation or potential confounding variable.
- Researchers want to test whether a new drug reduces Type 2 diabetes progression in patients with insulin resistance. The drug has been safety-tested in Phase 1 and 2 trials and is believed to be beneficial.
- Researchers want to investigate whether childhood sun exposure (before age 10) increases adult melanoma risk. Participants are adults aged 40–60 who either have or do not have melanoma.
- A study finds that people who drink red wine have lower rates of cardiovascular disease than non-drinkers. A journalist reports: "Red wine prevents heart disease." Identify at least two confounding variables that could explain this association, and explain why the study design cannot establish causation.
In 1951, Richard Doll and Austin Bradford Hill sent questionnaires to every doctor on the British Medical Register asking about their smoking habits. They then followed these ~40,000 doctors for decades, recording causes of death. This was one of the first large prospective cohort studies — and it produced the most compelling epidemiological evidence for the smoking-lung cancer causal link.
Within 4 years, the data were clear enough that Doll himself — a smoker — quit. After 50 years of follow-up, the study had quantified that smoking reduced life expectancy by approximately 10 years, established the dose-response relationship between pack-years and lung cancer mortality, and documented the survival benefit of quitting at different ages. Doctors who quit before age 35 had near-normal life expectancy; those who quit at 65 had reduced but still significant benefit.
The study design was crucial: by following people forward in time (prospective cohort), it established that smoking preceded lung cancer — ruling out reverse causation. By following a large, well-defined professional cohort with reliable death certification, it minimised selection bias and information bias. The results were consistent across subgroups, showed a clear dose-response, and had an identified biological mechanism (carcinogens in smoke). This is exactly how Bradford Hill's criteria for causation are applied in practice.
Three Disease Measures
- Incidence = NEW cases ÷ population at risk × 100,000 per year
- Prevalence = ALL existing cases ÷ total population × 100 (%)
- Mortality rate = deaths ÷ population × 100,000 per year
- Age-standardised: adjusts for age structure to allow fair comparison
Study Designs
- Cohort (prospective): follows exposed/unexposed forward in time
- Case-control (retrospective): cases vs controls, looks back at exposure
- Cross-sectional: snapshot of exposure + disease at one time
- RCT: randomised, gold standard for causation, can't use for harmful exposures
Confounding + Bias
- Confounding variable: associated with both exposure AND outcome
- Selection bias: sample not representative
- Recall bias: cases remember exposure differently to controls
- Correlation ≠ causation — need Bradford Hill criteria
Data Interpretation
- Rising total cases ≠ rising rate (check population size)
- Crude rate vs age-standardised rate (check age structure)
- Rising prevalence ≠ rising incidence (check treatment/survival)
- Always quote data values in exam answers
A fresh set drawn from this lesson's question bank — feedback shown immediately. +5 XP per correct · +25 XP all correct
Pick your answer, then rate your confidence — that tells the system what to drill next.
ApplyBand 4(4 marks) 1. Distinguish between incidence and prevalence, and explain why effective treatment for a disease can cause its prevalence to rise even if its incidence is falling. Use a specific example in your answer.
AnalyseBand 4–5(5 marks) 2. A researcher is investigating whether regular physical activity reduces the risk of Type 2 diabetes. Describe how you would design a cohort study to investigate this question. Identify the cohort, the exposure and outcome variables, how data would be collected, and what would constitute evidence of an association. Identify one confounding variable and explain how it would be controlled.
EvaluateBand 5–6(5 marks) 3. Evaluate the following claim using your knowledge of epidemiological evidence and study design: "Because an RCT is the gold standard for medical evidence, we should require RCT evidence before accepting any claim that an environmental exposure causes disease."
Show all answers
Multiple choice
MC answers and full explanations are shown inline as you complete each question. Use the retry button to attempt a fresh set from the lesson bank.
Activity 1 — Calculations and Interpretation
1. Bowel cancer calculations. (a) Incidence rate = 450 ÷ 5,000,000 × 100,000 = 9 per 100,000 per year. (b) Prevalence = 8,500 ÷ 5,000,000 × 100 = 0.17%. (c) Case fatality rate = 90 ÷ 8,500 × 100 = 1.06% per year — about 1 in 100 existing patients dies of the disease each year, reflecting that many are diagnosed early and survive for years while a smaller advanced-disease group contributes most deaths.
2. CVD country comparison. From crude rates, Country A appears to have higher CVD mortality (200 vs 150 per 100,000). But age-standardised rates reverse this — Country B has the higher rate (190 vs 145). This reversal indicates Country A has an older population: the large elderly proportion inflates Country A's crude rate even though its underlying risk at each age is lower. Age-standardised rates are more valid for comparing underlying disease burden because they remove the confounding effect of different age structures. Country B has the greater underlying CVD risk despite fewer total deaths per 100,000 in the raw data.
Activity 2 — Study Design Evaluation
1. New drug for T2D. Best design: Randomised Controlled Trial. The drug is safety-tested and believed beneficial, so it is ethical to assign participants to drug vs placebo. Randomisation eliminates confounding (groups have similar baseline characteristics by chance), so any difference in progression is attributable to the drug; double-blinding eliminates assessment bias. Limitation: the trial population may not represent all T2D patients (often excludes very old, pregnant, or multi-morbid patients), limiting generalisability; trial duration may be too short for long-term effects.
2. Childhood sun exposure and melanoma. Best design: Case-control. We cannot follow children prospectively for 30–40 years, so we recruit adults who already have melanoma (cases) and adults without (controls) and compare recalled childhood sun exposure (retrospective). Limitation: recall bias — melanoma patients may recall childhood sun exposure more carefully than controls, artificially inflating the association. Partially controlled by objective measures (e.g. geographical sun-exposure records) rather than self-report.
3. Red wine and CVD. Confounder 1: socioeconomic status — moderate red-wine drinkers tend to have higher SES, which is independently associated with lower CVD risk (healthcare, diet, activity). Confounder 2: diet quality — wine drinkers often follow Mediterranean-style diets that independently reduce CVD risk. Why causation can't be established: the study is observational — it shows wine drinkers have lower CVD rates but cannot determine whether wine is causal or whether confounders explain the association. Without controlling these (statistical adjustment, matching, or an RCT), the headline claim is unjustified; existing RCT/polyphenol evidence does not support a strong protective effect.
Short Answer Model Answers
SA1 (4 marks): Incidence is the rate of new cases arising in a defined population over a specified time, (new cases ÷ population at risk) × 100,000 — it measures how fast disease develops. Prevalence is the total proportion with the disease at a given time, (existing cases ÷ total population) × 100 — it measures how much disease exists [2]. Why effective treatment raises prevalence despite falling incidence: prevalence ≈ incidence × duration. Effective treatment extends survival, so patients remain in the existing-cases pool for longer; even if incidence falls, the pool grows [1]. Example: HIV in high-income countries — antiretroviral therapy extended life, so prevalence rose through the 2000s while incidence (new infections) fell. The same pattern occurs for Type 2 diabetes (better treatment → longer survival → rising prevalence despite stable incidence) [1].
SA2 (5 marks): Cohort: recruit a large sample (50,000+) of adults aged 35–65 without T2D, willing to be followed 15–20 years [1]. Exposure: measure physical activity at baseline and every ~2 years (questionnaires or accelerometers) — type, duration, intensity, frequency; classify into activity categories [1]. Outcome: development of T2D (fasting glucose ≥7.0 mmol/L, HbA1c ≥48 mmol/mol, or diagnosis), measured at each follow-up [1]. Evidence of association: compare annual T2D incidence in high- vs low-activity groups; calculate relative risk (<1.0 supports protection); test dose-response [1]. Confounding variable: diet (healthier eaters exercise more AND have lower T2D risk). Control: collect dietary data and statistically adjust, or restrict analysis to similar dietary patterns [1].
SA3 (5 marks): RCTs are the gold standard because randomisation distributes known and unknown confounders equally by chance, and blinding prevents bias — establishing causation [1]. But RCTs cannot ethically be used for harmful exposures: you cannot assign people to smoke, inhale asbestos, or receive high UV exposure for decades; an ethics board would never approve it. Requiring RCT evidence would mean we could never establish causation for environmental carcinogens experimentally [2]. Observational evidence can establish causation via the Bradford Hill criteria — strength, consistency, temporality, dose-response, biological plausibility, specificity. The smoking–lung cancer link was established entirely through observational cohort studies (Doll and Hill) plus mechanistic evidence, with no RCT [1]. Conclusion: the claim is partly valid (RCTs are ideal when ethical — drugs, vaccines, interventions) but inappropriate as a universal standard for harmful exposures; the appropriate standard is convergent evidence from multiple study types satisfying the Bradford Hill criteria [1].
Five timed questions on incidence, prevalence, mortality and study design. Beat the boss to bank a tier — gold (perfect + fast), silver (80%+), or bronze (cleared).
⚔ Enter the arenaSprint through questions on incidence, prevalence, mortality and study design. Pool: lessons 1–12.
Return to your Think First responses at the start of the lesson.
- Q1 — total cases vs rate: Total case count is influenced by population size. Rate (cases per 100,000) controls for this — allowing valid comparison between different-sized populations and over time. Rate = cases ÷ population size.
- Q2 — improved screening making disease appear to increase: Screening detects cases that previously existed but were undiagnosed. When screening uptake increases, the diagnosed (recorded) prevalence rises even if true prevalence is stable — this is ascertainment bias.
- Write the formulas for incidence rate and prevalence from memory, and state in one sentence why age-standardised rates are more useful than crude rates for comparing populations.