Lesson 12 ~35 min Unit 4 · Data Science +85 XP

Evaluating Data Quality

In 1998, Dr Andrew Wakefield published autism-vaccine data from just 12 children — a sample so small and compromised that 10 of his 13 co-authors retracted it by 2004.

Today's hook: In 1998, a doctor named Andrew Wakefield published a study claiming a link between the MMR vaccine and autism. His entire dataset was 12 children — a tiny, unrepresentative sample. The data collection method was later found to be invalid, and by 2004, ten of his thirteen co-authors had withdrawn their names from the paper. Yet that one flawed study caused vaccination rates in the UK to drop from 92% to 79% within five years, leading to measles outbreaks. How can data with so many problems cause so much real-world harm? What would you check before trusting a study?

0/5QUESTS

Printable Worksheets

Print or save as PDF — or build a custom worksheet from any module's questions.

Build Apply Master Build custom

Think First

warm-up

You find two websites about climate change. One is written by a team of climate scientists with referenced data. The other is a social media post with no sources and lots of exclamation marks.

Which source is more trustworthy? List three specific reasons for your choice.

Write your prediction in your book before reading on.

What Makes Data High Quality?

+5 XP

A study surveyed 8 students at one school and concluded that “teenagers across Australia prefer online learning to face-to-face.” You read the headline. Do you believe it? Eight students at one school is not enough to speak for millions. The ruler was fine, but the sample was broken. Validity, reliability and representativeness are the three main criteria scientists use to judge whether data is worth trusting.

Validity asks whether the data actually measures what it claims to measure. If you use a ruler that is marked incorrectly, your length data is invalid no matter how carefully you read it. Reliability asks whether the data is consistent. If you repeat the same measurement five times and get wildly different results, your method is unreliable. Representativeness asks whether the sample reflects the whole population. Surveying only your friends about a national issue gives unrepresentative data.

Another key factor is peer review. When independent experts check the methods and conclusions before publication, many errors and biases are caught. Peer review does not guarantee a finding is correct, but it greatly reduces the chance of sloppy work being accepted.

Example

A student measures the cooling rate of water using a thermometer with a cracked bulb. The readings are consistent (reliable) but always 3 °C low (invalid). The data fails on validity, so the conclusion about cooling rates is untrustworthy even though the numbers look neat.

Real-world anchor

CSIRO publishes peer-reviewed research on bushfire behaviour, drought patterns and renewable energy. Every data set is checked for valid instruments, reliable repetition and representative sampling sites across Australia. This quality control is why CSIRO advice is used by state emergency services.

Watch out

Many students think that big numbers always mean trustworthy data. They do not. A survey of ten thousand people is useless if the questions are leading or the sample is biased. Size matters, but only when the data is also valid, reliable and fairly collected.

Predict then reveal+8 XP

1 · Predict

2 · Reveal

3 · Compare

A researcher surveys 5 people about their favourite sport and finds 80 % choose soccer. How many people should they survey to be confident the result reflects the whole school?

Confidence 50%

What You'll Master

objectives

Know

High-quality data is reliable, valid and sufficient to support a conclusion.
Source credibility, sample size and methodology all affect data quality.

Understand

Bias and error can reduce data quality at any stage of collection and analysis.
Evaluating data quality is essential before using data to make decisions or form arguments.

Can Do

Apply criteria to evaluate the quality of a dataset or source.
Identify potential sources of bias and error in data collection.

Cross-lesson links: The quality criteria you learn here sharpen everything from Lesson 7 (Outliers, Anomalies and Measurement Error) and Lesson 8 (Accuracy, Precision and Repeated Trials), and they're essential for Lesson 13 (Communicating Findings Clearly), where your report is only as good as the data behind it.

Words You Need

vocabulary

Data qualityThe overall usefulness and trustworthiness of data for its intended purpose.

ReliabilityThe consistency of data when measurements are repeated under the same conditions.

ValidityThe extent to which data actually measures what it is intended to measure.

BiasA systematic distortion in data collection or interpretation that leads to unfair or unrepresentative results.

Sample sizeThe number of individuals, trials or measurements included in a study.

Peer reviewThe process by which scientific work is evaluated by other experts before publication.

Spot the Trap

heads-up

Wrong: If data is published on the internet, it must be reliable.

Right: Reliable data comes from credible sources with transparent methods and peer review — not just because it is online.

Wrong: A large sample size always guarantees good data.

Right: A large biased sample is still biased. Quality of sampling matters as much as quantity.

Wrong: Trusting data because it comes from a familiar brand or website.

Right: Familiarity does not equal scientific rigour. Always check the source credentials, methodology and whether the data has been peer-reviewed.

Wrong: Assuming that correlation in a dataset proves the data is high quality.

Right: Correlation can exist in biased or invalid data too. Evaluate the collection method and source independently of the results.

Source and Credibility

+5 XP

Before you trust a piece of data, you need to know where it came from. Source credibility means judging whether the person or organisation that produced the data has the expertise, honesty and evidence to back their claims.

Ask three questions. Who? Are the authors qualified? A climate scientist and a random blogger have very different authority on global warming. Why? What is their motive? A study funded by a tobacco company may downplay smoking risks. How? Can you see their methods and data? Transparent sources let you check their work; secretive sources hide flaws.

Websites ending in .gov or .edu are usually more reliable than social media posts, but they are not perfect. Even reputable sources can contain outdated or oversimplified information. The key is to cross-check: find two or three independent sources that agree before treating a claim as settled.

Example

A student finds two articles on renewable energy. One is a peer-reviewed paper from ANU researchers with cited data; the other is an anonymous forum post with no references. The ANU paper is the more credible source because its methods are transparent and its authors are accountable.

Real-world anchor

The Australian Academy of Science publishes clear, evidence-based briefs on topics like vaccination and climate change. Their credibility comes from independent expert panels, transparent funding and public accountability. When you see their logo, you know the claims have been through rigorous review.

Watch out

Students often believe that if a source is popular or viral, it must be credible. Popularity is not proof. A TikTok video with a million views can still be wrong. Credibility comes from expertise and evidence, not from likes and shares.

Which source is most credible for a report on Australian bushfire causes?

Bias in Data Collection

+5 XP

Bias is any systematic distortion that pushes data in a particular direction. It can creep into a study through the way questions are asked, the way participants are chosen, or the way results are interpreted. Bias does not always mean dishonesty; sometimes it is accidental, but it still undermines trust.

Selection bias happens when the sample is not representative. If you survey only people at a gym about exercise habits, you will overestimate how active the general population is. Confirmation bias happens when a researcher only notices evidence that supports their existing belief. Leading questions nudge respondents toward a desired answer — for example, “Don’t you agree that homework is unfair?” presumes the answer.

The best defence against bias is blinding: hide group assignments from participants and researchers so expectations cannot influence measurements. Random selection and neutral wording also help keep data fair.

Example

A company asks customers “How much do you love our new product?” instead of “What do you think of our new product?” The leading wording creates bias because it pressures people to be positive. A neutral question gives more honest data.

Real-world anchor

The ABS designs survey questions carefully to avoid leading wording and selects households randomly across all states and territories. This painstaking neutrality ensures that Australian census data is among the least biased in the world, guiding fair government spending.

Watch out

Some students think bias only means lying. It does not. A well-meaning researcher can produce biased data by accidentally choosing an unrepresentative sample or asking questions that hint at the “right” answer. Good scientists actively design their studies to prevent bias, not just avoid dishonesty.

True or false?

Bias only occurs when a researcher deliberately falsifies data.

Sample Size and Representation

+5 XP

The size and makeup of your sample decide how far you can stretch your conclusions. A sample is the subset of the population you actually study. If the sample is too small or skewed, your findings may be a fluke rather than a real pattern.

Sample size matters because random variation is more extreme in small groups. Toss a coin five times and you might get four heads; toss it five hundred times and the result will be close to 50/50. Scientists use statistical rules to decide the minimum sample size needed to detect a real effect. Random selection matters because it gives every member of the population an equal chance of being included, which reduces selection bias.

Even with a large sample, you must check representation. A survey of 1 000 city dwellers cannot speak for rural Australians unless rural areas were deliberately included. Always ask: who is missing from this sample?

Example

A student tests whether a new fertiliser works by growing three plants with it and three without. The difference in height is 2 cm. Because the sample is tiny, random variation could easily explain the result. A larger trial with thirty plants per group would give a more trustworthy conclusion.

Real-world anchor

Medical trials in Australia, such as those run by the NHMRC, often recruit thousands of participants from diverse backgrounds to ensure results apply to the whole population. Small, unrepresentative trials can miss side effects or exaggerate benefits, which is why size and representation are strictly regulated.

Watch out

Students often believe that a sample of their classmates is representative of all teenagers in Australia. It is not. Your classmates share a school, a neighbourhood and often similar backgrounds. Generalising beyond your sample without evidence is a common scientific mistake.

Drop the right term into each blank.

A study with a sample size is more affected by variation. Random selection helps reduce selection bias. A sample reflects the whole population.

Speed Round · 6 questions

Speed round +6 XP

True or false? Tap as fast as you can. Build a streak.

Q · 1 / 6 Streak · 0 Score · 0

Reliable data gives consistent results when repeated.

How are you completing this lesson?

Revisit Your Thinking

reflect

At the start of the lesson you were asked: "A study has 5 data points and a clear trend — can you trust the conclusion?" Your instinct might have been to say yes, because the trend looks convincing.

Now that you know the criteria for data quality — sample size, reliability, validity and bias — what's your answer? What would need to be true about those 5 data points for the conclusion to be trustworthy, and what warning signs would make you doubt it?

Apply the data quality criteria from this lesson to evaluate both sources, and explain which one you would use for a school assignment and why.

Write your updated thinking in your book.

Quick Check · 5 questions

Which best describes valid data?

+10 XP

Why is peer review important for scientific data?

+10 XP

A study surveys only people leaving a health food shop about their diet. What is the main problem?

+10 XP

Which factor is least important when evaluating data quality?

+10 XP

Data that gives similar results when the experiment is repeated is described as:

+10 XP

Check Your Understanding · 3 questions

Check Your Understanding

short answer

1. Explain the difference between reliability and validity, and describe a situation where data could be one but not the other.