Evaluating Data Quality
In 1998, Dr Andrew Wakefield published autism-vaccine data from just 12 children — a sample so small and compromised that 10 of his 13 co-authors retracted it by 2004.
Printable Worksheets
Print or save as PDF — or build a custom worksheet from any module's questions.
You find two websites about climate change. One is written by a team of climate scientists with referenced data. The other is a social media post with no sources and lots of exclamation marks.
Which source is more trustworthy? List three specific reasons for your choice.
A study surveyed 8 students at one school and concluded that “teenagers across Australia prefer online learning to face-to-face.” You read the headline. Do you believe it? Eight students at one school is not enough to speak for millions. The ruler was fine, but the sample was broken. Validity, reliability and representativeness are the three main criteria scientists use to judge whether data is worth trusting.
Validity asks whether the data actually measures what it claims to measure. If you use a ruler that is marked incorrectly, your length data is invalid no matter how carefully you read it. Reliability asks whether the data is consistent. If you repeat the same measurement five times and get wildly different results, your method is unreliable. Representativeness asks whether the sample reflects the whole population. Surveying only your friends about a national issue gives unrepresentative data.
Another key factor is peer review. When independent experts check the methods and conclusions before publication, many errors and biases are caught. Peer review does not guarantee a finding is correct, but it greatly reduces the chance of sloppy work being accepted.
A student measures the cooling rate of water using a thermometer with a cracked bulb. The readings are consistent (reliable) but always 3 °C low (invalid). The data fails on validity, so the conclusion about cooling rates is untrustworthy even though the numbers look neat.
CSIRO publishes peer-reviewed research on bushfire behaviour, drought patterns and renewable energy. Every data set is checked for valid instruments, reliable repetition and representative sampling sites across Australia. This quality control is why CSIRO advice is used by state emergency services.
Many students think that big numbers always mean trustworthy data. They do not. A survey of ten thousand people is useless if the questions are leading or the sample is biased. Size matters, but only when the data is also valid, reliable and fairly collected.
A researcher surveys 5 people about their favourite sport and finds 80 % choose soccer. How many people should they survey to be confident the result reflects the whole school?
How close was your prediction?
Nice calibration — your intuition is good for this kind of problem.
Good — being surprised is the point. This answer is worth remembering.
Know
- High-quality data is reliable, valid and sufficient to support a conclusion.
- Source credibility, sample size and methodology all affect data quality.
Understand
- Bias and error can reduce data quality at any stage of collection and analysis.
- Evaluating data quality is essential before using data to make decisions or form arguments.
Can Do
- Apply criteria to evaluate the quality of a dataset or source.
- Identify potential sources of bias and error in data collection.
Wrong: If data is published on the internet, it must be reliable.
Right: Reliable data comes from credible sources with transparent methods and peer review — not just because it is online.
Wrong: A large sample size always guarantees good data.
Right: A large biased sample is still biased. Quality of sampling matters as much as quantity.
Wrong: Trusting data because it comes from a familiar brand or website.
Right: Familiarity does not equal scientific rigour. Always check the source credentials, methodology and whether the data has been peer-reviewed.
Wrong: Assuming that correlation in a dataset proves the data is high quality.
Right: Correlation can exist in biased or invalid data too. Evaluate the collection method and source independently of the results.
Before you trust a piece of data, you need to know where it came from. Source credibility means judging whether the person or organisation that produced the data has the expertise, honesty and evidence to back their claims.
Ask three questions. Who? Are the authors qualified? A climate scientist and a random blogger have very different authority on global warming. Why? What is their motive? A study funded by a tobacco company may downplay smoking risks. How? Can you see their methods and data? Transparent sources let you check their work; secretive sources hide flaws.
Websites ending in .gov or .edu are usually more reliable than social media posts, but they are not perfect. Even reputable sources can contain outdated or oversimplified information. The key is to cross-check: find two or three independent sources that agree before treating a claim as settled.
A student finds two articles on renewable energy. One is a peer-reviewed paper from ANU researchers with cited data; the other is an anonymous forum post with no references. The ANU paper is the more credible source because its methods are transparent and its authors are accountable.
The Australian Academy of Science publishes clear, evidence-based briefs on topics like vaccination and climate change. Their credibility comes from independent expert panels, transparent funding and public accountability. When you see their logo, you know the claims have been through rigorous review.
Students often believe that if a source is popular or viral, it must be credible. Popularity is not proof. A TikTok video with a million views can still be wrong. Credibility comes from expertise and evidence, not from likes and shares.
Bias is any systematic distortion that pushes data in a particular direction. It can creep into a study through the way questions are asked, the way participants are chosen, or the way results are interpreted. Bias does not always mean dishonesty; sometimes it is accidental, but it still undermines trust.
Selection bias happens when the sample is not representative. If you survey only people at a gym about exercise habits, you will overestimate how active the general population is. Confirmation bias happens when a researcher only notices evidence that supports their existing belief. Leading questions nudge respondents toward a desired answer — for example, “Don’t you agree that homework is unfair?” presumes the answer.
The best defence against bias is blinding: hide group assignments from participants and researchers so expectations cannot influence measurements. Random selection and neutral wording also help keep data fair.
A company asks customers “How much do you love our new product?” instead of “What do you think of our new product?” The leading wording creates bias because it pressures people to be positive. A neutral question gives more honest data.
The ABS designs survey questions carefully to avoid leading wording and selects households randomly across all states and territories. This painstaking neutrality ensures that Australian census data is among the least biased in the world, guiding fair government spending.
Some students think bias only means lying. It does not. A well-meaning researcher can produce biased data by accidentally choosing an unrepresentative sample or asking questions that hint at the “right” answer. Good scientists actively design their studies to prevent bias, not just avoid dishonesty.
The size and makeup of your sample decide how far you can stretch your conclusions. A sample is the subset of the population you actually study. If the sample is too small or skewed, your findings may be a fluke rather than a real pattern.
Sample size matters because random variation is more extreme in small groups. Toss a coin five times and you might get four heads; toss it five hundred times and the result will be close to 50/50. Scientists use statistical rules to decide the minimum sample size needed to detect a real effect. Random selection matters because it gives every member of the population an equal chance of being included, which reduces selection bias.
Even with a large sample, you must check representation. A survey of 1 000 city dwellers cannot speak for rural Australians unless rural areas were deliberately included. Always ask: who is missing from this sample?
A student tests whether a new fertiliser works by growing three plants with it and three without. The difference in height is 2 cm. Because the sample is tiny, random variation could easily explain the result. A larger trial with thirty plants per group would give a more trustworthy conclusion.
Medical trials in Australia, such as those run by the NHMRC, often recruit thousands of participants from diverse backgrounds to ensure results apply to the whole population. Small, unrepresentative trials can miss side effects or exaggerate benefits, which is why size and representation are strictly regulated.
Students often believe that a sample of their classmates is representative of all teenagers in Australia. It is not. Your classmates share a school, a neighbourhood and often similar backgrounds. Generalising beyond your sample without evidence is a common scientific mistake.
A study with a sample size is more affected by variation. Random selection helps reduce selection bias. A sample reflects the whole population.
Speed Round · 6 questions
True or false? Tap as fast as you can. Build a streak.
Reliable data gives consistent results when repeated.
Valid data always supports your hypothesis.
A peer-reviewed journal is generally a more credible source than an anonymous blog.
A large sample size automatically guarantees that a study is free from bias.
Bias can enter data collection through leading questions or poor sampling.
If data is published on the internet, it must be reliable.
How are you completing this lesson?
At the start of the lesson you were asked: "A study has 5 data points and a clear trend — can you trust the conclusion?" Your instinct might have been to say yes, because the trend looks convincing.
Now that you know the criteria for data quality — sample size, reliability, validity and bias — what's your answer? What would need to be true about those 5 data points for the conclusion to be trustworthy, and what warning signs would make you doubt it?
Apply the data quality criteria from this lesson to evaluate both sources, and explain which one you would use for a school assignment and why.
Quick Check · 5 questions
Check Your Understanding · 3 questions
1. Explain the difference between reliability and validity, and describe a situation where data could be one but not the other.
2. Why is peer review important for ensuring data quality in scientific research?
3. Describe two ways bias can enter data collection, and suggest how to prevent each.
Show Your Working · 3 questions
SA1. Define reliability and validity, and explain why both are necessary for high-quality scientific data. Include an example where data might be reliable but not valid.
SA2. Describe three criteria you would use to evaluate whether a dataset about mobile phone use and sleep is trustworthy enough to support a health recommendation.
Hint: Think about source, method, sample and bias.
SA3. Explain why a large sample size does not automatically mean a study is free from bias.
Data Quality
Reliable, valid and sufficient
Reliability
Consistent results when repeated
Validity
Measures what it claims to measure
Bias
Systematic distortion from the truth
Peer Review
Evaluation by experts before publication
Sample Size
Must be sufficient AND representative
Put what you have learned to the test! Jump through the questions in game form.
Play GameYour Badges
0 of 6Mark lesson as complete
Tick when you've finished Learn, Practice and the game. Earns +85 XP and +25 coins.