AHL 4.12 — designing data collection, categorisation, reliability & validity

This topic explains how to design valid data collection methods (surveys, questionnaires, sampling),
how to choose sensible categories when converting numerical data into a χ² table, and how to
assess reliability and validity of measures.

Term / concept	Definition / short explanation
Survey / questionnaire design	Structured tool to collect data. Good design uses clear, unbiased questions, consistent answer choices, and pilot testing.
Categorisation for χ²	Grouping continuous data into classes so that expected frequencies ≥ 5 and categories are meaningful and justified.
Reliability	Consistency of a measure (repeatability). A reliable test gives similar results under similar conditions.
Validity	The test measures what it is intended to measure (content, criterion-related validity). A valid test may not be reliable if poorly administered.

📌 1. Designing valid data collection methods (surveys & questionnaires)

When designing a survey, follow these core principles (each explained below with practical notes):

Clear purpose: define the research question and which variables are needed.
Question clarity: use simple language, avoid double-barrelled questions, define timeframes and units.
Answer choices: provide mutually exclusive, exhaustive response options; use Likert scales consistently.
Sampling plan: decide who will be sampled and how (random, stratified, convenience) and justify your choice.
Pilot testing: trial the questionnaire to find ambiguous items and estimate completion time.
Ethics and consent: obtain informed consent, anonymise data, and consider sensitive questions carefully.

📐 IA Spotlight

For an Internal Assessment you can base your investigation on a well-designed survey: state your sampling frame, pilot the questionnaire, present the final instrument in an appendix, and discuss how design choices (question wording, response categories) influence validity and reliability.

📌 2. Sampling methods and selecting relevant variables

Simple random sampling: each member has equal chance — minimises selection bias but may be impractical for large populations.
Stratified sampling: divide population into subgroups (strata) and sample proportionally — ensures key groups are represented and reduces sampling error for subgroup estimates.
Cluster sampling: sample entire clusters (e.g., classes) when individual sampling is costly — efficient but increases design effect and standard errors.
Convenience sampling: easy but biased — use only when limitations are acknowledged and results are not generalised beyond the sample.

Explicit guidance on variable selection:

Choose variables that directly address the research question — avoid collecting more variables than necessary which increases respondent burden and noise.
Prefer objective measures where possible (e.g., logged minutes, measured height) rather than subjective recall which may be biased.
If using derived variables (e.g., indices, ratios), define calculation steps clearly and test for sensitivity to outliers.

🔍 TOK Perspective

Consider the role of question framing and sampling in shaping “what we know”. How do choice of sample and wording influence the reliability of the knowledge claims derived from data?

📌 3. Categorising numerical data for χ² analysis — principled choices

When converting continuous numerical data into categories for a χ² goodness-of-fit or contingency table:

Meaningful boundaries: choose class limits with practical meaning (e.g., age groups 0–17, 18–34, 35–54, 55+), not arbitrary tiny splits.
Ensure expected counts ≥ 5: combine adjacent classes where necessary to keep expected frequencies above the commonly used threshold (reduces χ² bias).
K (number of classes): prefer fewer classes with sufficient counts rather than many sparse classes; justify K based on sample size and research aim.
Document decisions: always explain why classes were chosen and give the expected counts used in the χ² calculation (transparency builds trust in conclusions).

Example scenarios:

Public health: grouping blood pressure readings into ‘normal’, ‘elevated’, ‘stage 1’, ‘stage 2’ — categories chosen based on clinical thresholds, ensuring enough observations per category.
Education research: grouping exam scores into performance bands (fail / pass / merit / distinction) with bands wide enough to avoid very small expected counts.

🌍 Real-World Connection

Governments categorise income into brackets for policy analysis; choices of bracket width affect analyses of inequality. Analysts must justify bracket selection and show expected counts before reporting χ² results.

📌 4. Reliability and validity — definitions, tests and interpretation

These concepts are distinct and both essential. Below are explicit methods and how to interpret them.

Subheading / Key pointer — Reliability (14px, Times Newer Roman)

Test–retest: apply the same instrument to the same group at two times (with conditions stable). High correlation between scores suggests temporal reliability. Interpret carefully: true change vs measurement error must be considered.
Parallel forms: two different versions of the test administered to same subjects; if scores correlate highly, forms are consistent.
Internal consistency (conceptual): for multi-item scales (Likert), check whether items measuring the same construct agree (Cronbach’s alpha is a statistic used outside SL; mention conceptually).
Practical note: high reliability does not guarantee validity — a consistently wrong ruler gives reliable but invalid length measurements.

Subheading / Key pointer — Validity (14px, Times Newer Roman)

Content validity: does the instrument cover the construct fully? For example, a “physical activity” questionnaire should cover intensity, duration and frequency, not just frequency.
Criterion-related validity: does the measure agree with an established standard? e.g., a new fitness test compared to VO₂ max measurements.
Face validity: does the instrument seem to measure what it should? This is weaker but useful for survey acceptance by respondents.
Practical checks: cross-validate findings where possible (compare with external data sources); discuss threats to validity such as social desirability bias or misreporting.

❤️ CAS Ideas

Run a data literacy workshop for younger students: design a short questionnaire, collect data, show how wording affects responses and discuss ethical consent and anonymity.

📌 5. Choosing degrees of freedom and justifying categorisation for χ² tests

Degrees of freedom when estimating parameters: if you estimate k parameters from data (for example, mean and variance when fitting a normal), reduce df accordingly when using χ² goodness-of-fit (commonly df = number of categories − 1 − number_of_estimated_parameters).
Document the estimation: if you estimate parameters from the same data you test, explain method and adjust df; this affects critical values and p-values.
Practical justification: always provide a short paragraph explaining why categories were chosen, how many degrees of freedom were used, and why expected frequencies meet the required thresholds.

📝 Paper tip — categorisation & χ²

Always show the raw observed counts, the formula or table used to compute expected counts, the χ² sum and df, then state p and conclusion in context.
When you combine classes to get expected ≥ 5, write exactly which classes were combined and why — examiners look for this justification.
If small expected counts remain, briefly state the limitation and how it affects reliability of the χ² result.

📌 6. Final recommendations & best practice (explicit checklist)

Design checklist: define aims → choose variables → select sampling method → draft questions → pilot → finalise instrument → collect data ethically.
Categorisation checklist: pick meaningful boundaries → ensure expected ≥ 5 → document combining of classes → adjust df if parameters are estimated.
Reliability & validity checklist: test–retest or parallel forms where possible; check content coverage and compare with external measures if available.
Reporting: include your instrument (appendix), sampling frame, response rate, treatment of missing data, and a short paragraph on limitations and possible biases.