SL 4.11 — hypothesis testing, Chi Square tests and t-tests

Term / concept Definition / short explanation
Null hypothesis (H0) The default claim to be tested (e.g., “no association”, “population mean = μ0“). We assume H0 unless data gives strong evidence otherwise.
Alternative hypothesis (H1) The claim we suspect may be true instead of H0 (e.g., “μ ≠ μ0“, “association exists”). Can be one- or two-sided.
Significance level (α) The threshold probability for rejecting H0 (common values 0.05, 0.01, 0.10). If p ≤ α, reject H0.
p-value Probability (under H0) of obtaining data at least as extreme as observed. Small p supports H1.
χ2 statistic Σ (Observed − Expected)2 / Expected measured across cells; compares observed counts to expected under H0.
Degrees of freedom (df) For χ2 goodness-of-fit df = k − 1 (k categories). For contingency table df = (rows − 1)(cols − 1).

📌 1. Formulating hypotheses (H0 and H1)

Follow these rules when writing H0 and H1:

  • State H0 as an equality or “no effect” claim (e.g., H0: p = 0.5, H0: no association).
  • State H1 as the alternative (e.g., H1: p ≠ 0.5, H1: association exists).
  • Decide one-tailed vs two-tailed before seeing the data (affects p-value interpretation).

🔍 TOK Perspective

Consider how the wording of hypotheses shapes evidence. Does rejecting H0 demonstrate the alternative is true, or only that H0 is unlikely under the data observed?

📌 2. Significance levels and p-values (interpretation)

  • Decision rule: choose α before testing; if p ≤ α → reject H0; if p > α → fail to reject H0.
  • p-value meaning: not the probability H0 is true; rather, how surprising the data are if H0 were true.
  • Reporting: give numeric p-value and conclusion in context (e.g., “There is evidence at the 5% level that …”).

🌍 Real-World Connection

Medical trials report p-values when testing new treatments. Policymakers must interpret small p with caution — consider effect size and sample design, not p alone.

📌 3. χ2 goodness-of-fit test (categorical data)

Purpose: compare observed counts to expected counts under a specified probability model.

  1. State H0 (e.g., “data follow the claimed distribution”) and H1 (“do not follow”).
  2. Compute expected counts: Expected = n × pcategory for each category.
  3. Calculate χ2 = Σ (O − E)2 / E across categories.
  4. Degrees of freedom df = k − 1 (k = number of categories). For parameters estimated from data, df reduces accordingly.
  5. Find p-value from χ2 distribution with df (use technology in exam). Compare to α.

Worked example — goodness-of-fit

A six-sided die is rolled 120 times; observed counts for faces 1–6 are: 18, 20, 19, 24, 20, 19. Test H0: die is fair (p = 1/6 each) at α = 0.05.

Expected per face E = 120 × 1/6 = 20. χ2 = Σ (O − 20)2/20 = ((−2)2 + 0 + (−1)2 + 42 + 0 + (−1)2)/20 = (4+0+1+16+0+1)/20 = 22/20 = 1.1.

df = 6 − 1 = 5. p ≈ 0.96 (use GDC). Since p > 0.05, fail to reject H0; no evidence die is unfair.

📌 4. χ2 test for independence (contingency tables)

Purpose: test whether two categorical variables are independent.

  1. Form contingency table of observed counts Oij (rows × columns).
  2. Compute expected counts Eij = (row total × column total) / grand total.
  3. Compute χ2 = Σcells (Oij − Eij)2 / Eij.
  4. Degrees of freedom df = (r − 1)(c − 1). Use technology to get p-value and conclusion.
  5. Check expected counts: best practice expected ≥ 5; if several expected ≤ 5, interpret χ2 with caution (consider Fisher’s exact test for small tables).

📐 IA Spotlight

  • Use contingency tables when investigating relationships in survey data (e.g., gender vs. preference). Show how expected counts are computed and discuss limitations when expected counts are small.

Worked example — independence (2×2)

Surveyed 100 students for (A) studies online (Yes/No) and (B) prefers recorded lectures (Yes/No). Observed:

Table: rows = Online study Yes (30), No (70); columns = Prefers recorded Yes (40), No (60).

Expected for cell (Yes, Yes): E = (row total 30 × column total 40) / 100 = 12. Compute χ2 across 4 cells (use GDC). df = (2−1)(2−1)=1. Compare p-value to α.

Yates continuity correction: sometimes applied for 2×2 tables with small counts to reduce χ2 bias. In exams, technology will usually handle this; mention continuity correction if counts are small.

📌 5. The t-test (comparing two means) — SL perspective

SL conditions: two independent (unpaired) samples; population variances unknown and assumed equal → use pooled two-sample t-test. Technology computes t and p.

  • Hypotheses: example H0: μ1 = μ2; H1: μ1 ≠ μ2 (two-sided) or >/< for one-sided.
  • Assumptions: both populations approx normal (especially important for small samples), independent samples, equal variances (pooled t-test).
  • Test statistic (pooled): technology computes t and df ~ n1 + n2 − 2; report p and conclude.

🧠 Examiner Tip

  • Always state H0 and H1 clearly (equation / inequality).
  • Show method: write which test was used (e.g., “pooled two-sample t-test”) and justify it (independence, approx normal, equal variances).
  • Include numeric result and context: show t, df (if asked), p-value and interpret in plain language (conclusion about means in context).

📌 6. Use of technology and practical advice

  • In examinations use GDC or software to compute χ2, t and p-values — display key intermediate values (observed & expected counts, t-statistic) for clarity.
  • Always check assumptions: expected counts in χ2, normality and equal variances for t-test. If assumptions fail, mention limitations.
  • For small sample counts in χ2 (expected ≤ 5), note that results may be unreliable and consider alternative tests (Fisher exact for 2×2).

❤️ CAS Link

Run a small community survey (e.g., about transport choices) and use χ2 tests to check associations. Present results and discuss limitations of small expected counts.

Worked example — two-sample t-test (illustrative)

Sample A: n1=12, mean = 50, s = 5. Sample B: n2=14, mean = 46, s = 6. Test H0: μ12 at α = 0.05.

Use technology (LinReg / T-Test): pooled t ≈ 2.03, df = 24, p ≈ 0.053 (two-sided) → p slightly above 0.05 so fail to reject H0; no strong evidence means differ (mention exact p and context).

🌐 EE Focus

Explore statistical testing choices in an EE: comparing χ2 vs Fisher for small counts, or studying the robustness of t-tests to non-normality with simulations.

📌 Quick summary & checklist

  • Write H0 and H1 clearly, state α.
  • Choose correct test: χ2 goodness-of-fit (categorical vs model), χ2 independence (contingency), t-test for two means (SL conditions).
  • Check assumptions (expected counts, normality, equal variances). Use technology for calculations and give contextual interpretation of p.
  • When small expected counts appear, mention Yates/Fisher as appropriate and highlight limitations.

📝 Paper tips — hypothesis tests

  • Label everything: show O and E tables, state df, give χ2, p and conclusion in context.
  • When using technology: still present the formula or intermediate E-values to earn method marks.
  • Always interpret: end with a one-line sentence linking conclusion to the real-world context of the question.
  • At the end of the sum: The way to conclude is to say “We do/do not have enough evidence to reject null hypothesis”(very important to remember)

📌 SL 4.11 — Hypothesis Testing, Chi-Square Tests & t-Tests

Multiple Choice Questions

MCQ 1
A hypothesis test is carried out at the 5% significance level. The p-value obtained is 0.032.
Which conclusion is correct?

  • A. Accept H0 because the p-value is small
  • B. Reject H0 because the p-value is less than 0.05
  • C. Accept H1 because the p-value is greater than 0.05
  • D. Do not reject H0 because the result is inconclusive
Answer & Explanation

Correct answer: B

In hypothesis testing, the decision rule using the p-value is:

If p-value < α, reject H0.

Here, the p-value is 0.032 and the significance level is α = 0.05.
Since 0.032 < 0.05, the result is considered statistically significant.

This means the observed data are unlikely under the assumption that H0 is true,
so we state ” we have enough evidence to reject the null hypothesis”.


MCQ 2
Which of the following situations is most appropriate for a chi-square test for independence?

  • A. Comparing two population means with unknown variance
  • B. Testing whether a die is fair
  • C. Testing whether two categorical variables are related
  • D. Estimating a confidence interval for a mean
Answer & Explanation

Correct answer: C

A chi-square test for independence is used when:

  • Both variables are categorical
  • Data are presented in a contingency table
  • We want to see whether the variables are associated or independent

Options A and D involve means, which require t-tests.
Option B refers to a chi-square goodness-of-fit test, not a test for independence.


MCQ 3
In a one-sample t-test, which assumption is required?

  • A. The population standard deviation must be known
  • B. The population must be normally distributed
  • C. The sample size must be greater than 30
  • D. The data must be categorical
Answer & Explanation

Correct answer: B

A one-sample t-test is used when the population standard deviation is unknown.

The key assumption is that the population distribution is normal, especially for small sample sizes.
If the sample size is large, the test is more robust, but normality is still the formal assumption in IB.


Short Answer Questions

Short Question 1
Explain what is meant by a Type I error in hypothesis testing.

Model Answer

A Type I error occurs when the null hypothesis H0 is rejected even though it is actually true.

In other words, it is a false positive result, where the test suggests evidence for an effect or difference
that does not truly exist.

The probability of making a Type I error is equal to the chosen significance level α.
For example, if α = 0.05, there is a 5% chance of rejecting a true null hypothesis.


Short Question 2
State two conditions required for a chi-square test to be valid.

Model Answer

First, all expected frequencies in the contingency table should be sufficiently large,
typically at least 5, to ensure the chi-square approximation is valid.

Second, the observations must be independent, meaning that each individual or outcome
contributes to only one cell of the table.

If these conditions are not met, the conclusions of the test may not be reliable.


Long Answer Questions

Long Question 1 — One-Sample t-Test

A manufacturer claims that the mean lifetime of a certain type of battery is 120 hours.
A random sample of 10 batteries has a mean lifetime of 114 hours with a sample standard deviation of 8 hours.

(a) State the null and alternative hypotheses.
(b) Explain why a t-test is appropriate.
(c) Determine the test statistic.
(d) State the conclusion at the 5% significance level.

Full Worked Solution

(a) Hypotheses

H0: μ = 120
H1: μ < 120

The alternative hypothesis reflects suspicion that the true mean lifetime is less than the advertised value.

(b) Choice of test

The population standard deviation is unknown and the sample size is small.
Therefore, a one-sample t-test is appropriate.

(c) Test statistic

t = (x̄ − μ0) / (s / √n)

t = (114 − 120) / (8 / √10) ≈ −2.37

(d) Conclusion

Using the GDC, the p-value corresponding to t ≈ −2.37 with 9 degrees of freedom is less than 0.05.

Since p-value < 0.05, we reject H0.
There is sufficient evidence at the 5% level to suggest that the mean battery lifetime is less than 120 hours.


Long Question 2 — Chi-Square Test for Independence

A school records whether students prefer online or in-person learning, classified by gender.
The results are shown in a contingency table.

(a) State the null and alternative hypotheses.
(b) Explain how expected frequencies are calculated.
(c) Describe how the test statistic is obtained.
(d) Interpret a decision to reject H0.

Full Worked Solution

(a) Hypotheses

H0: Gender and learning preference are independent.
H1: Gender and learning preference are not independent.

(b) Expected frequencies

Expected frequency = (row total × column total) / grand total.

This represents the frequency we would expect if the variables were truly independent.

(c) Test statistic

The chi-square statistic is calculated using:

χ² = Σ ( (Observed − Expected)² / Expected )

Each cell’s contribution is summed to obtain the final test statistic.

(d) Interpretation

If H0 is rejected, this indicates there is a statistically significant association
between gender and learning preference.

This does not imply causation, only that the variables are related in the population.