AHL 4.18 — Hypothesis testing, critical values & decision rules

Term / concept	Short definition / use
Null hypothesis (H₀)	Statement of no effect or status-quo to be tested. Example: μ = μ₀.
Alternative hypothesis (H₁)	What we suspect might be true. Can be one-tailed or two-tailed.
Significance level (α)	Probability threshold used to reject H₀ (common values 0.05, 0.01, 0.10).
Type I error	Reject H₀ when it is true. Probability = α (controlled by design).
Type II error (β)	Fail to reject H₀ when H₁ is true. Power = 1 − β.
Critical value / critical region	Threshold(s) in the test statistic scale beyond which H₀ is rejected.
p-value	Probability, under H₀, of observing a test statistic at least as extreme as the one measured.

📌 Overview: hypothesis testing and decision rules

Inferential hypothesis testing gives a formal way to decide whether observed data provide sufficient evidence
against a null hypothesis (H₀). The standard workflow (AIHL / IB) is:

1. State H₀ and H₁ (one-tailed or two-tailed).
2. Choose significance level α (commonly 0.05 or 0.01).
3. Decide appropriate test (normal z, t, binomial, Poisson, correlation test) based on data & assumptions.
4. Use the GDC to compute the test statistic and p-value or find the critical region.
5. Compare p-value to α (or statistic to critical values) and conclude. State Type I / II considerations.

Key idea — decision by p-value or critical region

If p-value < α → reject H₀ (evidence supports H₁).
If p-value ≥ α → do not reject H₀ (no sufficient evidence).
Alternatively compare the test statistic with critical value(s) for the chosen α and tail type.

🌍 Real-World Connection

Clinical trials use hypothesis tests to check whether a new treatment improves outcomes versus standard care.
Regulatory bodies require strict control of Type I error to prevent false claims of effectiveness.

📌 Tests for a population mean

Choose the test according to whether population variance σ is known and the sample size / normality assumption:

Normal (z) test: use when σ is known. Test statistic: z = (x̄ − μ₀)/(σ/√n).
t-test: use when σ is unknown — always use the t-test in AIHL IB when σ is not given.
Test statistic: t = (x̄ − μ₀)/(s/√n) with n − 1 degrees of freedom.
One-sample vs two-sample: use one-sample when comparing mean to a fixed μ₀; two-sample when comparing two group means.

🧠 Examiner Tip

Always state the test used and justify it using given information about σ and sample size.
Quote both the test statistic and the p-value directly from the GDC.
Write the final conclusion explicitly referencing the chosen significance level.

Always state which test you use (z or t) and why (σ known/unknown).
Give the GDC output (test statistic and p-value) and write “p-value < α” or “p-value ≥ α”.
When asked to interpret, give a plain language conclusion with the chosen α (e.g. “Reject H₀ at the 5% level — evidence that …”).

📱 GDC Tips

TI-Nspire CX II

Open Menu → Statistics → Stat Tests and select the appropriate test (t-Test, z-Test, Binomial Test, or Correlation).
Choose Stats when summary values (x̄, s, n) are given, or Data when raw lists are provided in columns.
Set the alternative hypothesis carefully (≠, <, >), as this directly affects the p-value and final conclusion.
Record both the test statistic and p-value, then explicitly compare the p-value with α in your written conclusion.

Casio fx-CG50 / fx-CG100

Navigate to MENU → STAT → TEST and select the required hypothesis test from the list provided.
Enter data into List 1 and List 2 where needed, or choose the option to input summary statistics directly.
Ensure the correct tail is selected (left, right, or two-tailed) before executing the test.
Use the displayed p-value to make a decision, then write a full sentence conclusion referencing the chosen significance level.

Example 1 — one-sample t-test (GDC solution)

A machine fills jars with coffee. A sample of n = 16 jars has sample mean x̄ = 250.8 g and sample standard deviation s = 2.4 g.
Test at α = 0.05 whether the machine is filling to the labelled 251 g (H₀: μ = 251, H₁: μ ≠ 251).

GDC steps (TI-style / general):

Menu: STAT > TESTS > t-Test (or 1-Sample t-test on your calculator).
Choose Data (if you have raw list) or Stats (enter x̄ = 250.8, s = 2.4, n = 16).
Set μ₀ = 251, alternative = ≠ (two-tailed), and press Calculate.
Read output: test statistic t (e.g. t ≈ value) and p-value (two-tailed).

Interpretation: if the GDC returns p-value = 0.38 (example), since p = 0.38 > 0.05, do not reject H₀.
Conclude: no evidence at 5% level that mean fill differs from 251 g.

🔍 TOK Perspective

A non-significant result does not prove the null hypothesis is true.
It only indicates insufficient evidence, raising questions about certainty and knowledge claims in statistics.

📌 Tests for a proportion (binomial)

Use a binomial-based test for tests about a population proportion p when data are counts of successes out of n trials.
For large n and p not near 0 or 1, a normal approximation can be used (z-test for proportion), but AIHL expects use of technology.

Exact binomial approach (preferred on GDC): use binomial PDF/CDF to find p-value exactly.
Normal approx (only when n large and np, n(1 − p) ≥ 10): use z = (p̂ − p₀)/√(p₀(1 − p₀)/n).

Example 2 — binomial test (GDC solution)

A quality inspector tests 40 bulbs and finds 6 defective. Test H₀: p = 0.05 vs H₁: p > 0.05 at α = 0.05.

GDC steps (exact binomial p-value):

Use distribution menu: DISTR > BinomCDF / BinomTest or BINOMTEST (some models have a direct binomial test).
For one-tailed (p > p₀): compute probability of getting ≥ 6 successes under n = 40, p₀ = 0.05:
use p-value = 1 − binomcdf(40, 0.05, 5) (because binomcdf gives P(X ≤ k)).
Compare p-value to α (if p < 0.05 reject H₀).

Note: AIHL exam calculators will return this p-value directly from a Binomial Test routine; report p-value and decision.

One Tailed Binomial Hypothesis Testing on Casio fx-CG50 Calculator

📐 IA Spotlight

Hypothesis testing is ideal for IAs involving real data such as defect counts or survey responses.
Explain clearly why the chosen probability model is appropriate for the context.
Discuss limitations, assumptions, and possible sources of statistical error.

📌 Tests using the Poisson distribution

Use the Poisson model for counts that occur at a constant average rate and independently (events per unit time or space).
Example use-cases: arrivals at a call centre, typos per page, emergency admissions per hour.

Poisson test: test whether observed count(s) are consistent with a hypothesised rate λ₀ (per unit).
Use GDC’s Poisson CDF/PDF (DISTR menus) to compute p-values exactly.

Example 3 — Poisson test (GDC)

Hospital expects on average λ = 2 emergency calls per hour. In a particular hour there were 6 calls. Test H₀: λ = 2 vs H₁: λ > 2 at α = 0.01.

GDC steps (Poisson right-tail):

Use DISTR > Poisson CDF (or POISSONCDF). Compute P(X ≤ 5) for λ = 2.
p-value = 1 − P(X ≤ 5). If p-value < 0.01, reject H₀.

This exact approach is preferred; do not approximate with normal unless stated/justified by continuity and large λ.

📌 Testing correlation (ρ = 0)

To test whether the population correlation ρ equals 0 (no linear association), AIHL expects use of technology:

Enter paired data lists into the GDC and run Correlation Test (STAT > TESTS > LinRegTTest or similar).
The GDC returns test statistic t and p-value for H₀: ρ = 0. Use these to accept/reject.

📝 Paper Tips

Always write a concluding sentence after using technology.
Explicitly state whether H₀ is rejected or not rejected.
Use correct statistical language rather than everyday phrasing.

📌 Type I and Type II errors — practical tradeoffs

Type I error (α): false positive — limit it by choosing small α (e.g. 0.01) but that increases β.
Type II error (β): false negative — reduce β (increase power) by increasing n, increasing effect size or raising α.
Power (1 − β): probability test detects a true effect of a given size — consider when designing studies.

🔗 Connections

Links to other subjects: medical trials (Type I/II balance), engineering (reliability testing).
Ethics: setting α too high can cause harm (false positives); too low can miss real effects.

🧠 Final examiner notes

Always state H₀, H₁, α and the test chosen with justification (assumptions, distributional conditions).
Use technology for test statistic and p-value. Quote both and make the decision clearly (reject / do not reject).
When asked for interpretation, write a plain sentence (e.g., “At the 5% level we reject H₀; there is evidence that …”).
- Discuss practical implications (Type I / II consequences) where relevant.

📌 Practice Questions — Type I & Type II Errors

Multiple Choice Questions (MCQs)

MCQ 1
A hypothesis test is carried out at a significance level of α = 0.05.
Which statement correctly defines the probability of a Type I error?

A. The probability that the alternative hypothesis is false
B. The probability that the null hypothesis is true
C. The probability of rejecting a true null hypothesis
D. The probability of failing to reject a false null hypothesis

Answer & Detailed Explanation

Correct answer: C

A Type I error occurs when the null hypothesis is rejected even though it is actually true.
By definition, the probability of making this error is equal to the chosen significance level α.
Since α = 0.05, there is a 5% risk of falsely concluding that a statistically significant effect exists when it does not.

MCQ 2
The power of a hypothesis test is best defined as:

A. The probability of rejecting a true null hypothesis
B. The probability of failing to reject a false null hypothesis
C. The probability of accepting the null hypothesis
D. The probability of correctly rejecting a false null hypothesis

Answer & Detailed Explanation

Correct answer: D

The power of a test measures the test’s ability to detect a real effect.
Mathematically, power ” = 1 − β “, where β is the probability of a Type II error.
A higher power indicates a greater likelihood of correctly rejecting the null hypothesis when it is false.

MCQ 3
Which of the following changes will most directly reduce the probability of a Type II error?

A. Decreasing the significance level α
B. Increasing the sample size
C. Narrowing the critical region
D. Using a two-tailed test instead of a one-tailed test

Answer & Detailed Explanation

Correct answer: B

Increasing the sample size reduces sampling variability and improves the precision of estimates.
This makes it easier to detect a genuine effect when one exists, thereby reducing the probability of failing to reject a false null hypothesis.
As a result, the probability of a Type II error decreases and the power of the test increases.

Short Answer Questions

Short Question 1
Explain what a Type I error represents in the context of hypothesis testing.

Model Answer

A Type I error occurs when the null hypothesis is rejected even though it is actually true.
In practical terms, this means concluding that there is sufficient statistical evidence for an effect, difference, or relationship when none genuinely exists.
The probability of committing a Type I error is controlled by the chosen significance level α.

Short Question 2
Explain why increasing the significance level α generally increases the probability of a Type I error.

Model Answer

The significance level α defines the threshold for rejecting the null hypothesis.
Increasing α enlarges the critical region, making it easier to reject H₀.
As a result, the likelihood of incorrectly rejecting a true null hypothesis increases, leading to a higher probability of a Type I error.

Long Answer Questions (IB AIHL Style)

Long Question 1 — Type I Error

A medical researcher tests whether a new drug has no effect on recovery time.
The test is carried out at a 5% significance level.

(a) State appropriate null and alternative hypotheses.
(b) Explain what a Type I error would mean in this context.
(c) State the probability of making a Type I error.
(d) Explain why minimizing Type I errors is particularly important in medical research.

Fully Worked Explanation

(a)
H₀: The drug has no effect on recovery time.
H₁: The drug affects recovery time.

(b)
A Type I error would occur if the researcher concludes that the drug has an effect when, in reality, it does not.
This could lead to the approval of an ineffective treatment.

(c)
The probability of a Type I error is equal to the significance level, α = 0.05.

(d)
In medical research, false positives can lead to harmful consequences such as unnecessary treatment, side effects, or wasted resources.
Therefore, controlling the probability of a Type I error is essential to protect patient safety and maintain scientific integrity.

Long Question 2 — Type II Error and Power

A manufacturer tests whether the mean lifetime of a battery exceeds 10 hours.
The test is performed at a 1% significance level.

(a) State suitable hypotheses.
(b) Explain what a Type II error would represent in this context.
(c) Define the power of the test.
(d) Describe two methods for increasing the power of the test.

Fully Worked Explanation

(a)
H₀: μ ≤ 10 hours
H₁: μ > 10 hours

(b)
A Type II error occurs if the manufacturer fails to detect that the battery lifetime exceeds 10 hours when it actually does.
This could result in missed marketing or performance opportunities.

(c)
The power of a test is the probability of correctly rejecting a false null hypothesis.
It is equal to 1 − β, where β is the probability of a Type II error.

(d)
Power can be increased by increasing the sample size, which reduces variability, or by increasing the significance level, which enlarges the rejection region.
Both changes make it more likely to detect a true effect.

Long Question 3 — Trade-off Between Errors

Explain why it is generally impossible to simultaneously minimize both Type I and Type II error probabilities.
Support your explanation with reference to the role of the significance level.

Fully Worked Explanation

Reducing the probability of a Type I error requires lowering the significance level α, which shrinks the critical region.
However, this makes it harder to reject the null hypothesis, increasing the probability of a Type II error.
Conversely, increasing α reduces Type II errors but increases Type I errors.
This inherent trade-off means both probabilities cannot be minimized at the same time.