Hypothesis Testing

Hypothesis testing is a way to ask whether sample data looks too inconsistent with a starting claim. That starting claim is called the null hypothesis, written $H_0$ .

The method does not prove $H_0$ true or false. It asks a narrower question: if $H_0$ were true, would data this extreme be unusual enough that we should doubt it?

The Core Idea

Every hypothesis test has two competing statements:

The null hypothesis $H_0$ , which is the default claim being tested.
The alternative hypothesis $H_1$ or $H_a$ , which is what you would support if the data gives enough evidence against $H_0$ .

You then choose a significance level $\alpha$ , often $0.05$ , before looking at the result. This is the cutoff for how much evidence you require before rejecting $H_0$ .

Two outcomes are possible:

Reject $H_0$ : the data is sufficiently inconsistent with the null model.
Fail to reject $H_0$ : the data is not strong enough to rule out the null model.

"Fail to reject" is not the same as "accept as true." It only means the sample did not provide strong enough evidence against $H_0$ .

The Usual Steps

The workflow is usually:

State $H_0$ and $H_1$ clearly.
Choose $\alpha$ and a test that matches the data and assumptions.
Compute a test statistic from the sample.
Turn that statistic into a $p$ -value or compare it with a critical value.
Make the decision and interpret it in context.

The test statistic depends on the situation. A $z$ -test, $t$ -test, chi square test, and many others are all examples of hypothesis tests. There is no single formula for all of hypothesis testing.

What The $p$ -Value Means

A $p$ -value is the probability, assuming $H_0$ is true and the test assumptions hold, of getting a result at least as extreme as the one observed.

A small $p$ -value means the data would be unusual under $H_0$ . That is why small $p$ -values count as evidence against the null hypothesis.

It does not mean:

The probability that $H_0$ is false.
The probability that your result happened "by random chance" in a vague everyday sense.
The size or importance of the effect.

Main Types Of Hypothesis Tests

There are two useful ways to group tests.

By Direction

A one-tailed test looks for change in one direction only.

Right-tailed: values larger than the null claim support $H_1$ .
Left-tailed: values smaller than the null claim support $H_1$ .

A two-tailed test looks for a difference in either direction. If $H_1$ is "not equal," the rejection region is split across both tails.

By Data Situation

A $z$ -test is used for some mean-testing settings when population standard deviation is known or a justified large-sample approximation is being used.
A $t$ -test is common for means when the population standard deviation is unknown and conditions are reasonable.
A chi square test is used for categorical count data.

The right test depends on the variable type, sample design, and assumptions. Choosing the formula first and the question second is a common mistake.

Worked Example

Suppose a filling machine is supposed to average $500$ mL per bottle. A quality-control team takes a sample of $36$ bottles and gets a sample mean of $496$ mL.

Assume, for this example, that the population standard deviation is known to be $\sigma = 12$ mL and the sampling conditions justify a one-sample $z$ -test.

Set up the hypotheses:

H_0: \mu = 500

H_1: \mu < 500

This is a left-tailed test because the concern is underfilling.

The standard error is

\frac{\sigma}{\sqrt{n}} = \frac{12}{\sqrt{36}} = 2

So the test statistic is

z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} = \frac{496 - 500}{2} = -2

If $\alpha = 0.05$ for a left-tailed $z$ -test, the critical value is about $-1.645$ . Because $-2 < -1.645$ , the result falls in the rejection region.

So the decision is to reject $H_0$ at the $5\%$ level. In context, the sample provides evidence that the machine is underfilling on average.

That conclusion depends on the test assumptions. If the assumptions are poor, the conclusion may be unreliable even if the arithmetic is correct.

Type I And Type II Errors

Hypothesis testing always involves error risk.

A Type I error means rejecting $H_0$ even though it is true. Its probability is controlled by $\alpha$ .

A Type II error means failing to reject $H_0$ even though $H_1$ is true. Its probability is usually written $\beta$ .

Lowering $\alpha$ makes false alarms less likely, but it can also make true effects harder to detect if nothing else changes. That tradeoff is one reason sample size matters.

Common Mistakes

One common mistake is saying a non-significant result proves there is no effect. Usually it only shows the data was not strong enough to detect one.

Another mistake is treating statistical significance as practical importance. A tiny effect can be statistically significant in a very large sample.

People also misuse tests by ignoring assumptions about independence, distribution shape, variance, or data type. A clean-looking $p$ -value does not rescue a mismatched test.

When Hypothesis Testing Is Used

Hypothesis testing is used in science, manufacturing, medicine, surveys, A/B testing, and policy analysis. The goal is usually the same: decide whether the sample gives enough evidence to question a default claim.

In practice, good testing is not just about the calculation. It also requires a sensible null hypothesis, a defensible design, and an interpretation that matches what the test can actually say.

Try Your Own Version

Take the same bottle-filling example, but change the sample mean to $498$ mL. Recompute the test statistic and see whether the decision changes at $\alpha = 0.05$ . That is a quick way to see how evidence gets stronger or weaker as the sample result moves closer to the null value.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →