Probability & Statistics — Rules, Mean & Standard Deviation

Probability and statistics share one toolkit for handling uncertainty. Probability turns assumptions about random events into numbers between $0$ and $1$ , while descriptive statistics compress a dataset into a few summary numbers — the mean, the variance, and the standard deviation. This guide covers the core probability rules, conditional probability, and the spread measures, with the formulas, short derivations, and worked examples you need to actually compute answers.

0 \le P(A) \le 1, \qquad \sigma = \sqrt{\frac{\sum (x_i - \mu)^2}{N}}

Probability rules

The probability of an event $A$ is the long-run fraction of trials in which $A$ happens. For a finite sample space where every outcome is equally likely,

P(A) = \frac{\text{number of favorable outcomes}}{\text{total number of outcomes}}.

Three rules generate almost everything else.

Complement rule

Every event either happens or it does not, and those two cases cover the whole sample space:

P(A) + P(A^c) = 1 \quad\Longrightarrow\quad P(A^c) = 1 - P(A).

This is the fastest route to "at least one" problems, where the complement ("none") is easier to count.

Addition rule

For the probability that $A$ or $B$ occurs, add the two probabilities but subtract the overlap so it is not counted twice:

P(A \cup B) = P(A) + P(B) - P(A \cap B).

If $A$ and $B$ are mutually exclusive (they cannot both happen), then $P(A \cap B) = 0$ and the rule reduces to $P(A \cup B) = P(A) + P(B)$ .

Multiplication rule

For the probability that $A$ and $B$ both occur,

P(A \cap B) = P(A)\, P(B \mid A).

If the events are independent, knowing $A$ tells you nothing about $B$ , so $P(B \mid A) = P(B)$ and the rule simplifies to $P(A \cap B) = P(A)\,P(B)$ .

Situation              Use this
---------------------  ----------------------------------
"at least one"         complement: 1 - P(none)
"A or B"               addition: P(A)+P(B)-P(A∩B)
"A and B"              multiplication: P(A)·P(B|A)
mutually exclusive     P(A∩B) = 0
independent            P(B|A) = P(B)

Conditional probability

Conditional probability asks: given that $B$ already happened, how likely is $A$ ? It is defined as

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B) > 0.

Why it is defined this way

Once you know $B$ occurred, $B$ becomes the new sample space — outcomes outside $B$ are impossible now. So you restrict attention to the part of $A$ that lives inside $B$ , namely $A \cap B$ , and rescale by the total probability of $B$ so the conditional probabilities still sum to $1$ . Dividing by $P(B)$ is exactly that rescaling. Rearranging the definition gives back the multiplication rule $P(A \cap B) = P(B)\,P(A \mid B)$ , which is why the two are really one idea.

From here, swapping the roles of $A$ and $B$ leads to Bayes' theorem:

P(A \mid B) = \frac{P(B \mid A)\,P(A)}{P(B)}.

Worked example 1 — conditional probability

A box has $10$ marbles: $4$ red and $6$ blue. You draw two without replacement. What is the probability that both are red?

Let $A$ be "first red" and $B$ be "second red." Then $P(A) = \tfrac{4}{10}$ . After removing one red, $3$ reds remain out of $9$ marbles, so $P(B \mid A) = \tfrac{3}{9}$ . By the multiplication rule,

P(A \cap B) = \frac{4}{10} \times \frac{3}{9} = \frac{12}{90} = \frac{2}{15} \approx 0.133.

The "without replacement" detail is what makes $P(B \mid A) \ne P(B)$ : the draws are dependent.

Mean, variance, and standard deviation

These three numbers describe a dataset's center and spread.

The mean (average) is

\mu = \frac{1}{N}\sum_{i=1}^{N} x_i.

The variance is the average squared distance from the mean:

\sigma^2 = \frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2.

The standard deviation is the square root of the variance:

\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{N} (x_i - \mu)^2}.

Why we square the deviations

If you simply added the raw deviations $x_i - \mu$ , the positives and negatives would cancel and the sum would always be $0$ — useless as a spread measure. Squaring removes the sign and weights large deviations more heavily. But squaring also changes the units (dollars become dollars-squared), so taking the square root at the end returns the standard deviation to the original units, which is why $\sigma$ is the number people actually report.

Population vs. sample

When your data is a sample meant to estimate a larger population, divide by $N-1$ instead of $N$ :

s^2 = \frac{1}{N-1}\sum_{i=1}^{N} (x_i - \bar{x})^2.

Dividing by the smaller $N-1$ corrects a bias that otherwise makes the sample variance too small. Use $N$ for a full population, $N-1$ for a sample.

Worked example 2 — mean, variance, standard deviation

Find the mean, variance, and standard deviation of the dataset $\{2, 4, 4, 4, 5, 5, 7, 9\}$ , treating it as a full population ( $N = 8$ ).

Step 1 — mean:

\mu = \frac{2+4+4+4+5+5+7+9}{8} = \frac{40}{8} = 5.

Step 2 — squared deviations:

x    x - μ    (x - μ)²
2     -3         9
4     -1         1
4     -1         1
4     -1         1
5      0         0
5      0         0
7      2         4
9      4        16
                ---
   sum =         32

Step 3 — variance and standard deviation:

\sigma^2 = \frac{32}{8} = 4, \qquad \sigma = \sqrt{4} = 2.

So the data is centered at $5$ and typically lies about $2$ units away from the mean.

Try it yourself, then check the answer

A fair six-sided die is rolled once. What is the probability of rolling at least a 5 (that is, a $5$ or a $6$ ), and what is the probability of not doing so?

There are $2$ favorable outcomes out of $6$ , so $P(\ge 5) = \tfrac{2}{6} = \tfrac{1}{3}$ . By the complement rule, $P(< 5) = 1 - \tfrac{1}{3} = \tfrac{2}{3}$ . Notice it was faster to count the small favorable set and use the complement than to count all four "not" outcomes by hand.

If you want to confirm a longer calculation quickly, drop the dataset or probability setup into the GPAI Solver and compare each intermediate line against your own work.

Calculation traps to watch for

Adding probabilities that overlap. For $P(A \cup B)$ , forgetting to subtract $P(A \cap B)$ double-counts the shared outcomes. Only skip the subtraction when the events are mutually exclusive.
Confusing $P(A \mid B)$ with $P(B \mid A)$ . These are generally different. The condition (what is given) goes in the denominator. Reversing them is the classic base-rate mistake.
Forgetting the square root for standard deviation. Variance is in squared units; the standard deviation is its square root. Reporting $\sigma^2$ as if it were $\sigma$ overstates the spread.
Mixing up $N$ and $N-1$ . Divide by $N$ for a full population and by $N-1$ for a sample estimate. Using the wrong one shifts every variance and standard deviation you report.
Treating dependent draws as independent. "Without replacement" changes the conditional probability of later draws; using $P(B)$ instead of $P(B \mid A)$ gives the wrong joint probability.

Master these three ideas — the probability rules, the conditioning step, and the spread formulas — and most introductory statistics problems become a matter of identifying which tool the wording is pointing at.

Frequently Asked Questions

What is the difference between probability and statistics?: Probability starts from assumptions about random events and predicts how likely outcomes are. Statistics starts from observed data and summarizes or draws conclusions from it. They use the same core ideas of distributions, mean, and variance.
How do you calculate conditional probability?: Use P(A given B) = P(A and B) divided by P(B), with P(B) greater than zero. You restrict attention to outcomes inside B, then rescale by the total probability of B.
Why do we square the deviations when finding variance?: If you added raw deviations from the mean, positives and negatives would cancel to zero. Squaring removes the sign and weights large deviations more, giving a usable measure of spread.
What is the difference between variance and standard deviation?: Variance is the average squared distance from the mean, so it is in squared units. Standard deviation is the square root of the variance, which returns the measure to the original units of the data.
When do you divide by N versus N minus 1?: Divide by N for a full population. Divide by N minus 1 for a sample used to estimate a larger population, which corrects a bias that would otherwise make the variance too small.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →