1.2: The Axioms of Probability Theory

Last updated
Save as PDF

Page ID: 44602

Robert Gallager
Massachusetts Institute of Technology via MIT OpenCourseWare

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

As the applications of probability theory became increasingly varied and complex during the 20th century, the need arose to put the theory on a firm mathematical footing. This was accomplished by an axiomatization of the theory, successfully carried out by the great Russian mathematician A. N. Kolmogorov in 1932. Before stating and explaining these axioms of probability theory, the following two examples explain why the simple approach of the last section, assigning a probability to each sample point, often fails with infinite sample spaces.

Example 1.2.1

Suppose we want to model the phase of a sine wave, where the phase is viewed as being “uniformly distributed” between 0 and \(2 \pi\). If this phase is the only quantity of interest, it is reasonable to choose a sample space consisting of the set of real numbers between 0 and \(2 \pi\). There are uncountably⁵ many possible phases between 0 and \(2 \pi\), and with any reasonable interpretation of uniform distribution, one must conclude that each sample point has probability zero. Thus, the simple approach of the last section leads us to conclude that any event in this space with a finite or countably infinite set of sample points should have probability zero. That simple approach does not help in finding the probability, say, of the interval \((0, \pi)\).

Solution

For this example, the appropriate view is that taken in all elementary probability texts, namely to assign a probability density \(\frac{1}{2 \pi}\) to the phase. The probability of an event can then usually be found by integrating the density over that event. Useful as densities are, however, they do not lead to a general approach over arbitrary sample spaces.⁶

Example 1.2.2

Consider an infinite sequence of coin tosses. The usual probability model is to assign probability \(2^{-n}\) to each possible initial \(n\)-tuple of individual outcomes. Then in the limit \(n \longrightarrow \infty\), the probability of any given sequence is 0. Again, expressing the probability of an event involving infinitely many tosses as a sum of individual sample-point probabilities does not work. The obvious approach (which we often adopt for this and similar situations) is to evaluate the probability of any given event as an appropriate limit, as \(n \rightarrow \infty\), of the outcome from the first \(n\) tosses.

Solution

We will later find a number of situations, even for this almost trivial example, where working with a finite number of elementary experiments and then going to the limit is very awkward. One example, to be discussed in detail later, is the strong law of large numbers (SLLN). This law looks directly at events consisting of infinite length sequences and is best considered in the context of the axioms to follow.

Although appropriate probability models can be generated for simple examples such as those above, there is a need for a consistent and general approach. In such an approach, rather than assigning probabilities to sample points, which are then used to assign probabilities to events, probabilities must be associated directly with events. The axioms to follow establish consistency requirements between the probabilities of different events. The axioms, and the corollaries derived from them, are consistent with one’s intuition, and, for finite sample spaces, are consistent with our earlier approach. Dealing with the countable unions of events in the axioms will be unfamiliar to some students, but will soon become both familiar and consistent with intuition.

The strange part of the axioms comes from the fact that defining the class of events as the set of all subsets of the sample space is usually inappropriate when the sample space is uncountably infinite. What is needed is a class of events that is large enough that we can almost forget that some very strange subsets are excluded. This is accomplished by having two simple sets of axioms, one defining the class of events,⁷ and the other defining the relations between the probabilities assigned to these events. In this theory, all events have probabilities, but those truly weird subsets that are not events do not have probabilities. This will be discussed more after giving the axioms for events.

The axioms for events use the standard notation of set theory. Let \(\Omega\) be the set of all sample points for a given experiment. The events are subsets of the sample space. The union of \(n\) subsets (events) \(A_{1}, A_{2}, \cdots, A_{n}\) is denoted by either \(\bigcup_{i=1}^{n} A_{i}\) or \(A_{1} \cup \cdots \cup A_{n}\), and consists i=1 of all points in at least one of \(A_{1}, \ldots, A_{n}\). Similarly, the intersection of these subsets is denoted by either \(\bigcap_{i=1}^{n} A_{i}\) or⁸ \(A_{1} A_{2} \cdots A_{n}\) and consists of all points in all of \(A_{1}, \ldots, A_{n}\).

A sequence of events is a collection of events in one-to-one correspondence with the positive integers, i.e., \(A_{1}, A_{2}, \ldots\), ad infinitum. A countable union, \(\bigcup_{i=1}^{\infty} A_{i}\) is the set of points in one or more of \(A_{1}, A_{2}, \ldots\) Similarly, a countable intersection \(\bigcap_{i=1}^{\infty} A_{i}\) is the set of points i=1 in all of \(A_{1}, A_{2}, \ldots\) Finally, the complement \(A^{c}\) of a subset (event) A is the set of points in \(\Omega\) but not \(A\).

Axioms for Events

Given a sample space \(\Omega\), the class of subsets of \(\Omega\) that constitute the set of events satisfies the following axioms:

\(\Omega\) is an event.
For every sequence of events \(A_{1}, A_{2}, \ldots\), the union \(\bigcup_{n=1}^{\infty} A_{n}\) is an event.
For every event \(A\), the complement \(A^{c}\) is an event.

There are a number of important corollaries of these axioms. First, the empty set \(\phi\) is an event. This follows from Axioms 1 and 3, since \(\phi=\Omega^{c}\). The empty set does not correspond to our intuition about events, but the theory would be extremely awkward if it were omitted.

Second, every finite union of events is an event. This follows by expressing \(A_{1} \cup \cdots \bigcup A_{n}\) as \(\bigcup_{i=1}^{\infty} A_{i}\) where \(A_{i}=\phi\) for all \(i>n\). Third, every finite or countable intersection of events is an event. This follows from deMorgan’s law,

\(\left[\bigcup_{n} A_{n}\right]^{c}=\bigcap_{n} A_{n}^{c}\)

Although we will not make a big fuss about these axioms in the rest of the text, we will be careful to use only complements and countable unions and intersections in our analysis. Thus subsets that are not events will not arise.

Note that the axioms do not say that all subsets of \(\Omega\) are events. In fact, there are many rather silly ways to define classes of events that obey the axioms. For example, the axioms are satisfied by choosing only the universal set \(\Omega\) and the empty set \(\phi\) to be events. We shall avoid such trivialities by assuming that for each sample point \(\omega\), the singleton subset \(\{\omega\}\) is an event. For finite sample spaces, this assumption, plus the axioms above, imply that all subsets are events.

For uncountably infinite sample spaces, such as the sinusoidal phase above, this assumption, plus the axioms above, still leaves considerable freedom in choosing a class of events. As an example, the class of all subsets of \(\Omega\) satisfies the axioms but surprisingly does not allow the probability axioms to be satisfied in any sensible way. How to choose an appropriate class of events requires an understanding of measure theory which would take us too far afield for our purposes. Thus we neither assume nor develop measure theory here.⁹

From a pragmatic standpoint, we start with the class of events of interest, such as those required to define the random variables needed in the problem. That class is then extended so as to be closed under complementation and countable unions. Measure theory shows that this extension is always possible, and we simply accept that as a known result.

Axioms of probability

Given any sample space \(\Omega\) and any class of events \(\mathcal{E}\) satisfying the axioms of events, a probability rule is a function Pr{} mapping each \(A \in \mathcal{E}\) to a (finite¹⁰) real number in such a way that the following three probability axioms¹¹ hold:

\(\operatorname{Pr}\{\Omega\}=1\).
For every event \(A, \operatorname{Pr}\{A\} \geq 0\).
The probability of the union of any sequence \(A_{1}, A_{2}, \ldots\) of disjoint events is given by

\[\left.\operatorname{Pr}\left\{\bigcup_{n=1}^{\infty} A_{n}\right\}\right\}=\sum_{n=1}^{\infty} \operatorname{Pr}\left\{A_{n}\right\},\label{1.1} \]

where \(\sum_{n=1}^{\infty} \operatorname{Pr}\left\{A_{n}\right\}\) is shorthand for \(\lim _{m \rightarrow \infty} \sum_{n=1}^{m} \operatorname{Pr}\left\{A_{n}\right\}\).

The axioms imply the following useful corollaries:

\[\operatorname{Pr}\{\phi\}=0\label{1.2} \]

\[\left.\operatorname{Pr}\left\{\bigcup_{n=1}^{m} A_{n}\right\}\right\}=\sum_{n=1}^{m} \operatorname{Pr}\left\{A_{n}\right\} \quad \text { for } A_{1}, \ldots, A_{m} \text { disjoint }\label{1.3} \]

\[\operatorname{Pr}\left\{A^{c}\right\}=1-\operatorname{Pr}\{A\}\quad
\text { for all } A\label{1.4} \]

\[\operatorname{Pr}\{A\} \leq \operatorname{Pr}\{B\} \quad \text { for all } A \subseteq B\label{1.5} \]

\[\operatorname{Pr}\{A\} \leq 1 \quad
\text { for all } A \label{1.6} \]

\[\sum_{n} \operatorname{Pr}\left\{A_{n}\right\} \leq 1\quad
\text { for } A_{1}, \ldots, \text { disjoint }\label{1.7} \]

\[\left.\left.\operatorname{Pr}\left\{\bigcup_{n=1}^{\infty} A_{n}\right\}\right\}=\lim _{m \rightarrow \infty} \operatorname{Pr}\left\{\bigcup_{n=1}^{m} A_{n}\right\}\right\}\label{1.8} \]

\[\left.\operatorname{Pr}\left\{\bigcup_{n=1}^{\infty} A_{n}\right\}\right\}=\lim _{n \rightarrow \infty} \operatorname{Pr}\left\{A_{n}\right\} \quad \text { for } A_{1} \subseteq A_{2} \subseteq \cdots\label{1.9} \]

To verify (1.2), consider a sequence of events, \(A_{1}, A_{2}, \ldots\), for which \(A_{n}=\phi\) for each \(n\). These events are disjoint since \(\phi\) contains no outcomes, and thus has no outcomes in common with itself or any other event. Also, \(\bigcup_{n} A_{n}=\phi\) since this union contains no outcomes. Axiom 3 then says that

\(\operatorname{Pr}\{\phi\}=\lim _{m \rightarrow \infty} \sum_{n=1}^{m} \operatorname{Pr}\left\{A_{n}\right\}=\lim _{m \rightarrow \infty} m \operatorname{Pr}\{\phi\}\)

Since \(\operatorname{Pr}\{\phi\}\) is a real number, this implies that \(\operatorname{Pr}\{\phi\}=0\).

To verify (1.3), apply Axiom 3 to the disjoint sequence \(A_{1}, \ldots, A_{m}, \phi, \phi, \ldots\)

One might reasonably guess that (1.3), along with Axioms 1 and 2 implies Axiom 3. Exercise 1.3 shows why this guess is incorrect.

To verify (1.4), note that \(\Omega=A \bigcup A^{c}\). Then apply \ref{1.3} to the disjoint sets \(A\) and \(A^{c}\).

To verify (1.5), note that if \(A \subseteq B\), then \(B=A \bigcup(B-A)\) where \(B-A\) is an alternate way to write \(B \cap A^{c}\). We see then that \(A\) and \(B-A\) are disjoint, so from (1.3),

\(\operatorname{Pr}\{B\}=\operatorname{Pr}\{A \bigcup(B-A)\}\}=\operatorname{Pr}\{A\}+\operatorname{Pr}\{B-A\} \geq \operatorname{Pr}\{A\}\),

where we have used Axiom 2 in the last step.

To verify \ref{1.6} and (1.7), first substitute \(\Omega\) for \(B\) in \ref{1.5} and then substitute \(\bigcup_{n} A_{n}\) for \(A\).

Finally, \ref{1.8} is established in Exercise 1.4, part (e), and \ref{1.9} is a simple consequence of (1.8).

The axioms specify the probability of any disjoint union of events in terms of the individual event probabilities, but what about a finite or countable union of arbitrary events? Exercise 1.4 \ref{b} shows that in this case, \ref{1.3} can be generalized to

\[\left.\operatorname{Pr}\left\{\bigcup_{n=1}^{m} A_{n}\right\}\right\}=\sum_{n=1}^{m} \operatorname{Pr}\left\{B_{n}\right\}\, \label{1.10} \]

where \(B_{1}=A_{1}\) and for each \(n>1\), \(B_{n}=A_{n}-\bigcup_{j=1}^{n-1} A_{j}\) is the set of points in \(A_{n}\) but not in j=1 any of the sets \(A_{1}, \ldots, A_{n-1}\). The probability of a countable union is then given by (1.8). In order to use this, one must know not only the event probabilities for \(A_{1}, A_{2} \ldots\), but also the probabilities of their intersections. The union bound, which is derived in Exercise 1.4 (c), depends only on the individual event probabilities, and gives the following frequently useful upper bound on the union probability.

\[\left.\operatorname{Pr}\left\{\bigcup_{n} A_{n}\right\}\right\} \leq \sum_{n} \operatorname{Pr}\left\{A_{n}\right\} \quad\label{1.11}
\text { (Union bound). } \nonumber \]

Reference

⁵A set is uncountably infinite if it is infinite and its members cannot be put into one-to-one correspondence with the positive integers. For example the set of real numbers over some interval such as \((0,2 \pi)\) is uncountably infinite. The Wikipedia article on countable sets provides a friendly introduction to the concepts of countability and uncountability.

⁶It is possible to avoid the consideration of infinite sample spaces here by quantizing the possible phases. This is analogous to avoiding calculus by working only with discrete functions. Both usually result in both artificiality and added complexity.

⁷A class of elements satisfying these axioms is called a \(\sigma\)-algebra or, less commonly, a \(\sigma\)-field.

⁸Intersection is also sometimes denoted as \(A_{1} \cap \cdots \cap A_{n}\), but is usually abbreviated as \(A_{1} A_{2} \cdots A_{n}\).

⁹There is no doubt that measure theory is useful in probability theory, and serious students of probability should certainly learn measure theory at some point. For application-oriented people, however, it seems advisable to acquire more insight and understanding of probability, at a graduate level, before concentrating on the abstractions and subtleties of measure theory.

¹⁰The word finite is redundant here, since the set of real numbers, by definition, does not include \(\pm \infty\). The set of real numbers with \(\pm \infty\) appended, is called the set of extended real numbers

¹¹Sometimes finite additivity, (1.3), is added as an additional axiom. This addition is quite intuitive and avoids the technical and somewhat peculiar proofs given for \ref{1.2} and (1.3).