19.6: Sums of Random Variables

Last updated
Save as PDF

Page ID: 48442

Eric Lehman, F. Thomson Leighton, & Alberty R. Meyer
Google and Massachusetts Institute of Technology via MIT OpenCourseWare

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

If all you know about a random variable is its mean and variance, then Chebyshev’s Theorem is the best you can do when it comes to bounding the probability that the random variable deviates from its mean. In some cases, however, we know more—for example, that the random variable has a binomial distribution— and then it is possible to prove much stronger bounds. Instead of polynomially small bounds such as $1/c^2$, we can sometimes even obtain exponentially small bounds such as $1/e^c$. As we will soon discover, this is the case whenever the random variable $T$ is the sum of $n$ mutually independent random variables $T_1, T_2, \ldots, T_n$ where $0 \leq T_i \leq 1$. A random variable with a binomial distribution is just one of many examples of such a $T$. Here is another.

Motivating Example

Fussbook is a new social networking site oriented toward unpleasant people. Like all major web services, Fussbook has a load balancing problem: it receives lots of forum posts that computer servers have to process. If any server is assigned more work than it can complete in a given interval, then it is overloaded and system performance suffers. That would be bad, because Fussbook users are not a tolerant bunch. So balancing the work load across mutliple servers is vital.

An early idea was to assign each server an alphabetic range of forum topics. (“That oughta work!”, one programmer said.) But after the computer handling the “privacy” and “preferred text editor” threads melted from overload, the drawback of an ad hoc approach was clear: it’s easy to miss something that will mess up your plan.

If the length of every task were known in advance, then finding a balanced distribution would be a kind of “bin packing” problem. Such problems are hard to solve exactly, but approximation algorithms can come close. Unfortunately, in this case task lengths are not known in advance, which is typical of workload problems in the real world.

So the load balancing problem seems sort of hopeless, because there is no data available to guide decisions. So the programmers of Fussbook gave up and just randomly assigned posts to computers. Imagine their surprise when the system stayed up and hasn’t crashed yet!

As it turns out, random assignment not only balances load reasonably well, but also permits provable performance guarantees. In general, a randomized approach to a problem is worth considering when a deterministic solution is hard to compute or requires unavailable information.

Specifically, Fussbook receives 24,000 forum posts in every 10-minute interval. Each post is assigned to one of several servers for processing, and each server works sequentially through its assigned tasks. It takes a server an average of $1/4$ second to process a post. Some posts, such as pointless grammar critiques and snide witticisms, are easier, but no post—not even the most protracted harangues—takes more than one full second.

Measuring workload in seconds, this means a server is overloaded when it is assigned more than 600 units of work in a given 600 second interval. Fussbook’s average processing load of $24,000 \cdot 1/4 = 6000$ seconds per interval would keep 10 computers running at 100% capacity with perfect load balancing. Surely, more than 10 servers are needed to cope with random fluctuations in task length and imperfect load balance. But would 11 be enough? . . . or 15, 20, 100? We’ll answer that question with a new mathematical tool.

The Chernoff Bound

The Chernoff⁵ bound is a hammer that you can use to nail a great many problems. Roughly, the Chernoff bound says that certain random variables are very unlikely to significantly exceed their expectation. For example, if the expected load on a processor is just a bit below its capacity, then that processor is unlikely to be overloaded, provided the conditions of the Chernoff bound are satisfied.

More precisely, the Chernoff Bound says that the sum of lots of little, independent random variables is unlikely to significantly exceed the mean of the sum. The Markov and Chebyshev bounds lead to the same kind of conclusion but typically provide much weaker bounds. In particular, the Markov and Chebyshev bounds are polynomial, while the Chernoff bound is exponential.

Here is the theorem. The proof will come later in Section 19.6.6.

Theorem $\PageIndex{1}$

(Chernoff Bound). Let $T_1, \ldots, T_n$ be mutually independent random variables such that $0 \leq T_i \leq 1$ for all $i$. Let $T = T_1 + \cdots + T_n$. Then for all $c \geq 1$,

\[\text{Pr}[T \geq c \text{Ex}[T]] \leq e^{-\beta (c) \text{Ex}[T]}\]

where $\beta(c) ::= c \ln c - c + 1$.

The Chernoff bound applies only to distributions of sums of independent random variables that take on values in the real interval $[0, 1]$. The binomial distribution is the most well-known distribution that fits these criteria, but many others are possible, because the Chernoff bound allows the variables in the sum to have differing, arbitrary, or even unknown distributions over the range $[0, 1]$. Furthermore, there is no direct dependence on either the number of random variables in the sum or their expectations. In short, the Chernoff bound gives strong results for lots of problems based on little information—no wonder it is widely used!

Chernoff Bound for Binomial Tails

The Chernoff bound can be applied in easy steps, though the details can be daunting at first. Let’s walk through a simple example to get the hang of it: bounding the probability that the number of heads that come up in 1000 independent tosses of a coin exceeds the expectation by 20% or more. Let $T_i$ be an indicator variable for the event that the $i$th coin is heads. Then the total number of heads is

\[\nonumber T = T_1 + \cdots + T_{1000}.\]

The Chernoff bound requires that the random variables $T_i$ be mutually independent and take on values in the range $[0, 1]$. Both conditions hold here. In this example the $T_i$’s only take the two values 0 and 1, since they’re indicators.

The goal is to bound the probability that the number of heads exceeds its expectation by 20% or more; that is, to bound $\text{Pr}[T \geq c \text{Ex}[T]]$ where $c = 1.2$. To that end, we compute $\beta(c) $ as defined in the theorem:

\[\nonumber \beta(c) = c \ln (c) - c + 1 = 0.0187 \ldots\]

If we assume the coin is fair, then $\text{Ex}[T] = 500$ Plugging these values into the Chernoff bound gives:

\[\begin{aligned} \text{Pr}[T \geq 1.2 \text{Ex}[T]] &\leq e^{-\beta (c). \text{Ex}[T]} \\ &= e^{-(0.0187\ldots) \cdot 500} < 0.0000834. \end{aligned}\]

So the probability of getting 20% or more extra heads on 1000 coins is less than 1 in 10,000.

The bound rapidly becomes much smaller as the number of coins increases, because the expected number of heads appears in the exponent of the upper bound. For example, the probability of getting at least 20% extra heads on a million coins is at most

\[\nonumber e^{-(0.0187\ldots) \cdot 500000} < e^{-9392},\]

which is an inconceivably small number.

Alternatively, the bound also becomes stronger for larger deviations. For example, suppose we’re interested in the odds of getting 30% or more extra heads in 1000 tosses, rather than 20%. In that case, $c = 1.3$ instead of 1.2. Consequently, the parameter $\beta (c)$ rises from 0.0187 to about 0.0410, which may not seem significant, but because $\beta (c)$ appears in the exponent of the upper bound, the final probability decreases from around 1 in 10,000 to about 1 in a billion!

Chernoff Bound for a Lottery Game

Pick-4 is a lottery game in which you pay $1 to pick a 4-digit number between 0000 and 9999. If your number comes up in a random drawing, then you win $5,000. Your chance of winning is 1 in 10,000. If 10 million people play, then the expected number of winners is 1000. When there are exactly 1000 winners, the lottery keeps $5 million of the $10 million paid for tickets. The lottery operator’s nightmare is that the number of winners is much greater—especially at the point where more than 2000 win and the lottery must pay out more than it received. What is the probability that will happen?

Let $T_i$ be an indicator for the event that the $i$th player wins. Then $T = T_1 + \cdots + T_n$ is the total number of winners. If we assume⁶ that the players’ picks and the winning number are random, independent and uniform, then the indicators $T_i$ are independent, as required by the Chernoff bound.

Since 2000 winners would be twice the expected number, we choose $c = 2$, compute $\beta(c) = 0.386 \ldots$, and plug these values into the Chernoff bound:

\[\begin{aligned} \text{Pr}[T \geq 2000] &= \text{Pr}[T \geq 2 \text{Ex}[T]] \\ &\leq e^{-k \text{Ex}[T]} = e^{-(0.386 \ldots) \cdot 1000} \\ &< e^{-386}. \end{aligned}\]

So there is almost no chance that the lottery operator pays out more than it took in. In fact, the number of winners won’t even be 10% higher than expected very often. To prove that, let $c = 1.1$, compute $\beta(c) = 0.00484 \ldots$, and plug in again:

\[\begin{aligned} \text{Pr}[T \geq 1.1 \text{Ex}[T]] &\leq e^{-k \text{Ex}[T]} \\ &= e^{-(0.00484) \cdot 1000} < 0.01. \end{aligned}\]

So the Pick-4 lottery may be exciting for the players, but the lottery operator has little doubt as to the outcome!

Randomized Load Balancing

Now let’s return to Fussbook and its load balancing problem. Specifically, we need to determine a number, $m$, of servers that makes it very unlikely that any server is overloaded by being assigned more than 600 seconds of work in a given interval.

To begin, let’s find the probability that the first server is overloaded. Letting $T$ be the number of seconds of work assigned to the first server, this means we want an upper bound on $\text{Pr}[T \geq 600]$. Let $T_i$ be the number of seconds that the first server spends on the $i$th task: then $T_i$ is zero if the task is assigned to another machine, and otherwise $T_i$ is the length of the task. So $T = \sum_{i = 1}^{T_i}$ is the total number of seconds of work assigned to the first server, where $n = 24,000$.

The Chernoff bound is applicable only if the $T_i$ are mutually independent and take on values in the range $[0, 1]$. The first condition is satisfied if we assume that assignment of a post to a server is independent of the time required to process the post. The second condition is satisfied because we know that no post takes more than 1 second to process; this is why we chose to measure work in seconds.

In all, there are 24,000 tasks, each with an expected length of $1/4$ second. Since tasks are assigned to the $m$ servers at random, the expected load on the first server is:

\[\begin{align} \nonumber \text{Ex}[T] &= \frac{24,000 \text{ tasks } \cdot 1/4 \text{ second per task}}{m \text{ servers }} \\ \label{19.6.2} &= 6000/m \text{ seconds }. \end{align}\]

So if there are fewer than 10 servers, then the expected load on the first server is greater than its capacity, and we can expect it to be overloaded. If there are exactly 10 servers, then the server is expected to run for $6000/10 = 600$ seconds, which is 100% of its capacity.

Now we can use the Chernoff bound based on the number of servers to bound the probability that the first server is overloaded. We have from (\ref{19.6.2})

\[\nonumber 600 = c \text{Ex}[T] \quad \text{where } c::= m/10,\]

so by the Chernoff bound

\[\nonumber \text{Pr}[T \geq 600] = \text{Pr}[T \geq c \text{Ex}[T]] \leq e^{-(c \ln (c) - c + 1) \cdot 6000 / m},\]

The probability that some server is overloaded is at most $m$ times the probability that the first server is overloaded, by the Union Bound in Section 16.5.2. So

\[\begin{aligned} \text{Pr}[\text{some server is overloaded}] &\leq \sum_{i = 1}^{m} \text{Pr}[\text{server } i \text{ is overloaded}] \\ &= m \text{Pr}[\text{the first server is overloaded}] \\ &\leq me^{-(c \ln (c) - c + 1) \cdot 6000 / m}, \end{aligned}\]

where $c = m/10$. Some values of this upper bound are tabulated below:

\[\begin{aligned} m &= 11 : 0.784 \ldots \\ m &= 12 : 0.000999 \ldots \\ m &= 13 : 0.0000000760 \ldots \end{aligned}\]

These values suggest that a system with $m = 11$ machines might suffer immediate overload, $m = 12$ machines could fail in a few days, but $m = 13$ should be fine for a century or two!

Proof of the Chernoff Bound

The proof of the Chernoff bound is somewhat involved. In fact, Chernoff himself couldn’t come up with it: his friend, Herman Rubin, showed him the argument. Thinking the bound not very significant, Chernoff did not credit Rubin in print. He felt pretty bad when it became famous!⁷

Proof. (of Theorem 19.6.1)

For clarity, we’ll go through the proof “top down.” That is, we’ll use facts that are proved immediately afterward.

The key step is to exponentiate both sides of the inequality $T \geq c \text{Ex}[T]$ and then apply the Markov bound:

\[\begin{aligned} \text{Pr}[T \geq c \text{Ex}[T]] &= \text{Pr}[c^T \geq c^{c \text{Ex}[T]}] \\ &\leq \frac{\text{Ex}[c^T]}{c^{c\text{Ex}[T]}} & (\text{Markov Bound})\\ &\leq \frac{e^{(c-1) \text{Ex}[T]}}{c^{c\text{Ex}[T]}} & (\text{Lemma } 19.6.2 \text{ below}) \\ &= \frac{e^{(c-1) \text{Ex}[T]}}{e^{c \ln (c) \text{Ex}[T]}} = e^{-(c \ln (c) - c + 1) \text{Ex}[T]}. & & \quad \blacksquare \end{aligned} \]

Algebra aside, there is a brilliant idea in this proof: in this context, exponentiating somehow supercharges the Markov bound. This is not true in general! One unfortunate side-effect of this supercharging is that we have to bound some nasty expectations involving exponentials in order to complete the proof. This is done in the two lemmas below, where variables take on values as in Theorem 19.6.1.

Lemma 19.6.2.

\[\nonumber \text{Ex}\left[c^T\right] \leq e^{(c-1)\text{Ex}[T]}.\]

Proof.

\[\begin{aligned} \text{Ex}\left[c^T\right] &= \text{Ex}\left[c^{T_1 + \cdots + T_n}\right] & (\text{def of } T)\\ &= \text{Ex}\left[c^{T_1} \cdots c^{T_n}\right] \\ &= \text{Ex}\left[c^{T_1}\right] \cdots \text{Ex}\left[c^{T_n}\right] & (\text{independent product Cor }18.5.7) \\ &\leq e^{(c-1)\text{Ex}[T_1]} \cdots e^{(c-1)\text{Ex}[T_n]} & (\text{Lemma 19.6.3 below}) \\ &= e^{(c-1)(\text{Ex}[T_1] + \cdots + \text{Ex}[T_n])} \\ &= e^{(c-1)(\text{Ex}[T_1+ \cdots + T_n])} \\ &= e^{(c-1)(\text{Ex}[T])}.\end{aligned}\]

The third equality depends on the fact that functions of independent variables are also independent (see Lemma 18.2.2).$\quad \blacksquare$

Lemma 19.6.3.

\[\nonumber \text{Ex}[c^{T_i}] \leq e^{(c-1)\text{Ex}[T_i]}\]

Proof. All summations below range over values $v$ taken by the random variable $T_i$, which are all required to be in the interval $[0, 1]$.

\[\begin{aligned} \text{Ex}[c^{T_i}] &= c^v \text{Pr}[T_i = v] & (\text{def of Ex}[\cdot]) \\ &\leq \sum (1 + (c-1)v) \text{Pr}[T_i = v] & (\text{convexity—see below}) \\ &= \sum \text{Pr}[T_i = v] + (c - 1) v \text{Pr}[T_i = v] \\ &= \sum \text{Pr}[T_i = v] + (c - 1) \sum v \text{Pr}[T_i = v] \\ &= 1 + (c-1) \text{Ex}[T_i] \\ &\leq e^{(c-1)\text{Ex}[T_i]} & (\text{since } 1 + z \leq e^z) \end{aligned}\]

The second step relies on the inequality

\[\nonumber c^v \leq 1 + (c-1)v,\]

which holds for all $v$ in $[0, 1]$ and $c \geq 1$. This follows from the general principle that a convex function, namely $c^v$, is less than the linear function, $1 + (c - 1)v$, between their points of intersection, namely $v = $ 0 and 1. This inequality is why the variables Ti are restricted to the real interval $[0,1]. \quad \blacksquare$

Comparing the Bounds

Suppose that we have a collection of mutually independent events $A_1, A_2, \ldots, A_n$, and we want to know how many of the events are likely to occur.

Let $T_i$ be the indicator random variable for $A_i$ and define

\[\nonumber p_i = \text{Pr}[T_i = 1] = \text{Pr}[A_i]\]

for $1 \leq i \leq n.$ Define

\[\nonumber T = T_1 + T_2 + \cdots + T_n\]

to be the number of events that occur.

We know from Linearity of Expectation that

\[\begin{aligned} \text{Ex}[T] &= \text{Ex}[T_1] + \text{Ex}[T_2] + \cdots + \text{Ex}[T_n] \\ &= \sum_{i = 1}^{n} p_i. \end{aligned} \]

This is true even if the events are not independent.

By Theorem 19.3.8, we also know that

\[\begin{aligned} \text{Var}[T] &= \text{Var}[T_1] + \text{Var}[T_2] + \cdots + \text{Var}[T_n] \\ &= \sum_{i = 1}^{n} p_i (1 - p_i), \end{aligned} \]

and thus that

\[\nonumber \sigma_T = \sqrt{\sum_{i = 1}^{n} p_i (1 - p_i)}.\]

This is true even if the events are only pairwise independent.

Markov’s Theorem tells us that for any $c>1$,

\[\nonumber \text{Pr}[T \geq c \text{Ex}[T]] \leq \frac{1}{c}.\]

Chebyshev’s Theorem gives us the stronger result that

\[\nonumber \text{Pr}[|T - \text{Ex}[T]| \geq c \sigma_T] \leq \frac{1}{c^2}.\]

The Chernoff Bound gives us an even stronger result, namely, that for any $c>0$,

\[\nonumber \text{Pr}[T - \text{Ex}[T] \geq c \text{Ex}[T]] \leq e^{-(c \ln (c) - c + 1) \text{Ex}[T]}.\]

In this case, the probability of exceeding the mean by $c \text{Ex}[T]$ decreases as an exponentially small function of the deviation.

By considering the random variable $n - T$, we can also use the Chernoff Bound to prove that the probability that $T$ is much lower than $\text{Ex}[T]$ is also exponentially small.

Murphy’s Law

If the expectation of a random variable is much less than 1, then Markov’s Theorem implies that there is only a small probability that the variable has a value of 1 or more. On the other hand, a result that we call Murphy’s Law⁸ says that if a random variable is an independent sum of 0–1-valued variables and has a large expectation, then there is a huge probability of getting a value of at least 1.

Theorem $\PageIndex{4}$

(Murphy’s Law). Let $A_1, A_2, \ldots, A_n$ be mutually independent events. Let $T_i$ be the indicator random variable for $A_i$ and define

\[\nonumber T ::= T_1 + T_2 + \cdots + T_n\]

to be the number of events that occur. Then

\[\nonumber \text{Pr}[T = 0] \leq e^{-\text{Ex}[T]},\]

Proof: \[\begin{aligned}
\text{Pr}[T=0] &=\text{Pr}\left[\bar{A}_{1} \cap \bar{A}_{2} \cap \ldots \cap \bar{A}_{n}\right] & (T = 0 \text{ iff no }A_i \text{ occurs})\\
&=\prod_{i=1}^{n} \text{Pr}\left[\bar{A}_{i}\right] & (\text{independence of } A_i)\\
&=\prod_{i=1}^{n}\left(1-\text{Pr}\left[A_{i}\right]\right) \\
& \leq \prod_{i=1}^{n} e^{-\text{Pr}\left[A_{i}\right]} & (\text{since } 1 - x \leq e^{-x})\\
&=e^{-\sum_{i=1}^{n} \text{Pr}\left[A_{i}\right]} \\
&=e^{-\sum_{i=1}^{n} \text{Ex}\left[T_{i}\right]} & (\text{since } T_i \text{ is an indicator for } A_i)\\
&=e^{-\text{Ex}[T]} & (\text{linearity of expectation}) \\ & & \quad \blacksquare
\end{aligned}\]

For example, given any set of mutually independent events, if you expect 10 of them to happen, then at least one of them will happen with probability at least $1 - e^{10}$. The probability that none of them happen is at most $e^{10} < 1/22000$.

So if there are a lot of independent things that can go wrong and their probabilities sum to a number much greater than 1, then Theorem 19.6.4 proves that some of them surely will go wrong.

This result can help to explain “coincidences,” “miracles,” and crazy events that seem to have been very unlikely to happen. Such events do happen, in part, because there are so many possible unlikely events that the sum of their probabilities is greater than one. For example, someone does win the lottery.

In fact, if there are 100,000 random tickets in Pick-4, Theorem 19.6.4 says that the probability that there is no winner is less than $e^{10} < 1/22000$. More generally, there are literally millions of one-in-a-million possible events and so some of them will surely occur.

⁵Yes, this is the same Chernoff who figured out how to beat the state lottery—this guy knows a thing or two.

⁶As we noted in Chapter 18, human choices are often not uniform and they can be highly dependent. For example, lots of people will pick an important date. The lottery folks should not get too much comfort from the analysis that follows, unless they assign random 4-digit numbers to each player.

⁷See “A Conversation with Herman Chernoff,” Statistical Science 1996, Vol. 11, No. 4, pp 335– 350.

⁸This is in reference and deference to the famous saying that “If something can go wrong, it probably will.”