4.4: Birthday Probabilities and Sampling with/out Replacement

Last updated
Save as PDF

Page ID: 7337

\( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \) \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)\(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\) \(\newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\) \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\) \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\) \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\) \( \newcommand{\Span}{\mathrm{span}}\)\(\newcommand{\AA}{\unicode[.8,0]{x212B}}\)

In many cryptographic schemes, the users repeatedly choose random strings (e.g., each time they encrypt a message), and security breaks down if the same string is ever chosen twice. Hence, it is important that the probability of a repeated sample is negligible. In this section we compute the probability of such events and express our findings in a modular way, as a statement about the indistinguishability of two libraries.

Birthday Probabilities

If \(q\) people are in a room, what is the probability that two of them have the same birthday (if we assume that each person’s birthday is uniformly chosen from among the possible days in a year)? This question is known as the birthday problem, and it is famous because the answer is highly unintuitive to most people. \({ }^{9}\)

Let’s make the question more general. Imagine taking \(q\) independent, uniform samples from a set of \(N\) items. What is the probability that the same value gets chosen more than once? In other words, what is the probability that the following program outputs 1 ?

Let’s give a name to this probability: \[\text { BirthdayProb }(q, N) \stackrel{\text { def }}{=} \operatorname{Pr}[\mathcal{B}(q, N) \text { outputs } 1]\] It is possible to write an exact formula for this probability:

Lemma 4.9

\(\operatorname{Birthday} \operatorname{Prob}(q, N)=1-\prod_{i=1}^{q-1}\left(1-\frac{i}{N}\right)\)

Proof

Let us instead compute the probability that \(\mathcal{B}\) outputs 0 , which will allow us to then solve for the probability that it outputs 1 . In order for \(\mathcal{B}\) to output 0 , it must avoid the early termination conditions in each iteration of the main loop. Therefore:

\[\begin{aligned} \operatorname{Pr}[\mathcal{B}(q, N) \text { outputs } 0]=& \operatorname{Pr}[\mathcal{B}(q, N) \text { doesn't terminate early in iteration } i=1] \\ & \cdot \operatorname{Pr}[\mathcal{B}(q, N) \text { doesn't terminate early in iteration } i=2] \\ & \vdots \\ & \cdot \operatorname{Pr}[\mathcal{B}(q, N) \text { doesn't terminate early in iteration } i=q] \end{aligned}\]

In iteration \(i\) of the main loop, there are \(i-1\) previously chosen values \(s_{1}, \ldots, s_{i-1}\). The program terminates early if any of these are chosen again as \(s_{i}\), otherwise it continues to the next iteration. Put differently, there are \(i-1\) (out of \(N\) ) ways to choose \(s_{i}\) that lead to early termination \(-\) all other choices of \(s_{i}\) avoid early termination. Since the \(N\) possibilities for \(s_{i}\) happen with equal probability:

\[\operatorname{Pr}[\mathcal{B}(q, N) \text { doesn't terminate early in iteration } i]=1-\frac{i-1}{N}\]

Putting everything together:

\[\begin{aligned} \operatorname{Birthday} \operatorname{Prob}(q, N) &=\operatorname{Pr}[\mathcal{B}(q, N) \text { outputs } 1] \\ &=1-\operatorname{Pr}[\mathcal{B}(q, N) \text { outputs } 0] \\ &=1-\left(1-\frac{1}{N}\right)\left(1-\frac{2}{N}\right) \cdots\left(1-\frac{q-1}{N}\right) \\ &=1-\prod_{i=1}^{q-1}\left(1-\frac{i}{N}\right) \end{aligned}\] This completes the proof.

Example

This formula for BirthdayProb \((q, N)\) is not easy to understand at a glance. We can get a better sense of its behavior as a function of q by plotting it. Below is a plot with \(N=365\), corresponding to the classic birthday problem:

With only \(q=23\) people the probability of a shared birthday already exceeds 50%. The graph could be extended to the right (all the way to \(q=365\) ), but even at \(q=70\) the probability exceeds \(99.9 \%\).

Asymptotic Bounds on the Birthday Probability

It will be helpful to have an asymptotic formula for how BirthdayProb \((q, N)\) grows as a function of \(q\) and \(N\). We are most interested in the case where \(q\) is relatively small compared to \(N\) (e.g., when \(q\) is a polynomial function of \(\lambda\) but \(N\) is exponential).

Lemma 4.10

If \(q \leqslant \sqrt{2 N}\), then \[0.632 \frac{q(q-1)}{2 N} \leqslant \text { BirthdayProb }(q, N) \leqslant \frac{q(q-1)}{2 N}\] Since the upper and lower bounds differ by only a constant factor, it makes sense to write BirthdayProb \((q, N)=\Theta\left(q^{2} / N\right)\)

Proof

We split the proof into two parts.

To prove the upper bound, we use the fact that when \(x\) and \(y\) are positive, \[\begin{aligned} (1-x)(1-y) &=1-(x+y)+x y \\ & \geqslant 1-(x+y) \end{aligned}\] More generally, when all terms \(x_{i}\) are positive, \(\prod_{i}\left(1-x_{i}\right) \geqslant 1-\sum_{i} x_{i}\). Hence, \[1-\prod_{i}\left(1-x_{i}\right) \leqslant 1-\left(1-\sum_{i} x_{i}\right)=\sum_{i} x_{i}\] Applying that fact, \[\text { BirthdayProb }(q, N) \stackrel{\text { def }}{=} 1-\prod_{i=1}^{q-1}\left(1-\frac{i}{N}\right) \leqslant \sum_{i=1}^{q-1} \frac{i}{N}=\frac{\sum_{i=1}^{q-1} i}{N}=\frac{q(q-1)}{2 N}\]
To prove the lower bound, we use the fact that when \(0 \leqslant x \leqslant 1\), \[1-x \leqslant e^{-x} \leqslant 1-0.632 x\] This fact is illustrated below. The significance of \(0.632\) is that \(1-\frac{1}{e}=0.63212 \ldots\)

Figure \(\PageIndex{1}\): Copy and Paste Caption here. (Copyright; author via source)

We can use both of these upper and lower bounds on \(e^{-x}\) to show the following: \[\prod_{i=1}^{q-1}\left(1-\frac{i}{N}\right) \leqslant \prod_{i=1}^{q-1} e^{-\frac{i}{N}}=e^{-\sum_{i=1}^{q-1} \frac{i}{N}}=e^{-\frac{q(q-1)}{2 N}} \leqslant 1-0.632 \frac{q(q-1)}{2 N} .\] With the last inequality we used the fact that \(q \leqslant \sqrt{2 N}\), and therefore \(\frac{q(q-1)}{2 N} \leqslant 1\) (this is necessary to apply the inequality \(e^{-x} \leqslant 1-0.632 x\) ). Hence: \[\begin{aligned} \text { BirthdayProb }(q, N) \stackrel{\text { def }}{=} 1-\prod_{i=1}^{q-1}\left(1-\frac{i}{N}\right) \\ & \geqslant 1-\left(1-0.632 \frac{q(q-1)}{2 N}\right)=0.632 \frac{q(q-1)}{2 N} \end{aligned}\] This completes the proof.

Example

Below is a plot of these bounds compared to the actual value of BirthdayProb( q, \(N)(\) for \(N=\) \(365)\)

As mentioned previously, BirthdayProb \((q, N)\) grows roughly like \(q^{2} / N\) within the range of values we care about ( \(q\) small relative to \(N)\).

The Birthday Problem in Terms of Indistinguishable Libraries

Below are two libraries which will also be useful for future topics.

Both libraries provide a SAMP subroutine that samples a random element of \(\{0,1\}^{\lambda}\). The implementation in \(\mathcal{L}_{\text {samp-L }}\) samples uniformly and independently from \(\{0,1\}^{\lambda}\) each time. It samples with replacement, so it is possible (although maybe unlikely) for multiple calls to SAMP to return the same value in \(\mathcal{L}_{\text {samp-L. }}\).

On the other hand, \(\mathcal{L}_{\text {samp-R }}\) samples \(\lambda\)-bit strings without replacement. It keeps track of a set \(R\), containing all the values it has previously sampled, and avoids choosing them again (" \(\{0,1\}^{\lambda} \backslash R^{\prime \prime}\) is the set of \(\lambda\)-bit strings excluding the ones in \(R\) ). In this library, SAMP will never output the same value twice.

The "obvious" distinguishing strategy. A natural way (but maybe not the only way) to distinguish these two libraries, therefore, would be to call sAMP many times. If you ever see a repeated output, then you must certainly be linked to \(\mathcal{L}_{\text {samp-L. After some number }}\) of calls to SAMP, if you still don’t see any repeated outputs, you might eventually stop and guess that you are linked to \(\mathcal{L}_{\text {samp-R. }}\).

Let \(\mathcal{A}_{q}\) denote this "obvious" calling program that makes \(q\) calls to sAMP and returns 1 if it sees a repeated value. Clearly, the program can never return 1 when it is linked to \(\mathcal{L}_{\text {samp-R. }}\) On the other hand, when it is linked to \(\mathcal{L}_{\text {samp-L }}\), it returns 1 with probability exactly BirthdayProb \(\left(q, 2^{\lambda}\right)\). Therefore, the advantage of \(\mathcal{A}_{q}\) is exactly BirthdayProb \(\left(q, 2^{\lambda}\right)\).

This program behaves differently in the presence of these two libraries, therefore they are not interchangeable. But are the libraries indistinguishable? We have demonstrated a calling program with advantage BirthdayProb \(\left(q, 2^{\lambda}\right)\). We have not specified \(q\) exactly, but if \(\mathcal{A}_{q}\) is meant to run in polynomial time (as a function of \(\lambda\) ), then \(q\) must be a polynomial function of \(\lambda\). Then the advantage of \(\mathcal{A}_{q}\) is \(\operatorname{BirthdayProb}\left(q, 2^{\lambda}\right)=\Theta\left(q^{2} / 2^{\lambda}\right)\), which is negligible!

To show that the librares are indistinguishable, we have to show that all calling programs have negligible advantage. It is not enough just to show that this particular calling program has negligible advantage. Perhaps surprisingly, the "obvious" calling program that we considered is the best possible distinguisher!

Lemma 4.11 (Repl. Sampling)

Let \(\mathcal{L}_{\mathrm{samp}-\mathrm{L}}\) and \(\mathcal{L}_{\mathrm{samp}-\mathrm{R}}\) be defined as above. Then for all calling programs \(\mathcal{A}\) that make \(q\) queries to the SAMP subroutine, the advantage of \(\mathcal{A}\) in distinguishing the libraries is at most BirthdayProb \(\left(q, 2^{\lambda}\right)\).

In particular, when \(\mathcal{A}\) is polynomial-time (in \(\lambda\) ), \(q\) grows as a polynomial in the security parameter. Hence, \(\mathcal{A}\) has negligible advantage. Since this is true for all polynomial-time \(\mathcal{A}\), we have \(\mathcal{L}_{\mathrm{samp}-\mathrm{L}} \approx \mathcal{L}_{\mathrm{samp}-\mathrm{R} \text {. }}\)

Proof

Consider the following hybrid libraries:

First, let us prove some simple observations about these libraries:

\(\mathcal{L}_{\text {hyb-L }} \equiv \mathcal{L}_{\text {samp-L }}:\)

Note that \(\mathcal{L}_{\text {hyb-L }}\) simply samples uniformly from \(\{0,1\}^{\lambda}\). The extra \(R\) and bad variables in \(\mathcal{L}_{\text {hyb-L }}\) don’t actually have an effect on its external behavior (they are used only for convenience later in the proof).

\(\mathcal{L}_{\text {hyb-R }} \equiv \mathcal{L}_{\text {samp-R }}:\)

Whereas \(\mathcal{L}_{\text {samp-R }}\) avoids repeats by simply sampling from \(\{0,1\}^{\lambda} \backslash R\), this library \(\mathcal{L}_{\text {hyb-R samples } r}\) uniformly from \(\{0,1\}^{\lambda}\) and retries if the result happens to be in \(R\). This method is called rejection sampling, and it has the same effect \({ }^{10}\) as sampling \(r\) directly from \(\{0,1\}^{\lambda} \backslash R\).

Conveniently, \(\mathcal{L}_{\text {hyb-L }}\) and \(\mathcal{L}_{\text {hyb-R }}\) differ only in code that is reachable when bad \(=1\) (highlighted). So, using Lemma 4.8, we can bound the advantage of the calling program:

\[\begin{aligned} \mid \operatorname{Pr}[\mathcal{A}&\left.\diamond \mathcal{L}_{\text {samp-L }} \Rightarrow 1\right]-\operatorname{Pr}\left[\mathcal{A} \diamond \mathcal{L}_{\text {samp-R }} \Rightarrow 1\right] \mid \\ &=\left|\operatorname{Pr}\left[\mathcal{A} \diamond \mathcal{L}_{\text {hyb-L }} \Rightarrow 1\right]-\operatorname{Pr}\left[\mathcal{A} \diamond \mathcal{L}_{\text {hyb-R }} \Rightarrow 1\right]\right| \\ & \leqslant \operatorname{Pr}\left[\mathcal{A} \diamond \mathcal{L}_{\text {hyb-L }} \text { sets bad }:=1\right] \end{aligned}\]

Finally, we can observe that \(\mathcal{A} \diamond \mathcal{L}_{\text {hyb-L }}\) sets bad := 1 only in the event that it sees a repeated sample from \(\{0,1\}^{\lambda}\). This happens with probability BirthdayProb \(\left(q, 2^{\lambda}\right)\).

Discussion

Stating the birthday problem in terms of indistinguishable libraries makes it a useful tool in future security proofs. For example, when proving the security of a construction we can replace a uniform sampling step with a sampling-without-replacement step. This change has only a negligible effect, but now the rest of the proof can take advantage of the fact that samples are never repeated.
Another way to say this is that, when you are thinking about a cryptographic construction, it is "safe to assume" that randomly sampled long strings do not repeat, and behave accordingly.
However, if a security proof does use the indistinguishability of the birthday libraries, it means that the scheme can likely be broken when a user happens to repeat a uniformly sampled value. Since this becomes inevitable as the number of samples approaches \(\sqrt{2^{\lambda+1}} \sim 2^{\lambda / 2}\), it means the scheme only offers \(\lambda / 2\) bits of security. When a scheme has this property, we say that it has birthday bound security. It is important to understand when a scheme has this property, since it informs the size of keys that should be chosen in practice.

A Generalization

A calling program can distinguish between the previous libraries if sAmp ever returns the same value twice. In any given call to SAMP, the variable \(\mathcal{R}\) denotes the set of "problematic" values that cause the libraries to be distinguished. At any point, \(R\) has only polynomially many values, so the probability of chosing such a problematic one is negligible.

Suppose we considered a different set of values to be problematic. As long as there are only polynomially many problematic values in each call to SAMP, the reasoning behind the proof wouldn’t change much. This idea leads to the following generalization, in which the calling program explicitly writes down all of the problematic values:

Lemma 4.12

The following two libraries are indistinguishable, provided that the argument \(\mathcal{R}\) to SAMP is passed as an explicit list of items.

Suppose the calling program makes \(q\) calls to SAMP, and in the \(i\) th call it uses an argument \(\mathcal{R}\) with \(n_{i}\) items. Then the advantage of the calling program is at most: \[1-\prod_{i=1}^{q}\left(1-\frac{n_{i}}{2^{\lambda}}\right)\] We can bound this advantage as before. If \(\sum_{i=1}^{q} n_{i} \leqslant 2^{\lambda}\), then the advantage is between \(0.632\left(\sum_{i=1}^{q} n_{i}\right) / 2^{\lambda}\) and \(\left(\sum_{i=1}^{q} n_{i}\right) / 2^{\lambda}\). When the calling program runs in polynomial time and must pass \(\mathcal{R}\) as an explicit list (i.e., take the time to "write down" the elements of \(\mathcal{R}\) ), \(\sum_{i=1}^{q} n_{i}\) is a polynomial in the security parameter and the calling program’s advantage is negligible.

The birthday scenario corresponds to the special case where \(n_{i}=i-1\) (in the \(i\) th call, \(R\) consists of the \(i-1\) results from previous calls to SAMP). In that case, \(\sum_{i=1}^{q} n_{i}=q(q-1) / 2\) and the probabilities collapse to the familiar birthday probabilities.

\({ }^{9}\) It is sometimes called the "birthday paradox," even though it is not really a paradox. The actual birthday paradox is that the "birthday paradox" is not a paradox.

\({ }^{10}\) The two approaches for sampling from \(\{0,1\}^{\lambda} \backslash R\) may have different running times, but our model considers only the input-output behavior of the library.