# 4.4: Sampling with Replacement and the Birthday Bound

$$\newcommand{\vecs}{\overset { \rightharpoonup} {\mathbf{#1}} }$$ $$\newcommand{\vecd}{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}{\| #1 \|}$$ $$\newcommand{\inner}{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$

Below is an example of two libraries which are not interchangeable (they do have mathematically different behavior), but are indistinguishable. These two libraries happen to be convenient for later topics, as well.

Lemma 4.7: Replacement Sampling

Define two libraries as below Then for all calling programs ? that make q queries to the SAMP subroutine, the advantage of ? in distinguishing the libraries is at most q(q − 1)/2λ+1.

In particular, when ? is polynomial-time (in λ), then q grows as a polynomial in the security parameter. Hence,? has negligible advantage, andsamp-L ≋ ℒsamp-R.

samp-L uniformly samples λ-bit strings with replacement (i.e., allowing for duplicates). In ℒsamp-R, we’ve initialized the set R outside of any subroutine, by which we mean that R is a static (its value is maintained across different calls to SAMP) and private (its value is not directly accessible to the calling program). ℒsamp-R uses the set Rto keep track of which values have been previously sampled, to sample without replacement and ensure that there are no duplicate outputs.

The overall idea here is that the only way to distinguish the two libraries seems to be to wait for a repeated output of the SAMP algorithm. But when sampling uniformly from a huge set with 2λ items, it is very unlikely (but not impossible!) to ever see a repeated value unless one asks for a huge number of samples. For that reason, the two libraries should be indistinguishable. This is an accurate intuition, but it’s not a formal proof. Among other things, we need to determine just how unlikely repeated outputs are, and argue that the distinguishing advantage is related to the probability of repeated outputs.

Proof:

Consider the following hybrid libraries: First, let us prove some simple observations about these libraries:

hyb-L ≡ ℒsamp-L: Note that ℒhyb-L simply samples uniformly from {0,1}λ. The extra R and bad variables in ℒhyb-Ldon’t actually have an effect on its external behavior (they are used only for convenience later in the proof).

hyb-R ≡ ℒsamp-R: Whereas ℒsamp-R avoids repeats by simply sampling from {0,1}λ\ R, this library ℒhyb-R samples r uniformly from {0,1}λ and retries if the result happens to be in R. This method is called rejection sampling, and it has the same effect7 as sampling r directly from {0,1}λ\ R.

We can also see that ℒhyb-L and ℒhyb-R differ only in code that is reachable only when bad = 1 (highlighted). So, using Lemma 4.6, we can bound the advantage of the calling program: It suffices to show that in ? ◊ ℒhyb-L the variable bad is set 1 only with negligible probability. It turns out to be easier to reason about the complement event, and argue that bad remains 0 with probability negligibly close to 1.

Suppose ? makes q calls to the SAMP subroutine. Let r1,. . . ,rq be the responses of these calls. For variable bad to remain 0, it must be the case that for each i, we have ri∉ {r1,. . . ,ri-1}. Then,

Pr[bad remains 0 in ? ◊ ℒhyb-L]  For the inequality we used the convenient fact that when x and y are positive, we have (1 − x)(1 − y) = 1 − (x + y) + xy ⩾ 1 − (x + y). More generally, when all terms xi are positive, ∏i (1 − xi) ⩾ 1 − ∑i xi.

Summing up, the advantage of ? in distinguishing ℒsamp-L and ℒsamp-R is: When ? is a polynomial-time distinguisher, q is a polynomial function of λ, so ?’s advantage is negligible. This shows that ℒsamp-L ≋ ℒsamp-R.

Birthday Bound

As part of the previous proof, we showed that the probability of repeating an item while taking q uniform samples from {0,1}λ is O(q2/2λ). This is an upper bound, but we can come up with a lower bound as well:

Lemma 4.8: Birthday Bound

When taking q uniform samples from {0,1}λ, where q ⩽ √(2λ + 1), the probability of encountering a repeated value is at least 0.632 * (q(q−1)/2λ+1). In particular, when q = √(2λ+1), a repeated value is encountered with probability at least 1/2.

Proof:

We use the fact that for x ∈ [0,1], we have 1 − xe−x ⩽ 1 − (0.632…)x, where the constant 0.632 … is 1 −(1/e).

Let r1,… ,rq be the values uniformly sampled from {0,1}λ. As above, we fail to encounter a repeated value if the ri’s are all distinct. Applying the bounds from above, we have: The second inequality follows from the fact that q(q − 1)/2λ+1 < 1 by our bound on q. So the probability of a repeated value among the ri’s is More generally, when sampling uniformly from a set of N items, taking √(2N) samples is enough to ensure a repeated value with probability -0.63. The bound gets its name from considering q = √(2 · 365) ≈ 27 people in a room, and assuming that the distribution of birthdays is uniform across the calendar. With probability at least 0.63, two people will share a birthday. This counterintuitive fact is often referred to as the birthday paradox, although it’s not an actual paradox.8

In the context of cryptography, we often design schemes in which the users take many uniform samples from {0,1}λ. Security often breaks down when the same value is sampled twice, and this happens after about √(2λ+ 1) ∼ 2λ/2 steps.

7The two approaches for sampling from {0,1}λ\ R may have different running times, but our model considers only the input-output behavior of the library.