# 4.4: Sampling with Replacement and the Birthday Bound

- Page ID
- 7337

Below is an example of two libraries which are not *interchangeable* (they do have mathematically different behavior), but are *indistinguishable*. These two libraries happen to be convenient for later topics, as well.

**Lemma 4.7: Replacement Sampling**

*Define two libraries as below*

*Then for all calling programs* ? *that make q queries to the SAMP subroutine, the advantage of* ? *in distinguishing the libraries is at most q*(*q* − 1)/2^{λ+1}.

* In particular, when* ? *is polynomial-time* (*in λ*), *then q grows as a polynomial in the security parameter. Hence,*? *has negligible advantage, and* ℒ^{samp-L} ≋ ℒ^{samp-R}.

ℒ^{samp-L} uniformly samples *λ*-bit strings with replacement (*i.e.*, allowing for duplicates). In ℒ^{samp-R}, we’ve initialized the set *R* outside of any subroutine, by which we mean that *R* is a static (its value is maintained across different calls to SAMP) and *private* (its value is not directly accessible to the calling program). ℒ_{samp-R} uses the set *R*to keep track of which values have been previously sampled, to sample *without* replacement and ensure that there are no duplicate outputs.

The overall idea here is that the only way to distinguish the two libraries *seems to be* to wait for a repeated output of the SAMP algorithm. But when sampling uniformly from a huge set with *2 ^{λ}* items, it is very unlikely (but not impossible!) to ever see a repeated value unless one asks for a huge number of samples. For that reason, the two libraries should be indistinguishable. This is an accurate intuition, but it’s not a formal proof. Among other things, we need to determine just how unlikely repeated outputs are, and argue that the distinguishing advantage is related to the probability of repeated outputs.

**Proof:**

Consider the following hybrid libraries:

First, let us prove some simple observations about these libraries:

ℒ_{hyb-L} ≡ ℒ_{samp-L}: Note that ℒ_{hyb-L} simply samples uniformly from {0,1}^{λ}. The extra *R* and bad variables in ℒ_{hyb-L}don’t actually have an effect on its external behavior (they are used only for convenience later in the proof).

ℒ_{hyb-R} ≡ ℒ_{samp-R}: Whereas ℒ_{samp-R} avoids repeats by simply sampling from {0,1}^{λ}\ *R*, this library ℒ_{hyb-R} samples *r *uniformly from {0,1}^{λ} and retries if the result happens to be in *R*. This method is called *rejection sampling*, and it has the same effect^{7} as sampling *r* directly from {0,1}^{λ}\ *R*.

We can also see that ℒ_{hyb-L} and ℒ_{hyb-R} differ only in code that is reachable only when bad = 1 (highlighted). So, using __ Lemma 4.6__, we can bound the advantage of the calling program:

It suffices to show that in ? ◊ ℒ_{hyb-L} the variable bad is set 1 only with negligible probability. It turns out to be easier to reason about the complement event, and argue that bad remains 0 with probability negligibly close to 1.

Suppose ? makes *q* calls to the SAMP subroutine. Let *r*_{1},. . . ,*r*_{q} be the responses of these calls. For variable bad to remain 0, it must be the case that for each *i*, we have *r*_{i}∉ {*r _{1}*,. . . ,

*r*}. Then,

_{i-1}Pr[bad remains 0 in ? ◊ ℒ_{hyb-L}]

For the inequality we used the convenient fact that when *x* and *y* are positive, we have (1 − *x*)(1 − *y*) = 1 − (*x* + *y*) + *xy* ⩾ 1 − (*x* + *y*). More generally, when all terms *x _{i}* are positive, ∏

_{i}(1 −

*x*) ⩾ 1 − ∑

_{i}_{i}

*x*.

_{i}Summing up, the advantage of ? in distinguishing ℒ_{samp-L} and ℒ_{samp-R} is:

When ? is a polynomial-time distinguisher, *q* is a polynomial function of *λ*, so ?’s advantage is negligible. This shows that ℒ_{samp-L} ≋ ℒ_{samp-R}.

**Birthday Bound**

As part of the previous proof, we showed that the probability of repeating an item while taking *q* uniform samples from {0,1}^{λ} is *O*(q^{2}/2^{λ}). This is an upper bound, but we can come up with a lower bound as well:

**Lemma 4.8: Birthday Bound**

*When taking q uniform samples from *{0,1}* ^{λ}, where q* ⩽ √(2

^{λ + 1}),

*the probability of encountering a repeated value is at least*0.632 * (

^{q(q−1)}/

_{2λ+1}).

*In particular, when q*= √(2

^{λ+1}),

*a repeated value is encountered with probability at least 1/2.*

**Proof:**

We use the fact that for *x* ∈ [0,1], we have 1 − *x* ⩽ *e*^{−x} ⩽ 1 − (0.632…)*x*, where the constant 0.632 … is 1 −(^{1}/_{e}).

Let *r _{1}*,… ,

*r*be the values uniformly sampled from {0,1}

_{q}^{λ}. As above, we

*fail*to encounter a repeated value if the

*r*’s are all distinct. Applying the bounds from above, we have:

_{i}The second inequality follows from the fact that *q*(*q* − 1)/2^{λ+1} < 1 by our bound on *q*. So the probability of a repeated value among the *r _{i}*’s is

More generally, when sampling uniformly from a set of *N* items, taking √(2*N*) samples is enough to ensure a repeated value with probability -0.63. The bound gets its name from considering *q* = √(2 · 365) ≈ 27 people in a room, and assuming that the distribution of birthdays is uniform across the calendar. With probability at least 0.63, two people will share a birthday. This counterintuitive fact is often referred to as the *birthday paradox*, although it’s not an actual paradox.^{8}

In the context of cryptography, we often design schemes in which the users take many uniform samples from {0,1}^{λ}. Security often breaks down when the same value is sampled twice, and this happens after about √(2^{λ+ 1}) ∼ 2^{λ/2} steps.

^{7}The two approaches for sampling from {0,1}^{λ}\ *R* may have different running times, but our model considers only the input-output behavior of the library.

^{8}I think the actual birthday paradox is that the “birthday paradox” is not a paradox.