# 11.1: Security Properties for Hash Functions

- Page ID
- 7390

Superficially, it seems like we have already given the formal definition of security: A hash function *H* is collision-resistant if no polynomial-time algorithm can output a collision under *H*. Unfortunately, this definition is impossible to achieve!

Fix your favorite hash function *H*. We argued that collisions in *H*definitely exist in a mathematical sense. Let *x,x’* be one such collision, and consider the adversary ? that has *x,x’* hard-coded and simply outputs them. This adversary runs in constant time and finds a collision in *H*. Of course, even though ? exists in a mathematical sense, it might be hard to *write down* the code of such an ? given *H*. But (and this is a subtle technical point!) security definitions consider only the running time of ?, and not the effort that goes into *finding the source code* of ?.^{2}

The way around this problem is to introduce some random choice made by the user of the hash function, who wants collisions to be hard to find. A **hash function family** is a set ℋ of functions, where each function *H* ∈ ℋ is a hash function with the same output length. We will require that collisions are hard to find, in a hash function *chosen randomly from the family*. This is enough to foil the hard-coded-collision distinguisher mentioned above. Think of a hash function family as having exponentially many functions in it — then no polynomial-time program can have a hard-coded collision for *all* of them.

Now the difficulty of finding collisions rests in the random choice of functions. An adversary can know every fact about ℋ, it just doesn’t know which *H* ∈ ℋ it is going to be challenged on to find a collision. It’s similar to how the security of other cryptographic schemes rests in the random choice of key. But in this case there is no secrecy involved, only unpredictability. The choice of *H* is made public to the adversary.

Note also that this definition is a mismatch to the way hash functions are typically used in practice. There, we usually do have a *single* hash function that we rely on and standardize. While it is possible to adapt the definitions and results in this lecture to the setting of fixed hash functions, it’s simpler to consider hash function *families*. If you’re having trouble connecting the idea of a hash function *family* to reality, imagine taking a standardized hash function like MD5 or SHA3 and considering the family of functions you get by varying the initialization parameters in those standards.

**Towards the Formal Definition**

The straight-forward way to define collision resistance of a hash family ℋ is to say that the following two libraries should be indistinguishable:

Indeed, this is a fine definition of collision resistance, and it follows in the style of previous security definitions. The two libraries give different outputs only when called on (*x,x’*) where *x* ≠ *x’* but *H*(*x*) = *H*(*x’*) — *i.e.*, an *H*-collision.

However, it turns out to not be a particularly convenient definition to *use* when proving security of a construction that involves a hash function as a building block. To see why, think back to the security of MACs. The difference between the two ℒ_{mac-*} libraries is in the verification subroutine VER. And indeed, constructions that use MACs as a building block often perform MAC verification. The libraries — in particular, their VER subroutine — are a good fit for how MACs are used.

On the other hand, cryptographic constructions don’t usually *explicitly test for collisions* when using a hash function. Rather, they typically compute the hash of some value and move on with their business *under the assumption that a collision has never been encountered*. To model this in a security definition, we use the approach below:

Definition \(\PageIndex{1}\)

*Let* ℋ *be a family of hash functions. Then* ℋ *is collision-resistant if* ℒ

^{ℋ}

_{cr-real}≋ ℒ

^{ℋ}

_{cr-fake},

*where:*

**Discussion**

- The two libraries have identical behavior, except in the event that ℒ
_{cr-fake}triggers a “**self destruct**” statement. Think of this statement as an exception that kills the entire program, including the distinguisher. In the case that the distinguisher is killed in this way, we take its output to be 0. Suppose a distinguisher*A*always outputs 1 under normal, explosion-free execution. Since an explosion happens only in ℒ_{cr-fake},*A*’s advantage is simply the probability of an explosion. So if the two libraries are supposed to be indistinguishable, then it must be that**self destruct**happens with only negligible probability. - In ℒ
_{cr-fake}we have given the associative array “*H*^{−1}” a somewhat suggestive name. During normal operation,*H*^{−1}[*y*] contains the*unique*value seen by the library whose hash is*y*. A collision happens when the library has seen two distinct values*x*and*x’*that hash to the same*y*. When the library sees the second of these values*x’*, it computes*H*(*x’*) =*y*and discovers that*H*^{−1}[*y*] already exists but is not equal to*x’*. This is the situation in which the library**self destructs**. - Why do we make the library
**self destruct**when it sees a collision, rather than just returning some error indicator? The reason is simply that it’s easier to just*assume*there is no collision than to check the return value of HASH after each call.

Think of ℒ_{cr-real} as a world in which you take hashes of things but you might see a collision and never notice. Then ℒ_{cr-fake} is a kind of thought- experiment in which some all-seeing judge ends the game immediately if there was ever a collision among the values that you hashed. The judge’s presence simplifies things a bit. There’s no real need to explicitly check for collisions yourself; you can just go about your business knowing that as long as the game is still going, the hash function is injective among all the values you’ve seen.

- The libraries have no secrets!
*H*is public (the adversary can freely obtain it through GETH). However, note that the library — not the adversary — chooses*H*. This random choice of*H*is the sole source of security.

Since *H* is public, the adversary doesn’t really need to call the subroutine HASH(*x*) to compute *H*(*x*) — he could compute *H*(*x*) locally. Intuitively, the library is used in a security proof to model the actions of the “good guys” who are operating on the assumption that *H* is collision-resistant. If an adversary finds a collision but never causes the “good guys” to evaluate *H* on it, then the adversary never violates their security assumption.

**Other variants of collision-resistance**

There are some other variants of collision-resistance that are often discussed in practice. We don’t define them formally, but give the rough idea:

**Target Collision Resistance.** Given *H* and *H*(*x*), where *H* ← ℋ and *x* ← {0,1}^{ℓ} are chosen randomly, it should be infeasible to compute a value *x’* (possibly equal to *x*) such that *H*(*x*) = *H*(*x’*).

**Second preimage resistance.** Given *H* and *x*, where *H* ← ℋ and *x* ← {0,1}* ^{ℓ}* are chosen randomly, it should be infeasible to compute a value

*x’*≠

*x*such that

*H*(

*x*) =

*H*(

*x’*).

These conditions are weaker than the condition of (plain) collision-resistance, in the sense that if ℋ is collision-resistant, then ℋ is also target collision-resistant and secondpreimage resistant. Hence, we focus on plain collision resistance in this course.

^{2}The reason we don’t define security this way is that as soon as someone does find the code of such an ?, the hash function *H* is “broken” forever. Nothing anyone does (like choosing a new key) can salvage it.