# 11.1: Security Properties for Hash Functions

- Page ID
- 7390

There are two common security properties of hash functions:

**Collision resistance**. It should be hard to compute any collision \(x \neq x^{\prime}\) such that \(H(x)=\) \(H\left(x^{\prime}\right)\)**Second-preimage resistance**. Given \(x\), it should be hard to compute any collision involving \(x\). In other words, it should be hard to compute \(x^{\prime} \neq x\) such that \(H(x)=\) \(H\left(x^{\prime}\right)\)

## Brute Force Attacks on Hash Functions

There is an important difference between collision resistance and second-preimage resistance, which is reflected in the difficulty of their respective brute force attacks. Suppose \(H\) is a hash function whose outputs are \(n\) bits long. Let’s make a simplifying assumption that for any \(m>n\), the following distribution is roughly uniform over \(\{0,1\}^{n}\):

This is quite a realistic assumption for practical hash functions. If this were not true, then \(H\) would introduce some bias towards some outputs and away from other outputs, which would be perceived as suspicious. Also, as the output of \(H\) deviates farther from a uniform distribution, it only makes finding collisions easier.

Below are straight-forward brute-force attacks for collision resistance (left) and second-preimage resistance (right):

Under the simplifying assumption on \(H\), the collision-resistance brute force attack \(\mathcal{A}_{\mathrm{cr}}\) is essentially choosing each \(y_{i}\) uniformly at random. Since each \(y_{i} \in\{0,1\}^{n}\), the probability of finding a repeated value after \(q\) times through the main loop is roughly \(\Theta\left(q^{2} / 2^{n}\right)\) by the birthday bound. While in the** worst case **it could take \(2^{n}\) steps to find a collision in \(H\), the birthday bound implies that it takes only \(2^{n / 2}\) attempts to find a collision with \(99 \%\) probability (or any constant probability).

On the other hand, the second-preimage brute force attack \(\mathcal{A}_{2 \text { pi }}\) is given \(y\) as input and (under our simplifying assumption on \(H\) ) essentially samples \(y^{\prime}\) uniformly at random until \(y\) is the result. It will therefore take \(\Theta\left(2^{n}\right)\) attempts in expectation to terminate successfully.\({ }^{1}\)

There is a fundamental difference in how hard it is to break collision resistance and second-preimage resistance. Breaking collision-resistance is like inviting more people into the room until the room contains 2 people with the same birthday. Breaking second-preimage resistance is like inviting more people into the room until the room contains another person with your birthday. One of these fundamentally takes longer than the other.

This difference explains why you will typically see cryptographic hash functions in practice that have 256- to 512 -bit output length (but not 128 -bit output length), while you only typically see block ciphers with 128 -bit or 256 -bit keys. In order to make brute force attacks cost \(2^{n}\), a block cipher needs only an \(n\)-bit key while a collision-resistant hash function needs a \(2 n\)-bit output.

Discussion of these attacks in terms of graphs, where # of edges is the "number of chances" to get a collision. Collision-resistance brute force is a complete graph (need \(\sqrt{N}\) vertices to have \(N\) edges / chances for a collision). Second-preimage brute force is a star graph (need \(N\) vertices to \(N\) edges). Can generalize to consider complete bipartite graph between \(\sqrt{N}+\sqrt{N}\) vertices.

## Hash Function Security In Practice

We will focus on developing a formal definition for collision resistance. We can take some inspiration from the security definition for MACs. Security for a MAC means that it should be hard to produce a forgery. The MAC security definition formalized that idea with one library that checks for a forgery and another library that assumes a forgery is impossible. If the two libraries are indistinguishable, then it must be hard to find a forgery.

We can take a similar approach to say that it should be hard to produce a collision. Here is an attempt:

This corresponds to what I would call the "folk definition" of collision resistance. It makes intuitive sense (as long as you’re comfortable with our style of security definition), but unfortunately the definition suffers from a very subtle technical problem.

Because of Kerckhoffs’ principle, we allow calling programs to depend arbitrarily on the source code of the two libraries. This is a way of formalizing the idea that "the attacker knows everything about the algorithms." Our security definitions restrict calling programs to be polynomial-time algorithms, but they never consider *the effort that goes into finding the source code of the calling program!*

This strange loophole leads to the following valid attack. When we consider the security of some function \(H\), we know that there exists many collisions \(\left(x, x^{\prime}\right)\) in \(H\). These collisions may be hard to find, but they certainly exist. With exponential time, we could find such an \(\left(x, x^{\prime}\right)\) pair and write down the code of an attacker:

Here, the values \(x\) and \(x^{\prime}\) are hard-coded into \(\mathcal{A}\). The algorithm \(\mathcal{A}\) is clearly polynomialtime (in fact, constant time). The "loophoole" is that the definition considers only the cost of running the algorithm \(\mathcal{A}\), and not the cost of finding the source code of \(\mathcal{A}\). The way this kind of situation is avoided in other security definitions is that the libraries have some secret randomness. While the attacker is allowed to depend arbitrarily on the *source code* of the libraries, it is not allowed to depend on the *choice of outcomes* for random events in the libraries, like sampling a secret key. Since the calling program can’t "prepare" for the random choice that it will be faced with, we don’t have such trivial attacks. On the other hand, these two libraries for collision resistance are totally deterministic. There are no "surprises" about which function \(H\) the calling program will be asked to compute a collision for, so there is nothing to prevent a calling program from being "prepared" with a pre-computed collision in \(H\).

## Hash Function Security In Theory

The way around this technical issue is to introduce some randomness into the libraries and into the inputs of \(H\). We define hash functions to take two arguments: a randomly chosen, public value \(s\) called a **salt**, and an adversarially chosen input \(x\).

A hash function \(H\) is **collision-resistant** if \(\mathcal{L}_{\mathrm{cr}-\mathrm{H} e \mathrm{l}}^{\mathcal{H}} \approx \mathcal{L}_{\mathrm{cr} \text {-fake }}^{\mathcal{H}}\), where:

The library initially samples the salt \(s\). Unlike in other libraries, this value \(s\) is meant to be provided to the calling program, and so the library provides a way (GETSALT) for the calling program to learn it. The calling program then attempts to find a collision \(x \neq x^{\prime}\) where \(H(s, x)=H\left(s, x^{\prime}\right)\)

I don’t know why the term "salt" is used with hash functions. The reason appears to be a mystery to the Internet." Think of salt as an extra value that **"personalizes" the hash function **for a given application. Here is a good analogy: an encryption scheme can be thought of as a different encryption algorithm \(\operatorname{Enc}(k, \cdot)\) for each choice of key \(k\). When I choose a random \(k\), I get a personalized encryption algorithm \(\operatorname{Enc}(k, \cdot)\) that is unrelated to the algorithm \(\operatorname{Enc}\left(k^{\prime}, \cdot\right)\) that someone else would get when they choose their own \(k\). When I choose a salt \(s\), I get a personalized hash function \(H(s, \cdot)\) that is unrelated to other \(H\left(s^{\prime}, \cdot\right)\) functions. Because the salt is chosen uniformly from \(\{0,1\}^{\lambda}\), a calling program cannot predict what salt (which personalized hash function) it will be challenged with.

Definition \(11.1\) is a valid definition for collision resistance, free of strange loopholes like the "folklore" definition. However, it is not a particularly *useful* definition to use in security proofs, when a hash function is used as a building block in a bigger system.

It becomes cumbersome to use in those cases, because when you use a hash function, you typically don’t explicitly check whether you’ve seen a collision. Instead, you simply proceed as if collisions are not going to happen.

In this chapter, we won’t see provable statements of security referring to this definition.

## Salts in Practice

When we define hash functions in theory, we require that the hash function accept two inputs, the first of which is interpreted as a salt. The hash functions that you see in practice have only one input, a string of arbitrary length. You can simulate the effect of a salt for such a hash function by simply concatenating the two inputs \(-e . g ., H(s \| x)\) instead of \(H(s, x)\)

The concept of a **salted hash** is not just useful to make a coherent security definition, it is also just good practice. Hash functions are commonly used to store passwords. A server may store user records of the form (username, \(h=H\) (password)). When a user attempts to login with password \(p^{\prime}\), the server computes \(H\left(p^{\prime}\right)\) and compares it to \(h\). Storing hashed passwords means that, in the event that the password file is stolen, an attacker would need to find a preimage of \(h\) in order to impersonate the user.

Best practice is to use a separate salt for each user. Instead of storing (username, \(H\) (password)), choose a random salt \(s\) for each user and store (username, \(s, H(s\), password)). The security properties of a hash function do not require \(s\) to be secret, although there is also no good reason to broadcast a user’s salt publicly. The salt is only needed by the server, when it verifies a password during a login attempt.

A user-specific salt means that each user gets their own "personalized" hash function to store their password. Salts offer the following benefits:

- Without salts, it would be evident when two users have the same password \(-\) they would have the same password hashes. The same password hashed with different salts will result in unrelated hash outputs.
- An attacker can compute a dictionary of \((p, H(p))\) for common passwords. Without salts, this dictionary makes it easy to attack all users at once, since all users are using the same hash function. With salts, each user has a personalized hash function, each of which would require its own dictionary. Salt makes an attacker’s effort scale with the number of victims.

\({ }^{1}\) A well-known and useful fact from probability theory is that if an event happens with probability \(p\), then the expected number of times to repeat before seeing the event is \(1 / p\). For example, the probability of rolling a 1 on a D6 die is \(1 / 6\), so it takes 6 rolls in expectation before seeing a 1 . The probability of sampling a particular \(y\) from \(\{0,1\}^{n}\) in one try is \(1 / 2^{n}\), so the expected number of trials before seeing \(y\) is \(2^{n}\).

\({ }^{2}\) If you have an additional random argument to a hash function, but you keep it secret, it is called a "pepper." I’m serious, this is a real thing.