# 11.3: Hash Functions vs. MACs: Length-Extension Attacks

- Page ID
- 7392

Constructing a hash function seems like a challenging task, especially given that it must accept strings of arbitrary length as input. In this section, we’ll see one approach for constructing hash functions, called the Merkle-Damgård construction.

Instead of a full-fledged hash function, imagine that we had a collision-resistant function (family) whose inputs were of a single fixed length, but longer than its outputs. In other words, suppose we had a family ℋ of functions *h* : {0,1}* ^{n+t}* → {0,1}

*, where*

^{n}*t*> 0. We call such an

*h*a

**compression function**. This is not compression in the usual sense of data compression — we are not concerned about recovering the input from the output. We call it a compression function because it “compresses” its input by

*t*bits (analogous to how a pseudorandom generator “stretches” its input by some amount).

We can apply the standard definition of collision-resistance to a family of *compression* functions, by restricting our interest to inputs of length exactly *n* + *t*. The functions in the family are not defined for any other input length.

The following construction is one way to extend a compression function into a full-edged hash function accepting arbitrary-length inputs:

**Construction 12.4: Merkle-Damgård**

*Let h* : {0,1}^{n + t} → {0,1}* ^{n} be a compression function. Then the Merkle-Damgård transformation of h is MD_{h}* : {0,1}

^{∗}→ {0,1}

^{n}, where:

The idea of the Merkle-Damgård construction is to split the input *x* into blocks of size *t*. The end of the string is filled out with 0s if necessary. A final block called the “padding block” is added, which encodes the (original) length of *x* in binary.

We are presenting a simplified version, in which MD* _{h}* accepts inputs whose maximum length is 2

*− 1 bits (the length of the input must fit into*

^{t}*t*bits). By using multiple padding blocks (when necessary) and a suitable encoding of the original string length, the construction can be made to accomodate inputs of arbitrary length (see the exercises).

The value *y*_{0} is called the **initialization vector** (IV), and it is a hard-coded part of the algorithm. In practice, a more “random-looking” value is used as the initialization vector. Or one can think of the Merkle-Damgård construction as defining a **family** of hash functions, corresponding to the different choices of IV.

**Claim 12.5**

*Let* ℋ *be a family of compression functions, and define MD*_{ℋ} = {*MD _{h}* |

*h*∈ ℋ} (

*a family of hash functions*).

*If*ℋ

*is collision-resistant, then so is MD*

_{ℋ}.

**Proof**

While the proof can be carried out in the style of our library-based security definitions, it’s actually much easier to simply show the following: given any collision under MD_{h}, we can efficiently find a collision under *h*. This means that any successful adversary violating the collision-resistance of MD_{ℋ} can be transformed into a successful adversary violating the collision resistance of ℋ. So if ℋ is collision-resistant, then so is MD_{ℋ}.

Suppose that *x,x’* are a collision under MD_{h}. Define the values *x*_{1},. . . ,*x*_{k + 1} and *y*_{1},. . . ,*y*_{k + 1} as in the computation of MD* _{h}* (

*x*). Similarly, define

*x’*

_{1},. . . ,

*x’*

_{k’+1}and

*y’*

_{1},. . . ,

*y’*

_{k’ + 1}as in the computation of MD

*(*

_{h}*x’*). Note that, in general,

*k*may not equal

*k’*.

Recall that:

MD* _{h}* (x) =

*y*

_{k + 1}=

*h*(

*y*‖

_{k}*x*

_{k + 1})

MD*h* (*x’*) = *y’*_{k’ + 1} = *h*(*y’ _{k’}*‖

*x’*

_{k’ + 1})

Since we are assuming MD_{h}(*x*) = MD* _{h}*(

*x’*), we have

*y*

_{k + 1}=

*y’*

_{k’ + 1}. We consider two cases:

*Case 1:* If |*x*| ≠ |*x’*|, then the padding blocks *x*_{k + 1} and *x’*_{k’ + 1} which encode |*x*| and |*x’*| are not equal. Hence we have *y _{k}*‖

*x*

_{k + 1}≠

*y’*

_{k’}‖

*x’*

_{k’ + 1}, so

*y*‖

_{k}*x*

_{k + 1}and

*y’*‖

_{k’}*x’*

_{k’ + 1}are a collision under

*h*and we are done.

*Case 2:* : If |*x*| = |*x’*|, then *x*

and *x’* are broken into the same number of blocks, so *k* = *k’*. Let us work backwards from the final step in the computations of MD* _{h}* (

*x*) and MD

*(*

_{h}*x’*). We know that:

If *y _{k}*‖

*x*

_{k + 1}and

*y’*‖

_{k}*x’*

_{k + 1}are not equal, then they are a collision under

*h*and we are done. Otherwise, we can apply the same logic again to

*y*and

_{k}*y’*, which are equal by our assumption.

_{k}More generally, if *y _{i}* =

*y’*, then either

_{i}*y*

_{i − 1}‖

*x*and

_{i}*y’*

_{i − 1}‖

*x’*are a collision under

_{i}*h*(and we say we are “lucky”), or else

*y*

_{i − 1}=

*y’*

_{i − 1}(and we say we are “unlucky”). We start with the premise that

*y*=

_{k}*y’*. Can we ever get “unlucky” every time, and not encounter a collision when propagating this logic back through the computations of MD

_{k}*(*

_{h}*x*) and MD

*(*

_{h}*x’*)? The answer is no, because encountering the unlucky case every time would imply that

*x*=

_{i}*x’*for

_{i}*all i*. That is,

*x*=

*x’*. But this contradicts our original assumption that

*x ≠ x’*. Hence we must encounter some “lucky” case and therefore a collision in

*h*.