11.2: Merkle-Damgård Construction
( \newcommand{\kernel}{\mathrm{null}\,}\)
Building a hash function, especially one that accepts inputs of arbitrary length, seems like a challenging task. In this section, we’ll see one approach for constructing hash functions, called the Merkle-Damgård construction.
Instead of a full-fledged hash function, imagine that we had a collision-resistant function whose inputs were of a single fixed length, but longer than its outputs. In other words, h:{θ,1}n+t→{0,1}n, where t>0. We call such an h a compression function. This is not compression in the usual sense of the word - we are not concerned about recovering the input from the output. We call it a compression function because it "compresses" its input by t bits (analogous to how a pseudorandom generator "stretches" its input by some amount).
The following construction is one way to build a full-fledged hash function (supporting inputs of arbitrary length) out of such a compression function:
Let h:{0,1}n+t→{0,1}n be a compression function. Then the Merkle-Damgård transformation of h is MDh:{0,1}∗→{0,1}n, where:

The idea of the Merkle-Damgård construction is to split the input x into blocks of size t. The end of the string is filled out with ϑs if necessary. A final block called the "padding block" is added, which encodes the (original) length of x in binary.
Suppose we have a compression function h:{0,1}48→{0,1}32, so that t=16. We build a Merkle-Damgård hash function out of this compression function and wish to compute the hash of the following 5-byte (40-bit) string:

We must first padx appropriately (MDPAD(x)) :
- Since x is not a multiple of t=16 bits, we need to add 8 bits to make it so.
- Since |x|=40, we need to add an extra 16-bit block that encodes the number 40 in binary(101000)
After this padding, and splitting the result into blocks of length 16 , we have the following:

The final hash of x is computed as follows:

We are presenting a simplified version, in which MDh accepts inputs whose maximum length is 2t−1 bits (the length of the input must fit into t bits). By using multiple padding blocks (when necessary) and a suitable encoding of the original string length, the construction can be made to accommodate inputs of arbitrary length (see the exercises).
The value y0 is called the initialization vector (IV), and it is a hard-coded part of the algorithm.
As discussed above, we will not be making provable security claims using the librarystyle definitions. However, we can justify the Merkle-Damgård construction with the following claim:
Suppose h is a compression function and MDh is the Merkle-Damgård construction applied to h. Given a collision x,x′ in MDh, it is easy to find a collision in h. In other words, if it is hard to find a collision in h, then it must also be hard to find a collision in MDh.
Proof
Suppose that x,x′ are a collision under Mh. Define the values x1,…,xk+1 and y1,…,yk+1 as in the computation of MDh(x). Similarly, define x′1,…,x′k′+1 and y′1,…,y′k′+1 as in the computation of MDh(x′). Note that, in general, k may not equal k′
Recall that: MDh(x)=yk+1=h(yk‖xk+1)MDh(x′)=y′k′+1=h(y′k′‖x′k′+1) Since we are assuming MDh(x)=MDh(x′), we have yk+1=y′k′+1. We consider two cases:
Case 1: If |x|≠|x′|, then the padding blocks xk+1 and x′k′+1 which encode |x| and |x′| are not equal. Hence we have yk‖xk+1≠y′k′‖x′k′+1, so yk‖xk+1 and y′k′‖x′k′+1 are a collision under h and we are done.
Case 2: If |x|=|x′|, then x and x′ are broken into the same number of blocks, so k=k′. Let us work backwards from the final step in the computations of MDDh(x) and Mh(x′).
We know that: yk+1=h(yk‖xk+1)=y′k+1=h(y′k‖x′k+1) If yk‖xk+1 and y′k‖x′k+1 are not equal, then they are a collision under h and we are done. Otherwise, we can apply the same logic again to yk and y′k, which are equal by our assumption.
More generally, if yi=y′i, then either yi−1‖xi and y′i−1‖x′i are a collision under h (and we say we are "lucky"), or else yi−1=y′i−1 (and we say we are "unlucky"). We start with the premise that yk=y′k. Can we ever get "unlucky" every time, and not encounter a collision when propagating this logic back through the computations of MDh(x) and MDh(x′) ? The answer is no, because encountering the unlucky case every time would imply that xi=x′i for all i. That is, x=x′. But this contradicts our original assumption that x≠x′. Hence we must encounter some "lucky" case and therefore a collision in h.