# 5.1: ChainedHashTable - Hashing with Chaining

$$\newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} }$$ $$\newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}}$$$$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$ $$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\id}{\mathrm{id}}$$ $$\newcommand{\Span}{\mathrm{span}}$$ $$\newcommand{\kernel}{\mathrm{null}\,}$$ $$\newcommand{\range}{\mathrm{range}\,}$$ $$\newcommand{\RealPart}{\mathrm{Re}}$$ $$\newcommand{\ImaginaryPart}{\mathrm{Im}}$$ $$\newcommand{\Argument}{\mathrm{Arg}}$$ $$\newcommand{\norm}[1]{\| #1 \|}$$ $$\newcommand{\inner}[2]{\langle #1, #2 \rangle}$$ $$\newcommand{\Span}{\mathrm{span}}$$$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$$

A ChainedHashTable data structure uses hashing with chaining to store data as an array, $$\mathtt{t}$$, of lists. An integer, $$\mathtt{n}$$, keeps track of the total number of items in all lists (see Figure $$\PageIndex{1}$$):

    List<T>[] t;
int n;


The hash value of a data item $$\mathtt{x}$$, denoted $$\mathtt{hash(x)}$$ is a value in the range $$\{0,\ldots,\texttt{t.length}-1\}$$. All items with hash value $$\mathtt{i}$$ are stored in the list at $$\mathtt{t[i]}$$. To ensure that lists don't get too long, we maintain the invariant

$\mathtt{n} \le \texttt{t.length} \nonumber$

so that the average number of elements stored in one of these lists is $$\mathtt{n}/\texttt{t.length} \le 1$$.

To add an element, $$\mathtt{x}$$, to the hash table, we first check if the length of $$\mathtt{t}$$ needs to be increased and, if so, we grow $$\mathtt{t}$$. With this out of the way we hash $$\mathtt{x}$$ to get an integer, $$\mathtt{i}$$, in the range $$\{0,\ldots,\texttt{t.length}-1\}$$, and we append $$\mathtt{x}$$ to the list $$\mathtt{t[i]}$$:

    boolean add(T x) {
if (find(x) != null) return false;
if (n+1 > t.length) resize();
n++;
return true;
}


Growing the table, if necessary, involves doubling the length of $$\mathtt{t}$$ and reinserting all elements into the new table. This strategy is exactly the same as the one used in the implementation of ArrayStack and the same result applies: The cost of growing is only constant when amortized over a sequence of insertions (see Lemma 2.1.1 in Section 2.1.2).

Besides growing, the only other work done when adding a new value $$\mathtt{x}$$ to a ChainedHashTable involves appending $$\mathtt{x}$$ to the list $$\mathtt{t[hash(x)]}$$. For any of the list implementations described in Chapters 2 or 3, this takes only constant time.

To remove an element, $$\mathtt{x}$$, from the hash table, we iterate over the list $$\mathtt{t[hash(x)]}$$ until we find $$\mathtt{x}$$ so that we can remove it:

    T remove(T x) {
Iterator<T> it = t[hash(x)].iterator();
while (it.hasNext()) {
T y = it.next();
if (y.equals(x)) {
it.remove();
n--;
return y;
}
}
return null;
}


This takes $$O(\mathtt{n}_{\mathtt{hash(x)}})$$ time, where $$\mathtt{n}_{\mathtt{i}}$$ denotes the length of the list stored at $$\mathtt{t[i]}$$.

Searching for the element $$\mathtt{x}$$ in a hash table is similar. We perform a linear search on the list $$\mathtt{t[hash(x)]}$$:

    T find(Object x) {
for (T y : t[hash(x)])
if (y.equals(x))
return y;
return null;
}


Again, this takes time proportional to the length of the list $$\mathtt{t[hash(x)]}$$.

The performance of a hash table depends critically on the choice of the hash function. A good hash function will spread the elements evenly among the $$\texttt{t.length}$$ lists, so that the expected size of the list $$\mathtt{t[hash(x)]}$$ is $$O(\mathtt{n}/\texttt{t.length)} = O(1)$$. On the other hand, a bad hash function will hash all values (including $$\mathtt{x}$$) to the same table location, in which case the size of the list $$\mathtt{t[hash(x)]}$$ will be $$\mathtt{n}$$. In the next section we describe a good hash function.

## $$\PageIndex{1}$$ Multiplicative Hashing

Multiplicative hashing is an efficient method of generating hash values based on modular arithmetic (discussed in Section 2.3) and integer division. It uses the $$\textrm{div}$$ operator, which calculates the integral part of a quotient, while discarding the remainder. Formally, for any integers $$a\ge 0$$ and $$b\ge 1$$, $$a\textrm{ div } b = \lfloor a/b\rfloor$$.

In multiplicative hashing, we use a hash table of size $$2^{\mathtt{d}}$$ for some integer $$\mathtt{d}$$ (called the dimension). The formula for hashing an integer $$\mathtt{x}\in\{0,\ldots,2^{\mathtt{w}}-1\}$$ is

$\mathtt{hash(x)} = ((\mathtt{z}\cdot\mathtt{x})\bmod 2^{\mathtt{w}}) \textrm{ div } 2^{\mathtt{w}-\mathtt{d}} \enspace . \nonumber$

Here, $$\mathtt{z}$$ is a randomly chosen odd integer in $$\{1,\ldots,2^{\mathtt{w}}-1\}$$. This hash function can be realized very efficiently by observing that, by default, operations on integers are already done modulo $$2^{\mathtt{w}}$$ where $$\mathtt{w}$$ is the number of bits in an integer.1 (See Figure $$\PageIndex{2}$$.) Furthermore, integer division by $$2^{\mathtt{w}-\mathtt{d}}$$ is equivalent to dropping the rightmost $$\mathtt{w}-\mathtt{d}$$ bits in a binary representation (which is implemented by shifting the bits right by $$\mathtt{w}-\mathtt{d}$$ using the $$\mathtt{\text{>>>}}$$ operator). In this way, the code that implements the above formula is simpler than the formula itself:

    int hash(Object x) {
return (z * x.hashCode()) >>> (w-d);
}


The following lemma, whose proof is deferred until later in this section, shows that multiplicative hashing does a good job of avoiding collisions:

Lemma $$\PageIndex{1}$$.

Let $$\mathtt{x}$$ and $$\mathtt{y}$$ be any two values in $$\{0,\ldots,2^{\mathtt{w}}-1\}$$ with $$\mathtt{x}\neq \mathtt{y}$$. Then $$\Pr\{\mathtt{hash(x)}=\mathtt{hash(y)}\} \le 2/2^{\mathtt{d}}$$.

With Lemma $$\PageIndex{1}$$, the performance of $$\mathtt{remove(x)}$$, and $$\mathtt{find(x)}$$ are easy to analyze:

Lemma $$\PageIndex{2}$$.

For any data value $$\mathtt{x}$$, the expected length of the list $$\mathtt{t[hash(x)]}$$ is at most $$\mathtt{n}_{\mathtt{x}} + 2$$, where $$\mathtt{n}_{\mathtt{x}}$$ is the number of occurrences of $$\mathtt{x}$$ in the hash table.

Proof. Let $$S$$ be the (multi-)set of elements stored in the hash table that are not equal to $$\mathtt{x}$$. For an element $$\mathtt{y}\in S$$, define the indicator variable

$I_{\mathtt{y}} = \left\{ \begin{array}{ll} 1 & \mbox{if \mathtt{hash(x)}= \mathtt{hash(y)}} \\ 0 & \mbox{otherwise} \end{array}\right. \nonumber$

and notice that, by Lemma $$\PageIndex{1}$$, $$\mathrm{E}[I_{\mathtt{y}}] \le 2/2^{\mathtt{d}}=2/\texttt{t.length}$$. The expected length of the list $$\mathtt{t[hash(x)]}$$ is given by

\begin{align} \mathrm{E}\left[\texttt{t[hash(x)].size()}\right] &=\mathrm{E}\left[\mathtt{n}_{\mathtt{x}} + \sum_{\mathtt{y}\in S} I_{\mathtt{y}}\right]\nonumber\\ &=\mathtt{n}_{\mathtt{x}} + \sum_{\mathtt{y}\in S} \mathrm{E}[I_{\mathtt{y}} ]\nonumber\\ &\le \mathtt{n}_{\mathtt{x}} + \sum_{\mathtt{y}\in S} 2/\texttt{t.length}\nonumber\\ &\le \mathtt{n}_{\mathtt{x}} + \sum_{\mathtt{y}\in S} 2/\mathtt{n}\nonumber\\ &\le \mathtt{n}_{\mathtt{x}} + (\mathtt{n}-\mathtt{n}_{\mathtt{x}})2/\mathtt{n}\nonumber\\ &\le \mathtt{n}_{\mathtt{x}} + 2 \enspace ,\nonumber \end{align}\nonumber

as required.

Now, we want to prove Lemma $$\PageIndex{1}$$, but first we need a result from number theory. In the following proof, we use the notation $$(b_r,\ldots,b_0)_2$$ to denote $$\sum_{i=0}^r b_i2^i$$, where each $$b_i$$ is a bit, either 0 or 1. In other words, $$(b_r,\ldots,b_0)_2$$ is the integer whose binary representation is given by $$b_r,\ldots,b_0$$. We use $$\star$$ to denote a bit of unknown value.

Lemma $$\PageIndex{3}$$.

Let $$S$$ be the set of odd integers in $$\{1,\ldots,2^{\mathtt{w}}-1\}$$; let $$q$$ and $$i$$ be any two elements in $$S$$. Then there is exactly one value $$\mathtt{z}\in S$$ such that $$\mathtt{z}q\bmod 2^{\mathtt{w}} = i$$.

Proof. Since the number of choices for $$\mathtt{z}$$ and $$i$$ is the same, it is sufficient to prove that there is at most one value $$\mathtt{z}\in S$$ that satisfies $$\mathtt{z}q\bmod 2^{\mathtt{w}} = i$$.

Suppose, for the sake of contradiction, that there are two such values $$\mathtt{z}$$ and $$\mathtt{z'}$$, with $$\mathtt{z}>\mathtt{z}'$$. Then

$\mathtt{z}q\bmod 2^{\mathtt{w}} = \mathtt{z}'q \bmod 2^{\mathtt{w}} = i \nonumber$

So

$(\mathtt{z}-\mathtt{z}')q\bmod 2^{\mathtt{w}} = 0 \nonumber$

But this means that

$(\mathtt{z}-\mathtt{z}')q = k 2^{\mathtt{w}} \label{factors}$

for some integer $$k$$. Thinking in terms of binary numbers, we have

$(\mathtt{z}-\mathtt{z}')q = k\cdot(1,\underbrace{0,\ldots,0}_{\mathtt{w}})_2 \enspace , \nonumber$

so that the $$\mathtt{w}$$ trailing bits in the binary representation of $$(\mathtt{z}-\mathtt{z}')q$$ are all 0's.

Furthermore $$k\neq 0$$, since $$q\neq 0$$ and $$\mathtt{z}-\mathtt{z}'\neq 0$$. Since $$q$$ is odd, it has no trailing 0's in its binary representation:

$q = (\star,\ldots,\star,1)_2 \enspace . \nonumber$

Since $$\vert\mathtt{z}-\mathtt{z}'\vert < 2^{\mathtt{w}}$$, $$\mathtt{z}-\mathtt{z}'$$ has fewer than $$\mathtt{w}$$ trailing 0's in its binary representation:

$\mathtt{z}-\mathtt{z}' = (\star,\ldots,\star,1,\underbrace{0,\ldots,0}_{<\mathtt{w}})_2 \enspace . \nonumber$

Therefore, the product $$(\mathtt{z}-\mathtt{z}')q$$ has fewer than $$\mathtt{w}$$ trailing 0's in its binary representation:

$(\mathtt{z}-\mathtt{z}')q = (\star,\cdots,\star,1,\underbrace{0,\ldots,0}_{<\mathtt{w}})_2 \enspace . \nonumber$

Therefore $$(\mathtt{z}-\mathtt{z}')q$$ cannot satisfy ($$\ref{factors}$$), yielding a contradiction and completing the proof.

The utility of Lemma $$\PageIndex{3}$$ comes from the following observation: If $$\mathtt{z}$$ is chosen uniformly at random from $$S$$, then $$\mathtt{zt}$$ is uniformly distributed over $$S$$. In the following proof, it helps to think of the binary representation of $$\mathtt{z}$$, which consists of $$\mathtt{w}-1$$ random bits followed by a 1.

Proof of Lemma $$\PageIndex{1}$$.

First we note that the condition $$\mathtt{hash(x)}=\mathtt{hash(y)}$$ is equivalent to the statement "the highest-order $$\mathtt{d}$$ bits of $$\mathtt{z} \mathtt{x}\bmod2^{\mathtt{w}}$$ and the highest-order $$\mathtt{d}$$ bits of $$\mathtt{z} \mathtt{y}\bmod 2^{\mathtt{w}}$$ are the same." A necessary condition of that statement is that the highest-order $$\mathtt{d}$$ bits in the binary representation of $$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}$$ are either all 0's or all 1's. That is,

$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^\mathtt{w}=( \underbrace{0,\ldots,0}_{\mathtt{d}}, \underbrace{\star,\ldots,\star}_{\mathtt{w}-\mathtt{d}})_2\label{all-zeros}$

when $$\mathtt{zx}\bmod 2^{\mathtt{w}} > \mathtt{zy}\bmod 2^{\mathtt{w}}$$ or

$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^\mathtt{w} = (\underbrace{1,\ldots,1}_{\mathtt{d}}, \underbrace{\star,\ldots,\star}_{\mathtt{w}-\mathtt{d}})_2 \enspace .\label{all-ones}$

when $$\mathtt{zx}\bmod 2^{\mathtt{w}} < \mathtt{zy}\bmod 2^{\mathtt{w}}$$. Therefore, we only have to bound the probability that $$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}$$ looks like ($$\ref{all-zeros}$$) or ($$\ref{all-ones}$$).

Let $$q$$ be the unique odd integer such that $$(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}=q2^r$$ for some integer $$r\ge 0$$. By Lemma $$\PageIndex{3}$$, the binary representation of $$\mathtt{z}q\bmod 2^{\mathtt{w}}$$ has $$\mathtt{w}-1$$ random bits, followed by a 1:

$\mathtt{z}q\bmod 2^{\mathtt{w}} = (\underbrace{b_{\mathtt{w}-1},\ldots,b_{1}}_{\mathtt{w}-1},1)_2 \nonumber$

Therefore, the binary representation of $$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}=\mathtt{z}q2^r\bmod 2^{\mathtt{w}}$$ has $$\mathtt{w}-r-1$$ random bits, followed by a 1, followed by $$r$$ 0's:

$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}} =\mathtt{z}q2^r \bmod 2^{\mathtt{w}} =(\underbrace{b_{\mathtt{w}-r-1},\ldots,b_{1}}_{\mathtt{w}-r-1},1,\underbrace{0,0,\ldots,0}_{r})_2 \nonumber$

We can now finish the proof: If $$r > \mathtt{w}-\mathtt{d}$$, then the $$\mathtt{d}$$ higher order bits of $$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}$$ contain both 0's and 1's, so the probability that $$\mathtt{z}(\mathtt{x}-\mathtt{y})\bmod 2^{\mathtt{w}}$$ looks like ($$\ref{all-zeros}$$) or ($$\ref{all-ones}$$) is 0. If $$\mathtt{r}=\mathtt{w}-\mathtt{d}$$, then the probability of looking like ($$\ref{all-zeros}$$) is 0, but the probability of looking like ($$\ref{all-ones}$$) is $$1/2^{\mathtt{d}-1}=2/2^{\mathtt{d}}$$ (since we must have $$b_1,\ldots,b_{d-1}=1,\ldots,1$$). If $$r < \mathtt{w}-\mathtt{d}$$, then we must have $$b_{\mathtt{w}-r-1},\ldots,b_{\mathtt{w}-r-\mathtt{d}}=0,\ldots,0$$ or $$b_{\mathtt{w}-r-1},\ldots,b_{\mathtt{w}-r-\mathtt{d}}=1,\ldots,1$$. The probability of each of these cases is $$1/2^{\mathtt{d}}$$ and they are mutually exclusive, so the probability of either of these cases is $$2/2^{\mathtt{d}}$$. This completes the proof.

## $$\PageIndex{2}$$ Summary

The following theorem summarizes the performance of a ChainedHashTable data structure:

Theorem $$\PageIndex{1}$$.

A ChainedHashTable implements the USet interface. Ignoring the cost of calls to $$\mathtt{grow()}$$, a ChainedHashTable supports the operations $$\mathtt{add(x)}$$, $$\mathtt{remove(x)}$$, and $$\mathtt{find(x)}$$ in $$O(1)$$ expected time per operation.

Furthermore, beginning with an empty ChainedHashTable, any sequence of $$m$$ $$\mathtt{add(x)}$$ and $$\mathtt{remove(x)}$$ operations results in a total of $$O(m)$$ time spent during all calls to $$\mathtt{grow()}$$.

#### Footnotes

1This is true for most programming languages including C, C#, C++, and Java. Notable exceptions are Python and Ruby, in which the result of a fixed-length $$\mathtt{w}$$-bit integer operation that overflows is upgraded to a variable-length representation.

This page titled 5.1: ChainedHashTable - Hashing with Chaining is shared under a CC BY license and was authored, remixed, and/or curated by Pat Morin (Athabasca University Press) .