# 5.2: LinearHashTable - Linear Probing

- Page ID
- 8457

The `ChainedHashTable` data structure uses an array of lists, where the \(\mathtt{i}\)th list stores all elements \(\mathtt{x}\) such that \(\mathtt{hash(x)}=\mathtt{i}\). An alternative, called open addressing is to store the elements directly in an array, \(\mathtt{t}\), with each array location in \(\mathtt{t}\) storing at most one value. This approach is taken by the `LinearHashTable` described in this section. In some places, this data structure is described as open addressing with linear probing.

The main idea behind a `LinearHashTable` is that we would, ideally, like to store the element \(\mathtt{x}\) with hash value \(\mathtt{i=hash(x)}\) in the table location \(\mathtt{t[i]}\). If we cannot do this (because some element is already stored there) then we try to store it at location \(\mathtt{t}[(\mathtt{i}+1)\bmod\texttt{t.length}]\); if that's not possible, then we try \(\mathtt{t}[(\mathtt{i}+2)\bmod\texttt{t.length}]\), and so on, until we find a place for \(\mathtt{x}\).

There are three types of entries stored in \(\mathtt{t}\):

- data values: actual values in the
`USet`that we are representing; - \(\mathtt{null}\) values: at array locations where no data has ever been stored; and
- \(\mathtt{del}\) values: at array locations where data was once stored but that has since been deleted.

In addition to the counter, \(\mathtt{n}\), that keeps track of the number of elements in the `LinearHashTable`, a counter, \(\mathtt{q}\), keeps track of the number of elements of Types 1 and 3. That is, \(\mathtt{q}\) is equal to \(\mathtt{n}\) plus the number of \(\mathtt{del}\) values in \(\mathtt{t}\). To make this work efficiently, we need \(\mathtt{t}\) to be considerably larger than \(\mathtt{q}\), so that there are lots of \(\mathtt{null}\) values in \(\mathtt{t}\). The operations on a `LinearHashTable` therefore maintain the invariant that \(\texttt{t.length}\ge 2\mathtt{q}\).

To summarize, a `LinearHashTable` contains an array, \(\mathtt{t}\), that stores data elements, and integers \(\mathtt{n}\) and \(\mathtt{q}\) that keep track of the number of data elements and non-\(\mathtt{null}\) values of \(\mathtt{t}\), respectively. Because many hash functions only work for table sizes that are a power of 2, we also keep an integer \(\mathtt{d}\) and maintain the invariant that \(\texttt{t.length}=2^\mathtt{d}\).

T[] t; // the table int n; // the size int d; // t.length = 2^d int q; // number of non-null entries in t

The \(\mathtt{find(x)}\) operation in a `LinearHashTable` is simple. We start at array entry \(\mathtt{t[i]}\) where \(\mathtt{i}=\mathtt{hash(x)}\) and search entries \(\mathtt{t[i]}\), \(\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]\), \(\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]\), and so on, until we find an index \(\mathtt{i'}\) such that, either, \(\mathtt{t[i']=x}\), or \(\mathtt{t[i']=null}\). In the former case we return \(\mathtt{t[i']}\). In the latter case, we conclude that \(\mathtt{x}\) is not contained in the hash table and return \(\mathtt{null}\).

T find(T x) { int i = hash(x); while (t[i] != null) { if (t[i] != del && x.equals(t[i])) return t[i]; i = (i == t.length-1) ? 0 : i + 1; // increment i } return null; }

The \(\mathtt{add(x)}\) operation is also fairly easy to implement. After checking that \(\mathtt{x}\) is not already stored in the table (using \(\mathtt{find(x)}\)), we search \(\mathtt{t[i]}\), \(\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]\), \(\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]\), and so on, until we find a \(\mathtt{null}\) or \(\mathtt{del}\) and store \(\mathtt{x}\) at that location, increment \(\mathtt{n}\), and \(\mathtt{q}\), if appropriate.

boolean add(T x) { if (find(x) != null) return false; if (2*(q+1) > t.length) resize(); // max 50% occupancy int i = hash(x); while (t[i] != null && t[i] != del) i = (i == t.length-1) ? 0 : i + 1; // increment i if (t[i] == null) q++; n++; t[i] = x; return true; }

By now, the implementation of the \(\mathtt{remove(x)}\) operation should be obvious. We search \(\mathtt{t[i]}\), \(\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]\), \(\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]\), and so on until we find an index \(\mathtt{i'}\) such that \(\mathtt{t[i']=x}\) or \(\mathtt{t[i']=null}\). In the former case, we set \(\mathtt{t[i']=del}\) and return \(\mathtt{true}\). In the latter case we conclude that \(\mathtt{x}\) was not stored in the table (and therefore cannot be deleted) and return \(\mathtt{false}\).

T remove(T x) { int i = hash(x); while (t[i] != null) { T y = t[i]; if (y != del && x.equals(y)) { t[i] = del; n--; if (8*n < t.length) resize(); // min 12.5% occupancy return y; } i = (i == t.length-1) ? 0 : i + 1; // increment i } return null; }

The correctness of the \(\mathtt{find(x)}\), \(\mathtt{add(x)}\), and \(\mathtt{remove(x)}\) methods is easy to verify, though it relies on the use of \(\mathtt{del}\) values. Notice that none of these operations ever sets a non-\(\mathtt{null}\) entry to \(\mathtt{null}\). Therefore, when we reach an index \(\mathtt{i'}\) such that \(\mathtt{t[i']=null}\), this is a proof that the element, \(\mathtt{x}\), that we are searching for is not stored in the table; \(\mathtt{t[i']}\) has always been \(\mathtt{null}\), so there is no reason that a previous \(\mathtt{add(x)}\) operation would have proceeded beyond index \(\mathtt{i'}\).

The \(\mathtt{resize()}\) method is called by \(\mathtt{add(x)}\) when the number of non-\(\mathtt{null}\) entries exceeds \(\texttt{t.length}/2\) or by \(\mathtt{remove(x)}\) when the number of data entries is less than \(\texttt{t.length/8}\). The \(\mathtt{resize()}\) method works like the \(\mathtt{resize()}\) methods in other array-based data structures. We find the smallest non-negative integer \(\mathtt{d}\) such that \(2^{\mathtt{d}} \ge 3\mathtt{n}\). We reallocate the array \(\mathtt{t}\) so that it has size \(2^{\mathtt{d}}\), and then we insert all the elements in the old version of \(\mathtt{t}\) into the newly-resized copy of \(\mathtt{t}\). While doing this, we reset \(\mathtt{q}\) equal to \(\mathtt{n}\) since the newly-allocated \(\mathtt{t}\) contains no \(\mathtt{del}\) values.

void resize() { d = 1; while ((1<<d) < 3*n) d++; T[] told = t; t = newArray(1<<d); q = n; // insert everything from told for (int k = 0; k < told.length; k++) { if (told[k] != null && told[k] != del) { int i = hash(told[k]); while (t[i] != null) i = (i == t.length-1) ? 0 : i + 1; t[i] = told[k]; } } }

## \(\PageIndex{1}\) Analysis of Linear Probing

Notice that each operation, \(\mathtt{add(x)}\), \(\mathtt{remove(x)}\), or \(\mathtt{find(x)}\), finishes as soon as (or before) it discovers the first \(\mathtt{null}\) entry in \(\mathtt{t}\). The intuition behind the analysis of linear probing is that, since at least half the elements in \(\mathtt{t}\) are equal to \(\mathtt{null}\), an operation should not take long to complete because it will very quickly come across a \(\mathtt{null}\) entry. We shouldn't rely too heavily on this intuition, though, because it would lead us to (the incorrect) conclusion that the expected number of locations in \(\mathtt{t}\) examined by an operation is at most 2.

For the rest of this section, we will assume that all hash values are independently and uniformly distributed in \(\{0,\ldots,\texttt{t.length}-1\}\). This is not a realistic assumption, but it will make it possible for us to analyze linear probing. Later in this section we will describe a method, called tabulation hashing, that produces a hash function that is "good enough" for linear probing. We will also assume that all indices into the positions of \(\mathtt{t}\) are taken modulo \(\texttt{t.length}\), so that \(\mathtt{t[i]}\) is really a shorthand for \(\mathtt{t}[\mathtt{i}\bmod\texttt{t.length}]\).

We say that a run of length \(k\) that starts at \(\mathtt{i}\) occurs when all the table entries \(\mathtt{t[i]}, \mathtt{t[i+1]},\ldots,\mathtt{t}[\mathtt{i}+k-1]\) are non-\(\mathtt{null}\) and \(\mathtt{t}[\mathtt{i}-1]=\mathtt{t}[\mathtt{i}+k]=\mathtt{null}\). The number of non-\(\mathtt{null}\) elements of \(\mathtt{t}\) is exactly \(\mathtt{q}\) and the \(\mathtt{add(x)}\) method ensures that, at all times, \(\mathtt{q}\le\texttt{t.length}/2\). There are \(\mathtt{q}\) elements \(\mathtt{x}_1,\ldots,\mathtt{x}_{\mathtt{q}}\) that have been inserted into \(\mathtt{t}\) since the last \(\mathtt{rebuild()}\) operation. By our assumption, each of these has a hash value, \(\mathtt{hash}(\mathtt{x}_j)\), that is uniform and independent of the rest. With this setup, we can prove the main lemma required to analyze linear probing.

*Fix a value \(\mathtt{i}\in\{0,\ldots,\texttt{t.length}-1\}\). Then the probability that a run of length \(k\) starts at \(\mathtt{i}\) is \(O(c^k)\) for some constant \(0<c<1\).*

*Proof*. If a run of length \(k\) starts at \(\mathtt{i}\), then there are exactly \(k\) elements \(\mathtt{x}_j\) such that \(\mathtt{hash}(\mathtt{x}_j)\in\{\mathtt{i},\ldots,\mathtt{i}+k-1\}\). The probability that this occurs is exactly

\[ p_k =

\binom{\mathtt{q}}{k}

\left(\frac{k}{\texttt{t.length}}\right)^{k}

\left(\dfrac{\texttt{t.length}-k}{\texttt{t.length}}\right)^{\mathtt{q}-k}

\enspace , \nonumber\]

since, for each choice of \(k\) elements, these \(k\) elements must hash to one of the \(k\) locations and the remaining \(\mathtt{q}-k\) elements must hash to the other \(\texttt{t.length}-k\) table locations.^{2}

In the following derivation we will cheat a little and replace \(r!\) with \((r/e)^r\). Stirling's Approximation (Section 1.3.2) shows that this is only a factor of \(O(\sqrt{r})\) from the truth. This is just done to make the derivation simpler; Exercise 5.4 asks the reader to redo the calculation more rigorously using Stirling's Approximation in its entirety.

The value of \(p_k\) is maximized when \(\texttt{t.length}\) is minimum, and the data structure maintains the invariant that \(\texttt{t.length} \ge 2\mathtt{q}\), so

\[\begin{align}

p_k

&\le \dbinom{\mathtt{q}}{k} \left(\dfrac{k}{2\mathtt{q}}\right)^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\nonumber\\

& = \left(\dfrac{\mathtt{q}!}{(\mathtt{q}-k)!k!}\right) \left(\dfrac{k}{2\mathtt{q}}\right)^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k} \nonumber\\

& \approx \left(\dfrac{\mathtt{q}^{\mathtt{q}}}{(\mathtt{q}-k)^{\mathtt{q}-k}k^k} \right ) \left(\dfrac{k}{2\mathtt{q}} \right )^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\qquad\text{[Stirling's approximation]}\nonumber\\

& = \left(\dfrac{\mathtt{q}^{k}\mathtt{q}^{\mathtt{q}-k}}{(\mathtt{q}-k)^{\mathtt{q}-k}k^k}\right ) \left(\dfrac{k}{2\mathtt{q}} \right )^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\nonumber\\

& = \left( \dfrac{\mathtt{q}k}{2\mathtt{q}k} \right)^k \left( \dfrac{\mathtt{q}(2\mathtt{q}-k)}{2\mathtt{q}(\mathtt{q}-k)} \right)^{\mathtt{q}-k}\nonumber\\

& = \left(\dfrac{1}{2}\right)^k \left(\dfrac{(2\mathtt{q}-k)}{2(\mathtt{q}-k)}\right)^{\mathtt{q}-k}\nonumber\\

& = \left(\dfrac{1}{2}\right)^k \left(1+\dfrac{k}{2(\mathtt{q}-k)}\right)^{\mathtt{q}-k}\nonumber\\

& \le \left(\dfrac{\sqrt{e}}{2}\right)^k \enspace .\nonumber

\end{align}\nonumber\]

(In the last step, we use the inequality \((1+1/x)^x \le e\), which holds for all \(x>0\).) Since \(\sqrt{e}/{2}< 0.824360636 < 1\), this completes the proof.

Using Lemma \(\PageIndex{1}\) to prove upper-bounds on the expected running time of \(\mathtt{find(x)}\), \(\mathtt{add(x)}\), and \(\mathtt{remove(x)}\) is now fairly straightforward. Consider the simplest case, where we execute \(\mathtt{find(x)}\) for some value \(\mathtt{x}\) that has never been stored in the `LinearHashTable`. In this case, \(\mathtt{i}=\mathtt{hash(x)}\) is a random value in \(\{0,\ldots,\texttt{t.length}-1\}\) independent of the contents of \(\mathtt{t}\). If \(\mathtt{i}\) is part of a run of length \(k\), then the time it takes to execute the \(\mathtt{find(x)}\) operation is at most \(O(1+k)\). Thus, the expected running time can be upper-bounded by

\[ O \left(1 + \left(\frac{1}{\texttt{t.length}}\right)

\sum_{i=1}^{\texttt{t.length}}

\sum_{k=0}^{\infty}k\,\text{Pr}

\{ \text{$\mathtt{i}$ is part of a run of length $k$} \}

\right) \enspace . \nonumber\]

Note that each run of length \(k\) contributes to the inner sum \(k\) times for a total contribution of \(k^2\), so the above sum can be rewritten as

\[\begin{align}

&{ } O \left( 1 + \left(\frac{1}{\texttt{t.length}} \right)\sum_{i=1}^{\texttt{t.length}}\sum_{k=0}^{\infty} k^2\Pr\{\mbox{$\mathtt{i}$ starts a run of length $k$}\}\right)\nonumber\\

&\le O\left(1 + \left(\dfrac{1}{\texttt{t.length}}\right)\sum_{i=1}^{\texttt{t.length}}\sum_{k=0}^{\infty} k^2p_k\right)\nonumber\\

&= O\left(1 + \sum_{k=0}^{\infty} k^2p_k\right)\nonumber\\

&= O\left(1 + \sum_{k=0}^{\infty} k^2\cdot O(c^k)\right)\nonumber\\

&= O(1) \enspace .\nonumber

\end{align}\nonumber\]

The last step in this derivation comes from the fact that \(\sum_{k=0}^{\infty} k^2\cdot O(c^k)\) is an exponentially decreasing series.^{3} Therefore, we conclude that the expected running time of the \(\mathtt{find(x)}\) operation for a value \(\mathtt{x}\) that is not contained in a `LinearHashTable` is \(O(1)\).

If we ignore the cost of the \(\mathtt{resize()}\) operation, then the above analysis gives us all we need to analyze the cost of operations on a `LinearHashTable`.

First of all, the analysis of \(\mathtt{find(x)}\) given above applies to the \(\mathtt{add(x)}\) operation when \(\mathtt{x}\) is not contained in the table. To analyze the \(\mathtt{find(x)}\) operation when \(\mathtt{x}\) is contained in the table, we need only note that this is the same as the cost of the \(\mathtt{add(x)}\) operation that previously added \(\mathtt{x}\) to the table. Finally, the cost of a \(\mathtt{remove(x)}\) operation is the same as the cost of a \(\mathtt{find(x)}\) operation.

In summary, if we ignore the cost of calls to \(\mathtt{resize()}\), all operations on a `LinearHashTable` run in \(O(1)\) expected time. Accounting for the cost of resize can be done using the same type of amortized analysis performed for the `ArrayStack` data structure in Section 2.1.

## \(\PageIndex{2}\) Summary

The following theorem summarizes the performance of the `LinearHashTable` data structure:

*A LinearHashTable implements the USet interface. Ignoring the cost of calls to \(\mathtt{resize()}\), a LinearHashTable supports the operations \(\mathtt{add(x)}\), \(\mathtt{remove(x)}\), and \(\mathtt{find(x)}\) in \(O(1)\) expected time per operation. *

*Furthermore, beginning with an empty LinearHashTable, any sequence of \(m\) \(\mathtt{add(x)}\) and \(\mathtt{remove(x)}\) operations results in a total of \(O(m)\) time spent during all calls to \(\mathtt{resize()}\).*

## \(\PageIndex{3}\) Tabulation Hashing

While analyzing the `LinearHashTable` structure, we made a very strong assumption: That for any set of elements, \(\{\mathtt{x}_1,\ldots,\mathtt{x}_\mathtt{n}\}\), the hash values \(\mathtt{hash}(\)x \(_1),\ldots,\mathtt{hash}(\mathtt{x}_\mathtt{n})\) are independently and uniformly distributed over the set \(\{0,\ldots,\texttt{t.length}-1\}\). One way to achieve this is to store a giant array, \(\mathtt{tab}\), of length \(2^{\mathtt{w}}\), where each entry is a random \(\mathtt{w}\)-bit integer, independent of all the other entries. In this way, we could implement \(\mathtt{hash(x)}\) by extracting a \(\mathtt{d}\)-bit integer from \(\mathtt{tab[x.hashCode()]}\):

int idealHash(T x) { return tab[x.hashCode() >>> w-d]; }

Unfortunately, storing an array of size \(2^{\mathtt{w}}\) is prohibitive in terms of memory usage. The approach used by tabulation hashing is to, instead, treat \(\mathtt{w}\)-bit integers as being comprised of \(\mathtt{w}/\mathtt{r}\) integers, each having only \(\mathtt{r}\) bits. In this way, tabulation hashing only needs \(\mathtt{w}/\mathtt{r}\) arrays each of length \(2^{\mathtt{r}}\). All the entries in these arrays are independent random \(\mathtt{w}\)-bit integers. To obtain the value of \(\mathtt{hash(x)}\) we split \(\texttt{x.hashCode()}\) up into \(\mathtt{w}/\mathtt{r}\) \(\mathtt{r}\)-bit integers and use these as indices into these arrays. We then combine all these values with the bitwise exclusive-or operator to obtain \(\mathtt{hash(x)}\). The following code shows how this works when \(\mathtt{w}=32\) and \(\mathtt{r}=4\):

int hash(T x) { int h = x.hashCode(); return (tab[0][h&0xff] ^ tab[1][(h>>>8)&0xff] ^ tab[2][(h>>>16)&0xff] ^ tab[3][(h>>>24)&0xff]) >>> (w-d); }

In this case, \(\mathtt{tab}\) is a two-dimensional array with four columns and \(2^{32/4}=256\) rows.

One can easily verify that, for any \(\mathtt{x}\), \(\mathtt{hash(x)}\) is uniformly distributed over \(\{0,\ldots,2^{\mathtt{d}}-1\}\). With a little work, one can even verify that any pair of values have independent hash values. This implies tabulation hashing could be used in place of multiplicative hashing for the `ChainedHashTable` implementation.

However, it is not true that any set of \(\mathtt{n}\) distinct values gives a set of \(\mathtt{n}\) independent hash values. Nevertheless, when tabulation hashing is used, the bound of Theorem \(\PageIndex{1}\) still holds. References for this are provided at the end of this chapter.

#### Footnotes

^{2}Note that \(p_k\) is greater than the probability that a run of length \(k\) starts at \(\mathtt{i}\), since the definition of \(p_k\) does not include the requirement \(\mathtt{t}[\mathtt{i}-1]=\mathtt{t}[\mathtt{i}+k]=\mathtt{null}\).

^{3}In the terminology of many calculus texts, this sum passes the ratio test: There exists a positive integer \(k_0\) such that, for all \(k\ge k_0\), \(\dfrac{(k+1)^2c^{k+1}}{k^2c^k} < 1\).