5.2: LinearHashTable - Linear Probing

Last updated
Save as PDF

Page ID: 8457

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

The ChainedHashTable data structure uses an array of lists, where the $\mathtt{i}$th list stores all elements $\mathtt{x}$ such that $\mathtt{hash(x)}=\mathtt{i}$. An alternative, called open addressing is to store the elements directly in an array, $\mathtt{t}$, with each array location in $\mathtt{t}$ storing at most one value. This approach is taken by the LinearHashTable described in this section. In some places, this data structure is described as open addressing with linear probing.

The main idea behind a LinearHashTable is that we would, ideally, like to store the element $\mathtt{x}$ with hash value $\mathtt{i=hash(x)}$ in the table location $\mathtt{t[i]}$. If we cannot do this (because some element is already stored there) then we try to store it at location $\mathtt{t}[(\mathtt{i}+1)\bmod\texttt{t.length}]$; if that's not possible, then we try $\mathtt{t}[(\mathtt{i}+2)\bmod\texttt{t.length}]$, and so on, until we find a place for $\mathtt{x}$.

There are three types of entries stored in $\mathtt{t}$:

data values: actual values in the USet that we are representing;
$\mathtt{null}$ values: at array locations where no data has ever been stored; and
$\mathtt{del}$ values: at array locations where data was once stored but that has since been deleted.

In addition to the counter, $\mathtt{n}$, that keeps track of the number of elements in the LinearHashTable, a counter, $\mathtt{q}$, keeps track of the number of elements of Types 1 and 3. That is, $\mathtt{q}$ is equal to $\mathtt{n}$ plus the number of $\mathtt{del}$ values in $\mathtt{t}$. To make this work efficiently, we need $\mathtt{t}$ to be considerably larger than $\mathtt{q}$, so that there are lots of $\mathtt{null}$ values in $\mathtt{t}$. The operations on a LinearHashTable therefore maintain the invariant that $\texttt{t.length}\ge 2\mathtt{q}$.

To summarize, a LinearHashTable contains an array, $\mathtt{t}$, that stores data elements, and integers $\mathtt{n}$ and $\mathtt{q}$ that keep track of the number of data elements and non-$\mathtt{null}$ values of $\mathtt{t}$, respectively. Because many hash functions only work for table sizes that are a power of 2, we also keep an integer $\mathtt{d}$ and maintain the invariant that $\texttt{t.length}=2^\mathtt{d}$.

    T[] t;   // the table
    int n;   // the size
    int d;   // t.length = 2^d
    int q;   // number of non-null entries in t

The $\mathtt{find(x)}$ operation in a LinearHashTable is simple. We start at array entry $\mathtt{t[i]}$ where $\mathtt{i}=\mathtt{hash(x)}$ and search entries $\mathtt{t[i]}$, $\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]$, $\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]$, and so on, until we find an index $\mathtt{i'}$ such that, either, $\mathtt{t[i']=x}$, or $\mathtt{t[i']=null}$. In the former case we return $\mathtt{t[i']}$. In the latter case, we conclude that $\mathtt{x}$ is not contained in the hash table and return $\mathtt{null}$.

    T find(T x) {
        int i = hash(x);
        while (t[i] != null) {
            if (t[i] != del && x.equals(t[i])) return t[i];
            i = (i == t.length-1) ? 0 : i + 1; // increment i
        }
        return null;
    }

The $\mathtt{add(x)}$ operation is also fairly easy to implement. After checking that $\mathtt{x}$ is not already stored in the table (using $\mathtt{find(x)}$), we search $\mathtt{t[i]}$, $\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]$, $\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]$, and so on, until we find a $\mathtt{null}$ or $\mathtt{del}$ and store $\mathtt{x}$ at that location, increment $\mathtt{n}$, and $\mathtt{q}$, if appropriate.

    boolean add(T x) {
        if (find(x) != null) return false;
        if (2*(q+1) > t.length) resize(); // max 50% occupancy
        int i = hash(x);
        while (t[i] != null && t[i] != del)
            i = (i == t.length-1) ? 0 : i + 1; // increment i
        if (t[i] == null) q++;
        n++;
        t[i] = x;
        return true;
    }

By now, the implementation of the $\mathtt{remove(x)}$ operation should be obvious. We search $\mathtt{t[i]}$, $\mathtt{t}[(\mathtt{i}+1)\bmod \texttt{t.length}]$, $\mathtt{t}[(\mathtt{i}+2)\bmod \texttt{t.length}]$, and so on until we find an index $\mathtt{i'}$ such that $\mathtt{t[i']=x}$ or $\mathtt{t[i']=null}$. In the former case, we set $\mathtt{t[i']=del}$ and return $\mathtt{true}$. In the latter case we conclude that $\mathtt{x}$ was not stored in the table (and therefore cannot be deleted) and return $\mathtt{false}$.

    T remove(T x) {
        int i = hash(x);
        while (t[i] != null) {
            T y = t[i];
            if (y != del && x.equals(y)) { 
                t[i] = del;
                n--;
                if (8*n < t.length) resize(); // min 12.5% occupancy
                return y;
            }
            i = (i == t.length-1) ? 0 : i + 1;  // increment i
        }
        return null;
    }

The correctness of the $\mathtt{find(x)}$, $\mathtt{add(x)}$, and $\mathtt{remove(x)}$ methods is easy to verify, though it relies on the use of $\mathtt{del}$ values. Notice that none of these operations ever sets a non-$\mathtt{null}$ entry to $\mathtt{null}$. Therefore, when we reach an index $\mathtt{i'}$ such that $\mathtt{t[i']=null}$, this is a proof that the element, $\mathtt{x}$, that we are searching for is not stored in the table; $\mathtt{t[i']}$ has always been $\mathtt{null}$, so there is no reason that a previous $\mathtt{add(x)}$ operation would have proceeded beyond index $\mathtt{i'}$.

The $\mathtt{resize()}$ method is called by $\mathtt{add(x)}$ when the number of non-$\mathtt{null}$ entries exceeds $\texttt{t.length}/2$ or by $\mathtt{remove(x)}$ when the number of data entries is less than $\texttt{t.length/8}$. The $\mathtt{resize()}$ method works like the $\mathtt{resize()}$ methods in other array-based data structures. We find the smallest non-negative integer $\mathtt{d}$ such that $2^{\mathtt{d}} \ge 3\mathtt{n}$. We reallocate the array $\mathtt{t}$ so that it has size $2^{\mathtt{d}}$, and then we insert all the elements in the old version of $\mathtt{t}$ into the newly-resized copy of $\mathtt{t}$. While doing this, we reset $\mathtt{q}$ equal to $\mathtt{n}$ since the newly-allocated $\mathtt{t}$ contains no $\mathtt{del}$ values.

    void resize() {
        d = 1;
        while ((1<<d) < 3*n) d++;
        T[] told = t;
        t = newArray(1<<d);
        q = n;
        // insert everything from told
        for (int k = 0; k < told.length; k++) {
            if (told[k] != null && told[k] != del) {
                int i = hash(told[k]);
                while (t[i] != null) 
                    i = (i == t.length-1) ? 0 : i + 1;
                t[i] = told[k];
            }
        }
    }

$\PageIndex{1}$ Analysis of Linear Probing

Notice that each operation, $\mathtt{add(x)}$, $\mathtt{remove(x)}$, or $\mathtt{find(x)}$, finishes as soon as (or before) it discovers the first $\mathtt{null}$ entry in $\mathtt{t}$. The intuition behind the analysis of linear probing is that, since at least half the elements in $\mathtt{t}$ are equal to $\mathtt{null}$, an operation should not take long to complete because it will very quickly come across a $\mathtt{null}$ entry. We shouldn't rely too heavily on this intuition, though, because it would lead us to (the incorrect) conclusion that the expected number of locations in $\mathtt{t}$ examined by an operation is at most 2.

For the rest of this section, we will assume that all hash values are independently and uniformly distributed in $\{0,\ldots,\texttt{t.length}-1\}$. This is not a realistic assumption, but it will make it possible for us to analyze linear probing. Later in this section we will describe a method, called tabulation hashing, that produces a hash function that is "good enough" for linear probing. We will also assume that all indices into the positions of $\mathtt{t}$ are taken modulo $\texttt{t.length}$, so that $\mathtt{t[i]}$ is really a shorthand for $\mathtt{t}[\mathtt{i}\bmod\texttt{t.length}]$.

We say that a run of length $k$ that starts at $\mathtt{i}$ occurs when all the table entries $\mathtt{t[i]}, \mathtt{t[i+1]},\ldots,\mathtt{t}[\mathtt{i}+k-1]$ are non-$\mathtt{null}$ and $\mathtt{t}[\mathtt{i}-1]=\mathtt{t}[\mathtt{i}+k]=\mathtt{null}$. The number of non-$\mathtt{null}$ elements of $\mathtt{t}$ is exactly $\mathtt{q}$ and the $\mathtt{add(x)}$ method ensures that, at all times, $\mathtt{q}\le\texttt{t.length}/2$. There are $\mathtt{q}$ elements $\mathtt{x}_1,\ldots,\mathtt{x}_{\mathtt{q}}$ that have been inserted into $\mathtt{t}$ since the last $\mathtt{rebuild()}$ operation. By our assumption, each of these has a hash value, $\mathtt{hash}(\mathtt{x}_j)$, that is uniform and independent of the rest. With this setup, we can prove the main lemma required to analyze linear probing.

Lemma $\PageIndex{1}$.

Fix a value $\mathtt{i}\in\{0,\ldots,\texttt{t.length}-1\}$. Then the probability that a run of length $k$ starts at $\mathtt{i}$ is $O(c^k)$ for some constant $0<c<1$.

Proof. If a run of length $k$ starts at $\mathtt{i}$, then there are exactly $k$ elements $\mathtt{x}_j$ such that $\mathtt{hash}(\mathtt{x}_j)\in\{\mathtt{i},\ldots,\mathtt{i}+k-1\}$. The probability that this occurs is exactly

\[ p_k =
\binom{\mathtt{q}}{k}
\left(\frac{k}{\texttt{t.length}}\right)^{k}
\left(\dfrac{\texttt{t.length}-k}{\texttt{t.length}}\right)^{\mathtt{q}-k}
\enspace , \nonumber\]

since, for each choice of $k$ elements, these $k$ elements must hash to one of the $k$ locations and the remaining $\mathtt{q}-k$ elements must hash to the other $\texttt{t.length}-k$ table locations.²

In the following derivation we will cheat a little and replace $r!$ with $(r/e)^r$. Stirling's Approximation (Section 1.3.2) shows that this is only a factor of $O(\sqrt{r})$ from the truth. This is just done to make the derivation simpler; Exercise 5.4 asks the reader to redo the calculation more rigorously using Stirling's Approximation in its entirety.

The value of $p_k$ is maximized when $\texttt{t.length}$ is minimum, and the data structure maintains the invariant that $\texttt{t.length} \ge 2\mathtt{q}$, so

\[\begin{align}
p_k
&\le \dbinom{\mathtt{q}}{k} \left(\dfrac{k}{2\mathtt{q}}\right)^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\nonumber\\
& = \left(\dfrac{\mathtt{q}!}{(\mathtt{q}-k)!k!}\right) \left(\dfrac{k}{2\mathtt{q}}\right)^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k} \nonumber\\
& \approx \left(\dfrac{\mathtt{q}^{\mathtt{q}}}{(\mathtt{q}-k)^{\mathtt{q}-k}k^k} \right ) \left(\dfrac{k}{2\mathtt{q}} \right )^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\qquad\text{[Stirling's approximation]}\nonumber\\
& = \left(\dfrac{\mathtt{q}^{k}\mathtt{q}^{\mathtt{q}-k}}{(\mathtt{q}-k)^{\mathtt{q}-k}k^k}\right ) \left(\dfrac{k}{2\mathtt{q}} \right )^k \left(\dfrac{2\mathtt{q}-k}{2\mathtt{q}}\right)^{\mathtt{q}-k}\nonumber\\
& = \left( \dfrac{\mathtt{q}k}{2\mathtt{q}k} \right)^k \left( \dfrac{\mathtt{q}(2\mathtt{q}-k)}{2\mathtt{q}(\mathtt{q}-k)} \right)^{\mathtt{q}-k}\nonumber\\
& = \left(\dfrac{1}{2}\right)^k \left(\dfrac{(2\mathtt{q}-k)}{2(\mathtt{q}-k)}\right)^{\mathtt{q}-k}\nonumber\\
& = \left(\dfrac{1}{2}\right)^k \left(1+\dfrac{k}{2(\mathtt{q}-k)}\right)^{\mathtt{q}-k}\nonumber\\
& \le \left(\dfrac{\sqrt{e}}{2}\right)^k \enspace .\nonumber
\end{align}\nonumber\]

(In the last step, we use the inequality $(1+1/x)^x \le e$, which holds for all $x>0$.) Since $\sqrt{e}/{2}< 0.824360636 < 1$, this completes the proof. $\qedsymbol$

Using Lemma $\PageIndex{1}$ to prove upper-bounds on the expected running time of $\mathtt{find(x)}$, $\mathtt{add(x)}$, and $\mathtt{remove(x)}$ is now fairly straightforward. Consider the simplest case, where we execute $\mathtt{find(x)}$ for some value $\mathtt{x}$ that has never been stored in the LinearHashTable. In this case, $\mathtt{i}=\mathtt{hash(x)}$ is a random value in $\{0,\ldots,\texttt{t.length}-1\}$ independent of the contents of $\mathtt{t}$. If $\mathtt{i}$ is part of a run of length $k$, then the time it takes to execute the $\mathtt{find(x)}$ operation is at most $O(1+k)$. Thus, the expected running time can be upper-bounded by

\[ O \left(1 + \left(\frac{1}{\texttt{t.length}}\right)
\sum_{i=1}^{\texttt{t.length}}
\sum_{k=0}^{\infty}k\,\text{Pr}
\{ \text{$\mathtt{i}$ is part of a run of length $k$} \}
\right) \enspace . \nonumber\]

Note that each run of length $k$ contributes to the inner sum $k$ times for a total contribution of $k^2$, so the above sum can be rewritten as

\[\begin{align}
&{ } O \left( 1 + \left(\frac{1}{\texttt{t.length}} \right)\sum_{i=1}^{\texttt{t.length}}\sum_{k=0}^{\infty} k^2\Pr\{\mbox{$\mathtt{i}$ starts a run of length $k$}\}\right)\nonumber\\
&\le O\left(1 + \left(\dfrac{1}{\texttt{t.length}}\right)\sum_{i=1}^{\texttt{t.length}}\sum_{k=0}^{\infty} k^2p_k\right)\nonumber\\
&= O\left(1 + \sum_{k=0}^{\infty} k^2p_k\right)\nonumber\\
&= O\left(1 + \sum_{k=0}^{\infty} k^2\cdot O(c^k)\right)\nonumber\\
&= O(1) \enspace .\nonumber
\end{align}\nonumber\]

The last step in this derivation comes from the fact that $\sum_{k=0}^{\infty} k^2\cdot O(c^k)$ is an exponentially decreasing series.³ Therefore, we conclude that the expected running time of the $\mathtt{find(x)}$ operation for a value $\mathtt{x}$ that is not contained in a LinearHashTable is $O(1)$.

If we ignore the cost of the $\mathtt{resize()}$ operation, then the above analysis gives us all we need to analyze the cost of operations on a LinearHashTable.

First of all, the analysis of $\mathtt{find(x)}$ given above applies to the $\mathtt{add(x)}$ operation when $\mathtt{x}$ is not contained in the table. To analyze the $\mathtt{find(x)}$ operation when $\mathtt{x}$ is contained in the table, we need only note that this is the same as the cost of the $\mathtt{add(x)}$ operation that previously added $\mathtt{x}$ to the table. Finally, the cost of a $\mathtt{remove(x)}$ operation is the same as the cost of a $\mathtt{find(x)}$ operation.

In summary, if we ignore the cost of calls to $\mathtt{resize()}$, all operations on a LinearHashTable run in $O(1)$ expected time. Accounting for the cost of resize can be done using the same type of amortized analysis performed for the ArrayStack data structure in Section 2.1.

$\PageIndex{2}$ Summary

The following theorem summarizes the performance of the LinearHashTable data structure:

Theorem $\PageIndex{1}$.

A LinearHashTable implements the USet interface. Ignoring the cost of calls to $\mathtt{resize()}$, a LinearHashTable supports the operations $\mathtt{add(x)}$, $\mathtt{remove(x)}$, and $\mathtt{find(x)}$ in $O(1)$ expected time per operation.

Furthermore, beginning with an empty LinearHashTable, any sequence of $m$ $\mathtt{add(x)}$ and $\mathtt{remove(x)}$ operations results in a total of $O(m)$ time spent during all calls to $\mathtt{resize()}$.

$\PageIndex{3}$ Tabulation Hashing

While analyzing the LinearHashTable structure, we made a very strong assumption: That for any set of elements, $\{\mathtt{x}_1,\ldots,\mathtt{x}_\mathtt{n}\}$, the hash values $\mathtt{hash}($x $_1),\ldots,\mathtt{hash}(\mathtt{x}_\mathtt{n})$ are independently and uniformly distributed over the set $\{0,\ldots,\texttt{t.length}-1\}$. One way to achieve this is to store a giant array, $\mathtt{tab}$, of length $2^{\mathtt{w}}$, where each entry is a random $\mathtt{w}$-bit integer, independent of all the other entries. In this way, we could implement $\mathtt{hash(x)}$ by extracting a $\mathtt{d}$-bit integer from $\mathtt{tab[x.hashCode()]}$:

    int idealHash(T x) {
        return tab[x.hashCode() >>> w-d];
    }

Unfortunately, storing an array of size $2^{\mathtt{w}}$ is prohibitive in terms of memory usage. The approach used by tabulation hashing is to, instead, treat $\mathtt{w}$-bit integers as being comprised of $\mathtt{w}/\mathtt{r}$ integers, each having only $\mathtt{r}$ bits. In this way, tabulation hashing only needs $\mathtt{w}/\mathtt{r}$ arrays each of length $2^{\mathtt{r}}$. All the entries in these arrays are independent random $\mathtt{w}$-bit integers. To obtain the value of $\mathtt{hash(x)}$ we split $\texttt{x.hashCode()}$ up into $\mathtt{w}/\mathtt{r}$ $\mathtt{r}$-bit integers and use these as indices into these arrays. We then combine all these values with the bitwise exclusive-or operator to obtain $\mathtt{hash(x)}$. The following code shows how this works when $\mathtt{w}=32$ and $\mathtt{r}=4$:

    int hash(T x) {
        int h = x.hashCode();
        return (tab[0][h&0xff] 
                 ^ tab[1][(h>>>8)&0xff]
                 ^ tab[2][(h>>>16)&0xff] 
                 ^ tab[3][(h>>>24)&0xff])
                  >>> (w-d);
    }

In this case, $\mathtt{tab}$ is a two-dimensional array with four columns and $2^{32/4}=256$ rows.

One can easily verify that, for any $\mathtt{x}$, $\mathtt{hash(x)}$ is uniformly distributed over $\{0,\ldots,2^{\mathtt{d}}-1\}$. With a little work, one can even verify that any pair of values have independent hash values. This implies tabulation hashing could be used in place of multiplicative hashing for the ChainedHashTable implementation.

However, it is not true that any set of $\mathtt{n}$ distinct values gives a set of $\mathtt{n}$ independent hash values. Nevertheless, when tabulation hashing is used, the bound of Theorem $\PageIndex{1}$ still holds. References for this are provided at the end of this chapter.

Footnotes

²Note that $p_k$ is greater than the probability that a run of length $k$ starts at $\mathtt{i}$, since the definition of $p_k$ does not include the requirement $\mathtt{t}[\mathtt{i}-1]=\mathtt{t}[\mathtt{i}+k]=\mathtt{null}$.

³In the terminology of many calculus texts, this sum passes the ratio test: There exists a positive integer $k_0$ such that, for all $k\ge k_0$, $\dfrac{(k+1)^2c^{k+1}}{k^2c^k} < 1$.

\(\PageIndex{1}\) Analysis of Linear Probing

\(\PageIndex{2}\) Summary

\(\PageIndex{3}\) Tabulation Hashing

Footnotes