8.1: ScapegoatTree - A Binary Search Tree with Partial Rebuilding

Last updated
Save as PDF

Page ID: 8469

$ \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } $ $ \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} $$\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$ $\newcommand{\id}{\mathrm{id}}$ $ \newcommand{\Span}{\mathrm{span}}$ $ \newcommand{\kernel}{\mathrm{null}\,}$ $ \newcommand{\range}{\mathrm{range}\,}$ $ \newcommand{\RealPart}{\mathrm{Re}}$ $ \newcommand{\ImaginaryPart}{\mathrm{Im}}$ $ \newcommand{\Argument}{\mathrm{Arg}}$ $ \newcommand{\norm}[1]{\| #1 \|}$ $ \newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $ \newcommand{\Span}{\mathrm{span}}$$\newcommand{\AA}{\unicode[.8,0]{x212B}}$

A ScapegoatTree is a BinarySearchTree that, in addition to keeping track of the number, $\mathtt{n}$, of nodes in the tree also keeps a counter, $\mathtt{q}$, that maintains an upper-bound on the number of nodes.

    int q;

At all times, $\mathtt{n}$ and $\mathtt{q}$ obey the following inequalities:

\[ \mathtt{q}/2 \le \mathtt{n} \le \mathtt{q} \enspace . \nonumber\]

In addition, a ScapegoatTree has logarithmic height; at all times, the height of the scapegoat tree does not exceed

\[\log_{3/2} \mathtt{q} \le \log_{3/2} 2\mathtt{n} < \log_{3/2} \mathtt{n} + 2\enspace .\label{scapegoat-height}\]

Even with this constraint, a ScapegoatTree can look surprisingly unbalanced. The tree in Figure $\PageIndex{1}$ has $\mathtt{q}=\mathtt{n}=10$ and height $5<\log_{3/2}10 \approx 5.679$.

Figure $\PageIndex{1}$: A `ScapegoatTree` with 10 nodes and height 5.

Implementing the $\mathtt{find(x)}$ operation in a ScapegoatTree is done using the standard algorithm for searching in a BinarySearchTree (see Section 6.2). This takes time proportional to the height of the tree which, by ($\ref{scapegoat-height}$) is $O(\log \mathtt{n})$.

To implement the $\mathtt{add(x)}$ operation, we first increment $\mathtt{n}$ and $\mathtt{q}$ and then use the usual algorithm for adding $\mathtt{x}$ to a binary search tree; we search for $\mathtt{x}$ and then add a new leaf $\mathtt{u}$ with $\texttt{u.x}=\mathtt{x}$. At this point, we may get lucky and the depth of $\mathtt{u}$ might not exceed $\log_{3/2}\mathtt{q}$. If so, then we leave well enough alone and don't do anything else.

Unfortunately, it will sometimes happen that $\mathtt{depth(u)} > \log_{3/2}\mathtt{q}$. In this case, we need to reduce the height. This isn't a big job; there is only one node, namely $\mathtt{u}$, whose depth exceeds $\log_{3/2}\mathtt{q}$. To fix $\mathtt{u}$, we walk from $\mathtt{u}$ back up to the root looking for a scapegoat, $\mathtt{w}$. The scapegoat, $\mathtt{w}$, is a very unbalanced node. It has the property that

\[\frac{\texttt{size(w.child)}}{\mathtt{size(w)}} > \frac{2}{3} \enspace ,\label{scapegoat}\]

where $\texttt{w.child}$ is the child of $\mathtt{w}$ on the path from the root to $\mathtt{u}$. We'll very shortly prove that a scapegoat exists. For now, we can take it for granted. Once we've found the scapegoat $\mathtt{w}$, we completely destroy the subtree rooted at $\mathtt{w}$ and rebuild it into a perfectly balanced binary search tree. We know, from ($\ref{scapegoat}$), that, even before the addition of $\mathtt{u}$, $\mathtt{w}$'s subtree was not a complete binary tree. Therefore, when we rebuild $\mathtt{w}$, the height decreases by at least 1 so that the height of the ScapegoatTree is once again at most $\log_{3/2}\mathtt{q}$.

    boolean add(T x) {
        // first do basic insertion keeping track of depth
        Node<T> u = newNode(x);
        int d = addWithDepth(u);
        if (d > log32(q)) {
            // depth exceeded, find scapegoat
            Node<T> w = u.parent;
            while (3*size(w) <= 2*size(w.parent))
                w = w.parent;
            rebuild(w.parent);
        }
        return d >= 0;
    }

Figure $\PageIndex{2}$: Inserting 3.5 into a `ScapegoatTree` increases its height to 6, which violates ($\ref{scapegoat-height}$) since $6 > \log_{3/2} 11 \approx 5.914$. A scapegoat is found at the node containing 5.

If we ignore the cost of finding the scapegoat $\mathtt{w}$ and rebuilding the subtree rooted at $\mathtt{w}$, then the running time of $\mathtt{add(x)}$ is dominated by the initial search, which takes $O(\log \mathtt{q}) = O(\log \mathtt{n})$ time. We will account for the cost of finding the scapegoat and rebuilding using amortized analysis in the next section.

The implementation of $\mathtt{remove(x)}$ in a ScapegoatTree is very simple. We search for $\mathtt{x}$ and remove it using the usual algorithm for removing a node from a BinarySearchTree. (Note that this can never increase the height of the tree.) Next, we decrement $\mathtt{n}$, but leave $\mathtt{q}$ unchanged. Finally, we check if $\mathtt{q} > 2\mathtt{n}$ and, if so, then we rebuild the entire tree into a perfectly balanced binary search tree and set $\mathtt{q}=\mathtt{n}$.

    boolean remove(T x) {
        if (super.remove(x)) {
            if (2*n < q) {
                rebuild(r);
                q = n;
            }
            return true;
        }
        return false;
    }

Again, if we ignore the cost of rebuilding, the running time of the $\mathtt{remove(x)}$ operation is proportional to the height of the tree, and is therefore $O(\log \mathtt{n})$.

$\PageIndex{1}$ Analysis of Correctness and Running-Time

In this section, we analyze the correctness and amortized running time of operations on a ScapegoatTree. We first prove the correctness by showing that, when the $\mathtt{add(x)}$ operation results in a node that violates Condition ($\ref{scapegoat-height}$), then we can always find a scapegoat:

Lemma $\PageIndex{1}$.

Let $\mathtt{u}$ be a node of depth $h>\log_{3/2} \mathtt{q}$ in a ScapegoatTree. Then there exists a node $\mathtt{w}$ on the path from $\mathtt{u}$ to the root such that

\[ \frac{\mathtt{size(w)}}{\mathtt{size(parent(w))}} > 2/3 \enspace . \nonumber\]

Proof. Suppose, for the sake of contradiction, that this is not the case, and

\[ \frac{\mathtt{size(w)}}{\mathtt{size(parent(w))}} \le 2/3 \enspace . \nonumber\]

for all nodes $\mathtt{w}$ on the path from $\mathtt{u}$ to the root. Denote the path from the root to $\mathtt{u}$ as $\mathtt{r}=\mathtt{u}_0,\ldots,\mathtt{u}_h=\mathtt{u}$. Then, we have $\mathtt{size(u}_0\mathtt{)}=\mathtt{n}$, $\mathtt{size(u}_1\mathtt{)}\le\frac{2}{3}\mathtt{n}$, $\mathtt{size(u}_2\mathtt{)}\le\frac{4}{9}\mathtt{n}$ and, more generally,

\[ \mathtt{size(u}_i\mathtt{)}\le\left(\frac{2}{3}\right)^i\mathtt{n} \enspace . \nonumber\]

But this gives a contradiction, since $\mathtt{size(u)}\ge 1$, hence

\[ 1 \le \mathtt{size(u)} \le \left(\frac{2}{3}\right)^{h} \mathtt{n}<\left(\frac{2}{3}\right)^{\log _{3 / 2} \mathtt{q}} \mathtt{n} \leq\left(\frac{2}{3}\right)^{\log _{3 / 2} \mathtt{n}} \mathtt{n}=\left(\frac{1}{\mathtt{n}}\right) \mathtt{n} = 1 \enspace . \nonumber\]

$\qedsymbol$

Next, we analyze the parts of the running time that are not yet accounted for. There are two parts: The cost of calls to $\mathtt{size(u)}$ when searching for scapegoat nodes, and the cost of calls to $\mathtt{rebuild(w)}$ when we find a scapegoat $\mathtt{w}$. The cost of calls to $\mathtt{size(u)}$ can be related to the cost of calls to $\mathtt{rebuild(w)}$, as follows:

Lemma $\PageIndex{2}$.

During a call to $\mathtt{add(x)}$ in a ScapegoatTree, the cost of finding the scapegoat $\mathtt{w}$ and rebuilding the subtree rooted at $\mathtt{w}$ is $O(\mathtt{size(w)})$.

Proof. The cost of rebuilding the scapegoat node $\mathtt{w}$, once we find it, is $O(\mathtt{size(w)})$. When searching for the scapegoat node, we call $\mathtt{size(u)}$ on a sequence of nodes $\mathtt{u}_0,\ldots,\mathtt{u}_k$ until we find the scapegoat $\mathtt{u}_k=\mathtt{w}$. However, since $\mathtt{u}_k$ is the first node in this sequence that is a scapegoat, we know that

\[ \mathtt{size(u}_{i}\mathtt{)} < \frac{2}{3}\mathtt{size(u}_{i+1}\mathtt{)} \nonumber\]

for all $i\in\{0,\ldots,k-2\}$. Therefore, the cost of all calls to $\mathtt{size(u)}$ is

\[\begin{align}
O\left( \sum_{i=0}^k \mathtt{size(u}_{k-i}\mathtt{)} \right)
&=O\left(\mathtt{size(u}_k\mathtt{)}+\sum_{i=0}^{k-1} \mathtt{size(u}_{k-i-1}\mathtt{)}\right) \nonumber\\
&=O\left(\mathtt{size(u}_k\mathtt{)}+ \sum_{i=0}^{k-1}\left(\frac{2}{3}\right)^i\mathtt{size(u}_{k}\mathtt{)}\right) \nonumber\\
&=O\left(\mathtt{size(u}_k\mathtt{)}\left(1+\sum_{i=0}^{k-1} \left(\frac{2}{3}\right)^i\right)\right) \nonumber\\
&=O(\mathtt{size(u}_k\mathtt{)}) = O(\mathtt{size(w)}) \enspace , \nonumber\\
\end{align}\nonumber\]

where the last line follows from the fact that the sum is a geometrically decreasing series. $\qedsymbol$

All that remains is to prove an upper-bound on the cost of all calls to $\mathtt{rebuild(u)}$ during a sequence of $m$ operations:

Lemma $\PageIndex{3}$.

Starting with an empty ScapegoatTree any sequence of $m$ $\mathtt{add(x)}$ and $\mathtt{remove(x)}$ operations causes at most $O(m\log m)$ time to be used by $\mathtt{rebuild(u)}$ operations.

Proof. To prove this, we will use a credit scheme. We imagine that each node stores a number of credits. Each credit can pay for some constant, $c$, units of time spent rebuilding. The scheme gives out a total of $O(m\log m)$ credits and every call to $\mathtt{rebuild(u)}$ is paid for with credits stored at $\mathtt{u}$.

During an insertion or deletion, we give one credit to each node on the path to the inserted node, or deleted node, $\mathtt{u}$. In this way we hand out at most $\log_{3/2}\mathtt{q}\le \log_{3/2}m$ credits per operation. During a deletion we also store an additional credit "on the side." Thus, in total we give out at most $O(m\log m)$ credits. All that remains is to show that these credits are sufficient to pay for all calls to $\mathtt{rebuild(u)}$.

If we call $\mathtt{rebuild(u)}$ during an insertion, it is because $\mathtt{u}$ is a scapegoat. Suppose, without loss of generality, that

\[ \frac{\texttt{size(u.left)}}{\mathtt{size(u)}} > \frac{2}{3} \enspace . \nonumber\]

Using the fact that

\[ \mathtt{size(u)} = 1 + \texttt{size(u.left)} + \texttt{size(u.right)} \nonumber\]

we deduce that

\[ \frac{1}{2}\texttt{size(u.left)} > \texttt{size(u.right)} \enspace \nonumber\]

and therefore

\[ \texttt{size(u.left)} - \texttt{size(u.right)}>\frac{1}{2}\texttt{size(u.left)} > \frac{1}{3}\mathtt{size(u)} \enspace . \nonumber\]

Now, the last time a subtree containing $\mathtt{u}$ was rebuilt (or when $\mathtt{u}$ was inserted, if a subtree containing $\mathtt{u}$ was never rebuilt), we had

\[ \texttt{size(u.left)} - \texttt{size(u.right)} \le 1 \enspace . \nonumber\]

Therefore, the number of $\mathtt{add(x)}$ or $\mathtt{remove(x)}$ operations that have affected $\texttt{u.left}$ or $\texttt{u.right}$ since then is at least

\[ \frac{1}{3}\mathtt{size(u)} - 1 \enspace . \nonumber\]

and there are therefore at least this many credits stored at $\mathtt{u}$ that are available to pay for the $O(\mathtt{size(u)})$ time it takes to call $\mathtt{rebuild(u)}$.

If we call $\mathtt{rebuild(u)}$ during a deletion, it is because $\mathtt{q} > 2\mathtt{n}$. In this case, we have $\mathtt{q}-\mathtt{n}> \mathtt{n}$ credits stored "on the side," and we use these to pay for the $O(\mathtt{n})$ time it takes to rebuild the root. This completes the proof. $\qedsymbol$

$\PageIndex{2}$ Summary

The following theorem summarizes the performance of the ScapegoatTree data structure:

Theorem $\PageIndex{1}$.

A ScapegoatTree implements the SSet interface. Ignoring the cost of $\mathtt{rebuild(u)}$ operations, a ScapegoatTree supports the operations $\mathtt{add(x)}$, $\mathtt{remove(x)}$, and $\mathtt{find(x)}$ in $O(\log \mathtt{n})$ time per operation.

Furthermore, beginning with an empty ScapegoatTree, any sequence of $m$ $\mathtt{add(x)}$ and $\mathtt{remove(x)}$ operations results in a total of $O(m\log m)$ time spent during all calls to $\mathtt{rebuild(u)}$.