# 8.1: ScapegoatTree - A Binary Search Tree with Partial Rebuilding

- Page ID
- 8469

A `ScapegoatTree` is a `BinarySearchTree` that, in addition to keeping track of the number, \(\mathtt{n}\), of nodes in the tree also keeps a counter, \(\mathtt{q}\), that maintains an upper-bound on the number of nodes.

int q;

At all times, \(\mathtt{n}\) and \(\mathtt{q}\) obey the following inequalities:

\[ \mathtt{q}/2 \le \mathtt{n} \le \mathtt{q} \enspace . \nonumber\]

In addition, a `ScapegoatTree` has logarithmic height; at all times, the height of the scapegoat tree does not exceed

\[\log_{3/2} \mathtt{q} \le \log_{3/2} 2\mathtt{n} < \log_{3/2} \mathtt{n} + 2\enspace .\label{scapegoat-height}\]

Even with this constraint, a `ScapegoatTree` can look surprisingly unbalanced. The tree in Figure \(\PageIndex{1}\) has \(\mathtt{q}=\mathtt{n}=10\) and height \(5<\log_{3/2}10 \approx 5.679\).

Implementing the \(\mathtt{find(x)}\) operation in a `ScapegoatTree` is done using the standard algorithm for searching in a `BinarySearchTree` (see Section 6.2). This takes time proportional to the height of the tree which, by (\(\ref{scapegoat-height}\)) is \(O(\log \mathtt{n})\).

To implement the \(\mathtt{add(x)}\) operation, we first increment \(\mathtt{n}\) and \(\mathtt{q}\) and then use the usual algorithm for adding \(\mathtt{x}\) to a binary search tree; we search for \(\mathtt{x}\) and then add a new leaf \(\mathtt{u}\) with \(\texttt{u.x}=\mathtt{x}\). At this point, we may get lucky and the depth of \(\mathtt{u}\) might not exceed \(\log_{3/2}\mathtt{q}\). If so, then we leave well enough alone and don't do anything else.

Unfortunately, it will sometimes happen that \(\mathtt{depth(u)} > \log_{3/2}\mathtt{q}\). In this case, we need to reduce the height. This isn't a big job; there is only one node, namely \(\mathtt{u}\), whose depth exceeds \(\log_{3/2}\mathtt{q}\). To fix \(\mathtt{u}\), we walk from \(\mathtt{u}\) back up to the root looking for a scapegoat, \(\mathtt{w}\). The scapegoat, \(\mathtt{w}\), is a very unbalanced node. It has the property that

\[\frac{\texttt{size(w.child)}}{\mathtt{size(w)}} > \frac{2}{3} \enspace ,\label{scapegoat}\]

where \(\texttt{w.child}\) is the child of \(\mathtt{w}\) on the path from the root to \(\mathtt{u}\). We'll very shortly prove that a scapegoat exists. For now, we can take it for granted. Once we've found the scapegoat \(\mathtt{w}\), we completely destroy the subtree rooted at \(\mathtt{w}\) and rebuild it into a perfectly balanced binary search tree. We know, from (\(\ref{scapegoat}\)), that, even before the addition of \(\mathtt{u}\), \(\mathtt{w}\)'s subtree was not a complete binary tree. Therefore, when we rebuild \(\mathtt{w}\), the height decreases by at least 1 so that the height of the `ScapegoatTree` is once again at most \(\log_{3/2}\mathtt{q}\).

boolean add(T x) { // first do basic insertion keeping track of depth Node<T> u = newNode(x); int d = addWithDepth(u); if (d > log32(q)) { // depth exceeded, find scapegoat Node<T> w = u.parent; while (3*size(w) <= 2*size(w.parent)) w = w.parent; rebuild(w.parent); } return d >= 0; }

If we ignore the cost of finding the scapegoat \(\mathtt{w}\) and rebuilding the subtree rooted at \(\mathtt{w}\), then the running time of \(\mathtt{add(x)}\) is dominated by the initial search, which takes \(O(\log \mathtt{q}) = O(\log \mathtt{n})\) time. We will account for the cost of finding the scapegoat and rebuilding using amortized analysis in the next section.

The implementation of \(\mathtt{remove(x)}\) in a `ScapegoatTree` is very simple. We search for \(\mathtt{x}\) and remove it using the usual algorithm for removing a node from a `BinarySearchTree`. (Note that this can never increase the height of the tree.) Next, we decrement \(\mathtt{n}\), but leave \(\mathtt{q}\) unchanged. Finally, we check if \(\mathtt{q} > 2\mathtt{n}\) and, if so, then we rebuild the entire tree into a perfectly balanced binary search tree and set \(\mathtt{q}=\mathtt{n}\).

boolean remove(T x) { if (super.remove(x)) { if (2*n < q) { rebuild(r); q = n; } return true; } return false; }

Again, if we ignore the cost of rebuilding, the running time of the \(\mathtt{remove(x)}\) operation is proportional to the height of the tree, and is therefore \(O(\log \mathtt{n})\).

## \(\PageIndex{1}\) Analysis of Correctness and Running-Time

In this section, we analyze the correctness and amortized running time of operations on a `ScapegoatTree`. We first prove the correctness by showing that, when the \(\mathtt{add(x)}\) operation results in a node that violates Condition (\(\ref{scapegoat-height}\)), then we can always find a scapegoat:

Lemma \(\PageIndex{1}\).

*Let \(\mathtt{u}\) be a node of depth \(h>\log_{3/2} \mathtt{q}\) in a ScapegoatTree. Then there exists a node \(\mathtt{w}\) on the path from \(\mathtt{u}\) to the root such that *

\[ \frac{\mathtt{size(w)}}{\mathtt{size(parent(w))}} > 2/3 \enspace . \nonumber\]

*Proof*. Suppose, for the sake of contradiction, that this is not the case, and

\[ \frac{\mathtt{size(w)}}{\mathtt{size(parent(w))}} \le 2/3 \enspace . \nonumber\]

for all nodes \(\mathtt{w}\) on the path from \(\mathtt{u}\) to the root. Denote the path from the root to \(\mathtt{u}\) as \(\mathtt{r}=\mathtt{u}_0,\ldots,\mathtt{u}_h=\mathtt{u}\). Then, we have \(\mathtt{size(u}_0\mathtt{)}=\mathtt{n}\), \(\mathtt{size(u}_1\mathtt{)}\le\frac{2}{3}\mathtt{n}\), \(\mathtt{size(u}_2\mathtt{)}\le\frac{4}{9}\mathtt{n}\) and, more generally,

\[ \mathtt{size(u}_i\mathtt{)}\le\left(\frac{2}{3}\right)^i\mathtt{n} \enspace . \nonumber\]

But this gives a contradiction, since \(\mathtt{size(u)}\ge 1\), hence

\[ 1 \le \mathtt{size(u)} \le \left(\frac{2}{3}\right)^{h} \mathtt{n}<\left(\frac{2}{3}\right)^{\log _{3 / 2} \mathtt{q}} \mathtt{n} \leq\left(\frac{2}{3}\right)^{\log _{3 / 2} \mathtt{n}} \mathtt{n}=\left(\frac{1}{\mathtt{n}}\right) \mathtt{n} = 1 \enspace . \nonumber\]

Next, we analyze the parts of the running time that are not yet accounted for. There are two parts: The cost of calls to \(\mathtt{size(u)}\) when searching for scapegoat nodes, and the cost of calls to \(\mathtt{rebuild(w)}\) when we find a scapegoat \(\mathtt{w}\). The cost of calls to \(\mathtt{size(u)}\) can be related to the cost of calls to \(\mathtt{rebuild(w)}\), as follows:

Lemma \(\PageIndex{2}\).

*During a call to \(\mathtt{add(x)}\) in a*

`ScapegoatTree`, the cost of finding the scapegoat \(\mathtt{w}\) and rebuilding the subtree rooted at \(\mathtt{w}\) is \(O(\mathtt{size(w)})\).*Proof*. The cost of rebuilding the scapegoat node \(\mathtt{w}\), once we find it, is \(O(\mathtt{size(w)})\). When searching for the scapegoat node, we call \(\mathtt{size(u)}\) on a sequence of nodes \(\mathtt{u}_0,\ldots,\mathtt{u}_k\) until we find the scapegoat \(\mathtt{u}_k=\mathtt{w}\). However, since \(\mathtt{u}_k\) is the first node in this sequence that is a scapegoat, we know that

\[ \mathtt{size(u}_{i}\mathtt{)} < \frac{2}{3}\mathtt{size(u}_{i+1}\mathtt{)} \nonumber\]

for all \(i\in\{0,\ldots,k-2\}\). Therefore, the cost of all calls to \(\mathtt{size(u)}\) is

\[\begin{align}

O\left( \sum_{i=0}^k \mathtt{size(u}_{k-i}\mathtt{)} \right)

&=O\left(\mathtt{size(u}_k\mathtt{)}+\sum_{i=0}^{k-1} \mathtt{size(u}_{k-i-1}\mathtt{)}\right) \nonumber\\

&=O\left(\mathtt{size(u}_k\mathtt{)}+ \sum_{i=0}^{k-1}\left(\frac{2}{3}\right)^i\mathtt{size(u}_{k}\mathtt{)}\right) \nonumber\\

&=O\left(\mathtt{size(u}_k\mathtt{)}\left(1+\sum_{i=0}^{k-1} \left(\frac{2}{3}\right)^i\right)\right) \nonumber\\

&=O(\mathtt{size(u}_k\mathtt{)}) = O(\mathtt{size(w)}) \enspace , \nonumber\\

\end{align}\nonumber\]

where the last line follows from the fact that the sum is a geometrically decreasing series.

All that remains is to prove an upper-bound on the cost of all calls to \(\mathtt{rebuild(u)}\) during a sequence of \(m\) operations:

*Starting with an empty*

`ScapegoatTree`any sequence of \(m\) \(\mathtt{add(x)}\) and \(\mathtt{remove(x)}\) operations causes at most \(O(m\log m)\) time to be used by \(\mathtt{rebuild(u)}\) operations.*Proof*. To prove this, we will use a credit scheme. We imagine that each node stores a number of credits. Each credit can pay for some constant, \(c\), units of time spent rebuilding. The scheme gives out a total of \(O(m\log m)\) credits and every call to \(\mathtt{rebuild(u)}\) is paid for with credits stored at \(\mathtt{u}\).

During an insertion or deletion, we give one credit to each node on the path to the inserted node, or deleted node, \(\mathtt{u}\). In this way we hand out at most \(\log_{3/2}\mathtt{q}\le \log_{3/2}m\) credits per operation. During a deletion we also store an additional credit "on the side." Thus, in total we give out at most \(O(m\log m)\) credits. All that remains is to show that these credits are sufficient to pay for all calls to \(\mathtt{rebuild(u)}\).

If we call \(\mathtt{rebuild(u)}\) during an insertion, it is because \(\mathtt{u}\) is a scapegoat. Suppose, without loss of generality, that

\[ \frac{\texttt{size(u.left)}}{\mathtt{size(u)}} > \frac{2}{3} \enspace . \nonumber\]

Using the fact that

\[ \mathtt{size(u)} = 1 + \texttt{size(u.left)} + \texttt{size(u.right)} \nonumber\]

we deduce that

\[ \frac{1}{2}\texttt{size(u.left)} > \texttt{size(u.right)} \enspace \nonumber\]

and therefore

\[ \texttt{size(u.left)} - \texttt{size(u.right)}>\frac{1}{2}\texttt{size(u.left)} > \frac{1}{3}\mathtt{size(u)} \enspace . \nonumber\]

Now, the last time a subtree containing \(\mathtt{u}\) was rebuilt (or when \(\mathtt{u}\) was inserted, if a subtree containing \(\mathtt{u}\) was never rebuilt), we had

\[ \texttt{size(u.left)} - \texttt{size(u.right)} \le 1 \enspace . \nonumber\]

Therefore, the number of \(\mathtt{add(x)}\) or \(\mathtt{remove(x)}\) operations that have affected \(\texttt{u.left}\) or \(\texttt{u.right}\) since then is at least

\[ \frac{1}{3}\mathtt{size(u)} - 1 \enspace . \nonumber\]

and there are therefore at least this many credits stored at \(\mathtt{u}\) that are available to pay for the \(O(\mathtt{size(u)})\) time it takes to call \(\mathtt{rebuild(u)}\).

If we call \(\mathtt{rebuild(u)}\) during a deletion, it is because \(\mathtt{q} > 2\mathtt{n}\). In this case, we have \(\mathtt{q}-\mathtt{n}> \mathtt{n}\) credits stored "on the side," and we use these to pay for the \(O(\mathtt{n})\) time it takes to rebuild the root. This completes the proof.

## \(\PageIndex{2}\) Summary

The following theorem summarizes the performance of the `ScapegoatTree` data structure:

*A ScapegoatTree implements the SSet interface. Ignoring the cost of \(\mathtt{rebuild(u)}\) operations, a ScapegoatTree supports the operations \(\mathtt{add(x)}\), \(\mathtt{remove(x)}\), and \(\mathtt{find(x)}\) in \(O(\log \mathtt{n})\) time per operation. *

*Furthermore, beginning with an empty ScapegoatTree, any sequence of \(m\) \(\mathtt{add(x)}\) and \(\mathtt{remove(x)}\) operations results in a total of \(O(m\log m)\) time spent during all calls to \(\mathtt{rebuild(u)}\).*