›INDEX
Last Updated:

Acknowledgement

These notes are based on the lectures and lecture notes by the professor, Meng He, for the class "CSCI-4117 Advanced Data Structures". I am taking/took this class in the Winter 2023-2024 semester. If there are corrections or issues with the notes, please use the contact page to let me know.

Amortized Analysis

To Amortize: To pay off over time.

This kind of analysis involves figuring out the time for operations over a number of operations rather than just in a single case. An example would be to take the average time for n operations.

An amortized analysis guarantees the average perform of each operation in the worst-case. This does not require probability.

This is different from the average case analysis since it uses probability and statistics and performs analysis of average time considering all possible input.

Amortized analysis can be used to show that the average cost of an operation is small even though the cost of a single operation may be high.

Aggregate Analysis

This is a form of amortized analysis.

  1. First we show that for all n, a sequence of n operations uses \(T(n)\) worst case time in total.

  2. Amortized cost per operation then is \(T(n) / n\)

Stack With Multi-Pop

Consider the following operation added to the regular stack operations (PUSH, POP)

# pop k elements off the stack S
def MULTIPOP(S, k):
    while (k != 0 and S.size() != 0):
        POP(S)
        k = k - 1

The worst-case running time of this algorithm would be \(O(min(k, s))\) where \(s\) is the number of elements in the stack, which can be simplified to \(O(n)\).

Now, consider n operations of PUSH, POP, and MULTIPOP combined.

The worst case for each operation is \(O(n)\) (from the multi-pop) and therefore, the naive impression would then be that n operations would take \(O(n) * n\) total time. Then, the average time of each operation would be \(O(n) * n / n\) which would be \(O(n)\).

If we consider the situation where we start with an initially empty stack then we can make the observation that each object can be popped at most one time that it is pushed. The number of times POP can be called on an initially empty stack, including calls within MULTIPOP, is at most the number of times PUSH is called.

PUSH is called an \(O(n)\) number of times, which must mean POP, within MUTLIPOP and independently combined, can at most be \(O(n)\). Therefore, the running time of \(n\) operations is \(O(n)\).

More precisely, the cost of n PUSH, POP, and MULTIPOP operations on an initially empty stack would be the following:

\[ 2n - 2 \rightarrow O(n) \]

Therefore, the amortized cost of all operations would be:

\[ \frac{O(n)}{n} = O(1) \]

Conclusion: The use of MULTIPOP, an \(O(n)\) operation, when amortized over \(n\) operations, results in an average cost of \(O(1)\). Therefore, the operation does not increase the order of the running time.

Incrementing a binary counter of k-bits

Consider a binary counter stored in an array \(A[0..k-1]\) of bits. Let \(A[0]\) be the lowest bit and \(A[k-1]\) be the highest bit.

We implement the INCREMENT operation on this structure to include the number by one. We scan the bits from right to left, zeroing out 1s until we find a zero and we flip the zero to a one.

INCREMENT(A[0..k-1])
    i = 0
    while (i < k and A[i] != 0) do
        A[i] = 0
        i = i + 1

    if (i == k)
        // Overflow
        return

    A[i] = 1

The worst case running time would be \(O(k)\) where \(k\) is the number of bits in the counter.

If we perform \(n\) operations then the naive impression would be that average running time:

\[ \frac{O(k) \cdot n}{n} = O(k) \]

Observation: \(A[0]\) flips each time increment is called. \(A[1]\) flips every other time. Looking at the pattern, we see that \(A[i]\) flips every \(2^i\) times increment is called.

For \(i=0,1\dots \lfloor \lg i \rfloor\) bit \(A[i]\) flips \(\lfloor n / 2^i \rfloor\) times in a sequence of \(n\) INCREMENT operations on an initially 0 counter.

For \(i > \lfloor \lg n \rfloor\), bit \(A[i]\) never flips.

The total number of flips in a sequence of n operations would then be:

\[ \begin{align} & \;\sum_{i=0}^{\lfloor \lg n \rfloor} \lfloor n/2^i \rfloor \\ \leq& \; \sum_{i=0}^{\infty} n/2^i \\ =& \; n \sum_{i=0}^{\infty} 1/2^i \\ =& \; 2n \end{align} \]

For the second step, we sum till infinity and get rid of the floor operation since we're only concerned about an upper bound and this simplifies the problem.

Since the summation of this is \(2n\), therefore the worst case running time for a sequence of \(n\) INCREMENT operations on an initially zero counter amortized would be: \(O(1)\).

Accounting Method

  1. The fundamental idea behind this method of amortized analysis is to assign an estimated or guessed amortized "cost" for each operations.

  2. If the amortized cost exceeds the actual cost then the surplus remains as credit associated with the data structure.

  3. If the amortized cost is less than the actual cost the accumulated credit is used to pay for the cost overflow.

  4. To show the amortized cost is correct, it suffices to show that at all times the total credit associated with the data structure is non-negative. That is, the total amortized cost is greater than the total cost at all times.

Stack With Multi-Pop

The actual cost of each of the operations:

  • PUSH: O(1) (1 dollar)
  • POP: O(1) (1 dollar)
  • MULTIPOP(S, k): O(min(s,k)) (min(s, k) dollars)

Note that we use "dollar" but this is just a placeholder to 1 unit of time.

Let's declare the amortized costs to be:

  • PUSH: 2 dollars
  • POP: 0 dollars
  • MULTIPOP: 0 dollars

We start with an empty stack and 0 dollars of credit. Each time we push an object onto the stack, we use 1 dollar to push the object and the remaining one dollar is stored as credit with the object that was pushed. Each time we pop an element, either in multi-pop or in pop, we use the 1 dollar credit associated with the object from when it was pushed to pay for the operation.

Therefore, the total credit of the stack is non-negative at all times. We see that the amortized costs are all \(O(1)\) in nature, and therefore, we can conclude that the amortized costs of all of these operations are also \(O(1)\).

Incrementing a binary counter of k-bits

Since there is only one operation in this problem, the INCREMENT operations, we have to think a little out of the box. Let's assign costs to flip bits in the counter.

Actual cost of flipping bits: \(O(1)\) - 1 dollar.

Let's assign the following amortized cost to flipping bits:

  • Flip from 0 to 1: 2 dollars.
  • Flip from 1 to 0: 0 dollars.

When we change a 0 to a 1, we incur an actual cost of 1 dollar and since the amortized cost is 2 dollars, we get a credit of 1 dollar.

When we change a 1 to a 0, we incur an actual cost of 1 dollar and since the amortized cost is 0 dollars, we use the credit associated to this bit when it was flipped to a 1.

Since we start with all 0s, the only way we can change a 1 to a 0 is to first change a 0 to a 1. This will ensure that all 1s have an associated dollar.

Going back to our pseudocode:

INCREMENT(A[0..k-1])
    i = 0
    while (i < k and A[i] != 0) do
        A[i] = 0
        i = i + 1

    if (i == k)
        // Overflow
        return

    A[i] = 1

We see that the INCREMENT operation changes at most one bit to a 1. Therefore, the amortized cost of the operation cannot be higher than 2 dollars. This demonstrates that the cost of the operation is \(O(1)\) constant over n operations when starting with an zero counter.

Potential Method

In this method, we represent prepaid work as "potential energy" or "potential" which can be released for future operations. Here potential is a function of the entire data structure.

Let \(D_i\) be the data structure after the \(i\)th operation.

Let \(\Phi(D_i)\) be the potential of \(D_i\) and \(C_i\) be the cost of the \(i\)th operation.

The amortized cost of the \(i\)th operation is then defined as:

\[ a_i = C_i + \Phi(D_i) - \Phi(D_{i-1}) \]

where the difference between \(\Phi(D_i)\) and \(\Phi(D_{i-1})\) is known as the change in potential.

The total amortized cost of n operations would then be:

\[ \begin{align} \sum_{i=1}^{n} a_i &\geq \left(\sum_{i=1}^{n} C_i + \Phi(D_i) - \Phi(D_{i-1})\right) \\ &\geq \left(\sum_{i=1}^{n} C_i\right) + \Phi(D_n) - \Phi(D_{0}) \end{align} \]

\(\Phi(D_i)\) can be as complicated as a function as necessary. It's a function on the entire data structure.

Our task is to define a potential function \(\Phi\) such that

  • \(\Phi(D_0) = 0\)
  • \(\Phi(D_i) \geq 0\) for all \(i\).
\[ \sum_{i=1}^n = \left(\sum_{i=1}^n a_i \right) - \Phi(D_n) \leq \sum_{i=1}^n a_i \]

Stack With Multi-Pop

We need to first define the potential function. Let's define the function \(\Phi\) on an initially empty stack to be the number of objects in the stack.

For the empty stack, \(D_0\), the potential energy is \(\Phi(D_0) = 0\). Since the number of objects in the stack cannot be negative, \(\Phi(D_i) \geq 0\) at all times. Therefore, the total amortized cost of \(n\) operations with respect to \(\Phi\) represents an upper bound on the actual cost.

If the \(i\)th operation on a stack S containing s objects is PUSH,

\[ \Phi(D_i) - \Phi(D_{i-1}) = (s + 1) - s = 1 \]

Then \(a_i\) for a push operation is

\[ a_i = C_i + \Phi(D_i) - \Phi(D_{i-1}) = 1 + 1 = 2 \]

Suppose the \(i\)th operation on a stack S with s objects is MULTIPOP(S, k) which causes \(k' = min(k,s)\) objects to be popped off the stack.

\[ \Phi(D_i) - \Phi(D_{i-1}) = (s - k') - s = -k' \]

Then \(a_i\) for a multipop operation is

\[ a_i = C_i + \Phi(D_i) - \Phi(D_{i-1}) = k' - k' = 0 \]

Therefore, the amortized cost of the operations are all \(O(1)\).

Incrementing a binary counter of k-bits

Let's define \(\Phi(D_i)\) be the number of \(1\)s \((b_i)\) in the binary counter \(D_i\).

For the empty counter, \(D_0\), the potential energy is \(\Phi(D_0) = 0\) since the number of 1s in the counter is 0. Since the number of 1s cannot be negative therefore \(\Phi(D_i) \geq 0\) at all times. Therefore, the total amortized cost of \(n\) operations with respect to \(\Phi\) represents an upper bound on the actual cost.

Consider the \(i\)th increment set \(t_i\) 1 bits to 0. The actual cost of the operation is then \(t_i + 1\).

If \(b_i = 0\), then the \(i\)th operation caused an overflow in the counter, therefore, it set all \(k\) bits from 1 to 0. So, \(b_{i-1} = t_i = k\). If \(b_i > 0\), then \(b_i\) is equal to \(b_{i-1} - t_i + 1\). In either case, \(b_i \leq b_{i-1} - t_i + 1\) - this can be used as an upper bound.

The change in potential caused by the \(i\)th operations then is:

\[ \begin{align} \Phi(D_i) - \Phi(D_{i-1}) &= b_i - b{i-1} \\ &\leq (b_{i-1} - t_i + 1) - b_{i-1} \\ &\leq 1 - t_i \end{align} \]

Therefore, the amortized cost is:

\[ \begin{align} a_i &= C_i + \Phi(D_i) - \Phi(D_i - 1) \\ &\leq (t_i + 1) + (1 - t_i) \\ &= 2 \\ &= O(1) \end{align} \]

Therefore, the upper bound on the amortized cost of the operation is \(O(1)\).

Resizable Arrays

Arrays are the most space-efficient way to store a sequence of homogeneous elements.

A resizable array \(A\) represents \(n\) fixed-size elements, each assigned a unique index from \(0\) to \(n-1\), to support three operations:

  1. LOCATE(i): which determines the location of \(A[i]\).
  2. INSERT(x): store \(x\) in \(A[n]\), incrementing \(n\).
  3. DELETE(): delete the last element \(A[n-1]\), decrementing \(n\).

Vector Solution

Upon inserting an element into a full array, we allocate a new array with twice the size of the old one. Then, we copy the old elements into the new array and delete the original array.

  • \(n\): the number of elements in \(A\).
  • \(s\): the maximum number of elements that can be stored in \(A\).
  • Load factor \(\alpha\): \(n/s\) when \(s \neq 0\), else \(1\).

\(1/2 \leq \alpha \leq 1\) since we always allocate something that's twice as big.

INSERT(x)
    if s == 0
        allocate A with 1 slot
        s = 1
    if n == s
        allocate A' with 2 * s slots
        A'[0..n-1] = A[0..n-1]
        deallocate A
        A = A'  // renaming variable
        s = 2 * s
    A[n] = x
    n = n + 1

The actual cost of the \(i\)th insert is the number of elements are copied. If \(A\) is not full, then the number of elements copied is just 1. Otherwise, \(i\) elements are copied (assuming deletion is not allowed).

Let \(A_i\) denote \(A\) after the \(i\)th insert.

Idea 1

Suppose \(\Phi(A_i) = n\), the change in potential would then be \(1\). We keep gaining a potential of 1 but we never use this potential. Therefore, this will lead to the same cost of \(O(n)\).

A costly function must release the potential (negative value) to get a tight bound for our analysis.

Idea 2

Consider the function \(\Phi(A_i) = n - s\), here \(s\) is the capacity of the array. When we double the array size, \(s\) will be doubled, therefore we'll be releasing the potential to pay for the operation. However, since \(n \leq s\) at all times, our potential will always be negative.

Idea 3

Consider the following potential function:

\[ \Phi(A_i) = 2n - s \]

Then, \(\Phi(A_0) = 0\) and \(\Phi(A_i) \geq 0\) for all \(i\). Therefore, this is a valid potential function. To analyze the amortized cost, \(a_i\), of the \(i\)th INSERT, let \(n_i\) denote the number of elements stored in \(A_i\) and \(s_i\) denote the capacity of \(A_i\).

  • Case 1: The \(i\)th insert does not require a new array to be allocated. Then \(S_i\) is equal to \(S_{i-1}\). Thus,
\[ \begin{align} a_i &= C_i + \Phi(A_i) - \Phi(A_i - 1) \\ &= 1 + (2n_i - s_i) - (2n_{i-1} - s_{i-1}) \\ &= 1 + (2(n_{i-1} + 1) - s_i - 1) - (2n_{i-1} - S_{i-1}) \\ &= 3 \end{align} \]
  • Case 2: The \(i\)th insert does require a new array to be allocated. Then \(S_i\) is equal to \(2 \cdot S_{i-1}\). Thus,
\[ \begin{align} a_i &= C_i + \Phi(A_i) - \Phi(A_i - 1) \\ &= n_i + (2n_i - s_i) - (2n_{i-1} - 2s_{i-1}) \\ &= n_{i-1}+1 + (2(n_{i-1} + 1) - 2n_{i-1}) - (2n_{i-1} - n_{i-1}) \\ &= 3 \end{align} \]

We see that in both cases, our INSERT requires \(O(1)\) running time.

Handling Delete

If we do not care about memory usage, we can just let it be. However, in real cases, we may want to optimize our memory usage, hence releasing the memory when not used.

When \(\alpha\) is too small, allocate a new, smaller array and copy the elements from the old array to the new one.

A strategy that does not work for \(O(1)\) amortized time is to double the array size when inserting into a full array and halve the array size when a deletion would cause the array to be less than half full. This is because, in the worst case we could go back and fourth on a particular element that would cause the array to grow and shrink each operation.

Counter Example:

For simplicity, assume \(n\) is a power of \(2\). Consider following \(n\) operations:

The first \(n/2\) operations are insertions, which cause \(A\) to be full (\(2^x + 1\)th insertion cause the array to grow). The next \(n/2\) operations are INSERT, DELETE, DELETE, INSERT, INSERT (we keep doing 2 inserts and 2 deletes for the next \(n/2\) operations).

counter example for naive delete resize

This leads to a situation where the worst case running time is \(\Theta(n^2)\). This would lead to an amortized time of \(\Theta(n)\)

The problem is that we're too strict about the load-factor. There are not enough operations for the potential to built up.

Solution: We double when the array size when inserting into a full array. We halve the array size when a delete would cause the array to be less than quarter full.

Analysis: Let \(\alpha_i\) be the load factor after the \(i\)th operation.

\[ \Phi(A_i) = \begin{cases} 2n_i - s_i &\text{ if } \alpha_i \geq \frac12 \\ \frac{s_i}{2} - n_i &\text{ if } \alpha_i < \frac12 \\ \end{cases} \]

Space Overhead

The overhead of a data structure, currently storing \(n\) elements, is the amount of memory usage beyond the minimal required to store \(n\) elements.

Vector: \(\Theta(n)\), in the worst case, when we are shrinking the array from \(4n\) to \(2n\) we require a total of \(6n\) space (old and new array combined). Therefore, the worst-case overhead requirement is \(5n\).

<< Insert diagram for this case >>

Lower bound for array: At some point in time, \(\Omega(\sqrt n)\) overhead is necessary, for any data structure that supports inserting elements and locating any element in some order, where \(n\) is the number of elements stored in the data structure.

Proof: Consider the following sequence of operations for any \(n\).

INSERT(a1)
INSERT(a2)
...
INSERT(an)

After inserting \(a_n\), let \(k(n)\) be the number of memory blocks occupied by the data structure. Let \(s(n)\) be the size of the largest of these blocks. For the vector solution, \(k(n) = \Theta(1)\) and \(s(n) = \Theta(n)\). In a linked list, \(k(n) = \Theta(n)\), \(s(n) = \Theta(1)\).

Since any element can be located, the data structure must store all the elements in memory. Hence, \(k(n) \cdot s(n) \geq n\). At this time, the overhead is at least \(k(n)\) to store book-keeping information for each block (memory address) since you need to keep track of these memory blocks. Furthermore, immediately after the allocation of the block of size \(s(n)\), the overhead is at least \(s(n)\) (since it has yet to be populated).

Therefore, the worst-case overhead is at least \(max(k(n), s(n))\).

Since \(s(n) \cdot k(n) \geq n\), at some point, the overhead is at least \(\sqrt n\) (if not, the product would be less than \(n\)).

Optimal Solution

Let Allocate(s) be a function that returns a new block of size \(s\) and Deallocate(B) be a function that frees the space used by the given block \(B\). Let Reallocate(B, s) be a function that if possible, resizes block \(B\) to size \(s\), otherwise allocate a block of size \(s\), copies the content of \(B\) into the new block and deallocates \(B\) and returns the new block. We can assume Allocate and Reallocate are constant time \(O(1)\) and the Reallocate takes \(O(s)\) time in the worst-case.

NOTE: There are some programming languages that initialize and clear memory block when they allocate and deallocate. For such languages, the running time would take \(O(s)\).

Approaches That Do Not Work

Approach 1

Try to store elements in \(\Theta(\sqrt n)\) blocks, and the size of the largest block is also \(\Theta(\sqrt n)\).

B1: size = 1
B2: size = 2
B3: size = 3
...
Bk: size = k

Every time we insert an element, if there is not space we increase the size of the block by 1.

The total number of elements in the first \(k\) blocks is going to store \(k(k+1)/2\). The number of blocks required to store \(n\) elements would be:

\[ \begin{align} \frac{k(k+1)}{2} &= n \\ k &= \lceil \frac{\sqrt(1 + 8n) - 1}{2} \rceil \end{align} \]

This gives us a solution with space \(\Theta(\sqrt n)\).

Issue: The locate(i) function is in block:

\[ \lceil \frac{\sqrt(1 + 8n) - 1}{2} \rceil \]

However, the square root function is not possible in constant time.

Approach 2

Try to store elements in \(\Theta(\sqrt n)\) blocks, and the size of the largest block is also \(\Theta(\sqrt n)\).

B1: size = 1
B2: size = 2
B3: size = 4
...
Bk: size = 2^k

Locate(i) to find which block contains \(i\), we can take the number of bits in the binary expression of \(i+1\) - 1. This can be done in constant on CPUs.

Issues: Size of the largest block is \(\Theta(n)\) (\(2^k\)). Immediately after allocating the largest block, the overhead space is an order on n.

Solution

Our solution contains two types of blocks, an "index block" and "data blocks". The index block stores pointers to the data blocks which is stored as a vector. The data blocks store the actual elements in the array. We have \(d\) data blocks that are non-empty.

super block data block visualized

A super block (\(SB_0\), \(SB_1\), ..., \(SB_{s-1}\)) are groups of data blocks. Data blocks in the same super block are of the same size.

When super block \(SB_k\) is fully allocated, it consists of \(2^{\lfloor k/2 \rfloor}\) data blocks, each of which are \(2^{\lceil k/2 \rceil}\) in size.

Therefore, the size of \(SB_{k}\) has \(2^k\) elements.

Only \(SB_{s-1}\) might not be fully allocated.

Reference: Resizable Arrays in Optimal Time and Space

Operations

Supporting random access in O(1)

LOCATE(i): Return the address of the \(i\)th element.

Let \(r\) be the binary expression of \((i + 1)\). The super block that contains this element is element \(e\) of data block \(b\) of super block \(k\), where:

\[ k = |r| - 1 \]

\(b\) is the \(\lfloor k/2 \rfloor\) bits of \(r\) immediately after the leading 1-bit.

\(e\) is the last \(\lceil k/2 \rceil\) bits of \(r\).

Where \(|r|\) is the number of bits in \(r\).

5 = (5+1) = 110 = 1, 1, 0

But to index the element, we need to first calculate how many data blocks are before the kth super block. This is because we only know the relative index of the data block within the super block.

\[ \begin{align} p &= \sum_{j=0}^{k-1} 2^{\lfloor j/2 \rfloor} \\ &= 2 * (2^{k/2} - 1) & \text{if k is even} \\ &= (3 * 2^{(k-1)/2}) - 2 & \text{if k is odd} \end{align} \]

Return the location of element \(e\) in data block \(DB_{p+b}\) (we can get this from the index block in constant time.)

Updates:

Some easy to maintain book-keeping information:

  • \(n\) Number of items in the array.
  • \(s\) Number of super blocks in the array.
  • \(d\) Number of non-empty data blocks.
  • \(d_e\) Number of empty data blocks (0 or 1).
  • \(d_s\) The size and occupancy of the last non-empty data block.
  • \(s_s\) The size and occupancy of the last non-empty super block.
  • \(i_s\) The size and occupancy of the last non-empty index block.
  • The index block is maintained as a vector.

Insert Operation:

  • If \(SB_{s-1}\) (last super block) is full, \(s = s+1\).

  • If \(DB_{d-1}\) is full and there is no empty data block, \(d = d+1\) and allocate a new data block as \(DB_{d-1}\).

  • If \(DB_{d-1}\) is full and there is an empty data block, \(d = d+1\) and \(DB_{d-1}\) is the next empty data block.

  • Store \(x\) in \(DB_{d-1}\).

Delete Operation:

  • Remove the last element.

  • If \(DB_{d-1}\) is empty then \(d = d-1\). If \(d_e = 1\), then we deallocate the empty data block. If not, set \(d_e = 1\).

  • If \(SB_{s-1}\) is empty, \(s = s - 1\)

Analysis

\(S = \lceil \lg (1+n) \rceil\).

Proof: The number of elements in the first s super blocks:

\[ \sum_{i=0}^{s-1} = 2^s - 1 \]

If we let this equal \(n\), then \(s=\lg(1+n)\). For slightly smaller \(n\), the same number of super blocks are required, hence we take the ceiling.

The number of data blocks is \(O(\sqrt n)\). Proof: As there is at most one empty data block, it suffices to prove that \(d = O(\sqrt n)\):

\[ \begin{align} d &\leq \sum_{i=0}^{s-1} 2^{\lfloor i/2 \rfloor} \\ &\leq \sum_{i=0}^{s-1} 2^{i/2} \\ &= \frac{2^{s/2} - 1}{\sqrt(2) - 1} \\ &= O(\sqrt n) \end{align} \]

The last data block has size \(\Theta(\sqrt n)\). Proof: Size of \(DB_{d-1}\) (the last non-empty data block) is \(2^{\lceil (s-1)/2 \rceil}\) which is equal to \(\Theta(\sqrt n)\). If there is an empty data block, either it's the same size or twice the size (next super block) which would also be \(\Theta(\sqrt n)\).

Therefore, the total overhead of the data structure is \(O(\sqrt n)\).

Update time:

Even if the model requires \(O(n)\) time for allocation and deallocation even then the amortized time is \(O(1)\).

If allocate or deallocate is called when \(n\) is equal to \(n_0\), then the next call to allocate or deallocate will occur after \(\Omega \sqrt{n_0}\).

To prove this:

Immediately after allocating or deallocating a data block, there is exactly one empty data block. We only deallocate a data block after two are empty, so we must call delete as many times as size of \(DB_{d-1}\) which is \(\Omega (\sqrt{n_0})\).

As we only allocate a data block when this empty data block is full, this requires \(\Omega \sqrt{n_0}\) insertions.

Summary

The data structure uses the concept of data blocks and super blocks to facilitate a low overhead resizeable array.

It achieves an overhead space requirement of \(O(\sqrt{n})\).

It maintains Insert, Delete, and Query operations under \(O(1)\) amortized time.

Van Emde Boas Trees

Problem Statement

Maintain an ordered set of \(n\) elements to support:

  • Membership (search)
  • Successor / Predecessor of a given element (strict)
  • Insert / Delete from set

Example: 2, 7, 11, 20, 25.

  • Is 11 in the set? Yes.
  • Is 12 in the set? No.
  • What's the successor of 11? 20.
  • What's the predecessor of 3? 2.
  • Insert 21.
  • Delete 11.

Well-Known Solutions

Binary Search Trees

The most well known solution to this problem is a balanced search tree.

  • \(\Theta(\lg n)\) time.
  • \(O(n)\) space.

Lower bound for membership under the binary decision tree model is \(\Omega(\lg n)\).

A binary decision tree model is a binary tree where each node is labelled \(x < y\) (a comparison between two specific elements). The program execution corresponds to a root-to-leaf path. A left contains the result of computation. Decision trees correspond to algorithms that only use comparisons to gain knowledge about input.

The binary search algorithm is an algorithm on this binary decision tree model.

Proof of lower bound: Any binary decision tree for membership must have at least \(n+1\) leaves, one for each element + additional leaf for "not found". As a binary tree of \(n+1\) leaves, the tree must have height \(\Omega(\lg n)\), since any program execution corresponds to root-to-leaf path, in the worst-case the algorithm looks for this leaf. Therefore, the algorithm has a running time lower bound of \(\Omega(\lg n)\).

Dynamic Perfect Hashing

Perfect hashing would achieve \(O(1)\) worst-case time for membership queries. The INSERT and DELETE operations are \(O(1)\) expected amortized time.

<< read reference posted online >>

This is based on the RAM model and we use input values as memory addresses, and this is the reason we achieve the constant-running time.

The "expected time" stands for average time.

Hashing also doesn't support predecessor and successor operations efficiently.

Word RAM Model

This model tries to model a realistic computer. In original RAM model you could store an infinitely precise number. However, in the Word RAM model there is a maximum number of bits in a word.

  • Word size \(w \geq \lg n\). Think of it as the minimum size of a pointer. We need a pointer/memory address to each element.

  • \(O(1)\): "C-style" operations.

  • Universe size: \(u\), where elements have values in \([0, u)\) from \(\mathbb{Z}\).

  • \(w \geq \lg u\) since we use \(w\) bits to encode any element from universe.

Preliminary Approaches

Direct Addressing

Uses a bit-vector of length \(u\).

Example, let \(u=16\), \(n=4\), \(s\) is the current set containing 1, 9, 10, 15.

\[ B = [0, 1, 0, 0, , ... 1, 0, ..., 1] \]

\(B[i] = 1\) if \(i\) is in the set \(s\).

  • Insert/Delete/Membership: \(O(1)\).
  • Successor/Predecessor: \(O(u)\).
  • Space: \(O(u)\)

Carved Bit-Vector

Carve the bit-vector into widgets of size \(\sqrt{u}\) (assume \(u=2^{2k}\)).

carved bit vector

\(W_i[j]\) denotes \(i \sqrt{u} + j\), \(j = 0, \dots \sqrt{u} -1\). The summary widget has size \(\sqrt{u}\) and summary\([i] = 1\) iff \(W_i\) contains a \(1\).

Aliases: \(high(x) = h(x)\) and \(low(x) = l(x)\)

  • \(h(x) = \lfloor x / \sqrt{u} \rfloor\).
  • \(l(x) = x \text{ mod } u\).
  • $x = $ \(h(x) \sqrt{u}\) + \(l(x)\).
  • When \(x\) is represented as a binary number with \(\lg u\) bits then \(h(x)\) is simply the highest \((\lg u) / 2\) bits and \(l(x)\) is the \((\lg u)/2\) bits.

The algorithms then are performed by doing:

  • Membership(x): \(W_{h(x)}[l(x)]\) is checked: \(O(1)\).

  • Successor(x): We search to the right in the widget that \(x\) belongs to. If no elements are found, then we search the summary block for the next widget with a value in it. In that widget, we search all the values. This can be done in \(O(\sqrt{u})\).

  • Predecessor(x): Same as Successor(x) but in the reverse direction. This can be done in \(O(\sqrt{u})\).

  • Insert(x): We just insert the element in the corresponding widget and update the summary widget to 1 for the corresponding widget. This can be done in \(O(1)\).

  • Delete(x): Set the position of \(x\) within it's widget to \(0\) and then check if the entire widget is empty and update the summary block. This can be done in \(O(\sqrt{u})\).

The space overhead for this solution is \(O(u)\).

Recursive Splitting Of Universe

The best we can hope for

\[ T(u) = T(\sqrt{u}) + O(1) \]

Let \(m = \lg u\), so that \(u = 2^m\) then

\[\begin{align} T(2^m) &= T(2^{m/2}) + O(1) \\ S(m) &= S(m/2) + O(1) \\ S(m) &= O(\lg m) + O(1) \\ T(u) &= O(\lg \lg u) \end{align}\]

Now we design this recursive approach:

initial recursive solution

And for implementation purposes, it's easier to understand it as the following diagram:

implementation of recursive solution

Now we implement the algorithms:

Membership(w, x)
    if |w| = 2
        return w[x]
    else
        return Membership(Sub[w][high(x)], low(x))
\[ T(u) = T(\sqrt{u}) + O(1) = O(\lg \lg u) \]
Insert(w, x)
    if |w| = 2
        w[x] = 1 
    else
        Insert(Sub[w][high(x)], low(x))
        Insert(Summary[x], high(x))

This requires two recursive calls

\[ \begin{align} T(u) &= 2T(\sqrt{u}) + \Theta(1) \\ m &= \lg u \\ T(2^m) &= 2T(2^{m/2}) + \Theta(1) \\ S(m) &= T(2^m) \\ S(m) &= 2 S(m/2) + \Theta(1) \\ S(m) &= \Theta(m) \\ T(w) = \Theta(\lg u) \end{align} \]

Therefore, the running time of the Insert operation is \(\Theta(\lg u)\).

Successor(w, x)
    if (|w| = 2)

    j = Successor(sub[w][high(x)], low(x))
    if j < infinity
        return high(x) * \sqrt(|w|) + j
    else
        i = Successor(summary[w], high(x))
        j = Successor(sub[w][i], -infinity)
        return i * \sqrt(|w|) + j

This requires three recursive calls

\[ \begin{align} T(u) &= 3T(\sqrt{u}) + \Theta(1) \\ m &= \lg u \\ T(2^m) &= 3T(2^{m/2}) + \Theta(1) \\ S(m) &= T(2^m) \\ S(m) &= 3 S(m/2) + \Theta(1) \\ S(m) &= \Theta(...) \\ T(w) = \Theta((\lg u)^{\lg 3}) \end{align} \]

Van Emde Boas Tree

van Emde Boas Tree visualized

Compared to the recursive structure, it has two improvements:

  1. For each widget w, we store the minimum an maximum value in the widget. If the set is empty, the both the min and max is -1. This reduces the requirement of two extra recursive call.
Successor(w, x)
    if x < min[w]
        return min[w]

    else if x >= max[w]
        return infinity

    // sub widget containing x
    w' = sub[w][high(x)]

    if low(x) < max[w']
        return high(x) * \sqrt{|w|} + Successor(w', low(x))

    else
        i = Successor(summary[w], high(x))
        return i * \sqrt(|w|) + min[sub[w][i]]

We compare low(x) against max[w'] since the sub-widget is a sub-universe. It only contains elements from the universe \(\sqrt{|w|}\) where \(|w|\) is the universe size of the current \(w\).

This is also why we add \(high(x) + \sqrt{|w|}\) to the result of the recursive call - it only returns the value for the sub-universe.

Since there is only one recursive call, the running time of the Successor operation is \(\lg \lg u\) running time.

Insert(w, x)
    if sub[w][high(x)] is empty:
        Insert(Summary[w], high[x])

    Insert(sub[w][high(x)], low(x))

    if x < min[w] or min[w] = -1
        min[w] = x
    if x > max[w]
        max[w] = x

This running time is \(O(\lg u)\) running time.

To fix this issue, we update our data structure in update 2.

  1. We do not store min[w] or max[w] in any sub widget of w, we only store it once in w.

We only maintain a summary in the summary element for elements that are not min and max.

The Membership function has some trivial changes, checking the min and max fields before recursive. Therefore, the running time is still \(O(\lg \lg u)\).

The Successor function also has some trivial changes, with still just one recursive call leading to a running time of \(O(\lg \lg u)\)

Insert(w, x)
    if min[w] = max[w] = -1
        min[w] = x
        max[w] = x

    else if min[w] = max[w]
        if x < min[w]
            min[w] = x
        else
            max[w] = x

    else
        if x < min[w]
            swap(x, min[w])

        else if x > max[w]
            swap(x, max[w])

        w' = sub[w][high(x)]

        Insert(w', low(x))

        if max[w'] == min[w']
            Insert(summary[w], high(x))

If the second recursive call takes place then the first recursive call has a running time of \(O(1)\) since the sub widget is empty.

\[ T(u) = T(\sqrt{u}) + O(1) \]

Therefore, the running of the algorithm is \(O(\lg \lg u)\)

Delete(w, x)
    if min[w] = -1
        return

    if min[w] = max[w]
        if x = min[w]
            min[w] = -1
            max[w] = -1

        return

    if min[summary[w]] = -1
        // there are two elements - one in min and one in max
        if min[w] = x
            min[w] = max[w]
        else if max[w] = x
            max[w] = min[w]
        return

    if min[w] = x
        // the first subwidget that has atleast one element
        j = min[summary[w]]

        min[w] = min[sub[w][j]] + j \sqrt{|w|}
        x = min[w]

    if max[w] = x
        // the last subwidget that has atleast one element
        j = max[summary[w]]

        max[w] = max[sub[w][j]] + j \sqrt{|w|}
        x = max[w]

    // get the subwidget that contains x
    w' = sub[w][high(x)]
    Delete(W', low(x))
    if min[w'] == -1
        Delete(summary[w], high(x))

Again, it looks like there are two recursive calls but for min[w'] = -1 to be true, the first recursive call would only take constant time. Therefore, the number of recursive calls in total is one.

Therefore, the running time of this algorithm is \(O(\lg \lg u)\).

Space Cost Analysis

Let \(S(u)\) be the space of the data structure. The space taken by the pointers and the min/max field are going to be \(\Theta(\sqrt{u})\).

\[ S(u) = (\sqrt{u} + 1) S(\sqrt{u}) + \Theta(\sqrt{u}) \]

This is going to be space of \(O(u)\).

Sqrt u details

If \(\sqrt{u}\) is not an integer, then we can just take the ceiling of \(\sqrt{u}\).

\[ T(u) = T(\lceil \sqrt{u} \rceil) + O(1) \leq T(u^{2/3}) + O(1) \]

Which simplifies to \(O(\lg \lg u)\).

Summary

van Emde Boas Trees is a recursive structure where we store a min and max for each widget and a summary for all the other elements in the widgets.

For an element x, high(x) (the first half of the bits in the binary representation) identifies the sub-widget that contains element x. low(x) (the lower half of the bits) is the value of x within the sub-widget (sub-universe).

  • Space overhead: \(O(u)\)
  • Membership: \(O(\lg \lg u)\)
  • Successor and Predecessor: \(O(\lg \lg u)\)
  • Insert and Delete: \(O(\lg \lg u)\).

Fast Tries

The objective of the data structure is the same as van Emde Boas Trees but we wish to reduce the space complexity of the data structure to \(O(n)\) rather than \(O(u)\) for cases where the universe size is really large.

Preliminary Approach

Let the following be the case: u=16, n=4, S={1,9,10,15}.

Binary Tree Approach

binary tree approach

We store elements as a doubly-linked list.

We store 3 different data structures: bit-vector, tree, and a doubly-linked list.

  • Membership: \(O(1)\)
  • Successor(x):
    • if \(x \in S\) then we use the doubly-linked list - \(O(1)\).
    • else We walk up the tree till we find the 1 node. This node must have a one child. Then there are two cases:
      • if it's a right child, we need the smallest element \(O(\lg u)\).
      • if it's a left child, we need the largest element \(O(\lg u)\) would be the predecessor, then we use the linked list to find the successor.
  • Insert/Delete: \(O(\lg u)\)

The bit vector requires \(O(u)\) space.

X-fast Tries

From this we learn that we can't keep the bit-vector. We keep the doubly-linked list. And we only keep only the 1 nodes from the tree.

x-fast tries visualized

Successor

How to locate the lowest 1 node from the leaf x to the root?

Solution: Each ancestor of x corresponds to a prefix of the binary expression of x. Consider x=13 which is x=1101 if we see a 1 we go to the right child and if we see a 0 then we go to the left child.

Using such a prefix, as a key for any 1-node. We store it in a dynamic perfect hashing table. A dynamic hashing table of \(m\) keys, \(O(m)\) space and \(O(1)\) worst-case membership and \(O(1)\) amortized insert and delete.

The lowest 1 node that is an ancestor of x corresponds to the longest prefix of x that is in the hash table. To find this longest prefix, we can perform a binary-search over the length of the prefix. Therefore, we can find the lowest 1 node in \(O(\lg \lg u)\) time.

x-fast tries with hash table

For each entry added to the data structure, we need to store at most \(\lg u\) nodes in the hash table. Since the total number of nodes is \(n\), the worst-case over head space in the hash table would be \(O(n \lg u)\).

If we follow the tree to find the successor or predecessor it would take \(O(\lg u)\) time. Therefore, we store a descendant pointer to the smallest leaf.

  • If the node has a 1-child as its right child, we store a descendant pointer to the smallest left of the right sub-tree (successor).

  • If the node has a 1-child as its left child, we store a descendant pointer to the largest leaf of its left sub-tree (predecessor).

If this node has a 1 child as it's right child, we follow the descendant pointer to find the successor.

If this node has a 1 child as it's left child, we follow the descendant pointer to find the predecessor and then use the doubly-linked list to find the successor.

descendant pointers

Note: Only nodes with exactly one child require descendant pointers since if a node has two children, then a higher prefix would have matched to a lower node. We are finding the closest 1-node to the queried x, which much have one child or be the element itself.

Other Operations

  • Membership: \(O(1)\) thanks to the hash table.

  • Insert/Delete: \(O(\lg u)\) amortized expected time.

Space Analysis

  • List: \(O(n)\)
  • Hash table: \(O(n \lg u)\)
  • Tree: \(O(n \lg u)\)

Therefore, the total space is \(O(n \lg u)\).

Summary

The x-fast trie tries to not use \(O(u)\) space by maintaining three data structures: A doubly-linked list, a binary tree, a hash-table. It uses the hash-table and the double linked list for successor and predecessor operations. It maintains additional descendant points to optimize locating the successor and predecessor.

  • Space: \(O(n \lg u)\) (for tree and hash-table)
  • Successor and Predecessor: \(O(\lg \lg u)\)
  • Member: \(O(1)\)
  • Insert and Delete: \(O(\lg u)\) amortized expected.

Y-fast Tries

Indirection: We cluster elements of \(S\) into consecutive groups of size \((\lg u)/4\) to \(2\lg u\). From this we get that the number of groups is \(O(n / \lg u)\).

y-fast tries

We then store elements of each group in a balanced binary search tree. We maintain a set of representative elements, one per group, stored in a x-fast tries structure.

A group's representative does not have to be in \(S\), but it has to be strictly between the max and the min elements of the preceding and succeeding groups respectively.

If we have a set number of elements then we just set each group size to be \(\lg u\) size and then pick any element in the group to be the representative.

Space Analysis

X-fast tries uses: $$ O(\frac{n}{\lg u} \lg u) = O(n) $$

BST: \(O(n)\)

Operations Analysis

Membership:

If \(x\) is stored in the x-fast trie then check the corresponding BST to see if \(x \in S\) and if \(x\) is not present, then we find the predecessor and the successor in the x-fast trie. \(x\) must be present in one of these BSTs if it is present.

It takes \(O(1)\) to check if \(x\) in x-fast trie and the binary search tree search takes \(\lg \lg u\) time.

Running time: \(O(\lg \lg u)\)

Predecessor/Successor:

This has a similar idea to membership and has a running time of \(O(\lg \lg u)\).

Insert/Delete:

We use x-fast trie to locate the group that \(x\) is in. Then we just insert \(x\) into the balanced binary search tree of that group or we just delete \(x\) from the binary search tree of that group.

We need to ensure the group size must remain between \(1/4 \lg u\) and \(2 \lg u\). To achieve this, when a group is of size greater than \(2 \lg u\) (after insert), we just split the group into two groups of about \(\lg u\) size each. When the group size is less than \(1/4 \lg u\) (after delete) we merge it with an adjacent group. If this causes a group to be too large, we can split again.

  • Find element position: \(O(\lg \lg u)\)
  • Insert into BST: \(O(\lg \lg u)\)
  • Maintain the group size: \(O(\lg u)\) and \(O(1)\) amortized.

The split and merge operations can be done using in-order traversal and can be done linearly to the tree size (\(O(\lg u)\)). Then we need to insert something into the x-fast trie which is \(O(\lg u)\) amortized expected time.

This only merge and split only takes place every \(\Omega(\lg u)\) insert delete operations. This is \(O(1)\) amortized expected running time for this merge and split operations.

Therefore the total worst-case amortized expected running time is \(O(\lg \lg u)\).

Indirection is a general approach to decreasing space cost.

Summary

To avoid using the \(O(n \lg u)\) space of the x-fast trie, we only store \(\lg u\) items in the x-fast trie. We maintain BST groups which have representatives stored in the x-fast trie. The groups are sized dynamically between \(1/4 \lg u\) and \(2 \lg u\).

  • Space: \(O(n)\)
  • Membership: \(O(\lg \lg u)\)
  • Predecessor and Successor: \(O(\lg \lg u)\)
  • Insert and Delete: \(O(\lg \lg u)\) amortized expected.

The insert and delete require \(O(\lg u)\) time in the worst-case to maintain group size. However, this is amortized to \(O(1)\) over \(n\) operations.

Self-adjusting Lists

Problem Statement

Linear search! We have a list of \(n\) elements and we are required to perform a linear search over these elements for a particular element \(x\). The worst-case running time for this is \(n\) comparisons.

If all elements have equal probability of search then the expected time is \(\frac{(n+1)}{2}\) if it is present in the list and \(n\) for unsuccessful search.

\[ \begin{align} \text{Cost } &= \sum_{i=1}^{n} \frac{1}{n} i \\ &= \frac{n + 1}{2} \end{align} \]
  1. What if probabilities of access differ between elements? Let \(p_i\) be the probability of element \(i\) being requested.

  2. What if the independent probabilities are not given?

  3. How do these solutions compare with the best we could do?

Independent And Equal Probabilities

If probabilities are the same and independent it doesn't matter how you change the structure.

The best you can hope for is \(\frac{n+1}{2}\) number of comparisons.

Fixed Probabilities And Lists

The probabilities and the list is fixed so that it cannot be changed. In this case, we want to minimize the expected number of comparisons.

\[ \sum_{i=1}^{n} i p_i \]

To solve this, we can use a greedy algorithm and pick the element with the highest probability to be first and so on.

This is the static optimal solution \(S_opt\).

There are cases where binary search does not perform as well as an optimized linear search. Binary search has a low cache-hit rate which may cause slowdowns in the real world.

Changing Unknown Probabilities

How do we adapt to changes in probabilities or "self-adjust" to do better?

We should move elements forward as accessed. There are many heuristics on how to move elements forward:

  1. Move-to-front (MTF): After accessing an item, we move that item to the front of the list.
  2. Transpose (TR): After accessing an item exchange it with an immediate preceding item.
  3. Frequency Count (FC): We maintain a frequency count for each item, initially
  4. Increase the counter by 1 whenever it is accessed. Maintain the list so that items are in decreasing order of frequency count.

In practice, usually they use MTF.

Static Optimality Of Move-to-front

Compare cost of MTF with the cost of \(S_opt\) (static optimal) solution.

For the model, we start with an empty list. We scan for elements requested, if it's not present in the list we Insert (and charge as if found at end of list) then apply heuristic to self-adjust.

The cost is the number of comparisons performed over a sequence of operations.

Consider an arbitrary list, but focus on search for \(b\) and \(c\), and in particular, we focus on "unsuccessful" comparisons where we compare query value \(b\) or \(c\) against \(c\) or \(b\) respectively.

Assume request sequence has \(k\) \(b\)'s and \(m\) \(c\)'s, without loss of generality, we assume \(k \leq m\).

\(S_{opt}\) order is [c, b] because \(c\) is accessed more times than \(b\). There are \(k\) unsuccessful comparisons since we only miss the \(b\)s.

For the worst-case for the MTF algorithm, we alternate between \(b\) and \(c\)s, therefore the pattern looks like \(c^{m-k} (bc)^k\). From this, we get that the maximum number of comparisons is \(2k\).

This holds for any list for any pair of elements. Therefore, if we sum over all pairs of elements, we find the total number of unsuccessful comparisons of MTF is less that equal to 2 times total number of unsuccessful comparisons of \(S_{opt}\).

The number of successful comparisons of MTF and \(S_{opt}\) are equal, the order doesn't matter for the successful comparisons.

Therefore, cost of \(MTF \leq 2 S_{opt}\).

This bound is tight. Given \(a_1\), \(a_2\), \(\dots\), \(a_n\), we repeatedly ask for the last element in the list. The cost of MTF would be \(n\) for each search. For the \(S_{opt}\), the cost would be \((n+1) / 2\).

From some sequences, MTF does a lot better than \(S_{opt}\). Consider the request sequence: \(a_1^n\), \(a_2^n\), \(\dots\), \(a_n^n\). For such a sequence, we have the following costs:

  • For \(S_{opt}\): \((n+1)/2\)

  • For MTF:

\[ \begin{gather} \frac{[1 + (n-1)] + [2 + (n-1)] + \dots + [n + (n-1)]}{n^2} \\ = \frac{3}{2} - \Theta(\frac{1}{n}) \end{gather} \]

Therefore, we find that MTF performs the search in constant amortized time.

Dynamic Optimality Of Move-to-front

Online Vs Offline Algorithms

  • Online: Must respond as requests comes, without knowledge of the future.

  • Offline: The algorithm gets to see the entire schedule and determine how to proceed.


Competitive Ratio of an algorithm is the worst-case of the online time of using the algorithm divided by the optimal offline time.

\[ \text{Competitive Ratio } = \frac{\text{Online time of Alg.}}{\text{Optimal Offline time}} \]

A method if competitive if this ratio is a constant.

For the cost model there are two parts:

  • Search for or change element at position \(i\): we scan to location \(i\) at cost \(i\).

  • Free exchanges: exchanges (swaps) between adjacent elements required to move element \(i\) closer to the front of the list.

  • Paid exchanges: Any other exchanges (swapping), moving element further from the front, costs 1.

It's okay to consider moving elements closer to the front "free" since we have to anyway perform a linear search to the element. Therefore, the number of swaps is going to be a constant multiple of the number of comparisons to find the element. This does not change the cost asymptotically, just by a constant factor.

  • Access: costs \(i\) if element is in position \(i\).
  • Delete: costs \(i\) if element is in position \(i\).
  • Insert: costs \(n+1\) if \(n\) elements are already there.

The above costs are the costs before self-adjusting.

The request sequence \(S\) has a total of \(M\) queries and a maximum \(n\) data values. We start with an empty list.

The cost for a search that ends by finding the element in position \(i\) is \(i\) plus the number of paid exchanges.

  • Let \(C_A(S)\) be the total cost of all the operations for algorithm \(A\), not counting paid exchanges.

  • Let \(X_A(S)\) be the number of paid exchanges.

  • Let \(F_A(S)\) be the number of free exchanges.

\[ X_{MTF}(S) = X_{TR}(S) = X_{FC}(S) = 0 \]

This is because in all the above algorithms (MTF, TR, FC) we only move elements to the front.

For any algorithm \(A\), \(F_A(S) \leq C_A(S) - M\). After accessing the \(i\)th element, there are at most \(i-1\) free exchanges that can be made.

However, to perform competitive analysis we need to know the optimal offline algorithm. For the question of self-adjusting lists, this is not a known solution. Therefore, we compare it against a general case (any algorithm).

\[ C_{MTF}(S) \leq 2 C_A(S) + X_A(S) - F_A(S) - M \]

We try to prove the above for any algorithm \(A\) starting with the empty list.

We run algorithm \(A\) an \(MTF\), in parallel, on the same request sequence \(S\). Then the potential function \(\Phi\) is defined as the number of inversions between MTF's list and A's list.

An inversions is when a pair of elements, say \(i\) and \(j\), appear in different orders in the two lists. That is, either \(i\) before \(j\) or \(j\) before \(i\). We consider all pairs to find the total number of inversions.

Since we start with an empty list, the initial potential is \(0\) and \(\Phi \geq 0\) since the number of inversions cannot be negative.

Now we bound the amortized time of access for a particular element. Consider access by \(A\) to position \(i\) and access by \(MTF\) to position \(k\). Let the element being accessed by \(x\).

visualized MTF vs A

Let \(X_i\) be the number items preceding the searched element in MTF's list, but not in A's list. So, the number items preceding the searched element in both lists is \(k - 1 - X_i\).

Moving \(x\) to front in MTF's list creates \(k- 1- X_i\) inversions. It would destroy \(X_i\) other inversions (items before in MTF's list).

So the amortized time is \(k\) + \((k - 1 - X_i)\) - \(X_i\) = \(2(k-X_i) -1\).

Since \(k-1-X_i \leq i - 1\) (elements preceding in both lists is less that in one), \(k-X_i \leq i\) holds.

Therefore, amortized time of access is at most \(2i - 1 = 2C_A - 1\), where \(C_A\) is the cost of access for \(A\). The same arguments hold for INSERT and DELETE operations.

However, this assumes that \(A\) doesn't self-adjust. But \(A\) is also allowed to self-adjust and this needs to be accounted for in the potential function.

An exchange by \(A\) has zero cost to \(MTF\). So amortized time of an exchange by \(A\) is just the increase in the number of inversions caused by the exchanges made by \(A\), that is, \(1\) for any paid exchange made by \(A\), \(-1\) for any free exchanges.

free and paid exchanges by algorithm A

Therefore, the amortized cost of the general algorithm is:

\[ 2 C_A(S) + X_A(S) - F_A(S) - M \]

MTS is 2-competitive (the competitive ratio is 2):

\[ \begin{align} T_A(S) &= C_A(S) + X_A(S) \\ X_{MTF}(S) &= 0 \\ T_{MTF}(S) &= C_{MTF}(S) \\ &\leq 2C_A(S) + X_A(S) - F_A(S) - M \\ &\leq 2C_A(S) + X_A(S) \\ &\leq 2C_A(S) + 2X_A(S) \\ &= 2 T_A(S) \end{align} \]

This shows that the MTF algorithm at most performs twice as worse as the best case offline (all information available) algorithm.

Self-Adjusting Binary Search Trees

Problem statement: We're given a sequence of access queries, what is the best way to organize the values of a search tree?

Model: Searches must start at the root and the cost is the number of nodes inspected.

Worst-case behaviour: \(\Theta(\lg n)\)

Static Optimal Solution

Given that the probability of access for an element \(A_i\) is \(p_i\). \(q_i\) is the probability of access for values between \(A_i\) and \(A_{i+1}\) (to fall in the gap) (\(i = 0, 1, 2, \dots\)).

Let \(A_0 = -\infty\) and \(A_{n+1} = + \infty\) and \(p_0 = 0\) and \(p_{n+1} = 0\).

Let \(T[i, j]\) be the root of the optimal tree on range \(q_{i-1}\) to \(q_j\).

Let \(C[i, j]\) be the cost of the optimal tree rooted at \(T[i, j]\); this is the sum of the probability of looking for one of the values (or one of the gaps) in this range times the cost of that item (or gap).

We use a dynamic programming approach to solve this problem:

Observation: First we define \(W[tree]\) (the weight of the tree) as the probability of being in the tree. The cost of the optimal tree rooted at \(r\):

\[ C[\text{tree rooted at }r] = W[tree] + C[L] + C[R] \]

We define a table, \(W[i, j]\) to be the probability of accessing any node or gap in the range \(q_{i-1}, \dots, q_j\) and \(p_i, \dots, p_j\).

\[ W[i,j] = \sum_{k=1}^{j} p_k + \sum_{k=i-1}^{j} q_k \]

This can be calculated in \(O(n^2)\) time.

\[ W[i, j+1] = W[i,j] + p_{j+1} + q_{j+1} \]

Dynamic Programming

\[ C[i,j] = \begin{cases} min_{r=i}^{j}(W[i,j], C[i, r-1], C[r+1, j]) & \text{if } i \leq j \\ 0 & \text{if } j = i - 1 \text{ there is one gap} \end{cases} \]
OPTIMAL_BST(p, q, n)
  for i = 1 to n + 1
    C[i, i-1] = 0
    W[i, i-1] = q[i-1]

  // l = j - i + 1
  for l = 1 to n
    for i = 1 to n-l+1
      j = i+l-1

      C[i, j] = infinity
      W[i, j] = W[i, j-1] + p[j] + q[j]

      for r = i to j
        t = W[i, j] + C[i, r-1] + C[r+1, j]
        if t < C[i, j]:
          C[i, j] = t
          T[i, j] = r

At the end C[1, n] is the cost of the optimal binary search tree and the root is stored in T[1, n]. The running time for this algorithm is \(O(n^3)\).

Can build tree using T table.

It is possible to do this in \(O(n^2)\) running time.

Splay Trees

Performing rotations to move a node to the root is a bad approach. This is because we optimize for the root but every other node in the single rotation was affected in a bad way.

On accessing a node, we move it (splay it) to the root by a sequence of local moves, where each move is called a splay step.

Let \(x\) be the node being splayed. \(y\) is the parent of \(x\) and \(z\) is the parent of \(y\).

There are 3 unique cases:

  1. Zig step: If \(y\) is the root, we rotate \(x\) so that it becomes the root (single rotation).

zig step visualized

  1. Zig-Zig step: If \(y\) is not the root and \(x\) and \(y\) are both left or both right children. We rotate \(y\) and then rotate \(x\).

zig-zig step visualized

  1. Zig-Zag step: If \(y\) is not the root and either \(y\) is the left child and \(x\) is the right child or vice-versa. We perform a double rotation on \(x\).

zig-zag step visualized

The cost of splaying a node \(x\) at depth \(d\) is \(O(d)\).

Splaying roughly halves the length of the access path.

Definitions:

  • \(w(i)\) is the weight of \(i\) item (always positive weights).
  • \(s(x)\) is the size of node \(x\) in the tree, the sum of the individual weights of all items in the subtree rooted at \(x\).
  • \(r(x)\) the rank of the node \(x\), which is \(\lg(s(x))\).

The potential function \(\Phi\) is the sum of the ranks the sum of the ranks of all its nodes.

\[ \Phi(T) = \sum_{\forall x \ in T} r(x) \]

Note \(\Phi_i\) is not necessarily non-negative.

\[ \sum_{i=1}^{m} C_i = \sum_{i=1}^{m} a_i + \Phi_0 - \Phi_m \]

Now the equation still holds but we can't just throw away the potential, we need to consider it in our upper bound (since it may be negative).

Access lemma: The amortized time to splay a tree with root \(t\) at node \(x\) is at most:

\[ 3(r(t) - r(x)) + 1 = O\left(\lg \frac{s(t)}{s(x)}\right) \]

Proof: Access Lemma Proof

Note that \(r\left(t\right) = r_k\left(x\right)\) since \(x\) is the root after the splay step.

Balance Theorem: Given a sequence of \(m\) accesses into an n-node splay tree it has total access time of \(O\left(\left(m+n\right) \lg n + m\right)\).

Proof: We first assign a weight of \(1/n\) to each item. Then, \(1/n \leq s\left(x\right) \leq 1\). Therefore, the amortized cost of any access is \(3 \lg\left(1\right) - 3 \lg\left(1/n\right) + 1\) which is upper bounded by \(3 \lg n + 1\).

The maximum potential of \(T\) is given by:

\[ \Phi = \sum_{\forall x} r\left(x\right) = \sum_{i=1}^n \lg \left(s\left(i\right)\right) \leq \sum_{i=1}^n \lg 1 = 0 \]

The minimum potential of \(T\) is given by:

\[ \Phi = \sum_{\forall x} r\left(x\right) = \sum_{i=1}^n \lg \left(s\left(i\right)\right) \geq \sum_{i=1}^n \lg \left(1/n\right) = -n \lg n \]

Therefore, \(\Phi_0 - \Phi_m \leq 0 - \left(-n \lg n\right) = n \lg n\) is the maximum difference in potential. We substitute this into the amortized cost calculation to get an upper bound on the amortized cost.

\[ \begin{align} \text{Total cost } &\leq m \left(3\lg n + 1\right) + \left(n \lg n\right) \\ &\leq m \left(3 \lg n + 1\right) + n \lg n \\ &= O\left(\left(m+n\right) \lg n + m\right) \end{align} \]

The worst case can be pretty bad since the splay tree doesn't guarantee any shape. Therefore, the worst case can be linear time \(O\left(n\right)\). However, if we access \(m \geq n\) times the amortized cost per access is \(O\left(\lg n\right)\).

Static Optimality Theorem: Given a sequence of \(m\) access into an \(n\)-node splay tree \(T\), where each item is accessed at least once, then the total access time is given by:

\[ O\left(m + \sum_{i=1}^n q\left(i\right) \lg \left(\frac{m}{q\left(i\right)}\right)\right) \]

Where \(q\left(i\right)\) is the access frequency of item \(i\).

Proof: Assign a weight of \(q\left(i\right)/m\) to each item \(i\). Then:

\[ \frac{q\left(x\right)}{m} \leq s\left(x\right) \leq 1 \]

The amortized cost of each access is at most:

\[ \begin{align} &3 \left(\lg 1 - \lg \left(\frac{q\left(x\right)}{m}\right)\right) + 1 \\ =& 3 \lg \left(\frac{m}{q\left(x\right)}\right) + 1 \end{align} \]

The maximum potential of \(T\) is given by:

\[ \Phi = \sum_{\forall x} r\left(x\right) = \sum_{i=1}^n \lg \left(s\left(i\right)\right) \leq \sum_{i=1}^n \lg 1 = 0 \]

The minimum potential of \(T\) is given by:

\[ \Phi = \sum_{\forall x} r\left(x\right) = \sum_{i=1}^n \lg \left(s\left(i\right)\right) \geq \sum_{i=1}^n \lg \left(\frac{q\left(i\right)}{m}\right) \]

Therefore, the difference in potential can be at most:

\[ \sum_{i=1}^n q\left(i\right) \left(3 \lg \left(\frac{m}{q\left(i\right)}\right) + 1\right) \]

Note that the sum of all \(q\left(i\right)\) is equal to \(m\) which is the total number of accesses.

\[ \begin{align} &3 \sum_{i=1}^{n} \left(q\left(i\right) \cdot \lg \left(\frac{m}{q\left(i\right)}\right)\right) + m + \sum_{i=1}^{n} \lg \frac{m}{q\left(i\right)} \\ &= O\left(m + \sum_{i=1}^{n} q\left(i\right) \lg \left(\frac{m}{q\left(i\right)}\right)\right) \end{align} \]

The entropy of the access is given by the following equation:

\[ H = \sum_{i=1}^n q\left(i\right) \lg \left(\frac{m}{q\left(i\right)}\right) \]

Insert Operation

We insert the value as a new leaf treating the tree as a regular binary search tree. Then we splay the newly inserted node to the top of the tree.

  1. Insert the node normally.
  2. Splay at the new inserted node.

insert operation on splay trees

Delete Operation

We first perform a binary search to find the node and then delete it. After we delete the node, it can have up to 3 separate disconnected trees. The parent, the left child and the right child.

To combine these 3 trees into one, we splay the largest value in the left sub-tree. This new root isn't going to have a right child since it's the largest node. We attach the right sub-tree to the right child of this new root. Then we attach this tree to the parent. We then splay the tree at the parent of the delete node.

  1. Delete node
  2. Splay largest node in left sub-tree
  3. Attach right sub-tree to left-subtree
  4. Attach parent to new tree
  5. Splay parent of delete node to top.

delete operation on splay tree

Summary

  1. It's unknown if splay trees are competitive.

  2. The average running time of access: \(O(\lg n)\) if number of accesses is greater than the number of elements.

  3. Access lemma: amortized cost is less than \(3(r(t) - r(x)) + 1\).

Entropy

For a set of objects \(S = \{k_1, k_2, \dots, k_n\}\) and a set of probabilities for each of those objects \(D = \{p_1, p_2, \dots, p_n\}\) and \(p_i\) is the probability associated with object \(k_i\).

The entropy of \(D\) is given by:

\[ H(D) = -\sum_{i=1}^{n} p_i \lg \left(\pi\right) = \sum_{i=1}^{n} p_i \lg \left(\frac{1}{p_i}\right) \]

Shannon's Theorem

Consider a sender (Alice) and a receiver (Bob). The sender wants to send a sequence \(S_1, S_2, \dots, S_m\) where each \(S_j\) is chosen randomly and independently according to \(D\) (the distribution). So that \(S_j = k_i\) with probability \(p_i\).

For any protocol the sender and receiver might use, the expected number of bits required to transmit the sequence using that protocol is at least \(m * H(D)\), where \(m\) is the length of the sequence and \(H(D)\) is the entropy of the distribution.

For a uniform distribution, \(p_i = 1/n\) and therefore, \(H(D) = \lg n\). This is the worst case with maximum entropy.

For a geometric distribution with \(p_i = 1/2^i\) for all \(1 \leq i < m\), the entropy works out to be:

\[ H(D) = \left[ \sum_{i=1}^{n} \frac{i}{2^i} \right] + \frac{n-1}{2^{n-1}} \leq 2 \]

Therefore, the number of bits required to transmit data is \(2m\) regardless of the size of \(n\). Note that \(n\) is the number of symbols in the distribution.


Theorem: For any comparison based data structure, storing \(k_1, \dots k_n\) each associated with probability of \(p_1, \dots p_n\) (to search) the expected number of comparisons while search for an item is \(\Omega(H(D))\).

Proof: The sender and receiver both store \(k_1, \dots, k_n\) in some comparison based data structure beforehand. When the sender wants to transmit \(S\), they perform a search for \(S\) in the data structure. The result of the comparisons form a sequence of 1s to encode true and 0s for false. This is sent to the receiver.

On the receiving end, the receiver runs the search algorithm without knowing the value of \(S\). This can be done since the receiver is doing exactly the same comparisons and knows the result of the comparison. This way the sender can transmit \(S\) to the receiver.

By applying Shannon's Theorem, the expected number of bits required to transmit \(S_1, \dots, S_m\) using this approach is at least \(m \cdot H(D)\). However, the expected number of bits to transmit the sequence must be less than or equal to the number of comparisons to search in the data structure.

Therefore, the number of comparisons required is at least \(m \cdot H(D)\).

Static Optimal Binary Search Tree

We start with the assumption that all elements in the request sequence are in the tree. Let \(p_i\) be the probability of access of \(A_i\) and in this case, the entropy would be:

\[ H(D) = - \sum_{i=1}^{n} p_i \lg p_i \]

The expected cost of the static optimal binary search tree would be greater than or equal to the entropy (\(H\)).

If not all request elements are in the tree, let \(q_i\) be the probability of accessing a gap \((A_i, A_{i+1})\). For this the entropy of the access would be:

\[ H(D) = - \sum_{i=1}^{n} p_i \lg p_i - \sum_{i=1}^{n} q_i \lg q_i \]

However, constructing optimal binary search trees are expensive (\(O(n^2)\)), therefore we consider a NEARLY optimal binary search tree.

We first convert the gaps into items in the tree with dummy nodes.

optimal BST dummy nodes

Goal: Build a BST such that the expected query time if \(O(H+1)\). We say \(O(H+1)\) since sometimes \(H \propto 1/n\) and therefore \(H+1\) ensures that it is at minimum constant time.

A common trick: Probability splitting.

We find a key \(k_i\) such that each side of the \(k_i\) the probability is less than half the total probability of all the elements. At the root, the sum of all probabilities is going to be \(1\), however, we use this recursively and the sum will not be \(1\) at the different places.

\[ \begin{gather} \sum_{j=1}^{i-1} p_j \leq \frac{1}{2} \sum_{j=1}^{n} p_j \\ \text{and} \\ \sum_{j=i+1}^{n} p_j \leq \frac{1}{2} \sum_{j=1}^{n} p_j \\ \end{gather} \]

The key \(k_i\) becomes the root of a BST and then we do this recursively. We're assuming we're given a sorted set of keys.

Observation: In \(T\), if node \(k_i\) has \(depth_T(k_i)\), then if we sum \(p_j\) over all \(k_j \in T(k_i)\) where \(T(k_i)\) is the set of nodes in the sub-tree rooted at \(k_i\):

\[ \sum_{k_j \in T(k_i)} p_j \leq \frac{1}{2^{depth_T(k_i)}} \]

This follows since at each level we divide the nodes into two parts, the left and the right sub-tree, in such a way that the sum of the probabilities is less than half the total probability.

process of building the tree

The tree \(T\) can answer queries in \(O(H+1)\) time.

Proof: We know

\[ \begin{gather} p_i \leq \sum_{k_j \in T(k_i)} p_j \leq \frac{1}{2^{depth_T(k_i)}} \\ \text{and}\\ depth_T(k_i) \leq \lg \left(\frac{1}{p_i}\right) \end{gather} \]

Thus the expected depth of a key chosen is:

\[ \sum_{i=1}^{n} p_i \cdot depth_T(k_i) \leq \sum_{i=1}^{n} p_i \cdot \lg \left(\frac{1}{p_i} \right) = H \]

To show that the tree can be constructed fast. To find the key \(k_i\) that holds such a property. To find this key, we would require \(\Theta(n)\) time. This would lead to an overall time of \(n^2\) similar to the other method to build a tree.

To speed this up, we calculate a prefix sums and stop at the point when sum is over \(1/2\). Using this pre-calculated prefix sum array we can find all \(k_i\) with the once calculated array using binary search. The total running time to build the tree would therefore be \(\Theta(n \lg n)\).

But we can do even better: First we find \(k_i\) by doing a doubling search or an exponential search.

  1. We calculate a prefix sum array.
  2. Check positions \(1, 2, 4, 8, \dots\) until we overshoot the expected prefix sum.
  3. After we overshoot, we perform a binary search between the two elements.

Even though this is still \(\Theta(\lg n)\) or more accurately \(\Theta(\lg i)\) (where \(i\) is the index found) it is actually preferable when we believe the key is in the beginning of the list.

Therefore, we do this simultaneously from the beginning and the end. We start at \(k_1\) and walking forward, and we also start at \(k_n\) and walk backward.

\(k_i\) can be found in \(O(\lg min(i, n-i+1))\). Then the total running time to build the tree will be given by:

\[ T(n) = O(\lg min(i, n-i+1)) + T(i-1) + T(n-i) \]

This works out to be \(O(n)\).

MTF Compression

We use the concept of entropy to bound the efficiency of compression algorithms.

Elias Gamma Coding

Commonly used to code non-negative integers, whose max value is not known.

  1. We first write down the binary representation of \(i\). The number of bits needed would be \(\lfloor \lg i \rfloor + 1\) bits.

  2. We prepend it by \(\lfloor \lg i \rfloor\) 0's.

Example: 9 -> 1001 -> 0001001.

When reading a sequence of bits, the number of zeros will tell us the number of bits in the next number.

For the sequence: 9, 5, 7 we would encode it as 0001001 00101 00111.

The number of bits required to encode \(i\) would then be \(1 + 2 \lfloor \lg i \rfloor\).

Better Than Elias Gamma Encoding

This is a scheme that uses fewer bits than the Elias Gamma Coding method by just replacing the zeroes that prefix numbers with the Elias Gamma encoded number.

That is, the number of bits are encoded in unary which we can replace with the Elias gamma encoding for that number.

Example: 9 -> 000 1001 -> 011 1001

elias gamma next level example

The number of bits required to encode \(i\) is now going to be \(1 + 2 \lfloor \lg \lg i \lfloor\) bits to encode the length and then \(\lg i\) bits to encode the number. Therefore the total required bits would be \(1 + 2 \lg \lg i + \lg i\) bits. To summarize: \(\lg i + O(\lg \lg i)\).

We can do this recursively to get a slightly better encoding scheme. But we don't need this. This is called recursive elias or elias omega codes.


MTF Compression of encoding \(m\) integers from \(1, \dots, n\).

The sender and receiver each maintain identical lists containing integers \(1 \dots n\). To send the symbol \(j\), the sender looks for \(j\) in their list and find it at position \(i\). The sender then encoding \(i\) using the 2 level Elias Gamma encoding scheme and sends it to the receiver.

The receiver then decodes \(i\) and looks at the ith element in their list to find \(j\). The sender and receiver both move element \(j\) to the front of their lists and continue.

Jensen's Inequality:

Let \(f: R \to R\) be a strictly increasing concave function. Then the sum of all \(f(t_i)\) for \(i \in [1, n]\) is subject to the constraint \(\sum t_i \leq m\) is maximized when \(t_1 = t_2 = \dots = t_n = m/n\)

A concave function is a function such that between two points the curve is above the line. \(f'\) is always decreasing.

concave function

The MTF compression algorithm compresses the sequence \(S_1, S_2, \dots, S_m\) into \(n \lg n\) + \(mH\) + \(O(n \lg\lg n + m\lg H)\) bits.

Proof: If the number of distinct symbols between two consecutives occurances of integers \(j\) is \(t-1\) then the cost of encoding the second \(j\) is \(\lg t + O(\lg \lg t)\). This is because the element \(j\) would have moved to position \(t\).

Therefore, the total cost of encoding all occurrences of \(j\) is

\[ C_j \leq \lg n + O(\lg \lg n) + \sum_{i=1}^{m_j} \left( \lg t_i + O(\lg \lg t_i) \right) \]

since \(\sum t_i \leq m\)

\[ C_j \leq \lg n + O(\lg \lg n) + m_j (\lg \frac{m}{m_j}) + O(\lg \lg (\frac{m}{m_j})) \]

Summing this over all \(j\), the total number of bits required:

\[ \sum_{j=1}^{n} C_j \leq n\lg n + O(n \lg \lg n) + mH + \sum_{j=1}^{m} O(m_j \lg \lg \frac{m}{m_j}) \]

Here, \(p_j = m_j / m\), therefore we can replace the final sum in terms of entropy.

\[ \text{Total cost} \leq n \lg n + mH + O(n \lg \lg n) + O(m \lg H) \]

Here \(n\) is the size of the alphabet that we are encoding and \(m\) is the number of elements in the sequence. \(mH\) is the theoretical lower bound. Therefore, this solution gets pretty close to the theoretical lower bound.

Note: This assumes that symbols are purely independent from each other, which is often not true. For example, the probability that you see "e" after the letters "th" is not just 1 in 26 but a lot higher since "the" is a very frequent word.

Extra Compression Details

Gzip and Bzip2 compression uses the Barrow Wheeler Transformation compression algorithms which offer a lot faster compression. Bzip2 uses BWT to transform the data and then uses MTF with Huffman encoding.

Text Indexing Structures

We have an alphabet \(\Sigma\) with \(\sigma\) number of symbols. We are given a text \(T\) of length \(n\) over the alphabet \(\Sigma\). We are required to search for the pattern \(P\) of length \(p\) in text \(T\).

There are two separate approaches based on the situation.

If no preprocessing is allowed then we can't pre-construct a data structure to help with answers queries. Then we need to use some sort of string matching algorithms. The obvious approach leads to a running time of \(O(n \cdot p)\).

However, we can speed this up for a situation without preprocessing we can use one of the following algorithms which do it in \(O(n + p)\):

  • Knuth Morris Pratt
  • Boyer-Moore
  • Rabin-Karp

If we're allowed to use preprocessing and build data structures then we can create text indexes to answers different queries for \(p\). The index speeds up text search when each time a new pattern is given.

We are trading space for query time.

Warm Up Structure: Trie

  1. We represent a set of strings \(\{T_1, T_2, \dots, T_k\}\).
  2. A rooted tree with edges labelled with alphabets in \(\Sigma\).
  3. We represent strings as root to leaf paths in a trie, terminated with a new letter.

We use an end of string symbol eg: $.

example of trie for strings

We sort the edges in lexicographic order.

An in-order traversal of the leaves then will get the strings in sorted order. This is because we ordered the children of each node.

To check if \(p\) is in the set of strings \(\{T_1, T_2, \dots, T_k\}\) we can do a top-down traversal. If each node stores pointers to each child in sorted order.

The number of nodes in the trie is \(n \leq \sum |T_i| + 1\).

Query time to check if \(p\) is in the trie: \(O(p \lg \sigma)\).

We see that there's a lot of space that's required by the nodes and therefore we can compress the trie (compressed trie) for non-branching paths to a single edge.

Suffix Trees

A compressed trie of all suffices of T$.

example suffix tree

We have \(n+1\) leaves and each edge label is a substring \(T[i \dots j]\).

We only store two indices \(i, j\) to store the label of an edge.

The space required for this tree: \(O(n)\).

Text search for \(P\) gives a sub-tree whose leaves correspond to all occurrences of \(P\). If we successfully use all characters of \(P\). If we have a mismatch then \(T\) doesn't contain the substring \(P\).

If each node stores points to children then for a particular \(P\) the running time would be \(O(p \lg \sigma)\). If we use perfect hashing time then the query time goes down to \(O(p)\).

To find all occurrences of the substring requires running time: \(O(p + occ)\) where \(occ\) is the number of occurrences of the pattern.

Applications:

  • Longest repeated substring in \(T\): can be found by finding the branching node with the maximum letter depth. This requires \(O(n)\) time.

However, for natural language the suffix tree requires around 15-20 times the size of \(T\) to store the tree.

Suffix Arrays

We sort the suffixes of \(T\) and then we just store the starting indices.

suffix array example for banana

This requires \(n+1\) words. To search for a pattern \(P\) we can just perform a simple binary search which would lead to a running time of \(O(p \lg n)\) to see if \(P\) occurs in \(T\).

To find all occurrences of \(P\) in \(T\) would require \(O(p\lg n + occ)\).

A Simple Acceleration

Let \(L\) and \(R\) be the left and right boundaries of the current search interval. We modify the binary search to use the midpoint to update \(L\) and \(R\) (normally, we would update them to \(midpoint \pm 1\)).

Let \(l\) be the length of the longest prefix of \(T[SA[L] .. n]\) that matches \(P\). Let \(r\) be the length of the longest prefix of \(T[SA[R] .. n]\) that matches \(P\).

The initial values of \(l\) and \(r\) are obtained by direct comparison.

Here is an example search for the string "ABA":

search for string aba in suffix array

If there is a match at least one of them would be equal to the length of \(P\). \(h = min(l, r)\).

Observation: first \(h\) characters of \(P\), \(T[SA(L) .. n]\), \(T[SA(L+1).. n]\), \(\dots\), \(T[SA(R)..n]\) agree = \(P[0 .. h-1]\).

Since the first \(h\) characters are the same, we do not need to compare them, we can just start from \(P[h]\). This way we are comparing fewer characters when searching for the substring.

This approach doesn't change the theoretical upper bound but in practice speeds up the search by quite a lot.

Super Accelerant

First we define a couple of terms:

  • The longest common prefix of two string \(s_1\) and \(s_2\).
  • \(lcp(i, j)\): gives t he length of the longest common prefix of \(T[SA(i)..n]\) and \(T[SA(j)..n]\).

Call an examination of a character in \(P\) redundant if that character has been examined before.

Goal: Reduce the number of redundant character examinations to at most 1 per iteration of the binary search. This would lead to a running time of \(O(\lg n)\) for the binary search. In turn leading to a time of \(O(p + \lg n)\).

To achieve this, we perform a little more preprocessing: for each triple \((L, M, R)\) that can possibly arise during the binary search we pre-compute \(lcp(L, M)\) and \(lcp(M, R)\).

There are \((n+1)-2 = n - 1\) triples in total. We never use \(0\) or \(n\) as a value of \(M\) since it is always a midway point.

The total extra space required to build this table is \(2n-2\).

Let \(Pos(i)\): be the suffix that corresponds to the \(i\)th entry of the suffix array: \(T[SA[i] .. n]\).

Case 1:

If any iteration of the binary search if \(l=r\) then compare \(p\) to suffix \(Pos(M)\), starting from \(P[h]\). Since the first \(h\) characters match, we don't need to compare those again.

Case 2 (a):

If \(l \neq r\), assume that \(l > r\) without loss of generality (just a symmetric case).

If \(lcp(L, M) > l\) then we have a situation where the \(L\) and \(M\) strings agree with each other more than \(L\) agrees with \(P\).

super accelerant case 2 a

Since \(P > L\) and \(lcp(L, M) > l\), therefore \(P > Pos(M)\). Therefore, no examination is of \(P\) is needed. \(L \gets M\) and \(l\) and \(r\) don't change.

Case 2 (b):

If \(l \neq r\), assume that \(l > r\) without loss of generality (just a symmetric case).

If \(lcp(L, M) < l\) we have a situation where \(L\) and \(M\) agree less than \(L\) agrees with \(P\).

Since \(P > L\) and \(lcp(L, M) < l\), therefore \(P < Pos(M)\). Rational: \(P[h-1] = L[h-1]\) but \(L[h-1] < M[h-1]\). Therefore, \(R \gets M\), \(r \gets lcp(L, M)\) and \(l\) remains unchanged.

Case 2 (c):

If \(l \neq r\), assume that \(l > r\) without loss of generality (just a symmetric case).

If \(lcp(L, M) = l\) we have a situation where \(L\), \(M\), and \(P\) all agree the same amount of characters. Therefore \(P\) agrees with \(Pos(M)\) up to position \(l-1\). We start the comparison from position \(P[l]\).

Running Time Analysis

The two cases in which the algorithm examines the character during the iterations, the comparison starts with \(P[max(l, r)]\); suppose \(k\) characters were examined for one iteration, then there are \(k-1\) matches and \(max(l, r)\) increases by \(k-1\). Hence, at the start of any iteration, \(P[max(l, r)]\) may have been examined, but the next character has not been.

Therefore, we have reduced the number of redundant comparisons to at most 1 per iteration.

From this, we get a running time for the entire search to be \(O(\lg n + p)\).

Construction Time

Suffix trees / arrays for string over a constant sized alphabet can be constructed in \(O(n)\) time. Trees and arrays for strings over alphabets \(\Sigma = \{1, 2, \dots, n\}\) can also be constructed in \(O(n)\) time.

Succinct Data Structures

Succinct Trees

Number Of Possible Trees

First we calculate the number of possibles trees with \(n\) nodes.

For binary trees, the Catalan number is the number of trees possible for \(n\) nodes. \(C_n = \binom{2n}{n} / (n + 1)\) where \(\binom{2n}{n}\) is \(2n\) choose \(n\).

For ordinal trees, trees with numbered children (1st, 2nd, ...), can be mapped to a binary tree, the left child of a node would be the first child in the ordinal tree. The right child is the immediate right sibling.

Left child (binary tree) -> first child (ordinal tree) and right child (binary tree) -> immediate right sibling (ordinal tree).

ordinal tree to binary tree

Note that since the root of the ordinal tree isn't going to have an immediate right sibling, the root of the binary tree is always going to have no right child. To convert a binary tree to an ordinal tree, we need to add a dummy root to ensure it doesn't have a right child.

From this conversion we see that the number of ordinal trees is \(C_{n-1}\) for \(n\) nodes.

Information Theoretic Lower Bound

Now we find the information theoretic lower bound of representing a binary tree. A combinational object of \(n\) elements with \(u\) being the number of different such objects of size \(n\): the lower bound for such an object is \(\lg u\) bits.

For binary trees:

\[ \begin{align} \lg C_n &= \lg \left( \frac{2n!}{n!^2} \right) - \lg(n+1) \\ &= \lg (2n!) - 2\lg(n!) - \lg (n+1) \\ \text{(Using }& \text{Stirling's Approximation)} \\ &= 2n \lg(2n) - 2n\lg(n) + o(n) \\ &= 2n(1 + \lg n) - 2n \lg n + o(n) \\ &= 2n + o(n) \end{align} \]

More careful analysis leads to a bound of:

\[ \lg C_n \approx 2n - \frac32 \lg n + O(1) \]

Level Order Representation

  1. Append external nodes \((\square)\) for each missing child so that we form a complete binary tree.

  2. For each node in the level-order we write a \(0\) if it is external or \(1\) if it is internal.

binary tree using level order representation

This requires \(2n + 1\) bits of space.

We can navigate the tree in the following way: The left and right children of the \(i\)th internal node are at positions \(2i\) and \(2i+1\) in the bit vector.

Proof: Induction

Clearly, tree for \(i=1\)

For \(i>1\), by the induction hypothesis, the children of the \(i-1\)st internal node are at \(2(i-1)\) and \(2(i-1)+1\) positions. There are two cases, if \((i-1)\)st and \(i\)th nodes are on the same level.

Level ordering is preserved in the children. That is, \(A\)'s children precedes \(B\)'s children if and only if \(A\) precedes \(B\). All nodes between \((i-1)\)st internal node and the \(i\)th internal node are external nodes, with no children. So, the children of the \(i\)th internal node immediately follow the children of the \((i-1)\)th internal node, at positions \(2i\) and \(2i+1\).

Rank And Select in Bit Vectors

  • \(rank_1(i)\) is the number of 1 bits up to and including position \(i\) in a bit vector.

  • \(select_1(j)\) returns the position of the \(j\)th 1 bit.

rank and select example

If we identify each node by its position in the bit vector:

  • left_child(i): \(2 \cdot rank_1(i)\)
  • right_child(i): \(2 \cdot rank_1(i) + 1\)
  • parent(i): \(select_1(\lfloor i/2 \rfloor)\)

Rank

The object is to get the rank operation to work in \(O(1)\) running time.

We split the bit vector into \(\lg ^2 n\) bit super-blocks. For each super-block we store the cumulative rank each of which will require \(\lg (n)\) bits to store.

The total space required for this additional information is going to be:

\[ O(\frac{n}{\lg ^2 n} \cdot \lg n) = O(\frac{n}{\lg n}) bits \]

Then we split each super-block into \(1/2 \lg n\) bit blocks. We store the cumulative rank within the super block. This requires \(O(\lg \lg n)\) more space per super block.

\[ O(\frac{n}{\lg n} \cdot \lg \lg n) = o(n) bits \]

We use a lookup table for bit vectors of length \(1/2 \lg n\), i.e. for any possible bit vector of length \(1/2 \lg n\) and for each of its positions we store the rank of that position.

rank using O(1) method

Look up in this 2D table. There are \(\sqrt{n}\) bit vectors of length \(1/2 \lg n\). This requires the space of:

\[ O(\sqrt{n} \lg n \lg \lg n) \]

Where \(\sqrt{n}\) is the rows in the table, \(\lg n\) is the number of columns, and \(\lg \lg n\) is the number of bits required to store the answer.

The overall space for this structure:

\[ O(\frac{n \lg \lg n}{\lg n}) \]

We retrieve the rank of a position using exactly 3 steps (\(O(1)\)):

  1. Find the rank of the super-block containing it.
  2. Find the relative rank within the super-block and add it.
  3. Find the relative rank of the element within the block using the lookup table and add it.

The best way to get rank in \(O(1)\) time can be reduced to \(O(n / \lg^k n)\) bits for any \(k = O(1)\).

Select

This uses a similar idea for bit more involved. Here is a reference paper that does the select operation in \(O(1)\) time.

Reference paper for rank and select

Binary Tree Bit Vector Summary

We can store the binary tree using \(2n + o(n)\) bits and the support left child, right child, and parent in \(O(1)\) running time.

Binary Trees Using Balanced Parenthesis Repr.

If we want to represent a binary tree then we first need to convert it to an ordinal tree before we do the balanced parenthesis representation. This is the same process an demonstrated above but here is another example. Note that we use the dummy node as the parent node for the ordinal tree.

converting a binary tree to an ordinal tree

  1. We perform a DFS and the first time we visit a node we append a (.
  2. When we visit it after visiting all its descendants we append a ).

For the above tree, the balanced parenthesis representation would look like:

example of balanced parenthesis representation

There are exactly \(2n\) parenthesis.

Binary Tree Ordinal Tree Parenthesis Representation
Node Node Left paren and matching right paren
Left child First child Next character '(' => child
Right child Next sibling First char after matching closing paren
Parent Prev sib / Parent Prev Char '(' => parent, ')' => prev sibling
Sub tree size half count between matching parenthesis
Sub tree size size(node) + size(siblings)

A bit vector with match parenthesis takes \(O(1)\) time, with \(O(n)\) bits space.

The total space required would be \(2n + o(n)\) bits since we still need to support rank and select.

Outer Planar Graphs

An adjacency matrix representation of a graph requires \(O(n^2)\) bits and an adjacency list representation requires \(O(E + V)\) words.

An outer planar graph is a graph such that if all the nodes are drawn in a line, their connections all lie on one side and the edges do not cross.

Outer planar graph example

We require two data structures to represent this graph:

  1. \(B\): A bit vector. If a node has degree \(d\) then we store 1 followed by \(d\) zeroes.

  2. \(P\): Parenthesis representation string. ) if edge incident to node to left, otherwise (.

For the above graph we would have the following representation:

B: 1000 100 10 10 ...
P: (((  )(  (  )  ...
  • The degree(x) is given by select(x+1) - select(x) - 1.
  • The neighbours(x) is given by locating the 0s corresponding to edges incident on x. select(x) + 1 ..., select(x+1) - 1

For each edge we need to find the other endpoint by parenthesis matching. neighbours(2): 1, 5, 3. This requires \(O(d(x))\) running time.

For Adjacency(x, y) with the requirement that \(x \leq y\) we perform the following steps:

  1. Find the matching parenthesis of the first ( corresponding to \(x\).
  2. If the position is before \(y\), then its false.
  3. If position is \(y\) then true.
  4. If position is \(>\) all edges of \(y\) finding matching opening parenthesis of last closing parenthesis corresponds to \(y\). If this hits \(x\) then we return true otherwise false.

This operation can be done using rank and select and requires a running time of \(O(1)\).

The total space required by this structure is then $n + 4m +o(/static/uploaded-data/f3eba584-5ec4-4e8f-b8cb-d06ea8e498a0.png

Enjoy the notes on this website? Consider supporting me in this adventure in you preferred way: Support me.