aez-notes-algorithms

Home

Theory
- Order of growth classification
Data structures
Algorithms
- Dynamic connectivity
- Sorting

Theory

Order of growth classification

Most of the time we can limit consideration to:

\(1\)
\(\log N\)
\(N\)
\(N \log N\) (logarithmic)
\(N^2\)
\(N^3\)
\(2^N\)

Data structures

Hash table

Separate-chaining

Idea: use a hash function to index into an array where each element of the array points to a linked list. Under uniformity assumptions of the hashing this leads to near constant time operations in the average case.

Linear probing

Idea: if there is a hash collision, just use the next position in the array. This has similar properties to the separate-chaining solution but avoids the need for additional linked lists. The problem is that its performance degrades rapidly as the array fills up. An alternative to just using the next array position is to have a second hash function and use that to determine where to look next.

Symbol tables

An abstraction of key-value pairs, a.k.a. association table. Dictionaries in python are the cannonical example.
- Linked list implementation has linear operations so is not useful
- Sorted array with binary search gets the lookup methods efficient but insertion is expensive.

Heap (binary tree)

This data structure is defined recursively, it is either empty or a node with links to left and right subtrees.
A binary tree is complete if it is balanced at all nodes down to the final layer. A complete tree with \(N\) nodes will have a depth \(\lg N\).
Heap-ordering in a binary tree means that the labels on each node is never smaller than the label on it's parent node.
- The swim operation is used to move a node up the tree if it violates the heap-ordering: this is how the insertion method works. There is also a sink method to push nodes down a tree.
- Heap-ordering provides a natural way to represent the binary tree with an array.

Binary search tree

While a heap uses an implicit tree represented as an array and uses heap-order, the BST uses an explicit tree and stores keys in symmetric order:
- All keys in left subtree are smaller all keys in right subtree are bigger.
When working with BSTs recursive methods lead to very slick code (some of which was invented by Robert Sedgewick).
Insertion into a BST is \(\ln N\) which is why you might prefer it to a heap.
Deletion is \(\sqrt{N}\) for randomised data which can kill performance. There are alternative structures that get around this.

2-3 tree

A tree allowing for both 2 and 3 nodes which is complicated but keeps the tree perfectly balanced on all operations we care about.

Left-leaning red-black tree

This binary tree uses edge labels to get performance closer to a 2-3 tree.

kd tree

Tree data structure for working with geometric data in a \(k\) dimensional space.
- During insertion of the nodes label the nodes by the dimension in which they partition the space. Each time the tree depth increases partition in the next dimension.

Stacks and queues

A min/max priority queue will dequeue the min/max element that has been enqueued. You probably want a binary tree style data structure to get an efficient implementation.
A bag or set can be implemented as a randomized queue.
Stacks have push and pop
- Can be implemented as a linked list (which has constant time operations)
- Can be implemented as an array but this requires keeping track of the array size and avoid the loitering problem for the garbage collector. Double when full, half at a quarter.
- Linked list has more consistent performance but will be slower on average than the array based implementation.
Queues have enqueue and dequeue

Algorithms

Dynamic connectivity

Algorithm	Initialisation	Union	Find
Quick-find	\(N\)	\(N\)	\(1\)
Quick-union	\(N\)	\(N\)	\(N\)
Weighted quick-union	\(N\)	\(\text{lg} N\)	\(\text{lg} N\)

Quick-find

Consider a graph, \(G\), with \(N\) nodes. The operations we care about are checking if two nodes are connected, and updating the graph by adding an edge. Quick-find refers to a way of checking if two nodes are connected by some path. The data structure used is an array where the \(i\)-th entry lists the connected component node \(i\) belongs to. To check if two nodes are connected then you just need to look up if they have the same value in this array.

Table 1: Quick-find on \(N\) nodes.
Initialisation	Union	Find
\(N\)	\(N\)	\(1\)

Quick-union

Same problem and data structure as used by Quick-find but the connected components are modelled as trees and the array values represent either the parent node or the node index if it is the root. Checking if two nodes are connected is just checking if their roots are the same. Updating when a new edge is added is just changing one of the root nodes to be the child of the other root if necessary.

There is the potential for the trees involved to grow very tall which then makes this algorithm too slow.

Table 2: Quick-union on \(N\) nodes.
Initialisation	Union	Find
\(N\)	\(N\)	\(N\)

Weighted quick-union

This is the same as Quick-union but always selects the smaller tree to be added as a subtree of the root of the bigger tree during a union operation. To make this work you need another array to count the number of nodes in the tree rooted at each index. Note that this considers the size of the trees rather than their depth.

Table 3: Weighted quick-union on \(N\) nodes.
Initialisation	Union	Find
\(N\)	\(\text{lg} N\)	\(\text{lg} N\)

An additional modification, path compression can further improve the efficiency with very little added bookkeeping.

Sorting

A partially sorted array is one which differs from a sorted array by a linear number of inversions.

A stable sort is a sorting algorithm that preserves the relative order of items with equal keys. Insertion sort and mergesort are stable but selection sort and shell sort are not stable.

A lower bound on the complexity of sorting can be obtained by representing an arbitrary computation by a binary decision tree. Graph theory tells us that any tree big enough to support all possible orderings must have a depth proportional to \(N \lg N\) yielding the lower bound! This argument is used to prove that mergesort is an optimal algorithm. In practise quicksort is the best if you don't need stability.

Selection sort

Go through and select the smallest element then repeat on the remaining elements. This requires order \(N^2 / 2\) comparisons and \(N\) movements of data to sort a list of \(N\) elements. The amount of work required does not change if the collection is already partially sorted.

Insertion sort

Go through and more each element to the left of the collection until all elements further left are lower and repeat. For a randomly ordered array it requires order \(N^2 / 4\) comparisons and \(N^2 / 4\) movements of data. If the collection is already partially sorted though this can be much quicker, it runs in linear time for partially sorted arrays.

Shell sort

Perform a sequence of h-sorts for a decreasing sequence of h. An h-sort is sorting the subsequences with indicies \(1,h+1,2h+1,\dots\), and \(2,h+2,2h+2,\dots\) and \(3,h+3,2h+3,\dots\), etcetera. This substantially reduces the amount of times that data needs to be moved. You can use insertion sort for each h-sort. Open problem to analyse complexity, but it performs well in practise.

Knuth shuffle

Is a linear time shuffling algorithm. Just iterate over the list, on iteration i randomly exchange element i with one of the others in the list. There is a Haskell implementation on Rosetta code.

Graham scan for computing the convex hull

This is an algorithm for computing the convex hull in two dimensions. It uses sorting to order the points by polar angle relative to the point right most bottom point.

Mergesort

Split the list in two, recursively sort, then merge the two. The merge step makes use of an auxiliary array. The number of comparisons and array accesses is order \(N \lg N\) (recalling that this is the base-2 logarithm). This can be proved with a recurrence relationship.

A trivial improvement to performance is to use insertion sort to sort the arrays once the size has gotten down to 7 because at this size the overhead required for mergesort means that it is no longer superior.

Quicksort

Inplace sorting method
Depth of recursion is logarithmic assuming a random ordering of elements in array.
On average takes \(\approx 1.39 N \ln N\)
Not stable!
Again default to insertion sort once arrays get below about 10.
Related to a linear algorithm for quick-select.
There is a variant which does a three-way partition which is more efficient if there are lots of duplicate keys.

Heapsort

Heapsort builds up a binary heap from the array and then uses it to put the elements in order. This is takes \(N \lg N\) time in the worst cast and does the sorting in place. However it is not stable and in practise it is slower than quicksort and mergesort.