aez-notes-algorithms
Table of Contents
Theory
Order of growth classification
Most of the time we can limit consideration to:
- \(1\)
- \(\log N\)
- \(N\)
- \(N \log N\) (logarithmic)
- \(N^2\)
- \(N^3\)
- \(2^N\)
Data structures
Hash table
Separate-chaining
Idea: use a hash function to index into an array where each element of the array points to a linked list. Under uniformity assumptions of the hashing this leads to near constant time operations in the average case.
Linear probing
Idea: if there is a hash collision, just use the next position in the array. This has similar properties to the separate-chaining solution but avoids the need for additional linked lists. The problem is that its performance degrades rapidly as the array fills up. An alternative to just using the next array position is to have a second hash function and use that to determine where to look next.
Symbol tables
- An abstraction of key-value pairs, a.k.a. association table. Dictionaries in
python are the cannonical example.
- Linked list implementation has linear operations so is not useful
- Sorted array with binary search gets the lookup methods efficient but insertion is expensive.
Heap (binary tree)
- This data structure is defined recursively, it is either empty or a node with links to left and right subtrees.
- A binary tree is complete if it is balanced at all nodes down to the final layer. A complete tree with \(N\) nodes will have a depth \(\lg N\).
- Heap-ordering in a binary tree means that the labels on each node is never
smaller than the label on it's parent node.
- The swim operation is used to move a node up the tree if it violates the heap-ordering: this is how the insertion method works. There is also a sink method to push nodes down a tree.
- Heap-ordering provides a natural way to represent the binary tree with an array.
Binary search tree
- While a heap uses an implicit tree represented as an array and uses
heap-order, the BST uses an explicit tree and stores keys in symmetric order:
- All keys in left subtree are smaller all keys in right subtree are bigger.
- When working with BSTs recursive methods lead to very slick code (some of which was invented by Robert Sedgewick).
- Insertion into a BST is \(\ln N\) which is why you might prefer it to a heap.
- Deletion is \(\sqrt{N}\) for randomised data which can kill performance. There are alternative structures that get around this.
2-3 tree
- A tree allowing for both 2 and 3 nodes which is complicated but keeps the tree perfectly balanced on all operations we care about.
Left-leaning red-black tree
- This binary tree uses edge labels to get performance closer to a 2-3 tree.
kd tree
- Tree data structure for working with geometric data in a \(k\) dimensional
space.
- During insertion of the nodes label the nodes by the dimension in which they partition the space. Each time the tree depth increases partition in the next dimension.
Stacks and queues
- A min/max priority queue will dequeue the min/max element that has been enqueued. You probably want a binary tree style data structure to get an efficient implementation.
- A bag or set can be implemented as a randomized queue.
- Stacks have push and pop
- Can be implemented as a linked list (which has constant time operations)
- Can be implemented as an array but this requires keeping track of the array size and avoid the loitering problem for the garbage collector. Double when full, half at a quarter.
- Linked list has more consistent performance but will be slower on average than the array based implementation.
- Queues have enqueue and dequeue
Algorithms
Dynamic connectivity
Algorithm | Initialisation | Union | Find |
---|---|---|---|
Quick-find | \(N\) | \(N\) | \(1\) |
Quick-union | \(N\) | \(N\) | \(N\) |
Weighted quick-union | \(N\) | \(\text{lg} N\) | \(\text{lg} N\) |
Quick-find
Consider a graph, \(G\), with \(N\) nodes. The operations we care about are checking if two nodes are connected, and updating the graph by adding an edge. Quick-find refers to a way of checking if two nodes are connected by some path. The data structure used is an array where the \(i\)-th entry lists the connected component node \(i\) belongs to. To check if two nodes are connected then you just need to look up if they have the same value in this array.
Initialisation | Union | Find |
---|---|---|
\(N\) | \(N\) | \(1\) |
Quick-union
Same problem and data structure as used by Quick-find but the connected components are modelled as trees and the array values represent either the parent node or the node index if it is the root. Checking if two nodes are connected is just checking if their roots are the same. Updating when a new edge is added is just changing one of the root nodes to be the child of the other root if necessary.
There is the potential for the trees involved to grow very tall which then makes this algorithm too slow.
Initialisation | Union | Find |
---|---|---|
\(N\) | \(N\) | \(N\) |
Weighted quick-union
This is the same as Quick-union but always selects the smaller tree to be added as a subtree of the root of the bigger tree during a union operation. To make this work you need another array to count the number of nodes in the tree rooted at each index. Note that this considers the size of the trees rather than their depth.
Initialisation | Union | Find |
---|---|---|
\(N\) | \(\text{lg} N\) | \(\text{lg} N\) |
An additional modification, path compression can further improve the efficiency with very little added bookkeeping.
Sorting
A partially sorted array is one which differs from a sorted array by a linear number of inversions.
A stable sort is a sorting algorithm that preserves the relative order of items with equal keys. Insertion sort and mergesort are stable but selection sort and shell sort are not stable.
A lower bound on the complexity of sorting can be obtained by representing an arbitrary computation by a binary decision tree. Graph theory tells us that any tree big enough to support all possible orderings must have a depth proportional to \(N \lg N\) yielding the lower bound! This argument is used to prove that mergesort is an optimal algorithm. In practise quicksort is the best if you don't need stability.
Selection sort
Go through and select the smallest element then repeat on the remaining elements. This requires order \(N^2 / 2\) comparisons and \(N\) movements of data to sort a list of \(N\) elements. The amount of work required does not change if the collection is already partially sorted.
Insertion sort
Go through and more each element to the left of the collection until all elements further left are lower and repeat. For a randomly ordered array it requires order \(N^2 / 4\) comparisons and \(N^2 / 4\) movements of data. If the collection is already partially sorted though this can be much quicker, it runs in linear time for partially sorted arrays.
Shell sort
Perform a sequence of h-sorts for a decreasing sequence of h. An h-sort is sorting the subsequences with indicies \(1,h+1,2h+1,\dots\), and \(2,h+2,2h+2,\dots\) and \(3,h+3,2h+3,\dots\), etcetera. This substantially reduces the amount of times that data needs to be moved. You can use insertion sort for each h-sort. Open problem to analyse complexity, but it performs well in practise.
Knuth shuffle
Is a linear time shuffling algorithm. Just iterate over the list, on iteration i randomly exchange element i with one of the others in the list. There is a Haskell implementation on Rosetta code.
Graham scan for computing the convex hull
This is an algorithm for computing the convex hull in two dimensions. It uses sorting to order the points by polar angle relative to the point right most bottom point.
Mergesort
Split the list in two, recursively sort, then merge the two. The merge step makes use of an auxiliary array. The number of comparisons and array accesses is order \(N \lg N\) (recalling that this is the base-2 logarithm). This can be proved with a recurrence relationship.
A trivial improvement to performance is to use insertion sort to sort the arrays once the size has gotten down to 7 because at this size the overhead required for mergesort means that it is no longer superior.
Quicksort
- Inplace sorting method
- Depth of recursion is logarithmic assuming a random ordering of elements in array.
- On average takes \(\approx 1.39 N \ln N\)
- Not stable!
- Again default to insertion sort once arrays get below about 10.
- Related to a linear algorithm for quick-select.
- There is a variant which does a three-way partition which is more efficient if there are lots of duplicate keys.
Heapsort
Heapsort builds up a binary heap from the array and then uses it to put the elements in order. This is takes \(N \lg N\) time in the worst cast and does the sorting in place. However it is not stable and in practise it is slower than quicksort and mergesort.