Notes taken from "8. Hashing with Chaining"; MIT OpenCourseware 6.006: Introduction to Algorithms.

Dictionary: Abstract Data Type

Maintain a set of items, each with a key. The interface looks like the following:

  • insert(item)
  • delete(item)
  • search(key) --> return item with given key, or error.

Inserting an item with an existing key overwrites any existing data for that key.

We can accomplish each of the methods of this interface using an AVL tree in O(lgn) time. However, it turns out we can do a lot better. We can search in constant time: O(1).

This is a randomised data structure (one of the few in 6.006).

Python

In Python, the data type closest to this ADT is the dict.

  • D[key] ~ search
  • D[key] = val ~ insert
  • del D[key] ~ delete

An item is simply a (key, value) pair (as in D.items()).

Motivation

  • DocDist: finding the distance between two documents is fastest using a data structure like this.
  • Databases: There are two kinds of databases: ones that use hashing, and ones that use search trees.
    E.g. on Miriam-Webster, searching for a word's definition, we use a hashtable. In a spell-checker, we
    'tweak' each of the letters and look it up in a hashtable.
  • Programming Languages / compilers / interpreters. E.g. in Python, whenever we type a variable name,
    the interpreter looks that variable up in a hashtable to find its value.
  • Network Router: routers have dictionaries of the devices connected locally. Also, in the network
    stack of a machine, the computer needs to look up the port in a hashtable to find which application is
    connected to it.
  • Searching for substrings in a string: grep, Emacs "C-s" etc.
  • String commonalities: e.g. to find the similarity between two DNA strings.
  • File & directory synchronisation: rsync, unison, Dropbox etc.
  • Crypto: cryptographic hash functions.

Implementation

Simple approach

The simplest approach is a "direct-access table". We store items in an array, indexed by key. Each empty "slot" in the array will be NULL.

Problems:

  1. Keys may not be non-negative integers;
  2. Will use a huge amount of memory - one slot per key;

Solutions

What if our keys aren't integers?

Prehash functions map whatever keys we have to non-negative integers. With some key, e.g. a string, a prehash function will map that key to a string.

In theory, keys are finite and discrete. We know that everything stored on the computer is a string of bits; treat these bits as an unsigned integer, and we're done!

In Python, it's not quite so simple; hash(x) is the prehash of x (yes, it's badly named!). This maps every hashable (i.e. immutable) object to an integer.

However, this has some issues:

hash("\0B") == hash("\0\0C") == 64

In an ideal (theoretical) world, (prehash(x) == prehash(y)) iff x == y. In Python, this is not quite true.

Python implementation note: defining a __hash__ method changes the behaviour of the hash() function.
If no __hash__ method is specified, the id() function is used. Make sure to define it in such a way that hash(x) does not change over time. In general, assume mutable objects are not hashable.

How do we reduce space?

This is the 'magic' part of the lecture: hashing. We need to reduce the 'universe' U of all possible keys down to a reasonable subset m for use in the hash table.

The keyspace is HUGE: storing a direct-access table with slots for each key in the universe would be utterly impractical. To get around this, we define m to be the size of our hashtable. We know there is a subset Q in U of keys which, when hashed, yield indices of non-NULL items in the hash table. Our idea: we want m to be O(n), where n is the size of Q. In other words, we want there to be (very roughly) the same number of slots in the hashtable as there are items stored in the dictionary.

However, by the pigeonhole principle, we know that there will be some two keys Ki and Kj such that h(Ki) == h(Kj), but Ki != Kj. This is a "collision".

(Note here: the hashing function maps the universe of keys to integers in the set [0, m-1]; therefore it needs to know the value of m. The hashing function does the work of reducing the keyspace.)

This is relatively easily solved by chaining (and another method called "open addressing", but that's outside the scope).

Chaining

If we have multiple items to a single slot in the array, we just 'chain' them together in a linked list.

Each element in the array is a pointer: either to NULL, if no items are stored with a key which hashes to that index; or to a linked list of each colliding item stored with keys which map to the same value. Note that this linked list may just contain one node, if there aren't any collisions.

In the worst case, we have to iterate through a linked list of all n items. This is true for all hashing schemes. Worst case: O(n) for searches, inserts, deletions.

However, a good hash function will evenly distribute keys across the space of indices.

In 6.006, we make an unrealistic (false) assumption of Simple Uniform Hashing. This is the assumption that each key is equally likely to be hashed to any slot in the hash table.

Therefore, the hashes of the keys in the universe will be evenly distributed across [0, m-1]. This is independent of where other keys are hashed.

Analysis

Expected length of a chain, for n keys and m slots:

This will be n/m: n keys distributed over m slots.

n/m is often called a (alpha), or the "load factor" of a hash table.

This will be O(1) given the right table size.

In general, the running time of an operation (let's say, search) will be 1 + the length of the chain. This is equal to O(1 + a). We write 1 + a since we always pay constant time for hashing the item and moving to that location in the hash table, then we also pay alpha for the load factor.

Remember: this assumes Simple Uniform Hashing, which isn't really true.

In Practice: Hash Functions

Hash functions must map from a universe U to a smaller keyspace [0, m-1].

1. Division Method: h(k) = k mod m

m should be prime, and not close to a power of 2 or 10 for real-world data.

2. Multiplication Method: h(k) = [(a . k) mod 2^W] >> (W - r)

Here, we're assuming that we're in a W-bit machine. Our key k is W bits long. We have some constant a, also of size W bits. We multiply the key and the constant, to get a bit of data two words (2W bits) long.

We then take this result, mod 2^W. This effectively ignores all but the rightmost W bits, so we once again have a single word. This word will be made up of lots of different copies of k, all diced up and added together, so it's pretty random.

Let's assume our 'm' is 2^r : we have 2^r entries in our table, i.e. our indices will be in [0, 2^r - 1].

This means that a binary number of 'r' bits can represent the index of every number in our table: from 0 to 2^r - 1. For instance, if r is 3, we will have 2^3 = 8 entries in our table. These will be indexed from 0 to 7. In binary, 7 is 0b111: the greatest number representable with 3 bits. Therefore, 3 bits are sufficient to represent any index for this table.

So our goal is now to get a binary number of 'r' bits from the provided 'k'. We already have an effectively-random word (a.k)mod2W at our disposal. We're going to isolate the leftmost 'r' bits of this word by shifting it right until we have only 'r' bits. Since we know that the length of the word is W bits, we can shift the word (W - r) bits to the right in order to isolate 'r' bits to use.

Multiplication Hashing method - visual representation of each step

The reason we don't want, say, the last r bits of the result instead of the first four bits is because in the 'middle' of the 2W-bit-long product of k and a, there's lots of possible carries, and it's difficult to predict whether a bit will be 1 or 0. However, at the rightmost end of this product, it's fairly easy to reason that if 'a' or 'k' ends in a zero, there'll be a zero at the end of our 'r' bits. This effectively removes a bit of entropy; we'll only be using half our keyspace.

3. Universal Hashing

A bit beyond the scope of 6.006. There's many methods of accomplishing this; here's one:

h(k) = [(a.k+b) mod p] mod m

  • p is a prime number s.t. p > sizeof(U)
  • a and b will be random numbers between 0 and p-1
  • m, as before, is the size of the keyspace

For the worst case keys, k1 != k2:

Probability over a and b of {h(k1) == h(k2)} is 1/m