HashTables for Fun
My buddy and co-worker Blake Caldwell posted a fun little note on Hashtables today. It got me thinking about the simple hash table implementation I have been using for cough cough years. So I thought I would put it up on github and write a bit about it.
First, let me introduce a few motivations. At the time I wrote the Objective-C version, there wasn't one in the standard library. The Java map is targetted at storing doubles as keys and values, something that avoids boxing. Again, this was written early on, before auto-boxing in Java. The Go implementation is really a Set, which is basically a hash table with booleans as the value. I wrote this to compare performance to the Go map, and found it not to be faster in simple tests, and thus it became a dead experiment.
I started using this implementation in Objective-C on a NeXT. I think the original ideas came from Practical Algorithsm for Programmers. The concept is this, a hashtable is an array of arrays, plus a capacity, or size, and a counter of how many elements are in the table. Each element in the main array holds an array of key-value pairs in a simple linked list struct.
In old Objective-C this looked like:
@interface Dictionary:Object
{
struct listnode **hashtable;
int size;
int used;
}
struct listnode{
struct listnode *next;
id key;
id value;
};
In Java:
public class DoubleDoubleMap
{
protected DoubleDoubleNode[] hashtable;
protected int size;
protected int used;
...
class DoubleDoubleNode
{
DoubleDoubleNode next;
double key;
double value;
}
Or in Go (note, we don't need a value for a set, only the key):
type TagSet struct {
nodes []*tagSetNode
size int32
used int32
}
type tagSetNode struct {
next *tagSetNode
key string
}
Collisions, where two keys have the same hash value, are handled using these linked lists. To look up an item, you:
- Calculate the hash
- Modulo the hash by the size of the table
- Go to that index
- Start walking the linked list looking for a node with that key using some sort of equality
For example, in Objective-C:
- objectForKey: (String*)aKey
{
id returnValue = nil;
if((aKey != nil)&&(used > 0))
{
unsigned long index;
struct listnode *node = NULL;
index = [aKey hash] % (size -1);
node = hashtable[index];
if(node != NULL)
{
do
{
if((node->key != nil) &&([aKey isEqual:node->key] == YES))
{
returnValue = node->value;
}
else
{
node = node->next;
}
} while((returnValue == nil)&&(node != NULL));
}
}
return returnValue;
}
or Java:
public double get(double c)
{
double returnValue = Double.NaN;
if(used > 0)
{
int index;
DoubleDoubleNode node = null;
index = ((int)c) % (size - 1);
node = hashtable[index];
if(node != null)
{
do
{
if(c == node.key)
{
returnValue = node.value;
break;
}
else
{
node = node.next;
}
} while(node != null);
}
}
return returnValue;
}
or Go:
func (set *TagSet) Contains(key string) bool {
if set.used > 0 {
index := set.hash(key) % uint32(set.size)
for node := set.nodes[index]; node != nil; node = node.next {
if key == node.key {
return true
}
}
}
return false
}
Adding items to the hashtable is more complicated. For performance reasons we want to grow the main array if there are enough items in the table. This attempts to insure that the table is sparse, with short linked lists. If the linked lists get too long the hashtable looses its main performance beneifit which is near constant time regardless of size. An easy example of this is the degenerate case of a hashtable with a main array of length 1. If you put 100 items into that hashtable they are all in a linked list. Which means that finding an item is O(n), the worst case. But if we keep the main array large compared to the number of elements in the table we can hope that they aren't all stacked at one index, and they shouldn't be unless the hash function is terrible.
So, in Java, my add method looks like this:
public double put(double c, double val)
{
int index;
DoubleDoubleNode node = null;
double valueToReturn = Double.NaN;
boolean gotIt = false;
index = ((int)c) % (size - 1);
node = hashtable[index];
if(node != null)
{
do
{
if(node.key == c)
{
valueToReturn = node.value;//cache the old one and
node.value = val;//replace it
gotIt = true;
break;
}
else
{
node = node.next;
}
} while(node != null);
}
if(!gotIt)
{
node = new DoubleDoubleNode();
node.key = c;
node.value = val;
node.next = hashtable[index];
hashtable[index] = node;
used++;
//check if we need to grow
if(used >= (GROW_PERCENTAGE * size))
{
DoubleDoubleNode[] newTable;
int newS = GROWTH_RATE * size, newUsed = 0;
DoubleDoubleIterator iterator = new DoubleDoubleIterator(hashtable, size);
DoubleDoubleNode newNode = null;
DoubleDoubleNode nextNode = null;
int curIndex, i;
newS = NumberUtils.ceilPrime(newS);
newTable = new DoubleDoubleNode[newS];
while(iterator.nextState())
{
curIndex = ((int)iterator.key()) % (newS - 1);
newNode = new DoubleDoubleNode();
newNode.key = iterator.key();
newNode.value = iterator.value();
newNode.next = newTable[curIndex];
newTable[curIndex] = newNode;
newUsed++;
}
for(i = 0; i < size; i++)
{
newNode = hashtable[i];
while(newNode != null)
{
nextNode = newNode.next;
newNode = nextNode;
}
hashtable[i] = null;
}
hashtable = newTable;
size = newS;
used = newUsed;
}
}
return valueToReturn;
}
Note the use of NumberUtils.ceilPrime
. This is my simple way of trying to use prime number sized tables. The Java version of this class is available on github. By using primes, I am trying to improve my chances of getting fewer collisions when I modulo the hash by the size of the main array.
Of course there are some other useful methods on hashtables, you can see my full implmentation for each of these on github.
One of the issues with the Go implementation is that there wasn't a standard hash function for strings, so I used the Java string hash function:
func (set *TagSet) hash(key string) uint32 {
var val uint32 = 1
var l = len(key)
for i := 0; i < l; i = i + 1 {
val += (val * 37) + uint32(key[i])
}
return val
}
And Blake, it was fun to look back on some of this old code, even if it did make me feel old.