You’ll need to be able to work with groups or collections of items. A game that only has one character with one quest and with one shop that sells one item isn’t going to be very fun. You have choices and this episode continues more than a week long exploration of collection types available. You’ll learn when to use each type of collection and why. Up today is the hash table. I’ll explain what hash tables are and then give you some guidance on when to use them.
Hash tables require a method that knows how to turn some information into an index that we can use to place the item in an array. The array size is fixed and we usually end up placing items not directly in the array but in a linked list. There’s a different linked list for each location in the array. It works like this assuming we have an item with a property called name that we’ll use to generate a hash value:
item.name => hash method => index
array[maximum index] -> list
array[…] -> list
array[index + 1] -> list
array[index] -> list -> item
array[index – 1] -> list
array[…] -> list
array -> list
You can see here that the hash method is used to generate a hash value that will be used for the index. then each index in the array starting from zero all the way up to the maximum indeed value all point to linked lists. The item we are interested in gets added to the list at its index value.
Listen to the full episode or you can also read the full transcript below.
Sometimes all you need is to be able to add and remove items in a collection and then find them later. If you’re not concerned about order, then a hash table is a good choice.
Let’s say that you know exactly where you want to add an item based on a number. Each item has its own number. You could allocate an array big enough to hold all the items and then put each one in its own spot. Think of a parking lot where everybody has a reserved spot. You don’t need to search for a spot. You just drive your car directly to your reserved space. If you’re not using the parking lot, then your space remains empty. This is actually a very good solution if all the spaces are normally full and you can control where everything goes.
But what do you do if your parking lot can hold up to two thousand cars but there’s usually only a hundred or so at any given time? You’re just wasting all that space. If this was a computer program, you’d be wasting memory.
A better solution would be to reserve less space and calculate the position an object will occupy based on some property of the object. Taking the parking lot example, maybe we could add up all the numbers and letter values on the license plate, then divide that by the total number of parking spaces. The division probably won’t be exact but even if it is, that doesn’t matter because what we’re really interested in is the remainder. If all the parking spaces are numbered, then the remainder will identify the making space to be used.
If you forget where you parked your car, just go through the process again to calculate the remainder and that’s where your car will be. The actual name for the final spot in the collection is the hash. And a hash table is sort of like this parking lot.
Whatever method is used to calculate the location or the hash, it should have some important properties. What happens if two cars arrive at the parking lot with different license plate numbers but they just happen to arrive at the same hash? This is called a collision and that’s exactly what happens if you try to park two cars in the same spot. The first property is that the method should produce hash results as random as possible. This randomness should result in a large difference in the hash even with a small difference in the properties. Just think of two cars arriving with only a single number different between their two license plates. A good hash method will place these two cars in very different locations.
I mention randomness but any hash method needs to be repeatable. You won’t be very happy if you forget where you parked your car only to have the hash method send you to a different spot each time.
While collisions in the parking lot are very bad, computer hash tables can handle collisions very easy. I’ll explain how right after this message from our sponsor.
( Message from Sponsor )
Instead of putting items directly into hash table locations, each location can have a double linked list. Now there’s no problem if two items hash to the same value because they’ll both go into a list. If you want to find an item, this means that you can sometimes need to search through a linked list but if your hash table is constructed properly, you should have very few collisions and therefore just a few items to examine in the list.
The worst case for a hash table happens when all the items hash to the same value and end up going into the same linked list. This gives a Big-O of N speed. You shouldn’t run into this situation in practice so the normal time required to insert, find, and delete items in a hash table is Big-O of one.
You might wonder how can it be a Big-O of one? I mean, calculating a hash value requires reading various properties, combining them in some way, performing some mathematical operation, and then possibly scanning through a linked list. Remember that Big-O notation doesn’t specify exactly how long something will take because different computers run at different speeds. And you can even have different hashing methods with more or less amount of work to be done. But none of this depends on how many items are in the hash table. This is all fixed work that takes some amount of time. Because the time required is constant, that’s what gives us a running time of Big-O of one. As long as the number of collisions is kept low, then the time required to scan through the list just gets added into the time required to calculate the hash value.
I mentioned how you could divide some number by however many spaces you have in the hash table and use the remainder as the index into the hash table for where an item will be stored. This is called the division method. It’s important that you carefully select how many spaces you have available in your table or you’ll end up with a poor hash method. A good way to select this value is to choose a prime number that’s as far from a power of two as possible. Powers of two are just the binary place holder values 1, 2, 4, 8, 16, 32, 64, etc. Just find a prime number between one of the power of two values that will give you a good number of places in your hash table based on how many items you expect to be placing in the hash table.
Another way that you could generate your hash values is called the multiplication method. With this method, you multiply your key properties by a fractional value between zero and one and then take just the fractional value of that result and multiple by the size of your hash table. It’s just a matter of choosing a good initial fractional value. The nice thing about this approach is that you don’t have to worry about adjusting the size of your hash table itself to be a prime number.
I’ll end this episode with some advice that could help you avoid a denial of service attack. I talked about denial of service attacks in two Q & A Friday episodes from Jan 8th and 16th of this year. Let’s say that an attacker is providing the items that your code is going to put into a hash table. The attacker could then choose items that even though they’re different, result in collisions. If you add enough of these items to your hash table, then you could degrade the performance of your hash table so that is starts approaching a running time of Big-O of N. That’s bad. The attacker can do this if your hash method is known. But what if you had many hash methods available and could select one at random. Once you select a hash method, then you’ll need to keep using that same method. But just the fact that you have choices will make it harder for an attacker because the exact method will not be known ahead of time.