If you receive some information, how do you know if it’s intact or has been changed?

You use a hash in a similar way to a checksum by including it along with the data. If you want to verify that the data is correct, recalculate the hash and compare. There are many ways to calculate a hash. This episode will describe two broad categories of hash functions, non-cryptographic hash functions and cryptographic hash functions.

You can create or use hash functions that avoid accidental collisions or sometimes don’t worry much at all about collisions. I was at a pharmacy recently and noticed behind the counter was space to hold prescriptions that were waiting to be picked up. Each prescription was placed in a paper bag with what looked like the first two letters of the person’s last name. That’s a hash function. Not very complicated, but still a hash function. Two letters gives you 26 times 26 for a total of 676 possible hash values. There’s many reasons this is not a very good hash function for general use including the fact that it’s not very random. This means that some hash values will be more common than others. And I don’t know anybody who’s last name begins with double Q’s.

But for the pharmacy, using the first two letters of a person’s last name works great. It’s a good hash function for their needs. There’s no need for anything more complicated or elaborate.
Other hash functions could be more complicated and make use of thousands or millions or even more possible hash values. These are all non-cryptographic hash functions because while they can help detect accidental errors, they won’t help protect your information against attacks done on purpose. A non-cryptographic hash function is designed to meet your application’s needs assuming you don’t have to worry about your hash values being used by an attacker. For that, you need a cryptographic hash function.

Listen to the whole episode for more information about well-known cryptographic hash functions MD5, SHA-1, SHA-2 (including SHA-256 and SHA-512), and SHA-3. You can also read the full transcript below.

Transcript

You might want to listen to episode 43 about hash tables. It’s a different way to use hashes but I explained several properties of hashes that apply even here to error detection. I’ll assume you already know what I’m talking about when I refer to hash collisions. You use a hash in a similar way to a checksum by including it along with the data. If you want to verify that the data is correct, recalculate the hash and compare. There are many ways to calculate a hash. This episode will describe two broad categories of hash functions, non-cryptographic hash functions and cryptographic hash functions.

Before that, though, why are hash functions sometimes better than checksums? One reason is checksums can’t detect if extra zero values are inserted or if some portions of the data are swapped. Let’s say you have a simple checksum that really does add bytes. If you have a short message with the values 5 and 3, then the sum is 8. But what’s 5 plus 3 plus zero? Still 8. What’s 5 plus 0 plus 3? Still 8. What about 3 plus 5? Still 8.

Depending on how you’re calculating the hash, it could detect these types of errors.

You can create or use hash functions that avoid accidental collisions or sometimes don’t worry much at all about collisions. I was at a pharmacy recently and noticed behind the counter was space to hold prescriptions that were waiting to be picked up. Each prescription was placed in a paper bag with what looked like the first two letters of the person’s last name. That’s a hash function. Not very complicated, but still a hash function. Two letters gives you 26 times 26 for a total of 676 possible hash values. There’s many reasons this is not a very good hash function for general use including the fact that it’s not very random. This means that some hash values will be more common than others. And I don’t know anybody who’s last name begins with double Q’s.

But for the pharmacy, using the first two letters of a person’s last name works great. It’s a good hash function for their needs. There’s no need for anything more complicated or elaborate.

Other hash functions could be more complicated and make use of thousands or millions or even more possible hash values. These are all non-cryptographic hash functions because while they can help detect accidental errors, they won’t help protect your information against attacks done on purpose. A non-cryptographic hash function is designed to meet your application’s needs assuming you don’t have to worry about your hash values being used by an attacker. For that, you need a cryptographic hash function.

What’s so special about a cryptographic hash function? There are three main qualities that a cryptographic hash function needs to have.

◦ 1. Pre-image resistance means that if you have a hash, then it should be hard to find a message that would generate that hash. This is also referred to as a one-way property. It’s easy go from a message to the hash but hard to go the other way.
◦ 2. Weak collision resistance means that if you have a message already then it should be hard to find another message that would generate the same hash.
◦ 3. Strong collision resistance means that it should be hard to find any two messages that result in the same hash.

You might be wondering how difficult can it be. Computers are supposed to be really good at calculating values. What does it mean that it should be hard to find hash collisions. Well, because there’s no way to guess what data will lead to what cryptographic hash value until you actually perform the hash function, you just have to try many times. This is called brute force.

It’s like trying to figure out a combination lock. If that lock has a random combination and there’s no way to listen to the tumblers, then you’re only option is to check 0-0-0, then try 0-0-1, then 0-0-2, and keep going. As long as you know how many numbers are in the combination, you can keep trying until you open the lock. It’s a lot harder if you don’t know how many numbers but still possible. As long as there’s no limit to failed attempts, you can just keep trying until the lock opens.

But if there was a way to try birthdays or common numbers, then maybe you wouldn’t need to try all possible values. Once you find yourself in this situation, the problem is not as hard anymore. And the same thing applies to computers and hash values. Finding hash collisions is hard until somebody figures out a way to avoid trying all possible messages looking for some message that just happens to generate the same hash value as one you already have.

Even then, you might wonder what’s the big deal with just trying everything. I mean, how many possible hash values are we talking about? Trillions? It’s a lot more than that. A good cryptographic hash function will generate hash values of at least 160 bits or longer. A hash value is always the same length no matter how long the data is that’s being hashed.

Now, 160 bits is a really big number. You only need about 40 bits to get into the trillions and each additional bit doubles the possible number of hash values. With 160 bits, the possible number of hash values is 1.46 times 10 to the 48th power. That’s a one with 48 zeros. Add a couple more zeros and you have about as many atoms in the entire Earth.

And 160 bits is borderline. There are some well-known cryptographic hash functions you should be aware of. One of them is called MD5 and it creates hashes that are 128 bits long. MD5 has already been broken 13 years ago which means that an attacker can find collisions fairly easily. The problem with MD5 was more in the steps taken than in the length but still 128 bits is small enough that it’s a problem even without a flaw in the calculation steps.

Another famous cryptographic hash function is called SHA-1 and it has some extra steps above MD5 plus it generates hash values that are 160 bits long. SHA-1 is one of the most widely used hash functions because the problems with MD5 have been known for a long time and many applications chose to use SHA-1 instead. But how long before even a brute force attack on 160 bits becomes quick enough to consider a problem?

That’s why you may want to consider using SHA-2 or even the latest SHA-3 hash functions. SHA-2 creates hash values that are 256 or even 512 bits long. These specific forms of SHA-2 are actually called SHA-256 and SHA-512 but they’re both SHA-2. This hash function is also widely used for securely visiting web pages and logging into computers. Even some Bitcoin implementations use SHA-2 for verifying financial transactions.

There’s also a new SHA-3 function that was created as a result of a competition.

So far, though, SHA-2 seems to be your best choice whenever you need to create hash values that need to withstand attacks. You can still use SHA-1 and even MD5 whenever all you really need is the ability to check for accidental errors.