fbpx

Consistency, Availability, and Partition Tolerance are three aspects of distributed computing. And a fourth not included in the theorem is Latency.

The original CAP theorem said that you could only be guaranteed to get two aspects between the three options, consistency, availability, and partition tolerance.

But here’s where it went wrong. You see, consistency and availability are up to you to decide how to handle. And they don’t even have to be all of one and zero of the other. They can be combined. But a network failure is outside of your ability to control. Sure CAP says you can design around this too. But CAP implies that you could choose to design a system that’s consistent and available. Those are two of the original three choices. This ignores the fact that networks will fail.

Modern distributed systems can no longer ignore partition tolerance. As designs scale out more by using more computers across increasing distances, network disruptions become more frequent. Because partition tolerance is becoming a necessary choice, you’re left with the choice of how you want to deal with the situation when you don’t have a complete system.

If you choose consistency over availability, then when a request for some information is made you can either wait for the network to come back or return an error. Or maybe you wait for a while and then return an error. Either way, the computer making the request won’t wait forever and will timeout eventually on its own. Consistency says that you need to make sure you have the most up-to-date information before replying. Even if you have some information and could reply with what you have, consistency says you double-check first. And since you can’t do that, then an error is best.

Now if you choose availability, then you reply with the information you have. Even if there might be more up-to-date information that you just can’t get to right at that moment. This means that the information should be available. You can’t just make up a response. If you reply with an answer, then it should at least have a chance to be the right answer. If a computer requests some information that a server simply does not have and can’t get to at all because of the network outage, then choosing availability means that you should at least return right away with an error.

Listen to the full episode, or read the full transcript below, to hear more including how latency fits into all this.

Transcript

Many explanations simplify this by saying that you need to choose two of the first three aspects. But that’s not quite right.

I’ll explain each of these aspects and then how they’re related.

First of all, consistency. If you know about databases, then this has a different meaning. I’ll explain databases in a future series of episodes. For now, consistency means that if you write some information, then you should read back the same information. If you instead get back old information, then this is called stale information. Sort of like how bread tastes better when it’s fresh.

Availability also has a slightly different meaning than what you might expect. It doesn’t mean that a system is working fully. And it doesn’t even mean that you can always get a reply at all. It just means that if you do happen to connect to a working server computer, then you should get back some reply even if it’s not the latest information.

Partition tolerance means that a distributed system can continue to work when there are network problems that isolate some server computers from others. Let’s say you have computers participating in a distributed system located in the United States, Canada, and Australia. If a hurricane causes all the united States computers to be disconnected from both Canada and Australia but otherwise they remain running, then a system is said to be partition tolerant if customers can continue to use the system.

The original CAP theorem said that you could only be guaranteed to get two aspects between the three options, consistency, availability, and partition tolerance.

But here’s where it went wrong. You see, consistency and availability are up to you to decide how to handle. And they don’t even have to be all of one and zero of the other. They can be combined. But a network failure is outside of your ability to control. Sure CAP says you can design around this too. But CAP implies that you could choose to design a system that’s consistent and available. Those are two of the original three choices. This ignores the fact that networks will fail.

Modern distributed systems can no longer ignore partition tolerance. As designs scale out more by using more computers across increasing distances, network disruptions become more frequent.

Just think about how reliable your internet connection is. Even if you have no problems yourself, imagine for a moment that you need to design a system that needs to connect computers from different countries with weather conditions, natural disasters, and even simple machine failures. As this system relies on more and more connections, the chances that one or more of these connections will break just keeps getting higher.

I’ll describe more and also get to that fourth aspect, latency, right after this message from our sponsor.

Because partition tolerance is becoming a necessary choice, you’re left with the choice of how you want to deal with the situation when you don’t have a complete system.

If you choose consistency over availability, then when a request for some information is made you can either wait for the network to come back or return an error. Or maybe you wait for a while and then return an error. Either way, the computer making the request won’t wait forever and will timeout eventually. Consistency says that you need to make sure you have the most up-to-date information before replying. Even if you have some information and could reply with what you have, consistency says you double-check first. And since you can’t do that, then an error is best.

Now if you choose availability, then you reply with the information you have. Even if there might be more up-to-date information that you just can’t get to right at that moment. This means that the information should be available. You can’t just make up a response. If you reply with an answer, then it should at least have a chance to be the right answer. If a computer requests some information that a server simply does not have and can’t get to at all because of the network outage, then choosing availability means that you should at least return right away with an error.

In order to prevent this where a server has no access to its own data, that means information needs to be replicated to several locations. That just means the information needs to be copied to each location and updated at each location whenever it changes. Copying information like this takes time.

Let’s say a customer connects to a server in the United States and orders a new bag. The inventory remaining for that bag needs to be reduced. And in order to improve availability, this inventory is found in both Canada servers and in Australian servers. To be consistent, you need to design the software to send a message to all the places where it needs to be updated before confirming the order with the customer. This can take a while. And this is latency. As your system becomes more consistent, it takes longer and becomes more latent. It might be better to sacrifice some consistency in order to speed things up and improve latency.

So to sum everything up, network partitions are a fact of life and something that just can’t be ignored. You can choose how to handle network errors by sticking to a firm commitment to verify any local information such as inventory amounts or to reply with whatever information you have. And this doesn’t have to be an all or nothing approach. You have some room to decide the best approach depending on the system you’re designing and what customers will find acceptable. And even when everything is working just fine, you still need to balance consistency with latency.