147: Distributed Computing: Ready? Yes. Done.

Two-phase and three-phase commits will help you design solutions that need to work across multiple computers.

Let’s say that you’re working on an event reservation application. You need to transfer some money to several vendors for a large event.

Maybe one of the vendors is a food catering company and they’re waiting for the payment before they begin preparation. Another vendor is a local hotel that is waiting payment to reserve a meeting room. You send the food payment and it goes through. But the hotel refuses your payment because the room is now reserved for another customer. What do you do? Most likely, you now need to go back to the food vendor and try to cancel your order. If they refuse, then you’re in trouble.

What if instead, there was a system in place that both the food vendor and the hotel agreed to work with that helped coordinate things like this. It’s actually good business for them both because they can advertise their relationship and highlight the benefits it gives to you. This new system lets you place the food order and the room reservation and get them both ready to go before you send the final instruction to proceed. Any problems will be discovered before you proceed with either vendor.

This is called a two-phase commit. You send a message to each system asking if there are any problems and they all have to reply back to you before you then send another message to each one to proceed.

We can turn this into a three-phase commit by first making sure everything is ready to proceed, then instead of committing right away, send a message to go ahead and actually do the work. But the important point is that when the work is being done, the final commitment is not given yet. We want to wait until as much work has been done as possible and only when each vendor has said everything is done and waiting for the final word, only then do we commit.

Make sure to listen to the full episode for more information and advice. You can also read the full transcript below.

Transcript

Imagine you and a friend are talking and you both agree to do something together. You’ll complete half of the work and your friend will do the other half. You each go about your work and after a few days and a lot of hard work, you call your friend to ask when you can put everything together. This is when you find out that your friend ran into a problem and stopped working. All your work is ruined because there’s no way you can complete everything on your own.

Computers face the same problems all the time. This episode will explain the problem in more detail and give you some advice for how to handle multiple computers that need to coordinate activities. The problem doesn’t even really need multiple computers. Multiple separate transactions are enough as long as they all need to be treated as one overall transaction.

Let’s say that you’re working on an event reservation application. You need to transfer some money to several vendors for a large event.

This new system lets you place the food order and the room reservation and get them both ready to go before you send the final instruction to proceed. Any problems will be discovered before you proceed with either vendor. Seems like a much better system, right?

There’s still a few problems. This definitely is a big improvement already. I’ll explain more right after this message from our sponsor.

The first problem that comes to mind is what happens if both the food vendor and the hotel say everything is good and you make the payment and send the final confirmation to both. Then, the hotel realizes it doesn’t have enough tables, or maybe while setting up for your event, the hotel remembered it needed to make repairs, or maybe the food vendor burnt the main dish because their timers broke. The point is, there are many things that can go wrong. In real life, there’s not much you can do about these things. But when programming, instead of just making sure everything is okay to proceed, what if we add another step?

You might think this is unreasonable in real life. But many companies will allow work to begin with just a deposit. Or maybe there will be a contract that will hold each vendor responsible for all the other vendors if any of them cannot complete their portion. Luckily, we don’t need to go to such lengths in programming.

It’s usually easy to reverse work that’s already been done by a computer and put things back the way they were. But not always. Systems like this that need to be absolutely sure that work can be reversed at the last moment need to do things like logging everything that was done to a separate system, or prepare two sets of actions – one to do the work, and another to roll everything back – before acknowledging that the work has been done.

This is because even with a three-stage commit, there’s still a possibility that something can go wrong and the work can’t be put back automatically. Sometimes, a human operator will need to manually go through the logs and reverse transactions if needed.

If the software system that you’re working on is even more critical, another option is to make sure alternate systems are ready to take over if needed. This would be similar to booking two different hotels for the meeting room and two different caterers for the food just in case there was a problem with one of them. This is a waste in real life. But can be a smart decision when we’re dealing with virtual bits of just ones and zeros.

What happens though if the network connection is broken at some point? You’ll need to decide how to handle this. Maybe you want the other systems to wait for a while and if there’s no response, then they assume everything’s been cancelled. Or you can get even more elaborate and setup a system where the other systems can talk amongst themselves and if at least a majority of the systems can still communicate with each other, then the transaction can proceed anyway. This could be a good thing or bad. I mention it so you can start to get an idea of what’s possible.

147: Distributed Computing: Ready? Yes. Done.

Transcript

Tags

Leave a ReplyCancel reply

147: Distributed Computing: Ready? Yes. Done.

Transcript

Share this:

Tags

Leave a ReplyCancel reply