Anytime a thread tries to access some memory or resource that another thread can change, you have a race condition. There is no winner for this kind of race. The whole application will lose.
This episode describes first explains what a race condition is and then explains two ways to avoid race conditions. You can use atomic operations or critical sections.
The problem that makes this type of bug difficult to catch is that you may not notice it. You could test your software and find no problems. After passing all the tests, you release it to the world. And that’s where the problems begin. You’ve just exposed your software to environments that are different than what you tested.
You can use atomic operations if all you need to do is modify a single numeric value. Anything more complicated even if it’s a series of atomic operations will need a critical section to protect the entire sequence. Listen to the full episode or read the full transcript below.
The problem that makes this type of bug difficult to catch is that you may not notice it all the time. Let’s compare this to real athletes and get a record holding olympian for our volunteer and the local high school running team that will be racing around town. Because we don’t want any cheating, nobody knows the directions for the race. Now it’s the job of the olympian to run around town and place instructions of where to run the race. Because olympians are faster than most people, this works great. By the time any of the high school runners reach a destination, they find instructions have already been placed and know where to go next.
If this scenario represented your code, you could test it and find no problems. After passing all the tests, you release it to the world. And that’s where the problems begin. You’ve just exposed your software to environments that are different than what you tested. If one of those environments has a high school with a runner soon to be the fastest in the world, then when that runner reaches the first checkpoint, there won’t be any instructions yet.
If you’re lucky, this will cause a crash. What? How is that lucky? Since when is a crash a good thing?
Well, a crash is never a good thing. But the reason I say a crash is a good thing in this case is compared to the alternative. We can all agree that if the high school runner beats the olympian to the instructions, then the race should be cancelled. This is similar to a crash. What if instead, the high school runner reached the checkpoint ahead of the olympian and found what looked like valid instructions. If those instructions send the teenager someplace unknown, then we have a lot bigger problem than just a cancelled race. Now we have a missing child. That’s a lot worse.
And relating this to computer programming, this would be like losing all your work because the application continued to run with bad information and corrupted your files not just in memory but on the hard drive too. You don’t even have an online backup because when the corrupt files were written to disk, your backup software kicked in and copied the bad files to your online storage and overwrote those too. Compare to this situation, a crash is much better.
Now that you understand the stakes involved, I’ll explain more about avoiding race conditions right after this message from our sponsor.
( Message from Sponsor )
Going back to our make-believe race for a moment, how could we fix this? I mean fix it properly. Getting a faster olympian doesn’t really solve the problem. It just makes it even more harder to spot the problem. We need some way to fix race conditions for all cases.
One good way is to avoid trying to modify anything that the threads need to access while the threads are running. If you can get everything ready before starting your threads, then your code will be fine. In the example, this would be like sending the olympian out to update all the instructions before the race begins. We need to wait for confirmation that all the instructions have been updated and then we can start the race.
Sometimes though it’s not possible to do this. You may find that you do need to modify resources while threads are running.
These resources could be global variables stored in memory that’s available to all the threads, or they could be information stored in files, or practically anything. If it can change and be referenced by other threads while changing, then you have a race condition. It doesn’t matter if it leads to a bug right now or not. The race is still there.
Let’s assume that you can at least get your resources into good initial states before other threads need access. We don’t want any missing kids here. Then if you need to change something, you have some options.
Here’s what you want to avoid. Let’s start with an integer variable set to 0. This is a known good initial value. Then we’re going to start two threads that will each increment this value. After both threads have run, we expect the final value to be 2. That’s because one of the threads and it doesn’t matter which one will get to the variable first and increment the value to 1. Then when the slower thread gets to the variable, it will increment the value to 2. When I say faster or slower thread, sometimes that might be accurate. If one of the threads has to do a lot more work than the other one before getting to the variable, then it’s probably going to be slower. But this is unreliable at best. It’s up to the operating system to schedule threads. So really, you never can tell which thread is faster. Both threads run at the same processor clock speed and when they get to run is outside of their ability to control.
Let’s zoom into that increment operation for a bit. Let’s say that I told you to walk into a room and change a number written on a whiteboard by adding one. You can’t just blindly walk into the room and write some number. You have to read the existing number first, right? Only then can you add one and then write the new value. What if somebody else slipped past you and wrote a different number just as you were about to write yours? Well, you’d notice this and start over, right? Computers don’t work like that. Once your code reads a value from memory and modifies it, unless you read the value again before writing it, you won’t know if it changed while you were adding one to the value you first read.
And will reading the number again right before updating it actually solve anything? No, all this does is make the race condition harder to detect. It doesn’t solve it. No matter how many times you make a “final check”, there’s always the possibility of another thread rushing past you and changing the value at the most inconvenient time.
This causes problems with our two threads trying to each increment the integer. Let’s say that the first thread reads the 0 and adds one. But before that thread can write the new value of 1 back to memory, the operating system stops it and gives the second thread it’s turn. The second thread, does the same thing and reads the current value which is still 0, adds one, and then writes 1 to memory. It’s done. The operating system resumes the first thread which picks up exactly where it left off and writes the value 1 to memory. The end result after both threads incremented the value is that the value actually only got incremented once.
A long time ago, scientists thought that atoms were the smallest thing. Now we know that there are smaller things than atoms but the name is still used to mean something really small. One way to solve this problem is to reduce the whole operation of reading, modifying, and then writing down to something so small that no other thread can get past you. You want something so small that the operating system can’t possibly break into the middle. Remember that the operating system can only suspend a thread and swap it out between instructions. It’ll never try to stop an instruction that’s already started. We need a way to read, change, and write all in a single processor instruction. Luckily, modern processors have this ability. We just have to use it. This is called an atomic operation which gets its name from the fact that atoms are so small. You’re programming language will have methods you can call that are guaranteed to be atomic.
There’s another way you can solve this that I’ll explain before ending this episode. What if when you entered the room to change the number on the whiteboard, you made sure to close and lock the door first? If you know that you’re the only one in the room, then it doesn’t matter how many steps it takes you to update the number. Once you change the number, you unlock the door and leave. If somebody else tries to slip past you, then they’ll just have to wait at the door until you exit.
It’s amazing isn’t it how computer programming concepts relate so well to real life? Okay, maybe we don’t often need to change numbers and make sure to lock the door first, but I hope the example is still something you can relate to.
Whenever you have to perform many operations and a single atomic operation is out of the question, then you’ll just have to lock the door first. This is called a critical section. Your language will provide the means to enter and leave a critical section. The only thing you need to make sure of is that any code that needs access to consistent information should enter and leave the same critical section. The operating system and your processor will manage the critical sections to make sure that only one thread is ever allowed to enter at any given time.
What happens if you forget to wrap some of your code in a critical section? That would be like giving the thread a key to unlock the door to the room. You think your thread has sole access to the information and can perform operations when in fact, some other thread enters anyway. Don’t do this. Make sure to add critical sections around all the places in your code that need to access the same information.
You can have many different critical sections. They’re all independent of each other. This would be like having multiple rooms that can each lock their door and where you can access different information in each room.
It’s also important to remember that critical sections don’t prevent a thread from being suspended and swapped out. This can still happen. But as long as that thread was inside a critical section, then another thread wanting to enter the same critical section will need to wait.