If you’re not careful, you can cause a processor to come to an immediate and full stop while it waits for data to move around in memory. That’s probably not the performance boost you were looking for.
In order to avoid these delays, you need to be aware of two things and make sure to test your code to fully understand what it’s doing.
- Cache invalidation can occur when a thread running on one processor modifies a variable being used in the cache of another thread running on another processor. This causes the cache to be moved around as the processors make sure that that data is consistent. It also causes a processor to halt while it waits for the data to be updated.
- You can also get cache invalidation even when two threads are each working with their own variables if those variables happen to reside in the same cache line. Packing your data together so it observes locality of reference will help avoid this false sharing. You also need to make sure that you don’t divide your data that’s been packed like this between multiple threads.
Listen to the full episode or you can also read the full transcript below.
I’ve found few books that explain this and I actually skipped one of the patterns in the Game Programming Patterns book that explained only half the story.
Let me first explain what a cache is because if you don’t understand this, then you’ll not get much benefit from this episode.
A cache is a simple concept that we use all the time to speed things up. When you buy groceries, do you buy one item at a time? Do you walk in the store, pick up a loaf of bread, wait in line, pay for it, take it all the way home, and then repeat the whole process for the sugar, juice, and apples? Of course not. You put everything in a cart first.
When you need to fold clothes, do you go to the dryer, take out a pair of socks, fold them, and keep them in a drawer before going back to the laundry room for another pair of socks? Of course not, you grab everything out of the dryer at once, put it all in a basket and take it somewhere to fold.
These are caches and computers use the same concept. When a processor wants to get some memory values from memory, it gets more than it needs. Because chances are good that the next memory will be nearby and it might get lucky and pick up the next values along with the first.
When a processor does find the value it needs in the cache, this is called a cache hit. And when it can’t, this is called a cache miss. And just like when you accidentally leave some clothes in the dryer and have to make a special trip to get them, a cache miss causes a processor to have to make another trip back to memory too.
The book Game Programming Patterns describes a design pattern called Data Locality which teaches you to keep your data nearby in memory. We sometimes use linked lists which allow us to easily insert or removed items from a collection without needing to move the items in memory. Each item can remain wherever it is in memory and the collection just stores a pointer to the item.
While this can definitely be a good thing, it also causes the processor to jump around in memory and that causes the processor to have to reload its cache much more often. You see, the cache is much faster than the main memory but it’s also much smaller.
Being aware of this and testing your code to find the slow spots will help you to write better code. But this isn’t the whole story. When you add multithreading into your application, you need to again reconsider your approach to organizing data in memory. I’ll explain how multithreading changes things right after this message from our sponsor.
( Message from Sponsor )
Alright, you know how to protect your data from race conditions by first obtaining a lock. And you also know that it’s a good idea to get a lock first, do what you need as quick as possible, and then release the lock. You want to be quick because some other threads might be waiting.
Let’s say that you have a loop where you need to modify many pieces of information and decide to follow this advice by obtaining and releasing a lock each time through the loop. Now if another thread is also going through the same loop or even a different loop that uses the same lock, then both threads are going to want this lock many times over. And ownership of the lock will likely pass back and forth between threads. From a locking point of view, that’s exactly what you want. But from a performance point of view, this can cause some major problems.
Even trying to obtain the same lock from another thread running on another processor can cause the cache to be invalidated. This is called a processor stall and can actually temporarily stop the thread that legitimately has the lock just because another thread tried to get the lock.
The solution seems simple right? Avoid sharing data between threads and realize that locks alone aren’t the full answer. You may need to shift where and when you obtain locks based on the results of your testing.
But here’s another thing to consider. You can actually run into this same situation even when the threads never share anything. How’s this possible? Well, it has to do with data locality and something called a cache line.
You see, caches aren’t just built up with the exact data you need. The processors don’t know for sure what data you’ll try to access next so they grab memory in bunches. You can think of a line drawn around a group of memory addresses as a cache line that defines what’s in the cache and what couldn’t fit.
What do you think will happen if two threads are each going about their separate business and only ever reading and writing to memory that they each solely own? This seems like the perfect arrangement.
When your data is spread out in memory, then it’s more likely to be mixed in with nearby data from other threads. Now, even though the two threads are each only ever touching their own memory, if that memory is close enough to be in the same cache line, then when one thread modifies a value, it’ll invalidate the cache for the other thread. This is called false sharing.
What should have been smooth sailing turns into a drawn out ping-pong match between the two threads as they each invalidate the cache of the other thread. This is not a race condition that your code is aware of. This is a race condition that happens at a much lower level. You don’t have to worry about your code getting the wrong or partially updated values. But how the processor goes about guaranteeing this can really slow things down.
By being more careful about where your data lives in memory and trying to get your data closer together, you not only improve the performance of your thread because it has more cache hits but you also reduce the chance that your data will get mixed into the same cache line as data from another thread.
In the previous episode about how to divide your data, I explained how you could have a thousand objects and if you have four cores, then it makes sense to give 250 objects to each thread. Here’s where you need to be careful. If those thousand objects are spread out in memory, then it’s possible that they’re going to live in the same cache line as other objects. This can cause processor stalls if you’re not careful.
And let’s say that you can get your thousand objects packed together in contiguous memory. This packing will only hurt you if you decide to allocate objects to the four threads like you might deal cards. If you give the first item to thread one, the second item to thread 2, the third item to thread three, etc. then you’re almost guaranteed to get false sharing. Instead, you want to make sure to give each thread a chunk of items that are all together. Now, not only will each thread have its own items but the memory that a processor puts in a cache will tend to consist of only those items.