Stop Doing Data Synchronization
Published on
Yes, I'm serious. Let me cook. My ingredients can be found in this repository (please read the instructions before compiling and executing yourself).
Let's start with a simple idea, make a program that counts from 0 to 1 billion by incrementing a counter 1 billion times as fast as possible. Yes, I know there are many ways to do this instantly, but we're going to use incrementing a counter as a proxy for actually useful and interesting work.
The Control
The first step of any experiment is a control, so I wrote a simple control program (source here).
The program simply increments a counter in a loop until the counter reaches 1 billion. On my machine, I got the following timings:
real 0m2.450suser 0m2.450ssys 0m0.000s
Note we don't care about the exact time, we just want to see generally how strategies perform so we can find patterns.
Anyway, we can see that a single thread simply counting up a loop to 1 billion takes around 2.5 seconds on my CPU. This is the benchmark that we will want to compare future timings against.
Mutexes
Now, let's look at using the most basic of data synchronization primitives: the mutex (source here).
This program spawns 4 threads, and these 4 threads use a mutex to control access to the global counter. Yes, the time for spawning the threads will be included in the timings, but it's negligible.
real 0m2.578suser 0m3.905ssys 0m4.987s
Now this is a bit more interesting. The real time is a little bit higher, but realistically there's not much of a difference. Looking at the user time though, we can see that the threads used almost 50% more CPU time to calculate the same data in the same amount of time, and about twice as much time was spent context switching by the OS (which is just wasted CPU time). This is obviously quite inefficient, but wait there's more!
This program only counted up to 10 million, not 1 billion. So using a mutex and 4 threads uses 4 times as many resources and is 100 times slower compared to using a single thread. That's beyond inefficient, that's enterprise!
Atomics
Now I'm sure that the more educated among you readers are sitting there and thinking to yourself "Everybody knows that mutexes are slow, atomics are much better and faster!" Well, good sir or madam or whatever pronoun, set your tea cup back on your coaster and watch this (source here).
This program spawns 4 threads, and these 4 threads each implement RCU by having a local counter and using compare exchange to update the global counter. Yes, I know C has many other options like atomic_fetch_add()
that are much better for counting atomically, but remember that counting is a proxy for real work, and real work is done using RCU.
real 0m2.676suser 0m10.417ssys 0m0.001s
Looking at the results, we can see that using an atomic compare exchange is slightly slower than the control, but the user time is very close to 4 * 2.676s
, so that means all 4 threads actually used the CPU pretty efficiently! Also, sys is more or less 0, which means there was very little context switching during execution. That's great, but this is still about the same speed as using a single thread... This is just overhead from doing multithreading, right? We can just add more threads and it'll outpace the single thread, right?
Ok I lied. The atomics test also only counts to 10 million, because 1 billion was also too slow. So just like with mutexes, multiply the timings by 100 to get an accurate comparison to the control.
Everything Sucks
We can clearly see that everything sucks and there is no easy answer, so let's do a little bit of thinking and investigating. Using 4 threads with both a mutex and with an atomic operation were both incredibly slow, but why? Let's rerun the previous two tests, but using only a single thread:
Mutex - Single Thread (up to 1 billion)
real 0m28.922suser 0m28.922ssys 0m0.000s
Atomics - Single Thread (up to 1 billion)
real 0m15.105suser 0m15.101ssys 0m0.002s
It turns out that atomics are about twice as fast as mutexes, and mutexes provide about an 11x slowdown compared to not doing data synchronization, while atomics have about a 6x slowdown compared to not doing data synchronization. But why does adding more threads make things significantly slower?
Why Are More Threads Slower?
Contention.
Uh, Can You Explain Contention?
Ok fine, let me actually explain what contention is. The entire point of a mutex is to block all threads from executing a critical section without the mutex, and only one thread can hold the mutex at a time... That's just making your program single threaded with extra steps. While that one thread holds the mutex, all of the other threads that are waiting on the mutex basically just sleep()
. What's even worse, the OS doesn't even know which thread in your program currently holds the mutex, so if the thread holding the mutex gets context switched out, all of the other threads waiting for that mutex are just deadlocked until that thread context switches back in.
But what about atomics? Atomics also make your program single threaded with extra steps, but at the hardware level and the mutex is built into every atomic operation with the LOCK
instruction prefix (section 1.2.5 of this document). This will generally result in higher throughput since atomic locks only last for a single CPU instruction, so threads don't end up blocking each other. But why did atomics perform a little slower than a mutex with 4 threads, and then perform almost twice as fast as a mutex with a single thread? Well, think about how atomics work.
With a mutex, a thread waits on the mutex. Once the thread gets the mutex, the thread does its work, then the thread releases the mutex. Pretty straight-forward, the thread isn't doing any meaningful amount of extra work. Compare this to atomics, which usually use RCU. Every time the compare exchange fails, the thread has to prepare the data again so it can attempt the compare exchange again. Not to mention, looping around this behavior is unpredictable, and will likely cause lots of branch prediction misses.
All of this extra work is much more expensive than just waiting, and all of this extra work simply means that possible chances to make forward progress are missed. This is why there is a common wisdom that atomics are faster than mutexes, but mutexes are faster than atomics under heavy contention.
Can Threading Even Be Fast?
Fortunately, the answer is yes! Just don't do data synchronization! It really is that easy. Oh, what's that? Race conditions? Right, it's time for me to explain.
What I mean by "don't do data synchronization" is "architect your program in a way such that you minimize race conditions and therefore reduce contention". I bet you wouldn't be reading this rant if I called it that.
Anyway, to demonstrate this, I wrote a version of the program that does exactly this using a simple divide and conquer approach (source here). Yes, this actually counts up to 1 billion this time.
real 0m0.609suser 0m2.425ssys 0m0.001s
As you can see, 4 threads is 4 times faster, which is exactly what we want. Although "just avoid race conditions bro" seems pretty obvious, this program is a trivial case that doesn't exist in the real world. I can't seriously be suggesting doing this for every program? Right?
...Right?
Ok no meme this time, but I am seriously suggesting that most programs can be written in a way that greatly reduces contention. But what causes contention? Contention happens every time a race condition would manifest, so what if we just make race conditions really unlikely?
Methodology
The start of any good methodology is to quantify and classify all possible possibilities. Let's start by organizing all possible programs into 2 different classes: programs that have all of the data upfront, and programs that receive data during execution. I don't think there's any other way to give a program data.
Let's start with the easier of the two classes: programs that have all of the data upfront. This scenario is quite trivial thanks to TLS. Huh? No, not Transport Layer Security, Thread Local Storage. Just like with the divide and conquer approach, each thread has its own isolated memory that it can work with. To be more methodical than the previous test, let's change the implementation slightly (source here).
real 0m0.698suser 0m2.775ssys 0m0.002s
The timings are a little bit slower because now we're touching memory and doing a bit of extra stuff, but you can see that the timings aren't significantly different from the previous test. Some variation of this pattern- Ew. No. Sorry for swearing. Some variation of this, uh, technique should be possible for any program in this class.
Writing A Hashmap In C
For the other class of problems, the first step is to make a hashmap in C. Don't worry though, because this hashmap implementation can be very simple. Our objective isn't necessarily to have O(1) time complexity, the goal is to have a bunch of isolated shared resources that the threads can interact with while having a low likelihood of contending with each other.
Network events are my go to example for these kinds of programs. In this case, socket file descriptors are just unsigned ints, so we can implement a simple hashmap by using some number of lowest bits from the socket fd as the "hash".
Ok, But How Are We Going To Test This?
I can't really setup and use epoll in a quick and easy way, and anything involving networking is very hard to test in depth. Instead, let's start with writing a test that assumes the best case (no hash collisions, minimal contention) and test it using a bunch of different parameters (source here).
Hashmap - Single Thread - Hashmap Size 1 (2^0) (up to 1 billion)
real 0m16.085suser 0m16.085ssys 0m0.000s
As we can see, this is pretty comparable to atomics test from before with a single thread, but this is just doing the same thing in a more complex way. Let's add some more entries to the hashmap.
Hashmap - Single Thread - Hashmap Size 4096 (2^12) (up to 1 billion)
real 0m16.379suser 0m16.375ssys 0m0.003s
Hashmap - Single Thread - Hashmap Size 65536 (2^16) (up to 1 billion)
real 0m16.663suser 0m16.661ssys 0m0.000s
Hashmap - Single Thread - Hashmap Size 16777216 (2^24) (up to 1 billion)
real 0m18.220suser 0m18.140ssys 0m0.059s
As we can see, there is a pretty small impact on performance even at massive hashmap sizes. Real world programs will likely have less of a performance hit, since this test iterates over the entire hashmap twice (once at the beginning to initialize the entry, and once at the end to tum the total). It's important to note that for a single thread on this test, the optimal hashmap size is 1 (quite obviously).
Hashmap - Two Threads - Hashmap Size 1 (2^0) (up to 100 million)
real 0m16.003suser 0m31.959ssys 0m0.001s
Hashmap - Two Threads - Hashmap Size 2 (2^1) (up to 100 million)
real 0m11.719suser 0m23.361ssys 0m0.000s
Hashmap - Two Threads - Hashmap Size 4096 (2^12) (up to 1 billion)
real 0m9.325suser 0m18.591ssys 0m0.004s
Hashmap - Two Threads - Hashmap Size 65536 (2^16) (up to 1 billion)
real 0m9.285suser 0m18.413ssys 0m0.002s
Hashmap - Two Threads - Hashmap Size 16777216 (2^24) (up to 1 billion)
real 0m9.883suser 0m19.049ssys 0m0.072s
As we can see from the tests above, contention is very expensive, and the cost of a massive hashmap is far lower than the cost of contention. Another thing to note is that the optimal hashmap size for this test with 2 threads seems to be somewhere between 1024 and 2048.
Hashmap - Four Threads - Hashmap size 1 (2^0) (up to 10 million)
real 0m2.872suser 0m11.341ssys 0m0.001s
Hashmap - Four Threads - Hashmap size 2 (2^1) (up to 10 million)
real 0m2.538suser 0m10.122ssys 0m0.002s
Hashmap - Four Threads - Hashmap size 4 (2^2) (up to 10 million)
real 0m1.793suser 0m7.139ssys 0m0.000s
Hashmap - Four Threads - Hashmap size 16 (2^4) (up to 100 million)
real 0m9.467suser 0m37.842ssys 0m0.001s
Hashmap - Four Threads - Hashmap size 256 (2^8) (up to 1 billion)
real 0m8.085suser 0m32.256ssys 0m0.004s
Hashmap - Four Threads - Hashmap size 4096 (2^12) (up to 1 billion)
real 0m4.832suser 0m19.287ssys 0m0.001s
Hashmap - Four Threads - Hashmap size 65536 (2^16) (up to 1 billion)
real 0m4.856suser 0m19.325ssys 0m0.002s
Hashmap - Four Threads - Hashmap size 16777216 (2^24) (up to 1 billion)
real 0m5.0906suser 0m18.892ssys 0m0.078s
Again, we see that contention is very, very bad. The optimal hashmap size for 4 threads seems to be somewhere between 2048 and 4096.
Hashmap Takeaways
Now that we have a bunch of numbers, let's make the important observations. The optimal hashmap size seems to be somewhere around 1000 * number of threads
. Another important take away is that 1 thread took around 16 seconds
of real time and user time, two threads took around 9.3 seconds
real time with 18.4 seconds
of user time, and four threads took around 4.8 seconds
real time with around 19 seconds
of user time.
The user time differences when using threads indicate that there was still some amount of contention, but the more important real time measurements were roughly twice as fast for two threads and roughly 4 times as fast for 4 threads. Nothing in life is ever perfect, but these results are pretty good!
However, this is still the best case of a trivial example of something that's not even a hashmap. Let's look at something closer to a worse case scenario.
Hashmap Worst Case
Remember that hashmaps are backed by linked lists, and although there are numerous lock free data structures and algorithms, linked lists do not yet have a lock free implementation. This means that each hashmap entry needs to have a mutex to protect the linked list. Fortunately, rather than using a traditional mutex, we can just make a lock with atomics.
I won't post as many numbers as before, but it's time to run some tests (source here).
Hashmap Worst Case - Single Thread - Hashmap Size 1 (2^0) (up to 1 billion)
real 0m30.079suser 0m30.066ssys 0m0.012s
Right away, this test is about twice as slow as the previous test, but that's fine because this test does more than the previous test.
Hashmap Worst Case - Single Thread - Hashmap Size 65536 (2^16) (up to 1 billion)
real 0m35.559suser 0m35.548ssys 0m0.009s
And this is why testing so many different things is important. With a single hashmap entry, the CPU was just cashing that entry, which sped things up a lot compared to our next tests. Linked lists are not cache friendly.
Hashmap Worst Case - Two Threads - Hashmap Size 65536 (2^16) (up to 1 billion)
real 0m23.641suser 0m47.256ssys 0m0.016s
Hashmap Worst Case - Four Threads - Hashmap Size 65536 (2^16) (up to 1 billion)
real 0m17.719suser 1m10.834ssys 0m0.016s
Now let's look at the numbers. A single thread is 35.5 seconds
for both real and user time. Two threads took around 23.5 seconds
real time and 47.2 seconds
user time, and four threads took around 17.7 seconds
real time and 70.8 seconds
user time.
The real time is indeed shrinking, a bit slower than we'd like, but remember that this is simulating the worst case scenario, so the fact that this isn't atrocious is already a W. Looking at the user time, we can see that the user time grows in the way we'd expect. This is good though because we can tell that each CPU core was being utilized fully, and we expect (and see) a lot of contention.
Improving The Worst Case
So one obvious improvement for these tests is that each thread is starting at the beginning of the hashmap and going through sequentially rather than accessing elements at random, so the real world will likely show better results than what we saw here. I can't think of an obvious and easy way to test this, but this is pretty obvious that it'd be an improvement.
The other obvious improvement is backing off during contention. How is this done? Usually the thread just does a sleep(0)
to tell the OS to context switch away to another thread. The idea is that mutexes are faster than atomics under heavy contention, so under heavy contention just do what a mutex does. One less thread contending for the resource will make it easier for the other threads to make forward progress, and then the sleeping thread will come back later when there is hopefully less contention. There are even different algorithms to use to figure out when to back off, there's linear backoff, exponential backoff...
But doing any kind of backoff at all is wasting CPU cycles because of context switching, and decreases throughput. In fact, talking about backoff is what made me want to do this rant in the first place. Finally, we're at the good stuff.
WAKE UP
Ok, so think about this. Instead of sleeping during resource contention, what if the thread actually did something useful during resource contention? There are 2 approaches I've thought of, but sadly writing test programs for each of these approaches would be an astronomical amount of work, so I will leave this as an exercise to the reader.
Backoff Alternative 1 - Track State of Processing
The first approach is something I've done and is something I can anecdotally say is effective. The first approach is, for each piece of potentially contentious data, use one of the three following states to track the processing of that piece of data: inactive
, processing
, and keep_processing
. The data is initialized in the inactive
state.
When some activity occurs that's related to a piece of data, do not read the activity information. Instead, the thread immediately finds the associated data and reads the current processing state for the data.
- If the state is
inactive
, the thread will attempt to compare exchange the state withprocessing
, and upon success the thread is now responsible for processing that event and its related data. - If the state is
processing
, the thread will attempt to compare exchange the state withkeep_processing
, and upon success the thread goes back to waiting for another event to occur. - If the state is
keep_processing
, the thread simply goes back to waiting for another event to occur. - If updating the state fails, go through this process again.
When the processing thread goes back to update the processing state for the potentially contentious data:
- If the state is
keep_processing
, the thread will set the state toprocessing
and then attempt to process events related to the data again. - If the state is
processing
, the thread will attempt to compare exchange the state toinactive
, and upon success the thread goes back to waiting for another event to occur. - If updating the state fails, go through this process again.
The great thing about this technique is that the threads that are waiting for events count the state up, and the thread that is processing the event and data counts the state down. Since there is nothing above the keep_processing
state, this ensures that no thread will ever have to do more than 3 atomic compare exchanges, barring some very poorly timed context switching.
Backoff Alternative 2 - Try Again Later
Just have the thread hold ownership of the data and little bit longer, and go off and do other work before coming back and trying the atomic operation again. Think of async. This approach requires structuring your program in a specific way though and would likely be difficult to retrofit into an existing program. However I think this approach would have optimal performance.
Conclusion
Threading is hard, and threading fast is even harder. I don't know all of the answers, and I'm certain someone smarter than me will come along and figure out a better way to minimize overhead in multithreaded programs.
However, the point of this rant isn't for me to pretend I know everything or to speculate that these suggestions are the best possible implementations. The point is that I see a problem, and a solution, and I wanted to share that knowledge and hopefully inspire others to think more critically about threading fast. Feel free to take the example programs from the repository and play with them. Experiment, Learn. Maybe you'll figure out something great and write a rant about how I'm stupid.
Happy Coding,
AOP