Stop Doing Data Synchronization

Published on June 10th, 2024

Yes, I'm serious. Let me cook. My ingredients can be found in this repository (please read the instructions before compiling and executing yourself).

Let's start with a simple idea, make a program that counts from 0 to 1 billion by incrementing a counter 1 billion times as fast as possible. Yes, I know there are many ways to do this instantly, but we're going to use incrementing a counter as a proxy for actually useful and interesting work.

The Control

The first step of any experiment is a control, so I wrote a simple control program (source here).

The program simply increments a counter in a loop until the counter reaches 1 billion. On my machine, I got the following timings:

real    0m2.450s

user    0m2.450s

sys     0m0.000s

Note we don't care about the exact time, we just want to see generally how strategies perform so we can find patterns.

Anyway, we can see that a single thread simply counting up a loop to 1 billion takes around 2.5 seconds on my CPU. This is the benchmark that we will want to compare future timings against.

Mutexes

Now, let's look at using the most basic of data synchronization primitives: the mutex (source here).

This program spawns 4 threads, and these 4 threads use a mutex to control access to the global counter. Yes, the time for spawning the threads will be included in the timings, but it's negligible.

real    0m2.578s

user    0m3.905s

sys     0m4.987s

Now this is a bit more interesting. The real time is a little bit higher, but realistically there's not much of a difference. Looking at the user time though, we can see that the threads used almost 50% more CPU time to calculate the same data in the same amount of time, and about twice as much time was spent context switching by the OS (which is just wasted CPU time). This is obviously quite inefficient, but wait there's more!

This program only counted up to 10 million, not 1 billion. So using a mutex and 4 threads uses 4 times as many resources and is 100 times slower compared to using a single thread. That's beyond inefficient, that's enterprise!

Atomics

Now I'm sure that the more educated among you readers are sitting there and thinking to yourself "Everybody knows that mutexes are slow, atomics are much better and faster!" Well, good sir or madam or whatever pronoun, set your tea cup back on your coaster and watch this (source here).

This program spawns 4 threads, and these 4 threads each implement RCU by having a local counter and using compare exchange to update the global counter. Yes, I know C has many other options like atomic_fetch_add() that are much better for counting atomically, but remember that counting is a proxy for real work, and real work is done using RCU.

real    0m2.676s

user    0m10.417s

sys     0m0.001s

Looking at the results, we can see that using an atomic compare exchange is slightly slower than the control, but the user time is very close to 4 * 2.676s, so that means all 4 threads actually used the CPU pretty efficiently! Also, sys is more or less 0, which means there was very little context switching during execution. That's great, but this is still about the same speed as using a single thread... This is just overhead from doing multithreading, right? We can just add more threads and it'll outpace the single thread, right?

meme

Ok I lied. The atomics test also only counts to 10 million, because 1 billion was also too slow. So just like with mutexes, multiply the timings by 100 to get an accurate comparison to the control.

Everything Sucks

We can clearly see that everything sucks and there is no easy answer, so let's do a little bit of thinking and investigating. Using 4 threads with both a mutex and with an atomic operation were both incredibly slow, but why? Let's rerun the previous two tests, but using only a single thread:

Mutex - Single Thread (up to 1 billion)

real    0m28.922s

user    0m28.922s

sys     0m0.000s

Atomics - Single Thread (up to 1 billion)

real    0m15.105s

user    0m15.101s

sys     0m0.002s

It turns out that atomics are about twice as fast as mutexes, and mutexes provide about an 11x slowdown compared to not doing data synchronization, while atomics have about a 6x slowdown compared to not doing data synchronization. But why does adding more threads make things significantly slower?

Why Are More Threads Slower?

Contention.

Uh, Can You Explain Contention?

Ok fine, let me actually explain what contention is. The entire point of a mutex is to block all threads from executing a critical section without the mutex, and only one thread can hold the mutex at a time... That's just making your program single threaded with extra steps. While that one thread holds the mutex, all of the other threads that are waiting on the mutex basically just sleep(). What's even worse, the OS doesn't even know which thread in your program currently holds the mutex, so if the thread holding the mutex gets context switched out, all of the other threads waiting for that mutex are just deadlocked until that thread context switches back in.

But what about atomics? Atomics also make your program single threaded with extra steps, but at the hardware level and the mutex is built into every atomic operation with the LOCK instruction prefix (section 1.2.5 of this document). This will generally result in higher throughput since atomic locks only last for a single CPU instruction, so threads don't end up blocking each other. But why did atomics perform a little slower than a mutex with 4 threads, and then perform almost twice as fast as a mutex with a single thread? Well, think about how atomics work.

With a mutex, a thread waits on the mutex. Once the thread gets the mutex, the thread does its work, then the thread releases the mutex. Pretty straight-forward, the thread isn't doing any meaningful amount of extra work. Compare this to atomics, which usually use RCU. Every time the compare exchange fails, the thread has to prepare the data again so it can attempt the compare exchange again. Not to mention, looping around this behavior is unpredictable, and will likely cause lots of branch prediction misses.

All of this extra work is much more expensive than just waiting, and all of this extra work simply means that possible chances to make forward progress are missed. This is why there is a common wisdom that atomics are faster than mutexes, but mutexes are faster than atomics under heavy contention.

Can Threading Even Be Fast?

Fortunately, the answer is yes! Just don't do data synchronization! It really is that easy. Oh, what's that? Race conditions? Right, it's time for me to explain.

What I mean by "don't do data synchronization" is "architect your program in a way such that you minimize race conditions and therefore reduce contention". I bet you wouldn't be reading this rant if I called it that.

Anyway, to demonstrate this, I wrote a version of the program that does exactly this using a simple divide and conquer approach (source here). Yes, this actually counts up to 1 billion this time.

real    0m0.609s

user    0m2.425s

sys     0m0.001s

As you can see, 4 threads is 4 times faster, which is exactly what we want. Although "just avoid race conditions bro" seems pretty obvious, this program is a trivial case that doesn't exist in the real world. I can't seriously be suggesting doing this for every program? Right?

...Right?

Ok no meme this time, but I am seriously suggesting that most programs can be written in a way that greatly reduces contention. But what causes contention? Contention happens every time a race condition would manifest, so what if we just make race conditions really unlikely?

Methodology

The start of any good methodology is to quantify and classify all possible possibilities. Let's start by organizing all possible programs into 2 different classes: programs that have all of the data upfront, and programs that receive data during execution. I don't think there's any other way to give a program data.

Let's start with the easier of the two classes: programs that have all of the data upfront. This scenario is quite trivial thanks to TLS. Huh? No, not Transport Layer Security, Thread Local Storage. Just like with the divide and conquer approach, each thread has its own isolated memory that it can work with. To be more methodical than the previous test, let's change the implementation slightly (source here).

real    0m0.698s

user    0m2.775s

sys     0m0.002s

The timings are a little bit slower because now we're touching memory and doing a bit of extra stuff, but you can see that the timings aren't significantly different from the previous test. Some variation of this pattern- Ew. No. Sorry for swearing. Some variation of this, uh, technique should be possible for any program in this class.

Writing A Hashmap In C

For the other class of problems, the first step is to make a hashmap in C. Don't worry though, because this hashmap implementation can be very simple. Our objective isn't necessarily to have O(1) time complexity, the goal is to have a bunch of isolated shared resources that the threads can interact with while having a low likelihood of contending with each other.

Network events are my go to example for these kinds of programs. In this case, socket file descriptors are just unsigned ints, so we can implement a simple hashmap by using some number of lowest bits from the socket fd as the "hash".

Ok, But How Are We Going To Test This?

I can't really setup and use epoll in a quick and easy way, and anything involving networking is very hard to test in depth. Instead, let's start with writing a test that assumes the best case (no hash collisions, minimal contention) and test it using a bunch of different parameters (source here).

Hashmap - Single Thread - Hashmap Size 1 (2^0) (up to 1 billion)

real    0m16.085s

user    0m16.085s

sys     0m0.000s

As we can see, this is pretty comparable to atomics test from before with a single thread, but this is just doing the same thing in a more complex way. Let's add some more entries to the hashmap.

Hashmap - Single Thread - Hashmap Size 4096 (2^12) (up to 1 billion)

real    0m16.379s

user    0m16.375s

sys     0m0.003s

Hashmap - Single Thread - Hashmap Size 65536 (2^16) (up to 1 billion)

real    0m16.663s

user    0m16.661s

sys     0m0.000s

Hashmap - Single Thread - Hashmap Size 16777216 (2^24) (up to 1 billion)

real    0m18.220s

user    0m18.140s

sys     0m0.059s

As we can see, there is a pretty small impact on performance even at massive hashmap sizes. Real world programs will likely have less of a performance hit, since this test iterates over the entire hashmap twice (once at the beginning to initialize the entry, and once at the end to tum the total). It's important to note that for a single thread on this test, the optimal hashmap size is 1 (quite obviously).

Hashmap - Two Threads - Hashmap Size 1 (2^0) (up to 100 million)

real    0m16.003s

user    0m31.959s

sys     0m0.001s

Hashmap - Two Threads - Hashmap Size 2 (2^1) (up to 100 million)

real    0m11.719s

user    0m23.361s

sys     0m0.000s

Hashmap - Two Threads - Hashmap Size 4096 (2^12) (up to 1 billion)

real    0m9.325s

user    0m18.591s

sys     0m0.004s

Hashmap - Two Threads - Hashmap Size 65536 (2^16) (up to 1 billion)

real    0m9.285s

user    0m18.413s

sys     0m0.002s

Hashmap - Two Threads - Hashmap Size 16777216 (2^24) (up to 1 billion)

real    0m9.883s

user    0m19.049s

sys     0m0.072s

As we can see from the tests above, contention is very expensive, and the cost of a massive hashmap is far lower than the cost of contention. Another thing to note is that the optimal hashmap size for this test with 2 threads seems to be somewhere between 1024 and 2048.

Hashmap - Four Threads - Hashmap size 1 (2^0) (up to 10 million)

real    0m2.872s

user    0m11.341s

sys     0m0.001s

Hashmap - Four Threads - Hashmap size 2 (2^1) (up to 10 million)

real    0m2.538s

user    0m10.122s

sys     0m0.002s

Hashmap - Four Threads - Hashmap size 4 (2^2) (up to 10 million)

real    0m1.793s

user    0m7.139s

sys     0m0.000s

Hashmap - Four Threads - Hashmap size 16 (2^4) (up to 100 million)

real    0m9.467s

user    0m37.842s

sys     0m0.001s

Hashmap - Four Threads - Hashmap size 256 (2^8) (up to 1 billion)

real    0m8.085s

user    0m32.256s

sys     0m0.004s

Hashmap - Four Threads - Hashmap size 4096 (2^12) (up to 1 billion)

real    0m4.832s

user    0m19.287s

sys     0m0.001s

Hashmap - Four Threads - Hashmap size 65536 (2^16) (up to 1 billion)

real    0m4.856s

user    0m19.325s

sys     0m0.002s

Hashmap - Four Threads - Hashmap size 16777216 (2^24) (up to 1 billion)

real    0m5.0906s

user    0m18.892s

sys     0m0.078s

Again, we see that contention is very, very bad. The optimal hashmap size for 4 threads seems to be somewhere between 2048 and 4096.

Hashmap Takeaways

Now that we have a bunch of numbers, let's make the important observations. The optimal hashmap size seems to be somewhere around 1000 * number of threads. Another important take away is that 1 thread took around 16 seconds of real time and user time, two threads took around 9.3 seconds real time with 18.4 seconds of user time, and four threads took around 4.8 seconds real time with around 19 seconds of user time.

The user time differences when using threads indicate that there was still some amount of contention, but the more important real time measurements were roughly twice as fast for two threads and roughly 4 times as fast for 4 threads. Nothing in life is ever perfect, but these results are pretty good!

However, this is still the best case of a trivial example of something that's not even a hashmap. Let's look at something closer to a worse case scenario.

Hashmap Worst Case

Remember that hashmaps are backed by linked lists, and although there are numerous lock free data structures and algorithms, linked lists do not yet have a lock free implementation. This means that each hashmap entry needs to have a mutex to protect the linked list. Fortunately, rather than using a traditional mutex, we can just make a lock with atomics.

I won't post as many numbers as before, but it's time to run some tests (source here).

Hashmap Worst Case - Single Thread - Hashmap Size 1 (2^0) (up to 1 billion)

real    0m30.079s

user    0m30.066s

sys     0m0.012s

Right away, this test is about twice as slow as the previous test, but that's fine because this test does more than the previous test.

Hashmap Worst Case - Single Thread - Hashmap Size 65536 (2^16) (up to 1 billion)

real    0m35.559s

user    0m35.548s

sys     0m0.009s

And this is why testing so many different things is important. With a single hashmap entry, the CPU was just cashing that entry, which sped things up a lot compared to our next tests. Linked lists are not cache friendly.

Hashmap Worst Case - Two Threads - Hashmap Size 65536 (2^16) (up to 1 billion)

real    0m23.641s

user    0m47.256s

sys     0m0.016s

Hashmap Worst Case - Four Threads - Hashmap Size 65536 (2^16) (up to 1 billion)

real    0m17.719s

user    1m10.834s

sys     0m0.016s

Now let's look at the numbers. A single thread is 35.5 seconds for both real and user time. Two threads took around 23.5 seconds real time and 47.2 seconds user time, and four threads took around 17.7 seconds real time and 70.8 seconds user time.

The real time is indeed shrinking, a bit slower than we'd like, but remember that this is simulating the worst case scenario, so the fact that this isn't atrocious is already a W. Looking at the user time, we can see that the user time grows in the way we'd expect. This is good though because we can tell that each CPU core was being utilized fully, and we expect (and see) a lot of contention.

Improving The Worst Case

So one obvious improvement for these tests is that each thread is starting at the beginning of the hashmap and going through sequentially rather than accessing elements at random, so the real world will likely show better results than what we saw here. I can't think of an obvious and easy way to test this, but this is pretty obvious that it'd be an improvement.

The other obvious improvement is backing off during contention. How is this done? Usually the thread just does a sleep(0) to tell the OS to context switch away to another thread. The idea is that mutexes are faster than atomics under heavy contention, so under heavy contention just do what a mutex does. One less thread contending for the resource will make it easier for the other threads to make forward progress, and then the sleeping thread will come back later when there is hopefully less contention. There are even different algorithms to use to figure out when to back off, there's linear backoff, exponential backoff...

But doing any kind of backoff at all is wasting CPU cycles because of context switching, and decreases throughput. In fact, talking about backoff is what made me want to do this rant in the first place. Finally, we're at the good stuff.

WAKE UP

Ok, so think about this. Instead of sleeping during resource contention, what if the thread actually did something useful during resource contention? There are 2 approaches I've thought of, but sadly writing test programs for each of these approaches would be an astronomical amount of work, so I will leave this as an exercise to the reader.

Backoff Alternative 1 - Track State of Processing

The first approach is something I've done and is something I can anecdotally say is effective. The first approach is, for each piece of potentially contentious data, use one of the three following states to track the processing of that piece of data: inactive, processing, and keep_processing. The data is initialized in the inactive state.

When some activity occurs that's related to a piece of data, do not read the activity information. Instead, the thread immediately finds the associated data and reads the current processing state for the data.

If the state is inactive, the thread will attempt to compare exchange the state with processing, and upon success the thread is now responsible for processing that event and its related data.
If the state is processing, the thread will attempt to compare exchange the state with keep_processing, and upon success the thread goes back to waiting for another event to occur.
If the state is keep_processing, the thread simply goes back to waiting for another event to occur.
If updating the state fails, go through this process again.

When the processing thread goes back to update the processing state for the potentially contentious data:

If the state is keep_processing, the thread will set the state to processing and then attempt to process events related to the data again.
If the state is processing, the thread will attempt to compare exchange the state to inactive, and upon success the thread goes back to waiting for another event to occur.
If updating the state fails, go through this process again.

The great thing about this technique is that the threads that are waiting for events count the state up, and the thread that is processing the event and data counts the state down. Since there is nothing above the keep_processing state, this ensures that no thread will ever have to do more than 3 atomic compare exchanges, barring some very poorly timed context switching.

Backoff Alternative 2 - Try Again Later

Just have the thread hold ownership of the data and little bit longer, and go off and do other work before coming back and trying the atomic operation again. Think of async. This approach requires structuring your program in a specific way though and would likely be difficult to retrofit into an existing program. However I think this approach would have optimal performance.

Conclusion

Threading is hard, and threading fast is even harder. I don't know all of the answers, and I'm certain someone smarter than me will come along and figure out a better way to minimize overhead in multithreaded programs.

However, the point of this rant isn't for me to pretend I know everything or to speculate that these suggestions are the best possible implementations. The point is that I see a problem, and a solution, and I wanted to share that knowledge and hopefully inspire others to think more critically about threading fast. Feel free to take the example programs from the repository and play with them. Experiment, Learn. Maybe you'll figure out something great and write a rant about how I'm stupid.

Happy Coding,

AOP