A thread-safe hash table for multi-cores (1/2)

24 October 2019

Update (October 27 2019): I’ve received some suggestions to include the stm-containers package in the benchmarks and the results are now updated.

This is Part 1 of a two-post series on the concurrent-hashtable package. Here we look at the performance aspects of the concurrent hash table. The next post goes into more details on how we can show correctness by identifying linearization points.

Features and Performance

If you’re in a hurry: here’s a link to a benchmark. You can find the explanation here.
This package provides Data.Map-style operationsThis is an early release and contains only a very basic set of operations.

for multi-threaded environments by implementing a thread-safe hash table. As the benchmarks show, this is in general much faster than using a single MVar/TVar container to hold your data.
lookup: lookups are non-blocking and are guaranteed to only perform two readTVarIOs.
insert/delete: As long as two requests do not access the same table entry and there is no resizing happening, insert/delete will be (almost) contention-free. We use a fine-grained approach based on software transactional memory (STM) to protect the individual entries of the table, by keeping the transactions as small as possible. To keep track of the load of the table, we modify an atomic counter with atomicModifyIORefCAS, as this turned out to be a lot faster than using STM. As a compromise, we accept that the load counter is not fully synchronized with the table: it could happen that, while one thread completes the insertion of a key-value pair another thread might see the inserted value but does not yet see the updated table load. This does not affect correctness but only the timing of resizing.
resize: This uses all available cores to speed up resizing (unless you tell it not to). See Linearization Points for more details on the implementation.

Benchmark

Apart from stm-containers, there don’t seem to be any thread-safe hash tables or similar thread-safe dictionary-style data structures available in Haskell and so the benchmark will be necessarily a bit unfair: For the most part, we are just going to take pure data structures and wrap them in thread-safe containers. There is also tskiplist, written by myself, but that package is really all about composability of STM-actions and does not scale. If you just want to use a fast hash table without any concurrent-access guarantees, you might want to give hashtables a try; however, the implementation does not seem to be thread-safe.

If you know of some other data structures that should be added to this benchmark, please let me know!

To make these data structures accessible from multiple threads, we place them into IORefs and/or TVars. In case that made you jump and shout “NO! You can’t use IORefs for multi-threading!”, let me ease your mind: the IORefs are being modified using atomic-writes.

Here is the full list of combinations of containers and data structures in the same order in which they appear in the benchmark chart:

Container	Data Structure	Lookup	Insert / Delete
`IORef`	`HashMap`	`readIORef`	`atomicModifyIORef`
`IORef`	`IntMap`	`readIORef`	`atomicModifyIORefCAS`
`IORef`	`HashMap`	`readIORef`	`atomicModifyIORefCAS`
`TVar`	`HashMap`	`readTVarIO`	`writeTVar`
`MVar`	`HashMap`	`readMVar`	`modifyMVar`
-	`StmContainers.Map`	`lookup`	`insert`/`delete`
-	`HashTable`	`lookup`	`insert`/`delete`

Some explanations about this table:

HashMap is in package unordered-containers and IntMap is in containers. The reason for choosing these two is that they seem to be the fastest pure data structures that provide a dictionary interface similar to a hash table.
atomicModifyIORefCAS is in package atomic-primops and can be used as a drop-in replacement for atomicModifyIORef. Both use hardware operations (i.e. compare-and-set) to atomically modify the IORef.
For lookups, we use readTVarIO (instead of readTVar) because readTVarIO is much faster as it only reads the value of the TVar without performing a full transaction.
StmContainers.Map is from the excellent package stm-containers, which provides a similar interface as Data.Map but in the STM monad. One advantage of this package is that you can easily compose multiple requests into a single atomic action.
The last entry is Data.HashTable (this package), which is thread-safe by design and doesn’t need a container.

Workload Pattern and Benchmark Parameters

We look at the different workload patterns, which should capture a sufficiently broad range of possible usage scenarios:

0% lookups; 50% insertions; 50% deletions
20% lookups; 40% insertions; 40% deletions
40% lookups; 30% insertions; 30% deletions
60% lookups; 20% insertions; 20% deletions
70% lookups; 15% insertions; 15% deletions
80% lookups; 10% insertions; 10% deletions
90% lookups; 5% insertions; 5% deletions

For each of these patterns, we generate \(10^6\) requests at random according to the workload distribution. Finally, we vary the number of threads that process the generated requests. For \(k\) threads and \(n\) requests, we split the generated requests into roughly \((n/k)\)-size parts and use forConcurrently (from async) to spawn threads and process them simultaneously. The concurrent hash table starts with an initial size of 10 in all benchmarks.

We execute the criterion benchmark suite from the command line as follows:I’m using environment variables instead of command line arguments because the latter interfere with criterion’s arguments; I’m sure there is a way around it but I haven’t had time to look into this yet.

# threads=32 range=4 expon=6 stack bench +RTS -N8 -RTS --benchmark-arguments='--output bench-32.html --csv bench-32.csv'

If you only have time to look at a single benchmark result, the one for 32 threads should give you a good overview:

Threads = 32; key range size = \(10^4\): Results (32)

Both, the MVar and TVar containers, turned out to be dreadfully slow compared to using IORefs. Interestingly, MVars are much slower than TVars, which is surprising, given that the way we handle requests is very similar for both (see the Dictionary instances in Main.hs in subdirectory benchmark): We read the HashMap that is stored in the container, perform the operation on the HashMap, and place the updated HashMap back into the container. This performance gap might be caused by
1. the additional fairness guarantee that an MVar provides, and
2. the “optimistic concurrency” approach of the STM runtime.
We can also see that IORef combined with atomicModifyIORef is significantly slower than IORef with atomicModifyIORefCAS from the atomic-primops package. This means that, unless you really know what you’re doing, you probably want to stay clear from atomicModifyIORef as far as multi-threading is concerned.
The concurrent HashTable performs best in this scenario under all workload patterns. We can see that the gap to the IORef-atomic-primops and TVar containers becomes small as we increase the ratio of reads (i.e. lookups). This shouldn’t be too surprising, given that having fewer writes means having less contention between the threads.
StmContainers.Map comes in second best under most workloads. However, the difference to the concurrent HashTable is still a factor of 2 in most settings. It seems that its performance becomes worse as we increase the number of lookups. For instance, for the case where we have 90%-lookup requests, it only takes the 4th spot behind atomicModifyIORefCAS and, surprisingly, the simple TVar-container.

Complete benchmark results for different numbers of threads

Threads = 1; key range size = \(10^4\): Results (1)

In this scenario we don’t have any concurrent requests at all: a single thread will process all \(10^6\) requests. Here we get a very different picture:

IORef combined with atomicModifyIORef beats all other options hands-down. The (somewhat obvious) conclusion is that you shouldn’t use concurrency mechanisms when you don’t actually need them!
Nevertheless, it’s encouraging that the concurrent hash table comes in second best.
As mentioned above, the performance of StmContainers.Map suffers under workloads with few writes (e.g. ≥60% lookups), as it takes the last spot.

Threads = 2; key range size = \(10^4\): Results (2)

This is still not a very concurrent setting, but we can already see some changes:

The MVar container is quite slow, followed by the TVar. The concurrent HashTable is the fastest in workloads that have at most 60% lookups.
Once the fraction of lookups increases to 70%, the IORef container with atomicModifyIORef takes the lead.
There is a gap of a factor of 2 between the concurrent HashTable and the 3rd-fastest, which is StmContainers.Map. Similarly, as for the single-threaded execution, StmContainers loses ground once we increase the lookups to 70%.
The containers using atomicModifyIORefCAS take the middle ground between StmContainers.Map and TVar/MVar containers.

Threads = 4,8,16; key range size = \(10^4\): Results (4) Results (8) Results (16)

The concurrent HashTable wins in all workload patterns.
When moving from 4 threads to 8, the performance of atomicModifyIORef suffers for workloads with at most 40% lookups, making it drop from second best to second worst (still ahead of the MVar container).
The performance of the TVar container also deteriorates but the drop is not as dramatic.
StmContainers.Map maintains its second place, except for workloads with ≥90% lookups, where it falls behind atomicModifyIORefCAS.

Threads = 32, 64, 128, 256; key range size = \(10^4\): Results (32) Results (64) Results (128) Results (256)

As we further increase the number of threads, we can see that the performance of atomicModifyIORef and the TVar container deteriorates even for workloads with 90% lookups, while the MVar container continues to take the last spot.
On the other hand, atomicModifyIORefCAS scales much better, coming in second place after concurrent HashTable, and followed by StmContainers.Map. Once the number of threads increases to 256, StmContainers.Map outperforms the atomicModifyIORefCAS containers.

Threads = 32; key range size = \(10^5\): Results (32 threads,range \(10^5\))

If we increase the key range size to \(10^5\), the difference between the concurrent hash table and StmContainers.Map becomes very small. Moreover, the containers using atomicModifyIORefCAS perform better than both, for workloads with 90% lookups. The reasons for this may be that a larger key range implies fewer insertions and deletions of the same key, which in turn means that the hash table needs to do a lot of resizing.

To be continued…

In the next post we will look at how to reason about correctness when multiple threads access the table simultaneously.