24 October 2019
Update (October 27 2019): I’ve received some suggestions to include the stm-containers package in the benchmarks and the results are now updated.
This is Part 1 of a two-post series on the concurrent-hashtable package. Here we look at the performance aspects of the concurrent hash table. The next post goes into more details on how we can show correctness by identifying linearization points.
If you’re in a hurry: here’s a link to a benchmark. You can find the explanation here.
This package provides Data.Map
-style operationsThis is an early release and contains only a very basic set of operations.
for multi-threaded environments by implementing a thread-safe hash table. As the benchmarks show, this is in general much faster than using a single MVar
/TVar
container to hold your data.
lookup
: lookups are non-blocking and are guaranteed to only perform two readTVarIO
s.
insert
/delete
: As long as two requests do not access the same table entry and there is no resizing happening, insert
/delete
will be (almost) contention-free. We use a fine-grained approach based on software transactional memory (STM) to protect the individual entries of the table, by keeping the transactions as small as possible. To keep track of the load of the table, we modify an atomic counter with atomicModifyIORefCAS
, as this turned out to be a lot faster than using STM. As a compromise, we accept that the load counter is not fully synchronized with the table: it could happen that, while one thread completes the insertion of a key-value pair another thread might see the inserted value but does not yet see the updated table load. This does not affect correctness but only the timing of resizing.
resize
: This uses all available cores to speed up resizing (unless you tell it not to). See Linearization Points for more details on the implementation.
Apart from stm-containers, there don’t seem to be any thread-safe hash tables or similar thread-safe dictionary-style data structures available in Haskell and so the benchmark will be necessarily a bit unfair: For the most part, we are just going to take pure data structures and wrap them in thread-safe containers. There is also tskiplist, written by myself, but that package is really all about composability of STM-actions and does not scale. If you just want to use a fast hash table without any concurrent-access guarantees, you might want to give hashtables a try; however, the implementation does not seem to be thread-safe.
If you know of some other data structures that should be added to this benchmark, please let me know!
To make these data structures accessible from multiple threads, we place them into IORef
s and/or TVar
s. In case that made you jump and shout “NO! You can’t use IORefs
for multi-threading!”, let me ease your mind: the IORef
s are being modified using atomic-writes.
Here is the full list of combinations of containers and data structures in the same order in which they appear in the benchmark chart:
Container | Data Structure | Lookup | Insert / Delete |
---|---|---|---|
IORef |
HashMap |
readIORef |
atomicModifyIORef |
IORef |
IntMap |
readIORef |
atomicModifyIORefCAS |
IORef |
HashMap |
readIORef |
atomicModifyIORefCAS |
TVar |
HashMap |
readTVarIO |
writeTVar |
MVar |
HashMap |
readMVar |
modifyMVar |
- | StmContainers.Map |
lookup |
insert /delete |
- | HashTable |
lookup |
insert /delete |
Some explanations about this table:
HashMap
is in package unordered-containers and IntMap
is in containers. The reason for choosing these two is that they seem to be the fastest pure data structures that provide a dictionary interface similar to a hash table.
atomicModifyIORefCAS
is in package atomic-primops and can be used as a drop-in replacement for atomicModifyIORef
. Both use hardware operations (i.e. compare-and-set) to atomically modify the IORef
.
For lookups, we use readTVarIO
(instead of readTVar
) because readTVarIO
is much faster as it only reads the value of the TVar
without performing a full transaction.
StmContainers.Map
is from the excellent package stm-containers, which provides a similar interface as Data.Map
but in the STM
monad. One advantage of this package is that you can easily compose multiple requests into a single atomic action.
The last entry is Data.HashTable
(this package), which is thread-safe by design and doesn’t need a container.
We look at the different workload patterns, which should capture a sufficiently broad range of possible usage scenarios:
For each of these patterns, we generate \(10^6\) requests at random according to the workload distribution. Finally, we vary the number of threads that process the generated requests. For \(k\) threads and \(n\) requests, we split the generated requests into roughly \((n/k)\)-size parts and use forConcurrently
(from async) to spawn threads and process them simultaneously. The concurrent hash table starts with an initial size of 10 in all benchmarks.
We execute the criterion benchmark suite from the command line as follows:I’m using environment variables instead of command line arguments because the latter interfere with criterion’s arguments; I’m sure there is a way around it but I haven’t had time to look into this yet.
#
threads=32 range=4 expon=6 stack bench +RTS -N8 -RTS --benchmark-arguments='--output bench-32.html --csv bench-32.csv'
If you only have time to look at a single benchmark result, the one for 32 threads should give you a good overview:
Both, the MVar
and TVar
containers, turned out to be dreadfully slow compared to using IORef
s. Interestingly, MVar
s are much slower than TVar
s, which is surprising, given that the way we handle requests is very similar for both (see the Dictionary
instances in Main.hs
in subdirectory benchmark
): We read the HashMap
that is stored in the container, perform the operation on the HashMap
, and place the updated HashMap
back into the container. This performance gap might be caused by
MVar
provides, andWe can also see that IORef
combined with atomicModifyIORef
is significantly slower than IORef
with atomicModifyIORefCAS
from the atomic-primops package. This means that, unless you really know what you’re doing, you probably want to stay clear from atomicModifyIORef
as far as multi-threading is concerned.
The concurrent HashTable
performs best in this scenario under all workload patterns. We can see that the gap to the IORef
-atomic-primops and TVar
containers becomes small as we increase the ratio of reads (i.e. lookups). This shouldn’t be too surprising, given that having fewer writes means having less contention between the threads.
StmContainers.Map
comes in second best under most workloads. However, the difference to the concurrent HashTable
is still a factor of 2 in most settings. It seems that its performance becomes worse as we increase the number of lookups. For instance, for the case where we have 90%-lookup requests, it only takes the 4th spot behind atomicModifyIORefCAS
and, surprisingly, the simple TVar
-container.
In this scenario we don’t have any concurrent requests at all: a single thread will process all \(10^6\) requests. Here we get a very different picture:
IORef
combined with atomicModifyIORef
beats all other options hands-down. The (somewhat obvious) conclusion is that you shouldn’t use concurrency mechanisms when you don’t actually need them!
Nevertheless, it’s encouraging that the concurrent hash table comes in second best.
As mentioned above, the performance of StmContainers.Map
suffers under workloads with few writes (e.g. ≥60% lookups), as it takes the last spot.
This is still not a very concurrent setting, but we can already see some changes:
MVar
container is quite slow, followed by the TVar
. The concurrent HashTable
is the fastest in workloads that have at most 60% lookups.IORef
container with atomicModifyIORef
takes the lead.HashTable
and the 3rd-fastest, which is StmContainers.Map
. Similarly, as for the single-threaded execution, StmContainers
loses ground once we increase the lookups to 70%.atomicModifyIORefCAS
take the middle ground between StmContainers.Map
and TVar
/MVar
containers.The concurrent HashTable
wins in all workload patterns.
When moving from 4 threads to 8, the performance of atomicModifyIORef
suffers for workloads with at most 40% lookups, making it drop from second best to second worst (still ahead of the MVar
container).
The performance of the TVar
container also deteriorates but the drop is not as dramatic.
StmContainers.Map
maintains its second place, except for workloads with ≥90% lookups, where it falls behind atomicModifyIORefCAS
.
As we further increase the number of threads, we can see that the performance of atomicModifyIORef
and the TVar
container deteriorates even for workloads with 90% lookups, while the MVar
container continues to take the last spot.
On the other hand, atomicModifyIORefCAS
scales much better, coming in second place after concurrent HashTable
, and followed by StmContainers.Map
. Once the number of threads increases to 256, StmContainers.Map
outperforms the atomicModifyIORefCAS
containers.
StmContainers.Map
becomes very small. Moreover, the containers using atomicModifyIORefCAS
perform better than both, for workloads with 90% lookups. The reasons for this may be that a larger key range implies fewer insertions and deletions of the same key, which in turn means that the hash table needs to do a lot of resizing.In the next post we will look at how to reason about correctness when multiple threads access the table simultaneously.