A thread-safe hash table for multi-cores (1/2)

24 October 2019

Update (October 27 2019): I’ve received some suggestions to include the stm-containers package in the benchmarks and the results are now updated.

This is Part 1 of a two-post series on the concurrent-hashtable package. Here we look at the performance aspects of the concurrent hash table. The next post goes into more details on how we can show correctness by identifying linearization points.

Features and Performance

Benchmark

Apart from stm-containers, there don’t seem to be any thread-safe hash tables or similar thread-safe dictionary-style data structures available in Haskell and so the benchmark will be necessarily a bit unfair: For the most part, we are just going to take pure data structures and wrap them in thread-safe containers. There is also tskiplist, written by myself, but that package is really all about composability of STM-actions and does not scale. If you just want to use a fast hash table without any concurrent-access guarantees, you might want to give hashtables a try; however, the implementation does not seem to be thread-safe.

If you know of some other data structures that should be added to this benchmark, please let me know!

To make these data structures accessible from multiple threads, we place them into IORefs and/or TVars. In case that made you jump and shout “NO! You can’t use IORefs for multi-threading!”, let me ease your mind: the IORefs are being modified using atomic-writes.

Here is the full list of combinations of containers and data structures in the same order in which they appear in the benchmark chart:

Container Data Structure Lookup Insert / Delete
IORef HashMap readIORef atomicModifyIORef
IORef IntMap readIORef atomicModifyIORefCAS
IORef HashMap readIORef atomicModifyIORefCAS
TVar HashMap readTVarIO writeTVar
MVar HashMap readMVar modifyMVar
- StmContainers.Map lookup insert/delete
- HashTable lookup insert/delete

Some explanations about this table:

Workload Pattern and Benchmark Parameters

We look at the different workload patterns, which should capture a sufficiently broad range of possible usage scenarios:

For each of these patterns, we generate \(10^6\) requests at random according to the workload distribution. Finally, we vary the number of threads that process the generated requests. For \(k\) threads and \(n\) requests, we split the generated requests into roughly \((n/k)\)-size parts and use forConcurrently (from async) to spawn threads and process them simultaneously. The concurrent hash table starts with an initial size of 10 in all benchmarks.

We execute the criterion benchmark suite from the command line as follows:I’m using environment variables instead of command line arguments because the latter interfere with criterion’s arguments; I’m sure there is a way around it but I haven’t had time to look into this yet.

# threads=32 range=4 expon=6 stack bench +RTS -N8 -RTS --benchmark-arguments='--output bench-32.html --csv bench-32.csv'

If you only have time to look at a single benchmark result, the one for 32 threads should give you a good overview:

Threads = 32; key range size = \(10^4\): Results (32)

Complete benchmark results for different numbers of threads

Threads = 1; key range size = \(10^4\): Results (1)

In this scenario we don’t have any concurrent requests at all: a single thread will process all \(10^6\) requests. Here we get a very different picture:

Threads = 2; key range size = \(10^4\): Results (2)

This is still not a very concurrent setting, but we can already see some changes:

Threads = 4,8,16; key range size = \(10^4\): Results (4) Results (8) Results (16)

Threads = 32, 64, 128, 256; key range size = \(10^4\): Results (32) Results (64) Results (128) Results (256)

Threads = 32; key range size = \(10^5\): Results (32 threads,range \(10^5\))

To be continued…

In the next post we will look at how to reason about correctness when multiple threads access the table simultaneously.