25 October 2019
This is Part 2 of a two-post series on the concurrent-hashtable package. In the previous post, we looked at the performance and discussed some benchmark results. Here we will look at the correctness aspects, in particular:
In this post, we will take a look at the details of our implementation to see if the underlying ideas are sound. To claim that each operation is thread-safe, we need to assign linearization points, which we can think of as the points in time when the method takes effect.
Both insert
and delete
call genericModify
to do the actual data modification. genericModify
for key k
performs 2 reads: It will read the table vector, and then read the Chain
at the index to which k
is hashed. A Chain
is essentially a TVar [(k,v)]
that contains all key-value pairs that map to that index.
Since reading the table vector and then reading the chain are non-atomic it could happen that some other thread has started resizing the table in-between these two reads. Of course, we could have made our life much simpler by putting them into the same transaction, but this would cause all other concurrently modifying threads to have a conflict with this thread when trying to commit their STM transaction, defeating the whole point of using a fine-grained synchronization approach.
Let’s go through the details of how interleaving updates are handled: To keep the contention between threads low, each chain has its own MigrationStatus
value, which is NotStarted
initially. If resizing has already started for this table entry, then MigrationStatus
must have value Ongoing
(ensured by the resizing thread) and the STM transaction in genericModify
will block (by calling retry
) until the migration status changes to Finished
. Then, we simply call genericModify
again to find the right index in the now-updated table where, initially, all chains have their MigrationStatus
set to NotStarted
. This will continue until the thread finds a point in time when the chain isn’t affected by resizing.
Now suppose that the thread observes a migration status of NotStarted
, i.e., either the chain hasn’t been processed yet by the resizing thread or no resizing is happening at this point. In this case, we go ahead and perform the modification of the chain for this index. It’s crucial for correctness that reading the MigrationStatus
and performing the update to the chain happen atomically, and hence we make them part of the same STM transaction.
We set the linearization point for insert
/delete
exactly at the point where the STM transaction in genericModify
commits, assuming that the MigrationStatus
value that was read in the transaction was NotStarted
.
It follows from the code of genericModify
that NotStarted
is the only possible value for the MigrationStatus
such that genericModify
returns to insert
/delete
. In all other cases, genericModify
will just keep trying again. Therefore, linearizability follows from the atomicity of the STM transaction in genericModify
. As mentioned above, the load of the hash table will not be updated atomically in insert
/delete
, but this only affects the timing of resizing and not the overall correctness.
We first describe how resizing works internally: Initially, the MigrationStatus
will be set to NotStarted
for each chain, and, once resizing is in progress, the resizer-thread will set this field to Ongoing
for each chain starting at index 0. Since the collection of these writes is not atomic - we want to avoid “stop the world” synchronization as much as possible - threads can still insert/delete data into/from the chains at higher indices while the chains at lower indices are already being migrated to the new table.
For resize
, we set the linearization point to be at the time when the STM transaction commits that replaces the old table with the new one.
No other thread has access to the new table until it is written and hence linearizability follows from the STM’s atomicity guarantee.
For lookups, we perform 2 reads using readTVarIO
: first we read the table vector and then we read the chain at index \(i\) to retrieve the item. There are some complications because something else could have happened between the first and second read. Suppose lookup
reads the table vector at time \(t_1\) and reads the chain at index \(i\) at time \(t_2\):
Another thread could have completed resizing the table, i.e., its linearization point for the resizing lies in the (open) interval \((t_1,t_2)\)Note that this time interval excludes \(t_1\) and \(t_2\).
, and so the table index that we read at time \(t_1\) was out of date at time \(t_2\). (We need to be careful because it could be that the item we’re looking for was already deleted from the new table by time \(t_2\) and we might not know about it because we’ve read the chain in the old table!)
No resizing was completed during the interval \((t_1,t_2)\). However, another thread inserted/deleted an item with the same key during that time.
We will look at these 2 cases separately.
Let \(t_4\) be the linearization point of the first resize that was completed during \((t_1,t_2)\). Let \(t_3\) be the linearization point of the latest insert
/delete
for the chain at index \(i\) during \((t_1,t_4)\); if there was no such insert
/delete
occuring, define \(t_3=t_1\). We choose the linearization point for lookup
to be at some time in \((t_3,t_4)\).
If there wasn’t any insert
/delete
, then the lookup
would just return the value for the key k
that was in the chain at index \(i\) in the old table (before resizing), and its linearization point would be set to happen before the one for resizing. Therefore, whatever happens after resizing doesn’t concern the return value of lookup
.
On the other hand, if the latest insert
/delete
had a linearization point at some time \(t_4\), then its STM transaction must have committed before time \(t_2\), which is when the lookup
reads the chain. Hence the readTVarIO
issued by lookup
will already see the value updated by this modification, which is perfectly fine because the linearization points are exactly in this order.
We set the linearization point to be at the time of the second read, i.e., time \(t_2\).
Suppose that a lookup for key k
returns Just a
and we set the linearization point at time \(t_2\) as defined above. Let’s assume that an item was inserted into the same chain, with a linearization point at time \(t_7\); moreover, we assume that no other modifications of index k
took place in the period \([t_7,t_2]\). Since \(t_7 < t_2\), it follows that the read of the chain at index \(i\) by the lookup
happened after the STM transaction for the insert
took place and hence the lookup will return the correct value. A similar argument works for delete
.
It’s relatively straightforward to show freedom from deadlock. From the code, we can see that the only point where there is any potential for deadlock is the call to retry
in genericModify
, which is reached if and only if the thread observes a MigrationStatus
value of OnGoing
. How did the status of the chain get this value? That must have happened in resize
. Looking at the code of resize
, we can see that the status of each chain is updated to Finished
eventually, which will wake up any thread that is blocked because of having read OnGoing
. Since resize
is non-blocking it follows that at least some thread will be making progress and hence we have deadlock-freedom.
Unfortunately, we cannot rule out that some thread starves. This is a property that our hash table inherits from the STM runtime, which itself is not starvation-free. Apart from that, it is also possible that a thread starves by getting stuck in genericModify
. For instance, if thread A is really slow, and other threads are constantly inserting new items, it could happen that, each time thread A reads the chain’s status, it has value OnGoing
because of a simultaneous call to resize
by another thread. That said, both of the above scenarios are quite unlikely to be an issue for practical workloads.