Byte-addressable non-volatile main memory (NVM) combines the performance benefits of conventional (volatile) main memory with the durability of secondary storage. Systems where non-volatile memory co-exists with volatile main memory already exist and are expected to become more prevalent in the future. This increases the interest in the crash-recovery model, where a failed process may be resurrected by the system following a crash. Traditional log-based recovery techniques can be applied correctly in such systems but fail to take full advantage of the parallelism and efficiency that may be gained by allowing processing cores to concurrently access recovery data directly from NVM, rather than by performing slow block transfers from secondary storage. Consequently, there is increasing interest in the design of recoverable concurrent objects that are robust to crash-failures, since their operations are able to recover from such failures by using state retained in NVM (see e.g. [AttiyaBH18, FriedmanHMP18, GolabH17recoverable, GolabH18recoverable, GolabR16recoverable, JayantiJJ18, JayantiJ17]).
Of particular interest are recoverable algorithms that, in addition to ensuring object consistency, also provide detectability [FriedmanHMP18]. Detectability requires that the code recovering from a failed operation can infer if it was linearized or not and, in the former case, obtain its response. Several recent works presented detectable algorithms [AttiyaBH18, Ben-DavidBFW19, FriedmanHMP18]. In particular, both Ben-David et al. [Ben-DavidBFW19] and Attiya et al. [AttiyaBH18] presented detectable CAS algorithms and [AttiyaBH18] also presented a detectable read/write object. All these algorithms augment the arguments of recoverable operations with unique identifiers, for allowing the recovery code to detect whether or not the failed operation was linearized, consequently incurring unbounded space complexity.111This is also the case with the durable queue algorithm of [FriedmanHMP18]. In addition, [AttiyaBH18] proved that every lock-free detectable test-and-set implementation from (non-recoverable) test-and-set objects must use unbounded space. This raises the question of whether unbounded space complexity is inherent to nonblocking detectable implementations of these objects. We provide a negative answer to this question by presenting the first nonblocking bounded-space detectable CAS and read/write algorithms. Both algorithms are wait-free. Our -process bounded-space CAS algorithm uses bits in addition to those storing the CAS object’s value. In our second contribution, we show that every obstruction-free detectable CAS implementation, assuming values from a domain of size at least , must have different reachable shared-memory configurations, thus establishing that our CAS algorithm’s space complexity is asymptotically optimal.
Detectable algorithms often require auxiliary state that helps them infer where in the execution the failure occurred. Informally, auxiliary state is information that is provided to the recoverable operation that is not provided to (nor required by) the “original” (non-recoverable) operation. In some works, it is assumed that this information is provided by the system. For example, the recoverable mutual exclusion algorithms presented by Golab and Hendler [GolabH18recoverable] assume a model in which the system provides to each operation an epoch number whose value increases after each (system-wide) failure. Some detectable algorithms presented by Attiya et al. [AttiyaBH18] assume that the system provides to the recovery code information identifying the instruction that the failed operation was about to execute via a non-volatile variable. However, auxiliary state is not necessarily provided by the system. For example, the read/write algorithm of [AttiyaBH18], the CAS algorithm of Ben-David et al. [Ben-DavidBFW19] and the queue algorithm of Friedman et al. [FriedmanHMP18] rely on auxiliary state (e.g. unique identifiers) passed to recoverable operations via their arguments by the operations that invoke them.
We show that, for a large class of objects that includes read/write, CAS and FIFO queue objects, any obstruction-free detectable implementation must receive auxiliary state. As we prove, this auxiliary state must be made available to recoverable operations either via their arguments or via a non-volatile variable accessible by them whose value must be modified outside the operation. In contrast, this external support is, in general, not required if the recoverable algorithm is not detectable.
The rest of the paper is organized as follows. We describe the system model in Section 2. We then present our bounded-space detectable read/write and CAS algorithms in Sections 3 and LABEL:section:_CAS, respectively. In Section LABEL:section:_CAS we also prove a lower bound on the space complexity of detectable CAS. This is followed by a proof that detectable implementations of a large class of objects require auxiliary state in Section LABEL:section:_auxiliary-state. The paper is concluded by a short discussion in Section LABEL:section:_discussion.
2 System Model
A set of asynchronous crash-prone processes communicate through shared objects. The system provides base objects (also called shared variables or registers) that support atomic read, write, and read-modify-write primitive operations. Base objects are used to implement higher-level concurrent objects by defining algorithms, for each process, which use primitive operations to carry out the operations of the implemented object.
The state of the system consists of non-volatile shared-memory variables and per-process local variables stored in its local volatile cache. Local variables are accessed only by the process to which they belong. For presentation simplicity, we assume that each process may own non-volatile private variables that reside in the NVM but are accessed only by . We also assume the abstract private cache model [Ben-DavidBFW19, IzraelevitzMS16], in which all shared variables are always persistent and there is no shared cache. In this model, primitive operations to shared variables are applied directly to the NVM. At any point during the execution of an operation, a system-wide crash-failure (or simply a crash) may occur, which resets the local variables of all processes to their initial values, but preserves the values of all non-volatile variables.
As we explain in Section LABEL:section:_discussion, all our results hold also in the more realistic shared-cache model. In this model, in addition to per-process private caches, there is a single (volatile) shared cache. Primitive operations to shared variables are applied to this cache and explicit persistency instructions may be required for guaranteeing that values written to this cache get persisted to the NVM in the correct order [IzraelevitzMS16].
To start executing an operation , a process invokes . We say that Op completes once control returns to the caller of . Before completing, returns a response value, which is stored to a local variable of . The response value is lost if crashes before persisting it (i.e., writing it to a non-volatile variable). We say that a process is idle if it is not in the midst of executing any operation. Each recoverable operation of a shared object is associated with a recovery function, denoted , which is responsible to infer whether was linearized or not, and to obtain its response in the former case. is performed by in order to recover from a failure that occurred while was executing . We assume that is being called with the same arguments as those with which was invoked when the crash occurred. If infers that was not yet linearized, it returns a special fail value, otherwise it returns ’s response.
Our lower bounds (Theorems LABEL:theorem:cas-lower-bound and LABEL:theorem:auxiliary-data-required) only require the model assumptions specified above. However, as we prove in Theorem LABEL:theorem:auxiliary-data-required, detectable algorithms must receive auxiliary state whose value is modified either by the operation’s caller or by the system. We therefore make the following additional assumptions that are used by the algorithms we present. Each process is associated with a private non-volatile structure consisting of three fields. stores the type of recoverable operation currently performed by , as well as the arguments with which it was called. It is accessed only by the caller of the recoverable operation , which sets its value (thus announcing the operation it is about to perform) immediately before invoking . Which function (if any) should be invoked by in order to recover from a failure is determined according to the value of . Field stores the response of the recoverable operation and is initialized to immediately before is invoked. The 3rd field, , may be used by recoverable operations and recovery functions for managing checkpoints in their execution flow. Field is set to 0 by the caller of the recoverable operation immediately before invoking it. can be read and written by recoverable operations and their recovery functions and is used by in order to record (in the NVM) the fact that the execution reached a certain point. The recovery function can then use this information in order to correctly recover and to avoid re-execution of critical instructions.
Failed processes recover in an asynchronous manner, independently of each other. Specifically, the recovery of some processes may have already completed while other processes may have not yet completed (or even started) their recovery. may be invoked multiple times before it completes, because the system may undergo multiple crashes in the course of executing it. If all the operations of an implementation are recoverable, then the implementation is called recoverable.
Linearizability [HerlihyW90] requires that each operation applied to a concurrent object takes effect instantaneously at some point between its invocation and response. The correctness condition ensured by our algorithms is durable linearizability (DL) [IzraelevitzMS16]. DL requires that linearizability be maintained in spite of crash-failures. In other words, once the system recovers after a crash-failure, the state of the data structure reflects a history containing all operations that completed before the crash and may also contain some operations that have not completed before the crash. This captures the idea that an operation can be linearized only once its effect gets persisted to NVM.
The progress conditions we consider are wait-freedom [Herlihy91] and obstruction-freedom [HerlihyLM03]. A recoverable operation or a recovery function is wait-free (resp. obstruction-free) if, starting from any reachable configuration, completes it in a finite number of its own steps (resp. when running solo), when the system experiences no crashes. We emphasize that all our results hold also in a model where processes may fail independently, such as that assumed by [AttiyaBH18].
3 Detectable Read/Write Object
Algorithm LABEL:bounded-space-recoverable-register presents the pseudo-code of a detectable read/write object that uses bounded space from (bounded-space) variables that support read/write primitive operations. To the best of our knowledge, this is the first detectable read/write algorithm that uses bounded space. The checkpoint field is used by process in order to allow the recovery function to infer where in the recoverable operation the failure occurred. Each process owns two private variables: , storing data used during recovery, and , storing an index to one of two size- toggle-bit arrays, and , that are used by ’s write operations in an alternating manner. ’s state consists of a single shared read/write register storing a triplet of values , where is ’s current value, is the identifier of the process that (last) wrote , and is the index of the toggle-bit array used by for that write operation. Initially, , where is ’s initial value, thus “attributing” this value to a write by process that used toggle-bit array . Register stores bits in addition to the application value , in contrast with the unbounded state required by the read/write object implementation of Attiya et al. [AttiyaBH18]. A 3-dimensional array allows each writing process to coordinate with any other process using ’s two toggle-bit arrays.
The key challenge with which Algorithm LABEL:bounded-space-recoverable-register copes is the ABA problem. Attiya et al. [AttiyaBH18] avoid it by ensuring that all written values are distinct, at the cost of using a register of unbounded-size. Algorithm LABEL:bounded-space-recoverable-register allows the same value to be written multiple times, so a process may read from a value (written by process ) and then write some value that is later overwritten by another write of by . In this case, if recovers after a system crash, a mechanism for allowing it to detect whether or not its operation was linearized is required. As we explain below, per-process toggle bits are used to implement this mechanism. Before invoking an operation on the object, its caller initializes the structure as described in Section 2. Specifically, is initialized to and is initialized to .
The Write operation
To write, process reads (line LABEL:write-enter), thus learning that was the last to write to and which toggle-bit array was used by for writing. Next, resets the bit from ’s other toggle-bit array corresponding to (line LABEL:write-set-0), and persists the value read from , as well as the index of the toggle-bit array used by ’s current write (stored in ), into (lines LABEL:write-read-Tp-LABEL:write-RD-update). Then, reads again (line LABEL:write-read-R-again) and proceeds to write to (line LABEL:write-R-update) only if it read from the same value as in line LABEL:write-enter. In this case, sets its checkpoint field to (line LABEL:write-CP-first) immediately before the write to and sets it to (line LABEL:write-CP-termination) immediately after it. It then sets all the bits in the toggle-bit array used by its current write operation, switches its toggle-bit array index, persist the response and returns (lines LABEL:write-for-LABEL:write-return).
If the condition of line LABEL:write-read-R-again is not satisfied then, as we prove, a write operation by a process other than is linearized between ’s first and second reads of , hence can be assumed to have been overwritten by . In this case, skips lines LABEL:write-CP-first-LABEL:write-R-update and proceeds directly to line LABEL:write-CP-termination.
The Write.Recover recovery function
Upon recovery from a failed Write operation , first reads (line LABEL:write-recover-read-RD) and then checks if was set (line LABEL:write-recover-if-result-set). In this case, was completed and has been linearized, so the recovery function returns ack. Next, checks if equals (line LABEL:write-recover-if-no-CP). In this case, as we prove, was not linearized before the failure, so the recovery function returns fail (line LABEL:write-reinvoke-if-CP-0); the caller of the failed operation can now decide whether or not to reattempt performing . Otherwise, if equals (line LABEL:write-recover-if-after-first-CP), then the recovery code must determine whether or not was written in line LABEL:write-R-update (either by or by another process) since when read from in line LABEL:write-enter. This is done in line LABEL:write-recover-if-no-write as follows. If ’s value differs from , then was written and so either performed line LABEL:write-R-update or can be assumed to have been overwritten by another write, so the recovery code proceeds by performing lines LABEL:write-recover-set-CP-LABEL:write-recover-return (which are identical to lines LABEL:write-CP-termination-LABEL:write-return). Otherwise, ’s value equals but it is still possible that wrote to again after was read by . This is checked by the 2nd condition of line LABEL:write-recover-if-no-write which relies on the following key observation used by our correctness proof: in order for to write again using the same toggle-bit index, it must first complete a write operation using the other toggle-bit index. However, in that earlier write operation, sets all its toggle bits of that set to 1 (either in lines LABEL:write-for-LABEL:write-for-end of its write operation or in lines LABEL:write-recover-for-LABEL:write-recover-set-toggle-bits of its recovery function). Therefore, upon recovery, if reads the same value from as before the crash, it can conclude that a write occurred in between its two reads of if and only if ’s toggle bit that it has set to 0 is now 1. If this is not the case, concludes that was not linearized and returns fail (line LABEL:write-recover-no-write-fail).
The Read operation reads a triplet of values from and then extracts its first component, writes it to and returns it. Its recovery function re-invokes Read if holds, otherwise it returns it. This simple code is not presented in Algorithm LABEL:bounded-space-recoverable-register.
It is easily seen that Algorithm LABEL:bounded-space-recoverable-register uses bounded space, assuming that the values written by Write operations are of bounded size. It remains to show that the algorithm satisfies durable linearizability, detectability and wait-freedom.
Algorithm LABEL:bounded-space-recoverable-register is wait-free and satisfies durable linearizability and detectability.
Consider an execution of Algorithm LABEL:bounded-space-recoverable-register. Assume process completes a Write() operation in (either directly or by completing the recovery function). We prove that one of the following holds: 1) writes to exactly once, and this is ’s linearization point; 2) does not write to and there is a concurrent write operation by a different process that writes to , hence we can linearize immediately before ; or 3) the failure occurred before wrote to , in which case Write.Recover returns fail.