Byte-addressable non-volatile main memory (NVRAM) combines the performance benefits of conventional main memory with the durability of secondary storage. Systems where NVRAM co-exists with (or even replaces) traditional volatile memory are anticipated to be more prevalent in the near future. The availability of durable main memory has increased the interest in the crash-recovery model, in which failed processes may be resurrected after the system crashes. Of particular interest is the design of recoverable concurrent objects (also called persistent [ChenQ-VLDB2015, CoburnCAGGJW-Asplos2011] or durable [VenkataramanTRC-FAST2011]), whose operations can recover from crash-failures.
In many computer systems (e.g., databases), recovery is supported by tracking the progress of computations, logging it to non-volatile storage, and replaying the log during recovery. Logging of this kind imposes significant overheads in time and space. This cost is even more pronounced in the context of concurrent data structures, where there is an extra cost of synchronizing accesses to the log. Furthermore, replaying an operation in our setting, where several processes may be concurrently recovering from crash-failures while still others have already completed recovery and proceed their normal execution, often requires to add new mechanisms to the original, non-recoverable, implementation.
A key observation is that in the context of concurrent data structures, full-fledged logging is not needed, and progress can be tracked in a per-process manner, which is sufficient for making them recoverable. Moreover, many lock-free implementations use helping and already encompass such tracking mechanisms; they can be easily adapted to support recoverability. Exploiting these observations yields the tracking approach to designing recoverable objects. This approach is based on explicitly maintaining an information structure (called Info structure), stored in non-volatile memory, to track an operation’s progress as it executes. The Info structure allows a process to decide, upon recovery, whether it is required to restart a failed operation or whether the operation’s effect has already been made visible to other processes, in which case it only remains for the operation to determine and return its response. The tracking approach proved to be widely applicable—to all the data structures we have inspected.
We also present a variant of the tracking approach, called direct tracking, applicable to implementations in which every update takes effect in a single CAS operation, e.g., [RF2004, DBLP:conf/wdag/Harris01, MichaelS-PODC1996]. Direct tracking relies on an arbitration mechanism that helps determine the responses of updates that failed while competing to apply the same change to the data structure (e.g., deleting the same node). Upon recovery, each of these processes competes to become the one to which the successful execution of the primitive operation is attributed, thereby determining its response value.
As we show, the tracking approach can be used to derive recoverable versions of a large collection of concurrent data structures [DBLP:conf/podc/EllenFRB10, RF2004, DBLP:conf/wdag/Harris01, DBLP:journals/jpdc/HendlerSY10, MichaelS-PODC1996] and succeeds in doing so with relatively minor changes to their original code. It significantly saves on the cost (in both time and memory) incurred by tracking operations’ progress, by not having to track which instructions have been performed exactly, but rather, only if specific instructions have been reached. Furthermore, even this can often be piggybacked on information already tracked by lock-free concurrent data structures. This means that updates efficiently maintain and persist sufficient information for recovery, and that the corresponding recovery code infers whether the operation took effect before the failure (in which case its response value is computed and returned) or not (in which case it is re-executed). Our approach does not modify operations that do not update the data structure.
For simplicity, we present the tracking approach and our recoverable data structures assuming that caches are non-volatile. However, flushes can be added to ensure that they work correctly even if cache memories are volatile and their content is lost upon a system-wide failure. Section LABEL:section:evaluation presents an initial experimental evaluation, showing the feasibility of the tracking approach. Our experimental evaluation inserts flushes into the code, ensuring that cache values are persisted in correct order.
2 System Model
A set of asynchronous crash-prone processes communicate through shared objects. The system provides base objects supporting atomic read, write, and Compare-And-Swap (CAS) primitive operations. Base objects are used to implement higher-level concurrent objects by defining algorithms, for each process, which use primitives to carry out the operations of the more complex object.
The state of each process consists of non-volatile shared-memory variables, non-volatile private variables, and local variables stored in volatile processor registers. Private and local variables are accessed only by the process to which they belong. At any point during the execution of an operation, a system-wide crash-failure (or simply a crash) resets all variables to their initial values, except the values of (shared and local) non-volatile variables. A process invokes to start its execution; Op completes by returning response value, which is stored to a local variable of . The response value is lost if a crash occurs before persists it (i.e., writes it to a non-volatile variable).
A recoverable operation has an associated recovery function, denoted , which the system calls when recovering after a system-failure that occurred while it was executing . Failed processes are recovered by the system asynchronously, independently of each other; the system may recover only a subset of these processes before another crash occurs. The recovery code is responsible for finishing ’s execution and returning its response. An implementation is recoverable if all its operations are recoverable. Process may incur multiple crashes along the execution of and during the execution of , so may be invoked multiple times before completes. We assume that the system invokes with the same arguments as those with which was invoked when the crash occurred. For each process , we also assume the existence of a non-volatile private variable , that may be used by recoverable operations and recovery functions for managing check-points in their execution flow.111 Some form of system support seems necessary for designing recoverable data structures, as assumed in previous works [AttiyaBH-PODC2018, DBLP:journals/corr/abs-1806-04780, DBLP:conf/ppopp/FriedmanHMP18, DBLP:conf/podc/GolabH17]). E.g., in [AttiyaBH-PODC2018], recovery code knows the address of ’s instruction that was about to execute when it crashed. When invokes a recoverable operation , the system sets to just before ’s execution starts. can be read and written by recoverable operations (and their recovery functions). is used by in order to persistently report that the execution reached a certain point. The recovery function can use this information in order to correctly recover and to avoid re-execution of critical instructions such as CAS.
An operation can be completed either directly or when, following one or more crashes, the execution of the last instance of invoked for is complete. In either case, the response of is written to a local variable of . Our algorithms are strictly recoverable [AttiyaBH-PODC2018, Definition 1]: the response of a recoverable operation is made persistent before completes, so that a higher-level operation that invokes is able to access ’s response value, even if a crash occurs after completes. Our algorithms satisfy Nesting-safe Recoverable Linearizability (NRL) [AttiyaBH-PODC2018, Definition 4]: a failed operation is linearized within an interval that includes its failures and recovery attempts. NRL implies detectability [DBLP:conf/ppopp/FriedmanHMP18]—the ability to conclude upon recovery whether or not the operation took effect, and access its response value, if it did.
3 The Tracking Approach
Tracking with Info Structures:
Many lock-free implementations of data structures employ a helping mechanism to ensure global progress, even if some processes crash during executing their operations, e.g., [DBLP:conf/spaa/Barnes93, DBLP:conf/podc/EllenFHR13, DBLP:conf/podc/EllenFRB10, feldman2016efficient, DBLP:conf/ipps/WalulyaT17]. They associate an information (Info) structure with each update, tracking the progress of the update by storing sufficient information to allow its completion by concurrent operations.
Our approach applies to implementations using Info-Structure-Based (ISB) helping: An update by process initializes an Info structure inf and then attempts to install it (using CAS) in a node nd that is trying to change, by setting a designated field of nd to point to inf; updates that change several nodes may install Info structures in all of them. If inf is successfully installed, continues the execution of using the information in inf. Once the update completes, uninstalls inf by resetting nd’s designated field. Every time fails to install inf, it must be that an Info structure of another process is installed at nd. In this case, helps ’s operation to complete (using the information in ) and then restarts .
ISB helping goes a long way towards making a data structure recoverable: updates are idempotent and not susceptible to the ABA problem, since they must ensure that an update is done exactly once, even if several processes attempt to concurrently help it complete. So, if the system crashes while executing an operation , upon recovery, can essentially re-execute the failed update to completion by either using the information in the Info structure for (if it has already been installed) or by starting from scratch.
Two issues that need to be addressed to support recoverability are the following. First, when recovers from a crash that occured while executing an update , its recovery code must be able to access the Info structure installed by . We address this by allocating, for every process , a designated persistent variable (for Recovery Data) that provides access to ’s recovery data and, specifically, holds a reference to the Info structure of . Second, a recovering process must be able to figure out whether its failed operation took effect, and if it did, to discover its response. To ensure this, we add a result field to each Info structure. Process , as well as each process helping , must set the result field in ’s Info structure before uninstalling the Info structure from the relevant nodes. This allows to retrieve its response at some later time.
Upon recovery, checks to find the reference to the Info structure, inf, it last stored there. Then, checks whether its last operation (i.e., the operation to which inf belongs) is still in progress. This can be done by accessing the data structure nodes that are to be modified. If any of these nodes stores a reference to inf, then is still in progress and tries to complete by using the information recorded in inf. Finally, if inf’s result field is set, the operation took effect, and returns the value recorded in this field. Otherwise, inf’s result field has not been set and none of the nodes store a reference to it, thus did not take effect, and restarts . We note that if changes to have been performed and later obliterated by other operations, then the result field of would have been set. This is so because once becomes visible, processes that operate on the same nodes as help to complete. Therefore, the effect of cannot be obliterated if the result field of inf has not been set.
We apply this scheme to an exchanger (Section 4) and a Binary Search Tree (Section LABEL:section:BST and Appendix LABEL:section:_appendix-BST).
While the tracking approach described above is applicable in all cases we have considered, in some of these implementations, e.g. [RF2004, DBLP:conf/wdag/Harris01, MichaelS-PODC1996], an alternative approach can be used to facilitate recovery. In these implementations, updates change the abstract state of the data structure using a single realization primitive, and they become visible to other processes only after executing this primitive. For example, in Harris’ linked-list-based set implementation [DBLP:conf/wdag/Harris01], an update changes the abstract state of the set with a single successful realization CAS: Inserts do so by atomically modifying the pointer of a node in the data structure, while deletes do so by using the standard logical deletion technique, in which a node is marked as having been removed from the set (and physically removed from the data structure lazily later on).
An implementation with a single realization CAS can be made recoverable with direct tracking: Process , executing an update , stores in a direct reference to the node to be changed, instead of referencing indirectly via a reference to an Info structure. Upon recovery from a failure, checks to find the reference to the node associated with .
For an insert, can determine whether is still in the data structure by searching for it; if it is found, then was successful in adding to the data structure. If does not find , it is possible that crashed after inserting but was deleted in the meantime; in this case, however, would have been marked by the deleting process, allowing to infer that indeed took effect. Note that under this technique, recovery from failed inserts has higher overhead since the recovery function must search for a node in the list. However, updates have lower overhead, which is preferable when crashes are rare.
Recovering a delete requires to handle the scenario of several processes simultaneously attempting to delete the same node by applying CAS to mark it as logically deleted. Exactly one operation (say that of process ) succeeds in marking , while the CAS performed by the other processes to mark the node fails. If the system crashes after some of the processes apply their CAS but before obtaining the responses, then the identity of the winner () is lost. An arbitration mechanism is required for choosing a single process (not necessarily ) to which the successful CAS (and delete) will be attributed. Then, can return success for its delete, whereas all other deletes must return fail. To implement arbitration, a deleter field is added to each node. Following the logical deletion of a node , a Delete by each process attempts to swap its ID to nd.deleter in order to attribute ’s deletion to . If the system fails when attempts to delete , then, when it recovers, checks if is logically deleted, and if so, attempts to swap nd.deleter as well.
Direct tracking is applied to obtain a recoverable linked-list-based set in Section LABEL:section:linked-list. ISB tracking and direct tracking are combined in a recoverable elimination stack [DBLP:journals/jpdc/HendlerSY10] (Section LABEL:section:elimination-stack and Appendix LABEL:section:_appendix-elimination-stack).
4 Recoverable Exchanger
An Exchanger [DBLP:books/daglib/0020056, scherer2005scalable] allows two processes to pair-up the operations they are executing and exchange values. Following [DBLP:books/daglib/0020056], an Exchanger object is comprised of two fields, value and state, which are manipulated atomically with a CAS. The state field can hold the values Empty, Waiting, or Busy, and it is initially Empty. The first process, , to arrive finds the Exchanger free (i.e., finds its state equal to Empty) and captures it by atomically writing to it its value and changing its state to Waiting. Then, busy-waits until another process collides with it: if arrives while is waiting, it reads ’s value in the Exchanger, and tries to atomically write its value to it and change the state to Busy, informing of a successful collision. If it succeeds, when next reads the Exchanger, it gets Busy and ’s value and resets the Exchanger’s state to Empty. Another process reading Busy in the Exchanger’s state (before resets it) busy-waits until changes it again to Empty (hence, this implementation is not lock-free).
We employ the tracking approach to achieve recoverability: processes exchange Info structures (ExInfo) instead of values. See Algorithm LABEL:alg:exchanger: pseudocode in blue handles recoverability; black psuedocode is the original implementation. In addition to state and value fields, ExInfo contains a result field, and a partner field pointing to the ExInfo of the operation with which is trying to collide. Initially, the Exchanger stores a pointer to a default ExInfo with an Empty .