In this paper, we are interested in the management of concurrently shared resources, and in particular in safely destructing them. Resources can include memory, file pointers, communication channels, unique labels, operating system resources, or any resource that needs to be constructed and destructed, and can be accessed concurrently by many threads. The problem being addressed is that when a handle to a resource is stored in a shared location, there can be an unsafe race between a thread that reads the handle (often a pointer) and then accesses its content, and a thread that clears the location and then destructs the content. Such read-destruct races are a common issue in the design of concurrent software. In the context of allocating and freeing memory blocks, the problem has been referred to as the memory reclamation problem [23, 4] , the repeat offender problem , and read-reclaim races . Elsewhere, read-destruct races come up in managing reference counts , OS resources , and multiversioning .
Read-destruct races can be addressed by placing a lock around the reads and destructs, but this comes with all the standard issues with locks. To avoid locks, many lock-free methods have been suggested and are widely used, including read copy update (RCU) 
, epochs, hazard-pointers , pass-the-buck , interval-based reclamation , and others [26, 4]. RCU is implemented in the Linux operating system, and McKenney et. al. point out that it is used in over 6500 places, largely to protect against read-destruct races . Epoch-based reclamation and hazard pointers are also widely used for memory management in concurrent data structure  [33, Chapter 7.2] to protect against read-destruct races.
Our contribution is to solve the read-destruct problem in a constant time (expected) wait-free manner—that is to say that both the read sections and the destructs can be implemented safely with expected constant time overhead. Furthermore, our solution is memory efficient, only requires single-word CAS, and ensures timely destruction. Our solution is expected rather than worst-case constant time because each process uses a local hash table.
Specifically, we support a acquire-retire interface consisting of four operations: acquire, release, retire, and eject. An acquire takes a pointer to a location containing a resource handle, reads the handle and protects the resource, returning the handle. A later paired release, releases the protection. A retire follows an update on a location containing a resources handle, and is applied to the handle that is overwritten to indicate the resource is no longer needed. A later paired eject will return the resource handle indicating it is now safe to destruct the resource (i.e. it is no longer protected). This definition is illustrated in Figure 1. We allow the same handle to be retired multiple times concurrently, each paired with its own eject. This is useful in our reference counting collector where the destruct is a decrement of the reference count. All operations are linearizable , i.e., must appear to be atomic. Section 2.1 compares the acquire-retire interface to similar interfaces.
As an example, Figure 2 defines a structure for protecting a resource based on acquire-retire, and uses it for redirecting a file stream. The use function wraps a function call (on ) in an acquire-retire, while the redirect overwrites the value of the resource. A thread (other) uses the stream concurrently while the main thread redirects the stream. Without acquire-retire, there is a read-destruct race that would allow other to write to a stream that has already been closed. Here, the retire binds a function (the destructor) that is applied when ejected.
The acquire-retire interface can be implemented using hazard pointers  or pass-the-buck . With these approaches, however, the acquire is only lock-free. In particular after installing the value read from the handle location into a hazard slot, it is necessary to go back and check that the location has not been updated, and repeat if it has. Epoch-based reclamation can be used to support the operations in constant time, it provides no bound on the number of delayed retires. We say a retire-eject pair is delayed between the retire and the eject.
In this paper we describe an efficient implementation of acquire-retire, which gives the following results.
Result 1 (Acquire-Retire): For an arbitrary number of resources and locations, processes, and at most resources protected on each process at a given time, the acquire-retire interface can be supported using:
time for acquire, release, and retire,
expected time for eject,
space overhead and delayed retires, and
single-word (pointer-width) read, write and CAS.
For the space and delayed resources bounds, we assume that every retire comes with at least one eject. We note that time for each operation is stronger than wait-free since wait-free only guarantees finite time. Also, for resources and , the space overhead is a low order term compared to the space for the resources themselves. Limiting ourselves to pointer-width words implies that we do not use unbounded counters. Such counters are used in many other non-blocking data structures to avoid the ABA problem [15, 22]. We know of no previous solution that is constant (expected or worst-case) time, bounded space, and does not use much more powerful primitives. Our results make use of recent results on atomic copy .
Applications of Acquire-Retire.
We describe several applications that can use acquire-retire, and significantly improve on previous bounds for each of these. We first consider the classic memory reclamation problem solved by hazard pointers  and pass-the-buck . Here, the resources are memory blocks managed with explicit allocates and frees. A read-destruct race can happen when one process reads a variable with a pointer to a block and uses the block, and a concurrent process overwrites the variable and frees the overwritten block. We say the protected-block memory reclamation111We call it “protected-block” to distinguish from other “protected-region” interfaces such as epochs or RCU, which protect all blocks in a region of code instead of individual blocks. problem is to support a protected_read that reads and uses a pointer from a location, and a safe_free that is applied when the last pointer to a block is overwritten and ready to free, but only actually freed when it is not protected. We say a block is delayed if it has been safe freed but has not been actually freed. We describe an implementation with the following bounds.
Result 2 (Memory Reclamation): For arbitrary number of blocks and processes each with at most protected blocks at any time, protected-block memory reclamation can be supported with:
time overhead for protected_read
expected time for safe_free,
space + bits per block,
delayed blocks at any time, and
single-word (pointer-width) read, write, and CAS.
Hazard pointers solve the problem, but with unbounded time on protected reads (i.e., lock-free). Epochs can solve the problem in constant time overhead, but with unbounded space. RCU ensures bounded space, but is blocking. We know of no other constant time, bounded space solution that does not require powerful primitives with operating system support. Based on our solution to the protected-block memory reclamation problem we develop data structures for concurrent stacks and queues that improve on the time of peeking at the top of the stack or end of the queue.
Result 3 (Stacks and Queues): On processes, a collection of concurrent stacks or queues with a maximum of word-sized222Note the word could be a pointer to a much larger item, but we are not including the space of the larger item, just the pointer. items total across them can be supported with:
lock-free updates, and time peeks,
single-word (pointer-width) read, write and CAS.
The best previous implementations of concurrent stacks and queues either require unbounded counters with double-word-width CAS [21, 30], take unbounded time on peeks , or require space and double-word-width LL/SC .
Next, we describe an implementation of reference counting for garbage collection. Here, the resource is the reference count. Copying a reference (pointer) to an object involves reading the reference and then incrementing the count on the object. Destructing the reference decrements the count on the object and collects it if the count goes to zero. When the reference is kept in a shared mutable location, a read-destruct race can occur where a read and then increment is split by an overwrite, decrement and collect. To protect against this, we present the following result.
Result 4 (Reference Counting): On processes, any number of reference counted objects with references stored in shared mutable locations can be implemented safely with:
references as just pointers (i.e., memory addresses),
expected time for reading, copying and overwriting references,
space overhead and delayed decrements,
single-word (pointer-width) read, write, CAS, FAS, and FAA.
We say a decrement is delayed if a reference has been overwritten or otherwise deleted, but the count on its object has not yet been decremented. Previous approaches to protect against the read-destruct race in reference counting are either only lock-free [31, 21, 7, 12, 33], wait-free with time  per operation, require unbounded decrement-delay , or use unbounded sequence numbers and double-word fetch-and-add [17, 24], which is not available on modern machines. We note that although we are able to improve previous results for memory reclamation, stacks, and queues by simply applying our constant time acquire, for reference counting we have to change previous algorithms. In particular, we cannot use sticky counters [31, 21, 12] since they need a lock-free CAS-loop instead of a single FAA for updating the reference count. Our implementation avoids sticky counters by allowing multiple retires of the same resource. This was not allowed in previous interfaces.
Next, we generalize the approach used in reference counting collection to support objects that are managed with ownership—i.e., every object (resource) has a single owner, but objects can be copied or moved to another owner, or destructed when the object is dropped or the owner terminates. This generalization is particularly useful in programming languages, such as C++ and Rust, that manage all objects with ownership and automatically insert move, copy and destruct methods into code at the appropriate locations (e.g., assignments, function calls, and end of scope) . The race that we want to protect against is between the copy and the destruct. We assume that the copies and destructs are safe if there are no such races, which we refer to as race-free safe (Section 5). We say a destruction is delayed between when the ownership of an object is released, i.e., ready to be destructed, and when it is actually destructed. We present the following result.
Result 5 (Copy and Destruct): For race-free safe methods, the copy, move and destruct methods can be applied safely on shared mutable locations with
expected time overhead per operation,
total space overhead, delayed destructions,
single-word (pointer-width) read, write, CAS, and FAS.
As far as we know, we are the first to consider this general form of protection for read-destruct races. Reference counting is a special case in which the copy increments the count and the destruct decrements it. The approach can also, for example, be used to safely manage the widely used C++ standard template library (STL) vectors and strings, as well as any other data structure that requires a deep copy.
In addition to the theoretical results mentioned above we have implemented most of the techniques in C++. The code is, perhaps, surprisingly simple and elegant. For the copy and destruct idea, we develop a weak_atomic that can wrap any type and make it safe for copy and destruct if it satisfied the copy-safe assumptions (almost always true). The weak_atomic can be wrapped around C++ shared pointers to give the equivalent of atomic shared pointers, but with constant time instead of using locks or being just lock free. In Section 6 we describe some simple benchmarking that shows that weak_atomic wrapped around C++ shared pointers performs well in practice.
Model and Assumptions.
We assume the standard concurrent shared memory model with asynchronous processes and sequential consistency . Appropriate fences are needed for weaker memory models, and are included in our implementation. We use the standard definitions of wait-free, lock-free and linearizability, invocation, and response . Throughout, when we talk about the time of an operation, we mean the number of instructions (both local and shared) performed by that operation before it completes. By pointer-width, we mean a word that is just big enough for a pointer —i.e., not even an extra bit hidden somewhere. By space we mean number of words including both shared and local memory. The “expected time” bounds we give are purely due to hashing. Beyond reads and writes, we consider three other atomic read-modify-write primitives: compare_and_swap (CAS), fetch_and_store (FAS), and fetch_and_add (FAA). All three instructions are supported by modern processors.
The Acquire-Retire interface can be used to efficiently protect against read-destruct races and, as we will see in the following sections, can solve a handful of important problems. The interface consists of four operations acquire, release, retire, and eject. The acquire and release operations are used to protect and unprotect resources. The retire and eject operations are used to indicate when a resource is ready to destruct, and when it is safe to destruct, respectively.
We assume that resources are accessed via a single-word handle. In the uses in this paper the handle is always a memory pointer, but in general it need not be—e.g., it could be a key into a hash table, or index into an array. In the following discussion we say an operation is linked to another operation , if maps to just , but there could be many operations that link to . We say and are paired, if they each are only matched to the other. We now formally define the interface.
Definition 2.1 (Acquire-Retire Specification).
The acquire-retire interface supports the following four operations, where T is the the type of a resource handle.
T acquire(T* ptr, int i): returns the resource handle in the location pointed to by ptr and “protects” it.
void release(int i): paired with the previous acquire(*, i) on the same process. Releases the resource protected by that acquire.
void retire(T h): used after an atomic update on a location that overwrites a resource handle h. The update is paired with the retire.
T eject(): it is either paired with a previous retire(h) and returns h, or not and returns .
We say that an acquire(p, i) links to the next update on location p (if any). This update is then paired with a retire (if any) which is paired with an eject (if any). By transitivity, every acquire is linked to at most one retire, and the resource handle returned by the acquire is equal to the handle passed to its linked retire. The interface guarantees that for any acquire, if it is linked to an eject, then the release paired with the acquire must have happened before the eject.
We note that the guarantee captures our intuition of what the interface is supposed to protect against. In particular, it ensures that any destruct of a resource placed after the eject will happen after all processes release that resource.
Our interface allows multiple concurrent retires of the same handle, which leads to a subtle point: every acquire can be linked to (at most) one retire, but a retire(h) has the possibility of being paired with any future eject, many which could return the same handle, i.e., one cannot tell by the handle returned by eject which ones are matched. Our guarantee is simply that there exists a matching of retires to ejects that satisfies the condition.
In most use cases, each process only needs to protect a single resource at a time, so we omit the parameter i in acquire and release. We assume that all four operations are atomic (i.e. linearizable).
2.1 Related Interfaces
The acquire-retire interface is similar to the interfaces for hazard-pointers  and pass-the-buck . A key difference is that those interfaces have a non-atomic acquire. This means that after their “weak” acquire, they have to check the location to ensure the value has not changed in the meantime. If it has, they have to try again. For this reason a “strong” acquire, which we support, would be lock-free but not wait-free in their interface. Another important difference is that our interface allows multiple concurrent retires on the same handle, but since theirs was designed specifically for memory reclamation, theirs insists that a handle is ejected before it can be retired a second time. Allowing concurrent retires enables us to implement a reference counting collector that uses fetch_and_add instead of compare_and_swap for incrementing and decrementing counters. It also allows us to define a copy-destruct interface that permits multiple destructs (e.g. an object that requires two destructs to kill it). The interfaces also have some minor less important differences—e.g., in pass-the-buck, the eject is part of the retire.
Read Copy Update (RCU)  and epoch-based reclamation  also serve a similar purpose and have a somewhat similar interface. Instead of acquiring and releasing individual resources, however, they read_lock and read_unlock regions that are not paired with any particular resource. They guarantee that everything that is retired, by any process, during the protected (locked) region will be ejected after the region is finished. In some cases, this can make protection easier. On the other hand, it means a retired resource cannot be destructed until all protected regions that overlap the retire finish. Most of the regions could have nothing to do with the particular retire. This implies that the memory required by RCU and epochs can be unbounded. The linux implementation of RCU mitigates this problem by disabling interrupts during a locked region and asking the user to ensure that they hold the read lock very briefly. Brown  gives bounds for a similar method. These approaches require OS support. Variants of epochs based on intervals [26, 32] can be used to bound the memory, but are not wait-free, and the memory bound is very large: where is the most memory that was live (allocated but not retired) at any time.
Ben-David et. al  describe an interface similar to acquire-retire, and use it for multi-versioning. It has an acquire, release, and set, where the set is equivalent to a CAS followed by a retire, and the release includes the eject. They say an implementation of the interface is precise if the last release that holds a resource (version) ejects the resource. They describe a precise data structure that has constant time acquire, and time release. It requires a two-word wide CAS and uses unbounded counters. Our implementation of acquire-retire is not precise, but we get time release, and without two-word CAS or unbounded counters. We conjecture it is not possible to be both precise, and time for all operations of the acquire-retire interface.
2.2 Implementing the Acquire-Retire Interface
Lock-free acquire with constant time release, retire, and expected constant time eject can be implemented with techniques similar to the ones used in Hazard Pointers . To get acquire down to constant time, we leverage a recently proposed primitive called swcopy . The swcopy atomically copies from one location to another location, but requires that the destination location is only written to by a single process. This primitive can be implemented in constant time with space overhead .
We begin by implementing the lock-free version. As with Hazard Pointers, for a process to protect resources, it owns slots in a shared announcement array . The total number of slots in is therefore . A process uses its slots to protect resources by placing their handles in these slots.
A lock-free acquire(T* ptr, i) begins by reading *ptr and writing the result in its ith announcement slot. To check if the announcement happened “in time”, it reads *ptr again to see if it has changed. If it’s still the same, then the announced handle has been successfully protected, and can be returned. Otherwise, the acquire has to restart from the beginning.
A release(i) operation unprotects by simply clearing the process’s ith announcement location, and retire(T x) simply adds to a process local retired list called rlist. To determine which handles are safe to eject, we implement an ejectAll(rl) operation which first loops through and makes a list of all the handles that it sees. We call this list of handles plist for “protected list”. If a handle is seen multiple times in , then it will also appear that many times in plist. Next, ejectAll computes a multi-set difference between rl and plist (denoted rl \ plist). The result of this multi-set difference are handles that can be safely ejected without violating the specifications from Definition 2.1. It’s important that we keep track of multiplicity and perform multi-set difference because in the case where there are multiple retires of the same handle, each occurrence of this handle in the announcement array might be linked to a different retire. So if a handle appears in rlist times and the announcement array times, it is safe to eject only copies of this handle.
There are two ways of performing the multi-set difference. The most general method is to use a hash table and this would result in a expected time ejectAll(rl), where is the size of rl. If ejectAll(rlist) is run once every retires, then we can guarantee that rlist does not become too big. This is because only handles can be protected at any time, so if rl has more than handles, then ejectAll(rl) will add at least of them to the free list. The size of rlist is always upper bounded by .
An eject is essentially a deamortized version of ejectAll. Every time it is called, it performs a small (expected) constant number of steps towards ejectAll(rlist). When ejectAll returns a list of handles, they get stored in a local free list to be returned one at a time by the following ejects. If eject is called after every retire, then an entire ejectAll() will complete every calls to eject. To ensure that this is the case, eject does not breakup hash table operations. Thus, it takes expected constant time.
Finally, we turn our attention to making acquire constant time. This can be done by making the read of *ptr and write to the announcement array appear to happen atomically. This is exactly the functionality provided by the Single-Writer Copy (swcopy) primitive . To use this primitive, we would have to replace each element of the announcement array with a Destination object. These Destination objects support read, write and swcopy(T* src), and they are single-writer, which means that only one process is allow to write and copy into each object. The swcopy primitive takes a pointer to an arbitrary memory location and appears to atomically copy the word from that memory location into the Destination object. Blelloch and Wei  present an implementation of Destination objects using space such that read, write and sw-copy all take constant time. In our implementation, we use such objects for the announcement array.
Pseudo-code for this implementation appears in Figure 3.
Proof outline of Result 1.
We need to prove that for an acquire, the eject that it links to (if any) happens after the release it is paired with. We consider an acquire returning h, the release it is paired with, the update it is linked with, the retire(h) that the update is paired with, and the eject that the retire is paired with. Since swcopy is atomic, the acquire is linearized at the copy point. After this point and before the release, the announcement array will contain the protected handle h. The next update in linear order on the same location is linked to the acquire, and this is in turn possibly paired with a retire(h). If the retire happens after the release, then clearly any paired eject will happen after the release, so we assume the retire(h) happens before the release. Between the retire and the release the announcement array contains a copy of h for the specific acquire (it could also hold other copies). Because of the multi-set difference applied between the retired list and announcement array, every h in the announcement array can be paired with up to one element in the retired list of a process, and that one will not be ejected. Therefore, by the pigeonhole principle, any that are ejected are no longer protected. Hence, the eject linked to an acquire can only happen after the paired release that removes h from the announcement array.
The time for acquire, release, and retire are constant. If each resources comes with an extra bits, then eject can also be implemented in constant time. Otherwise, it is only expected constant time. If eject is called after each retire, the space is bounded by since each process requires at most space. We say that a retire-eject pair is delayed between the retire and the eject. The number of delayed retire-eject pairs is bounded by since each process can have at most in its retired list, in a partially completed ejectAll and, in the results from the previous ejectAll that have not been ejected yet. The implementation only uses atomic single word read, write and CAS.
3 Memory Reclamation
The memory reclamation problem is the special case of read-destruct races where the resource is a pointer to a block of memory, and the destruct is a call to the free of that block—i.e., to return the block to a pool for later reuse. In this context, the race is sometimes called a read-reclaim race  and has been widely studied [23, 4, 12, 10, 9, 26, 32]. We note that in this special case, there can only be one concurrent destruct on a resource (block) otherwise we would double free a block.
Within the context of memory reclamation there are two classes of interfaces that both protect read-reclaim races. In the first, supported by hazard pointers and pass-the-buck, individual memory blocks are protected. We refer to this as protected-block memory reclamation. In the second, which includes RCU and epochs, regions of code are protected along with all the blocks that are read in the region. This can be referred to as protected-region memory reclamation. There is a tradeoff between the interfaces. The first can support bounds on the number of resources used even if threads stall or fail, while the second can more easily protects a large collection of objects. Here, we are interested in the protected-block version. In this case making memory access safe for read-reclaim races just involves wrapping the use of a memory block in an acquire-release on the block, and splitting the reclamation of memory into a retire and then an eject that frees the pointer at some later point when it is safe. The interface and code is shown in Figure 4.
We say that a memory block pointed to by is delayed if safe_free has been called but the corresponding free has not yet been called. Proper usage of protected-block memory allocation requires that while a pointer is delayed, no protected_read(loc) can use that pointer (i.e., it cannot be stored in loc), and no safe_free can be called on the pointer. The first corresponds to a use after free and the second to a double free.
Proof outline of Result 1.
We first consider safety—i.e., with proper usage, a block is never freed during a protected read. This follows from the proper usage requirement that if the block is protected after it is retired, the protected regions must have been acquired before the retire. Due to the semantics of acquire-retire, the eject cannot happen until after these protected regions are all released, and hence the free will happen after all releases and is safe.
We now consider the five properties. The first two about constant time overhead follow directly from the acquire-retire results. The space plus the per block (3) also follow from the acquire-retire results. The number of delayed blocks is bounded by the delayed retires given by the acquire-retire result (4). The implementation uses only the primitives needed by acquire-retire (5).
Stacks and Queues.
Protected-block memory reclamation can be used to implement lock-free stacks and queues with a single word CAS and without unbounded tags [23, 12]. The standard ABA problem is avoided since a link (memory block) in the structure cannot be recycled while someone is doing a protected access on its pointer. However, by doing this, even just peeking at the top of a stack or the front of a queue requires an acquire. Therefore the Hazard pointers and pass-the-buck only support lock-free peeks. By using our constant time acquire algorithm, we get constant time peeks, while preserving the time for updates. The code for peek, push, and pop for stacks is shown in Figure 5. The push and pop are effectively the same as previous results [23, 12], but the peek ensures constant time. We assume the allocate leaves enough space for tags. The code for enqueue and dequeue on queues is again the same as the previous results and the peek is easy to implement (not shown).
Proof outline of Result 1.
The correctness and lock-free updates follow from previous work [23, 12]. The time bounds for peek (1) follow from the fact that the protected_read is time. The space includes a constant number of words for each stack or queue (), plus three pointers per node in the linked lists for the tag, the value, and the next pointer () , plus the space overhead of protected-block memory reclamation (). The total is as claimed (2). We note that for stacks the constant in the is one. The implementation only uses single work atomics (3).
4 Reference Counting
Reference counting (RC) garbage collectors  keep a count associated with every memory block managed by the collector, indicating how many pointers reference the block. Whenever a pointer to the block is copied, this count is incremented, and whenever a pointer is overwritten or destructed, the count is decremented. If the count goes to zero, the block itself is destructed, which in addition to freeing memory will destruct any pointers within the block, possibly causing more blocks to be destructed. Reference counting is widely used both in collected languages (e.g., python and swift) and in non-collected languages managed by ownership (e.g., C++ with std::shared_ptr and Rust with RC and ARC pointers). With ownership, the counters can be automatically incremented and decremented by copy and destruct methods.
Compared to other garbage collection schemes, such as stop-and-copy and mark-and-sweep, reference counting has the advantage that it more aggressively collects garbage, often as soon as it becomes unreachable. However, it has the disadvantage that it cannot collect cycles in the memory graph. There are techniques to alleviate this problem, such as weak pointers or using separate collectors for cycles.
In this paper, we are interested in using RC in the concurrent setting. Reference counters can be incremented and decremented concurrently with an atomic FAA instruction. If the references are thread private or immutable, this is all that is needed. However, if the references are held in locations that can be simultaneously read by one thread and updated by another, which is true in just about any lock-free data structure, then there can be a read-destruct race. In particular the read will read the location, increment the count, and return the location, while an update will overwrite the location and then decrement the count of the old value, possibly collecting it if the count goes to zero. If an update, decrement and collect splits the read and increment, then the memory block will be collected before it is incremented and returned, hence incrementing a collected counter and returning a pointer to garbage.
The read-destruct problem for reference counts is well understood. Valois, Michael and Scott [31, 21] first developed a lock-free approach to prevent it. Their approach, however, can increment the counter of freed memory and requires a CAS to update the counter, which could fail and need to be repeatedly retried. Detlefs et al.  describe lock-free method that avoids these two issues, but it requires a DCAS (two word CAS), which is not supported by any machine. Herlihy et al.  are able to remove the DCAS assumption, leaving us with a lock-free solution to the problem with just single-word CAS, but it still requires a CAS loop instead of a FAA due to the need of a sticky counter. Most importantly, all these solutions are lock-free but not wait-free. In particular, a thread trying to read a pointer could retry indefinitely as other threads update or copy the pointer.
Sundell developed a wait-free solution . However, like the Valois-Michael-Scott method, it can increment freed memory, making it inappropriate in many practical situations. Also, and perhaps more critically, the approach requires time to retire each location, which is expensive. RCU and epochs can also be used, but with no bound on memory usage since GC can be delayed arbitrarily. From a practical standpoint, the C++ STL library supports atomic shared pointers, which are reference counted pointers that are safe for concurrency. The standard implementations used in GCC and LLVM are based on locks. However, there is at least one library that support a lock-free solution . It is based on split reference counters . It requires double word CAS and is lock-free, but not wait-free. A similar technique was independently developed by Lee  and generalized by Plyukhin . Their version requires atomic double-word FAA on a location containing both a pointer and an unbounded sequence number. Unlike double-word CAS, double-word FAA is not supported by modern machines.
As far as we know, no previous work simultaneously supports constant time access, has bounded memory overhead, and uses only instructions supported by hardware. Here, we support constant time operations on mutable pointers to reference counted objects. We do this by protecting the read and increment of a reference pointer with an acquire/release, and using a retire instead of decrement when overwriting an old pointer. We ensure that no more than locations per process ( total) have been retired but not decremented.
Figure 6(a) gives a safe implementation of a mutable reference to a counted object. It assumes the object has an add_counter method that can be used to atomically increment or decrement the counter. The copy constructor (line 6), copies a pointer by acquiring the pointer, incrementing the count of the object being pointed to, setting p to the copied pointer, and then releasing the pointer. The update function (line 6) uses a fetch_and_store to swap in the new value and return the old value. It then retires the old value and runs an eject step. The eject can return a previously retired pointer. If it does, then the pointer is decremented. Line 6 is needed since C++ will insert a destructor for b on exit of update and we do not want this to decrement the counter for what is now stored in p.
The with_ptr function (line 6) takes a function as argument, and applies the function to the raw object pointer. As with the copy, this is wrapped in an acquire and release to protect the access. The decrement (line 6) attempts to garbage collect by decrementing its count and checking if it went to zero (the result returned by add_counter is the old value). If it went to zero, it deletes . The destructor for can then destruct other reference counted pointers, potentially recursively collecting more objects. Note that the destructor (line 6) calls collect immediately instead of calling the delayed retire. This is because in a proper program, when the pointer itself is being destructed, no one else will hold a copy of the pointer so a read-destruct race cannot occur.
Figure 6(b) gives an example use of ref_ptr for binary trees. A tree node consists of the reference count, a value, and a left and right child. The constructor sets the counter to 0, and sets the value and children appropriately. The add_counter atomically updates the count. The example below the node definition creates a tree with three nodes. It then concurrently updates and copies the tree. This would normally create a read-destruct race and would be unsafe (e.g., using STL shared_ptr). It is, however, safe with our version. After the join point, and depending on the outcome of the race, Tr2 will either contain the original tree or the new singleton tree. If it contains the new tree, then the original was retired by the update and the reference count on the root of the new tree is 2.
Proof outline of Result 1.
We first consider safety. From Definition 2.1 of the acquire-retire interface, every acquire is linked to at most one retire. This happens through the update that overwrote the acquired handle. Furthermore, the linked eject will happen after the release paired with the acquire. Since the decrement happens after the eject, the decrement must happen after the protected region of the acquire. Therefore, during any copy of a pointer, or protected read using with_ptr, a decrement due to the next update of the location cannot happen until after the protected region. Hence, the counter will never go to zero and be collected during the protected region.
We now consider the four properties from Result 1. (1) References are just pointers, as claimed. The expected time (2) and space (3) follow directly from the acquire-retire results. Note that we cannot use the worst-case version since we have multiple retires on a single resource. Similarly, number of delayed decrements is at most (3) because there are at most delayed retires. The implementation uses the primitives needed by acquire-retire and additionally a FAA for incrementing and decrementing the reference count (4).
In the next section we generalize the idea we used in reference counters to a broad set of resources that are based on copying and destructing. The generalized approach actually has a practical advantage over what we described even in the context of reference counting in that it more cleanly separates private or immutable pointers to reference counted objects from shared ones. This allows it to avoid an acquire/release on pointers that are private or immutable.
5 Copy and Destruct
Languages based on the resource allocation is initialization (RAII) paradigm (e.g., C++, Ada, and Rust) manage resources (objects and values) by ownership [27, 16]. In this setting, objects have one owner that is responsible for destructing the object when it is overwritten or goes out of scope. This typically uses a destruct method, which can be defaulted or user defined. Most objects also have copy and move methods that are used to make a copy of the object or move the object to a new owner. The compiler inserts calls to the copy, move, and destruct methods throughout the code where needed to ensure single ownership and proper destruction, e.g. at function calls, assignments, the end of lexical blocks, and exception points.333We note that C++ copies by default when passing by value or assigning, while Rust moves by default. In both languages the other choice can be made by an explicit move or copy (called clone), respectively. Ownership is particularly powerful in languages that are not garbage collected since, if used properly, it avoids any memory leaks or other memory problems, even in the presence of exceptions. Typical examples of objects managed using non-trivial copy, move and destruct methods are strings or vectors in the C++ standard template library (STL). For strings the copy allocates a new block of memory, copies the characters and returns a new header with the length and other information, along with a pointer to the new memory. The move just moves the header, clearing the old location. The destruct will free the memory block, and clear the header. Another example is shared reference counted pointers as discussed in the previous section, where the copy increments the counter, and the destruct decrements the counter and deletes the object pointed to if the counter goes to zero. The approach is also often applied to nested objects such that the copy does a deep copy of the object. Ownership can also be used for locks (the destruct releases the lock), and file pointers (the destruct closes the file), but in these cases the copy is often disabled.
Here, we generalize the approach used for reference counting to other objects managed by ownership. As with reference counts, we are concerned with read-destruct races than can occur when an object is stored in a shared mutable location, where the race is between a load operation (consisting of a read of the location followed by applying the copy on the object), and a store operation (consisting of swapping a new value into the location, and then destructing the old value). As previously mentioned, the problem occurs when the store fully splits the read and copy.
In the following discussion, we assume that all objects have a method that makes a copy of , returning it, and a method that destructs the object , returning nothing. We assume objects of type have an empty instance, denoted as , and that does nothing. We consider the following interface for a mutable reference cell for storing objects of type .
creates a cell initializing it with the object
reads the object stored in the cell, makes a copy, and returns it;
stores the object into the cell, calling destruct on the old object;
writes into the cell, and returns the old value, and
deletes the cell and destructs its contents.
We note this interface supports “ownership” since it copies the value on reading it (load), and destructs the owned value when overwriting it (store) or destructing the cell itself (delete). Figure 7(a) gives C++-like code that supports the interface based on copy and destruct. The new is defined as the constructor in the fourth line, and the delete as the destructor in the last line. In the code, we are explicitly inserting the copy and destruct methods—C++ itself will put them in implicitly.
Our goal is to create a implementation of the interface that protects against a read-destruct race between the load and the destruct in the store. To support this efficiently we assume the type can be atomically read and swapped. This is without loss of generality (although perhaps some loss of efficiency) since if the object is larger than an atomic word, it can be stored indirectly with a pointer to the object, i.e., it can be boxed.
We require that that the supplied copy and destruct methods are safe when there are no read-destruct races. In particular we say the copy and destruct are race-free safe if they are safe when a copy and its linked destruct (if any) never overlap in time. The destruct linked with a copy follows the definition of the acquire-retire interface. In particular the copy of an object from a location links with the next update of the object in that location (if any), which is then paired with a destruct of the object. We note that copy and destruct methods are typically race-free safe. This is because it is typically safe to run multiple copy operations on the same source object concurrently since copy often just reads the shared source object, and writes to the local (unshared) copy that is returned. This is true even if the copy is a deep copy. In many cases there is only one destruct, so as long as it does not overlap with the copy it is safe. For shared pointers, the copy updates a shared reference count, but this increment can be done atomically with a fetch-and-add. There can also be overlapping destructs on the same count (since a destruct can be applied multiple times), but these are safe with a atomic decrement. As long as copies do not overlap their linked destruct, the counter will only go to zero when no references are left and there are no ongoing copies.
Assuming the copy and destruct are race-free safe, our implementation can work as shown in Figure 7(b). We note that there are basically only three changes. The first is to wrap the copy that is part of the load inside of an acquire release. The second is to use retire instead of destruct in the store, and add an eject which does the destruct later when safe. The third is to use atomic swaps instead of regular swaps for the store and move. It’s possible to add a CAS to this interface and implement it similarly to store. We note that the destructor for the reference itself can use a destruct instead of retire on since correct use of the reference (or any object) requires that all operations on it respond before invoking the destruct.
Proof outline of Result 1.
The correctness proof follows our other proofs. In particular, if an acquire links to an eject, the eject will happen after the paired release. Therefore, the copy will be finished before the object is destructed. Along with being race-free safe this ensures correctness. The bounds on time, space and delayed destructions also follow from acquire-retire interface.
In our C++ implementation we have created a templated class:
It supports protected copies and destruct if the methods are race-free safe. For example, It can be used safely for the C++ STL structures shared_ptr, vector, and string. All these have copy and destruct methods that are race-free safe. The weak_atomic interface supports a load, store interface similar to Figure 7, which is similar to the C++ atomic structure. When used as weak_atomic<shared_ptr>, for example, it gives a safe type for reference counting that can be a plug in replacement for atomic<shared_ptr>. We use this in the experiments in the next section.
6 Reference Counting Experiments
In this section, we present some experimental results for our concurrent reference counted garbage collector presented in Section 4. These experiments are not meant to be a comprehensive evaluation of existing concurrent garbage collection techniques nor of our acquire-retire interface. Instead, their purpose is just to show that our approach is light weight enough to be used in practice and that it scales well.
Setup. We ran our experiments on a 4-socket machine with 32 physical cores (AMD Operton 6278 processor), 2-way hyperthreading, 2.4 GHz, 6MB L3 cache, and 200 GB main memory. All our experiments were run up to 32 cores without hyperthreading and we interleaved memory across sockets using numactl -i all. For scalable memory allocation we used the jemalloc library . All our experiments were written in C++ and compiled with g++ version 9.2.1 on optimization level O3.
Workloads. The workloads we ran involve loads and stores on an array of
reference counted pointers, each pointing to a different object. This array is padded so that each pointer is on a different cache line. A load involves reading the pointer and creating a new reference counted pointer to that object. A store involves allocating a new object and changing a reference counted pointer to point to that object. The location of the load/store is picked uniformly randomly betweento , so smaller means more contention. We show results for (representing a highly contended workload) and (representing a workload with almost no contention). Note that for
, less than 1% of the array fits in L3 cache. Stores are performed with probability, and we try three different settings: load only (), load mostly (), and store heavy (). We run each experiment for 3 seconds and report the overall throughput of loads and stores (averaged across 5 runs).
Implementations. We compare two implementations based on our approach with the atomic_shared_ptr implementation in the GNU C++ library . We also compare with a commercial implementation of atomic_shared_ptr from Anthony William’s just::thread library . All four implementations provide a similar interface and solve the same problem. The implementation in GNU is lock-based whereas just::thread is lock-free, and according to this post , it uses something similar to the split reference count technique described in [33, Chapter 7.2.4].
For our implementations, we use the weak_atomic class described in Section 5, which can be wrapped around any type and make it safe for copy and destruct if it satisfied the copy-safe assumptions. In one implementation, we wrap weak_atomic around C++’s shared pointer type, std::shared_ptr , to achieve the equivalent of atomic_shared_ptr. We refer to this implementation as Weak Atomic std::shared_ptr. We also implement our own shared pointer type which is more efficient, but less general compare to std::shared_ptr since it does not support weak pointers. We refer to this implementation as Weak Atomic Custom shared_ptr. In our implementation of Acquire-Retire, we applied a fast-path/slow-path optimization where the acquire operation runs the lock-free version of the acquire for a few tries before switching over to the wait-free version.
Analysis. We selected three graphs from our benchmarks to show in Figure 8. Across all our benchmarks, both versions of our implementations seemed to scale reasonably and perform better than the two competitors. The just::thread library performed better whenever we lowered the contention level or lowered the frequency of stores. The load only workload in Figure (a) shows the closest just::thread gets to the throughput of our implementations. The single-core perform of the GNU implementation is actually very close to Weak Atomic std::shared_ptr in all the workloads we tried, but it achieves relatively little speed up (less than 3x on 32 cores). This is because the GNU implementation shares a small number of global locks across all pointers.
In the low contention case (Figures 8 (a) and 8 (b)) where it is rare for two processes to access the same pointer, we observed that both our implementations scale nearly perfectly, achieving 30-31x speedup on 32 cores. The just::thread library also achieves 31x and 29x speedup on workloads (a) and (b) respectively. While just::thread gets a lot of self-speedup in the store-heavy workload (Figure (b)), its single-core performance is over 8x slower than ours. In the high contention case (Figure 8 (c)), our implementations still manages to achieve modest scaling.
We have presented a constant-time per operation and bounded-space interface for protecting against read-destruct races, and used it for a handful of important applications. While our specific results are important on their own, we believe the framework is also important. By considering resources in general, our acquire-retire interface generalizes previous interfaces, which focused mostly on the memory-reclamation problem. We allow, for example, multiple retires/destructs on the same resource, and define a general copy-destruct interface that covers a reasonably broad set of applications. The weak_atomic class, based on the copy-destruct interface, is able to implement a concurrent reference counting collector in C++ as easily as weak_atomic<shared_ptr>. Furthermore our preliminary experimental results indicate that it is fast.
We would like to thank Daniel Anderson for his C++ experitise and the improvements he made to our code.
-  (2016) Upper bounds for boundless tagging with bounded objects. In International Symposium on Distributed Computing (DISC), pp. 442–457. Cited by: §1.
-  (2019) Multiversion concurrency with bounded delay and precise garbage collection. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 241–252. Cited by: §1, §2.1.
-  (2019) LL/sc and atomic copy: constant time, space efficient implementations using only pointer-width cas. External Links: Cited by: §1, §2.2, §2.2.
-  (2015) Reclaiming memory for lock-free data structures: there has to be a better way. In ACM Symposium on Principles of Distributed Computing (PODC), pp. 261–270. Cited by: §1, §1, §2.1, §3.
-  (1960-12) A method for overlapping and erasure of lists. Commun. ACM 3 (12), pp. 655–657. Cited by: §4.
-  (2015) Asynchronized concurrency: the secret to scaling concurrent search data structures. In ACM SIGARCH Computer Architecture News, Vol. 43, pp. 631–644. Cited by: §1.
-  (2002) Lock-free reference counting. Distributed Computing 15 (4), pp. 255–271. Cited by: §1, §4.
-  (2019 (accessed November 5, 2019)) Scalable memory allocation using jemalloc. External Links: Cited by: §6.
-  (2004) Practical lock-freedom. Technical report University of Cambridge, Computer Laboratory. Cited by: §1, §2.1, §3.
-  (2008) The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with linux. IBM Systems Journal 47 (2), pp. 221–236. External Links: Cited by: §1, §2.1, §3.
-  (2007) Performance of memory reclamation for lockless synchronization. J. Parallel Distrib. Comput. 67 (12), pp. 1270–1285. External Links: Cited by: §1, §3.
-  (2005-05) Nonblocking memory management support for dynamic-sized data structures. ACM Trans. Comput. Syst. 23 (2). Cited by: §1, §1, §1, §1, §1, §2.1, §3, §3, §3, §4.
-  (1990) Linearizability: a correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems 12 (3), pp. 463–492. Cited by: §1.
-  (2008) The art of multiprocessor programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: Cited by: §1.
-  (2005) Efficiently implementing a large number of LL/SC objects. In ACM Symposium on Principles of Distributed Computing (PODC), pp. 17–31. Cited by: §1.
-  RustBelt: securing the foundations of the rust programming language. ACM Symposium on. Principles of Programming Languages (POPL). Cited by: §5.
-  (2010) Fast local-spin abortable mutual exclusion with bounded space. In International Conference On Principles Of Distributed Systems, pp. 364–379. Cited by: §1, §4.
-  (2019 (accessed November 5, 2019)) The gnu c++ library. External Links: Cited by: §6.
-  (2013) RCU usage in the linux kernel: one decade later. Technical report. Cited by: §1, §1.
-  (2007) Overview of linux-kernel reference counting. Note: Prepared for the Concurrency Working Group of the C/C++ standards committee External Links: Cited by: §1.
-  (1995) Correction of a memory management method for lock-free data structures. Technical report Computer Science Department, University of Rochester. Cited by: §1, §1, §4.
-  (2004) ABA prevention using single-word instructions. IBM Research Division, RC23089 (W0401-136), Tech. Rep. Cited by: §1.
-  (2004) Hazard pointers: safe memory reclamation for lock-free objects. IEEE Transactions on Parallel and Distributed Systems 15 (6), pp. 491–504. Cited by: §1, §1, §1, §1, §1, §2.1, §2.2, §3, §3, §3.
-  (2015 [last accessed 2019-11-05])(Website) External Links: Cited by: §1, §4.
-  (2019 (accessed November 5, 2019)) C++ shared pointer. External Links: Cited by: §6.
-  (2017) Brief announcement: hazard eras-non-blocking memory reclamation. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 367–369. Cited by: §1, §2.1, §3.
-  (2015)(Website) External Links: Cited by: §5.
-  (2001) Exception safety: concepts and techniques. In Advances in Exception Handling Techniques, A. Romanovsky, C. Dony, J. L. Knudsen, and A. Tripathi (Eds.), Cited by: §1.
-  (2005) Wait-free reference counting and memory management. In International Parallel and Distributed Processing Symposium (IPDPS), Cited by: §1, §4.
-  (1986) Systems programming: coping with parallelism. Technical report Technical Report Technical Report RJ5118, IBM Almaden Research Center. Cited by: §1.
-  (1995) Lock-free linked lists using compare-and-swap. In ACM Symposium on Principles of Distributed Computing (PODC), Cited by: §1, §4.
-  (2018) Interval-based memory reclamation. In ACM SIGPLAN Notices, Vol. 53, pp. 1–13. Cited by: §1, §2.1, §3.
-  (2012) C++ concurrency in action: practical multithreading. Manning Publ.. Cited by: §1, §1, §4, §6.
-  (2019 (accessed November 5, 2019)) Just::thread concurrency library. External Links: Cited by: §4, §6.
-  (2019 (accessed November 5, 2019)) Why do we need atomic_shared_ptr?. External Links: Cited by: §1.