In synchronized data structures and algorithms, there are many pitfalls that programmers can fall into, such as deadlock, livelock, and priority inversion.[mAlgorithmsScalableSynchronization1991] Non-blocking data structures and algorithms provide many benefits over their synchronized counterparts, such as providing guarantees on liveness, which is a property on how threads make progress throughout a system. One such liveness property is obstruction-freedom[herlihyObstructionfreeSynchronizationDoubleended2003], which states that threads are guaranteed to complete their operation so long as they are not obstructed by some other thread, such as by executing in isolation; in lock-freedom[massalin1992lock], at least one thread is guaranteed to progress and succeed in a bounded number of steps; in wait-freedom[10.1145/114005.102808], all threads are guaranteed to progress and succeed in a bounded number of steps. These properties are attainable in shared-memory, with decades of research available on non-blocking data structures for shared-memory, but to the author’s knowledge, there aren’t many, if at all, for distributed memory.
The Partitioned Global Address Space (PGAS) offers an abstraction of distributed memory systems in a way that allows them to have very similar semantics to that of shared-memory. For example, PUTs and GETs, which are remote-direct memory access (RDMA) operations that remotely write and read values in memory without intervention of the CPU, are analogous to shared-memory load and store operations, and can even be modeled as such.[hayashiLLVMbasedCommunicationOptimizations2015] RDMA atomic operations, which are entirely handled by the network interface controller (NIC) by high-performance computing networks such as Mellanox’s Infiniband and Cray’s Gemini and Aries, allow for extremely low latency atomic operations that are in the ballpark of mere nanoseconds. The exploration of the application of non-blocking algorithms and data structures to PGAS is enticing, but there are roadblocks and hurdles that must be dealth with first. For example, the Chapel programming language lacks native support for atomics on arbitrary objects such as class instances, which is necessary for any non-blocking algorithm. As well, the issue of concurrent-safe memory-reclamation, that is the reclamation of memory when arbitrarily many threads could be accessing said memory at any given time, is a real problem in shared-memory; perhaps their solutions [lockfreerefcounting, michaelHazardPointersSafe2004, wenIntervalbasedMemoryReclamation2018, hartPerformanceMemoryReclamation2007a] can be directly applied or modified to work in the PGAS domain.
In this work, the hurdle of atomic operations is overcome by providing AtomicObject and the local-optimized variant LocalAtomicObject, where both provide atomic operations on class instances and provide optional ABA-protection, and the former is designed to support RDMA atomics on class instances, making it possible for some truly scalable algorithms and data structures. Also provided is the EpochManager and LocalEpochManager, which are based on Epoch-Based Reclamation (EBR) [fraser2004practical], and serve as a pseudo garbage collection mechanism that scales not only in shared-memory but distributed memory as well.
Ii Design & Implementation
In the development of the EpochManager, there were prerequisites that needed to be addressed, such as the need for atomic operations on class instances. Only after the overcoming those hurdles is it possible to create the infrastructure and building blocks needed for creating non-blocking algorithms in both shared-memory and distributed memory.
Ii-a Atomic Objects
In Chapel, atomic operations, which are operations that appear to take place either all at once or not at all, are defined only on integers, Boolean, and real numbers. As of today, there is no official support for atomic operaions on class instances, which are represented as widened pointers that contain not only the 64-bit virtual address, but 64-bits of locality information, making up a 128-bit struct. Out of portability, Chapel does not implement support for atomics on class instances for this reason, making it impossible to create even the most primitive of non-blocking data structures, such as queues, stacks, and linked lists. Furthermore, in Chapel, atomic operations over the network rely on Remote Direct Memory Access (RDMA), which currently only supports atomic operations up to 64-bit, which is true of both Infiniband on commodity clusters and Gemini/Aries on Cray-XTs and Cray-XCs respectively.
To develop the EpochManager, it was essential to make it non-blocking so that it did not weaken the non-blocking guarantees of the data structures that it is employed in. The three non-blocking guarantees from weakest to strongest are as follows: Obstruction-Freedom, which ensures that if a thread runs in isolated, that is the thread does not have its progress obstructed by any other thread, it will complete in a bounded number of steps; Lock-Freedom, which ensures that at least one thread must complete within a bounded number of steps even when obstructed; Wait-Freedom, which ensures that all threads must complete within a bounded number of steps, regardless of obstruction; the EpochManager has been made lock-free. In the initial prototype, which has been adapted into its own independent module, called the LocalAtomicObject, the locality information is ignored and it maintains an atomic holding only the 64-bit virtual address. As LocalAtomicObject will only work in a shared-memory context, the GlobalAtomicObject offers pointer compression, which is designed to take advantage of the fact that currently, processors only use the lowest 48-bits for the virtual address, enabling the encoding of 16-bits of locality information in the 64-bit pointers. This approach will only work in distributed setups with less than compute nodes, which consequently enables RDMA atomics on Cray Aries.111RDMA atomics are not yet possible on Infiniband networks due to a lack of support in Chapel. In the event that more than compute nodes are used, the implementation will fall back to using Intel’s CMPXCHG16B instruction, also known as the a ‘Double-word-Compare-and-Swap’ (DCAS) operation, which is able to atomically update both the virtual address as well as the 64-bits of locality information. Unfortunately, this demotes atomic operations on remote memory from RDMA atomics, which take around a microsecond to complete and do not require intervention of the CPU, to using active messages, which are entirely handled by the progress thread of the recipient compute node. As owned and shared types are already wrapped in a record and are larger than 64-bits, and since borrowed types are explicitly tracked by the compiler in such a way that it makes support impossible as lifetime cannot be tracked statically, support is restricted to unmanaged class instances.
Another problem that had to be overcome was the ABA problem. The ABA problem occurs in scenarios where you have at least two threads, and typically arises when performing a compare-and-swap operation. In one such scenario, consider the an atomic linked list where you have multiple threads, where a thread reads from the head of the list and receives the node with virtual address . Imagine that gets preempted and some thread also reads the head of the list, atomically moves the head of the list forward, and deletes the node such that is put back on some free-list. Later, some other thread allocates a new node which happens to have the same address and atomically inserts this at the head of the list, where wakes up and incorrectly succeeds in its atomic exchange, despite the fact that the head of the list has changed. There are two known ways to solve the ABA problem, and they are to either use a concurrent memory reclamation system, in which is currently being built and leads to a chicken-and-egg paradox, and using a DCAS, where a 64-bit counter is held adjacent to the 64-bit word being atomically updated. In the DCAS approach, the counter gets incremented after each ABA dependent operation, which causes a DCAS to fail even in the event of the ABA problem, since the 64-bit counter will have changed. The AtomicObject and LocalAtomicObject provide a 128-bit wrapper for 64-bit types, called ABA where such a 64-bit counter is held adjacent to the 64-bit virtual address, which in conjunction with pointer compression can provide both ABA-free atomic operations on remote objects, albeit using remote execution rather than RDMA. Each operation has an ABA variant, which includes the suffix ‘ABA’, that will take into account the 64-bit counter, but the advanced user is free to use both ABA and normal variants interchangeably. Due to Chapel’s forwarding, it is possible to use the ABA in a seamless manner as if it were the type it is wrapping, as all methods and field accesses will forward to the underlying instance.
chapel var node = new unmanaged Node(newObj); do var oldHead = head.readABA(); node.next = oldHead.getObject(); while(!head.compareAndSwapABA(oldHead, node));
Ii-B Epoch Based Reclamation (EBR)
Epoch-Based Reclamation (EBR) is a concurrent-safe memory reclamation system which utilizes epochs
, which are descriptors for a specific period of time, to determine the quiescence of objects and determine when they are safe to be reclaimed. Concurrent-safe memory reclamation is a non-trivial problem and is at the very root of non-blocking algorithms and data structures. The problems presented by concurrent access is that it is not easy to know whether or not a thread is accessing data we are interested in deleting, and naively deallocating data can result in undefined behavior from a use-after-free error. That is, given that once an object is freed, it is normally placed on some type of free-list, where it can be used in some future allocation, which can cause data corruption in the case of arbitrary writes, or segmentation faults in the case of dereferencing pointers. EBR combats this issue by utilizing epochs to track the epoch that participating threads are in, where each participating thread must enter an epoch before accessing data, and must leave the epoch afterwards. Generally, if a thread is not in an epoch, it is considered quiescent in that it no longer has access to the objects we are interested in at that given moment, and a thread inside of an epoch may or may not be accessing the object at that given time, but out of safety the deallocation of said objects is deferred until later.
To delete an object, it is first logically removed from the data structure that is accessible from, where an example of logical removal would be the removal of a node from a linked list. The logically removed object is then put in a limbo list, which is a list of objects to be reclaimed, associated with a given epoch. More formally, an object that is associated with an epoch must not be deleted until it is certain that no thread is in epoch . The only hazard in concurrent memory reclamation occurs when another thread is accessing it while it is being deleted, but the logical removal of entirely removes it from the data structure, and so only threads that have had access prior to the removal can access . Eventually, once the epoch has been advanced to , which occurs after all threads are guaranteed to be in at least epoch and not epoch , it is safe to delete the objects in the limbo list for . Note that is not reclaimed at this point, and instead the epoch must advance once more, in which after there is utmost certainty that can safely be reclaimed as all participating threads were quiescent after the logical removal of , and since is no longer accessible from the current epoch.
Ii-C Epoch Manager
The EpochManager is built on top of the notion of epoch-based reclamation and limbo lists, in that objects that are marked for deletion during an epoch are held in limbo until they are safe to be deleted. To implement the limbo lists, it was necessary to implement a non-blocking data structure that was optimized not only for concurrent insertion, but bulk removal, as all objects in the limbo list are deleted at once, and not incrementally. The limbo list can be viewed as having two phases, which include an insertion phase, which is entirely concurrent, and a deletion phase, which both occur at disjoint times. A somewhat novel but simple data structure has been designed to significantly reduce overall latency to the point that deferring an object for deletion has been made entirely wait-free during the insertion phase and during the deletion phase, and both are handled in one atomic exchange. Nodes are recycled using a lock-free stack[hendlerScalableLockfreeStack2004] and the ABA-protection provided by the AtomicObject.
chapel proc push(obj : unmanaged object?) var node = recycleNode(obj); var oldHead = _head.exchange(node); node.next = oldHead; proc pop() return _head.exchange(nil);
The EpochManager is privatized, in that an instance of the EpochManager is created and maintained on each individual locale, and all accesses the EpochManager are forwarded, such as the case for field accesses or method invocations, to the instance that is local to that locale. That is, even though the EpochManager can be used in distributed contexts, such as in distributed parallel forall loops, or inside of remote-procedure call (RPC) on statements, all accesses are guaranteed to respect locality. This is achieved by remote-value forwarding and record-wrapping, where a record which holds data required to lookup the instance itself is based by-value, and not by-reference as is the default in Chapel when it comes to forall statements, which allows for zero-communication when acquiring the privatized instance. This results in a massive speedup, since replication across locales cuts down all unnecessary communication and allows the caching of data or even keeping locale-specific instances of data, and the record-wrapping eliminates an additional round-trip communication required to obtain the metadata needed to find the privatized instance. In practice, this has been observed by the authors to allow distributed objects to no longer be communication bound, that is bound the available bandwidth and latency of the network, and allows for some truly scalable algorithms. This technology is not new, either, as it has been used in previous works to create distributed data structures[jenkinsChapelAggregationLibrary2018, jenkinsChapelGraphLibrary2019, jenkinsChapelHyperGraphLibrary2018, jenkinsRCUArrayRCULikeParallelSafe2018], and is used as the backbone for Chapel’s arrays, domains, and distributions.
An array of three limbo lists are maintained on each locale in the privatized instance, representing the possible epochs that any given thread can be in, which are , , and . Each locale caches their own epoch, which is used when deciding which limbo list to defer deletion of objects to. When it is time to update the global epoch, a task gets elected, in this case election is handled in a first-come-first-serve nature via a local atomic flag is_setting_epoch for their locale, and then for the locales that the EpochManager is distributed over, which has the effect of stemming off unnecessary amounts of communication that would arise if multiple tasks across multiple locales attempted to update the global epoch at the same time, as only one of them can succeed in doing so. As each locale has their own individual instance, a class instance wraps the he global epoch itself so that there is a single centralized and coherent epoch that all locales can come to a consensus on.
The EpochManager creates a set of tokens, which are class instances that keep track of the epoch that a task is currently engaged in. Before a task is free to access a data structure that is protected by the epoch-based reclamation provided by the EpochManager, it must first register and obtain one of these tokens, and when they are finished they must unregister and relinquish them. Two separate lists are maintained for tokens, one which keeps track of free tokens, used when registering and unregistering, and one which is a list of all allocated tokens, which is used to scan the minimum epoch for. Once registered, the token is not yet in an epoch, and in fact can be used to perform multiple operations in the same task as an optimization. A token must be pinned and unpinned just like it must be registered and unregistered, where pinning enters the current epoch and unpinning exits the current epoch, and when an object is to be deleted, it is always added to the current epoch associated with the token. The token is itself wrapped in a managed class so that when it goes out of scope, the token can automatically be unregistered, and is useful using task-private variable intents on forall loops, as shown below.
[H] chapel var em = new EpochManager(); // Serial and Shared Memory var tok = em.register(); tok.pin(); tok.unpin(); tok.unregister();
// Parallel and Distributed (forall)… forall x in X with (var tok = em.register()) tok.pin(); tok.deferDelete(x); tok.unpin(); // automatic unregister em.clear(); // Reclaim everything at once.
The EpochManager will not advance the epoch on its own, and requires user intervention to do so. The user is free to tryReclaim, which attempts to advance the epoch if and only if no token on any other locale is in a previous epoch. As well, since the objects to be deleted can be remote, and since remote deallocation would result in RPC, a scatter list is constructed that sorts objects by the locales they are allocated on, significantly cutting down unnecessary communication. The tryReclaim method is intended to be invoked on the token or EpochManager, and is a global operation, and is optimized such that if another task is attempting to update the epoch on the current locale, other tasks will swiftly return, without much wasted effort; if another task is attempting to update the global epoch, it will also return after clearing the local flag. The clear method is intended to be invoked directly on the EpochManager and performs the same action as tryReclaim with the exception that it will always reclaim all objects across all epochs, and should be called when there is a guaurantee that no other thread is interacting with the EpochManager.
[fontsize=]chapel proc tryReclaim() if (is_setting_epoch.testAndSet()) then return; if (global_epoch.is_setting_epoch.testAndSet()) is_setting_epoch.clear(); return;
// Is it safe to reclaim across all locales? const this_epoch = global_epoch.read(); var safeToReclaim = true; coforall loc in Locales with (&& reduce safeToReclaim) do on loc var _this = getPrivatizedInstance(); for tok in _this.allocated_list var local_epoch = tok.local_epoch.read(); if (local_epoch != 0 && local_epoch != this_epoch) safeToReclaim = false; break;
if safeToReclaim const new_epoch = (current_global_epoch global_epoch.write(new_epoch); coforall loc in Locales do on loc // Update each locale’s epoch var _this = getPrivatizedInstance(); _this.locale_epoch.write(new_epoch);
const reclaim_epoch = _this.getReclaimEpoch(); var reclaim_limbo_list = _this.limbo_list[reclaim_epoch]; var head = reclaim_limbo_list.pop();
while (head != nil) var obj = head.val; var next = head.next; // Scatter objects to their locale _this.objsToDelete[obj.locale.id].append(obj); delete head; head = next; coforall loc in Locales do on loc // Bulk transfer and delete var ourObjs = _this.objsToDelete[here.id].getArray(); delete ourObjs;
// Clear scatter list forall i in LocaleSpace do _this.objsToDelete[i].clear(); global_epoch.is_setting_epoch.clear(); is_setting_epoch.clear();
The LocalEpochManager is a shared-memory optimized variant that functions in a similar way to the EpochManager, but differs in that it lacks global epoch and does not take remote objects into consideration when being used, speeding up computations that do not require epoch-based reclamation support across multiple locales.
Iii Performance Evaluation
All experiments were conducted on a Cray XC-50 as part of Cray’s Marketing Partner Network, on compute nodes with 44 core Broadwell CPUs, compiled using Chapel 1.20 with the ‘fast’ flag to enable all compiler optimizations. The presence of CHPL_NETWORK_ATOMICS, which is RDMA atomic operations, is tested. These RDMA atomics are not coherent, and so all atomic operations, including those that are performed locally on the same system, must go through the Gemini or Aries NIC. This overhead of using network atomics for local operation has been measured to be as much as an order of magnitude by the authors. While in the development of both the EpochManager and AtomicObject, care has been taken to eliminate the usage of RDMA atomics when they are unnecessary by ‘opting out’, it is still useful to compare performance with and without the support for RDMA atomics. Since RDMA atomics are only available on Cray Gemini or Aries interconnects, the performance observed with RDMA atomics disabled would mirror the performance seen on Infiniband clusters, as Chapel does not utilize Infiniband RDMA atomics even when present on the system.
The performance of both AtomicObject and EpochManager are measured to show the raw overhead of both constructs with the goal to prove them both scalable. Such microbenchmarks are important in that it is impossible to build scalable non-blocking algorithms without scalable building blocks, beginning with AtomicObject, which is a fundamental building block that is used even in EpochManager. The AtomicObject is compared to one of the only types that atomics are natively supported in Chapel, being atomic int. The atomic int is also a sibling of the atomic uint, in which the AtomicObject is built on top of. Microbenchmarks involving the AtomicObject test the overhead injected the abstraction. Microbenchmarks involving the EpochManager focus on different use-cases and workloads.
In this section, we explore two performance criteria. First, we evaluate the performance of Atomic Objects against Chapel’s atomic variables. We use Chapel’s atomic int. The experiments focus on the common set of operations available between Chapel’s atomic variables and Atomic Objects: read, write, compare and swap, and exchange. Second, we evaluate the scalability of EpochManager, testing raw acquire/release, memory reclamation with remote objects and manual garbage collection every fixed number of iterations.
Iii-a Atomic Objects Performance Evaluation
We compare Atomic Objects performance with and without ABA protection against Chapel’s atomic int in shared memory and distributed memory, as shown in Figure 1. The experiment evaluates strong scaling, with each task performing the same number of operations, comprising of 25% read, 25% write, 25% compare-and-swap and 25% exchange operations. Shared memory experiments show that all three of atomic int, AtomicObject (ABA) and AtomicObject scale linearly with increasing number of tasks. AtomicObject without ABA protection performs equivalent to Chapel’s atomic int, and AtomicObject (ABA) takes the highest amount of time, with a constant overhead. In distributed memory, the performance of AtomicObject without ABA protection is equivalent to Chapel’s atomic int, showing that even in distributed memory, there isn’t any noticeable overhead, and it scales linearly with the number of locales, whether RDMA atomics are used or not. AtomicObject (ABA) scales linearly with increasing number of locales. It performs equivalent to Chapel’s atomic int without network atomics.
Iii-B Epoch Manager Performance Evaluation
[fontsize=]chapel // Create manager instance var manager = new EpochManager(); var objsDom = 0..#numObjects dmapped Cyclic(startIdx=0); var objs : [objsDom] unmanaged C(); // Randomize locale that each object is allocated on randomizeObjs(objs); forall obj in objs with ( var tok = manager.register(), var M : int ) tok.pin(); // If we are deleting… tok.deferDelete(obj); tok.unpin(); M += 1; // If we are tryReclaim’ing… if M tok.tryReclaim(); // Reclaim all objects at end manager.clear();
The microbenchmarks for EpochManager are similar to Listing III-B.
We evaluate the scalability of EpochManager under various workloads, which should be representative of the different use-cases. In a read-only workload, such as for a read-often write-rarely data structure, such as when performing a lookup in a hash table or a linked list, it may be suitable to just pin at the beginning of the operation, and then unpin at the end. Demonstrated in Figure 5, performance is essentially stable across multiple locales, demonstrating that even in distributed contexts it can scale reasonably well as all locales forward their accesses to their privatized instances despite being in a parallel and distributed forall loop. In Figure 4, another typical workload is analyzed where no reclamation is performed until the very end, which is typical when the number of objects is bound and can fit within memory without running out-of-memory. The number of remote objects to be reclaimed varies by 0%, 50%, and 100%, and thanks to the ability to keep remote objects in limbo lists as if they were any other object guarantees us scalability. In the case where tryReclaim is invoked once every now and then, the results of which are demonstrated in Figure 2, still shows scalability, as yet again, there are limbo lists on each locale. When reclamation is performed, not even the locale where the global epoch is allocated is bogged down by redundant requests thanks to the first-come-first-serve election of tasks, and scales equally both with and without RDMA atomics. In the case where the user does not want to take any chances and attempts to tryReclaim on every iteration, there is still scalability, as shown in Figure 3.
The AtomicObject is a solution to the problem of a lack of language support atomic operations on objects. The implementation not only provides the operations in shared-memory but distributed memory, utilizing pointer compression that enables RDMA atomic operations which are on-par with performance for atomic operations on integers, while also providing protection from the ABA problem with memory reclamation via usage of double-word compare-and-swap. The EpochManager is a non-blocking epoch-based reclamation garbage collection system that allows for concurrent-safe reclamation even in distributed-memory contexts. Both of these are essential building blocks for developing non-blocking algorithms in both shared-memory and distributed-memory. In future works, it is planned to allow more than locales while still allowing RDMA atomic operations. Also in future works, an application of the both constructs in the porting of the Interlocked Hash Table[jenkins2017redesigning] is complete and awaiting release; their applications in the creation of other distributed algorithms are also planned.