DeepAI
Log In Sign Up

Disaggregation and the Application

10/29/2019
by   Sebastian Angel, et al.
0

This paper examines disaggregated data center architectures from the perspective of the applications that would run on these data centers, and challenges the abstractions that have been proposed to date. In particular, we argue that operating systems for disaggregated data centers should not abstract disaggregated hardware resources, such as memory, compute, and storage away from applications, but should instead give them information about, and control over, these resources. To this end, we propose additional OS abstractions and interfaces for disaggregation and show how they can improve data transfer in data parallel frameworks and speed up failure recovery in replicated, fault-tolerant applications. This paper studies the technical challenges in providing applications with this additional functionality and advances several preliminary proposals to overcome these challenges.

READ FULL TEXT VIEW PDF
03/02/2022

ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

Fault-tolerant distributed applications require mechanisms to recover da...
04/03/2021

High-Level Synthesis of Security Properties via Software-Level Abstractions

High-level synthesis (HLS) is a key component for the hardware accelerat...
11/10/2019

Enhancing Programmability, Portability, and Performance with Rich Cross-Layer Abstractions

Programmability, performance portability, and resource efficiency have e...
05/30/2018

l0-norm Based Centers Selection for Failure Tolerant RBF Networks

The aim of this paper is to select the RBF neural network centers under ...
05/16/2018

FT-LADS: Fault-Tolerant Object-Logging based Big Data Transfer System using Layout-Aware Data Scheduling

Layout-Aware Data Scheduler (LADS) data transfer tool, identifies and ad...
06/01/2021

UniStore: A fault-tolerant marriage of causal and strong consistency (extended version)

Modern online services rely on data stores that replicate their data acr...

1 Introduction

Disaggregation splits existing monolithic servers into a number of consolidated single-resource pools that communicate over a fast interconnect [37, 8, 32, 49, 50, 34, 61]. This model decouples individual hardware resources, including tightly bound ones such as processors and memory, and enables the creation of “logical servers” with atypical hardware configurations. Disaggregation has long been the norm for disk-based storage [33] because it allows individual resources to scale, evolve, and be managed independently of one another. In this paper, we target the new trend of memory disaggregation.

Existing works on disaggregated data centers (DDCs) have focused primarily on the operational benefits of disaggregation—it allows resources to be packed more densely and improves utilization by eliminating the bin-packing problem. As a result, these works strive to preserve existing abstractions and interfaces and propose runtimes and OSes that make the unique characteristics of DDCs transparent to applications [14, 61]. The implicit underlying assumption in these works is that from the perspective of the OS, the distributed nature of processors and memory is an inconvenient truth of the underlying hardware, much like paging or interrupts, that should be abstracted away from applications.

Our position is that the disaggregated nature of DDCs is not just a hardware trend to be tolerated and abstracted away to support legacy applications, but rather one that should be exposed to applications and exploited for their benefit. We draw inspiration from decades-old distributed shared memory systems (which conceptually closely resemble disaggregation) where early attempts at full transparency quickly gave way to weaker consistency and more restrictive programming models for performance reasons [45, 35, 16, 36]. While the driving rationale for externalizing memory has changed, along with the underlying hardware and target applications, we believe that co-designing applications and disaggregated operating systems remains an attractive proposition.

Two properties of disaggregated hardware with potential to benefit applications are the ability to reassign memory by dynamically reconfiguring the mapping between processors and memory and the failure independence of different hardware components (i.e., the fact that processors may fail without the associated memory failing or vice versa). Memory reassignment can be leveraged by applications performing bulk data transfers across the network to achieve zero-copy operations by remapping memory from the source to the destination, or during processor failures to find orphaned memory a new home. Failure independence also allows processors to be useful despite memory failures by acting as fast and reliable failure informers [4] and triggering recovery protocols.

We target data center applications that are logically cohesive, but physically distributed across multiple co-operating instances—examples of these include most microservice-based applications, data parallel frameworks, distributed data stores, and fault-tolerant locking and metadata services—and propose extending existing OSes for disaggregated systems, such as LegoOS [61], with primitives for memory reassignment and failure notification. Below is a discussion of the proposed primitive operations and the challenges in implementing them, all of which are exacerbated by the fact that the exact nature of disaggregation and the functionality of each component is in flux (§2).

  • Memory grant. This is a voluntary memory reassignment called by a source application instance to yield its memory pages and move them to a destination application instance. This reassignment requires a degree of flexibility from the interconnect, which must be able to handle modifying memory mappings quickly and at fine granularities.

  • Memory steal. This is an involuntary reassignment of memory from one application instance to another. While similar to a memory grant from the perspective of the interconnect, a key difference is that the source application instance may not have any prior warning. Since volatile state can now transcend an application instance, the programming model needs to guarantee crash consistency to ensure that state is semantically coherent at all times.

  • Failure notification. An application instance can opt to receive notifications for memory failures or it can register other instances to automatically be notified in such cases. This requires making failure information visible to applications, as well as retaining group membership at the processor so other instances can be notified if the local instance cannot handle or mask the memory failure.

Data parallel frameworks, such as MapReduce and Dryad, can use these primitives to eliminate unnecessary data transfer during shuffles or between nodes in the data flow graph, while Chubby [12] and other applications based on Paxos [39] can recover and reassign the committed state machine from a failed replica. In addition, early detection of memory failures can trigger recovery mechanisms without waiting for conservative end-to-end timeouts. While this paper focuses on these two applications, we believe that the interfaces are broad enough to benefit other applications. For example, scalable data stores, such as Redis or memcached, could use memory grants to delegate part of their key space to new instances sans copying, while microservice-based applications can use grants and steals to achieve performance comparable to monolithic services and still retain some modularity.

This paper is largely speculative and poses more questions than it answers. We discuss some operating system primitives that empower forward-looking applications to benefit from disaggregation (while allowing legacy applications to remain oblivious to it) and describe how these primitives could be implemented and used in the context of an abstract disaggregated data center that resembles existing designs. Our hope is to foster a broader discussion around disaggregation, not from the perspective of operators, but as an opportunity, and also a challenge, for systems and application developers.

2 DDC architecture and resource allocation

In the absence of existing disaggregated data centers, a number of different architectures have been proposed [34, 61, 49, 50, 57]. While these architectures differ in some of the details, the general strokes are similar. We assume the architecture given in Figure 1, which has three core components: individual blades with compute elements and memory elements, connected over a low-latency programmable resource interconnect. While we have chosen to explore these ideas in the context of a single architecture for simplicity, we believe that they are broadly applicable to other disaggregation models.

Figure 1: Proposed architecture for disaggregated data centers (DDCs). Racks consist of blades housing compute or memory elements that are connected through the Rack MMU. Compute elements have caches and some local memory, while memory elements rely on an attached processor to mediate accesses. Network communication between compute elements within and across racks uses the Top-of-Rack Ethernet switch.

Compute elements.

The basic compute elements in our rack are commodity processors which retain the existing memory hierarchy with private core and shared socket caches. While some architectures have processors operate entirely on remote memory [57], this requires major modifications to the processor to support instructions such as PUSH and POP that implicitly reference the stack, as well as to the memory and caching subsystems. In line with the majority of proposed architectures, we assume a small amount of locally-attached memory at the compute elements, which is used for the operating system and as a small cache to improve performance [24, 61].

Memory elements.

Memory elements, which are conventional DRAM or NVRAM chips, can be exposed directly across the interconnect (Fabric-Attached Memory) or fronted by a low-power processing element (e.g., mobile processor, FPGA, or ASIC) that interacts with memory through a standard MMU (Proxied-Memory). We assume a form of proxied-memory where addressing, virtualization, and access control are delegated to the local processing element which interposes on memory requests. Similar functionality can be achieved for fabric-attached memory by coordinating memory controllers at multiple compute elements.

Resource interconnect.

The resource interconnect allows processor and memory elements to communicate and can be based on RDMA over InfiniBand or Ethernet, Omnipath [11], Gen-Z [25], or a switched PCIe fabric [13, 22]. Our design is agnostic to the physical layer, but we assume a degree of programmability and on-the-fly reconfiguration within the interconnect (that we call the Rack MMU) that allows compute and memory elements to be dynamically connected and disconnected in arbitrary configurations. Recent work [63] proposes and implements one such network fabric; although the proposed architecture lacks a programmable switch, it emulates its functionality through a Clos network of switches and a coordination-free scheduling protocol.

Resource partitioning and allocation.

We assume that the unit for disaggregation is a single rack (i.e., compute and memory elements reside in the same rack), with resources being partitioned into the desired compute abstractions, such as virtual machines, containers, or processes, and presented to applications (we generically refer to all of these compute abstractions, which host application workloads, as processes). The Rack MMU acts as a resource manager for the rack and is responsible for resource partitioning within the rack and assigning compute and memory elements to processes.

The Rack MMU has a similar policy regarding sharing of hardware resources as LegoOS [61]: processes may share the same memory element, but not the same regions of memory (i.e., there is no shared memory). Similarly, compute elements can host multiple processes, but all the threads of a process are restricted to a single compute element. This simplifies caching, as shared memory would require coherence across the local memory attached to the compute elements. Memory is allocated at a fixed page-sized granularity, which is chosen according to the addressing architecture of compute and memory elements. The Rack MMU is responsible for high-level placement decisions for processes and picks compute and memory elements on the basis of some bin-packing policy, while fine-grained sharing and isolation across co-hosted processes are managed by the local OS.

Addressing and access control.

Traditional processes expect to operate on a private virtual address space, regardless of the physical layout of the underlying memory. To preserve this illusion, the Rack MMU stores a virtual-to-physical (V2P) mapping for each process, which resembles a traditional per-process page table. Compute elements query this V2P mapping, which may be cached locally, to route requests to the correct memory element.

The Rack MMU is also responsible for configuring access control to memory. When memory is allocated, the Rack MMU ensures that the topology of the interconnect allows for the existence of a path between the corresponding compute and memory elements. It also configures the page tables at the memory elements with the process identifier (effectively the CR3), the virtual address, and the appropriate permissions, enabling local enforcement at the memory elements. While this mechanism is specific to proxied memory, fabric-attached memory systems have proposed a capability-based protection system to achieve similar functionality [2].

Scaling out.

Not all applications want to live within a single rack: to span racks, traditional Ethernet-based networking is available through a commodity top-of-rack (ToR) switch that connects to the rest of the data center network. Distributed applications comprising multiple processes have to choose the appropriate deployment: intra-rack deployments enjoy lower latencies, while cross-rack deployments have greater failure independence. This decision is analogous to the one faced by developers when selecting the appropriate placement group [5] or availability set [55] in cloud deployments today.

3 Exposing disaggregation

In traditional architectures, the OS is responsible for managing hardware resources, allocating them to processes, and enforcing isolation of shared resources. In a disaggregated environment, this is no longer true and resource allocation is now within the bailiwick of the Rack MMU; the local OS at compute elements continues to be responsible for managing the underlying hardware, providing local scheduling and isolation, and presenting a standard programming interface to applications. Additionally, the OS is responsible for transparently synchronizing application state between local and remote memory and, if any state is locally cached, managing the contents and coherence of this cache [27, 61].

Prior OSes for DDCs [61, 14] have chosen to implement a standard POSIX API and abstract away the disaggregated nature of DDCs from applications. While this allows existing unmodified applications to run on DDCs, our case studies (§5 and §4) argue that many of these applications could achieve better performance if they had more visibility and control. Accordingly, we advocate for the design and implementation of the following three operations as OS interfaces.

3.1 Memory grant

Memory is reassigned at page granularity by moving it from the V2P mapping of one process to another at the resource manager (Rack MMU) and invalidating any cached V2P mappings at compute elements. Following this, the resource manager revokes access to that memory region by modifying the page table entries for proxied memory (or by generating a new capability for fabric-attached memory); the detached memory can then be attached to an existing process similar to newly allocated memory.

Memory reassignment is conceptually similar to a memory grant operation in L4 [48], with one significant difference: as reassigned pages may contain data structures with internal references, these pages must be attached to the same virtual address to prevent dangling pointers. To avoid a situation in which the receiving process has already used the provided virtual addresses (which would create ambiguity), we propose reserving a fixed number of bits of the virtual address to act as a process identifier.

Mechanistically, we envision memory reassignment to occur, like in L4, through message passing between OS instances. This transfer is initiated by the application through a system call similar to vmsplice() in Linux: when called with the SPLICE_F_GIFT flag, the process “gifts” the memory to the kernel, promising to never access it again. As the page continues to use the same virtual address space in the receiving process, the sender OS marks the virtual address as being “in use” and prevents further allocations or mappings to it. Receiving processes are notified about the addition of new pages by their local OS through signals.

3.2 Memory steal

Memory grants are the most natural flavor of memory reassignment, but are not particularly useful in the case of compute element failures. An alternative is for other entities (processes or local OSes) to be able to take away, or steal, a process’ memory. For example, when a compute element crashes, another process belonging to the same application could request the crashed process’ memory. This is similar to how servers in Frangipani [64] keep their logs remotely, and can request the logs of servers that have crashed to resume their operations.

Two questions naturally arise in this case: firstly, who is allowed to trigger memory reassignments and when is it acceptable to do? Secondly, how does the application guarantee the semantic consistency of memory that may abruptly be stolen? While it is clear in the context of memory grants that a process should have the authority to give away its own memory, the policy around forcible reassignment is less clear. One possibility is to group trusted processes together and allow any group member to initiate reassignment; another is to require a group of processes reach consensus before reassigning any memory. In terms of timing, while we envision this as primarily an aid to recovery mechanisms when a process has crashed (or is suspected of having crashed), there might be applications where stealing memory from a running process is acceptable and actually profitable.

We propose to expose memory stealing via a syscall that requires the id of the source process and uses the group of the calling process as a capability for authentication; memory allocated using brk or mmap can disallow future reallocation with the appropriate flags. We do not enforce a specific policy at the Rack MMU and instead leave it up to the application to determine what is appropriate (we explore one such policy in the context of Paxos in Section 4.1). While a buggy application can mistakenly steal its own memory and crash, this is not morally different from threads stomping on each other’s memory in buggy shared memory applications.

The second challenge is maintaining crash consistency for reassignments. This is non-trivial since most applications are not written with the idea that memory should be consistent and all invariants maintained at every point in the midst of computation; most applications do in fact have temporary windows of inconsistency. While certain programming abstractions such as transactional memory and objects [62, 28] provide atomicity, they are not sufficient in the case of compute element failures. Storage systems have historically faced similar challenges in allowing application state to outlive compute and building transactional, crash consistent programming models for non-volatile memory (NVRAM) is an active area of research [67, 66, 17, 65, 68, 54]. Applications can adopt any of these programming models, which rely on a combination of techniques such as journaling, soft updates [23], shadow copies [18], and undo logs [17] to remain crash consistent when updating structures in remote memory.

Nevertheless, even when these structures are consistent in remote memory, the metadata required to locate them may be in processor registers, caches, or in stack variables that are not part of remote memory. Applications typically do not have a namespace to locate internal objects and instead rely on the compiler to keep track of them; consequently, when memory is reassigned to a new process, finding the necessary objects from raw memory pages would be a momentous task, akin to searching for a lost treasure.

Our suggestions for this are two-fold: first, applications can use an asynchronous, event-based model that forces them to reason about all critical state and package it into a heap object before yielding (i.e., stack ripping [3]), since that is all that persists across invocations. Secondly, the application can use a file system like namespace for objects [20], or it can distribute metadata about heap objects (depending on the application, this could be as minimal as the root address of a tree), in anticipation of failures, that act as a “map” to help locate critical state.

3.3 Failure notification

Compute elements should be notified about memory failures either asynchronously using liveness information from a reliable interconnect or explicitly in response to accesses on unreliable interconnects. In the latter case, compute elements can receive messages from the controller of the memory element (when specific elements have failed), or rely on timeouts (when the entire memory element is unreachable). Error notifications are propagated back to the application through OS signals (SIGBUS); applications that want to manage faults can register for these signals and trigger a failure-recovery protocol, while legacy applications may safely ignore them.

As memory failures may result in the loss of application state, it is sometimes unclear how an application should leverage failure notifications. To guard against such cases, an application can pre-register a group of processes with the OS that will be informed in case of failures (these processes essentially serve as “emergency contacts”). This group is stored in a per-process forwarding table within the OS. As the OS is local to the compute element, memory failures do not affect the forwarding table; consequently, the application can defer notification to the OS using a syscall which broadcasts the error to the corresponding group. This allows other processes to learn of the failure and respond appropriately, making the compute element a local failure informer [4, 44].

Failures of compute elements are harder to detect as the absence of accesses to a particular memory element need not be a sign of failure. We propose the addition of a rack-level monitor that periodically verifies the health of compute elements using heartbeats and triggers the appropriate action when failures are detected. Applications can register a group of processes to inform in the case of failures, similar to the groups registered for memory failures; alternatively, they can also register a lightweight failure handler to be run, in an isolated context, at the monitor. While this monitor is a single point of failure and may not detect all failures, we view it as an optimization, rather than a replacement, to failure detection using end-to-end timeouts at the application.

setting mean RTT (s)
Cross-rack (Cloud) 45
Intra-rack (eRPC [31]) 2
Future intra-rack (Mellanox ConnectX-6 [1]) 1
Figure 2: Comparison of the latency of data transfer between VMs in the same and on different racks. The cross-rack number is derived experimentally and represents the mean round-trip time (RTT) between two VMs, with accelerated networking, within a cloud data center. We ensure that VMs are placed on different racks using the appropriate availability primitives [55, 5]. Both current and future intra-rack numbers are taken from the referenced publications.

One might wonder why local failure informers are better than just using application-level timeouts to detect failures—especially given the reliance on timeouts to detect compute and memory failures. The answer is that we can exploit the difference between intra-rack and cross-rack latencies; as we show in Figure 2, this difference is a few orders of magnitude. As compute and memory are located within the same rack, we make the assumption that the Rack MMU achieves comparable latencies. This allows local failure detectors to have more aggressive timeouts and trigger recovery procedures earlier.111The exact gains are hard to quantify since network latency is only one out of many factors considered when setting end-to-end timeouts [4].

3.4 Feasibility of implementing the Rack MMU

The memory interconnect described so far is capable of routing requests between any compute and memory elements within the rack, as well as blocking communication between any such elements, at very low latency. It has enough space to store address mappings for each process, so that accesses from compute elements are transparently routed to the correct memory element; further, it supports dynamic reconfiguration of routes and mappings without requiring any downtime.

While existing research and production hardware satisfies some of these requirements, achieving their composition remains an open problem. Programmable switches, such as the Barefoot Tofino and Cavium XPliant, offer low-latency, reconfigurable routing between compute and memory elements, but are limited in their port counts and memory, restricting their scale. In contrast, Shoal [63] supports high-density racks with hundreds of compute and memory elements, but does not currently offer the low latency, programmability, and reconfigurability required for grant and steal operations.

4 Case study: Paxos

Applications use Paxos [39] to tolerate failures via the replicated state machine approach [38, 58, 60]: Paxos ensures that different replicas (which are deterministic state machines that implement the application’s logic) execute the same commands in the same order, ensuring that all replicas transition through the same sequence of states. If a replica fails, a client can simply issue its requests to a live replica.

Replica failures lead the system into a state of reconfiguration where the old failed replica is removed and a new replica is introduced [40, 41, 15]. This prevents too many failures from accumulating over time and making the system unavailable. Mechanistically, reconfiguration achieves two goals: first, it brings new replicas up to date by having them fetch the latest state from existing replicas or persistent storage [15]. Second, it prevents old replicas that have been excluded from the current configuration (presumably because they have failed) from participating if they come back online.

Detecting failures.

Detecting failures is a challenging proposition in an asynchronous environment due to the difficulty of distinguishing between crashed and slow processes [21, 4]. Consequently, Paxos implementations rely on heartbeats and keep-alives with conservative end-to-end timeouts to ascertain the state of processes. Recent failure detectors [44, 42, 43] quickly and reliably detect failures and kickstart recovery mechanisms in asynchronous settings using a combination of local, host-based monitors that track the health of components across the stack, and lethal force. In cases where failures are suspected but cannot be confirmed, these detectors forcibly kill the process—the intuition behind this protocol (called STONITH or “Shoot the Other Node in the Head”) is that unnecessary failures are preferable to uncertainty.

4.1 Paxos reconfiguration in DDCs

The failure independence of DDCs enables new ways to detect and recover from failures in fault-tolerant applications using Paxos. We assume that the replicas of this application run in different racks within the same data center—a reasonable assumption for applications that want greater failure independence without paying the costs of wide-area traffic. Within this deployment, we explore two scenarios: a compute element that loses some or all of its memory elements, and a faulty compute element with functional memory elements.

Dead compute with live memory.

When a replica dies, one could in principle reassign the state machine’s memory to another compute element and the system could continue operating unimpeded. Such reassignment effectively reincarnates the old node, from the perspective of Paxos, which allows the consensus group to return to full health faster (no need to retrieve the state from a checkpoint or another replica). However, as we discuss in Section 3.2, the developer must ensure that the state machine’s transition function preserves memory consistency after crashes.

Should the failure of the compute element be detected faster than the end-to-end timeout of the Paxos group—a likely scenario due to the difference between intra- and cross-rack latencies—the reincarnation can be transparent to the rest of the system. In such cases, a client and other replicas will only observe a connection termination and will attempt to reconnect. Each Paxos replica registers a fast failure handler with the rack monitor that requests the Rack MMU to provision a new replica that can take ownership of the dead replica’s memory with the steal operator of Section 3.2. Failures detected by the monitor trigger this handler, while undetected failures are eventually detected by another replica which must gain consensus, either through a proposal or from a stable leader, before reincarnating the failed instance.

In response to a steal operation, the Rack MMU revokes and reassigns access to the region of memory. Revocation is needed because compute element failures are not always fail-stop and the system must prevent a temporarily unavailable compute element from returning and corrupting state. The ToR switch can redirect cross-rack traffic to the new compute element using OpenFlow rules; further, it can also use these rules to fence the old compute element off from the rest of the network [43].

Dead memory with live compute.

When a compute element flushes its operations to a remote memory element, it is possible for this operation to fail if the memory element is down. Instead of terminating the application right away, as we discuss in Section 3.3, the OS propagates a signal up the stack or forwards the signal to other replicas. This mechanism allows other replicas to detect memory failures more quickly than relying on end-to-end timeouts. Indeed, were the application to be terminated immediately without this notification, other replicas would not know whether the memory remains alive or not, leading to ambiguity as to the type of failure.

5 Case study: Data parallel computations

In-memory data parallel frameworks such as data flow and graph processing systems [56, 69, 19, 29, 26, 51] express computations as a series of nodes, where each node performs an operation on its inputs. In these systems, it is often necessary to move data between nodes so that the output of a node may be used as the input to the next node. For example, in MapReduce [19], the output of mappers is shuffled and sent to reducers that operate on a chunk of related data.

Applications represent large compute jobs as a set of smaller tasks, and distribute these tasks across nodes using the data parallel framework. While completing these jobs requires all the individual tasks to finish, tasks are often unexpectedly delayed due to factors such as load imbalances and workload skews, failures, and hardware defects. As such

stragglers hold up the entire job and significantly impact completion rates, frameworks employ a variety of mitigation techniques including blacklisting slow machines, speculatively timing out and rerunning tasks [19, 7], and even proactively launching multiple replicas of the same task [6].

We believe that executing data parallel systems transparently on a DDC would leave performance on the table, and argue for the use the operators described in Section 3 to speed up data movement and straggler mitigation.

Figure 3: Data transfer between two nodes, A and B, in the same rack. This figure assumes that not all of the data is cached in blade-local memory. (1) Node A loads its data from the remote memory via the Rack MMU. (2) Node A sends the data to Node B over a TCP/IP stream via the ToR switch. (3) Node B stores the data into its remote memory via the Rack MMU.

Faster data movement.

Deploying an unmodified data parallel framework on a transparent DDC results in unnecessary data movement between computational nodes; for example, Figure 3 shows how transferring data between nodes forces 3 network and memory RTTs. First, the source processor fetches data from its remote memory over the memory interconnect. Then, the source processor sends this data over the network to the destination processor via the ToR switch. Finally, the destination processor forwards the data via the memory interconnect to the remote memory for storage.

Meanwhile, data transfer is often a bottleneck in these systems. As an example, McSherry and Schwarzkopf [53] demonstrate that Timely Dataflow [52] achieves up to 3 higher throughput when provided with a faster network. Memory grants convert the 3 RTTs for data transfer into a single RTT over the memory interconnect. The source, , would call grant on the memory pages storing the data that it plans to send to the destination, , and would indicate as the recipient of these pages; the Rack MMU would make the necessary adjustments to page permissions before notifying that the pages are ready to be mapped into its local address space. Here, all the data transfer consists of small control messages that occur via the Rack MMU and bypasses the slower ToR.

Dealing with stragglers.

Straggler nodes in data parallel systems can have their memory forcibly reassigned to another node by having the job orchestrator steal the appropriate memory pages. The recipient node can resume and complete the half-completed computation, rather than starting from scratch. In case of failures, as with the Paxos example (§4.1), the failure notification interfaces can inform the job orchestrator, allowing it to relaunch the task more quickly than relying on an end-to-end timeout. If only the compute elements of the node have failed, the newly launched task can resume computation from where it had stalled.

6 Discussion

We are by no means the first to observe either the ability to reassign memory across processes or the failure independence of resources in DDCs. While recent works on disaggregated systems have advocated for transparent solutions—RAID-style [59] memory replication in LegoOS [61] and replication and switch-based failover by Carbonari and Beschastnikh [14]—this has largely been driven by the desire to benefit legacy applications. Carbonari and Beschastnikh also observe that applications could benefit from information about failures but do not go further; we build on that observation and look at how applications that eschew transparency could use this information. More specifically, we borrow ideas from systems for single-host IPC [9, 10, 46, 47, 48], distributed shared memory [45, 35, 16, 36], and accelerated RPCs [30, 31] for fast, zero-copy data transfer and from reliable failure informers [4, 44, 42, 43] for faster recovery.

Disaggregation represents a fundamental change in how hardware resources are built, provisioned, and presented to applications for consumption. As befitting an operator-driven initiative, early research has focused on changes necessary within the hardware rather than the application. But application developers are not averse to major changes in their programming model as long as they receive commensurate benefits; in fact, as witnessed by the prevalence of MapReduce, good models can help applications transition more smoothly. Reasoning about memory grants and steals is a significant departure from existing programming models, but there is encouraging precedent: the Rust programming language successfully introduced ownership and move semantics to guarantee memory safety and data race freedom. We believe that similar abstractions could be useful in our context.

Acknowledgements

We thank Andrew Baumann, Natacha Crooks, Joshua Leners, Youngjin Kwon, Srinath Setty, and Nathan Taylor for feedback and discussions that improved this paper.

References

  • [1] Connectx-6 single/dual-port adapter supporting 200Gb/s with VPI. https://www.mellanox.com/page/products_dyn?product_family=265&mtag=connectx_6_vpi_card.
  • [2] R. Achermann, C. Dalton, P. Faraboschi, M. Hoffmann, D. Milojicic, G. Ndu, A. Richardson, T. Roscoe, A. L. Shaw, and R. N. M. Watson. Separating Translation from Protection in Address Spaces with Dynamic Remapping. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS), 2017.
  • [3] A. Adya, J. Howell, M. Theimer, W. J. Bolosky, and J. R. Douceur. Cooperative Task Management Without Manual Stack Management. 2002.
  • [4] M. K. Aguilera and M. Walfish. No time for asynchrony. In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS), 2009.
  • [5] Amazon. Placement Groups. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/placement-groups.html.
  • [6] G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective Straggler Mitigation: Attack of the Clones. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013.
  • [7] G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris.

    Reining in the Outliers in Map-reduce Clusters Using Mantri.

    In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010.
  • [8] K. Asanović. FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), 2014.
  • [9] B. Bershad, T. Anderson, E. Lazowska, and H. Levy. Lightweight Remote Procedure Call. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 1989.
  • [10] B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. User-level Interprocess Communication for Shared Memory Multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(2), 1991.
  • [11] M. S. Birrittella, M. Debbage, R. Huggahalli, J. Kunz, T. Lovett, T. Rimmer, K. D. Underwood, and R. C. Zak. Intel Omni-path Architecture: Enabling Scalable, High Performance Fabrics. In Proceedings of the 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI), 2015.
  • [12] M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.
  • [13] BusinessWire. Liqid Fulfils the Promise of Rack-Scale Composable Infrastructure with General Availability. https://www.businesswire.com/news/home/20171114006064/en/Liqid-Fulfils-Promise-Rack-Scale-Composable-Infrastructure-General, 2017.
  • [14] A. Carbonari and I. Beschastnikh. Tolerating Faults in Disaggregated Datacenters. In Proceedings of the ACM Workshop on Hot Topics in Networks (HotNets), 2017.
  • [15] T. Chandra, R. Griesemer, and J. Redstone. Paxos Made Live—An Engineering Perspective. In Proceedings of the Symposium on Principles of Distributed Computing (PODC), 2007.
  • [16] J. S. Chase, H. M. Levy, M. J. Feeley, and E. D. Lazowska. Sharing and Protection in a Single-address-space Operating System. ACM Transactions on Computer Systems (TOCS), 12(4), 1994.
  • [17] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, and S. Swanson. NV-Heaps: Making Persistent Objects Fast and Safe with Next-generation, Non-volatile Memories. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
  • [18] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, and D. Coetzee. Better I/O Through Byte-addressable, Persistent Memory. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2009.
  • [19] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2004.
  • [20] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, and J. Jackson. System Software for Persistent Memory. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2014.
  • [21] M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility of Distributed Consensus with One Faulty Process. Journal of the ACM, 32(2), 1985.
  • [22] S. Foskett. Liqid Takes Composable Infrastructure to a New Level. https://gestaltit.com/exclusive/stephen/liqid-takes-composable-infrastructure-to-a-new-level/, 2018.
  • [23] G. R. Ganger, M. K. McKusick, C. A. N. Soules, and Y. N. Patt. Soft Updates: A Solution to the Metadata Update Problem in File Systems. ACM Transactions on Computer Systems (TOCS), 18(2), 2000.
  • [24] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker. Network Requirements for Resource Disaggregation. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
  • [25] Gen-z core specification, revision 1.0. https://www.genzconsortium.com.
  • [26] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012.
  • [27] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. Efficient Memory Disaggregation with INFINISWAP. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2017.
  • [28] N. Herman, J. P. Inala, Y. Huang, L. Tsai, E. Kohler, B. Liskov, and L. Shrira. Type-aware transactions for faster concurrent code. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2016.
  • [29] M. Isard, M. Budiu, Y. Yu, A. Birrel, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2007.
  • [30] A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, Scalable and Simple Distributed Transactions with Two-sided (RDMA) Datagram RPCs. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
  • [31] A. Kalia, M. Kaminsky, and D. G. Andersen. Datacenter RPCs can be General and Fast. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
  • [32] K. Katrinis, D. Syrivelis, D. Pnevmatikatos, G. Zervas, D. Theodoropoulos, I. Koutsopoulos, K. Hasharoni, D. Raho, C. Pinto, F. Espina, S. López-Buedo, Q. Chen, M. Nemirovsky, D. Roca, H. Klos, and T. Berends. Rack-scale Disaggregated Cloud Data Centers: The dReDBox Project Vision. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE), 2016.
  • [33] R. H. Katz. High Performance Network and Channel-Based Storage. Technical Report UCB/CSD-91-650, EECS Department, University of California, Berkeley, Sep 1991.
  • [34] K. Keeton. The Machine: An Architecture for Memory-centric Computing. In Proceedings of the Workshop on Runtime and Operating Systems for Supercomputers (ROSS), 2015.
  • [35] P. Keleher, A. L. Cox, S. Dwarkadas, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. In Proceedings of the USENIX Winter Technical Conference, WTEC’94, 1994.
  • [36] L. Kontothanassis, R. Stets, G. Hunt, U. Rencuzogullari, G. Altekar, S. Dwarkadas, and M. L. Scott. Shared Memory Computing on Clusters with Symmetric Multiprocessors and System Area Networks. ACM Transactions on Computer Systems (TOCS), 23(3), 2005.
  • [37] J. Kyathsandra and E. Dahlen. Intel Rack Scale Architecture Overview. http://presentations.interop.com/events/las-vegas/2013/free-sessions---keynote-presentations/download/463, 2013.
  • [38] L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Communications of the ACM, 21(7), 1978.
  • [39] L. Lamport. The Part-Time Parliament. ACM Transactions on Computer Systems (TOCS), 16(2), 1998.
  • [40] L. Lamport, D. Malkhi, and L. Zhou. Vertical Paxos and Primary-Backup Replication. In Proceedings of the Symposium on Principles of Distributed Computing (PODC), 2009.
  • [41] L. Lamport, D. Malkhi, and L. Zhou. Reconfiguring a state machine. ACM SIGACT News, 41(1), 2010.
  • [42] J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Improving Availability in Distributed Systems with Failure Informers. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013.
  • [43] J. B. Leners, T. Gupta, M. K. Aguilera, and M. Walfish. Taming uncertainty in distributed systems with help from the network. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2015.
  • [44] J. B. Leners, H. Wu, W.-L. Hung, M. K. Aguilera, and M. Walfish. Detecting failures in distributed systems with the FALCON spy network. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2011.
  • [45] K. Li and P. Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems (TOCS), 7(4), 1989.
  • [46] J. Liedtke. Improving IPC by Kernel Design. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 1993.
  • [47] J. Liedtke. On Micro-kernel Construction. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 1995.
  • [48] J. Liedtke. Toward Real Microkernels. Communications of the ACM, 39(9), 1996.
  • [49] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2009.
  • [50] K. Lim, Y. Turnet, J. Chang, J. Renato Santos, and P. Ranganathan. Disaggregated Memory Benefits for Server Consolidation. Technical Report HPL-2011-31, HP Laboratories, 2011.
  • [51] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD Conference, 2010.
  • [52] F. McSherry. Timely dataflow. https://github.com/TimelyDataflow/timely-dataflow.
  • [53] F. McSherry and M. Schwarzkopf. The impact of fast networks on graph analytics, part 1. http://www.frankmcsherry.org/pagerank/distributed/performance/2015/07/08/pagerank.html, 2015.
  • [54] A. Memaripour, A. Badam, A. Phanishayee, Y. Zhou, R. Alagappan, K. Strauss, and S. Swanson. Atomic In-place Updates for Non-volatile Main Memories with Kamino-Tx. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2017.
  • [55] Microsoft. Regions and availability for virtual machines in Azure. https://docs.microsoft.com/en-us/azure/virtual-machines/windows/regions-and-availability.
  • [56] D. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A Timely Dataflow System. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2013.
  • [57] S. Novaković, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot. Scale-out NUMA. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.
  • [58] B. M. Oki and B. H. Liskov. Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems. In Proceedings of the Symposium on Principles of Distributed Computing (PODC), 1988.
  • [59] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the ACM SIGMOD Conference, 1988.
  • [60] F. B. Schneider. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys (CSUR), 22(4), 1990.
  • [61] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018.
  • [62] N. Shavit and D. Touitou. Software transactional memory. In Proceedings of the Symposium on Principles of Distributed Computing (PODC), 1995.
  • [63] V. Shrivastav, A. Valadarsky, H. Ballani, P. Costa, K. S. Lee, H. Wang, R. Agarwal, and H. Weatherspoon. Shoal: A Network Architecture for Disaggregated Racks. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019.
  • [64] C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: a scalable distributed file system. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 1997.
  • [65] S. Venkataraman, N. Tolia, P. Ranganathan, and R. H. Campbell. Consistent and Durable Data Structures for Non-volatile Byte-addressable Memory. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), 2011.
  • [66] H. Volos, A. J. Tack, and M. M. Swift. Mnemosyne: Lightweight Persistent Memory. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011.
  • [67] M. Wu and W. Zwaenepoel. eNVy: A Non-Volatile, Main Memory Storage System. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1994.
  • [68] J. Xu and S. Swanson. NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST), 2016.
  • [69] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.