Serverless compute is emerging as an attractive cloud computing model that lets developers focus only on the core applications—building the workloads as small, fine-grained custom programs (i.e., lambdas)—without having to worry about the infrastructure they run on. Cloud providers dynamically provision, deploy, patch, and monitor the infrastructure and its resources (e.g., compute, storage, memory, and network) for these workloads; with tenants only paying for the resources they consume at millisecond increments. Serverless compute lowers the barrier to entry, especially, for organizations lacking expertise, manpower, and budget to efficiently manage the underlying infrastructure resources.
Today, all major cloud vendors offer some form of serverless frameworks, such as Amazon Lambda (amzn_lmda_wp), Google Cloud Functions (goog_func), and Microsoft Azure Functions (msft_func), along with open-source versions like OpenFaaS (openfaas) and OpenWhisk (openwhisk). These frameworks rely on server virtualization (i.e., virtual machines (VMs) (vm)) and container technologies (i.e., Docker (docker)) to execute and scale tenants’ workloads. These technologies are designed to maximize utilization of the providers’ physical infrastructure, while presenting each tenant with its own view of a completely isolated machine. However, in serverless compute, where server management is hidden from tenants, these virtualization technologies become redundant, unnecessarily bloating the code-size of serverless workloads, and causing processing delays (of hundreds of milliseconds) and memory overheads (of tens of megabytes) (7899408). The increased overheads also limit the concurrent execution (less than hundred or so) of these workloads on a single server, hence, raising the overall cost of running such workloads in a data center.
The cloud-computing industry is now realizing these issues and some providers, such as Google and CloudFlare, have already started developing alternative frameworks (like Isolate (cloudflare_isolate)) that remove these technology layers (e.g., containers) and run serverless workloads directly on a bare-metal server (cloudflare_isolate). However, these bare-metal alternatives are inherently limited by the design restrictions of CPU-based architectures, which exacerbate when running at the scale of cloud data centers (187030). CPUs are designed to process sequence of instructions (i.e., a function) blazingly fast. They are not designed to run thousands of such small, discrete functions in parallel—a typical server CPU has in the order of 4 to 28 cores that can run up to 56 simultaneous threads (intel_platinum). Each serverless function interrupts a CPU core to store the state (e.g., registers and memory) of the currently running function and load itself with the new one, resulting in wasting tens of milliseconds worth of CPU cycles per context switch (such wasted cycles increase the overall costs for the cloud providers (Firestone:2018:AAN:3307441.3307446)). Thus, with ever increasing network speeds—100/400G NICs are on the horizon—these overheads quickly add up, limiting throughput and leading to long-tail latency in the order of 100s of milliseconds (Kaufmann:2016:HPP:2954679.2872367).
Recently, public cloud providers are deploying SmartNICs in an attempt to reduce load on host CPUs (Firestone:2018:AAN:3307441.3307446). So far, these attempts have been limited to offloading ad-hoc application tasks (like TCP offload, VXLAN tunneling, overlay networking, or some partial computation) to accelerate network processing of the hosts (floem; liu2019e3; Liu:2019:ODA:3341302.3342079). However, modern SmartNICs, more specifically ASIC-based NICs, consist of hundreds of RISC processors (i.e., NPUs) (netronome_cx), each with their local instruction store and memory. These SmartNICs are more flexible and can run many discrete functions in parallel at high speed and low latency—unlike GPUs and FPGAs, which are optimized to accelerate specific workloads (DBLP:journals/corr/BahrampourRSS15; Firestone:2018:AAN:3307441.3307446; Putnam:2014:RFA:2665671.2665678),
Serverless workloads by design are small, presenting unique opportunities for SmartNICs to accelerate them, while also achieving strict tail-latency guarantees. However, the main shortcomings of using SmartNICs come from their programming complexity. Programming SmartNICs is a formidable undertaking that requires intimate familiarity with NIC’s system and resource architecture (e.g., memory hierarchy, and multi-core parallelism and pipelining). Developers need to carefully partition, schedule, and arbitrate these resources to maximize performance of their applications, which is a characteristic that is counter to the motivation behind serverless compute (i.e., where developers are unaware of the architectural details of the underlying infrastructure). Furthermore, each application has to explicitly handle packet processing as there is no notion of a network stack in SmartNICs.
In this paper, we present -NIC, a framework for running interactive serverless workloads entirely on SmartNICs. -NIC supports a new programming abstraction (Match+Lambda) along with a machine model—an extension of P4’s match-action abstraction with more sophisticated actions—and helps address the shortcomings of SmartNICs in five key ways. First, users provide their lambdas, which -NIC compiles and then, at runtime, selects to execute by matching on the header of the incoming requests’ packets. Second, users program their lambdas assuming a flat memory model; -NIC analyzes the memory-access patterns (i.e., read, write, or both) and sizes, and optimally maps these lambdas across different memory hierarchies of the NICs while ensuring that memory accesses are isolated. Third, -NIC infers which packet headers are used by each lambda and automatically generates the corresponding parser for the headers, thus eliminating the need for manually specifying packet-processing logic within these lambdas. Fourth, instead of partitioning and scheduling a single lambda across multiple NPUs, -NIC assumes a run-to-completion (RTC) model exploiting the fact that lambdas are small and can run within a single NPU. The vast array of NPU cores and short service times of lambdas further mitigate head-of-line-blocking and performance issues that lead to high tail latency. Lastly, serverless functions mostly communicate using independent, mutually-exclusive request-response pairs and do not need the strict in-order, streaming delivery semantic provided by TCP (Kogias:2019:RMR:3358807.3358881). -NIC, therefore, employs a weakly-consistent delivery semantic (Caulfield:2016:CAA:3195638.3195647; Kogias:2019:RMR:3358807.3358881), alongside RDMA (rdma), for communication between serverless workloads—processing requests directly within the NIC cores without involving the host CPU. In summary, we make the following contributions:
We evaluate -NIC and show up to two orders of magnitude improvement in latency and throughput compared to existing serverless compute frameworks (§6).
We begin with a background on the current state-of-the-art in cloud-computing frameworks and SmartNICs (§2) followed by the challenges and motivations behind -NIC (§3). We then describe the overall architecture of -NIC (§4) with an extensive evaluation of the system (§6). Finally, we conclude by discussing -NIC’s limitations and future works (§7), and comparisons with related implementations (§8).
We now discuss the latest advancements in cloud-computing frameworks and programmable SmartNICs, which are the core building blocks of -NIC.
2.1. Cloud Computing Frameworks
Figure 1 illustrates four different cloud-computing frameworks in use today. Server virtualization is the foremost technology underneath cloud computing that allows a bare-metal server to host multiple VMs each with their independent, isolated environments containing separate Operating Systems (OS), libraries, and applications. It arrived at a time when advances in hardware made it difficult for a single application to efficiently consume the entire bare-metal server resources, and having multiple applications co-existing on a single server raised various issues (such as isolation and contention for resources). However, as trends shifted from monoliths to building applications as microservices (microservice)—to increase manageability, resiliency, and scalability—the overheads of having a separate OS for each microservice were no longer negligible. This gave rise to container virtualization (docker), which is a way to isolate applications’ binaries and libraries while sharing an OS.
Still, with growing complexity and scale of cloud workloads, it became daunting for many users to provision and manage infrastructure resources with tasks requiring fine-grain allocation of resources under changing workload demands. Early solutions settled on over-provisioning these resources, incurring added cost for idle resources. More recently, serverless compute (serverless) has emerged as a favorable compute model that alleviates such operational and financial burdens from the users by letting them specify only the workloads, their memory and timing constraints, and when and which events (e.g., API calls, database updates, incoming data streams) to trigger them. In response, cloud providers independently provision infrastructure resources, deploying a set of containers on-demand to serve workloads’ requests. These containers are quickly taken down once the workloads complete and users are charged only for the time a container is executed. Serverless workloads are therefore short-lived with strict compute-time and memory limits (up to 15 minutes and , respectively, for Amazon Lambda (lambda_limit)).
Figure 2 depicts a typical serverless compute framework. Workload manager compiles users’ workloads to executable binaries, which along with their data dependencies are stored in a global storage (e.g., Amazon S3 (s3), Google Cloud (google_cloud_storage), or Microsoft Azure Storage (Calder:2011:WAS:2043556.2043571)). Gateway proxies users’ requests (or events) to appropriate workloads, which typically run as containers on a set of dynamically provisioned servers, called worker nodes. Upon completion, results are written back to the storage or forwarded to other workloads for further execution.
Serverless compute frameworks embed users’ workloads within containers managed by orchestration engines (e.g., Docker) running atop an OS, which provide memory, compute, and file-system isolation using OS-based mechanisms (e.g., cgroups (cgroups) and namespaces (namespaces)). Each container maintains its own libraries and binaries, and communicates with other containers using an overlay network that is set up using virtual switches (like Open vSwitch (OVS) (ovs)).
However, running workloads as containers incur additional processing and networking overheads. As an alternative, Isolate functions (cloudflare_isolate) run workloads directly on the bare-metal server itself (Figure 1), while providing all the benefits of a typical serverless framework (e.g., resource isolation between workloads). Although still in early development, these Isolate functions are showing promising results: up to 3x improvement in request latency while consuming 90% less memory than containers with faster startup times.
Today, serverless frameworks find applications in two types of use cases (azure_serverless_cookbook; lambda_use_cases; goog_cf_usecases): (1) Running API backends that serve interactive applications, such as returning static or dynamic content, or key-value responses based on users’ requests. (2) Processing changes in data stores at real-time, such as cropping a newly uploaded image or running a custom operation on a newly added database entry.
The complexity of lambda functions ranges from running simple arithmetic operations to making machine-learning decisions that are generally short-lived. A bare-metal server can house thousands of such lambda functions; however, doing so causes CPUs to constantly context switch between these functions. These, along with other communication overheads, make it difficult for services (using lambda functions) to meet their tail-latency service-level objectives (SLOs)(cloudflare_isolate).
To mitigate these issues, efforts are underway to move computation down to SmartNICs; however, unlike -NIC, the focus is mostly on offloading either network processing or some small portion of applications to these NICs (floem; liu2019e3; Liu:2019:ODA:3341302.3342079).
|FPGA-based SmartNICs||ASIC-based SmartNICs||SoC-based SmartNICs|
|Performance||10+ cores, low latency||200+ cores, low latency||50+ cores, high latency|
2.2. Programmable SmartNICs
In addition to handling basic networking tasks, SmartNICs can offload more general tasks that a CPU normally handles (e.g., checksum, TCP offload, and more). Based on their architecture and processing capabilities, these SmartNICs come in three different types: FPGA-, ASIC-, and SoC-based (Firestone:2018:AAN:3307441.3307446). Table 2.1 summarizes the main differences.
Alongside other issues (e.g., steep development cost and power consumption (Firestone:2018:AAN:3307441.3307446)), FPGA-based NICs (fpga) are dominated by the on-chip interconnect overhead (chen2010optimization) that significantly limits the lookup tables (LUTs) and memory (e.g., SRAM) resources available for executing lambda functions—today’s large FPGAs can barely support a small number of processing cores ( or so). SoC-based NICs (e.g., Mellanox Bluefield (mellanox_bluefield) and Broadcom Stingray (broadcom_smartnic)) are easier to program as they run a Linux-like OS on embedded cores (like ARM); however, similar to server CPUs, they are susceptible to high tail latency due to context switch and network stack overheads. Therefore, it’s questionable that these SoC-based NICs can support speeds higher than + with low latency (Firestone:2018:AAN:3307441.3307446).
Due to these limitations of FPGA- and SoC-based NICs, we opted for the ASIC-based NICs when designing -NIC. These NICs consist of an ASIC that can sustain traffic rates of +; contain hundreds of non-cache-coherent multi-threaded RISC cores (e.g., NPU, ARM, or RISC-V), operating at speeds, along with specialized hardware functions (e.g., lookup, load balancing, queuing, and more); and are capable of running embarrassingly parallel workloads with low latency. Furthermore, recent advances in the design of these SmartNICs (e.g., Netronome Agilio (netronome_cx) and Marvell LiquidIO (cavium_liquidio)) make it easier for users to customize the NIC’s data-plane logic (e.g., parse, match, and action) using, for example, P4 (Bosshart:2014:PPP:2656877.26568900) and Micro-C (netronome_programming) programs, thus exposing dataflow and C-like abstractions a typical programmer is familiar with, without the need for an OS.
In §6, we demonstrate how the unique characteristics of serverless workloads (i.e., short-lived with strict compute and memory limits) make ASIC-based SmartNICs a viable execution platform to accelerate lambda functions.
3. Motivation & Challenges
The key motivation behind -NIC is to accelerate interactive serverless workloads by: (1) mitigating excessive compute and network virtualization overheads and inefficiencies of the modern server architecture to achieve low and bounded tail latencies for lambda functions, and (2) exploiting the right domain-specific architecture (i.e., ASIC-based SmartNICs) to sustain high throughput while reducing CPU load and cost-per-watt in cloud data centers.
Low latency and high throughput serverless functions.
The key tenet of serverless compute is that it establishes a clear demarcation between users and infrastructure providers; users only specify programs (or functions) that the providers efficiently execute on their infrastructure. Yet, all modern serverless frameworks are based on technologies (i.e., VMs and containers) that were designed to give users explicit control over the underlying infrastructure from the get-go. This control—in the form of compute and network (physical and overlay) virtualization—adds a significant overhead to serverless functions. For interactive serverless workloads, with strict tail latency SLOs, eliminating such computational and networking overheads is becoming crucial (227649).
The modern server architecture (with CPUs and GPUs) further adds to these overheads. CPUs are Von Neumann machines designed to efficiently execute a long sequence of instructions (or a function). However, they perform poorly when executing a large number of small serverless functions where significant time is wasted context switching between functions. Similarly, GPUs are Single-Instruction-Multiple-Data (SIMD) machines that serve as look-aside accelerators (lookaside) in a typical server, controlled by the primary CPU. Although, in recent years, these GPUs have shown orders of magnitude improvements in accelerating machine-learning workloads (Sujeeth:2011:OIP:3104482.3104559; Coates:2013:DLC:3042817.3043086) (which by nature are long-running, batch jobs), they perform poorly for low-latency, interactive tasks (like serverless functions). Even with technologies (such as GPUDirect RDMA (GPUDirectRDMA) and RDMA-over-Converged-Ethernet (RoCE) (rocev2)) that can bypass a CPU and push data directly into the GPU or main memory, the requests for serverless functions still have to traverse a NIC—adding non-negligible delays in the order of sub-microseconds.
-NIC eliminates both these virtualization and architectural overheads by running serverless workloads directly on the vast array of NIC-resident RISC cores.
A domain-specific processor for serverless functions.
Till now, cloud providers have almost always relied on newer, faster CPUs to improve applications’ performance. More recently, they have started looking into other fine-grain, domain-specific processors, like look-aside or bump-in-the-wire accelerators (GPUs or FPGAs (Firestone:2018:AAN:3307441.3307446)). This is because, with Moore’s Law slowing down (Hennessy:2019:NGA:3310134.3282307), CPUs today are no-longer a viable solution to meet ever-rising performance demands of customer workloads in a cost-efficient way—as has been demonstrated by both GPUs (for improving machine-learning training and inference throughput (Coates:2013:DLC:3042817.3043086)) and FPGAs (for accelerating search indexes (Putnam:2014:RFA:2665671.2665678) and host networking (Firestone:2018:AAN:3307441.3307446)). We believe that ASIC-based SmartNICs present the same opportunity for accelerating serverless workloads with orders of magnitude improvement in performance-per-watt at one-tenth of the hardware cost (netronome_power; colfax_netronome; Sohan2010Characterizing10G), compared to server CPUs and GPUs (Firestone:2018:AAN:3307441.3307446).
3.1. Key Challenges
The embarrassingly-parallel and independent nature of serverless workloads take away much of the complexities that arise when synchronizing state between functions (Liu:2019:ODA:3341302.3342079), making them an ideal candidate for SmartNICs with hundreds of cores. Still, executing them on these NICs is not a panacea and comes with its own unique challenges:
a. Programming SmartNICs
Due to their non-cache-coherent design, programming ASIC-based SmartNICs has always been considered hard (Chen:2005:SAH:1065010.1065038); non-coherency requires developers to program each NPU core separately, forcing them to manually handle synchronization between individual functions. With serverless functions, however, this is no longer an issue as functions do not share state and can run independently. Still, the lack of an OS layer in these NICs—though useful in reducing unwanted processing—puts the onerous of mapping and placing these functions, across various clusters of cores and memory hierarchy, on the developers; requiring them to have low-level knowledge of the NIC architecture, firmware, and specialized languages it supports. To take this burden away from developers, we need a high-level abstraction and a framework that can automatically and efficiently compile, optimize, and deploy serverless programs across a collection of these SmartNICs.
b. Offloading serverless workloads
NPUs are optimized for packet forwarding, and they typically do not support features (e.g., floating-point operations, dynamic-memory allocation, recursion, and reliable transport) that a general-purpose CPU supports (jungck2011packetc; george2003taming). A serverless framework, therefore, must be able to compile workloads that rely on these features by, for example, transforming programs with floating-point operations to fixed-point arithmetic (Banner:2018:SMT:3327345.3327421), dynamic-memory allocations to explicit memory calls, recursions to iterations, as well as employing other forms of reliable (or weakly-consistent) delivery protocols (e.g., RoCEv2 (rocev2), Lightweight Transport Layer (Caulfield:2016:CAA:3195638.3195647), or R2P2 (Kogias:2019:RMR:3358807.3358881)
). To achieve high throughput and low latency, emerging workloads are also lowering their dependency on these features. For example, deep-learning training and inference is shown to perform well with lower, fixed bit-width integers(Banner:2018:SMT:3327345.3327421; dettmers20158). Moreover, serverless request-response (RPC) pairs are mostly independent and mutually-exclusive, and do not need TCP’s strict, reliable, and in-order streaming delivery of messages (Kogias:2019:RMR:3358807.3358881).
c. Ensuring security under multi-tenancy
Lambdas run alongside other workloads (e.g., microservices) in a data center and share infrastructure resources. When using SmartNICs: (1) a serverless framework should ensure that lambdas running on the NICs do not interfere with each other or degrade the network performance between the NIC and host CPUs that are running traditional workloads. (2) The framework should reserve ample SmartNIC resources (i.e., cores and memory) for basic NIC operations (e.g., TCP/IP offload and checksums) while maximally consuming remaining resources for serverless functions. (3) Serverless functions should execute in their own isolated sandboxes and the framework should restrict them from accessing each others’ working set. (4) Lastly, the framework should be robust against security attacks (e.g., DDoS) both from outside actors and malicious tenants.
4. -NIC Overview
-NIC adds a new backend to existing serverless frameworks with its own programming abstraction, called Match+Lambda (§4.1), and the accompanying machine model (§4.2) that makes it easier to program and deploy lambdas directly on a SmartNIC.
4.1. Match+Lambda Abstraction
-NIC implements a Match+Lambda programming abstraction that extends the traditional Match+Action Table (MAT) abstraction (bosshart13) with more complicated actions (lambdas).
In -NIC, users provide one or more lambdas written in a restricted C-like language, called Micro-C (netronome_programming).111We use Micro-C as it is the native language of the SmartNICs we have for the evaluation. The Micro-C language can support a large class of serverless functions (§3.1); however, -NIC is not just limited to this language and can work with more feature-rich languages supported by other SmartNICs.
Listing LABEL:lst:lambda_signature shows the signature of the top-level function, which each lambda must begin with, having two arguments:
The number and structure of all the supported headers (i.e., the
EXTRACTED_HEADERS_T data structure) and function parameters (i.e.,
MATCH_DATA_T), in -NIC, are defined a-priori.
The lambdas operate directly on these parameters and headers without having to parse packets, which is done at the parse stage (Figure 3).
Furthermore, these functions can have both local objects as well as global objects that persist state across runs.
Listing LABEL:lst:web_server_lambda shows a real-world example of a lambda running as a web server.
The function reads the server address (Line 2) from the
It then copies the requested web content from the memory into the header location pointed by the server address (Line 2), before returning.
The user further specifies the corresponding P4 code222We use P4 as it is the most widely used data-plane language (Bosshart:2014:PPP:2656877.26568900). for the match stage (Listing LABEL:lst:generated_match).
During compilation, the workload manager assigns unique identifiers (IDs) to each of these lambdas, shares this mapping with the gateway, and populates the ID variables (e.g.,
OTHER_LAMBDA_ID) in the P4 code.
For each incoming request, the gateway inserts the ID of the destined lambda as a new header.
The match stage of a -NIC (as defined in the P4 code), checks the ID listed in the new header and calls the matching lambda (implemented as an extern in P4 (p4_spec)) or sends the packet to the host OS, in cases where no matching ID is found.
In the end, the workload manager pairs the lambdas (Micro-C code) and match stage (P4 code) into a single Match+Lambda program, and prepends it with a generic P4 packet-parsing logic. It then compiles and transforms this program into a format that the target SmartNIC can execute (§5), while ensuring fair allocation of resources and isolation between lambda workloads.
4.2. Abstract Machine Model
In -NIC, users write their Match+Lambda workloads against an abstract machine model (Figure 3). In this model: (1) lambdas are independent programs that do not share state and are isolated from each other; only a matching rule can invoke these functions. (2) The match stage serves as a scheduler (analogous to the OS networking stack) that forwards packets to the matching lambdas or the host OS. Finally, (3) a parser handles packet operations (like header identification), and lambdas operate directly on the parsed headers.
These properties of Match+Lambda machine model make it easier for software developers to express serverless functions by separating out the parsing and matching logic, as well as for hardware designers to efficiently support the model on their target SmartNICs (§5 demonstrates one such implementation using the Netronome SmartNICs). Thus, the abstract machine model enables unique optimizations that lets serverless workloads run as lambdas in parallel without any interference from each other.
4.2.1. Design Characteristics
The abstract machine model has the following three design characteristics:
D1: Run-to-completion execution
-NIC executes each lambda to completion. The machine model exposes a dense array of discrete, non-coherent processing threads (Figure 3), having their own instruction and data store in the memory. The lambdas execute in the context of a given thread, maximally utilizing the resources of that thread only (e.g., CPU and memory). Given the short service times and strict memory footprints of serverless functions, modern SmartNICs hold ample resources per thread to execute these lambdas (netronome_programming).
Having a large number of parallel threads, further mitigate issues related to head-of-line-blocking where lambdas wait behind other lambadas to finish or context switch, as in the case of server CPUs. These issues severely affect the performance of lambdas at the tail and require more complicated scheduling policies (like preemption (kaffes2019shinjuku) or core allocation (227649)). With -NIC, however, this is not the case as lambdas—even at the tail—can run to completion without degradation in performance (§6).
Moreover, the highly-parallel nature and run-to-completion characteristic of the machine model ensure strong performance isolation between different lambdas running as separate threads, and -NIC implements weighted-fair-queuing (WFQ) (wfq) to route requests between these threads. We leave it as future work to explore more sophisticated resource-allocation mechanisms (e.g., DRF (Ghodsi:2011:DRF:1972457.1972490)) to further improve the performance of -NIC.
D2: Flat memory access
The abstract machine model lets users write lambdas assuming a flat (virtual) memory address space. All objects (local and global) on the thread stack are allocated from within that address space. This has the advantage that users do not have to worry about the complex memories, and their structure and hierarchies, present in modern SmartNICs (netronome_programming). Each of these memories come with their own performance benefits and are necessary to reach high speeds in these NICs; however, having so makes it the responsibility of the programmers to efficiently utilize these memories. -NIC’s machine model takes this onerous away by exposing a single, uniform memory to the user.
When deploying to a particular SmartNIC, the compiler (or the workload manager) can take into account NIC’s specifications and can perform target-specific optimizations to effectively utilize its memory resources (§5.1). The users can also provide pragmas—specifying which objects are read more frequently—to guide the compiler in allocating objects to memories based on their access needs; it can place small or hot objects to core-local memories, and large or less frequently used ones in external, shared memories.
Furthermore, having a virtual memory space per lambda can let the compiler enforce policies for data isolation, since virtual spaces do not interfere and are isolated from each other. The compiler can insert static and dynamic assertions (aho1986compilers) to ensure that a lambda does not access the physical memory of other lambdas on the target SmartNIC.
D3: Network transport
In -NIC, the primary mode of communication between the gateway, lambdas, and external services (e.g., storage) is via Remote Procedure Calls (RPCs). These RPCs are small, typically single-packet, request-response messages (spdy; Kogias:2019:RMR:3358807.3358881). The parser, in the abstract machine model (Figure 3), decomposes these messages into headers and forwards them to the match stage, which further directs them to a matching lambda (by looking up the lambda ID associated with each message). Multi-packet RPCs, depending upon their size, can either be processed directly by the parse and match stage or pushed into the memory over RDMA (i.e., RoCEv2 (rocev2)). (In the latter case, an event RPC triggers the lambda to start reading data from the desired memory location.)
Already, modern datacenter applications (like Amazon DynamoDB (amzn_dynamo) and Deep Learning (Banner:2018:SMT:3327345.3327421; dettmers20158)) are choosing to go away with the strict, reliable, and in-order guarantees provided by TCP, which are far stronger and computationally intensive than what applications need. Instead, these applications are being designed to work with weaker guarantees to achieve low tail latency (Kogias:2019:RMR:3358807.3358881). -NIC exploits these facts and assumes a weakly-consistent delivery semantic for RPCs that are processed by the parse and match stage. A sender (the gateway or external services) tracks the outgoing RPCs to lambdas, and is responsible for resending a message in case of timeouts or packet drops. -NIC, on the other hand, performs packet reordering at the SmartNIC for multi-packet RPCs.333We measured that Netronome SmartNICs can reorder four packets using 120 instructions, which is only 1.3% of the instructions used by our benchmark lambdas (§6.4).
In this section, we present an implementation of -NIC on P4-enabled Netronome SmartNICs (Figure 4). These NICs contain hundreds of RISC cores grouped together into islands. Each core has its own instruction and local memory, as well as a Cluster Target Memory (CTM) (netronome_programming) per island, and is capable of executing multiple threads, concurrently. There are also on-chip internal memories (IMEMs) and an external memory (EMEM) shared between all islands and their cores; and a dedicated scheduler unit that directs incoming packets to cores. The architecture of Netronome SmartNICs therefore has the necessary elements to efficiently implement -NIC’s Match+Lambda abstract machine model: cores can execute lambdas to completion (D1), data can reside in different memories (e.g., local, CTM, or more) based on their usage patterns (D2), and the scheduler can direct RPCs to lambdas (D3).
A more programmable scheduler (e.g., RMT/PIFO-based (Sivaraman:2016:PPS:2934872.2934899; sp-pifo)
) can execute the parse and match stage of the machine model directly, with cores only processing the lambda logic. However, the scheduler inside the Netronome NICs we used for our evaluation is not programmable; it is work-conserving and uniformly distributes incoming traffic to all cores. We therefore execute all three stages (parse, match, and lambdas) together inside a core, with every core running the same Match+Lambda program.444The other approach is to pipeline these stages and run them on separate cores; we intend to look into this as future work.555At present, the binary running on SmartNICs must be swapped with a new one each time, resulting in downtimes. However, this constraint is expected to disappear in the next-generation NICs (§7). For single-packet messages, the scheduler directs an incoming packet to a core at random, which after parsing selects the matching lambda using the ID embedded in the packet. The multi-packet messages are committed to memory via RDMA, and lambdas read directly from the desired location.
5.1. Target-Specific Optimizations
Next, we discuss optimizations, to improve the execution time and binary size of Match+Lambda programs (§6.4), that arise as a result of our design choices and architectural constraints of Netronome NICs.
As multiple lambdas run on a single core, the workload manager runs program analysis (i.e., dead-code elimination and code motion (Shahbaz:2015:CIR:2774993.2775000)) to remove duplicate logic (e.g., for modifying similar headers or generating packets) and move it into shared libraries as helper functions.
The workload manager can further reduce the number of tables (e.g., for route-management) defined in the match stage of a Match+Lambda program. Each new lambda consists of both a parse and a match stage; the workload manager can compose these stages with the logic already running on the core, removing the unused headers and duplicate match fields from the final code. Furthermore, the P4 tables are converted into if-else sequences, which the NIC core can execute more efficiently. Transforming tables into if-else sequences also helps reduce the total number of instructions of the final binary, running on a core.
Based on the access patterns, the workload manager can choose the most efficient memory for an object at compile time. It can also look at the object size or hints from the user (as pragmas) to decide whether to put the object in a local memory, CTM, IMEM or EMEM.
We now compare the performance of -NIC with bare-metal and container backends both in isolation, when executing a single lambda (§6.3.1), and in a shared setting, when running multiple lambdas together (§6.3.2). We also evaluate the impact of -NIC on resource utilization, startup times, as well as the effectiveness of the -NIC compiler to optimize lambda program size when running on the NIC cores (§6.4).
6.1. Test Methodology
6.1.1. Baseline framework
To evaluate -NIC against existing serverless backends, we select OpenFaaS (openfaas) as our baseline serverless compute framework, which closely resembles the architecture depicted in Figure 2. We choose OpenFaaS for its simplicity, ease of deployment, extensive feature set, and greater adoption by the community.666OpenFaaS is the most favorable open-source serverless compute framework with 12,000 stars on GitHub (serverless_comparison). It is written in Golang (golang) and includes: (1) a Web UI, (2) an autoscaler to scale lambdas as demands change, (3) a Prometheus (prometheus) based monitoring engine to analyze system state and (4) a gateway with a NAT (rfc2663) to proxy users’ requests to the appropriate lambdas. Each of these components and lambdas run as Docker (docker) containers, managed via Kubernetes (kubernetes) or Docker Swarm (docker_swarm).
Adding a bare-metal backend
For evaluating emerging runtimes like Isolate (cloudflare_isolate), we add support for a bare-metal backend to OpenFaaS. It is implemented as a Python service777We found the performance of the Python service similar to a comparable C implementation except for the one-time startup cost of the backend. We, therefore, used the Python service for its ease of integration with OpenFaaS. that runs on a bare-metal server as a standalone process, launching lambdas as new threads to serve users’ requests. The service relies on a Raft-based (Ongaro:2014:SUC:2643634.2643666) distributed key-value store, called etcd (etcd), to sync lambda-related states (e.g., number of active lambdas, their placement and load balancing policies) with the gateway to correctly proxy requests. Our goal, using the bare-metal backends, is to analyze how performance of lambdas improves in the absence of the container processing stack.
Introducing -NIC extensions
We built -NIC as an extension to our baseline framework, inheriting all of OpenFaaS’s core features with additional support for running lambdas on P4-enabled Netronome SmartNICs. With these extensions, the baseline framework can simultaneously deploy lambdas to containers, bare-metal, and SmartNIC backends. We also augment etcd to share state and manage -NIC deployments across multiple worker nodes.
6.1.2. Testbed setup
Our evaluation testbed consists of a cluster of five servers (Figure 5) housing two Intel Xeon Gold 5117 processors with 14 physical cores, running at with DDR4 Dual-Ranked RAM and SATA SSD. One of the servers (M1) act as a master node running: Kubernetes services, gateway, workload manager, memcached server (memcached), the web interface, and the monitoring engine. M1 comes with a Broadcom 57412 2x and 2x Quad-Port NIC, which is used for management traffic. Other servers (M2–5) are worker nodes equipped with a Netronome Agilio CX 2x SmartNIC (netronome_cx) having 56 RISC cores (8 threads and instructions per core) running at with of on-board RAM. All servers connect to an Arista DCS-7124S switch over a link. The backends communicate over an overlay network using Kubernetes’ calico (calico) networking plugin for high performance switching and policy management across nodes inside a Kubernetes cluster.
6.2. Benchmark Workloads
We evaluate -NIC on three different types of interactive lambdas (i.e., a web server, a key-value client, and an image transformer), each reflecting a popular use case (azure_serverless_cookbook; lambda_use_cases; goog_cf_usecases).
a. Web server
A common usage pattern for lambdas is to serve web contents (lambda_use_cases), such as text or HTML pages, similar to traditional web servers (like nginx (nginx)). These workloads are typically self-contained and do not need information from external sources (e.g., data stores) to service a request. For our experiments, we wrote a lambda that returns text responses based on the incoming requests.
b. Key-value client
Next, we consider workloads with external dependencies, needing information from remote services. These workloads query users’ data from external storage, e.g., databases or key-value stores (such as memcached (memcached)), do customization on the retrieved data, and finally send the processed data back to the user. Moreover, these workloads generate extensive intra-data center requests and typically have strict tail-latency requirement to meet user service-level objectives (SLOs). To evaluate, we implement lambdas acting as key-value clients that generate write (SET) and read (GET) requests to a memcached server.
c. Image transformer.
Finally, we evaluate workloads that involve real-time, interactive processing of large datasets (i.e., image processing or stream processing) (Viola:2004:RRF:966432.966458; Chen:2016:RDP:2882903.2904441), where the datasets span multiple packets and must be stored in memory (i.e., DRAM). These workloads perform customization to the requested datasets, and either return a response to the user immediately or store results back to the memory for further processing (lambda_image_editing). For our evaluation, we consider lambdas that transform RGBA images to grayscale.
6.3. System Performance
We now discuss how -NIC performs, both in terms of latency and throughput, compared to the bare-metal and container backends. We evaluate two cases: (1) when there is only a single lambda running on a backend in isolation, and (2) when there are multiple lambdas running, all contending for the shared resources (i.e., processing cores and memory).
6.3.1. Performance in Isolation
We first look at the latency and throughput of a lambda in isolation.
We measure the latency of each backend, which is the time it takes for a gateway to send a request to a node running a single thread of the pre-loaded (or warm) lambda and receive a response back (Figure 6). For web server and key-value client lambdas, -NIC outperforms containers by 880x and bare-metal by 30x in average latency—completing requests in under —while still achieving 5x to 3x improvements for the data-intensive image-transformer lambda. Improvement are more visible at the tail (i.e., 99th-percentile) where -NIC achieves 5x to 24x better tail latency than bare-metal for the three benchmark lambdas. For the key-value client lambdas, -NIC even improves upon reported latencies in a highly-optimized cloud-scale data center (fb_memcached) by three orders of magnitude. Furthermore, container and bare-metal exhibit longer tail latency, specifically, for short-lived web server and key-value lambdas. This is likely the artifact of miscellaneous software overheads (e.g., context switching, cache management, and network stack).
We see similar improvements in throughput, measured as request serviced per second, for the three lambdas when running on -NIC. We carryout two separate experiments: (1) closed-loop testing with sender generating each request one after the other, and (2) parallel testing with 56 requests—the maximum number of threads that can run simultaneously on our testbed server CPU—to stress the backends under concurrent load. -NIC outperforms both container and bare-metal backends (Figure 7), servicing requests 27x to 736x faster than the two backends for the web server and key-value client lambdas, and 5x to 15x faster for the image-transformer lambda.
6.3.2. Performance Under Resource Contention
In a real setting, a serverless backend will typically run multiple lambdas at the same time. These workloads will contend for their fair share of resources (i.e., CPU and memory), leading to added delays due to, for example, context switching of lambdas and movement of data to and from CPU and memory.
|56 Threads||1 Thread|
Effects of context switching
In the previous experiment, we measured the performance of each backend when running a single lambda in isolation. Now, we evaluate a set up with three distinct web-server lambdas running on a single backend at once. We generate requests for each of these workloads in a round-robin fashion, causing the processor to context switch between lambdas when servicing each incoming request. Figure 8 shows the context-switching overhead on latency for -NIC as well as the bare-metal backend (with 1 and 56 threads) when executing the warm (pre-loaded) web-server lambdas. With multiple lambdas running concurrently, the bare-metal backend suffers even higher latency (178x to 330x) compared to -NIC. Moreover, -NIC completes requests 55x to 100x faster than the bare-metal backend, with both single and 56 threads (Table 2)—a difference of at least 9% compared to running a single lambda in isolation (§6.3.1), whereas -NIC shows no significant change. The performance of containers was a lot worse than even the bare-metal backend (not shown). These experiments demonstrate that, unlike container and bare-metal backends (with server CPUs), -NIC is not susceptible to context switching and performs better under resource contention by virtue of a vast array of on-chip NPU cores and the lack of operating system and container software.
6.4. Other Metrics
We also compare the memory and CPU usage at the host and the SmartNIC when running a single data-intensive image-transformer instance in isolation. Table 3 shows the additional resources utilized by each backend when servicing 56 concurrent requests. Containers have the largest memory footprint and consume an order of magnitude more host memory and CPU cycles than the bare-metal backend. On the other hand, as expected, -NIC’s impact on the host memory and CPU is negligible, and it consumes roughly the same amount of NIC memory for the image-transformer workload as the bare-metal backend on the host.
|Host CPU (Avg. %)||+0.1||+9.2||+13.7|
|Host Memory (MiB)||0||+62.5||+219.5|
|NIC Memory (MiB)||+63.2||0||0|
|Workload Size (MiB)||11.0||17.0||153.0|
|Startup Time (s)||19.8||5.0||31.7|
Next, we measure the startup time; the total time a backend takes to download the image-transformer binary and its dependencies and start serving requests. To compare startup times we use: (1) Lambda binary size (i.e., SmartNIC’s compiled firmware, Python library packaged using setuptools and Wheel (python_wheel), and the docker container). (2) Boot-up time of the lambda, from launching the system to responding to a user request. -NIC’s image-transformer binary is 13x smaller in size than a container image, and is comparable to a bare-metal binary (Table 4). In addition, it takes the image-transformer lambda 38% less time on -NIC to service the first request compared to a container. On the bare-metal backend, the image-transformer binary starts up in under 5 seconds (4x faster than -NIC); however, in our evaluation, we do not consider any framework overheads (like Isolate (cloudflare_isolate)), which will likely lead to higher startup times. In summary, while slow start is a known issue in serverless compute, -NIC keeps the additional delay over bare-metal backends 2x less than the container overhead.
We now report the results of our compiler optimizations. The number of instructions in the naïve implementation—consisting of two key-value clients, a web server, and an image transformer lambda—is gradually reduced by applying the following target-specific optimizations (§5.1). First, we perform lambda coalescing for the two distinct key-value clients. We coalesce these lambdas, as they contain equivalent logic to generate a new packet to query memcached, which we can combine and reuse. We further coalesce the web server and image-transformer lambdas, having a pattern of response that does not query external services. Hence, we combine their reply logic. Next, we apply match reduction. The naïve implementation adds a separate table for managing routes for each lambda. We combine these tables into one, and use individual parameter values (defined as P4 metadata) for route management. Finally, we do memory stratification to place variables into appropriate memories based on their sizes. For example, the image variable within the image-transformer lambda is mapped to IMEM, whereas the web server results are mapped to CTM inside the island. These optimizations bring the total number of instructions of the final binary down to 8,050 (a reduction of 9.56% from the naïve implementation); hence, improving latency by (on average) (Crowley:2000:CPA:335231.335237) or letting additional lambdas to fit within the program-size constraints of the Netronome SmartNIC.
Choice of hardware
-NIC is not just limited to NPU-based SmartNICs. In fact, the -NIC’s abstract machine model can run on other SmartNICs (with varying benefits) having more general-purpose processors: either FPGA- or SoC-based SmartNICs, or some form of ASIC with ARM cores (netronome_fx). These alternatives can further extend the processing capabilities of -NIC, providing support for more features (such as floating point operations, deeper instruction store, and dynamic memory allocation) to run more complicated workloads.
Hot swapping workloads
For each new lambda, -NIC needs to recompile and swap the firmware with the one currently running on the SmartNIC. Present versions of Netronome SmartNICs do not support hot swapping or hitless updates (netronome_programming), resulting in downtimes each time a new firmware is loaded. Hitless updates are not technically challenging as devices like FPGAs (xilinx_fpga_reconfig) and programmable switch fabrics (bosshart13) already support it (using partial reconfiguration and versioning techniques). We believe this limitation will go away in the future versions of SmartNICs as well, allowing -NIC to load new lambadas with causing downtimes.
Accelerating other forms of workloads
In this work, we primarily focused on offloading interactive serverless workloads to SmartNICs. However, -NIC can accelerate other parts of the serverless framework as well. For example, the gateway is a proxy that routes users’ requests to -NIC’s worker nodes (Figure 2); its performance is therefore crucial to the end-to-end behavior of the system. -NIC can provide strict bounds on tail latency and throughput, by running the gateway directly on a SmartNIC. Certain types of data stores (like key-value stores (memcached)) can also benefit from -NIC. Their restricted compute pattern (Jin:2017:NBK:3132747.3132764) lends itself nicely to run on -NIC’s Match+Lambda machine model. Likewise, existing large and long-running workloads (like code compilation and video processing) have shown to perform better when broken down into small serverless functions (fouladi2017thunk); thus, making -NIC a suitable target for these workloads. We believe that these trends will inspire more workloads to migrate to -NIC in the future, with serverless frameworks automatically determining which backend to execute the workload on.
Security and reliable transport
The serverless framework (i.e., gateway, workload manager, and worker nodes) typically run within a trusted domain of a provider or a tenant, such that any malicious attempt to trigger the lambdas will be blocked by the gateway. In addition, each lambda operates within its own memory region on the NIC, restricting them from accessing each others data. -NIC enforces this policy using compile-time assertions; in the future, we plan to explore enforcing assertions at runtime for dynamically-allocated memory (once it becomes available in the upcoming SmartNICs). For transport, -NIC relies on RDMA for multi-packet messages. However, with recent focus on terminating the entire transport layer on the NIC (Montazeri:2018:HRL:3230543.3230564; tonic), -NIC’s serverless functions can instead operate on complete messages rather than individual packets.
8. Related Work
NIC offloading technologies
Offloads for network features, such as L3/L4 checksum computation, large send and receive offload (LSO, LRO) (lso), and Receive Side Scaling (RSS) (rss) have been around for decades. More recent work, however, is looking into offloading more complex, application-related functions to modern programmable, SmartNICs (Firestone:2018:AAN:3307441.3307446). For example, Microsoft is using FPGA-based SamrtNICs to offload hypervisor switching tasks (Firestone:2018:AAN:3307441.3307446), which were previously handled by CPU-based software switches (like Open vSwitch (OVS) (ovs)). HyperLoop (Kim:2018:HGN:3230543.3230572) provides methods for accelerating replicated storage operations using RDMA on NICs. -NIC can assist these works by providing a framework to easily deploy general compute on a cluster of nodes hosting SmartNICs. Both Floem (floem) and iPipe (Liu:2019:ODA:3341302.3342079) provide a framework to enable easier development of NIC-assisted applications. However, these frameworks can offload only a portion of these applications to a NIC as a bump-in-the-wire, and need CPUs to do the remaining processing. In contrast, -NIC runs complete workloads on the NIC, mitigating the effects of any CPU-related overheads.
Orthogonal to NIC offloading technologies, there is a recent focus on moving various application tasks inside the network. P4 (Bosshart:2014:PPP:2656877.26568900) and RMT (bosshart13) have provided the initial building blocks: a data-plane language and an architecture for programmable network devices, which enabled developers to run various applications in-network. For example, SilkRoad (Miao:2017:SMS:3098822.3098824) and HULA (Katta:2016:HSL:2890955.2890968) present methods for offloading load balancers, NetCache (Jin:2017:NBK:3132747.3132764) implements a key-value store, and NetPaxos (Dang:2015:NCN:2774993.2774999) runs the Paxos (Lamport:1998:PP:279227.279229) consensus algorithm inside switches. Tokusashi et al. (Tokusashi:2019:CIC:3302424.3303979) further demonstrate that in-network computing not only improves performance but also is more power efficient. -NIC, alongside these networking devices, can provide a more programmable environment for accelerated application processing. In fact, SmartNICs have more memory and less-restricted programming model, which can help alleviate the limitations present in these switches.
Improving serverless compute
Serverless compute is a relatively new idea and many of its details are not yet disclosed by the cloud providers. Thus, most of the recent work focuses on reverse engineering existing frameworks to study their internals and to educate the public. For example, OpenLambda (openlambda) provides an open-source serverless compute framework that closely resembles the ones deployed by the cloud providers. Glikson et al.(Glikson:2017:DEC:3078468.3078497) proposed another framework with support for edge deployments. -NIC complements these efforts by presenting a high-performance, open-source serverless compute framework for testing and developing lambdas on SmartNICs.
Server CPUs are not the right architecture for serverless compute. The particular characteristics of the severless workloads (i.e., fine-grain functions with short service times, and memory), as well as the slowdown of Moore’s Law and Denard Scaling, demand a radically different architecture for accelerating serverless functions (or lambdas): an architecture that can execute multiple of these lambdas in parallel with minimal contention and context-switching overhead. In this paper, we present -NIC, a serverless framework along with an abstraction, Match+Lambda, and a machine model for executing thousands of lambdas on an ASIC-based SmartNIC. These SmartNICs host a vast array of RISC cores with their own instruction store and local memory, and can execute embarrassingly parallel workloads with very low latency. -NIC’s Match+Lambda abstraction hides the architectural complexities of these SmartNICs and efficiently compiles, optimizes, and deploys serverless functions on these NICs—improving average latency by 880x and throughput by 736x. With new and emerging developments in both SmartNICs and serverless frameworks, we believe offloading lambdas to SmartNICs will become a common practice, inspiring even more complex real-world workloads to run on -NIC.