CARGO : Context Augmented Critical Region Offload for Network-bound datacenter Workloads

08/17/2020
by   Siddharth Rai, et al.
0

Network bound applications, like a database server executing OLTP queries or a caching server storing objects for a dynamic web applications, are essential services that consumers and businesses use daily. These services run on a large datacenters and are required to meet predefined Service Level Objectives (SLO), or latency targets, with high probability. Thus, efficient datacenter applications should optimize their execution in terms of power and performance. However, to support large scale data storage, these workloads make heavy use of pointer connected data structures (e.g., hash table, large fan-out tree, trie) and exhibit poor instruction and memory level parallelism. Our experiments show that due to long memory access latency, these workloads occupy processor resources (e.g., ROB entries, RS buffers, LS queue entries etc.) for a prolonged period of time that delay the processing of subsequent requests. Delayed execution not only increases request processing latency, but also severely effects an application throughput and power-efficiency. To overcome this limitation, we present CARGO, a novel mechanism to overlap queuing latency and request processing by executing select instructions on an application critical path at the network interface card (NIC) while requests wait for processor resources to become available. Our mechanism dynamically identifies the critical instructions and includes the register state needed to compute the long latency memory accesses. This context-augmented critical region is often executed at the NIC well before execution begins at the core, effectively prefetching the data ahead of time. Across a variety of interactive datacenter applications, our proposal improves latency, throughput, and power efficiency by 2.7X, 2.7X, and 1.5X, respectively, while incurring a modest amount storage overhead.

READ FULL TEXT
research
07/04/2018

Cimple: Instruction and Memory Level Parallelism

Modern out-of-order processors have increased capacity to exploit instru...
research
05/08/2023

A Case for CXL-Centric Server Processors

The memory system is a major performance determinant for server processo...
research
05/03/2023

CHASE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory

Caches at CPU nodes in disaggregated memory architectures amortize the h...
research
09/17/2021

Micro-architectural Analysis of a Learned Index

Since the publication of The Case for Learned Index Structures in 2018, ...
research
04/28/2015

Improving Block-level Efficiency with scsi-mq

Current generation solid-state storage devices are exposing a new bottle...
research
08/31/2022

Orloj: Predictably Serving Unpredictable DNNs

Existing DNN serving solutions can provide tight latency SLOs while main...
research
09/26/2019

λ-NIC: Interactive Serverless Compute on Programmable SmartNICs

There is a growing interest in serverless compute, a cloud computing mod...

Please sign up or login with your details

Forgot password? Click here to reset