DeepAI
Log In Sign Up

AgileWatts: An Energy-Efficient CPU Core Idle-State Architecture for Latency-Sensitive Server Applications

03/04/2022
by   Jawad Haj-Yahya, et al.
0

User-facing applications running in modern datacenters exhibit irregular request patterns and are implemented using a multitude of services with tight latency requirements. These characteristics render ineffective existing energy conserving techniques when processors are idle due to the long transition time from a deep idle power state (C-state). While prior works propose management techniques to mitigate this inefficiency, we tackle it at its root with AgileWatts (AW): a new deep C-state architecture optimized for datacenter server processors targeting latency-sensitive applications. AW is based on three key ideas. First, AW eliminates the latency overhead of saving/restoring the core context (i.e., micro-architectural state) when powering-off/-on the core in a deep idle power state by i) implementing medium-grained power-gates, carefully distributed across the CPU core, and ii) retaining context in the power-ungated domain. Second, AW eliminates the flush latency overhead (several tens of microseconds) of the L1/L2 caches when entering a deep idle power state by keeping L1/L2 cache content power-ungated. A minimal control logic also remains power-ungated to serve cache coherence traffic (i.e., snoops) seamlessly. AW implements sleep-mode in caches to reduce caches leakage power consumption and lowers a core voltage to the minimum operational voltage level to minimize the leakage power of the power-ungated domain. Third, using a state-of-the-art power efficient all-digital phase-locked loop (ADPLL) clock generator, AW keeps the PLL active and locked during the idle state, further cutting precious microseconds of wake-up latency at a negligible power cost. Our evaluation with an accurate simulator calibrated against an Intel Skylake server shows that AW reduces the energy consumption of Memcached by up to 71 (35

READ FULL TEXT VIEW PDF

page 3

page 5

09/24/2020

A Study of Runtime Adaptive Prefetching for STTRAM L1 Caches

Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAM in ...
04/22/2022

AgilePkgC: An Agile System Idle State Architecture for Energy Proportional Datacenter Servers

This paper presents the design of AgilePkgC (APC): a new C-state archite...
02/24/2017

An analysis of core- and chip-level architectural features in four generations of Intel server processors

This paper presents a survey of architectural features among four genera...
05/18/2019

HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems

Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising alte...
05/09/2018

Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks

This paper presents the Neural Cache architecture, which re-purposes cac...
08/28/2020

Reversible Computing with Fast, Fully Static, Fully Adiabatic CMOS

To advance the energy efficiency of general digital computing far beyond...

1 Introduction

Recently, Google heralded the era of the killer microseconds for datacenter servers running user-facing applications  [barroso2017attack, barroso2018datacenter, prekas2017zygos]. Killer microseconds refer to microsecond-scale idleness during CPU execution caused by the combination of two major trends. First, various events (e.g., related to NVM storage, faster datacenter networking, and main memory) with latencies in the range of microseconds are prevalent [barroso2017attack, chou2019mudpm, cho2018taming]. Second, a new software architecture is deployed in datacenters based on microservices, i.e., a large application composed of numerous interconnected smaller services that explicitly communicate with each other. These applications exhibit irregular request streams, and their services have very tight (i.e., few tens to few hundreds of microseconds) latency requirements [dmitry2014micro].

C-state Transition time Residency time Power per core
() N/A N/A W
() N/A N/A W
() µs µs W
() µs µs W
() µs µs W
() µs µs W
µs µs W
Table 1: C-states available on the Intel Skylake server (SKX) core [intel_idle] together with AW’s new and C-States.

Latency-critical microservice-based applications follow a request/response model and exhibit a busy/quiet execution pattern [chou2016dynsleep, chou2019mudpm, meisner2009powernap]. Ideally, systems should be in an active power state when a burst of requests arrives and then transition to an idle, low-power state during the quiet period. Table 1 reports a typical hierarchy111C-states that further reduce idle power at the package-level (e.g., C8) incur an even longer transition latency and require a longer residency time [gough2015cpu, haj2018power]. of existing idle states (i.e., C-states: , , , and ) and our new proposed idle states: and (which replace and , discussed in Sec. 4). The Table shows for each state and two frequency levels (i.e., base frequency: , and minimum frequency: ) the power of modern server CPU cores during idle periods.222The microsecond-granularity transition times in Table 1 represent the worst-case software+hardware entry+exit latency (latency to start executing the first instruction) and not the actual hardware transition latency [CPU_idle]. For example, the hardware latency for C1 C-state is only a few nanoseconds (cycles) since C1 mainly performs clock-gating (Fig. 3(a)) [rogers2012core, schone2015wake, schone2019energy, intel_atom_2010].

Clearly, the C-state that the CPU core resides while idle determines its power consumption. Transitioning to a deeper (or shallower) C-state requires bookkeeping work with varying transition latencies, during which the CPU core cannot perform useful work. For this reason, power management controllers only decide to switch to a deeper C-state if they predict that waking up will not be necessary before a target residency time.

Under these C-state transition policies, servers running latency-critical services rarely enter a deep idle power state (e.g., C6) because: 1) the residency time is hard to estimate, as the duration of busy/quiet periods is irregular; and 2) the stringent requirements on service latency (

µs [chou2019mudpm, zhan2016carb]) cannot be met when several tens of microseconds are required to transition out of deep C-states. As a result, CPU cores only enter shallow (e.g., C1) C-states during the quiet period and burn excess energy while idle.

We claim that the inefficiency of C-state hierarchy for microsecond latency events is not fundamental, but a byproduct of their design being focused on client systems. Major server vendors commonly design a common base microarchitecture upon which both client and server processors are built. For example, Intel CPU core design is a single development project where client and server processors are based on the same master design [Skylake_die_server, kumar2017intel]. Within this development flow, optimizations for energy efficiency are mostly targeted at client processors, which are used in battery-operated devices, while server processors are mostly optimized for performance. Therefore, features such as C-states are designed for client applications (e.g., video playback, conferencing, gaming [19_MSFT, mobilemark]), which, contrary to latency-critical microservices, typically present long and predictable idle periods, allowing processors to exploit existing deep package C-states (i.e., , , , and ) [kurd2014haswell, chi2015low, haj2020sysscale, haj2020techniques]. For example, a client processor spends of video streaming time in the deep package C-states [haj2021burstlink].

Prior work (reviewed in Sec. 8) proposes various management techniques to mitigate the inability of datacenter processors to leverage deep C-states effectively. In contrast, our goal is to directly address the root cause of the inefficiency, namely the high transition latency (tens of microseconds; see Table 1) to/from deep C-states. We propose AgileWatts (AW): a new deep C-state architecture optimized for servers processors in modern datacenters running user-facing workloads. AW drastically reduces the transition latency of deep idle power states while retaining most of their power savings, enabling the usability of deep C-state. AW redesigns state-of-the-art CPU core deep C-states by leveraging three key power management techniques. First, instead of completely shutting off the core power supply when entering a deep C-state, AW uses medium-grained power-gates distributed across the core and maintains the core context in the power-ungated domain. This approach shaves off several microseconds of latency by removing the need for saving and restoring the architectural state. Second, AW keeps private caches (i.e., L1 and L2) and minimal control logic for cache coherence power-ungated instead of completely shutting them down, which otherwise requires flushing their content and inflate transition latency by several tens of microseconds. AW implements a sleep-mode in caches that reduces the core voltage to a minimum operational level to reduce the leakage power consumption of the power-ungated domain. Third, instead of shutting down the clock distribution completely, AW clock-gates the core components and clock distribution while keeping the power-efficient all-digital phase-locked loop (ADPLL [fayneh20164]) clock generator on and locked. This approach shaves a few more microseconds off the transition latency and costs minimal extra idle power consumption.

We demonstrate AW potential for Intel server processors, which span more than of the entire server processor market [intel_amd_marketshare]. However, our proposed techniques are general and hence applicable to most server processor architectures.

This work makes the following contributions:

  • [leftmargin=*]

  • To our knowledge, AgileWatts (AW) is the first practical highly-efficient new core C-state design, directly targeting the killer microseconds problem in datacenter servers running latency-critical applications.

  • We propose an agile idle power states architecture, AW, that drastically reduces the transition latency of deep idle states while retaining most of their power savings.

  • AW architecture employs medium-grained power gating and voltage control to reduce the need for saving/restoring the microarchitectural context and flushing caches and save several tens of microseconds of transition latency.

  • Our evaluation with an accurate model calibrated against an Intel Skylake server shows that AW reduces the energy consumption of Memcached by up to . AW / new deep C-states provides up to shorter transition latency than the deepest idle state while its power consumption is only / of the active state (C0), respectively.

2 Motivation

Before diving into details, we analyze the opportunity that a new, agile deep idle state holds for datacenter processors.

As discussed in Sec. 1, servers running latency-critical applications, usually operate at low load, and the workload characteristics prevent modern cores from entering deep idle states during quiet periods. Previous work showed that, for a key-value store (e.g., Memcached [jose2011memcached]) workload, cores never go to a C-state deeper than (the shallowest C-state, shown in Table 1) when running at or higher load [chou2016dynsleep, lo2014towards] (our experimental analysis of this workload in Sec. 7 confirms this observation). A search workload was shown to be slightly more efficient, with cores reaching deeper C-states for of the time at load, but only of the time at load, thus still spending and of the time in C-state, respectively [lo2014towards]. These examples point to a large opportunity to design a new C-state that has latency similar to but much lower idle power than the W of 1.

Since the deepest C-state is (Table 1), we can estimate an upper bound on the average power (AvgP) savings opportunity by considering the ideal case of a deep idle state with the same transition latency as (i.e., s) and the same power consumption of , i.e., W per core as in Equation 1.

(1)

denotes the residency at power state , i.e., the percentage of the total time the CPU core spends in power state . denotes the average CPU core power consumption in power state (reported in Table 1).

Referring to our examples from previous work, given 1) the C-state residencies for the search workload at (i.e., =, =, =) and (i.e., =, =, =) load, and for the key-value store workload at (i.e., =, =, =) load, and 2) C-states power consumption from Table 1: then the savings correspond to , , and reduction in core power consumption for the three cases, respectively.

The rest of the paper illustrates how AW reaps a large part of this massive power-saving opportunity by defining a new low-latency deep idle state we call C6A (C6 Agile).

3 Background

We provide a brief overview of different power management components and techniques used in modern processors.

Figure 1: A Skylake server die with annotations of various units (reversed engineered in [skx_die_annotations]). The CPU core is bordered with green, and the AVX-512 and L2 extensions [kumar2017intel] (unavailable in client CPU cores) are bordered in pink. The 256-bit and 512-bit AVX units have separate power gates [mandelblat2015technology, intel_avx512, fayneh20164, haj2021ichannels], as shaded in the figure.

Server and Client Cores. Major server vendors have nearly the same CPU core microarchitecture for client and server processors. For example, Intel CPU core design is a single development project, leading to a master superset core. The project has two derivatives, one for server and one for client processors [Skylake_die_server]. Fig. 1 shows the AVX and L2 extension of the Intel Skylake server core over the client core [kumar2017intel].

Clock Distribution Network (CDN). A clock distribution network distributes the clock signals from a common point (e.g., clock generator) to all the elements in the system that need it. Modern processors use an all-digital phase-locked loop (ADPLL) to generate the CPU core clock [tam2018skylake]. An ADPLL can maintain high performance while significantly reducing power compared to conventional PLLs. For example, the power of an ADPLL in Skylake, shown in (bottom) of Fig. 1, is only at frequency [fayneh20164].

Power Delivery Network (PDN). There are three commonly-used PDNs in modern processors: 1) integrated voltage regulator (IVR [burton2014fivr, nalamalpu2015broadwell, tam2018skylake, icelake2020], 2) motherboard voltage regulator (MBVR [rotem2011power, fayneh20164, haj2019comprehensive], and 3) low dropout voltage regulator (LDO VR [singh20173, singh2018zen, burd2019zeppelin, beck2018zeppelin, toprak20145]. For example, recent Intel server processors implemented a fully integrated voltage regulator (FIVR) per core, as shown at the bottom of Fig. 1  [fayneh20164, Skylake_die_server].

Figure 2: Staggering power-gate wake-up by daisy-chaining the control signals of the power-gating switches.

Staggered Unit’s Power-gates Wake-up. Power gating is a circuit-level technique used to eliminate leakage power of an idle circuit [hu2004microarchitectural, haj2018power, gough2015cpu]. Typically, the wake-up latency from the power-gated state can take a few cycles to tens of cycles [kahng2013many, gough2015cpu]. However, to reduce the worst-case peak in-rush current [chadha2013architectural, usami2009design, agarwal2006power, abba2014improved] and voltage noise on the power delivery (e.g., di/dt noise [larsson1997di, gough2015cpu, haj2018power]) when waking up a power-gate, the power-gate controller applies a staggered wake-up technique [chadha2013architectural, usami2009design, agarwal2006power], as illustrated in Fig. 2. The technique turns on different power-gating switch cells in stages to limit the current spike from the power supply. To do so, the and of the switch cells are daisy-chained. The power controller issues a signal to the first , and it receives an acknowledge signal (ready) from the last , indicating that the power-gate is fully conducting. An alternate wake-up technique groups switch cells into multiple chains and uses a daisy chain configuration for each chain. Doing so allows the power-gate controller to tune (e.g., post-silicon) a unit’s wake-up time by controlling assertion time for each chain.

Modern processors implement the staggering technique [kahng2013many, akl2009effective, kahng2012tap]; for example, the Intel Skylake core staggers the wake up of the AVX power gates over to reduce in-rush current [fayneh20164][haj2021ichannels, Sec. 5].

Core C-states. Power saving states, i.e., C-states, enable processors to reduce power consumption during idle periods when no instructions are available to execute. Modern processors offer deeper C-states for more power savings during idle periods. For example, Intel’s Skylake architecture offers the following four C-states: , , , [gough2015cpu, haj2018power]. Table 2 describes the microarchitectural state while in core C-states. While C-states enable processors to achieve power savings, entry-to and exit-from a C-state incurs a latency overhead during which the core cannot be utilized. For example, it is estimated that the state requires µs transition time, as shown in Table 1. These entry-exit latencies can significantly affect the performance of workloads whose request processing latencies are of similar magnitudes, such as user-facing applications [jose2011memcached].

C-State Clocks ADPLL L1/L2 Cache Voltage Context
C0 Running On Coherent Active Maintained
C1 Stopped On Coherent Active Maintained
C6A Stopped On Coherent PG/Ret/Active In-place S/R
C1E Stopped On Coherent Min V/F Maintained
C6AE Stopped On Coherent PG/Ret/Min V/F In-place S/R
C6 Stopped Off Flushed Shut-off S/R SRAM
Table 2: Microarchitectural state while in Skylake server core C-states [oneintel] together with AW’s new and C-States.

Core C-state Entry and Exit Flow. The Core C1/C1E, and C6 entry and exit flows are shown in Figures 3(a) and 3(b), respectively, and are described in more detail in [rogers2012core, schone2015wake, schone2019energy, intel_atom_2010].

Figure 3: (a) C1 entry, exit, and snoop flows. (b) C6 entry and exit flows.

Core C6 Entry/Exit Latency. We analyze the entry/exit latency based on available data from x86 implementation [rogers2012core]. On C6 entry, the latency is dominated by the L1/L2 cache flush loop. This flush time varies depending on 1) the amount of dirty (modified) cache lines and 2) the core frequency when entering C6. For example, a dirty cache at 800MHz takes µs. The core state save time into the save/restore (S/R) SRAM depends on the core clock frequency. For example, at a frequency of , the latency is µs. Including control flow overhead and the power-gate controller, the overall CPU core entry time is µs.

exit latency is significantly faster than entry latency, taking approximately µs from the wake-up interrupt to the resumption of instruction execution in the core. This latency includes µs for hardware wake-up, including power-ungating, PLL relock, reset, and fuse propagation. State restoration and microcode restoration take about µs [rogers2012core, schone2015wake, schone2019energy, intel_atom_2010].

4 AgileWatts (AW) Architecture

AW introduces a new core deep idle power state, C6A (C6 Agile) with, close to zero Watts power consumption and nanosecond-scale entry/exit latency. This new state allows servers running latency-critical applications to frequently enter C6A state during short and irregular idle periods, thereby significantly reducing overall energy consumption. Additionally, AW defines a lower power variant of C6A, C6AE (C6A Enhanced, the analogous to ), which further reduces C6A leakage power consumption by lowering the core voltage to a minimum operational voltage. We focus on the C6A architecture and operation and highlight C6AE differences when relevant. The C6A C-state is based on two key innovations.

First, AW defines a Units Fast Power-Gating (UFPG) architecture that uses distributed power-gates within the core to power gate the majority of the core while avoiding the need for saving and restoring microarchitectural context, which is kept in a separate, non-gated domain. Keeping the microarchitectural context in place removes the need to save and restore it, as normally done by the C6 state, and allows cutting power to the majority of the core within tens of nanoseconds. Sec. 4.1 introduces AW’s the UFPG architecture.

Second, AW defines a Caches Coherency and Sleep Mode (CCSM) that avoids the need to flush private caches, thereby saving tens of microseconds of transition latency compared to the C6 state. To avoid the need to flush caches, CCSM clock-gates the caches (keeping them powered) and, to minimize their active power, it reduces their voltage to retention level while the cache is idle. Since caches are not flushed, CCSM needs to respond to requests from the coherency protocol; for this purpose, CCSM includes a mechanism that quickly reactivates caches to respond to snoop requests while keeping the rest of the core in an idle state. Sec. 4.2 presents the CCSM architecture.

AW defines a new power management flow that precisely coordinates the UFPG and the CCSM at nanosecond granularity to enter and exit the C6A state. Sec. 4.3 expands on how this flow transitions the core to sleep mode, allows responding to cache snoops while keeping most of the core asleep, and wakes up the core to an active state.

4.1 Units Fast Power-Gating (Ufpg)

AW UFPG is a low-latency power-gating architecture that shuts off most of the core units while retaining context in place, thus enabling a transition latency of tens of nanoseconds. Conventional context retention techniques (e.g., the C6 C-state flow in Fig. 3(b)) sequentially save/restore the context to/from an external SRAM before/after power-gating/un-gating [rogers2012core, haj2021ichannels, fayneh20164, haj2020techniques]; this process adds several microseconds of overhead to the entry/exit latency. Instead, AW retains the context in place, completely removing that overhead at a very small additional idle power cost.

AW enables in-place context retention with a medium-grain power gating approach. This is in contrast to the coarse-grain power-gating approach used in Skylake client cores, where the entire core is under the same power gate [rotem2011power, 10_jahagirdar2012power, fayneh20164, 12_howse2015tick] and context needs to be saved/restored externally. Our approach leverages the same ideas used by the fine-grain power gating for the AVX-256 and AVX-512 core units [mandelblat2015technology, intel_avx512, fayneh20164, haj2021ichannels] in recent server and client cores (see Fig. 1). These fine-grain power-gating techniques only require to nanoseconds to power-gate/un-gate a unit [fayneh20164][haj2021ichannels, Sec. 5] because they retain context in place and do not require saving and restoring it externally.

Figure 4: Medium-grain power gating for the majority (area shaded in red) of the core units, excluding the L1 and L2 caches and their controllers.

AW’s medium-grain power gating applies to the majority of the core units (shown as a red shaded area in Fig. 4) and excludes the L1 and L2 caches and their controllers, which AW handles separately (see Sec. 4.2).

Within the medium-grain power gating region, AW leverages multiple techniques to retain context in place, enabling fast (within a few nanoseconds) transition latencies. The context of a modern CPU core amounts to a total size of approximately kB333In [haj2020techniques], is shown that the context to be saved/restored for the entire Skylake client mobile SoC is KB. The context includes quad-core, integrated GPU, uncore, and system agent. We extrapolate a single-core context out of the KB based on the die area of a single-core out of the entire quad-core SoC area. The relative context is about kB. (estimated as the amount of the SRAM that C6 saves) [gerosa2008sub, jahagirdar2012method] and falls into two categories: i) registers, such as configuration and status registers (CSRs) or fuse registers, and ii) SRAMs, such as firmware persistent data and firmware patches [haj2020techniques, rogers2012core]. In the following paragraphs, we illustrate the three distinct techniques AW’s uses to efficiently retain the context during the C6A state; the first two apply to registers, while the third one deals with SRAM.

Figure 5: Context retention techniques AW uses when power-gating a unit: (a) Placing a context in the core ungated power domain; (b) placing an SRAM with context (e.g., microcode patch) in the core ungated power-domain; (c) using state retention power gating (SRPG) flops for distributed context.

4.1.1 Placing Unit Context in the Ungated Domain

One option to retain the context of a power-gated unit is placing its registers outside the power-gated region, i.e., in the core ungated domain, as illustrated in Fig. 5(a).

This solution is suitable for units that have a small context that needs to be retained (e.g., execution units); for example, Intel likely uses this technique for the AVX execution units [mandelblat2015technology, lackey2002managing]. AW uses this context retention technique for all core units that require only a local context to be retained, i.e., units that do not require retaining a distributed context that would be impractical to relocate to a centralized un-gated region. We identify the following units that satisfy these requirements: 1) all the execution units (besides AVX), 2) the execution ports, and 3) the out-of-order engine.

4.1.2 State Retention Power Gates (SRPGs)

Moving large context to a separate un-gated area is impractical (e.g., due to timing and wiring constraints). For this reason, AW employs a different retention technique – SRPGs – for units that contain large context. As Fig. 5(c) illustrates, SRPGs (i.e., retention flops) are special flops fed with two power supplies: power-gated and power-ungated. These flops typically use a shadow register to retain state even when the unit in which they are residing is power-gated [mahmoodi2004data, rabinowicz2021new, lackey2002managing]. For example, Intel uses this technique in the chipset to retain the state of autonomously power-gated units [mandelblat2015technology]. As Sec. 5 further illustrates, using SRPGs is more expensive than moving context to a separate, un-gated domain; therefore, we only apply this technique when relocating context to a centralized un-gated area is impractical.

4.1.3 Place Context SRAMs on the Ungated Power Supply

Part of the CPU core context is located in SRAMs [haj2020techniques]. While the microcode firmware is located inside a read-only memory known as microcode sequencer ROM (MS-ROM), microcode patches and data are stored in a SRAM [gwennap1997p6, ermolov2021undocumented]. This SRAM is initialized at boot time and should be retained when powering-gating the microcode unit. The C6 state deals with the microcode patches by re-initializing the content of this SRAM from another SRAM located in a separate un-gated domain; this re-initialization process is sequential and can take multiple microseconds [haj2020techniques, rogers2012core]. Instead, AW avoids the need to re-initialize the microcode patch SRAM by powering it with a separate, core un-gated supply, as Fig. 5(b) illustrates.

4.2 Caches Coherency and Sleep Mode (Ccsm)

To avoid the high latency (tens of microseconds) to flush private caches (i.e., L1D and L2) to power-gate them, AW instead keeps them in a non-power-gated area (see Fig. 4) when transitioning to C6A. This choice has two consequences: first, AW needs to employ other power-saving techniques to reduce the impact of not power-gating the cache domain; second, a core in C6A state still needs to serve requests coming from the coherence protocol [molka2009memory, hackenberg2009comparing].

AW employs two key techniques to reduce the power consumption of the non-power-gated private cache domain. First, unless a coherency request is being served, AW keeps this domain clock-gated to save its dynamic power. Second, AW leverages the cache sleep-mode technique [huang2013energy, flautner2002drowsy, chen201322nm, rusu20145, rusu201422], which adds sleep transistors to the SRAM array of private caches. These sleep transistors reduce the SRAM array’s supply voltage to the lowest level that can safely retain the SRAM content while significantly reducing leakage power.

Since private caches are not flushed when a core enters C6A, AW must allow the core to respond to snoop requests coming from the coherency protocol [gough2015cpu, haj2018power, rotem2011power]. To do so, AW keeps the logic required to handle cache snoops in the non-power-gated (but clock-gated) domain together with the private caches. Additionally, it utilizes minimal logic (same logic used in C1) to detect incoming snoop requests in an always-active (i.e., not power- or clock- gated) domain. As soon as this minimal logic detects incoming snoop traffic, it temporarily increases the SRAM array voltage through the sleep transistors and reactivates the clock to the private caches for the time required to respond to the snoop requests.

4.3 C6A Power Management Flow

AW implements the C6A power management flow within the core power management agent (PMA) [rotem2012power]. This flow, illustrated in Fig. 6, orchestrates the transitioning between the C0 and C6A C-states and handling coherency traffic while in C6A state.

Figure 6: Core power management flow for the new C6A (C6 Agile) state.

Similar to other C-states, the operating system triggers C6A entry by executing the MWAIT instruction [gough2015cpu, haj2018power]. The first step

1
in the entry flow, clock-gates the UFPG domain (see Sec. 4.1), while keeping the core phase-locked loop (PLL) powered-on. When entering C6AE, the PMA additionally initiates a non-blocking transition to Pn – the P-state with lowest frequency and voltage. Subsequently

2
, the flow saves (in place) the UFPG domain context and shuts down its power. Finally

3
, the flow sets the private caches into sleep mode (see Sec. 4.2) and shuts down their clock. After these three steps the core is in the C6A (or C6AE) state.

When a snoop request arrives while the core is in the C6A (or C6AE), the PMA has to temporarily activate the private caches to respond. First,

a
the flow clock-ungates the private cache domain and contextually adjusts its supply voltage to make it exit sleep mode. At this point

b
, the caches can handle the snoop requests. Finally

c
, when all outstanding snoop requests are serviced, it rolls back the changes in reverse order and brings the core back into full C6A (or C6AE) state.

When an interrupt occurs, the core exits from C6A (or C6AE) and goes back into C0 (active) state. The exit flow is simply the reverse process of the entry flow. First

4
, the flow clock-ungates L1/L2 and exits sleep-mode. Next

5
, it power-ungates the UFPG units and triggers the restore signal to the SRPG flops (see Fig. 5(c)). Finally

6
, the flow clock-ungates UFPG units, bringing the core to C0 active state.

5 Implementation and Hardware Cost

As discussed in Sec. 4, AW requires the implementation of the three main components in each CPU core: the UFPG subsystem, the CCSM subsystem, and the C6A controller. In this section, we discuss implementation details of each component, their area and power cost, and the resulting transition latency for the new C6A and C6AE states.

5.1 Units Fast Power-Gating (Ufpg)

As illustrated in Sec. 4.1, AW places the majority of the core components behind power gates that can be implemented similar to the existing power-gates used for the AVX units in recent cores. The other key part within the UFPG is the in-place context retention mechanisms for the kB of core context [gerosa2008sub, jahagirdar2012method]. AW implements power-gates for approximately of the core area (as Fig. 4 shows) and requires about  –  extra core area. The exact overhead depends on the specific implementation, the exact size of the gated area, the number of required isolation cells444Isolation cells are responsible for isolating the always-on units from the floating values of the power-gated units. They are typically placed on the outputs of the shut off power domain during the physical placement stage [chadha2013architectural]., and technology node [petrica2013flicker, ditomaso2017machine, rahman2006determination, flynn2007low, zimmer2017reprogrammable]. Area requirements for the three context retention techniques vary, as follows. First, moving the context of a unit to the ungated power typically requires of the context area, mainly due to the isolation cells [chadha2013architectural]. Second, the state retention power-gates (SRPG) needed for components with too large or distributed state to be moved to the non-power-gated domain, are a mature technique already used in shipping products, e.g., on Intel Skylake processors [mandelblat2015technology]). Efficient SRPG designs, which uses selective context retention, require less than additional area compared to the power-gated area they control [rabinowicz2021new, hyun2019allocation]. Finally, we do not expect additional overheads for context held in SRAM (e.g., the microcode patches, which account for kB out of the kB core context [gwennap1997p6, ermolov2021undocumented]), since we place these SRAMs to the non-power-gated domain. However, placing a context SRAM into the ungated area requires of the SRAM area for the isolation cells [chadha2013architectural].

5.2 Caches Coherency and Sleep Mode (Ccsm)

AW implements the SRAM sleep-mode for the private caches in a similar way that is used for the L3 cache sleep-mode found in multiple server products [huang2013energy, chen201322nm, rusu20145, rusu201422]. SRAM sleep-mode implements P-type sleep transistors with seven programmable sleep settings and local bit-line float to reduce SRAM cell leakage in the data array; additionally, it employs word line sleep to reduce array leakage power. In this implementation, only the data array (which accounts for more than of the cache size) is placed in sleep-mode (low voltage), while the other control arrays (e.g., tag, state) operate at nominal voltage. Doing so allows hiding the data array wake-up latency during the control arrays access time, thus eliminating any performance degradation as compared to operation without the sleep mode. Implementing sleep-mode using sleep-transistors for the SRAM data array of the private caches requires additional area similar to power-gates (i.e., of the SRAM area) [petrica2013flicker, ditomaso2017machine, rahman2006determination, flynn2007low, flautner2002drowsy]); a recent implementation reports a area requirement [zimmer2017reprogrammable].

5.3 C6A/C6AE Power Management Control Flow

The main implementation detail to enable the control flow illustrated in Fig. 6 is a mechanism to control in rush current [fayneh20164, jeong2012mapg]. This mechanism needs to enable staggered wake-up, so as to ensure stability of the power delivery network [fayneh20164, jeong2012mapg, haj2021ichannels, kahng2013many, akl2009effective, kahng2012tap]. We further discuss this mechanism in Sec. 5.7). The remaining capabilities, such as clock-gating, detection of events (interrupts, snoops) are all commonly supported in state-of-the-art SoCs, e.g., Skylake processors. The C6A controller is implemented using a simple finite-state-machine (FSM) within the core’s power management agent (PMA), which resides in the uncore [rotem2012power] and controls clock gating/un-gating, save/restore signals, and L1/L2 entry to / exit from sleep-mode. The snoop flow reuses the existing snoop handling mechanisms used in the state (shown in Fig. 3). Based on a comparable power-management flow implemented in  [haj2020techniques], we estimate the area required to implement the C6A controller to be up to of the core PMA area.

Table 3 summarizes the requirements to implement AW on top of a Skylake server processor.

Component Sub-Component Area Requirement C6A Power C6AE Power
Units Fast Power-Gating (UFPG) Units power-gates ( of the core)  –  of power-gated area – mW – 
Ungated context registers of ungated context registers mW mW
State retention power-gates (SRPG) of gated unit area
Ungated context SRAM of SRAM area
Caches Coherency & Sleep Mode (CCSM) L1/L2 caches sleep-mode –  of private cache area mW mW
Rest of ungated memory subsystem of the ungated units mW mW
PMA Flow Implemented in the uncore [rotem2012power] of core PMA [haj2020techniques] mW mW
Core ADPLL & FIVR ADPLL mW [fayneh20164] mW [fayneh20164]
Core FIVR inefficiency mW – mW mW – mW
FIVR static losses mW mW
Overall –  of the core area mW – mW mW – mW

Assuming  –  [flynn2007low] of leakage power  C1 power.  Power of the context [haj2020techniques].  L1+L2 size is ; power from chen201322nm_ppt in nm scaled to nm based on shahidi2018chip.  Higher sleep-transistors efficiency at (C6AE) [luria2016dual, huang2016fully, haj2020flexwatts].  Based on scaled wake-up logic power from haj2020techniques.  Assuming FIVR efficiency in light load [haj2020flexwatts, haj2019comprehensive, radhakrishnan2021power].

FIVR static losses [haj2019comprehensive, haj2020flexwatts, asyabi2020peafowl] in C6 state are mW [nalamalpu2015broadwell].

Table 3: Area and power requirements to implement AW in a Skylake-like core.

5.4 Idle Power Analysis

The AVX power-gates and the new UFPG shut off all the units in the core front-end and execution domains; however, since power gates can only eliminate  –  of the leakage power [flynn2007low], the UFPG domain has residual idle power when in C6A. Using the Intel core-power-breakdown tool [haj2016fine], which provides power breakdown for different components, we estimate the leakage power contribution of the power-gated units starting from the leakage power of the entire core. The leakage of the entire core is approximately equivalent to the power (see Table 1), which only removes dynamic power by applying clock-gating. Our estimation shows that the new power-gated units contribute to approximately of the core leakage (i.e., of power, ). Hence, the remaining power overhead of UFPG (i.e.,  –  of the gated leakage power) is  –  at nominal frequency/voltage (P1), or at minimum frequency/voltage (Pn).

The three UFPG context retention techniques collectively retain core context [gwennap1997p6, ermolov2021undocumented], which consume at retention voltage [haj2020techniques]. To estimate the retention power at nominal (P1 state) and minimum (Pn state) frequency/voltage, we conservatively multiply the retention-level power by and , respectively. Therefore, our estimate for context retention power is mW (P1) mW (Pn).

The CCSM implements sleep-mode using sleep-transistors in the L1/L2 data array; this technique is already used for efficient design of the L3 cache SRAM in multiple server processors in the market [huang2013energy, chen201322nm, rusu20145, rusu201422]. We estimate the leakage power of the L1/L2 SRAM data array when in sleep-mode, starting from the leakage power of the MB SRAM for the L3 cache with sleep-mode implementation at nm [chen201322nm_ppt, chen201322nm] Based on established methodology [shahidi2018chip], we scale this value to the cumulative L1 and L2 capacity (MB and to the nm the technology node used for Skylake). The resulting leakage power estimate for the L1/L2 caches is approximately mW. We use the same method to estimate the power for the rest of the power-ungated units (controllers and tags), resulting in an additional at P1 voltage level. Reducing the core voltage to Pn level increases the sleep transistor efficiency and reduces leakage power at Pn voltage to . This is because a sleep transistor is effectively a linear voltage regulator (LVR). The LVR power-conversion efficiency is the ratio of the desired output voltage and the input voltage, hence the closer the input voltage to the output, the higher the power-conversion efficiency [luria2016dual, huang2016fully, haj2020flexwatts].

AW implements the C6A controller as an FSM within the PMA. Based on a comparable power-management flow implemented [haj2020techniques], we estimate that the C6A controller adds approximately to the PMA power.

Finally, we estimate the C6A idle power needed by AW to keep the PLL on and locked during sleep while accounting for voltage regulator inefficiencies. The Skylake core uses an all-digital phase-locked loop (ADPLL) and a fully integrated voltage regulator (FIVR) [fayneh20164, Skylake_die_server]. The ADPLL consumes mW (fixed across core voltage/frequency [fayneh20164]). The FIVR presents dynamic efficiency loss due to conduction and switching inefficiency [lakkas2016mosfet], and static efficiency loss, due to power consumption of the control and feedback circuits [lakkas2016mosfet, haj2020flexwatts, nalamalpu2015broadwell]. The static loss still applies when the output of the FIVR is . The FIVR static loss accounts for mW per core [haj2019comprehensive, haj2020flexwatts, asyabi2020peafowl]. The FIVR efficiency at light load is about (excluding the static power looses) [haj2020flexwatts, haj2019comprehensive, radhakrishnan2021power].

5.5 C6A and C6AE Latency

We estimate that the overall transition time (i.e., entry followed by direct exit) for the C6A and C6AE states of AW to be ns: three orders of magnitude faster than the µs that C6 requires. Next, we discuss in detail the entry, exit, and snoop handling latency for C6A and C6AE; we refer to the power management flow shown in Fig. 6.

5.5.1 C6A and C6AE Entry Latency

Clock-gating all domains and keeping the PLL ON

1
typically takes  –  cycles in an optimized clock distribution system [el2011clocking, shamanna2010scalable]. Transitioning to (required for ) happens with a non-blocking parallel P-state flow that can take few microseconds to tens of microseconds, depending on the power management architecture [gendler2021dvfs]. Since AW keeps context in place, saving context to power-gate the core units

2
only requires asserting the Ret signal followed by deasserting the Pwr signal, as Fig. 5(c) illustrates. We estimate that this process takes to cycles. Finally, placing the L1/L2 caches in sleep-mode and clock-gating them

3
takes cycles. Hence, the overall entry flow takes cycles, equivalent to using a power management controller with clock frequency.555Typically, a power management controller of a modern system-on-chip operates at clock frequency of several megahertz (e.g., 500MHz [peterson2019fully]) to handle the nanosecond scale events, such as di/dt prevention [fayneh20164][haj2021ichannels, Sec. 5].

5.5.2 C6A and C6AE Exit Latency

Clock-ungating the L1/L2 caches and exiting sleep-mode

4
takes cycles [huang2013energy]. Power-ungating the core units

5
takes (further discussed in Sec. 5.7) and restoring the core context (i.e., deasserting the Ret signal after restoring power) takes  –  cycles. Finally, clock-ungating all domains

6
typically takes cycles. Hence, the overall exit flow takes clocks + , equivalent to when using clock frequency.

5.5.3 C6A and C6AE Snoop Handling

Snoop handling latency in C6A and C6AE is similar to that of C1 and C1E, respectively; more specifically: Clock-ungating the L1/L2 caches and exiting sleep-mode

a
takes cycles. In the first cycle, the flow ungates the clock; in the second cycle, the snoop requests simultaneously 1) accesses the cache tags (power-ungated), and 2) wakes up the cache data array [huang2013energy, chen201322nm, rusu20145, rusu201422]. Placing the L1/L2 in sleep-mode and clock-gating L1/L2 caches

c
after servicing the snoop traffic

b
takes  –  cycles.

5.6 Performance Penalty

In an active CPU core, simultaneous operations in memory or/and logic circuits demand high current flow, which creates fast transient voltage droops from the nominal voltage [cho2016postsilicon]. One of the power-gating design challenges is the resistive voltage (IR) drop across a power gate, which exacerbates voltage droops [nithin2010dynamic, radhakrishnan2021power, haj2020flexwatts, shekhar2016power, haj2019comprehensive, jotwani2010x86]. The worst case voltage droop can degrade the maximum attainable frequency at a given voltage since this requires additional voltage (droop) guardband above the nominal voltage to enable the CPU core to run at the target frequency [nithin2010dynamic, radhakrishnan2021power, cho2016postsilicon]. An x86 implementation of CPU core power-gate shows frequency loss due to power-gating [jotwani2010x86]. AW analytical model (discussed in Sec. 6.2) assumes frequency degradation due to the additional CPU core power-gates (i.e., UFPG).

Scheme Core type Power-gating Trigger Power-gated Blocks Wake-up overhead
roy2010state In-order CPU Cache miss Register file 5 cycles
jeong2012mapg In-order CPU Cache miss Core 10ns
hu2004microarchitectural OoO CPU Execution unit idle Execution units 9 cycles
battle2012flexible OoO CPU Register file bank idle Register file bank 17 cycles
jeon2015gpu GPU Register subarray unused Register subarray 10 cycles
haj2021ichannels OoO CPU AVX execution unit idle Intel AVX execution unit 10ns
AW (This work) OoO CPU Core idle Most of core units 100ns
Table 4: Comparison of CPU/GPU core Power-gating schemes.

5.7 Staggered Unit Wake-up

As discussed in Sec. 3, rapid wake up of a power-gated domain results in a sudden increase in current requirement, known as in-rush current [chadha2013architectural, usami2009design, agarwal2006power, abba2014improved], that could damage the chip. Skylake core’s AVX power-gating mitigates this issue by staggering the AVX un-gating over  [fayneh20164][haj2021ichannels, Sec. 5]. AW can exacerbate in-rush current effects, since during C6A / C6AE exit it wakes up a power-gated domain (i.e., UFPG, the red shaded area in Fig. 4) that has approximately larger area and capacitance than the AVX units [haj2016fine].

To mitigate this issue, we divide the UFPG area into five zones, each with a local power-gate controller (as in shown Fig. 2). Each of the five local power-gate controllers has a zone sleep signal () that is controlled by the core PMA. The PMA sequentially wakes up the five domains using the signals. Since each of the five zones has a smaller area than the AVX power-gated units, staggering the wake up of each zone over (i.e., same as AVX units) keeps the in-rush current within limits [chadha2013architectural, jeong2012mapg, agarwal2006power]. Hence, waking up all five domains takes approximately ns.

Several prior works propose nanosecond-scale staggered power gates wake-up for different units including the entire core. Table 4 summarizes some of these works.

5.8 Design Effort and Complexity

AW proposed techniques involve non-negligible front-end and back-end design complexity and effort. First, although medium-grained power-gates of Units Fast Power-Gating (UFPG) are less invasive than fine-grain power-gating, they still require significant back-end (e.g., power-delivery, floor-planning, place and route) design effort. Second, the Caches Coherency and Sleep Mode (CCSM) and the C6A/C6AE control flows requires careful pre-silicon verification to make sure that all the hardware flows (described in Fig. 6) operates as expected by the architecture specification. The effort and complexity can be significant if a processor vendor chooses to have two separate designs for client and servers to eliminate AW’s overhead from client processors.

However, AW effort and complexity is comparable to recent techniques implemented in modern processor to increase their energy-efficiency (e.g., hybrid cores). Therefore, we believe that once there is a strong request from customers and/or pressure from competitors, several processor vendors will eventually implement a similar architecture to AW to significantly increase servers’ energy efficiency.

6 Experimental Methodology

We outline our methodology for evaluating AW. First, we describe our workloads and experimental setup used to run the workloads and collect input data to our analytical power model. Second, we introduce our new industry grade analytical power and performance model666

We will open-source our model online upon paper acceptance.

for evaluating the baseline system and AW. Third, we discuss our process for validating our model against power measurements from a real modern server system that is based on the Intel Skylake [tam2018skylake] processor.

6.1 Workloads and Experimental Setup

We evaluate AW using Memcached [memcached], a lightweight key-value store that is widely deployed as a distributed caching service to accelerate user-facing applications with stringent latency requirements [nishtala2013memcached, yang2020twemcache, pymemcache]. Memcached has been the focus of numerous studies [lim2013memcached, xu2014memcached, leverich2014mutilate, prekas2017zygos], including efforts to provide low microsecond-scale tail latency [nishtala2013memcached, jialin2014memcached, asyabi2020peafowl].

Processor code name
Model
Base/Minimum frequency
Max Turbo Boost frequency
Thermal design power (TDP)
L1I/L1D/L2 caches per core
L3 cache
Process technology node
Skylake
Intel Xeon Silver 4114 [Skylake_4114]
2.2 GHz / 0.8 GHz
3 GHz
85W
32 KB/32 KB/1 MB
13.75 MB
14nm
Total cores/threads 10/20
DRAM 192GB ECC DDR4 2666MHz
Network round-trip delay 117us
Linux kernel version 4.15.18
Table 5: Hardware details of the servers run Memcached

We use a small cluster of six server machines to run the Memcached service and workload clients. Table 5 provides the details of the server hardware. On one of the server machines, we deploy a single Memcached server process with worker threads, with each thread pinned to a distinct CPU core to minimize the impact of the OS scheduler on the key-value store tail latency. On the rest of the machines, we run a modified version of the Mutilate load generator [leverich2014mutilate] for Memcached. We configure the load generator to recreate the ETC workload from Facebook [atikoglu2012workload], using one master and four workload-generator clients, each running on a separate server. Each workload client maintains persistent connections to Memcached for a total of connections, which we find are enough to stress the single Memcached server.

6.2 In-house Power and Performance Analytical Model

We model the average power of the baseline and AW CPU cores through the weighted sum of the duration of different C-state and the transition periods with their corresponding power consumption. We model the power consumption of the system enhanced with the two new C-states of AW (i.e., and ) using our power estimations (discussed in Sec. 5.4) and from our baseline power model. Recall that and C-states of AW replace the and , respectively. Our model takes into account the extra power in active mode and transition due to AW performance degradation. Our performance model considers 1) the performance penalty of power-gates (discussed in Sec. 5.6), which we scale according to the performance scalability777We define performance scalability of a workload with respect to CPU frequency as the performance improvement the workload experiences with unit increase in frequency, as described in [yasin2017performance, haj2015doee, gendler2021dvfs]. of the workload, and 2) the extended / transition latency (i.e., ) compared to /. We obtain power state residency using processor’s residency reporting counters [intel_skl_dev]. Similar to previous works [kasture2015rubik, lo2014towards, kanev2014tradeoffs, chou2016dynsleep, fan2007power, chou2019mudpm, asyabi2020peafowl, mirhosseini2019enhancing, zhan2016carb], we focus on CPU power which is the single largest contributor to server power [jin2020review, vasques2019review]. We describe our models for the baseline system and AW in more detail below.

Modeling the Baseline CPU Core. We develop a new analytical power model that estimates the average CPU core power, , within a workload, as follows:

(2)

denotes the core power consumption in power state (reported in Table 1). denotes the residency at power state , i.e., the percentage of the total time the system spends at power state . We obtain C-state residency and number of transitions using processor’s residency reporting counters [intel_skl_dev]. We use the RAPL interface [HDC_intel] to measure power consumption.

Modeling the AW CPU Core. We model the power consumption of the CPU core enhanced with the two new C-states of AW (i.e., and ) using 1) measured data from our baseline power model; C-state residency is scaled using our performance model (more details below), and 2) estimated power of the and , as summarized in Table 3. and C-states of AW replaces the and , respectively, as follows:

(3)

Therefore, in a given workload, we perform the following steps. 1) We obtain the power and residency of each core C-state from the baseline model. We scale the C-state residency taking into account i) the AW’s performance penalty scaled using workloads performance scalability metric, and ii) the extended / transition latency (i.e., ) compared to /. 2) We replace / C-state residency (i.e., /) with / C-state residency (i.e.,/), and 3) We replace / power consumption (i.e., /) with / estimated power consumption (i.e., /, in Table 3). We plug in the new values in our analytical model to estimate the average CPU core power consumption when applying AW.

6.3 Measurements and Power Model Validation

The power consumption at each processor C-state and frequency step (i.e., P1 and Pn) is collected from measurements of real systems. Skylake power consumption is shown in Table 1. To validate our model, we run five representative server workloads: Kafka [kreps2011kafka], Nginx [nginx], MySQL [mysql], Spark [spark], and Hive [hive] with multiple CPU utilization level. We measure average power and collect core C-states residencies and transition times along with each run. We use our analytical power model to estimate the average power consumption of these workloads. Then, we compare the measured vs. estimated average power consumption. We find that the accuracy of our analytical power model is //// for Kafka/Nginx/MySQL/Spark/Hive workloads, respectively.

7 Evaluation

We evaluate AW against the baseline Intel Skylake server (see Table 1) running a Memcached [memcached] service.

7.1 Power Savings and Overhead at Varying Load Levels

Fig. 7 shows how AW affects power consumption and request latency against the baseline with P-states disabled and C-states and Turbo enabled.

Figure 7: Comparison of AW against the baseline configuration (P-state disabled, Turbo enabled, C-state enabled) with varying request rates. (a) Residency of the baseline system at different C-states. (b) AW core average power (AvgP) reduction and average/tail latency degradation when replacing / with /. (c) Average response time degradation. (d) Performance scalability when increasing frequency from  GHz to  GHz.

We expect AW to achieve significant power savings, thanks to the lower power of / compared to /, while having a small impact on request latency because of the increased transition time () and the frequency loss (discussed in Sec. 5.6). While the increased transition time impacts each C-state transition, the impact of the frequency loss depends on the sensitivity of the workload to the core frequency reduction.

Fig. 7(a) shows the residency of the system in each C-state: as expected, idle time is progressively reduced as the load (reported in queries per second – QPS) increases. Therefore, we expect AW to provide higher power savings and lower impact on performance at low load. Indeed, Fig. 7(b) shows that AW reduces the average power consumption by up to at low load, with less than impact on both average and tail latency. At high load, AW still provides power savings, with less than impact on tail latency. For reference, Fig. 7(d) quantifies the scalability of our Memcached service when increasing the core frequency from  GHz to  GHz (discussed in Sec. 6.2).

Fig. 7(c) further analyzes the impact of AW on average response time. We consider end-to-end (including network latency) and server-side only response time in two cases: the worst case, where we assume a C-state transition at each query and the expected case, which keeps the actual C-state transitions observed in the baseline. As expected, the gap between the worst-case and the expected case is larger at high load, since multiple queries are serviced within the same wake period. Finally, we observe that the degradation of end-to-end response time (i.e., time observed by a client) is negligible because the (non-changing) network latency dominates the overall response time.

AW attempts to make modern servers more energy proportional. Google states in their paper that discusses latency-critical applications [lo2014towards]: “Modern servers are not energy proportional: they operate at peak energy efficiency when they are fully utilized, but have much lower efficiencies at lower utilizations”. The utilization of servers running latency-critical applications is only to meet target tail latency requirements, as reported by multiple works from academia and industry [lo2014towards, B1, B2, B3, B4]. For example, recently, Alibaba reported that the utilization of servers running latency-critical applications is typically 10% [B4]. Therefore, our AW proposal addresses the inefficient part of modern servers running latency-critical applications, typically under low utilization, i.e., low queries-per-second (QPS).

We conclude that AW significantly improves core average power consumption of the Memcached service across load levels with minimal performance overhead over the baseline when disabling P-states and enabling Turbo.

7.2 Comparison to Commonly used
Configurations

Server vendors commonly provide recommended system configurations [cisco_cstates, dell_cstates, lenovo_cstates], such as disabling certain C-states to increase system performance, and/or disabling Turbo Boost to reduce power consumption. We analyze three common configurations by modifying our baseline (which has P-states disabled and Turbo and C-states enabled) by successively disabling Turbo, , and . Before analyzing the impact of AW on these three tuned configurations, we study them individually.

Figure 8: Analysis of three tuned variants of the baseline configuration (NT_Baseline disables Turbo, NT_No_C6 disables Turbo and , NT_No_C6,No_C1E) disables, Turbo, , and ) on (a) Average latency, (b) Tail latency, (c) Package power, (d) C-state residency at increasing load level. All configurations have P-states disabled. QPS refers to queries per second.

Fig. 8 reports latency (average and tail), package power, and C-state residency in the three tuned configurations. We observe that NT_No_C6,No_C1E has the lowest average and tail latency, but also the highest average power across the entire range of request rates. At  KQPS, this configuration has and lower average and tail latency, respectively, but also has higher average package power than the other two configurations. Latency improves because disabling reduces the long transition overhead of (i.e., , shown in Table 1), in contrast to the other two configurations that spend significant time in , as shown in Fig. 8(d). However, disabling also increases average power because, as shown in Fig. 8(d), the core now spends more time in (the shallower C-state), which has higher power than , as shown in Table 1. This analysis shows that a new C-state that consumes similar (or lower) power to but with a transition time that is close to can maintain a low average and tail latency with reduced power consumption. Next, we show that our newly proposed C-state, , achieves this balance.

Fig. 9 compares AW against the three tuned configurations.

Figure 9: Evaluation of AW compared to three configurations: baseline (P-state disabled, C-state enabled) with Turbo disabled (NT_Baseline), Turbo and disabled (NT_No_C6), and Turbo, and disabled (NT_No_C6,No_C1E) with different request rates (QPS) with respect to average latency, tail latency, core average power (AvgP).

We can observe that AW significantly reduces power consumption against all three tuned configurations. The reason is that, in these workloads AW replaces the time that other configurations spend in / with / C-state, which has / lower power. Additionally, AW moderately improves performance compared to NT_Baseline and NT_No_C6, while only degrading performance by less than compared to NT_No_C6,No_C1E. Based on this analysis, we conclude that AW provides average and tail latencies comparable to the tuned configurations that disable some of the power management features while providing significant power savings.

7.3 Analysis of Turbo Boost Performance
Improvement

To maximize performance, server vendors recommend 1) enabling Turbo for better burst performance and 2) disabling and to reduce the performance overhead of C-state transition times [cisco_cstates, dell_cstates, lenovo_cstates]. However, server vendors also note that disabling can hamper Turbo performance since the processor is kept at high power, thereby not gaining enough thermal capacitance needed during Turbo Boost periods [lenovo_cstates, rotem2011power, rotem2012power, rotem2013power, rotem2015intel]. Therefore, with the current C-state architecture, we cannot have the best of both worlds, i.e., 1) the performance gain by removing transition overhead, and 2) the Turbo frequency bursting. In this section, we demonstrate the extra benefits of AW that enable high Turbo performance while eliminating the performance overhead.

Fig. 10 shows average and tail request latency in four configurations that combine enabling/disabling turbo and the and C-states and highlight the effect of C-states on Turbo performance, compared to AW’s state with and without Turbo.

Figure 10: Average and tail latency at different request rates (QPS) for four configurations that show the effect on idle power state on Turbo performance:Turbo/No-Turbo & disabled (T_No_C6/NT_No_C6), Turbo/No-Turbo & & disabled (T_No_C6,No_C1E/NT_No_C6,No_C1E), compared with AW: Turbo/No-Turbo & enabled & & disabled (T_C6A_No_C6_No_C1E/NT_C6A_No_C6_No_C1E).

We make three key observations. First, with Turbo disabled (Fig. 10(a,c)), , and (i.e., NT_No_C6,No_C1E) increase the average/tail latency performance by up to / over the same configuration with enabled (i.e., NT_No_C6). Second, comparing Fig. 10(c) with 10(d) shows that enabling Turbo while disabling (i.e., T_No_C6,No_C1E) does not improve performance over the same configuration with Turbo disabled (i.e., NT_No_C6,No_C1E) Third, with Turbo enabled (Fig. 10(b,d)), only disabling (i.e., T_No_C6) has the same performance as additionally disabling (i.e., NT_No_C6,No_C1E). The reason is that in the T_No_C6 configuration, the transition overhead of on average/tail latency offsets any thermal capacitance gains and ensuing performance gains from Turbo.

We conclude that in a configuration in which and are disabled while Turbo is enabled, large performance benefits can be obtained by enabling instead of , i.e., T_C6A_No_C6,No_C1E. Doing so 1) provides larger thermal capacitance to Turbo compared to enabling , and 2) reduces the long transition latency overhead of C-states (i.e., and ). We illustrate the potential benefits of Turbo with AW in Fig. 10(b,d) (dashed green line).

8 Related Work

To our knowledge, AW is the first practical proposal for a new C-state design directly targeting latency-critical applications in datacenters. While the problem of low server efficiency for latency-critical workloads has been studied before, previous work proposes management and scheduling techniques to mitigate the problem, rather than addressing it directly; here we review such recent proposals.

Fine-grained, Latency-Aware DVFS Management. Besides C-states, the other major power-management feature of modern processors is dynamic voltage and frequency scaling (DVFS). Previous work proposes fine-grained DVFS control to save power, while avoiding excessive latency degradation. Rubik [kasture2015rubik] scales core frequency at sub-ms scale based on a statistical performance model to save power, while still meeting target tail latency requirements. Swan [zhou2020swan] extends this idea to computational sprinting (e.g., Intel Turbo Boost): requests are initially served on a core operating at low frequency and, depending on the load, Swan scales the frequency up (including sprinting levels) to catch up and meet latency requirements. NMAP [kang2021nmap], focuses on the network stack and leverages transitions between polling and interrupt mode as a signal to drive DVFS management. The new C6A state of AW facilitates the effective use of idle states and makes a simple race-to-halt approach more attractive compared to complex DVFS management techniques.

Workload-Aware Idle State Management. Various proposals exist for techniques that profile incoming request streams and use that information to improve power management decisions. SleepScale [liu2014sleepscale] is a runtime power management tool that selects the most efficient C-state and DVFS setting for a given QoS constraint based on workload profiling information. WASP [yao2017wasp] proposes a two-level power management framework; the first level tries to steer bursty request streams to a subset of servers, such that other machines can leverage deeper, longer-latency idle states; the second level adjusts local power management decisions based on workload characteristics such as job size, arrival pattern and system utilization. Similarly, CARB [zhan2016carb] tries to pack requests into a small subset of cores, while limiting latency degradation, so that the other cores have longer quiet times and can transition to deeper C-states. The idea of packing requests onto a subset of active cores, so as to extend quiet periods on other cores is further explored by other work focusing on both C-state and DVFS management [chou2016dynsleep, asyabi2020peafowl, chou2019mudpm]. These proposals are orthogonal to AW: while our new C6A C-state can provide most of the benefits of a deep idle state at much lower latency, advanced and workload-aware sleep management techniques can still bring additional power savings by enabling cores to enter traditional deeper, higher-latency C-states.

9 Conclusion

This paper presents the design of AgileWatts (AW): a new C-state architecture that leverages existing power-gating and power-saving technologies and judicious choice for which core units to power gate and which to keep ungated to maximize power reduction while minimizing exit latency overhead. AW represents the first attempt to reconcile deep low power states with fast exit latency with an end goal to eliminate killer microseconds that prevent servers running latency critical microservices from entering deep energy saving states. Our detailed evaluation reveals that AW has the potential to realize power savings up to 71% per core at a worst-case 1% performance degradation. These findings support the adoption of AW in future CPUs targeting servers in datacenters running microservices and calls for further research that will explore additional opportunities presented by applications, operating systems, microarchitecture and technology to make deep idle states more energy proportional and latency friendly.

References