ALERT: Accurate Anytime Learning for Energy and Timeliness

10/31/2019 ∙ by Chengcheng Wan, et al. ∙ 0

An increasing number of software applications incorporate runtime Deep Neural Network (DNN) inference for its great accuracy in many problem domains. While much prior work has separately tackled the problems of improving DNN-inference accuracy and improving DNN-inference efficiency, an important problem is under-explored: disciplined methods for dynamically managing application-specific latency, accuracy, and energy tradeoffs and constraints at run time. To address this need, we propose ALERT, a co-designed combination of runtime system and DNN nesting technique. The runtime takes latency, accuracy, and energy constraints, and uses dynamic feedback to predict the best DNN-model and system power-limit setting. The DNN nesting creates a type of flexible network that efficiently delivers a series of results with increasing accuracy as time goes on. These two parts well complement each other: the runtime is aware of the tradeoffs of different DNN settings, and the nested DNNs' flexibility allows the runtime prediction to satisfy application requirements even in unpredictable, changing environments. On real systems for both image and speech, ALERT achieves close-to-optimal results. Comparing with the optimal static DNN-model and power-limit setting, which is impractical to predict, ALERT achieves a harmonic mean 33 constraints, and reduces image-classification error rate by 58 sentence-prediction perplexity by 52

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

Deep neural networks (DNNs) have become a key workload for many computing systems due to their high inference accuracy. This accuracy, however, comes at a cost of long latency, high energy usage, or both. Although recent work in DNN design and system support improved not only DNN accuracy [17, 33, 41], but also model efficiency [39, 102, 5], performance, and energy usage [12, 13, 19, 27], successful deployment of DNN inference in the field, however, requires meeting a variety of user-defined, application-specific performance goals for latency, accuracy, and possibly, energy on a variety of hardware platforms in sometimes unpredictable dynamic environments, a problem that little work has explored.

Latency constraints naturally arise with DNN deployments when inference interacts with the real world as a consumer or producer: processing latency-sensitive data from a sensor or returning a timely answer to a human. For example, in simultaneous interpretation, translation must be provided every 2–4 seconds[64]; in motion tracking, a frame must be processed at camera speed. Violating these deadlines may lead to severe consequences: if a self-driving vehicle cannot act within a small time budget, life threatening accidents could follow [61]

and machine-learning service providers, like Google, have strict DNN-inference constraints

[48].

Requirements on accuracy and energy are also common and may vary for different applications in different operating environments. For example, in both mobile systems and user-facing cloud systems, DNN-based applications may have inference accuracy requirements for correct functionality. While ensuring the accuracy requirement, it would be highly beneficial to minimize energy or resource usage to extend mobile-battery time or reduce server-operation cost [48, 61]. In extremely resource-limited systems, like some embedded and intermittent computing systems [23], applications might be constrained in energy and want to return the most accurate results given the energy constraint.

Previous research has produced huge tradeoff spaces of DNN models and system-resource settings, where different selections can be used to meet different requirements. Along the DNN dimension, people have found DNN design particularly well suited to approximate computing [7, 36, 80], where small sacrifices in accuracy can yield big gains in speed [13, 27, 34]. Along the system-resource dimension, DNN latency can be reduced by bringing more powerful hardware [100]—or even the same hardware operated at higher frequency—to bear, sacrificing energy to reduce latency without affecting accuracy.

The unexplored challenge is then to find a disciplined way to make the right DNN and system-resource selections to satisfy different requirements and goals for different users and applications. Note that, while it is possible to make the selection statically based on profiling results, real DNN deployments are further complicated by dynamic environmental variation:

  • Input Variety: Although DNN computation is quite regular, time variation among different inputs still exists. For example, our results show that the 75th-percentile latency is about 1.4x the median latency for some NLP tasks, even with a fixed DNN on dedicated hardware.

  • Contention Variety: An inference job may compete for resources against unpredictable, co-located jobs.

  • Requirement Variety: Job latency could vary dynamically depending on how much time related jobs have consumed before it. Furthermore, other requirements could change, for example an application might have power requirements until it detects a specific event, when it switches to a strict accuracy requirement to correctly capture that event.

Thus, it would be ideal for systems to support inference by dynamically choosing the DNN model and resource settings to meet user-defined latency, accuracy, and energy goals.

1.2 Contributions

This paper makes several related contributions. We first quantify the degree to which DNN inference is affected by environmental variability. We then develop ALERT, a complementary runtime system and flexible DNN design. The runtime system leverages the DNN design to meet user goals by adapting to overcome the challenges of environmental changes.

Understanding the challenges

We measure DNN performance for a range of requirements across applications, inputs, hardware, and resource contention and observe high variation in inference time. We compare 42 existing DNN models and find they offer a wide spectrum of latency, energy, and accuracy. Although some models are more efficient than others (i.e., producing similar accuracy with less execution time), the overall trend is that higher accuracy comes at the cost of longer latency and/or higher energy consumption (Section 2).

ODroid Core i5 Razer Skylake Xeon P100 K80
CPU ARM Cortex A-15@2.0 GHz Core-i5@2.9 GHz Core-i7@2.2 GHz Xeon(R) Gold 6126@2.60GHz Xeon(R) E5-2670@2.30GHz
GPU none none none none none Tesla P100 Tesla K80
Memory DDR3 2G DDR3 16G DDR4 16G DDR4 16G*12 DDR4 16G*8
LLC 2MB 3MB 9MB 19.25MB 30MB
Table 1: Hardware platforms used in our experiments
 ID Task DNN Models Datasets
 IMG1 Image VGG16 [85] ILSVRC2012
 IMG2 Classification ResNet50 [33]

(ImageNet)

 NLP1 Sentence Prediction RNN Penn Treebank [66]
 NLP2 Question Bert [17] Stanford Q&A
Answering Dataset (SQuAD) [77]
Table 2: ML tasks and benchmark datasets in our experiments
Figure 1: ALERT inference system
Run-time inference controller

We design an inference controller that uses dynamic feedback and prediction to select a DNN model and system resource setting that together provide a disciplined balance between energy consumption, inference latency, and accuracy. The controller predicts the distribution of possible behavior and can make aggressive choices when the environment is quiescent or conservative choices when it is volatile. The runtime meets constraints in any two dimensions while optimizing the third, for example minimizing energy given accuracy and latency requirements or maximizing accuracy given latency and energy budgets (Section 3).

Controller-friendly DNN design

Although the runtime works with any set of candidate DNN models with latency/energy/accuracy tradeoffs, we further design a special type of Anytime DNN that provides additional flexibility to handle unpredictable system dynamics.

Our Anytime-DNN design employs novel width- and depth-nesting to create a DNN family. Instead of producing one result, the nested family produces a series of results with non-decreasing accuracy and little additional overhead. While a traditional DNN produces no result if it misses a deadline, our Anytime-DNN produces slightly degraded inference results if execution is unexpectedly slow or the deadline comes early, giving the run-time more flexibility to reduce resources without violating constraints (Section 4).

We test ALERT using various types of DNN models (CNN, RNN, and self-attention) and application domains (image classification, recommendation systems, and NLP) on different machines and various constraints. Our evaluation shows that our proposed Anytime-DNN allows existing state-of-the-art approaches to provide Anytime inference with negligible cost. Across various experimental settings, ALERT runtime meets constraints while achieving within 93–99% of optimal energy saving or accuracy optimization. Compared to an impractical scheme that uses perfect latency and energy predication to select static optimal DNN and power settings, ALERT saves 33% energy while satisfying latency and accuracy constraints, or reduces the image-classification error rate by 58% and the perplexity of sentence prediction by 52% while satisfying latency and energy constraints (Section 5).

2 Understanding DNN Deployment Challenges

To better understand the challenges of DNN inference in latency critical environments, we profile two canonical machine learning tasks on a set of platforms. The two tasks, image classification and text translation (NLP), could both be deployed with deadlines; e.g., for motion tracking and simultaneous interpretation, and both have received wide attention, primarily in pursuit of accuracy, leading to a diverse set of available models. We use the state-of-the-art networks and common datasets (Table 2). The platforms cover representative embedded systems (ODroid), laptops (Core i5, Razer), CPU servers (Xeon), and GPU servers (P100, K80), as shown in Table 1.

2.1 Understanding Inference Latency

To emulate real-world scenarios, we feed the network one input at a time and use 1/10 of the total data for warm up. We run each input times and use the average inference latency for that input after the warm up phase.

Q1: Might deadline violations occur?

Yes (Figure 2). Different applications will likely impose different inference deadlines. Image classification on video might have deadlines from 1 second down to the camera latency (e.g., 1/60 seconds) [47]. The two NLP tasks, particularly NLP2, also risk deadline violations, as studies show human users typically expect text results in 1 second [71]. There is clearly no single DNN design that meets all deadlines on all hardware.

Q2: Does latency vary across inputs?

Yes (Figure 2

). While latency variance across hardware is not surprising, the variance is still non-negligible even for fixed networks and hardware. Except for NLP2, all tasks have outlier inputs that take much longer time to process across all platforms. For example, in IMG1’s

Tesla P100 setting, the longest latency is the median. Among all tasks, NLP1 has the largest input variance: its 75th (90th) percentile latency is at least 1.37 (1.72) the median latency across all platforms. This large variance is likely because NLP1 inputs have different lengths.

Q3: Is latency affected by contention?

Yes (Figure 3). We repeat the experiments from Figure 2 except we now make the inference job competes for resources with a co-located job, the memory-intensive STREAM benchmark [67]. Comparing Figures 3 and 2, we see that the co-located memory-intensive job increases both the median and tail inference latency. This trend also applies to CPU contention cases, and we omit the result figure due to space constraints.

Figure 2: Latency variance across inputs for different tasks and hardware (Most tasks have 5 boxplots for 5 hardware platforms, Core i5, Razer, Xeon, P100, K80 from left to right; NLP1 has an extra boxplot for Odroid platform; other three tasks ran out of memory on Odroid; every box shows the 25th–75th percentile; points beyond the whiskers are >90th or <10th percentile).
Figure 3: Latency variance under memory contention

Summary: Deadline violations are realistic concerns and inference latency varies greatly across inputs and platforms, with or without resource competition. Sticking to one static DNN design across inputs, platforms, and workloads would lead to an unpleasant trade-off: always meet the deadline by sacrificing accuracy in most settings, or achieve a high accuracy some times but fail the deadline in others.

Figure 4: Accuracy and latency tradeoffs for 42 DNNs on Xeon.

2.2 Understanding Tradeoffs Across Networks

Many DNN models have been designed for classic tasks such as image classification. To understand the tradeoffs offered by different DNN models for the same task, we run all

42 trained image classification models provided by the Tensorflow website

[83] on the 50000 images from ImageNet [16]. We also include a naive model that takes a random guess, incurring almost 100% error with no latency. Figure 4 shows the accuracy/latency trade-offs (blue circles) offered by these models on Xeon (numbers are averaged across all images in ImageNet). The trends on other machines are similar.

Q4: What are the tradeoffs?

As shown in Figure 4, these 42 models cover a wide spectrum of latency and accuracy tradeoffs. The fastest model runs almost 12 faster than the slowest one and the most accurate model has about 7.8 lower error rate than the least accurate model. While omitted due to space, we also find the networks exhibit a wide range—spanning more than of energy usage. These results reflect the intuition that greater DNN accuracy generally comes at the cost of greater resource usage (e.g., time and energy).

Q5: Which network provides the best tradeoff?

There is no magic network that provides both the highest accuracy and the lowest latency or energy consumption. In Figure 4, all the networks that sit above the lower-convex-hull curve represent sub-optimal tradeoffs. For example, MobileNet-v2 (=0.75), a network sitting above the curve incurs similar latency as ResNet50, but 1.5 errors.

Summary:

Inference latency tends to increase with DNN accuracy and energy usage. Given different deadlines and run-time environment, different DNN models should be chosen. Furthermore, since existing models are far from covering the whole latency spectrum (Figure 4), it is impractical to pick a DNN model that uses up the whole time before the deadline for its inference. These observations demonstrate the room for improvements in resource management and energy saving.

3 ALERT Run-time Inference Management

ALERT’s runtime system addresses the challenges highlighted earlier of meeting user-specified latency, accuracy, and energy constraints and optimization goals across different DNN models, inputs, run-time contention, and hardware.

As illustrated in Figure 1, ALERT takes a set of DNN models and user-specified requirements on latency, accuracy, and energy usage. Then, for each input at runtime, ALERT measures feedback on each requirement and selects for the next inference (1) a DNN model from a set , and (2) resource use, expressed as a power cap from , to meet constraints in any two of the three dimensions of latency, accuracy, and energy, while optimizing the third. More formally, ALERT fulfills either111In this paper, we examine constraints on latency and accuracy as well as latency and energy. For space, we omit discussion of meeting energy and accuracy constraints while minimizing latency as it is a trivial extension of the discussed techniques and we believe it to be the least practically useful. of these user-specified goals:

  • [leftmargin=2em]

  • Maximizing inference accuracy for an energy budget and inference deadline :

    (1)
  • Minimizing the energy use for an accuracy goal and inference deadline .

    (2)

We use a power budget as the interface between the run-time and the system resource manager. Both hardware [15] and software resource managers [99, 35, 78] can convert power budgets into optimal performance resource allocations. So, ALERT should be compatible with many different schemes from both commercial products and the research literature.

3.1 Key ideas

Satisfying the constraints of Eqs. 1 and 2

would be trivially easy if the deployment environment is guaranteed to match the training and profiling environment: we could estimate

to be the average inference time over a set of profiling inputs under model and power setting . However, this static approach does not work given the dynamic hardware, input, contention, and requirement variation. We break down the tasks and present ALERT key ideas below.

How to estimate the inference time ?

To handle the run-time variation, a potential solution is to apply a reinforcement learner, like a Kalman filter

[63], to make dynamic estimation based on recent history. The problem is that most models and power settings will not have been picked recently and hence would have no recent history to feed into the learner.

Idea 1

To make effective online estimation for all combinations of models and power settings, ALERT introduces a global slow-down factor to capture how the current environment differs from the profiled environment (e.g., due to co-running processes, hardware variation, or others). Such an environmental slow-down factor is independent from individual model or power selection. It can be estimated fully leveraging the execution history, no matter which models and power settings were recently used; it can then be used to estimate based on for all and combinations.

How to estimate the accuracy under a deadline?

Given a deadline , the inference accuracy delivered by model and power setting is determined by three factors, as shown in Eq. 3: (1) whether the inference result, which takes to compute, can be generated before the deadline ; (2) if yes, the accuracy is determined by the model ;222Since it could be infeasible to calculate the exact inference accuracy at run time, ALERT uses the average training accuracy of the selected DNN model , denoted as , as the inference accuracy, as long as the inference computation finishes before the specified deadline. (3) if not, without any output at the deadline, a random guess will be used as the inference result, greatly reducing accuracy, denoted by .

(3)

A potential solution to estimate accuracy at deadline is to simply feed the estimated into Eq. 3

. However, this approach could lead to sub-optimal decisions under high run-time variation, when the real inference time has a high probability of rising above the estimated time. To understand this problem, imagine two configurations

={, } and ={, }. Under , the inference is estimated to finish right before the deadline with a high accuracy; under , the inference is estimated to finish long before the deadline with a medium accuracy. The straightforward solution always picks ; however, if the run-time environment is highly variable—as in Figure 3—this pick might be wrong. Then, it is better to conservatively pick ensuring that a result—rather than a random guess—is available at the deadline.

Idea 2

To handle the run-time variability, ALERT treats the execution time and the global slow-down factor as random variables

drawn from a normal distribution. ALERT uses a Kalman Filter to estimate not only the mean value but also the

standard deviation of (and hence ), which reflects the run-time environment variation. ALERT then computes ’s expected range, instead of just the mean.

Idea 3

To further address the run-time variance and alleviate the cost of an incorrect estimation of —no estimation can guarantee to be error free after all—ALERT proposes a novel DNN design so that even when the deadline is not met (e.g., the real is much longer than estimated), an inference result much better than a random guess can be provided. This design will be presented in Section 4.

How to minimize energy or satisfy energy constraints?

Minimizing energy or satisfying energy constraints for given latency (and accuracy constraints) is complicated, as the energy is related to, but cannot be easily calculated by, the complexity of the selected model and the power cap . For a given model , one could set a higher power to finish inference early and transitions to a low-power idle mode; or one could set a lower power so that the inference task runs for a longer time but also consumes less energy at each time unit.

Idea 4

ALERT leverages insight from previous research, which shows that energy consumption for latency-constrained systems can be efficiently expressed as a mathematical optimization problem

[56, 10, 59, 70]. The mathematical optimization finds the greatest energy savings while accounting for both idle power and deadline constraints:

(4)

ALERT guarantees a energy constraint by predicting both , the power when the DNN is running, and , the power when the DNN is idle. We refer to the latter as DNN idle to denote that the system may be performing other non-DNN, tasks. ALERT picks the configuration within the energy budget with highest accuracy, turning the problem (Eq. 1) into:

(5)

If a user provides power budget, ALERT converts it to an energy budget according to: .

3.2 Algorithm

3.2.1 Overview

ALERT follows four steps for each input :

1) Measurement. ALERT measures the processing time, energy usage, and computes inference accuracy for .

2) Goal adjustment. ALERT updates the time goal if necessary. Different inputs may demand different inference deadlines (NLP1 workload in Table 2) and delays in previous input processing could shorten the available time for the next input [2, 55]. Additionally, ALERT sets the goal latency to compensate for its own, worst-case overhead so that the runtime system itself will not cause violations. ALERT also updates if needed.333In experiments, we consider the accuracy goal to be the average accuracy of any continuous N inputs. Consequently, the goal for each individual input is adjusted based on the actual accuracy delivered to the last inputs.

3) Feedback-based estimation. ALERT updates the global slow-down factor , including its mean and standard deviation (see Section 3.2.2) and computes the expected accuracy for every combination of DNN model and power setting (Section 3.2.3). ALERT uses the measured energy to update the DNN-idle power ratio and then predicts the energy consumption for each configuration (Section 3.2.4).

4) Picking a configuration. ALERT feeds all the updated estimations of latency, accuracy, and energy into Eqs. 4 and 5, and gets the desired DNN model and power-cap setting for .

The key task is to estimate and then the accuracy and energy for different combinations of models and power limits.

3.2.2 Global Slow-down Factor

ALERT introduces a global slow down factor to reflect how the run-time environment differs from the profiling environment. Conceptually, if the inference task under model and power-cap took time to finish at run time and took on average to finish during profiling, the corresponding slow-down factor would be . We can estimate using recent execution history under any model or power setting and then apply to make estimations for all combinations of models and power.

After an input , ALERT computes as the ratio of the observed time to the profiled time , and then uses a Kalman Filter to estimate the mean and standard deviation of the slow-down factor at input . A Kalman Filter is an optimal estimator that assumes a normal distribution and estimates a varying quantity based on multiple potentially noisy observations [63]. ALERT’s formulation is defined in Eq. 6, where is Kalman Filter variable; is a constant reflecting the measurement noise; is the process noise capped with . We set forgetting factor of process variance [3]. ALERT initially sets , , , , , following the standard convention [63].

(6)

Then, using , ALERT estimates the inference time of input under any model and power cap : = .

3.2.3 Accuracy

As discussed earlier, ALERT computes the estimated inference accuracy by considering as a random variable that follows normal distribution with its mean and standard deviation computed based on that of :

(7)

3.2.4 Energy

ALERT predicts each configuration’s energy consumption by separately estimating energy during (1) DNN execution: estimated by multiplying the power limit by the estimated latency and (2) between inference inputs: estimated by the Kalman Filter in Eq. 8. is the predicted DNN-idle power ratio, is process variance, is process noise, is measurement noise, and is the Kalman Filter gain. ALERT initially sets , , .

(8)

ALERT then predicts the energy by Eq. 9.

(9)

3.2.5 Limitations of ALERT

(1) ALERT applies its global slow-down factor to all DNN/power configurations. It is possible that some perturbations lead to different slowdowns for different configurations. However, we believe the slight loss of accuracy here is out-weighed by the benefit of having a simple mechanism that allows prediction even for configurations that have not been used recently. (2) ALERT’s prediction, particularly the Kalman Filter, relies on the feedback from recent input processing. Consequently, it requires at least one input to react to sudden changes. (3) ALERT incurs overhead in both scheduler computation and witching from one DNN/power-setting to another. We find such overhead to be small compared with inference cost, just 0.6–1.7% of input processing time. We explicitly account for overhead by subtracting it from the user-specified goal (see step 2 in Section 3.2.1). In effect ALERT schedules both itself and the inference. (4) ALERT provides probabilistic guarantees, not hard guarantees. Because ALERT estimates not just average timing, but the distributions of possible timings, it can provide arbitrarily many nines of assurance that it will meet latency or accuracy goals; e.g., scheduling for slow-down factors of up to three standard deviations corresponds to meeting the goals 99.7% of the time. Providing 100% guarantees, however, requires much more conservative configuration selection—hurting both energy and accuracy—a property shared by all systems that have to choose between probabilistic and hard guarantees [9].

3.3 Implementation

ALERT currently adjusts power through Intel’s Running Average Power Limit (RAPL) interface [15], which allows software to set a hardware power limit. ALERT can also be applied to other approaches that translate power limits into settings for combinations of resources [35, 42, 78, 99]

. For example, ALERT could be trivially modified for GPUs using nvidia-smi to adjust GPU power. We could not explore it in this paper because the GPU drivers that support Machine Learning frameworks (e.g. TensorFlow and Pytorch) on our test systems do not allow software to set GPU speed. Furthermore, ALERT’s design is compatible with any emerging specialized DNN accelerators that export a power management interface.

ALERT does not change power setting continuously, but instead uses a set of discrete buckets within the feasible range of power-limit settings. ALERT considers a series of settings with 2.5W interval on our test laptop and a 5W interval on our test server, as the later has a wider power range than the former. The number of power buckets is configurable.

Equation 6 requires feedback to update the global slowdown factor . ALERT cannot observe the full latency when a deadline is missed (because it immediately starts processing the next input). To account for this inability to measure accurate latency, we increase the latency we pass to the Kalman filter on a deadline miss by a factor of 0.2, ensuring ALERT chooses a more conservative configuration in the next iteration.

Users may set goals that are not achievable. If ALERT cannot meet all constraints, it prioritizes latency highest, then accuracy, then power. This hiearchy is configurable.

4 ALERT Anytime DNN Design

ALERT provides probabilistic, not absolute, guarantees. When ALERT occasionally makes an incorrect prediction, perhaps during a sudden change of resource contention, the inference task may miss its deadline and cause inference accuracy to drop to that of a random guess (Eq. 3).

Figure 5: Ensemble on 3 different networks (each in a row).

To alleviate the damage of a missed deadline, we develop a new class of DNN model that outputs a series of increasingly accurate results over time: an Anytime DNN. Conceptually, it is a nested family of DNNs {, , … }. With time goes, it produces outputs , , … , with more reliable than . Even if ALERT fails to handle an unpredictable run-time variation and the expected inference result is not ready at the deadline, earlier results () can still be used with only slight accuracy loss, turning Eq. 3 into:

(10)

Using a novel design methodology, we construct families of DNNs with extremely efficient execution properties for use as an Anytime DNN. Our key insight is to architect the network family so that later DNN models re-use internal state of earlier models, accelerating the execution of the entire Anytime DNN.

4.1 Design Principles

Strawman

One can simply take a sequence of independently designed and trained DNNs with non-decreasing accuracy and latency. For example, we can find many such sequences from Figure 4. At run time, given an input, inference starts with the fastest, least accurate network. After the first network completes, the second network is run, and so on until the deadline. As an accuracy improvement, all outputs can be rolled into a classic ensemble [18]. Whenever the -th model finishes, the ensemble will output a weighted average of outputs from all of the first independent models, as illustrated in Figure 5.

This baseline design is intuitively ineffective: the first -1 independent models’ computation seems largely wasted, with the ensemble only slightly improving, if at all, the accuracy of the -th individual model. To avoid similar problems in our Anytime DNN, we leverage the following two observations.

Observation-1

An effective way to improve accuracy with more computation is to grow a network larger, moving to both deeper (more layers) [85, 87, 33]

and wider (more neurons per layer) 

[98] designs. An idea which applies to not only vision examples but also language models [17]. Moreover, while not apparent from Figure 4, the trend is especially pronounced within an architecture family: if residual networks (ResNets) are preferred for a particular domain, a 50-layer ResNet will deliver better accuracy than a 34-layer ResNet [33].

Observation-2

We could reduce the computational costs in an Anytime DNN by fully reusing internal states of earlier networks to boot-strap later networks.Each layer within a very deep neural network might be learning to perform an incremental update on a certain representation [25, 46]; deeper networks might refine their representations more gradually than shallower networks. If we were to align layers of different networks trained for the same task according to the relative depth in their own network we might find they are using related representations. This representation-similarity suggests that we can jump-start computation in larger networks by giving them direct access to the previously computed internal states of smaller networks. It also suggests specific design rules for connecting the nested smaller and larger networks.

Figure 6: Cascade with 3 outputs.

Note that, having aligned state sharing is important. For example, an established image-classification approach cascade is used in several recent network designs [40, 68, 89], in which some inputs traverse a subset of layers while others may traverse all layers. We do not use this design for our Anytime inference, as it does not have the alignment and connectivity patterns required to nest networks for optimal state re-use. Figure 6 illustrates a particular cascade pattern, typical of prior work. Here, as some outputs are generated without traversing later pipeline stages—which tend to capture high-level input features—this design can lead to large accuracy loss; the additional computation on every early output path to compensate for the skipped structures (red boxes in Figure 6) can recover some, not all, of the accuracy loss, and leads to inefficiency.

4.2 Proposed Nested Anytime DNN

Figure 7: Width-wise nesting.

Our Anytime design contains a sequence of fully nested DNNs: the first is completely contained within the second , which is a subpart of , and so on. Progressing from to , our scheme permits growing the DNN in width, depth, or both. In each case, we connect internal layers of the smaller network to internal layers of the larger network most appropriate for consuming their signals, according the alignment principle explained previously. Together with the nesting requirement, our Anytime DNNs have the following properties:

  • Pipeline structure: Every subnetwork follows the usual pipeline structure of a traditional DNN.

  • Feed-forward: Neurons feed outputs into deeper layers of the same network or units in a larger network; connections are purely feed-forward in depth or nesting level.

Every subnetwork will closely, but not exactly, resemble a traditional non-nested network. The difference is that some neurons in an outer nesting level will not feed backward into neurons in inner nesting levels. Dropping these connections slightly lightens compute load and parameter count, slightly shifting the network’s position on the latency-accuracy curve444Compared with skipping a neural network pipeline stage, cutting a few edges does not hurt accuracy too much. Indeed, many standard DNN designs use sparse or grouped connection structures, which do not connect channels in a subsequent layer with all channels in the prior layer., which is inconsequential in the big picture of a Anytime DNN that populates the tradeoff curve with many nested networks. Critically, our approach is general and works with different types of DNNs without application-specific changes.

4.2.1 Width Nesting

Architecture

Width nesting divides a traditional DNN into horizontal stripes, forming a family of width-nested DNNs, with the -th DNN including all the neurons inside the first stripes. The connections in the original DNN then fall into three categories, as shown in Figure 7: (1) connections between neurons in the same stripe (horizontal edges and invisible edges inside each box); (2) connections from an earlier-stripe neuron to a later-stripe neuron (downward edges); (3) connections from a later-stripe neuron to an earlier-step neuron (upward gray edges). The corresponding width-nested Anytime DNN contains all edges of the types (1) and (2), but drops edges of type (3) to satisfy the nesting property.

Assessment

This design applies to almost all DNNs; Section 5

demonstrates this generality on multiple DNN types, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and a self-attention network.

We use a power of 2 sequence for stripe widths: if the first nested network contains neurons in one layer, contains neurons in the corresponding layer. This choice creates a good trade-off curve for accuracy and latency. Additionally, comparing the outermost DNN in the Anytime ensemble to a traditional DNN of the same width, only a small portion of edges are pruned due to the nesting requirement.

Execution

Inference is conducted in a zig-zag manner: (1) the first result is produced using all neurons in the first stripe (from left to right along the top layers in Figure 7); (2) execution moves down to the second stripe, and is produced by going through all layers of the second tripe (left to right), which receive, as additional inputs, neuron activations from the first stripe. Inference proceeds moving down stripes until the deadline, at which time the system reports the most recent result, or an ensemble of all completed results.

4.2.2 Depth Nesting

Architecture

Depth nesting is more complicated. A standard cascade (Figure 6) destroys the traditional DNN pipeline structure; intermediate layers serve heterogeneous branches (e.g., a blue block is connected to both a red and a blue block in Figure 6) and those branches might require different representations.


Figure 8: Depth-wise nesting.

We use a fundamentally different approach to nest networks by depth: we interlace layers following the same pipeline structure as the original DNN. Figure 8

illustrates. In this example, we partition a traditional DNN into odd and even layers, create a shallower network consisting of only the odd numbered layers (Fig. 

8.a), and nest it within the full network (Fig. 8.b). Figure 6.c shows an alternative, but equivalent, view. Gray edges (from layer 2 to 3, and layer 4 to 5), present in the standard DNN, must be pruned to achieve nesting. Repeatedly nesting via interlacing, the depths of the nested networks , , … naturally follow an incremental sequence of powers of 2. As was the case for width-nesting, this offers a series of meaningful latency/energy-accuracy trade-offs.

Assessment

Unlike width-nesting, our depth-nesting strategy applies only to DNNs satisfying an additional architectural requirement. Notice, in Figure 8, the presence of additional skip connections between layers, even in the basic shallow DNN. Indeed, within any DNN in the sequence, we see that each layer connects directly to any other layer separated in depth by a power of 2 (not accounting for pruned connections).

Fortunately, this power-of-2 skip-connection design is exactly the SparseNet architecture [102], which is a state-of-the-art variant of the ResNet [33] (or DenseNet [41]) CNNs.

Our width and depth nesting designs can be easily combined in any arbitrary nesting order.

4.3 Training and Implementation

We apply our Anytime DNN design to a variety of DNN models (ResNet, SparseNet, RNN, Self-attention), with implementations using both Pytorch [74] and Tensorflow [1]. Pytorch uses eager execution, which naturally fits nesting computation. Tensorflow conducts graph-based computation, so we use its partial run API [24] to fulfill the same functionality.

Training

We explore two types of training: greedy and joint. The greedy method trains the first stripe first to achieve the highest possible accuracy. It then freezes all network weights inside the first stripe, and trains the second network, which contains the first two stripes; and then continues for the third, and so on. This strategy guarantees high accuracy for the narrowest network. The joint training method considers the whole network together, but places different importance on the output of different subnetworks. The flexibility to specify per-output importance could be used to train custom network to match known operating environment characteristics.

Infrastructure-induced overheads

In theory, Anytime DNN inference should take equal or less time than a similar traditional DNN that outputs a single result, as the Anytime DNN prunes some connections. Unfortunately, existing Pytorch and Tensorflow infrastructures are highly optimized for computing one DNN with one result, not nested DNNs. Thus, we observe slowdowns of our Anytime DNNs, ranging from negligible to more than 50%, which is rare, for different infrastructures and different DNN models. These slowdowns are not inherent to Anytime DNN design and should be mitigated with future APIs. As Section 5 shows, even with these infrastructure-induced slowdowns, Anytime DNNs have significant benefits.

Limitations

Of course, anytime DNN is not completely a free lunch. Comparing with traditional DNN, we expect a slight accuracy drop at certain nested layers because (1) some connections are pruned, and (2) the training process cannot optimize the accuracy of every single nested DNN. Fortunately, as we will see in the evaluation section (Section 5.2), the accuracy drops little, like only 0.3 percent for the deepest Anytime Sparse ResNet with joint training.

5 Experiments

Run-time environment setting
Default Inference task has no co-running process
CPU Co-locate with Bodytrack from PARSEC-3.0 [8]
Memory Co-locate with memory-hungry STREAM [67]
Ranges of constraint setting
Latency 0.4x–2x mean latency* of the largest Anytime DNN
Accuracy Whole range achievable by trad. and Anytime DNN
Energy Whole feasible power-cap ranges on the machine
Task Trad. DNN model Nesting Fixed deadline?
Image Classific. Sparse ResNet50 Depth Yes
Sentence Pred. RNN Width No
Scheme ID DNN model selection Power selection
Oracle Dynamic optimal Dynamic optimal
Oracle Static optimal Static optimal
ALERT ALERT default ALERT default
ALERT ALERT w/o Anytime DNNs ALERT default
ALERT ALERT default System default
ALERT The fastest traditional DNN ALERT default
Table 3: Settings, tasks, and schemes under evaluation (* measured under default setting without resource contention)
Platform DNN Env. ALERT ALERT Trad ALERT DNN ALERT Power Oracle ALERT ALERT Trad ALERT DNN ALERT Power Oracle
Energy in Minimize Energy Task Error Rate in Maximize Accuracy Task
Razer Sparse Resnet Default 0.98 1.33 0.93 0.91 0.93 0.95 1.25 0.89
CPU 0.95 1.92 0.93 0.38 0.41 0.55 0.51 0.36
Memory 0.93 1.82 0.92 0.34 0.35 0.42 0.46 0.33
RNN Default 0.61 1.06 0.61 0.87 0.90 0.89 1.02 0.86
CPU 0.60 1.28 0.60 0.42 0.45 0.51 0.49 0.42
Memory 0.54 1.06 0.54 0.45 0.47 0.50 0.51 0.44
Skylake Sparse Resnet Default 0.97 0.97 0.91 0.68 0.69 0.70 0.93 0.66
CPU 0.58 0.91 0.57 0.58 0.59 0.69 0.79 0.55
Memory 0.50 0.78 0.48 0.23 0.32 0.33 0.37 0.21
RNN Default 0.87 0.85 0.83 0.84 0.87 0.85 0.99 0.84
CPU 0.60 0.89 0.60 0.53 0.54 0.57 0.61 0.52
Memory 0.52 0.77 0.52 0.26 0.35 0.39 0.34 0.25
Harmonic mean 0.67 1.04 0.66 0.45 0.50 0.55 0.58 0.43
Table 4: Normalized average energy consumption and error rate to Oracle (Smaller is better) Superscript means how many settings violate constraints for more than 10% cases. We report the average normalized energy only for non-violating settings.
(a) Razer, Image Classification
(b) Razer, Sentence Prediction
(c) Skylake, Image Classification
(d) Skylake, Sentence Prediction
Figure 9: ALERT versus Oracle and Oracle on minimize energy task (Lower is better). (whisker: whole range; circle: mean)
(a) Razer, Image Classification
(b) Razer, Sentence Prediction
(c) Skylake, Image Classification
(d) Skylake, Sentence Prediction
Figure 10: ALERT versus Oracle and Oracle on maximize accuracy task (Lower is better). (whisker: whole range; circle: mean)

5.1 ALERT runtime management

5.1.1 Methodology

We use the Razer and Skylake machines (Table 1), which represent a typical laptop and server. The evaluation covers various constraint settings, 3 different run-time environments, 2 inference tasks, and 2 DNN types, with different ALERT nesting approaches, as shown in Table 3.

We compare ALERT with 5 other schemes, as shown in Table 3. The two Oracle schemes have perfect predictions for latency and energy-consumption of every input under every DNN/power setting (i.e., impractical). “Dynamic" schemes allow DNN/power setting changes across inputs; “Static” has one fixed setting across inputs. “System default” refers to the default power-management scheme used on our machines, which is a Race-to-Idle approach [56]. The three alternative ALERT schemes each only use part of the ALERT algorithm.

5.1.2 Results

Table 4 shows the average error rate and energy saving rate, the smaller the better, for all combinations of latency and accuracy/energy constraints (35–40 combinations for each task), normalized to Oracle. As we can see, ALERT achieves almost identical effect as the Oracle scheme, providing much better optimization than all other schemes (even impractical ones) while satisfying constraints.

Comparison with Oracle, Oracle

The Oracle scheme represents the theoretically optimal result using traditional DNN designs, perfect prediction capability, dynamic configuration across inputs, and no management overhead. As shown in Table 4 and visualized in Figures 9 and 10, ALERT achieves 93-99% of Oracle’s energy and accuracy optimization while satisfying constraints! Oracle, the baseline in Table 4, represents the best one can achieve by selecting 1 DNN model and 1 power setting for all inputs. ALERT greatly out-performs Oracle. Comparing with Oracle, ALERT reduces the energy consumption by 2–50% (33% in harmonic mean) and reduces the error rate by 9-77% (45% in harmonic mean).

Across all tests, ALERT satisfies the constraint in 99.9% cases for image classification and 98.5% cases for sentence prediction. For the latter, due to the large input variability (NLP1 in Figure 2), some input sentences simply take too long to be processed before the deadline even with the fastest DNN, for which the Oracle fails, too.

Note that, ALERT is running on the same machines as the DNN workloads. So, all results include ALERT’s run-time latency and power overhead.

Comparison with alternative designs

As we can see in Table 4, the three alternative/partial designs of ALERT are all much worse than ALERT, indicating that ALERT’s different components complement each other well.

Among the three, ALERT is the only one that can continue to satisfy accuracy constraints, benefiting from Anytime DNNs. However, its adoption of system-default power management causes it to save much less energy than ALERT. On the other hand, with only ALERT power management but not ALERT DNN selection, ALERT not only violate a huge number of accuracy constraints, but also cannot save as much energy as ALERT. Finally, ALERT ’s optimization results are only slightly worse than ALERT when it can satisfy constraints. Unfortunately, without using Anytime DNNs, it frequently violates accuracy constraints, particularly for sentence-predication task (RNN), where the latency deadline changes from one RNN input (i.e., a word) to the next, depending on how much time remains for the whole sentence.

Case study: changing environment

Figure 11 shows the different dynamic behavior of ALERT (blue curve) and ALERT (orange curve) when the environment changes from Default to Memory-intensive and back.

At the beginning, due to a loose latency constraint, ALERT and ALERT both select the biggest traditional DNN, which provides the highest accuracy within the energy budget. When the memory contention suddenly starts, this DNN choice leads to a deadline miss and an energy-budget violation (as the idle period disappeared), which causes an accuracy dip. Fortunately, both ALERT and ALERT quickly detect this problem, including not just the misses, but the high variability in the expected latency. ALERT switches to use an anytime network and a lower power cap. This switch is very effective: although the environment is still unstable, the inference accuracy remains high, with slight ups and downs depending on which nested DNN finished before the deadline (ALERT occasionally goes back to use the fastest traditional DNN to catch the deadline). Only able to choose from traditional DNNs, ALERT conservatively switches to much simpler and hence lower-accuracy traditional DNNs to avoid deadline misses. This switch does eliminate deadline misses under the highly dynamic environment, but many of the conservatively chosen DNNs finish a while before the deadline (see the Latency panel), wasting the opportunity to produce more accurate results and causing ALERT to have a lower accuracy than ALERT. When the system quiesces, both schemes quickly shift back to the highest-accuracy, traditional DNN.

Overall, these detailed results demonstrate that ALERT makes use of the full potential of the proposed DNN and resource management co-design process.

Figure 11: Maximize accuracy for image classific. on Razer. ALERT in blue; ALERT in orange; constraints in red. Memory contention occurs from about input 46 to 119; Deadline: 1.25 mean latency of largest Anytime DNN in Default; power limit: 35W.

5.2 Anytime-DNN design

5.2.1 Methodology

We evaluate three inference tasks: image classification, with the CIFAR-10 [57] dataset,555We do not use ImageNet here, as it took too long to train all the different models described below. Actually, CIFAR is a good indicator for ImageNet[104]. Some popular vision networks [33, 41, 98] also use CIFAR10., NLP sentence prediction, with Penn Treebank [66],and a recommendation system, with the MovieLens 1M Dataset[29].

ALERT’s width nesting approach can apply to all kinds of DNNs. For image classification, we apply width nesting to ResNet50 [33], and will compare our nesting design with a corresponding family of 6 ResNet50 networks [33] that have width doubled from one to the other. For sentence prediction and recommendation system, width nesting is applied to RNN [69] and self-attention networks [53], respectively, and will be compared with a family of 5 RNN networks and 5 self-attention networks, respectively. ALERT’s depth nesting is applied to Sparse ResNet50 [102] for image classification, and will be compared with a family of 5 Sparse ResNet50 networks that have depth doubled from one to the next. The number of networks in a family is determined by whether they offer meaningful accuracy-latency tradeoffs.

(a)

Image Classifi.: ResNet50

(b) Image Classifi.: Sparse ResNet50
(c) Sentence Prediction: RNN
(d) Recommendation: Self-Attention
Figure 12: Accuracy-latency trade-off (lower is better; all results from the Xeon machine, #5 in Table 1)

5.2.2 Results

Figure 12 shows the average accuracy and latency tradeoff points offered by different steps/members of (1) an ALERT Anytime DNN, (2) an ensemble DNN (Figure 5), and (3) a family of traditional DNNs. We name the last scheme as Oracle in the figure, because its trade-off points are impractical to achieve in a realistic latency-constraint setting, which requires perfect knowledge about which member DNNs can finish execution before the deadline, with any mis-prediction dropping the average accuracy greatly.

As we can see, ALERT Anytime DNNs, both depth nesting or width nesting, offer accuracy-latency tradeoffs that are much better than the Ensemble and very close to the infeasible Oracle, confirming that our nesting-based Anytime DNN design gives up little accuracy for the huge gain in flexibility.

6 Related work

Anytime DNN Design

Anytime algorithms apply to many learning techniques [103]

. For example, anytime decision trees use ensembles or boosting 

[26, 93, 96, 97]. Constructing anytime DNNs requires the more involved process of integrating anytime capability into the DNN during training.

Adaptive inference skips parts of a DNN to save time on an input-dependent basis [21, 68, 91, 95]. This differs from anytime as input difficulty—rather than a deadline—determines what DNN components are executed. Using these techniques deadline violations are more likely for difficult inputs.

As discussed earlier, cascade methods for anytime DNNs lack generality and efficiency under tight deadlines [40, 58, 89]. Wang et al. develop anytime prediction specifically for stereo depth estimation to output images with increasing resolution, an approach that is tied to image output and does not apply to other tasks [94].

Dynamic Resource Management

Past resource management systems have used machine learning [6, 60, 75, 76, 86] or control theory [51, 52, 70, 38, 81, 101, 43] to make dynamic decisions and adapt to changing environments or application needs. Several of these also make use of the Kalman filter because it has optimal error properties [43, 51, 52, 70]. There are two major differences between these prior approaches and ALERT. First, prior approaches use the Kalman filter to estimate physical quantities such as CPU utilization [52] or job latency [43], while ALERT estimates a virtual quantity that is then used to update a large number of latency estimates. Second, while standard deviation is naturally computed as part of the filter, ALERT actually uses it, in addition to the mean, to help produce estimates that better account for environment variability.

Past work designed resource managers explicitly to coordinate approximate applications with system resource usage [38, 37, 54, 20]. Although related, these managers work with independently developed approximate applications and manage them separately from system resources, which is fundamentally different from that in ALERT, where the application and the manager are co-designed. Specifically, when an environmental change occurs, prior approaches first adjust the application and then the system serially (or vice versa) so that the change’s effects on each can be established independently [37, 38]. In practice this design forces the manager to model each combination of application and system setting independently, and delays response to changes. In contrast, ALERT’s global slowdown factor allows it to easily model and update prediction about all application and system configurations simultaneously, leading to very fast response times, like the single input delay demonstrated in Figure 11.

In addition, anytime DNNs differ from traditional approximation methods, which configure the approximation level before execution. Instead, no configuration is needed for anytime DNNs, the approximation/accuracy is naturally determined when the time budget expires; i.e., it moves from one tradeoff point to another naturally during execution. Consequently, anytime allows ALERT to meet both accuracy and timing constraints while aggressively reducing energy, whereas prior work had to either forgo accuracy guarantees—e.g., [20]—or sacrifice optimality to achieve those guarantees—e.g., [37].

DNN Support

Much work accelerates DNNs and (largely) retains the base accuracy through ASICs, FPGAs, micro-architectures and ISA extensions [12, 14, 31, 19, 62, 79, 4, 13, 27, 50, 34, 22, 65, 82, 73, 90, 44], compilers [11, 72], or system support [32, 61]. These approaches are orthogonal to ALERT. ALERT has the complementary objective of managing tradeoff spaces induced by a combination of user goals, and a family of DNN designs. All the above techniques shift the range of the tradeoffs, but the tradeoffs still exist and we expect ALERT to handle them just as well in this new context. Some other works improves DNN latency or energy by changing the required computation [92, 28, 49, 30, 88, 45, 84, 28] through low-precision representation and DNN pruning, like traditional approximation methods. They are also orthogonal to ALERT. They add tradeoff points, but do not provide policies for meeting user needs or for navigating tradeoffs dynamically.

Some research supports hard real-time guarantees for DNNs [100]. This work occupies a very different point in the design space, providing 100% timing guarantees while assuming that the DNN model gives the desired accuracy, the environment is completely predictable, and energy consumption is not a concern. ALERT provides slightly weaker timing guarantees, but manages accuracy and power goals as well. Furthermore, ALERT provides more flexibility to adapt to unpredictable environments. Hard real-time systems would fail in the co-located scenario unless they explicitly account for the impact of all possible co-located applications at design time.

7 Conclusion

This paper demonstrates the challenges behind the important problem of ensuring timely, accurate, and energy efficient neural network inference on a variety of platforms with dynamic input, contention, and requirement variation. ALERT, our runtime manager, achieves these goals through dynamic DNN model selection, dynamic power management, and feedback control. Our general approach to creating anytime inference, width-nesting and depth-nesting, is particularly friendly for latency critical environments and makes the best use of ALERT.

References

  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016) TensorFlow: a system for large-scale machine learning. In OSDI, Cited by: §4.3.
  • [2] B. AI (2018)(Website) External Links: Link Cited by: §3.2.1.
  • [3] S. Akhlaghi, N. Zhou, and Z. Huang (2017) Adaptive adjustment of noise covariance in kalman filter for dynamic state estimation. In 2017 IEEE Power Energy Society General Meeting, Cited by: §3.2.2.
  • [4] J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos (2016) Cnvlutin: ineffectual-neuron-free deep neural network computing. In ISCA, pp. 1–13. Cited by: §6.
  • [5] M. Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, M. S. Nasrin, M. Hasan, B. C. Van Essen, A. A. Awwal, and V. K. Asari (2019)

    A state-of-the-art survey on deep learning theory and architectures

    .
    Electronics. Cited by: §1.1.
  • [6] J. Ansel, M. Pacula, Y. L. Wong, C. Chan, M. Olszewski, U. O’Reilly, and S. Amarasinghe (2012) Siblingrivalry: online autotuning through local competitions. In CASES, Cited by: §6.
  • [7] W. Baek and T. M. Chilimbi (2010) Green: a framework for supporting energy-conscious programming using controlled approximation. In ACM Sigplan Notices, Cited by: §1.1.
  • [8] C. Bienia, S. Kumar, J. P. Singh, and K. Li (2008-10) The parsec benchmark suite: characterization and architectural implications. In PACT, Cited by: Table 3.
  • [9] G. C. Buttazzo, G. Lipari, L. Abeni, and M. Caccamo (2006) Soft real-time systems: predictability vs. efficiency: predictability vs. efficiency. Springer. Cited by: §3.2.5.
  • [10] A. Carroll and G. Heiser (2013) Mobile multicores: use them or waste them. In HotPower, Cited by: §3.1.
  • [11] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In OSDI, pp. 578–594. Cited by: §6.
  • [12] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam (2014) Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. SIGPLAN Not., pp. 269–284. Cited by: §1.1, §6.
  • [13] Y. Chen, T. Krishna, J. S. Emer, and V. Sze (2016) Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. JSSC. Cited by: §1.1, §1.1, §6.
  • [14] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. (2014) Dadiannao: a machine-learning supercomputer. In MICRO 47, pp. 609–622. Cited by: §6.
  • [15] H. David, E. Gorbatov, U. R. Hanebutte, R. Khanna, and C. Le (2010) RAPL: memory power estimation and capping. In ISLPED, Cited by: §3.3, §3.
  • [16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §2.2.
  • [17] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. External Links: arXiv:1810.04805 Cited by: §1.1, Table 2, §4.1.
  • [18] T. G. Dietterich (2000) Ensemble methods in machine learning. In MCS, pp. 1–15. Cited by: §4.1.
  • [19] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam (2015) ShiDianNao: shifting vision processing closer to the sensor. In ISCA, pp. 92–104. Cited by: §1.1, §6.
  • [20] A. Farrell and H. Hoffmann (2016) MEANTIME: achieving both minimal energy and timeliness with approximate computing. In USENIX ATC, Cited by: §6, §6.
  • [21] M. Figurnov, M. D. Collins, Y. Zhu, L. Zhang, J. Huang, D. P. Vetrov, and R. Salakhutdinov (2017) Spatially adaptive computation time for residual networks.. In CVPR, pp. 7. Cited by: §6.
  • [22] M. Gao, C. Delimitrou, D. Niu, K. T. Malladi, H. Zheng, B. Brennan, and C. Kozyrakis (2016) DRAF: a low-power dram-based reconfigurable acceleration fabric. ISCA, pp. 506–518. Cited by: §6.
  • [23] G. Gobieski, B. Lucia, and N. Beckmann (2019) Intelligence beyond the edge: inference on intermittent embedded systems. In ASPLOS, Cited by: §1.1.
  • [24] Google (2019)(Website) External Links: Link Cited by: §4.3.
  • [25] K. Greff, R. K. Srivastava, and J. Schmidhuber (2017) Highway and residual networks learn unrolled iterative estimation. ICLR. Cited by: §4.1.
  • [26] A. Grubb and D. Bagnell (2012) Speedboost: anytime prediction with uniform near-optimality. In AISTATS, pp. 458–466. Cited by: §6.
  • [27] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. In ISCA, pp. 243–254. Cited by: §1.1, §1.1, §6.
  • [28] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. External Links: arXiv:1510.00149 Cited by: §6.
  • [29] F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. ACM TIIS. Cited by: §5.2.1.
  • [30] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda (2017) Understanding the impact of precision quantization on the accuracy and energy of neural networks. In DATE, pp. 1474–1479. Cited by: §6.
  • [31] J. Hauswald, Y. Kang, M. A. Laurenzano, Q. Chen, C. Li, T. Mudge, R. G. Dreslinski, J. Mars, and L. Tang (2015) DjiNN and tonic: dnn as a service and its implications for future warehouse scale computers. In ISCA, pp. 27–40. Cited by: §6.
  • [32] J. Hauswald, M. A. Laurenzano, Y. Zhang, C. Li, A. Rovinski, A. Khurana, R. G. Dreslinski, T. Mudge, V. Petrucci, L. Tang, et al. (2015) Sirius: an open end-to-end voice and vision personal assistant and its implications for future warehouse scale computers. In ASPLOS, pp. 223–238. Cited by: §6.
  • [33] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1.1, Table 2, §4.1, §4.2.2, §5.2.1, footnote 5.
  • [34] P. Hill, A. Jain, M. Hill, B. Zamirai, C. Hsu, M. A. Laurenzano, S. Mahlke, L. Tang, and J. Mars (2017)

    Deftnn: addressing bottlenecks for dnn execution on gpus via synapse vector elimination and near-compute data fission

    .
    In MICRO, pp. 786–799. Cited by: §1.1, §6.
  • [35] H. Hoffmann and M. Maggio (2014) PCP: A generalized approach to optimizing performance under power constraints through resource management. In ICAC, pp. 241–247. Cited by: §3.3, §3.
  • [36] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard (2011) Dynamic knobs for responsive power-aware computing. In ACM SIGARCH Computer Architecture News, Cited by: §1.1.
  • [37] H. Hoffmann (2014) CoAdapt: predictable behavior for accuracy-aware applications running on power-aware systems. In ECRTS, pp. 223–232. Cited by: §6, §6.
  • [38] H. Hoffmann (2015) JouleGuard: energy guarantees for approximate applications. In SOSP, Cited by: §6, §6.
  • [39] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. External Links: arXiv:1704.04861 Cited by: §1.1.
  • [40] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger (2017) Multi-scale dense convolutional networks for efficient prediction. Cited by: §4.1, §6.
  • [41] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §1.1, §4.2.2, footnote 5.
  • [42] C. Imes and H. Hoffmann (2016) Bard: a unified framework for managing soft timing and power constraints. In SAMOS, pp. 31–38. Cited by: §3.3.
  • [43] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann (2015-04) POET: a portable approach to minimizing energy under soft real-time constraints. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, Vol. , pp. 75–86. External Links: Document, ISSN 1545-3421 Cited by: §6.
  • [44] A. Jain, M. A. Laurenzano, G. A. Pokam, J. Mars, and L. Tang (2018) Architectural support for convolutional neural networks on modern cpus. In PACT, Cited by: §6.
  • [45] S. Jain, S. Venkataramani, V. Srinivasan, J. Choi, P. Chuang, and L. Chang (2018) Compensated-dnn: energy efficient low-precision deep neural networks by compensating quantization errors. In DAC, pp. 1–6. Cited by: §6.
  • [46] S. Jastrzebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio (2018) Residual connections encourage iterative inference. ICLR. Cited by: §4.1.
  • [47] J. Jiang, G. Ananthanarayanan, P. Bodik, S. Sen, and I. Stoica (2018) Chameleon: scalable adaptation of video analytics. In ACM SIGCOMM, pp. 253–266. Cited by: §2.1.
  • [48] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon (2017)

    In-datacenter performance analysis of a tensor processing unit

    .
    In ISCA, Cited by: §1.1, §1.1.
  • [49] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, N. E. Jerger, and A. Moshovos (2016) Proteus: exploiting numerical precision variability in deep neural networks. In ICS, pp. 23. Cited by: §6.
  • [50] P. Judd, J. Albericio, T. Hetherington, T. M. Aamodt, and A. Moshovos (2016) Stripes: bit-serial deep neural network computing. In MICRO, pp. 1–12. Cited by: §6.
  • [51] E. Kalyvianaki, T. Charalambous, and S. Hand (2009) Self-adaptive and self-configured cpu resource provisioning for virtualized servers using kalman filters. In ICAC, Cited by: §6.
  • [52] E. Kalyvianaki, T. Charalambous, and S. Hand (2014) Adaptive resource provisioning for virtualized servers using kalman filters. TAAS. Cited by: §6.
  • [53] W. Kang and J. McAuley (2018) Self-attentive sequential recommendation. In ICDM, Cited by: §5.2.1.
  • [54] A. Kansal, S. Saponas, A. Brush, K. S. McKinley, T. Mytkowicz, and R. Ziola (2013) The latency, accuracy, and battery (lab) abstraction: programmer productivity and energy efficiency for continuous mobile context sensing. In ACM SIGPLAN Notices, Cited by: §6.
  • [55] S. Kato, S. Tokunaga, Y. Maruyama, S. Maeda, M. Hirabayashi, Y. Kitsukawa, A. Monrroy, T. Ando, Y. Fujii, and T. Azumi (2018) Autoware on board: enabling autonomous vehicles with embedded systems. In ICCPS, pp. 287–296. Cited by: §3.2.1.
  • [56] D. H. K. Kim, C. Imes, and H. Hoffmann (2015)

    Racing and pacing to idle: theoretical and empirical analysis of energy optimization heuristics

    .
    In ICCPS, Cited by: §3.1, §5.1.1.
  • [57] A. Krizhevsky and G. Hinton (2012) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §5.2.1.
  • [58] G. Larsson, M. Maire, and G. Shakhnarovich (2016) Fractalnet: ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648. External Links: arXiv:1605.07648 Cited by: §6.
  • [59] E. Le Sueur and G. Heiser (2011-06) Slow down or sleep, that is the question. In USENIX ATC, Cited by: §3.1.
  • [60] B. C. Lee and D. Brooks (2008) Efficiency trends and limits from comprehensive microarchitectural adaptivity. ACM SIGARCH computer architecture news. Cited by: §6.
  • [61] S. Lin, Y. Zhang, C. Hsu, M. Skach, M. E. Haque, L. Tang, and J. Mars (2018) The architectural implications of autonomous driving: constraints and acceleration. In ASPLOS, pp. 751–766. Cited by: §1.1, §1.1, §6.
  • [62] D. Liu, T. Chen, S. Liu, J. Zhou, S. Zhou, O. Teman, X. Feng, X. Zhou, and Y. Chen (2015) Pudiannao: a polyvalent machine learning accelerator. In ISCA, pp. 369–381. Cited by: §6.
  • [63] J. S. Liu and R. Chen (1998) Sequential monte carlo methods for dynamic systems. Journal of the American statistical association. Cited by: §3.1, §3.2.2.
  • [64] A. LS (2010)(Website) External Links: Link Cited by: §1.1.
  • [65] D. Mahajan, J. Park, E. Amaro, H. Sharma, A. Yazdanbakhsh, J. K. Kim, and H. Esmaeilzadeh (2016) Tabla: a unified template-based framework for accelerating statistical machine learning. In HPCA, pp. 14–26. Cited by: §6.
  • [66] M. P. Marcus, B. Santorini, M. A. Marcinkiewicz, and A. Taylor (1999)(Website) External Links: Link Cited by: Table 2, §5.2.1.
  • [67] J. D. McCalpin (1995) Memory bandwidth and machine balance in current high performance computers. TCCA. Cited by: §2.1, Table 3.
  • [68] M. McGill and P. Perona (2017) Deciding how to decide: dynamic routing in artificial neural networks. arXiv preprint arXiv:1703.06217. External Links: arXiv:1703.06217 Cited by: §4.1, §6.
  • [69] T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In ISCA, Cited by: §5.2.1.
  • [70] N. Mishra, C. Imes, J. D. Lafferty, and H. Hoffmann (2018) CALOREE: learning control for predictable latency and low energy. In ASPLOS, Cited by: §3.1, §6.
  • [71] J. Nielsen (1994) Usability engineering. Elsevier. Cited by: §2.1.
  • [72] NVIDIA (2018)(Website) External Links: Link Cited by: §6.
  • [73] K. Ovtcharov, O. Ruwase, J. Kim, J. Fowers, K. Strauss, and E. S. Chung (2015) Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper. Cited by: §6.
  • [74] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.3.
  • [75] P. Petrica, A. M. Izraelevitz, D. H. Albonesi, and C. A. Shoemaker (2013) Flicker: a dynamically adaptive architecture for power limited multicore systems. In ACM SIGARCH computer architecture news, Cited by: §6.
  • [76] D. Ponomarev, G. Kucuk, and K. Ghose (2001) Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources. In MICRO, Cited by: §6.
  • [77] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250. External Links: arXiv:1606.05250 Cited by: Table 2.
  • [78] S. Reda, R. Cochran, and A. K. Coskun (2012) Adaptive power capping for servers with multithreaded workloads. IEEE Micro. Cited by: §3.3, §3.
  • [79] M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler (2016) VDNN: virtualized deep neural networks for scalable, memory-efficient neural network design. In MICRO, pp. 18. Cited by: §6.
  • [80] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman (2011) EnerJ: approximate data types for safe and general low-power computation. In ACM SIGPLAN Notices, Cited by: §1.1.
  • [81] M. H. Santriaji and H. Hoffmann (2016) GRAPE: minimizing energy for gpu applications with performance requirements. In MICRO, Cited by: §6.
  • [82] H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, and H. Esmaeilzadeh (2016) From high-level deep neural models to fpgas. In MICRO, pp. 17. Cited by: §6.
  • [83] N. Silberman and Guadarrama. S (2016)(Website) External Links: Link Cited by: §2.2.
  • [84] H. Sim, S. Kenzhegulov, and J. Lee (2018) DPS: dynamic precision scaling for stochastic computing-based deep neural networks. In DAC, pp. 13. Cited by: §6.
  • [85] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. Cited by: Table 2, §4.1.
  • [86] S. Sridharan, G. Gupta, and G. S. Sohi (2013) Holistic run-time parallelism management for time and energy efficiency. In ICS, Cited by: §6.
  • [87] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. CVPR. Cited by: §4.1.
  • [88] H. Tann, S. Hashemi, R. I. Bahar, and S. Reda (2017) Hardware-software codesign of accurate, multiplier-free deep neural networks. In DAC, Cited by: §6.
  • [89] S. Teerapittayanon, B. McDanel, and H.T. Kung (2016) BranchyNet: fast inference via early exiting from deep neural networks. In CVPR, Cited by: §4.1, §6.
  • [90] V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. In Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, pp. 4. Cited by: §6.
  • [91] A. Veit and S. Belongie (2018) Convolutional networks with adaptive inference graphs. In ECCV, Cited by: §6.
  • [92] S. Venkataramani, A. Ranjan, K. Roy, and A. Raghunathan (2014) AxNN: energy-efficient neuromorphic systems using approximate computing. In ISLPED, Cited by: §6.
  • [93] P. Viola and M. J. Jones (2004)

    Robust real-time face detection

    .
    IJCV, pp. 137–154. Cited by: §6.
  • [94] Y. Wang, Z. Lai, G. Huang, B. H. Wang, L. van der Maaten, M. Campbell, and K. Q. Weinberger (2018) Anytime stereo image depth estimation on mobile devices. arXiv preprint arXiv:1810.11408. External Links: arXiv:1810.11408 Cited by: §6.
  • [95] Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris (2018) Blockdrop: dynamic inference paths in residual networks. In CVPR, pp. 8817–8826. Cited by: §6.
  • [96] Z. Xu, M. Kusner, K. Weinberger, and M. Chen (2013) Cost-sensitive tree of classifiers. In ICML, pp. 133–141. Cited by: §6.
  • [97] Z. Xu, K. Weinberger, and O. Chapelle (2012) The greedy miser: learning under test-time budgets. arXiv preprint arXiv:1206.6451. External Links: arXiv:1206.6451 Cited by: §6.
  • [98] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. arXiv preprint arXiv:1605.07146. Cited by: §4.1, footnote 5.
  • [99] H. Zhang and H. Hoffmann (2016) Maximizing performance under a power cap: a comparison of hardware, software, and hybrid techniques. In ASPLOS, Cited by: §3.3, §3.
  • [100] H. Zhou, S. Bateni, and C. Liu (2018) S3DNN: supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In RTAS, Cited by: §1.1, §6.
  • [101] Y. Zhou, H. Hoffmann, and D. Wentzlaff (2016) CASH: supporting iaas customers with a sub-core configurable architecture. In ISCA, Cited by: §6.
  • [102] L. Zhu, R. Deng, M. Maire, Z. Deng, G. Mori, and P. Tan (2018) Sparsely aggregated convolutional networks. CoRR. Cited by: §1.1, §4.2.2, §5.2.1.
  • [103] S. Zilberstein (1996) Using anytime algorithms in intelligent systems. AI magazine, pp. 73. Cited by: §6.
  • [104] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: footnote 5.