Data-Driven Offline Optimization For Architecting Hardware Accelerators

10/20/2021
by   Aviral Kumar, et al.
Google
berkeley college
7

Industry has gradually moved towards application-specific hardware accelerators in order to attain higher efficiency. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform a large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a "simulation-driven" approach must be re-run from scratch every time the set of target applications or design constraints change. An alternative paradigm is to use a "data-driven", offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulations. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when set of target applications changes. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points, and optimizes the design against this estimate without any additional simulator queries during optimization. PRIME architects accelerators – tailored towards both single and multiple applications – improving performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93 PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/29/2019

HTS: A Hardware Task Scheduler for Heterogeneous Systems

As the Moore's scaling era comes to an end, application specific hardwar...
09/08/2021

Resistive Neural Hardware Accelerators

Deep Neural Networks (DNNs), as a subset of Machine Learning (ML) techni...
11/17/2021

Early DSE and Automatic Generation of Coarse Grained Merged Accelerators

Post-Moore's law area-constrained systems rely on accelerators to delive...
11/08/2021

Data-driven Set-based Estimation of Polynomial Systems with Application to SIR Epidemics

This paper proposes a data-driven set-based estimation algorithm for a c...
11/29/2021

A Graph Deep Learning Framework for High-Level Synthesis Design Space Exploration

The design of efficient hardware accelerators for high-throughput data-p...
10/27/2019

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Statistical machine learning often uses probabilistic algorithms, such a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The death of Moore’s Law [esmaeilzadeh2011dark] and its spiraling effect on the semiconductor industry have driven the growth of specialized hardware accelerators. These specialized accelerators are tailored to specific applications [yazdanbakhsh2021apollo, reagen2017case, prac_dse:mascots:2019, shi2020learned]. To design specialized accelerators, designers first spend considerable amounts of time developing simulators that closely model the real accelerator performance, and then optimize the accelerator using the simulator. While such simulators can automate accelerator design, this requires a large number of simulator queries for each new design, both in terms of simulation time and compute requirements, and this cost increases with the size of the design space [yazdanbakhsh2021evaluation, shi2020learned, hegdemind]. Moreover, most of the accelerators in the design space are typically infeasible [hegdemind, yazdanbakhsh2021apollo] because of build errors in silicon or compilation/mapping failures. When the target applications change or a new application is added, the complete simulation-driven procedure is generally repeated. To make such approaches efficient and practically viable, designers typically “bake-in” constraints or otherwise narrow the search space, but such constraints can leave out high-performing solutions [dmazerunner, timeloop, marvel].

An alternate approach, proposed in this work, is to devise a data-driven optimization method that only utilizes a database of previously tested accelerator designs, annotated with measured performance metrics, to produce new optimized designs without additional active queries to an explicit silicon or a cycle-accurate simulator. Such a data-driven approach provides three key benefits: (1) it significantly shortens the recurring cost of running large-scale simulation sweeps, (2) it alleviates the need to explicitly bake in domain knowledge or search space pruning, and (3) it enables data re-use by empowering the designer to optimize accelerators for new unseen applications, by the virtue of effective generalization. While data-driven approaches have shown promising results in biology [fu2021offline, brookes19a, trabucco2021conservative], using offline optimization methods to design accelerators has been challenging primarily due to the abundance of infeasible design points [yazdanbakhsh2021apollo, hegdemind] (see Figures 3 and 11).

The key contribution of this paper is a data-driven approach,  , to automatically architect high-performing application-specific accelerators by using only previously collected offline data.  learns a robust surrogate model of the task objective function from an existing offline dataset, and finds high-performing application-specific accelerators by optimizing the architectural parameters

Figure 1: Overview of . We use a one-time collected dataset of prior hardware accelerator designs, including TPU-style [yazdanbakhsh2021evaluation], NVDLA-style [nvdla], and ShiDianNao-style [shidiannao] accelerators to train a conservative surrogate model, which can then be used to design accelerators to meet desired goals and constraints.

against this learned surrogate function, as shown in Figure 1. While naïvely learned surrogate functions usually produces poor-performing, out-of-distribution designs that appear quite optimistic under the learned surrogate [kumar2019model, brookes19a, trabucco2021conservative]. The robust surrogate in  is explicitly trained to prevent overestimation on “adversarial” designs that would be found during optimization. Furthermore, in contrast to prior works that discard infeasible points [hegdemind, trabucco2021conservative], our proposed method instead incorporates infeasible points when learning the conservative surrogate by treating them as additional negative samples. Additionally,  can be used effectively for multi-model and zero-shot optimization, capabilities that prior works [trabucco2021conservative, hegdemind] do not show. During evaluation,  optimizes the learned surrogate using a discrete optimizer.

Our results show that  architects hardware accelerators that improve over the best design in the training dataset, on average, by 2.46 (up to 6.7) when specializing for a single application. In this case,  also improves over the best conventional simulator-driven optimization methods by 1.54 (up to 6.6). These performance improvements are obtained while reducing the total simulation time to merely 7% and 1% of that of the simulator-driven methods for single-task and multi-task optimization, respectively. More importantly, a contextual version of  can design accelerators that are jointly optimal for a set of nine applications without requiring any additional domain information. In this challenging setting,  improves over simulator-driven methods, which tend to scale poorly as more applications are added, by 1.38. Finally, we show that the surrogates trained with  on a set of training applications can be readily used to obtain accelerators for unseen target applications, without any retraining on the new application. Even in this zero-shot optimization scenario,  outperforms simulator-based methods that require re-training and active simulation queries by up to 1.67. In summary,  allows us to effectively address the shortcomings of simulation-driven approaches, significantly reduces the simulation time, enables data reuse and enjoys generalization properties, and does not require domain-specific engineering or search space pruning. To facilitate further research in architecting hardware accelerators, we will also release the dataset in our experiments, consisting of many accelerator design points.

2 Background on Hardware Accelerators

The goal of specialized hardware accelerators—Google TPUs [jouppi2017datacenter, edgetpu:arxiv:2020], Nvidia GPUs [nvidia], GraphCore [graphcore]

—is to improve the performance of specific applications, such as machine learning models. To design such accelerators, architects typically create a parameterized design and sweep over parameters using simulation. In this section, we will provide an overview of hardware accelerators, present the design of our template-based accelerator and explain how an accelerator works.

Figure 2: An industry-level machine learning accelerator [yazdanbakhsh2021evaluation].

Target hardware accelerators. Our primary evaluation uses an industry-grade and highly parameterized template-based accelerator following prior work [yazdanbakhsh2021evaluation]

. This template enables architects to determine the organization of various components, such as compute units, memory cells, memory, etc., by searching for these configurations in a discrete design space. Some ML applications may have large memory requirements (e.g., large language models 

[brown2020language]) demanding sufficient on-chip memory resources, while others may benefit from more compute blocks. The hardware design workflow directly selects the values of these parameters. In addition to this accelerator and to further show the generality of our method to other accelerator design problems, we evaluate two distinct dataflow accelerators with different search spaces, namely NVDLA-style [nvdla] and ShiDianNao-style [shidiannao] from kao2020confuciux (See Section 6 and Appendix D for a detailed discussion; See Table 6 for results).

How does an accelerator work? We briefly explain the computation flow on our template-based accelerators (Figure 2) and refer the readers to Appendix D for details on other accelerators. This template-based accelerator is a 2D array of processing elements (PEs). Each PE is capable of performing matrix multiplications in a single instruction multiple data (SIMD) paradigm [simd]. A controller orchestrates the data transfer (both activations and model parameters) between off-chip DRAM memory and the on-chip buffers and also reads in and manages the instructions (e.g. convolution, pooling, etc.) for execution. The computation stages on such accelerators start by sending a set of activations to the compute lanes, executing them in SIMD manner, and either storing the partial computation results or offloading them back into off-chip memory. Compared to prior works [hegdemind, shidiannao, kao2020confuciux], this parameterization is unique—it includes multiple compute lanes per each PE and enables SIMD execution model within each compute lane—and yields a distinct accelerator search space accompanied by an end-to-end simulation framework. Appendix D elaborates on other accelerators evaluated in this work.

3 Problem Statement, Training Data and Evaluation Protocol

Our template-based parameterization maps the accelerator, denoted as , to a discrete design space, , and each is a discrete-valued variable representing one component of the microarchitectural template, as shown in Table 1 (See Appendix D for the description of other accelerator search spaces studied in our work). A design maybe be infeasible due to various reasons, such as a compilation failure or the limitations of physical implementation, and we denote the set of all such feasibility criterion as . The feasibility criterion depends on both the target software and the underlying hardware, and it is not easy to identify if a given is infeasible without explicit simulation. We will require our optimization procedure to not only learn the value of the objective function but also to learn to navigate through a sea of infeasible solutions to high-performing feasible solutions satisfying .

Our training dataset consists of a modest set of accelerators that are randomly sampled from the design space and evaluated by the hardware simulator. We partition the dataset into two subsets, and . Let denote the desired objective (e.g., latency, power, etc.) we intend to optimize over the space of accelerators . We do not possess functional access to , and the optimizer can only access values for accelerators in the feasible partition of the data, . For all infeasible accelerators, the simulator does not provide any value of . In addition to satisfying feasibility, the optimizer must handle explicit constraints on parameters such as area and power [flynn2011computer]. In our applications, we impose an explicit area constraint, , though additional explicit constraints are also possible. To account for different constraints, we formulate this task as a constrained optimization problem. Formally:

(1)

While Equation 1 may appear similar to other standard black-box optimization problems, solving it over the space of accelerator designs is challenging due to the large number of infeasible points, the need to handle explicit design constraints, and the difficulty in navigating the non-smooth landscape (See Figure 3 and Figure 10 in the Appendix) of the objective function.

Accelerator Parameter # discrete values Accelerator Parameter # discrete values
# of PEs-X 10 # of PEs-Y 10
PE Memory 7 # of Cores 7
Core Memory 11 # of Compute Lanes 10
Instruction Memory 4 Parameter Memory 5
Activation Memory 7 DRAM Bandwidth 6
Table 1: The accelerator design space parameters for the primary accelerator search space targeted in this work. The maximum possible number of accelerator designs (including feasible and infeasible designs) is 452,760,000.  only uses a small randomly sampled subset of the search space.

What makes optimization over accelerators challenging? Compared to other domains where model-based optimization methods have been applied [brookes19a, trabucco2021conservative], optimizing accelerators introduces a number of practical challenges. First, accelerator design spaces typically feature a narrow manifold of feasible accelerators within a sea of infeasible points [prac_dse:mascots:2019, shi2020learned, gelbart2014bayesian], as visualized in Figure 3 and Appendix (Figure 11). While some of these infeasible points can be identified via simple rules (e.g. estimating chip area usage), most infeasible points correspond to failures during compilation or hardware simulation. These infeasible points are generally not straightforward to formulate into the optimization problem and requires simulation [shi2020learned, timeloop, yazdanbakhsh2021apollo].

Second, the optimization objective can exhibit high sensitivity to small variations in some architecture parameters (Figure 10) in some regions of the design space, but remain relatively insensitive in other parts, resulting in a complex optimization landscape. This suggests that optimization algorithms based on local parameter updates (e.g., gradient ascent, evolutionary schemes, etc.)

Figure 3: Top: histogram of infeasible (right orange bar with large score values) and feasible (left cluster of bars) data points; Bottom: zoomed-in histogram focused on feasible points highlighting the variable latencies.

may have a challenging task traversing the nearly flat landscape of the objective, which can lead to poor performance.

Training dataset. We used an offline dataset of (accelerator parameters, latency) via random sampling from the space of 452M possible accelerator configurations. Our method is only provided with a relatively modest set of feasible points ( points) for training, and these points are the worst-performing feasible points across the pool of randomly sampled data. This dataset is meant to reflect an easily obtainable and an application-agnostic dataset of accelerators that could have been generated once and stored to disk, or might come from real physical experiments. We emphasize that no assumptions or domain knowledge about the application use case was made during dataset collection. Table 2 depicts the list of target applications, evaluated in this work, includes three variations of MobileNet [edgetpu:arxiv:2020, mnv2:arxiv:2018, mnv3:cvpr:2019], three in-house industry-level models for object detection (M4, M5, M6; names redacted to prevent anonymity violation), a U-net model [unet], and two RNN-based encoder-decoder language models [trnn01, trnn02, trnn03, trnn04]. These applications span the gamut from small models, such as , with only 0.4 MB model parameters that demands less on-chip memory, to the medium-sized models ( 5 MB), such as MobileNetV3 and models, and large models ( 19 MB), such as t-RNNs, hence requiring larger on-chip memory.

Name Domain # of XLA Ops (Conv, D/W, FF) Model Param Instr. Size # of Compute Ops.
MobileNetEdge Image Class. (45, 13, 1) 3.87 MB 476,736 1,989,811,168
MobileNetV2 Image Class. (35, 17, 1) 3.31 MB 416,032 609,353,376
MobileNetV3 Image Class. (32, 15, 17) 5.20 MB 1,331,360 449,219,600
Object Det. (32, 13, 2) 6.23 MB 317,600 3,471,920,128
Object Det. (47, 27, 0) 2.16 MB 328,672 939,752,960
Object Det. (53, 33, 2) 0.41 MB 369,952 228,146,848
U-Net Image Seg. (35, 0, 0) 3.69 MB 224,992 13,707,214,848
t-RNN Dec Speech Rec. (0, 0, 19) 19 MB 915,008 40,116,224
t-RNN Enc Speech Rec. (0, 0, 18) 21.62 MB 909,696 45,621,248
Table 2: The description of the applications, their domains, number of (convolutions, depth-wise convolutions, feed-forward) XLA ops, model parameter size, instruction sizes in bytes, number of compute operations.

Evaluation protocol. To compare state-of-the-art simulator-driven methods and our data-driven method, we limit the number of feasible points (costly to evaluate) that can be used by any algorithm to equal amounts. We still provide infeasible points to any method and leave it up to the optimization method to use it or not. This ensures our comparisons are fair in terms of the amount of data available to each method. However, it is worthwhile to note that in contrast to our method where worse-quality data points from small offline dataset are used, the simulator-driven methods have an inherent advantage because they can steer the query process towards the points that are more likely to be better in terms of performance. Following prior work [brookes19a, trabucco2021conservative, trabucco2021designbench], we evaluate each run of a method by first sampling the top design candidates according to the algorithm’s predictions, evaluating all of these under the ground truth objective function and recording the performance of the best accelerator design. The final reported results is the median of ground truth objective values across five independent runs.

4 : Architecting Accelerators via Conservative Surrogates

As shown in Figure 4, our method first learns a conservative surrogate model of the optimization objective using the offline dataset. Then, it optimizes the learned surrogate using a discrete optimizer. The optimization process does not require access to a simulator, nor to real-world experiments beyond the initial dataset, except when evaluating the final top-performing designs (Section 3).

Learning conservative surrogates using logged offline data. Our goal is to utilize a logged dataset of feasible accelerator designs labeled with the desired performance metric (e.g., latency), , and infeasible designs, to learn a mapping , that maps the accelerator configuration to its corresponding metric . This learned surrogate can then be optimized by the optimizer. While a straightforward approach for learning such a mapping is to train it via supervised regression, by minimizing the mean-squared error , prior work [kumar2019model, kumar2020conservative, trabucco2021conservative] has shown that such predictive models can arbitrarily overestimate the value of an unseen input . This can cause the optimizer to find a solution that performs poorly in the simulator but looks promising under the learned model. We empirically validate this overestimation hypothesis and find it to confound the optimizer in on our problem domain as well (See Figure 12 in appendix).

Figure 4: Overview of  which trains a conservative surrogate using Equation 3. Our neural net model for utilizes two transformer layers [vaswani2017attention], and a multi-headed architecture which is pooled via a soft-attention layer.

To prevent overestimated values at unseen inputs from confounding the optimizer, we build on COMs [trabucco2021conservative] and train with an additional term that explicitly maximizes the function value at unseen values. Such unseen designs , where the learned function is likely to be overestimated, are “negatively mined” by running a few iterations of an approximate stochastic optimization procedure that aims to maximize in the inner loop. This procedure is analogous to adversarial training [goodfellow2014explaining]

in supervised learning. Equation 

2 formalizes this objective:

(2)

denotes the negative samples produced from an optimizer that attempts to maximize the current learned model, . We will discuss our choice of in the Appendix Section C.

Incorporating design constraints via infeasible points. While prior work [trabucco2021conservative] simply optimizes Equation 2 to learn a surrogate, this is not enough when optimizing over accelerators, as we will also show empirically (Appendix A.1). This is because explicit negative mining does not provide any information about accelerator design constraints. Fortunately, this information is provided by infeasible points, . The training procedure in Equation 2 provides a simple way to do incorporate such infeasible points: we simply incorporate as additional negative samples and maximize the prediction at these points. This gives rise to our final objective:

(3)

Multi-application optimization and zero-shot generalization. One of the central benefits of a data-driven approach is that it enables learning powerful surrogates that generalize over the space of applications, potentially being effective for new unseen application domains. In our experiments, we evaluate  on designing accelerators for multiple applications denoted as , jointly or for a novel unseen application. In this case, we utilized a dataset , where each consists of a set of accelerator designs, annotated with the latency value and the feasibility criterion for a given application . While there are a few overlapping designs that appear in each of the parts of the dataset, and these designs are annotated with latency values for more than one application, most of the designs only appear in one part, and so our training procedure does not have access to the latency values corresponding to more than one application for such designs. This presents a challenging scenario for any data-driven method, which must generalize correctly to unseen combinations of accelerators and applications.

To train a single conservative surrogate for multiple applications, we extend the training procedure in Equation 3 to incorporate context vectors for various applications driven by a list of application properties in Table 2. The learned function in this setting is now conditioned on the context . We train via the objective in Equation 3, but in expectation over all the contexts and their corresponding datasets: . Once such a contextual surrogate is learned, we can either optimize the average surrogate across a set of contexts

to obtain an accelerator that is optimal for multiple applications simultaneously on an average (“multi-model” optimization), or optimize this contextual surrogate for a novel context vector, corresponding to an unseen application (“zero-shot” generalization). In this case,  is not allowed to train on any data corresponding to this new unseen application. While such zero-shot generalization might appear surprising at first, note that the context vectors are not simply one-hot vectors, but consist of parameters with semantic information, which the surrogate can generalize over.

Optimizing the learned conservative surrogate. Prior work [yazdanbakhsh2021apollo]

has shown that the most effective optimizers for accelerator design are meta-heuristic/evolutionary optimizers. We therefore choose to utilize, firefly 

[yang2010nature, yang2010eagle, liu2013adaptive] to optimize our conservative surrogate. This algorithm maintains a set of optimization candidates (a.k.a. “fireflies”) and jointly update them towards regions of low objective value, while adjusting their relative distances appropriately to ensure multiple high-performing, but diverse solutions. We discuss additional details in Appendix C.1.

Cross validation: which model and checkpoint should we evaluate? Similarly to supervised learning, models trained via Equation 3

can overfit, leading to poor solutions. Thus, we require a procedure to select which hyperparameters and checkpoints should actually be used for the design. This is crucial, because we cannot arbitrarily evaluate as many models as we want against the simulator. While effective methods for model selection have been hard to develop in offline optimization 

[trabucco2021conservative, trabucco2021designbench], we devised a simple scheme using a validation set for choosing the values of and (Equation 3), as well as which checkpoint to utilize for generating the design. For each training run, we hold out the best 20% of the points out of the training set and use them only for cross-validation as follows. Typical cross-validation strategies in supervised learning involve tracking validation error (or risk), but since our model is trained conservatively, its predictions may not match the ground truth, making such validation risk values unsuitable for our use case. Instead, we track Kendall’s ranking correlation between the predictions of the learned model and the ground truth values (Appendix C) for the held-out points for each run. We pick values of , and the checkpoint that attain the highest validation ranking correlation. We present the pseudo-code for  (Algorithm 1) and implementation details in Appendix C.1.

5 Related Work

Optimizing hardware accelerators has become more important recently. Prior works [bo:frontiers:2020, flexibo:arxiv:2020, cnn_gen:cyber:2020, prac_dse:mascots:2019, accel_gen:dac:2018, spatial:pldi:2018, automomml:hpc:2016, opentuner:pact:2014, hegdemind] mainly rely on expensive-to-query hardware simulators to navigate the search space. For example, HyperMapper [prac_dse:mascots:2019] targets compiler optimization for FPGAs by continuously interacting with the simulator in a design space with relatively few infeasible points. Mind Mappings [hegdemind], optimizes software mappings to a fixed hardware provided access to millions of feasible points and throws away infeasible points during learning. kao2020confuciux

utilizes reinforcement learning against a simulator to optimize the parameters of a set of simple accelerators. In contrast, our data-driven approach,  , not only learns a conservative surrogate using offline data but can also effectively leverage information from the large number of infeasible points and is effective with just a few thousand feasible points. In addition, to the best of our knowledge, our work, is the first to demonstrate generalization to unseen applications for accelerator design, outperforming state-of-the-art online methods.

A popular approach for solving black-box optimization problems is model-based optimization (MBO) [snoek15scalable, shahriari2016TakingTH, snoek2012practical]

. Most of these methods fail to scale to high-dimensions, and have been extended with neural networks 

[snoek15scalable, snoek2012practical, kim2018attentive, garnelo18neural, garnelo18conditional, angermueller2020population, angermueller2019model, mirhoseini2020chip]. While these methods work well in the active setting, they are susceptible to out-of-distribution inputs [trabucco2021designbench] in the offline, data-driven setting. To prevent this, offline MBO methods that constrain the optimizer to the manifold of valid, in-distribution inputs have been developed brookes19a, fannjiang2020autofocused, kumar2019model. However, modeling the manifold of valid inputs can be challenging for accelerators.  dispenses with the need for generative modeling, while still avoiding out-of-distribution inputs.  builds on “conservative” offline RL and offline MBO methods that train robust surrogates [kumar2020conservative, trabucco2021conservative]. However, unlike these approaches,  can handle constraints by learning from infeasible data and utilizes a better optimizer (See Appendix Table 7 for a comparison). In addition, while prior works area mostly restricted to a single application, we show that  is effective in multi-task optimization and zero-shot generalization.

6 Experimental Evaluation

Our evaluations aim to answer the following questions: Q(1) Can  design accelerators tailored for a given application that are better than the best observed configuration in the training dataset, and comparable to or better than state-of-the-art simulation-driven methods under a given simulator-query budget? Q(2) Does  reduce the total simulation time compared to other methods? Q(3) Can  produce hardware accelerators for a family of different applications? Q(4) Can  trained for a family of applications extrapolate to designing a high-performing accelerator for a new, unseen application, thereby enabling effective data reuse? Additionally, we ablate various properties of  (Appendix A.6) and evaluate its efficacy in designing different accelerators with distinct dataflow architectures, with a larger search space (up to 2.5 possible candidates). We also show that  improves over a human-engineered accelerator in Appendix A.4. We also show how  can be reused when the design constraints (such as area constraints) change in Appendix A.3.

Baselines and comparisons. We compare  against three state-of-the-art online optimization methods that actively query the simulator: (1) evolutionary search with the firefly optimizer [yazdanbakhsh2021apollo] (“Evolutionary”), which is the shown to the best online method in designing the accelerators that we consider by prior work [yazdanbakhsh2021apollo]. (2) Bayesian Optimization (“Bayes Opt”) [vizier:sigkdd:2017] implemented via the Google Vizier framework, a Gaussian process-based optimizer that is widely use to tune machine learning models at Google and more broadly. (3) MBO [angermueller2019model], a state-of-the-art online MBO method for designing biological sequences. In all of our experiments, we allow all the methods access to an identical number of feasible points. Note however that while online methods can actively select which points to query, our offline method  is constrained to utilizing the worst performing feasible designs. “(Best in Training)” denotes the best latency value in the training dataset used in . We also present ablation results with different components

Figure 5: Comparing the total simulation time needed by  and online evolutionary optimization on MobileNetEdge. Note that optimizing  only requires about 7% of the total simulation time of the online method.

of our method removed in Appendix A.6, where we observe that utilizing both infeasible points and negative sampling are generally important for attaining good optimization performance. Appendix A.1 presents additional comparisons to COMs [trabucco2021conservative]—which only obtains negative samples via gradient ascent on the learned surrogate and does not utilize infeasible points—and P3BO [p3bo:arxiv:2020]—an state-of-the-art online method.  outperforms both of these prior approaches.

Architecting application-specific accelerators. We first evaluate  in designing specialized accelerators for each of the applications in Table 2. We train a conservative surrogate using the method in Section 4 on the logged dataset for each application separately. The area constraint (Equation 1) is set to , a realistic budget for accelerators [yazdanbakhsh2021apollo]. Table 3 summarizes the results. On average, the best accelerators designed by  outperforms the best accelerator configuration in the training dataset (last row Table 3), by 2.46.

Online Optimization
Application Bayes Opt Evolutionary MBO (Best in Training)
MobileNetEdge 298.50 319.00 320.28 332.97 354.13
MobileNetV2 207.43 240.56 238.58 244.98 410.83
MobileNetV3 454.30 534.15 501.27 535.34 938.41
370.45 396.36 383.58 405.60 779.98
208.21 201.59 198.86 219.53 449.38
131.46 121.83 120.49 119.56 369.85
U-Net 740.27 872.23 791.64 888.16 1333.18
t-RNN Dec 132.88 771.11 770.93 771.70 890.22
t-RNN Enc 130.67 865.07 865.07 866.28 584.70
Geomean of ’s Improvement  1.0  1.58  1.54  1.61  2.46
Table 3: Optimized objective values (i.e., latency in milliseconds) obtained by various methods for the task of learning accelerators specialized to a given application. Lower latency is better. From left to right

: our method, online Bayesian optimization (“Bayes Opt”), online evolutionary algorithm (“Evolutionary”), and the best design in the training dataset. On average (last row),  improves over the best in the dataset by 2.46

(up to 6.69 in t-RNN Dec) and outperforms best online optimization methods by 1.54 (up to 6.62 in t-RNN Enc). The best accelerator configurations identified is highlighted in bold.

also outperforms the accelerators in the best online method by 1.54 (up to 5.80 and 6.62 in t-RNN Dec and t-RNN Enc, respectively). Moreover, perhaps surprisingly,  generates accelerators that are better than the best online optimization method for 7/9 applications, and performs on par in the other two scenarios (on average only 6.8 slowdown compared to the best accelerator with online methods in and ). These results indicates that offline optimization of accelerators using  can be more data-efficient compared to online methods with active simulation queries. We also emphasize again that  exhibits this strong performance by only learning from an equal number of feasible points as the online methods, and these are the worst feasible points in the training dataset.

To answer Q(2), we compare the total simulation time of  and the best evolutionary approach from Table 3 on the MobileNetEdge domain. On average, not only that  outperforms the best online method, but also considerably reduces the total simulation time by 93%, as shown in Figure 5. Even the total simulation time to the first occurrence of the final design that is eventually returned by the online methods is about 11 what  requires to find the better design. This indicates that data-driven  is much more preferred, both because it attains a better performance, and because it requires only a tiny fraction of the simulation time of online methods.

Architecting accelerators for multiple applications. To answer Q(3), we evaluate the efficacy of the contextual version of  in designing an accelerator that attains the lowest latency averaged over a set of application domains.

Applications Area  (Ours) Evolutionary  (Online) MBO (Online)
MobileNet (Edge,V2,V3) 29 mm (310.21, 334.70) (315.72, 325.69) (342.02, 351.92)
MobileNet (V2,V3), , 29 mm (268.47, 271.25) (288.67, 288.68) (295.21, 307.09)
MobileNet (Edge, V2, V3), , , 29 mm (311.39, 313.76) (314.31, 316.65) (321.48, 339.27)
MobileNet (Edge, V2, V3), , , , U-Net, t-RNN-Enc 29 mm (305.47, 310.09) (404.06, 404.59) (404.06, 412.90)
MobileNet (Edge, V2, V3), , , , t-RNN-Enc 100 mm (286.45, 287.98) (404.25, 404.59) (404.06, 404.94)
MobileNet (Edge, V2, V3), , , , t-RNN (Dec,Enc) 29 mm (426.65, 426.65) (586.55, 586.55) (626.62, 692.61)
MobileNet (Edge, V2, V3), , , , U-Net, t-RNN (Dec,Enc) 100 mm (383.57, 385.56) (518.58, 519.37) (526.37, 530.99)
Geomean of ’s Improvement  —  (1.0, 1.0)  (1.21, 1.20)  (1.24, 1.27)
Table 4:

Optimized average latency (the lower, the better) across multiple applications (up to ten applications) from diverse domains by  and best online algorithms (Evolutionary and MBO) under different area constraints. Each row show the (Best, Median) of average latency across five runs. The geometric mean of ’s improvement over other methods (last row) indicates that  is at least 21% better.

As discussed previously, the training data used does not label a given accelerator with latency values corresponding to each application, and thus,  must extrapolate accurately to estimate the latency of an accelerator for a context it is not paired with in the training dataset. This also means that  cannot simply return the accelerator with the best average latency and must run non-trivial optimization to find a design that actually accelerates all the applications. We evaluate our method in seven different multi-application design scenarios (Table 4),

Figure 6: Comparing the total simulation time needed by  and online methods on seven models (Area 100mm) .  only requires about 1%, 6%, and 0.9% of the total simulation time of Evolutionary, MBO, and Bayes Opt, respectively, although  outperforms the best online method by 41%.

comprising various combinations of models from Table 2 and under different area constraints, where the smallest set consists of the three MobileNet variants and the largest set consists of nine models from image classification, object detection, image segmentation, and speech recognition. This scenario is also especially challenging for online methods since the number of jointly feasible designs is expected to drop significantly as more applications are added. For instance, for the case of the MobileNet variants, random sampling only finds a few (20-30) accelerator configurations that are jointly feasible and high-performing (Appendix C.2—Figure 8), but for the largest scenario of nine applications, there are very few jointly feasible designs.

Table 4 shows that, on average,  finds accelerators that outperform the best online method by 1.2 (up to 41%). While  performs similar to online methods in the smallest three-model scenario (first row), it outperforms online methods as the number of applications increases and the set of applications become more diverse. In addition, comparing with the best jointly feasible design point across the target applications,  finds significantly better accelerators (3.95). Finally, as the number of model increases the total simulation time difference between online methods and  further widens (Figure 6). These results indicate that  is effective in designing accelerators jointly optimized across multiple applications while reusing the same dataset as for the single-task, and scales more favorably than its simulation-driven counterparts. Appendix B expounds on the details of the designed accelerators for nine applications, comparing our method and the best online method.

Accelerating previously unseen applications (“zero-shot” optimization). Finally, we answer Q(4) by demonstrating that our data-driven offline method,  enables effective data reuse by using logged accelerator data from a set of applications to design an accelerator for an unseen new application, without requiring any training on data from the new unseen application(s). We train a contextual version of  using a set of “training applications” and then optimize an accelerator using the learned surrogate with different contexts corresponding to “test applications,” without any additional designs evaluated against the test application. In this case, we train the evolutionary (online) method on the test application domain for 1000 iterations, for comparison. We also compare directly to the accelerator found by the evolutionary (online) method when jointly optimizing the training applications in Appendix A.2, but found that to be worse than running a few iterations of the evolutionary (online) method on the test applications. Table 5 shows that, on an average,  outperforms the best online method on the test applications by 1.26 (up to 66) and only 2 slowdown in one case. Note that the difference in performance increases as the number of training applications increases. These results show the effectiveness of  in the zero-shot setting (more results in Appendix A.5) and highlight the effectiveness of data re-use by our offline approach.

Applying  on other accelerator architectures and dataflows. Finally, to test the the generalizability of  to other accelerator architectures kao2020confuciux, we evaluate  to optimize latency of two style of dataflow accelerators—NVDLA-style and ShiDianNao-style—across three applications (Appendix D details the methodology). As shown in Table 6,  outperforms the online evolutionary method by 6% and improves over the best point in the training dataset by 3.75. This demonstrates that  is effective in optimizing accelerators with different dataflow architectures and can successfully optimize over extremely large hardware search spaces.

Train Applications Test Applications Area  (Ours) Evolutionary (Online)
MobileNet (Edge, V3) MobileNetV2 29 mm (311.39, 313.76) (314.31, 316.65)
MobileNet (V2, V3), , MobileNetEdge, 29 mm (357.05, 364.92) (354.59, 357.29)
MobileNet (Edge,V2,V3), , , , t-RNN Enc U-Net, t-RNN Dec 29 mm (745.87, 745.91) (1075.91, 1127.64)
MobileNet (Edge,V2,V3),, , , t-RNN Enc U-Net, t-RNN Dec 100 mm (517.76, 517.89) (859.76, 861.69)
Geomean of ’s Improvement  (1.0, 1.0)  (1.24, 1.26)
Table 5: Optimized objective values (i.e., latency in milliseconds) under zero-shot setting. Lower latency is better. From left to right: the applications used to train the surrogate model in   the target applications for which the accelerator is optimized for, the area constraint of the accelerator, ’s (best, median) latency, and best online method’s (best, median) latency.  does not use any additional data from the target applications. On average (last row),  yields optimized accelerator for target applications (with zero query to the target applications’ dataset) with 1.26 (up to 1.66) lower latency over the best online method. The best accelerator configurations identified is highlighted in bold.
Applications Dataflow Evolutionary  (Online) (Best in Training)
MobileNetV2 NVDLA 2.5110 2.7010 1.3210
MobileNetV2 ShiDianNao 2.6510 2.8410 1.2710
ResNet50 NVDLA 2.8310 3.1310 1.6310
ResNet50 ShiDianNao 3.4410 3.7410 2.0510
Transformer NVDLA 7.810 7.810 1.310
Transformer ShiDianNao 7.810 7.810 1.510
Geomean of ’s Improvement  1.0  1.06  3.75
Table 6: Optimized objective values (i.e. total number of cycles) for two different dataflow architectures, NVDLA-style [nvdla] and ShiDianNao-style [shidiannao], across three classes of applications. The maximum search space for the studied accelerators are  2.510.  generalizes to other classes of accelerators with larger search space and outperforms the best online method by 1.06 and the best data seen in training by 3.75 (last column). The best accelerator configurations is highlighted in bold.

7 Discussion

In this work, we present a data-driven offline optimization method,  to automatically architect hardware accelerators. Our method learns a conservative surrogate of the objective function by leveraging infeasible data points to better model the desired objective function of the accelerator using a one-time collected dataset of accelerators, thereby alleviating the need for time-consuming simulation. Our results show that, on average, our method outperforms the best designs observed in the logged data by 2.46 and improves over the best simulator-driven approach by about 1.54. In the more challenging setting of designing accelerators jointly optimal for multiple applications or for new, unseen applications, zero-shot,  outperforms simulator-driven methods by 1.2, while reducing the total simulation time by 99%.

At a high level, the efficacy of  highlights the potential for utilizing the logged offline data in conjunction with strong offline methods in an accelerator design pipeline. While , in principle, can itself be used in the inner loop of an online method that performs active data collection beyond an initial offline dataset, the strong generalization ability of neural networks when trained with good offline methods and offline datasets consisting of low-performing designs, can serve as a highly effective ingredient for multiple design problems. Utilizing  for other problems in architecture and systems, such as software-hardware co-optimization is an interesting avenue for future work.

Acknowledgements

We thank the “Learn to Design Accelerators” team at Google Research and the Google EdgeTPU team for their invaluable feedback and suggestions. In addition, we extend our gratitude to the Vizier team, Christof Angermueller, Sheng-Chun Kao, Samira Khan, Stella Aslibekyan, and Xinyang Geng for their help with experiment setups and insightful comments.

References

Appendix A Additional Experiments

a.1 Comparison to Other Baseline Methods

Comparison to COMs. In this section, we perform a comparative evaluation of  to the COMs method [trabucco2021conservative]. Like several offline reinforcement learning algorithms [kumar2020conservative], our method,   and COMs are based on the key idea of learning a conservative surrogate of the desired objective function, such that it does not overestimate the value of unseen data points, which prevents the optimizer from finding accelerators that appear promising under the learned model but are not actually promising under the actual objective. The key differences between our method and COMs are: (1)  uses an evolutionary optimizer () for negative sampling compared to gradient ascent of COMs, which can be vastly beneficial in discrete design spaces as our results show empirically, (2)  can explicitly learn from infeasible data points provided to the algorithm, while COMs does not have a mechanism to incorporate the infeasible points into the learning of surrogate. To further assess the importance of these differences in practice, we run COMs on three tasks from Table 3, and present a comparison our method, COMs, and Standard method in Table 7. The “Standard” method represents a surrogate model without utilizing any infeasible points. On average,  outperforms COMs by 1.17 (up to 1.24 in ).

Application  (Ours) COMs Standard
MobileNetV2 207.43 251.58 374.52
MobileNetV3 454.30 485.66 575.75
131.46 163.94 180.24
Geomean of ’s Improvement  1.0  1.17  1.46
Table 7: Optimized objective values (i.e., latency in milliseconds) obtained by  and COMs [trabucco2021conservative] when optimizing over single applications (MobilenetV2, MobilenetV3 and ), extending Table 3. Note that  outperforms COMs. However, COMs improves over baseline “Standard” method (last column).

Comparison to generative offline MBO methods. We also ran a prior offline MBO method, model inversion networks (MINs) [kumar2019model], that trains a generative model of the accelerator on our data. However, we were unable unable to train a discrete objective-conditioned GAN model to 0.5 discriminator

accuracy on our offline dataset, and often observed a collapse of the discriminator. As a result, we trained a VAE [razavi2019preventing], conditioned on the objective function (i.e., latency). A standard VAE [kingma2013auto] suffered from posterior collapse and thus informed our choice of utilizing a VAE. The latent space of a trained objective-conditioned VAE corresponding to accelerators on a held-out validation dataset (not used for training) is visualized in the t-SNE plot in the figure on the right. This is a 2D t-SNE of the accelerators configurations (§Table 1). The color of a point denotes the latency value of the corresponding accelerator configuration, partitioned into three bins. Observe that while we would expect these objective conditioned models to disentangle accelerators with different objective values in the latent space, the models we trained did not exhibit such a structure, which will hamper optimization. While our method  could also benefit from a generative optimizer (i.e., by using a generative optimizer in place of with a conservative surrogate), we leave it for future work to design effective generative optimizers for accelerators.

Comparison to P3BO. We perform a comparison against P3BO, a state-of-the-arts online MBO method in biology [p3bo:arxiv:2020]. On average,  outperforms the P3BO method by 2.5 (up to 8.7 in U-Net) in terms of the latency of the optimized accelerators found. In addition, we present the comparison between the total simulation runtime of the P3BO and Evolutionary methods in Figure 7. Note that, not only the total simulation time of P3BO is around 3.1 higher than the Evolutionary method, but also the latency of final optimized accelerator is around 18% for MobileNetEdge. On the other hand, the total simulation time of  for the task of accelerator design for MobileNetEdge is lower than both methods (only 7% of the Evolutionary method as shown in Figure 5).

Application  (Ours) P3BO
MobileNetEdge 298.50 376.02
370.45 483.39
U-Net 740.27 771.70
t-RNN Dec 132.88 865.12
t-RNN Enc 130.67 1139.48
Geomean of ’s Improvement  1.0  2.5
Table 8: Optimized objective values (i.e., latency in milliseconds) obtained by  and P3BO [p3bo:arxiv:2020] when optimizing over single applications (MobileNetEdge, , t-RNN Dec, t-RNN Enc, and U-Net). On average,  outperforms P3BO by 2.5.
Figure 7: Comparing the total simulation time needed by the P3BO and Evolutionary method on MobileNetEdge. Note that, not only the total simulation time of P3BO is around 3.1 higher than the Evolutionary method, but also the latency of final optimized accelerator is around 18% for MobileNetEdge. The total simulation time of our method is around 7% of the Evolutionary method (See Figure 5).

a.2 Learned Surrogate Model Reuse for Accelerator Design

Extending our results in Table 4, we present another variant of optimizing accelerators jointly for multiple applications. In that scenario, the learned surrogate model is reused to architect an accelerator for a subset of applications used for training. We train a contextual conservative surrogate on the variants of MobileNet (Table 2) as discussed in Section 4, but generated optimized designs by only optimizing the average surrogate on only two variants of MobileNet (MobileNetEdge and MobileNetV2). This tests the ability of our approach  to provide a general contextual conservative surrogate that can be trained only once and optimized multiple times with respect to different subsets of applications. Observe in Table 9,  architects high-performing accelerator configurations (better than the best point in the dataset by 3.29 – last column) while outperforming the online optimization methods by 7%.

Online Optimization
Applications All -Opt -Infeasible Standard Bayes Opt Evolutionary (Best in Training)
(MobileNetEdge, MobileNetV2) 253.85 297.36 264.85 341.12 275.21 271.71 834.68
Table 9: Optimized objective values (i.e., latency in milliseconds) obtained by our  when using the jointly optimized model on three variants of MobileNets and use for MobileNetEdge and MobileNetV2 for different dataset configurations.  outperforms the best online method by 7% and finds an accelerator that is 3.29 better than the best accelerator in the training dataset (last row). The best accelerator configuration is highlighted in bold.

a.3 Learned Surrogate Model Reuse under Different Design Constraints

We also test the robustness of our approach in handling variable constraints at test-time such as different chip area budget. We evaluate the learned conservative surrogate trained via  under a reduced value of the area threshold, , in Equation 1. To do so, we utilize a variant of rejection sampling – we take the learned model trained for a default area constraint and then reject all optimized accelerator configurations which do not satisfy a reduces area constraint: . Table 10 summarizes the results for this scenario for the MobileNetEdge [edgetpu:arxiv:2020] application under the new area constraint (). A method that produces diverse designs which are both high-performing and are spread across diverse values of the area constraint are expected to perform better. As shown in Table 10,  provides better accelerator than the best online optimization from scratch with the new constraint value by 4.4%, even when  does not train its conservative surrogate with this unseen test-time design constraint. Note that, when the design constraint changes, online methods generally need to restart the optimization process from scratch and undergo costly queries to the simulator. This would impose additional overhead in terms of total simulation time (see Figure 5 and Figure 6). However, the results in Table 10 shows that our learned surrogate model can be reused under different test-time design constraint eliminating additional queries to the simulator.

Online Optimization
Applications All -Opt -Infeasible Standard Bayes Opt Evolutionary (Best in Training)
MobileNetEdge, Area 18 mm 315.15 433.81 351.22 470.09 331.05 329.13 354.13
Table 10: Optimized objective values (i.e., latency in milliseconds) obtained by various methods for the task of learning accelerators specialized to MobileNetEdge under chip area budget constraint 18 mm reusing the already learned model by our method for MobileNetEdge (shown in Table 3). Lower latency/runtime is better. From left to right: our method, our method without negative sampling (“-”) and without utilizing infeasible points (“-Infeasible”), standard surrogate (“Standard”), online Bayesian optimization (“Bayes Opt”), online evolutionary algorithm (“Evolutionary”) and the best design in the training dataset. Note that  improves over the best in the dataset by 12%, outperforms the best online optimization method by 4.4%. The best accelerator configuration is highlighted in bold.

a.4 Comparison with Human-Engineered Accelerators

In this section, we compare the optimized accelerator design found by  to the EdgeTPU design [yazdanbakhsh2021evaluation, edgetpu:arxiv:2020] targeted towards single applications. The goal of this comparison is to present the potential benefit of specialization towards single applications, only using architecture exploration. We utilize an area constraint of 27 mm and a DRAM bandwidth of 25 Gbps, identical to the specifications of the EdgeTPU accelerator. Table 11 shows the summary of results in two sections, namely “Latency” and “Chip Area”. The first and second under each section show the results for  and EdgeTPU, respectively. The final column for each section shows the improvement of the design suggested by  over EdgeTPU. On average (as shown in the last row),  finds accelerator designs that are 2.69 (up to 11.84 in t-RNN Enc) better than EdgeTPU in terms of latency. Our method achieves this improvement while, on average, reducing the chip area usage by 1.50 (up to 2.28 in MobileNetV3).

Latency (milliseconds) Chip Area (mm)
Application EdgeTPU Improvement EdgeTPU Improvement
MobileNetEdge 294.34 523.48 1.78 18.03 27 1.50
MobileNetV2 208.72 408.24 1.96 17.11 27 1.58
MobileNetV3 459.59 831.80 1.81 11.86 27 2.28
370.45 675.53 1.82 19.12 27 1.41
208.42 377.32 1.81 22.84 27 1.18
132.98 234.88 1.77 16.93 27 1.59
U-Net 1465.70 2409.73 1.64 25.27 27 1.07
t-RNN Dec 132.43 1384.44 10.45 14.82 27 1.82
t-RNN Enc 130.45 1545.07 11.84 19.87 27 1.36
 Average Improvement  —  —  2.69  —  —  1.50
Table 11: The comparison between the accelerator designs suggested by  and EdgeTPU [yazdanbakhsh2021evaluation, edgetpu:arxiv:2020] for single model specialization. On average (last row), with single-model specialization our method reduces the latency by 2.69 while minimizes the chip area usage by 1.50.

a.5 Comparison with Online Methods in Zero-Shot Setting

We evaluated the Evolutionary (online) method under two protocols for the last two rows of Table 5: first, we picked the best designs (top-performing 256 designs similar to the  setting in Section 4) found by the evolutionary algorithm on the training set of applications and evaluated them on the target applications and second, we let the evolutionary algorithm continue simulator-driven optimization on the target applications. The latter is unfair, in that the online approach is allowed access to querying more designs in the simulator. Nevertheless, we found that in either configuration, the evolutionary approach performed worse than   which does not access training data from the target application domain. For the area constraint 29 mm and 100 mm, the Evolutionary algorithm reduces the latency from 1127.64  820.11 and 861.69  552.64, respectively, although still worse than . In the second experiment in which we unfairly allow the evolutionary algorithm to continue optimizing on the target application, the Evolutionary algorithm suggests worse designs than Table 5 (e.g. 29 mm: 1127.64  1181.66 and 100 mm: 861.69  861.66).

a.6 Ablation Studying Various Components of

Online Optimization
Application All -Opt -Infeasible Standard Bayes Opt Evolutionary (Best in Training)
MobileNetEdge 298.50 435.40 322.20 411.12 319.00 320.28 354.13
MobileNetV2 207.43 281.01 214.71 374.52 240.56 238.58 410.83
MobileNetV3 454.30 489.45 483.96 575.75 534.15 501.27 938.41
370.45 478.32 432.78 1139.76 396.36 383.58 779.98
208.21 319.61 246.80 307.57 201.59 198.86 449.38
131.46 197.70 162.12 180.24 121.83 120.49 369.85
U-Net 740.27 740.27 765.59 763.10 872.23 791.64 1333.18
t-RNN Dec 132.88 172.06 135.47 136.20 771.11 770.93 890.22
t-RNN Enc 130.67 134.84 137.28 150.21 865.07 865.07 584.70
Table 12: Optimized objective values (i.e., latency in milliseconds) obtained by various methods for the task of learning accelerators specialized to a given application. Lower latency/runtime is better. From left to right: our method, our method without negative sampling (“-”) and without utilizing infeasible points (“-Infeasible”), standard surrogate (“Standard”), online Bayesian optimization (“Bayes Opt”), online evolutionary algorithm (“Evolutionary”) and the best design in the training dataset. Note that, in all the applications  improves over the best in the dataset, outperforms online optimization methods in 7/9 applications and the complete version of  generally performs best. The best accelerator designs are in bold.

Here we ablate over components of our method: (1) was not used for negative sampling (“-” in Table 12) (2) infeasible points were not used (“-Infeasible” in Table 12). As shown in Table 12, the variants of our method generally performs worse compared to the case when both negative sampling and infeasible data points are utilized in training the surrogate model.

Appendix B Comparing Optimized Accelerators Found By  and Evolutionary Methods

Latency (ms)
Applications Evolutionary (Online) Improvement of  over Evolutionary
MobileNetEdge 288.38 319.98 1.10
MobileNetV2 216.27 255.95 1.18
MobileNetV3 487.46 573.57 1.17
400.88 406.28 1.01
248.18 239.18 0.96
164.98 148.83 0.90
U-Net 1268.73 908.86 0.71
t-RNN Dec 191.83 862.14 5.13
t-RNN Enc 185.41 952.44 4.49
Average (Latency in ms)  383.57  518.58  1.35
Table 13: Per application latency for the best accelerator design suggested by  and the Evolutionary method according to Table 4 for multi-task accelerator design (nine applications and area constraint 100 mm).  outperforms the Evolutionary method by 1.35.
Parameter Value
Accelerator Parameter Evolutionary (Online)
# of PEs-X 4 4
 # of PEs-Y  6  8
# of Cores  64  128
 # of Compute Lanes  4  6
 PE Memory  2,097,152  1,048,576
Core Memory 131,072 131,072
 Instruction Memory  32,768  8,192
Parameter Memory 4,096 4,096
 Activation Memory  512  2,048
DRAM Bandwidth (Gbps) 30 30
 Chip Area (mm)  46.78  92.05
Table 14: Optimized accelerator configurations (See Table 1) found by  and the Evolutionary method for multi-task accelerator design (nine applications and area constraint 100 mm). Last row shows the accelerator area in mm.  reduces the overall chip area usage by 1.97. The difference in the accelerator configurations are shaded in gray.

In this section, we overview the best accelerator configurations that  and the Evolutionary method found for multi-application accelerator design (See Table 4), when the number of target applications are nine and the area constraint is set to 100 mm. The average latencies of the best accelerators found by  and the Evolutionary method across nine target applications are 383.57 ms and 518.58 ms, respectively. In this setting, our method outperforms the best online method by 1.35. Table 13 shows per application latencies for the accelerator suggested by our method and the Evolutionary method. The last column shows the latency improvement of  over the Evolutionary method. Interestingly, while the latency of the accelerator found by our method for MobileNetEdge, MobileNetV2, MobileNetV3, , t-RNN Dec, and t-RNN Enc are better, the accelerator identified by the online method yields lower latency in , , and U-Net.

To better understand the trade-off in design of each accelerators designed by our method and the Evolutionary method, we present all the accelerator parameters (See Table 1) in Table 14. The accelerator parameters that are different between each of the designed accelerator are shaded in gray (e.g. # of PEs-Y, # of Cores, # of Compute Lanes, PE Memory, Instruction Memory, and Activation Memory). Last row of Table 14 depicts the overall chip area usage in mm.  not only outperforms the Evolutionary algorithm in reducing the average latency across the set of target applications, but also reduces the overall chip area usage by 1.97. Studying the identified accelerator configuration, we observe that  trade-offs compute (64 cores vs. 128 cores) for larger PE memory size (2,097,152 vs. 1,048,576). These results show that  favors PE memory size to accommodate for the larger memory requirements in t-RNN Dec and t-RNN Enc (See Table 2 Model Parameters) where large gains lie. Favoring larger on-chip memory comes at the expense of lower compute power in the accelerator. This reduction in the accelerator’s compute power leads to higher latency for the models with large number of compute operations, namely , , and U-Net (See last row in Table 2). is an interesting case where both compute power and on-chip memory is favored by the model (6.23 MB model parameters and 3,471,920,128 number of compute operations). This is the reason that the latency of this model on both accelerators, designed by our method and the Evolutionary method, are comparable (400.88 ms in  vs. 406.28 ms in the online method).

Appendix C Details of

In this section, we provide training details of our method  including hyperparameters and compute requirements and details of different tasks.

c.1 Hyperparameter and Training Details

Algorithm 1 outlines our overall system for accelerator design.  parameterizes the function as a deep neural network as shown in Figure 4. The architecture of first embeds the discrete-valued accelerator configuration into a continuous-valued 640-dimensional embedding via two layers of a self-attention transformer [vaswani2017attention]. Rather than directly converting this 640-dimensional embedding into a scalar output via a simple feed-forward network, which we found a bit unstable to train with Equation 3, possibly due to the presence of competing objectives for a comparison), we pass the 640-dimensional embedding into different networks that map it to different scalar predictions . Finally, akin to attention [vaswani2017attention] and mixture of experts [shazeer2017outrageously], we train an additional head to predict weights of a linear combination of the predictions at different heads that would be equal to the final prediction: . Such an architecture allows the model to use different predictions , depending upon the input, which allows for more stable training. To train , we utilize the Adam [kingma2014adam] optimizer. Equation 3 utilizes a procedure that maximizes the learned function approximately. We utilize the same technique as Section 4 (“optimizing the learned surrogate”) to obtain these negative samples. We periodically refresh , once in every 20K gradient steps on over training.

1:Initialize a neural model and a set of negative particles to be updated by the firefly optimizer to random configurations from the design space.
2:for iteration until convergence do
3:     for firefly update step Inner loop do
4:         Update the fireflies according to the firefly update rule in Equation 5,
5:            towards maximizing according to: Negative mining
6:                                
7:     end for
8:     Find the best firefly found in these steps to be used as the negative sample:
9:                       Find negative sample
10:     Run one gradient step on using Equation 3 with as the negative sample
11:     if , (p = 20000), then: Periodically reinitialize the optimizer
12:               Reinitialize firefly particles to random designs.
13:end for
14:Return the final model
Algorithm 1 Training the conservative surrogate in PRIME

The hyperparameters for training the conservative surrogate in Equations 3 and its contextual version are as follows:

  • Architecture of . As indicated in Figure 4, our architecture takes in list of categorical (one-hot) values of different accelerator parameters (listed in Table 1), converts each parameter into -dimensional embedding, thus obtaining a sized matrix for each accelerator, and then runs two layers of self-attention [vaswani2017attention] on it. The resulting output is flattened to a vector in and fed into different prediction networks that give rise to , and an additional attention 2-layer feed-forward network (layer sizes ) that determines weights , such that and . Finally the output is simply .

  • Optimizer/learning rate for training . Adam, , default , .

  • Validation set split. Top 20% high scoring points in the training dataset are used to provide a validation set for deciding coefficients , and the checkpoint to evaluate.

  • Ranges of , . We trained several models with and . Then we selected the best values of and based on the highest Kendall’s ranking correlation on the validation set. Kendall’s ranking correlation between two sets of objective values: corresponding to ground truth latency values on the validation set and corresponding to the predicted latency values on the validation set is given by equal to:

    (4)
  • Clipping during training. Equation 3 increases the value of the learned function at and . We found that with the small dataset, these linear objectives can run into numerical instability, and produce predictions. To avoid this, we clip the predicted function value both above and below by , where the valid range of ground-truth values is .

  • Negative sampling with . As discussed in Section 4, we utilize the firefly optimizer for both the negative sampling step and the final optimization of the learned conservative surrogate. When used during negative sampling, we refresh (i.e., reinitialize) the firefly parameters after every gradient steps of training the conservative surrogate, and run steps of firefly optimization per gradient step taken on the conservative surrogate.

  • Details of firefly: The initial population of fireflies depends on the number of accelerator configurations () following the formula . In our setting with ten accelerator parameters (See Table 1), the initial population of fireflies is 23. We use the same hyperparameters: , for the optimizer in all the experiments and never modify it. The update to a particular optimization particle (i.e., a firefly) , at the -th step of optimization is given by:

    (5)

    where is a different firefly that achieves a better objective value compared to and the function is given by: .

c.2 Details of Architecting Accelerators for Multiple Applications Simultaneously

Now we will provide details of the tasks from Table 4 where the goal is to architect an accelerator which is jointly optimized for multiple application models. For such tasks, we augment data-points for each model with the context vector from Table 2 that summarizes certain parameters for each application. For entries in this context vector that have extremely high magnitudes (e.g., model parameters and number of compute operations), we normalize the values by the sum of values across the applications considered to only encode the relative scale, and not the absolute value which is not required. To better visualize the number of feasible accelerators for joint optimization, Figure 8 show the tSNE plot (raw architecture configurations are used as input) of high-performing accelerator configurations. The blue-colored dots are the jointly feasible accelerators in the combined dataset, and note that these data points are no more than 20-30 in total. The highlighted red star presents the best design suggested by  with average latency of 334.70 (Table 4). This indicates that this contextual, multi-application problem poses a challenge for data-driven methods: these methods need to produce optimized designs even though very few accelerators are jointly feasible in the combined dataset. Despite this limitation,  successfully finds more efficient accelerator configurations that attain low latency values on each of the applications jointly, as shown in Table 4.

Figure 8: tSNE plot of the joint dataset and randomly sampled infeasible data points. The blue points show the accelerator configurations that are jointly feasible for all the applications. The highlighted point with red star shows the best design proposed by Ṫhe rest of the points show the infeasible points.

c.3 Dataset Sensitivity to Accelerator Parameters

We visualize the sensitivity of the objective function (e.g. latency) with respect to the changes in certain accelerator parameters, such as memory size (Table 1), in Figure 10, illustrating this sensitivity. As shown in the Figure, the latency objective that we seek to optimize can exhibit high sensitivity to small variations in the architecture parameters, making the optimization landscape particularly ill-behaved. Thus, a small change in one of the discrete parameters, can induce a large change in the optimization objective. This characteristic of the dataset further makes the optimization task challenging.

Appendix D Overview of Accelerators and Search Space

This section briefly discuss the additional accelerators (similar to [kao2020confuciux]) that we evaluate in this work, namely NVDLA [nvidia] and ShiDianNao [shidiannao], and their corresponding search spaces.

NVDLA: Nvidia Deep Learning Accelerator. NVDLA 

[nvdla] is an open architecture inference accelerator designed and maintained by Nvidia. In compared to other inference accelerators, NVDLA is a weight stationary accelerator. That is, it retains the model parameters on each processing elements and parallelizes the computations across input and output channels. NVDLA-style dataflow accelerators generally yield better performance for the computations of layers at the later processing stages. This is because these layers generally have larger model parameters that could benefit from less data movement associated to the model parameters.

ShiDianNao: Vision Accelerator. Figure 9 shows the high-level schematic of ShiDianNao accelerator [shidiannao]. ShiDianNao-style dataflow accelerator is an output-stationary accelerator.

Figure 9: Overview of ShiDianNao dataflow accelerator. This dataflow accelerators exhibits an output-stationary dataflow where it keeps the partial results stationary within each processing elements (PEs).

That is, it keeps the partial results inside each PE and instead move the model parameters and input channel data. As such, in compared to NVDLA-style accelerators, ShiDianNao provides better performance for the computations of the layers with large output channels (generally first few layers of a model).

Search space of dataflow accelerators. We follow a similar methodology as [kao2020confuciux] to evaluate additional hardware accelerators, discussed in the previous paragraphs. We use MAESTRO [maestro], an analytical cost model, that supports the performance modeling of various dataflow accelerators. In this joint accelerator design and dataflow optimization problem, the total number of parameters to be optimized is up to 106—the tuple of (# of PEs, Buffers) per per model layer—with each parameter taking one of 12 discrete values. This makes the hardware search space consist of  2.5 accelerator configurations. We also note that while the method proposed in [kao2020confuciux] treats the accelerator design problem as a sequential decision making problem, and uses reinforcement learning techniques,  simply designs the whole accelerator in a single step, treating it as a model-based optimization problem.

[] []

Figure 10: The (a) histogram of infeasible (orange bar with large score values)/feasible (blue bars) data points and (b) the sensitivity of runtime to the size of core memory for the MobileNetEdge [efficientnet:2020] dataset.
Figure 11: tSNE plot of the infeasible and feasible hardware accelerator designs. Note that feasible designs (shown in blue) are embedded in a sea of infeasible designs (shown in red), which makes this a challenging domain for optimization methods.
Figure 12: To verify if the overestimation hypothesis–that optimizing an accelerator against a naïve standard surrogate model is likely to find optimizers that appear promising under the learned model, but do not actually attain low-latency values–in our domain, we plot a calibration plot of the top accelerator designs found by optimizing a naïvely trained standard surrogate model. In the scatter plot, we represent each accelerator as a point with its x-coordinate equal to the actual latency obtained by running this accelerator design in the simulator and the y-coordinate equal to the predicted latency under the learned surrogate. Note that for a large chunk of designs, the predicted latency is much smaller than their actual latency (i.e., these designs lie beneath the line in the plot above). This means that optimizing designs under a naïve surrogate model is prone to finding designs that appear overly promising (i.e., attain lower predicted latency values), but are not actually promising. This confirms the presence of the overestimation hypothesis on our problem domain.