The death of Moore’s Law [esmaeilzadeh2011dark] and its spiraling effect on the semiconductor industry have driven the growth of specialized hardware accelerators. These specialized accelerators are tailored to specific applications [yazdanbakhsh2021apollo, reagen2017case, prac_dse:mascots:2019, shi2020learned]. To design specialized accelerators, designers first spend considerable amounts of time developing simulators that closely model the real accelerator performance, and then optimize the accelerator using the simulator. While such simulators can automate accelerator design, this requires a large number of simulator queries for each new design, both in terms of simulation time and compute requirements, and this cost increases with the size of the design space [yazdanbakhsh2021evaluation, shi2020learned, hegdemind]. Moreover, most of the accelerators in the design space are typically infeasible [hegdemind, yazdanbakhsh2021apollo] because of build errors in silicon or compilation/mapping failures. When the target applications change or a new application is added, the complete simulation-driven procedure is generally repeated. To make such approaches efficient and practically viable, designers typically “bake-in” constraints or otherwise narrow the search space, but such constraints can leave out high-performing solutions [dmazerunner, timeloop, marvel].
An alternate approach, proposed in this work, is to devise a data-driven optimization method that only utilizes a database of previously tested accelerator designs, annotated with measured performance metrics, to produce new optimized designs without additional active queries to an explicit silicon or a cycle-accurate simulator. Such a data-driven approach provides three key benefits: (1) it significantly shortens the recurring cost of running large-scale simulation sweeps, (2) it alleviates the need to explicitly bake in domain knowledge or search space pruning, and (3) it enables data re-use by empowering the designer to optimize accelerators for new unseen applications, by the virtue of effective generalization. While data-driven approaches have shown promising results in biology [fu2021offline, brookes19a, trabucco2021conservative], using offline optimization methods to design accelerators has been challenging primarily due to the abundance of infeasible design points [yazdanbakhsh2021apollo, hegdemind] (see Figures 3 and 11).
The key contribution of this paper is a data-driven approach, , to automatically architect high-performing application-specific accelerators by using only previously collected offline data. learns a robust surrogate model of the task objective function from an existing offline dataset, and finds high-performing application-specific accelerators by optimizing the architectural parameters
against this learned surrogate function, as shown in Figure 1. While naïvely learned surrogate functions usually produces poor-performing, out-of-distribution designs that appear quite optimistic under the learned surrogate [kumar2019model, brookes19a, trabucco2021conservative]. The robust surrogate in is explicitly trained to prevent overestimation on “adversarial” designs that would be found during optimization. Furthermore, in contrast to prior works that discard infeasible points [hegdemind, trabucco2021conservative], our proposed method instead incorporates infeasible points when learning the conservative surrogate by treating them as additional negative samples. Additionally, can be used effectively for multi-model and zero-shot optimization, capabilities that prior works [trabucco2021conservative, hegdemind] do not show. During evaluation, optimizes the learned surrogate using a discrete optimizer.
Our results show that architects hardware accelerators that improve over the best design in the training dataset, on average, by 2.46 (up to 6.7) when specializing for a single application. In this case, also improves over the best conventional simulator-driven optimization methods by 1.54 (up to 6.6). These performance improvements are obtained while reducing the total simulation time to merely 7% and 1% of that of the simulator-driven methods for single-task and multi-task optimization, respectively. More importantly, a contextual version of can design accelerators that are jointly optimal for a set of nine applications without requiring any additional domain information. In this challenging setting, improves over simulator-driven methods, which tend to scale poorly as more applications are added, by 1.38. Finally, we show that the surrogates trained with on a set of training applications can be readily used to obtain accelerators for unseen target applications, without any retraining on the new application. Even in this zero-shot optimization scenario, outperforms simulator-based methods that require re-training and active simulation queries by up to 1.67. In summary, allows us to effectively address the shortcomings of simulation-driven approaches, significantly reduces the simulation time, enables data reuse and enjoys generalization properties, and does not require domain-specific engineering or search space pruning. To facilitate further research in architecting hardware accelerators, we will also release the dataset in our experiments, consisting of many accelerator design points.
2 Background on Hardware Accelerators
The goal of specialized hardware accelerators—Google TPUs [jouppi2017datacenter, edgetpu:arxiv:2020], Nvidia GPUs [nvidia], GraphCore [graphcore]
—is to improve the performance of specific applications, such as machine learning models. To design such accelerators, architects typically create a parameterized design and sweep over parameters using simulation. In this section, we will provide an overview of hardware accelerators, present the design of our template-based accelerator and explain how an accelerator works.
Target hardware accelerators. Our primary evaluation uses an industry-grade and highly parameterized template-based accelerator following prior work [yazdanbakhsh2021evaluation]
. This template enables architects to determine the organization of various components, such as compute units, memory cells, memory, etc., by searching for these configurations in a discrete design space. Some ML applications may have large memory requirements (e.g., large language models[brown2020language]) demanding sufficient on-chip memory resources, while others may benefit from more compute blocks. The hardware design workflow directly selects the values of these parameters. In addition to this accelerator and to further show the generality of our method to other accelerator design problems, we evaluate two distinct dataflow accelerators with different search spaces, namely NVDLA-style [nvdla] and ShiDianNao-style [shidiannao] from kao2020confuciux (See Section 6 and Appendix D for a detailed discussion; See Table 6 for results).
How does an accelerator work? We briefly explain the computation flow on our template-based accelerators (Figure 2) and refer the readers to Appendix D for details on other accelerators. This template-based accelerator is a 2D array of processing elements (PEs). Each PE is capable of performing matrix multiplications in a single instruction multiple data (SIMD) paradigm [simd]. A controller orchestrates the data transfer (both activations and model parameters) between off-chip DRAM memory and the on-chip buffers and also reads in and manages the instructions (e.g. convolution, pooling, etc.) for execution. The computation stages on such accelerators start by sending a set of activations to the compute lanes, executing them in SIMD manner, and either storing the partial computation results or offloading them back into off-chip memory. Compared to prior works [hegdemind, shidiannao, kao2020confuciux], this parameterization is unique—it includes multiple compute lanes per each PE and enables SIMD execution model within each compute lane—and yields a distinct accelerator search space accompanied by an end-to-end simulation framework. Appendix D elaborates on other accelerators evaluated in this work.
3 Problem Statement, Training Data and Evaluation Protocol
Our template-based parameterization maps the accelerator, denoted as , to a discrete design space, , and each is a discrete-valued variable representing one component of the microarchitectural template, as shown in Table 1 (See Appendix D for the description of other accelerator search spaces studied in our work). A design maybe be infeasible due to various reasons, such as a compilation failure or the limitations of physical implementation, and we denote the set of all such feasibility criterion as . The feasibility criterion depends on both the target software and the underlying hardware, and it is not easy to identify if a given is infeasible without explicit simulation. We will require our optimization procedure to not only learn the value of the objective function but also to learn to navigate through a sea of infeasible solutions to high-performing feasible solutions satisfying .
Our training dataset consists of a modest set of accelerators that are randomly sampled from the design space and evaluated by the hardware simulator. We partition the dataset into two subsets, and . Let denote the desired objective (e.g., latency, power, etc.) we intend to optimize over the space of accelerators . We do not possess functional access to , and the optimizer can only access values for accelerators in the feasible partition of the data, . For all infeasible accelerators, the simulator does not provide any value of . In addition to satisfying feasibility, the optimizer must handle explicit constraints on parameters such as area and power [flynn2011computer]. In our applications, we impose an explicit area constraint, , though additional explicit constraints are also possible. To account for different constraints, we formulate this task as a constrained optimization problem. Formally:
While Equation 1 may appear similar to other standard black-box optimization problems, solving it over the space of accelerator designs is challenging due to the large number of infeasible points, the need to handle explicit design constraints, and the difficulty in navigating the non-smooth landscape (See Figure 3 and Figure 10 in the Appendix) of the objective function.
|Accelerator Parameter||# discrete values||Accelerator Parameter||# discrete values|
|# of PEs-X||10||# of PEs-Y||10|
|PE Memory||7||# of Cores||7|
|Core Memory||11||# of Compute Lanes||10|
|Instruction Memory||4||Parameter Memory||5|
|Activation Memory||7||DRAM Bandwidth||6|
What makes optimization over accelerators challenging? Compared to other domains where model-based optimization methods have been applied [brookes19a, trabucco2021conservative], optimizing accelerators introduces a number of practical challenges. First, accelerator design spaces typically feature a narrow manifold of feasible accelerators within a sea of infeasible points [prac_dse:mascots:2019, shi2020learned, gelbart2014bayesian], as visualized in Figure 3 and Appendix (Figure 11). While some of these infeasible points can be identified via simple rules (e.g. estimating chip area usage), most infeasible points correspond to failures during compilation or hardware simulation. These infeasible points are generally not straightforward to formulate into the optimization problem and requires simulation [shi2020learned, timeloop, yazdanbakhsh2021apollo].
Second, the optimization objective can exhibit high sensitivity to small variations in some architecture parameters (Figure 10) in some regions of the design space, but remain relatively insensitive in other parts, resulting in a complex optimization landscape. This suggests that optimization algorithms based on local parameter updates (e.g., gradient ascent, evolutionary schemes, etc.)
may have a challenging task traversing the nearly flat landscape of the objective, which can lead to poor performance.
Training dataset. We used an offline dataset of (accelerator parameters, latency) via random sampling from the space of 452M possible accelerator configurations. Our method is only provided with a relatively modest set of feasible points ( points) for training, and these points are the worst-performing feasible points across the pool of randomly sampled data. This dataset is meant to reflect an easily obtainable and an application-agnostic dataset of accelerators that could have been generated once and stored to disk, or might come from real physical experiments. We emphasize that no assumptions or domain knowledge about the application use case was made during dataset collection. Table 2 depicts the list of target applications, evaluated in this work, includes three variations of MobileNet [edgetpu:arxiv:2020, mnv2:arxiv:2018, mnv3:cvpr:2019], three in-house industry-level models for object detection (M4, M5, M6; names redacted to prevent anonymity violation), a U-net model [unet], and two RNN-based encoder-decoder language models [trnn01, trnn02, trnn03, trnn04]. These applications span the gamut from small models, such as , with only 0.4 MB model parameters that demands less on-chip memory, to the medium-sized models ( 5 MB), such as MobileNetV3 and models, and large models ( 19 MB), such as t-RNNs, hence requiring larger on-chip memory.
|Name||Domain||# of XLA Ops (Conv, D/W, FF)||Model Param||Instr. Size||# of Compute Ops.|
|MobileNetEdge||Image Class.||(45, 13, 1)||3.87 MB||476,736||1,989,811,168|
|MobileNetV2||Image Class.||(35, 17, 1)||3.31 MB||416,032||609,353,376|
|MobileNetV3||Image Class.||(32, 15, 17)||5.20 MB||1,331,360||449,219,600|
|Object Det.||(32, 13, 2)||6.23 MB||317,600||3,471,920,128|
|Object Det.||(47, 27, 0)||2.16 MB||328,672||939,752,960|
|Object Det.||(53, 33, 2)||0.41 MB||369,952||228,146,848|
|U-Net||Image Seg.||(35, 0, 0)||3.69 MB||224,992||13,707,214,848|
|t-RNN Dec||Speech Rec.||(0, 0, 19)||19 MB||915,008||40,116,224|
|t-RNN Enc||Speech Rec.||(0, 0, 18)||21.62 MB||909,696||45,621,248|
Evaluation protocol. To compare state-of-the-art simulator-driven methods and our data-driven method, we limit the number of feasible points (costly to evaluate) that can be used by any algorithm to equal amounts. We still provide infeasible points to any method and leave it up to the optimization method to use it or not. This ensures our comparisons are fair in terms of the amount of data available to each method. However, it is worthwhile to note that in contrast to our method where worse-quality data points from small offline dataset are used, the simulator-driven methods have an inherent advantage because they can steer the query process towards the points that are more likely to be better in terms of performance. Following prior work [brookes19a, trabucco2021conservative, trabucco2021designbench], we evaluate each run of a method by first sampling the top design candidates according to the algorithm’s predictions, evaluating all of these under the ground truth objective function and recording the performance of the best accelerator design. The final reported results is the median of ground truth objective values across five independent runs.
4 : Architecting Accelerators via Conservative Surrogates
As shown in Figure 4, our method first learns a conservative surrogate model of the optimization objective using the offline dataset. Then, it optimizes the learned surrogate using a discrete optimizer. The optimization process does not require access to a simulator, nor to real-world experiments beyond the initial dataset, except when evaluating the final top-performing designs (Section 3).
Learning conservative surrogates using logged offline data. Our goal is to utilize a logged dataset of feasible accelerator designs labeled with the desired performance metric (e.g., latency), , and infeasible designs, to learn a mapping , that maps the accelerator configuration to its corresponding metric . This learned surrogate can then be optimized by the optimizer. While a straightforward approach for learning such a mapping is to train it via supervised regression, by minimizing the mean-squared error , prior work [kumar2019model, kumar2020conservative, trabucco2021conservative] has shown that such predictive models can arbitrarily overestimate the value of an unseen input . This can cause the optimizer to find a solution that performs poorly in the simulator but looks promising under the learned model. We empirically validate this overestimation hypothesis and find it to confound the optimizer in on our problem domain as well (See Figure 12 in appendix).
To prevent overestimated values at unseen inputs from confounding the optimizer, we build on COMs [trabucco2021conservative] and train with an additional term that explicitly maximizes the function value at unseen values. Such unseen designs , where the learned function is likely to be overestimated, are “negatively mined” by running a few iterations of an approximate stochastic optimization procedure that aims to maximize in the inner loop. This procedure is analogous to adversarial training [goodfellow2014explaining]
in supervised learning. Equation2 formalizes this objective:
denotes the negative samples produced from an optimizer that attempts to maximize the current learned model, . We will discuss our choice of in the Appendix Section C.
Incorporating design constraints via infeasible points. While prior work [trabucco2021conservative] simply optimizes Equation 2 to learn a surrogate, this is not enough when optimizing over accelerators, as we will also show empirically (Appendix A.1). This is because explicit negative mining does not provide any information about accelerator design constraints. Fortunately, this information is provided by infeasible points, . The training procedure in Equation 2 provides a simple way to do incorporate such infeasible points: we simply incorporate as additional negative samples and maximize the prediction at these points. This gives rise to our final objective:
Multi-application optimization and zero-shot generalization. One of the central benefits of a data-driven approach is that it enables learning powerful surrogates that generalize over the space of applications, potentially being effective for new unseen application domains. In our experiments, we evaluate on designing accelerators for multiple applications denoted as , jointly or for a novel unseen application. In this case, we utilized a dataset , where each consists of a set of accelerator designs, annotated with the latency value and the feasibility criterion for a given application . While there are a few overlapping designs that appear in each of the parts of the dataset, and these designs are annotated with latency values for more than one application, most of the designs only appear in one part, and so our training procedure does not have access to the latency values corresponding to more than one application for such designs. This presents a challenging scenario for any data-driven method, which must generalize correctly to unseen combinations of accelerators and applications.
To train a single conservative surrogate for multiple applications, we extend the training procedure in Equation 3 to incorporate context vectors for various applications driven by a list of application properties in Table 2. The learned function in this setting is now conditioned on the context . We train via the objective in Equation 3, but in expectation over all the contexts and their corresponding datasets: . Once such a contextual surrogate is learned, we can either optimize the average surrogate across a set of contexts
to obtain an accelerator that is optimal for multiple applications simultaneously on an average (“multi-model” optimization), or optimize this contextual surrogate for a novel context vector, corresponding to an unseen application (“zero-shot” generalization). In this case, is not allowed to train on any data corresponding to this new unseen application. While such zero-shot generalization might appear surprising at first, note that the context vectors are not simply one-hot vectors, but consist of parameters with semantic information, which the surrogate can generalize over.
Optimizing the learned conservative surrogate. Prior work [yazdanbakhsh2021apollo]
has shown that the most effective optimizers for accelerator design are meta-heuristic/evolutionary optimizers. We therefore choose to utilize, firefly[yang2010nature, yang2010eagle, liu2013adaptive] to optimize our conservative surrogate. This algorithm maintains a set of optimization candidates (a.k.a. “fireflies”) and jointly update them towards regions of low objective value, while adjusting their relative distances appropriately to ensure multiple high-performing, but diverse solutions. We discuss additional details in Appendix C.1.
Cross validation: which model and checkpoint should we evaluate? Similarly to supervised learning, models trained via Equation 3
can overfit, leading to poor solutions. Thus, we require a procedure to select which hyperparameters and checkpoints should actually be used for the design. This is crucial, because we cannot arbitrarily evaluate as many models as we want against the simulator. While effective methods for model selection have been hard to develop in offline optimization[trabucco2021conservative, trabucco2021designbench], we devised a simple scheme using a validation set for choosing the values of and (Equation 3), as well as which checkpoint to utilize for generating the design. For each training run, we hold out the best 20% of the points out of the training set and use them only for cross-validation as follows. Typical cross-validation strategies in supervised learning involve tracking validation error (or risk), but since our model is trained conservatively, its predictions may not match the ground truth, making such validation risk values unsuitable for our use case. Instead, we track Kendall’s ranking correlation between the predictions of the learned model and the ground truth values (Appendix C) for the held-out points for each run. We pick values of , and the checkpoint that attain the highest validation ranking correlation. We present the pseudo-code for (Algorithm 1) and implementation details in Appendix C.1.
5 Related Work
Optimizing hardware accelerators has become more important recently. Prior works [bo:frontiers:2020, flexibo:arxiv:2020, cnn_gen:cyber:2020, prac_dse:mascots:2019, accel_gen:dac:2018, spatial:pldi:2018, automomml:hpc:2016, opentuner:pact:2014, hegdemind] mainly rely on expensive-to-query hardware simulators to navigate the search space. For example, HyperMapper [prac_dse:mascots:2019] targets compiler optimization for FPGAs by continuously interacting with the simulator in a design space with relatively few infeasible points. Mind Mappings [hegdemind], optimizes software mappings to a fixed hardware provided access to millions of feasible points and throws away infeasible points during learning. kao2020confuciux
utilizes reinforcement learning against a simulator to optimize the parameters of a set of simple accelerators. In contrast, our data-driven approach, , not only learns a conservative surrogate using offline data but can also effectively leverage information from the large number of infeasible points and is effective with just a few thousand feasible points. In addition, to the best of our knowledge, our work, is the first to demonstrate generalization to unseen applications for accelerator design, outperforming state-of-the-art online methods.
A popular approach for solving black-box optimization problems is model-based optimization (MBO) [snoek15scalable, shahriari2016TakingTH, snoek2012practical]
. Most of these methods fail to scale to high-dimensions, and have been extended with neural networks[snoek15scalable, snoek2012practical, kim2018attentive, garnelo18neural, garnelo18conditional, angermueller2020population, angermueller2019model, mirhoseini2020chip]. While these methods work well in the active setting, they are susceptible to out-of-distribution inputs [trabucco2021designbench] in the offline, data-driven setting. To prevent this, offline MBO methods that constrain the optimizer to the manifold of valid, in-distribution inputs have been developed brookes19a, fannjiang2020autofocused, kumar2019model. However, modeling the manifold of valid inputs can be challenging for accelerators. dispenses with the need for generative modeling, while still avoiding out-of-distribution inputs. builds on “conservative” offline RL and offline MBO methods that train robust surrogates [kumar2020conservative, trabucco2021conservative]. However, unlike these approaches, can handle constraints by learning from infeasible data and utilizes a better optimizer (See Appendix Table 7 for a comparison). In addition, while prior works area mostly restricted to a single application, we show that is effective in multi-task optimization and zero-shot generalization.
6 Experimental Evaluation
Our evaluations aim to answer the following questions: Q(1) Can design accelerators tailored for a given application that are better than the best observed configuration in the training dataset, and comparable to or better than state-of-the-art simulation-driven methods under a given simulator-query budget? Q(2) Does reduce the total simulation time compared to other methods? Q(3) Can produce hardware accelerators for a family of different applications? Q(4) Can trained for a family of applications extrapolate to designing a high-performing accelerator for a new, unseen application, thereby enabling effective data reuse? Additionally, we ablate various properties of (Appendix A.6) and evaluate its efficacy in designing different accelerators with distinct dataflow architectures, with a larger search space (up to 2.5 possible candidates). We also show that improves over a human-engineered accelerator in Appendix A.4. We also show how can be reused when the design constraints (such as area constraints) change in Appendix A.3.
Baselines and comparisons. We compare against three state-of-the-art online optimization methods that actively query the simulator: (1) evolutionary search with the firefly optimizer [yazdanbakhsh2021apollo] (“Evolutionary”), which is the shown to the best online method in designing the accelerators that we consider by prior work [yazdanbakhsh2021apollo]. (2) Bayesian Optimization (“Bayes Opt”) [vizier:sigkdd:2017] implemented via the Google Vizier framework, a Gaussian process-based optimizer that is widely use to tune machine learning models at Google and more broadly. (3) MBO [angermueller2019model], a state-of-the-art online MBO method for designing biological sequences. In all of our experiments, we allow all the methods access to an identical number of feasible points. Note however that while online methods can actively select which points to query, our offline method is constrained to utilizing the worst performing feasible designs. “(Best in Training)” denotes the best latency value in the training dataset used in . We also present ablation results with different components
of our method removed in Appendix A.6, where we observe that utilizing both infeasible points and negative sampling are generally important for attaining good optimization performance. Appendix A.1 presents additional comparisons to COMs [trabucco2021conservative]—which only obtains negative samples via gradient ascent on the learned surrogate and does not utilize infeasible points—and P3BO [p3bo:arxiv:2020]—an state-of-the-art online method. outperforms both of these prior approaches.
Architecting application-specific accelerators. We first evaluate in designing specialized accelerators for each of the applications in Table 2. We train a conservative surrogate using the method in Section 4 on the logged dataset for each application separately. The area constraint (Equation 1) is set to , a realistic budget for accelerators [yazdanbakhsh2021apollo]. Table 3 summarizes the results. On average, the best accelerators designed by outperforms the best accelerator configuration in the training dataset (last row Table 3), by 2.46.
|Application||Bayes Opt||Evolutionary||MBO||(Best in Training)|
|Geomean of ’s Improvement||1.0||1.58||1.54||1.61||2.46|
: our method, online Bayesian optimization (“Bayes Opt”), online evolutionary algorithm (“Evolutionary”), and the best design in the training dataset. On average (last row), improves over the best in the dataset by 2.46(up to 6.69 in t-RNN Dec) and outperforms best online optimization methods by 1.54 (up to 6.62 in t-RNN Enc). The best accelerator configurations identified is highlighted in bold.
also outperforms the accelerators in the best online method by 1.54 (up to 5.80 and 6.62 in t-RNN Dec and t-RNN Enc, respectively). Moreover, perhaps surprisingly, generates accelerators that are better than the best online optimization method for 7/9 applications, and performs on par in the other two scenarios (on average only 6.8 slowdown compared to the best accelerator with online methods in and ). These results indicates that offline optimization of accelerators using can be more data-efficient compared to online methods with active simulation queries. We also emphasize again that exhibits this strong performance by only learning from an equal number of feasible points as the online methods, and these are the worst feasible points in the training dataset.
To answer Q(2), we compare the total simulation time of and the best evolutionary approach from Table 3 on the MobileNetEdge domain. On average, not only that outperforms the best online method, but also considerably reduces the total simulation time by 93%, as shown in Figure 5. Even the total simulation time to the first occurrence of the final design that is eventually returned by the online methods is about 11 what requires to find the better design. This indicates that data-driven is much more preferred, both because it attains a better performance, and because it requires only a tiny fraction of the simulation time of online methods.
Architecting accelerators for multiple applications. To answer Q(3), we evaluate the efficacy of the contextual version of in designing an accelerator that attains the lowest latency averaged over a set of application domains.
|Applications||Area||(Ours)||Evolutionary (Online)||MBO (Online)|
|MobileNet (Edge,V2,V3)||29 mm||(310.21, 334.70)||(315.72, 325.69)||(342.02, 351.92)|
|MobileNet (V2,V3), ,||29 mm||(268.47, 271.25)||(288.67, 288.68)||(295.21, 307.09)|
|MobileNet (Edge, V2, V3), , ,||29 mm||(311.39, 313.76)||(314.31, 316.65)||(321.48, 339.27)|
|MobileNet (Edge, V2, V3), , , , U-Net, t-RNN-Enc||29 mm||(305.47, 310.09)||(404.06, 404.59)||(404.06, 412.90)|
|MobileNet (Edge, V2, V3), , , , t-RNN-Enc||100 mm||(286.45, 287.98)||(404.25, 404.59)||(404.06, 404.94)|
|MobileNet (Edge, V2, V3), , , , t-RNN (Dec,Enc)||29 mm||(426.65, 426.65)||(586.55, 586.55)||(626.62, 692.61)|
|MobileNet (Edge, V2, V3), , , , U-Net, t-RNN (Dec,Enc)||100 mm||(383.57, 385.56)||(518.58, 519.37)||(526.37, 530.99)|
|Geomean of ’s Improvement||—||(1.0, 1.0)||(1.21, 1.20)||(1.24, 1.27)|
Optimized average latency (the lower, the better) across multiple applications (up to ten applications) from diverse domains by and best online algorithms (Evolutionary and MBO) under different area constraints. Each row show the (Best, Median) of average latency across five runs. The geometric mean of ’s improvement over other methods (last row) indicates that is at least 21% better.
As discussed previously, the training data used does not label a given accelerator with latency values corresponding to each application, and thus, must extrapolate accurately to estimate the latency of an accelerator for a context it is not paired with in the training dataset. This also means that cannot simply return the accelerator with the best average latency and must run non-trivial optimization to find a design that actually accelerates all the applications. We evaluate our method in seven different multi-application design scenarios (Table 4),
comprising various combinations of models from Table 2 and under different area constraints, where the smallest set consists of the three MobileNet variants and the largest set consists of nine models from image classification, object detection, image segmentation, and speech recognition. This scenario is also especially challenging for online methods since the number of jointly feasible designs is expected to drop significantly as more applications are added. For instance, for the case of the MobileNet variants, random sampling only finds a few (20-30) accelerator configurations that are jointly feasible and high-performing (Appendix C.2—Figure 8), but for the largest scenario of nine applications, there are very few jointly feasible designs.
Table 4 shows that, on average, finds accelerators that outperform the best online method by 1.2 (up to 41%). While performs similar to online methods in the smallest three-model scenario (first row), it outperforms online methods as the number of applications increases and the set of applications become more diverse. In addition, comparing with the best jointly feasible design point across the target applications, finds significantly better accelerators (3.95). Finally, as the number of model increases the total simulation time difference between online methods and further widens (Figure 6). These results indicate that is effective in designing accelerators jointly optimized across multiple applications while reusing the same dataset as for the single-task, and scales more favorably than its simulation-driven counterparts. Appendix B expounds on the details of the designed accelerators for nine applications, comparing our method and the best online method.
Accelerating previously unseen applications (“zero-shot” optimization). Finally, we answer Q(4) by demonstrating that our data-driven offline method, enables effective data reuse by using logged accelerator data from a set of applications to design an accelerator for an unseen new application, without requiring any training on data from the new unseen application(s). We train a contextual version of using a set of “training applications” and then optimize an accelerator using the learned surrogate with different contexts corresponding to “test applications,” without any additional designs evaluated against the test application. In this case, we train the evolutionary (online) method on the test application domain for 1000 iterations, for comparison. We also compare directly to the accelerator found by the evolutionary (online) method when jointly optimizing the training applications in Appendix A.2, but found that to be worse than running a few iterations of the evolutionary (online) method on the test applications. Table 5 shows that, on an average, outperforms the best online method on the test applications by 1.26 (up to 66) and only 2 slowdown in one case. Note that the difference in performance increases as the number of training applications increases. These results show the effectiveness of in the zero-shot setting (more results in Appendix A.5) and highlight the effectiveness of data re-use by our offline approach.
Applying on other accelerator architectures and dataflows. Finally, to test the the generalizability of to other accelerator architectures kao2020confuciux, we evaluate to optimize latency of two style of dataflow accelerators—NVDLA-style and ShiDianNao-style—across three applications (Appendix D details the methodology). As shown in Table 6, outperforms the online evolutionary method by 6% and improves over the best point in the training dataset by 3.75. This demonstrates that is effective in optimizing accelerators with different dataflow architectures and can successfully optimize over extremely large hardware search spaces.
|Train Applications||Test Applications||Area||(Ours)||Evolutionary (Online)|
|MobileNet (Edge, V3)||MobileNetV2||29 mm||(311.39, 313.76)||(314.31, 316.65)|
|MobileNet (V2, V3), ,||MobileNetEdge,||29 mm||(357.05, 364.92)||(354.59, 357.29)|
|MobileNet (Edge,V2,V3), , , , t-RNN Enc||U-Net, t-RNN Dec||29 mm||(745.87, 745.91)||(1075.91, 1127.64)|
|MobileNet (Edge,V2,V3),, , , t-RNN Enc||U-Net, t-RNN Dec||100 mm||(517.76, 517.89)||(859.76, 861.69)|
|Geomean of ’s Improvement||—||—||(1.0, 1.0)||(1.24, 1.26)|
|Applications||Dataflow||Evolutionary (Online)||(Best in Training)|
|Geomean of ’s Improvement||—||1.0||1.06||3.75|
In this work, we present a data-driven offline optimization method, to automatically architect hardware accelerators. Our method learns a conservative surrogate of the objective function by leveraging infeasible data points to better model the desired objective function of the accelerator using a one-time collected dataset of accelerators, thereby alleviating the need for time-consuming simulation. Our results show that, on average, our method outperforms the best designs observed in the logged data by 2.46 and improves over the best simulator-driven approach by about 1.54. In the more challenging setting of designing accelerators jointly optimal for multiple applications or for new, unseen applications, zero-shot, outperforms simulator-driven methods by 1.2, while reducing the total simulation time by 99%.
At a high level, the efficacy of highlights the potential for utilizing the logged offline data in conjunction with strong offline methods in an accelerator design pipeline. While , in principle, can itself be used in the inner loop of an online method that performs active data collection beyond an initial offline dataset, the strong generalization ability of neural networks when trained with good offline methods and offline datasets consisting of low-performing designs, can serve as a highly effective ingredient for multiple design problems. Utilizing for other problems in architecture and systems, such as software-hardware co-optimization is an interesting avenue for future work.
We thank the “Learn to Design Accelerators” team at Google Research and the Google EdgeTPU team for their invaluable feedback and suggestions. In addition, we extend our gratitude to the Vizier team, Christof Angermueller, Sheng-Chun Kao, Samira Khan, Stella Aslibekyan, and Xinyang Geng for their help with experiment setups and insightful comments.
Appendix A Additional Experiments
a.1 Comparison to Other Baseline Methods
Comparison to COMs. In this section, we perform a comparative evaluation of to the COMs method [trabucco2021conservative]. Like several offline reinforcement learning algorithms [kumar2020conservative], our method, and COMs are based on the key idea of learning a conservative surrogate of the desired objective function, such that it does not overestimate the value of unseen data points, which prevents the optimizer from finding accelerators that appear promising under the learned model but are not actually promising under the actual objective. The key differences between our method and COMs are: (1) uses an evolutionary optimizer () for negative sampling compared to gradient ascent of COMs, which can be vastly beneficial in discrete design spaces as our results show empirically, (2) can explicitly learn from infeasible data points provided to the algorithm, while COMs does not have a mechanism to incorporate the infeasible points into the learning of surrogate. To further assess the importance of these differences in practice, we run COMs on three tasks from Table 3, and present a comparison our method, COMs, and Standard method in Table 7. The “Standard” method represents a surrogate model without utilizing any infeasible points. On average, outperforms COMs by 1.17 (up to 1.24 in ).
|Geomean of ’s Improvement||1.0||1.17||1.46|
Comparison to generative offline MBO methods. We also ran a prior offline MBO method, model inversion networks (MINs) [kumar2019model], that trains a generative model of the accelerator on our data. However, we were unable unable to train a discrete objective-conditioned GAN model to 0.5 discriminator
accuracy on our offline dataset, and often observed a collapse of the discriminator. As a result, we trained a VAE [razavi2019preventing], conditioned on the objective function (i.e., latency). A standard VAE [kingma2013auto] suffered from posterior collapse and thus informed our choice of utilizing a VAE. The latent space of a trained objective-conditioned VAE corresponding to accelerators on a held-out validation dataset (not used for training) is visualized in the t-SNE plot in the figure on the right. This is a 2D t-SNE of the accelerators configurations (§Table 1). The color of a point denotes the latency value of the corresponding accelerator configuration, partitioned into three bins. Observe that while we would expect these objective conditioned models to disentangle accelerators with different objective values in the latent space, the models we trained did not exhibit such a structure, which will hamper optimization. While our method could also benefit from a generative optimizer (i.e., by using a generative optimizer in place of with a conservative surrogate), we leave it for future work to design effective generative optimizers for accelerators.
Comparison to P3BO. We perform a comparison against P3BO, a state-of-the-arts online MBO method in biology [p3bo:arxiv:2020]. On average, outperforms the P3BO method by 2.5 (up to 8.7 in U-Net) in terms of the latency of the optimized accelerators found. In addition, we present the comparison between the total simulation runtime of the P3BO and Evolutionary methods in Figure 7. Note that, not only the total simulation time of P3BO is around 3.1 higher than the Evolutionary method, but also the latency of final optimized accelerator is around 18% for MobileNetEdge. On the other hand, the total simulation time of for the task of accelerator design for MobileNetEdge is lower than both methods (only 7% of the Evolutionary method as shown in Figure 5).
|Geomean of ’s Improvement||1.0||2.5|
a.2 Learned Surrogate Model Reuse for Accelerator Design
Extending our results in Table 4, we present another variant of optimizing accelerators jointly for multiple applications. In that scenario, the learned surrogate model is reused to architect an accelerator for a subset of applications used for training. We train a contextual conservative surrogate on the variants of MobileNet (Table 2) as discussed in Section 4, but generated optimized designs by only optimizing the average surrogate on only two variants of MobileNet (MobileNetEdge and MobileNetV2). This tests the ability of our approach to provide a general contextual conservative surrogate that can be trained only once and optimized multiple times with respect to different subsets of applications. Observe in Table 9, architects high-performing accelerator configurations (better than the best point in the dataset by 3.29 – last column) while outperforming the online optimization methods by 7%.
|Applications||All||-Opt||-Infeasible||Standard||Bayes Opt||Evolutionary||(Best in Training)|
a.3 Learned Surrogate Model Reuse under Different Design Constraints
We also test the robustness of our approach in handling variable constraints at test-time such as different chip area budget. We evaluate the learned conservative surrogate trained via under a reduced value of the area threshold, , in Equation 1. To do so, we utilize a variant of rejection sampling – we take the learned model trained for a default area constraint and then reject all optimized accelerator configurations which do not satisfy a reduces area constraint: . Table 10 summarizes the results for this scenario for the MobileNetEdge [edgetpu:arxiv:2020] application under the new area constraint (). A method that produces diverse designs which are both high-performing and are spread across diverse values of the area constraint are expected to perform better. As shown in Table 10, provides better accelerator than the best online optimization from scratch with the new constraint value by 4.4%, even when does not train its conservative surrogate with this unseen test-time design constraint. Note that, when the design constraint changes, online methods generally need to restart the optimization process from scratch and undergo costly queries to the simulator. This would impose additional overhead in terms of total simulation time (see Figure 5 and Figure 6). However, the results in Table 10 shows that our learned surrogate model can be reused under different test-time design constraint eliminating additional queries to the simulator.
|Applications||All||-Opt||-Infeasible||Standard||Bayes Opt||Evolutionary||(Best in Training)|
|MobileNetEdge, Area 18 mm||315.15||433.81||351.22||470.09||331.05||329.13||354.13|
a.4 Comparison with Human-Engineered Accelerators
In this section, we compare the optimized accelerator design found by to the EdgeTPU design [yazdanbakhsh2021evaluation, edgetpu:arxiv:2020] targeted towards single applications. The goal of this comparison is to present the potential benefit of specialization towards single applications, only using architecture exploration. We utilize an area constraint of 27 mm and a DRAM bandwidth of 25 Gbps, identical to the specifications of the EdgeTPU accelerator. Table 11 shows the summary of results in two sections, namely “Latency” and “Chip Area”. The first and second under each section show the results for and EdgeTPU, respectively. The final column for each section shows the improvement of the design suggested by over EdgeTPU. On average (as shown in the last row), finds accelerator designs that are 2.69 (up to 11.84 in t-RNN Enc) better than EdgeTPU in terms of latency. Our method achieves this improvement while, on average, reducing the chip area usage by 1.50 (up to 2.28 in MobileNetV3).
|Latency (milliseconds)||Chip Area (mm)|
a.5 Comparison with Online Methods in Zero-Shot Setting
We evaluated the Evolutionary (online) method under two protocols for the last two rows of Table 5: first, we picked the best designs (top-performing 256 designs similar to the setting in Section 4) found by the evolutionary algorithm on the training set of applications and evaluated them on the target applications and second, we let the evolutionary algorithm continue simulator-driven optimization on the target applications. The latter is unfair, in that the online approach is allowed access to querying more designs in the simulator. Nevertheless, we found that in either configuration, the evolutionary approach performed worse than which does not access training data from the target application domain. For the area constraint 29 mm and 100 mm, the Evolutionary algorithm reduces the latency from 1127.64 820.11 and 861.69 552.64, respectively, although still worse than . In the second experiment in which we unfairly allow the evolutionary algorithm to continue optimizing on the target application, the Evolutionary algorithm suggests worse designs than Table 5 (e.g. 29 mm: 1127.64 1181.66 and 100 mm: 861.69 861.66).
a.6 Ablation Studying Various Components of
|Application||All||-Opt||-Infeasible||Standard||Bayes Opt||Evolutionary||(Best in Training)|
Here we ablate over components of our method: (1) was not used for negative sampling (“-” in Table 12) (2) infeasible points were not used (“-Infeasible” in Table 12). As shown in Table 12, the variants of our method generally performs worse compared to the case when both negative sampling and infeasible data points are utilized in training the surrogate model.
Appendix B Comparing Optimized Accelerators Found By and Evolutionary Methods
|Applications||Evolutionary (Online)||Improvement of over Evolutionary|
|Average (Latency in ms)||383.57||518.58||1.35|
|Accelerator Parameter||Evolutionary (Online)|
|# of PEs-X||4||4|
|# of PEs-Y||6||8|
|# of Cores||64||128|
|# of Compute Lanes||4||6|
|DRAM Bandwidth (Gbps)||30||30|
|Chip Area (mm)||46.78||92.05|
In this section, we overview the best accelerator configurations that and the Evolutionary method found for multi-application accelerator design (See Table 4), when the number of target applications are nine and the area constraint is set to 100 mm. The average latencies of the best accelerators found by and the Evolutionary method across nine target applications are 383.57 ms and 518.58 ms, respectively. In this setting, our method outperforms the best online method by 1.35. Table 13 shows per application latencies for the accelerator suggested by our method and the Evolutionary method. The last column shows the latency improvement of over the Evolutionary method. Interestingly, while the latency of the accelerator found by our method for MobileNetEdge, MobileNetV2, MobileNetV3, , t-RNN Dec, and t-RNN Enc are better, the accelerator identified by the online method yields lower latency in , , and U-Net.
To better understand the trade-off in design of each accelerators designed by our method and the Evolutionary method, we present all the accelerator parameters (See Table 1) in Table 14. The accelerator parameters that are different between each of the designed accelerator are shaded in gray (e.g. # of PEs-Y, # of Cores, # of Compute Lanes, PE Memory, Instruction Memory, and Activation Memory). Last row of Table 14 depicts the overall chip area usage in mm. not only outperforms the Evolutionary algorithm in reducing the average latency across the set of target applications, but also reduces the overall chip area usage by 1.97. Studying the identified accelerator configuration, we observe that trade-offs compute (64 cores vs. 128 cores) for larger PE memory size (2,097,152 vs. 1,048,576). These results show that favors PE memory size to accommodate for the larger memory requirements in t-RNN Dec and t-RNN Enc (See Table 2 Model Parameters) where large gains lie. Favoring larger on-chip memory comes at the expense of lower compute power in the accelerator. This reduction in the accelerator’s compute power leads to higher latency for the models with large number of compute operations, namely , , and U-Net (See last row in Table 2). is an interesting case where both compute power and on-chip memory is favored by the model (6.23 MB model parameters and 3,471,920,128 number of compute operations). This is the reason that the latency of this model on both accelerators, designed by our method and the Evolutionary method, are comparable (400.88 ms in vs. 406.28 ms in the online method).
Appendix C Details of
In this section, we provide training details of our method including hyperparameters and compute requirements and details of different tasks.
c.1 Hyperparameter and Training Details
Algorithm 1 outlines our overall system for accelerator design. parameterizes the function as a deep neural network as shown in Figure 4. The architecture of first embeds the discrete-valued accelerator configuration into a continuous-valued 640-dimensional embedding via two layers of a self-attention transformer [vaswani2017attention]. Rather than directly converting this 640-dimensional embedding into a scalar output via a simple feed-forward network, which we found a bit unstable to train with Equation 3, possibly due to the presence of competing objectives for a comparison), we pass the 640-dimensional embedding into different networks that map it to different scalar predictions . Finally, akin to attention [vaswani2017attention] and mixture of experts [shazeer2017outrageously], we train an additional head to predict weights of a linear combination of the predictions at different heads that would be equal to the final prediction: . Such an architecture allows the model to use different predictions , depending upon the input, which allows for more stable training. To train , we utilize the Adam [kingma2014adam] optimizer. Equation 3 utilizes a procedure that maximizes the learned function approximately. We utilize the same technique as Section 4 (“optimizing the learned surrogate”) to obtain these negative samples. We periodically refresh , once in every 20K gradient steps on over training.
The hyperparameters for training the conservative surrogate in Equations 3 and its contextual version are as follows:
Architecture of . As indicated in Figure 4, our architecture takes in list of categorical (one-hot) values of different accelerator parameters (listed in Table 1), converts each parameter into -dimensional embedding, thus obtaining a sized matrix for each accelerator, and then runs two layers of self-attention [vaswani2017attention] on it. The resulting output is flattened to a vector in and fed into different prediction networks that give rise to , and an additional attention 2-layer feed-forward network (layer sizes ) that determines weights , such that and . Finally the output is simply .
Optimizer/learning rate for training . Adam, , default , .
Validation set split. Top 20% high scoring points in the training dataset are used to provide a validation set for deciding coefficients , and the checkpoint to evaluate.
Ranges of , . We trained several models with and . Then we selected the best values of and based on the highest Kendall’s ranking correlation on the validation set. Kendall’s ranking correlation between two sets of objective values: corresponding to ground truth latency values on the validation set and corresponding to the predicted latency values on the validation set is given by equal to:
Clipping during training. Equation 3 increases the value of the learned function at and . We found that with the small dataset, these linear objectives can run into numerical instability, and produce predictions. To avoid this, we clip the predicted function value both above and below by , where the valid range of ground-truth values is .
Negative sampling with . As discussed in Section 4, we utilize the firefly optimizer for both the negative sampling step and the final optimization of the learned conservative surrogate. When used during negative sampling, we refresh (i.e., reinitialize) the firefly parameters after every gradient steps of training the conservative surrogate, and run steps of firefly optimization per gradient step taken on the conservative surrogate.
Details of firefly: The initial population of fireflies depends on the number of accelerator configurations () following the formula . In our setting with ten accelerator parameters (See Table 1), the initial population of fireflies is 23. We use the same hyperparameters: , for the optimizer in all the experiments and never modify it. The update to a particular optimization particle (i.e., a firefly) , at the -th step of optimization is given by:
where is a different firefly that achieves a better objective value compared to and the function is given by: .
c.2 Details of Architecting Accelerators for Multiple Applications Simultaneously
Now we will provide details of the tasks from Table 4 where the goal is to architect an accelerator which is jointly optimized for multiple application models. For such tasks, we augment data-points for each model with the context vector from Table 2 that summarizes certain parameters for each application. For entries in this context vector that have extremely high magnitudes (e.g., model parameters and number of compute operations), we normalize the values by the sum of values across the applications considered to only encode the relative scale, and not the absolute value which is not required. To better visualize the number of feasible accelerators for joint optimization, Figure 8 show the tSNE plot (raw architecture configurations are used as input) of high-performing accelerator configurations. The blue-colored dots are the jointly feasible accelerators in the combined dataset, and note that these data points are no more than 20-30 in total. The highlighted red star presents the best design suggested by with average latency of 334.70 (Table 4). This indicates that this contextual, multi-application problem poses a challenge for data-driven methods: these methods need to produce optimized designs even though very few accelerators are jointly feasible in the combined dataset. Despite this limitation, successfully finds more efficient accelerator configurations that attain low latency values on each of the applications jointly, as shown in Table 4.
c.3 Dataset Sensitivity to Accelerator Parameters
We visualize the sensitivity of the objective function (e.g. latency) with respect to the changes in certain accelerator parameters, such as memory size (Table 1), in Figure 10, illustrating this sensitivity. As shown in the Figure, the latency objective that we seek to optimize can exhibit high sensitivity to small variations in the architecture parameters, making the optimization landscape particularly ill-behaved. Thus, a small change in one of the discrete parameters, can induce a large change in the optimization objective. This characteristic of the dataset further makes the optimization task challenging.
Appendix D Overview of Accelerators and Search Space
This section briefly discuss the additional accelerators (similar to [kao2020confuciux]) that we evaluate in this work, namely NVDLA [nvidia] and ShiDianNao [shidiannao], and their corresponding search spaces.
NVDLA: Nvidia Deep Learning Accelerator. NVDLA[nvdla] is an open architecture inference accelerator designed and maintained by Nvidia. In compared to other inference accelerators, NVDLA is a weight stationary accelerator. That is, it retains the model parameters on each processing elements and parallelizes the computations across input and output channels. NVDLA-style dataflow accelerators generally yield better performance for the computations of layers at the later processing stages. This is because these layers generally have larger model parameters that could benefit from less data movement associated to the model parameters.
ShiDianNao: Vision Accelerator. Figure 9 shows the high-level schematic of ShiDianNao accelerator [shidiannao]. ShiDianNao-style dataflow accelerator is an output-stationary accelerator.
That is, it keeps the partial results inside each PE and instead move the model parameters and input channel data. As such, in compared to NVDLA-style accelerators, ShiDianNao provides better performance for the computations of the layers with large output channels (generally first few layers of a model).
Search space of dataflow accelerators. We follow a similar methodology as [kao2020confuciux] to evaluate additional hardware accelerators, discussed in the previous paragraphs. We use MAESTRO [maestro], an analytical cost model, that supports the performance modeling of various dataflow accelerators. In this joint accelerator design and dataflow optimization problem, the total number of parameters to be optimized is up to 106—the tuple of (# of PEs, Buffers) per per model layer—with each parameter taking one of 12 discrete values. This makes the hardware search space consist of 2.5 accelerator configurations. We also note that while the method proposed in [kao2020confuciux] treats the accelerator design problem as a sequential decision making problem, and uses reinforcement learning techniques, simply designs the whole accelerator in a single step, treating it as a model-based optimization problem.