1. Introduction
Deep learning (DL) is a fundamental technology for many emerging applications such as autonomous driving (Bojarski et al., 2016), translation (Wu et al., 2016), and image classification (Russakovsky et al., 2015), with accuracy close to, and even surpassing, that of humans (Karpathy and FeiFei, 2015; Toshev and Szegedy, 2014; Farabet et al., 2013)
. Modern deep neural networks (DNNs) such as convolution neural networks (CNNs) require millions of input activations and weights, resulting in hundreds of millions of multiplyandaccumulate (MAC) operations for each inference pass. Moreover, DL inference is being increasingly deployed on mobile devices
(42) and edge devices (11) for fast response times and data privacy. Achieving low latency and energy goals with stringent computation and memory constraints in mobile and edge devices has emerged as an important challenge.To cope with this challenge, specialized hardware accelerators for DNN inference are being developed and deployed (Y. Chen, T. Yang, J. Emer, and V. Sze (2019); E. S. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. M. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. E. Husseini, T. Juhász, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. K. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, and D. Burger (2018); NVIDIA (2018); 11; 45). Most of these accelerators are “spatial”, i.e., they are built by interconnecting hundreds to thousands of processing elements (PEs). They achieve high throughput by exploiting parallelism over the PEs, and energy efficiency by maximizing data reuse within the PE array via direct data forwarding between PEs and the use of scratchpad memories (Chen et al., 2016, 2014; NVIDIA, 2018; Parashar et al., 2017; Sharma et al., 2016; Jouppi et al., 2017; Ma et al., 2017; Zhang et al., 2015). The mechanism used by a spatial accelerator to exploit parallelism and perform data staging is known as dataflow (Chen et al., 2016), and it is a crucial component of an accelerator design because it directly impacts performance and energy efficiency of the accelerator design. Some of the stateoftheart dataflows include rowstationary (RS) of Eyeriss (Chen et al., 2016), DLA of NVDLA (NVIDIA, 2018), and outputstationary of ShiDianNao (Du et al., 2015). A mapper (or) compiler for such a spatial accelerator system takes a DNN model, dataflow, and hardware parameters of the accelerator as inputs, and generate dataflowcompatible mappings for execution (Sze et al., 2017).
Even though dataflows have shown to impact the performance of DNN accelerators significantly, no single dataflow is an excellent choice for all shapes and types of convolution layers (Lu et al., 2017; Kwon et al., 2019). As a result, there is a substantial interest in exploring flexible spatial accelerators, where networks and state machines in the hardware are fully programmable which allow dynamic reconfiguration of dataflows during its execution (Lu et al., 2017; Kwon et al., 2018; Parashar et al., 2019; Chen et al., 2019; Krishna, 2019). Now, the overall performance and energy of these flexible accelerators depend heavily on the compiler in generating efficient mappings, thereby reinforcing the importance of the “mapping problem” (Parashar et al., 2019) for DNN accelerators. The focus of this paper is on efficiently mapping convolutions and related computations on flexible accelerators for optimized throughput and energy efficiency.
The efficiency of any mapping is tightly crosscoupled with both the algorithmic aspects of DNN models and the microarchitectural aspects of accelerators. On the algorithmic end, DNNs have been changing at an exponential rate since the success of early models like AlexNet (Krizhevsky et al., 2012). For convolutions alone, many new algorithmic techniques have been developed such as depthwise convolution (Howard et al., 2017; Sandler et al., 2018), pointwise convolution (Sandler et al., 2018; He et al., 2016), and skip connections (also referred to as residual links or identity operators) (He et al., 2016; Xie et al., 2017). These techniques sacrifice arithmetic intensity (or algorithmic reuse) for fewer computations (Table 2). On the microarchitecture end, DNN accelerators have evolved from simple systolic arrays with limited flexibility to spatial arrays that are becoming increasingly complex with various networkonchip (NoC) implementations (Chen et al., 2019; Kwon et al., 2018; Lu et al., 2017), reuse mechanisms (Chen et al., 2016, 2019), and flat/hierarchical implementations (Chung et al., 2018). However, much of the prior work on mapping (Zhang et al., 2015; Ma et al., 2017; Zhao et al., 2019; Yang et al., 2016) targeted hardware with limited capabilities and used DNNs with standard convolutions.
Prior work. The mapping problem for convolutions is described as a loop optimization problem in the literature (Parashar et al., 2019; Ma et al., 2017; Zhang et al., 2015; Zhao et al., 2019), involving several transformations such as multilevel loop tiling, parallelization, and interchange to the seven nested loops of the convolution. As a result, the mapping problem has an enormous search space to explore. For example, there are over 10 valid mappings for a single convolution layer on average for mapping ResNet50 (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) models on an Eyerisslike accelerator (Chen et al., 2016) configuration. Prior work has fixed certain aspects of the mapping space such as choice of parallel loops, loop orders, and performed a bruteforce exploration (Zhang et al., 2015; Ma et al., 2017; Motamedi et al., 2016) of the search space, but fixing such choices may not be efficient for rapidly evolving convolutions. To the best of our knowledge, TimeLoop (Parashar et al., 2019)
is the only framework that considers all aspects of the loop transformation space for a fully flexible spatial accelerator. But, it employs either an exhaustive linear search or a random samplingbased heuristic to explore the optimization search space. Also, none of the prior work has included data layouts of tensors as part of the mapping space for spatial accelerators.
Our approach for the mapping problem is motivated by the observation that the offchip data movement between DRAM and accelerator is 23 orders of magnitude more compared to the onchip data movement involving the PE array and L1/L2 buffers (Chen et al., 2016; Sze et al., 2017). Hence, we propose a novel approach referred as “decoupled offchip/onchip” that decomposes the mapping space into two subspaces, i.e., offchip and onchip subspaces, and first explores the offchip subspace followed by exploring the onchip mapping subspace constructed with the optimal mappings from the offchip subspace. In contrast to prior work (Parashar et al., 2019), we use different approaches and cost models for these subspaces, i.e., a classical distinctblock (DB) locality cost model (Ferrante et al., 1991; Sarkar, 1997) to explore the offchip subspace, and a stateofthe art DNN accelerator cost model, MAESTRO (Kwon et al., 2019), for the onchip subspace. Note that, the MAESTRO’s DSE tool (Kwon et al., 2019) is limited to design space exploration of hardware parameters, and doesn’t explore the mapping (dataflow) search space as in our approach.
fig. 1 shows Marvel in a DNN accelerator compilation flow. The goal of Marvel is to formulate and explore mapping space of convolutions and its variants for a target fully flexible accelerator, and recommend efficient one(s) for code/configuration generation. We explore a much larger space of mappings relative to past work, because our mapping space includes dimension permutation, a form of datalayouts, in the mapping space formulation along with the loop transformations. Even though our approach generates efficient mappings for each layer on the flexible accelerators, it can also be used to find a uniform datalayout (fig. 7) and also a uniform dataflow for a rigid accelerator, that is good across the layers.
We ran Marvel for 15 representative layers with diverse types, shapes and sizes across ResNet50 (He et al., 2016) and MobileNetv2 (Sandler et al., 2018) on two accelerator configurations. Our approach reduces the mapping space by an factor, and the generated optimal mappings for the onchip subspace demonstrate a geometric mean improvement of 5.23 higher throughput and 1.12 lower energy consumption with respect to optimal mappings corresponding to three stateoftheart dataflows (Chen et al., 2016; NVIDIA, 2018; Du et al., 2015). In addition, the performance cost of mappings obtained by our decoupled approach is 5% higher compared to the best mappings identified by a bruteforce exploration, similar to the proposed in the TimeLoop framework (Parashar et al., 2019) (stopped after 48 hours), while being close to 300 faster in search time.
2. Background and Related work
In this section, we provide a brief overview of convolutions and variations to the convolutions in modern DNNs. Then, we briefly discuss spatial DNN accelerators and prior works on mapping convolutions to the the accelerators.
2.1. Convolutions
Convolution Neural Networks (CNNs) are one of the most popular DNNs for image recognition (Russakovsky et al., 2015; Karpathy and FeiFei, 2015; Toshev and Szegedy, 2014; Farabet et al., 2013). Among many layers in CNN models, convolution layers account for more than 90% of overall computation (Cong and Xiao, 2014; Chen et al., 2016), dominating overall latency, and energy consumption in inferences. In general, the convolution layers deal with three highdimensional tensors: A fourdimensional tensor for filter (weights), a fourdimensional tensor for inputs, and another fourdimensional tensor for outputs, whose conventions and visualization are shown in fig. 2 (a) and (b) respectively. A regular convolution operation (CONV2D) is comprised of threedimensional multiply and accumulate (MAC) operations enclosed in four loops, where each threedimensional multiply and accumulate yields an element of the output activation tensor. The loopnest representation of a regular convolution is shown in fig. 2 (c). Fullyconnected layers (FC) are also very common and can be viewed as a particular case of convolution that has a sliding window as large as input activation. In addition to CONV2D and FC layers, recent DNN models such as MobileNetV2 (Sandler et al., 2018) and ResNet50 (He et al., 2016) employ diverse layer types which are briefly described below.
Pointwise Convolution (PWCONV) Layer. These layers are special cases from regular convolutions, which operate on 1x1 filter size, i.e., these layers accumulate partial sums only across input channels (depth) to generate an output activation, resulting in no convolution reuse (sliding window behavior).
Depthwise Convolution (DWCONV) Layer. These layers include the same operation as regular convolutions, but does not accumulate partial sums across input channels (depth). In addition, these layers have a filter batch size (K) as one, resulting in no filter reuse. But, these layers can exploit convolution reuse present in an input activation channel. A loopnest representation of the these layers is shown in fig. 2(d).
Residual Addition (RAdd) Layer. Residual links in inference are used to perform elementwise additions of inputs and weights in a channel, and don’t perform reduction across channels. These layers can be viewed as depthwise separable layers with filter width and height as one, and as result, these layers have no filter reuse and no convolution reuse.
2.2. Spatial DNN Accelerators
Spatial DNN accelerators based on ASICs and FPGAs have emerged to address extreme demands on performance and energyefficiency of CNN layers (Chen et al., 2016, 2014; NVIDIA, 2018; Parashar et al., 2017; Sharma et al., 2016; Jouppi et al., 2017). Such accelerators are built using an array of processing elements (PEs) to provide high parallelism and use direct communication instead of via shared memory for energyefficiency. An abstract model of spatial accelerators is shown in fig. 3, where each PE of an accelerator consists of a single/multiple ALU(s) dedicated for multiplyaccumulate operations (MACs) and a local scratchpad (L1 buffer). Also, accelerators employ various networkonchips (NoCs) for direct communication among PEs and between PE array and L2 scratchpad buffer. The interconnection network often supports multicasting data to multiple PEs, which can reduce the total number of data reads from L2 buffer to PEs. Unlike GPU cores, PEs can communicate with adjacent PEs (data forwarding) using a NoC, which can significantly reduce the energy consumption for expensive L2 buffer accesses. Accelerators also typically employ a large shared L2 scratchpad buffer to stage data from DRAM and also partial accumulations from PE arrays. Both L1 and L2 scratchpad buffers are softwarecontrolled memories, i.e., programmer/compiler directly controls contents of the buffer, unlike cache memories, which implicitly manages them, and this is because the memory traffic in accelerators is known in advance. Many spatial accelerators can be further interconnected together to create a scaleout system (Chung et al., 2018).
Systolic arrays (Jouppi et al., 2017; xDNN, ) are also popular DNN accelerators, which entirely relies on the pointtopoint connection among adjacent PEs for input data distribution and partial sum accumulations. That is, systolic arrays distribute input data and accumulate partial sums via storeandforward. Typically, systolic arrays are two dimensional, and each dimension is used for data forwarding and partial sum accumulation, respectively. Although systolic arrays can provide high throughput and energyefficiency, they lack flexibility in its dataflow due to their rigid NoC architecture. Such inflexibility allows limited dataflow styles, which can lead to low compute unit utilization depending on the layer type and dimensions. Therefore, in this work, we focus on spatial accelerators, which provide more flexibility via NoC, so that we can explore massive benefits from scheduling the convolutions onto them.
2.3. Past work on Mapping
The problem of optimally mapping a convolution operation onto a spatial accelerator is described as a loop optimization problem in the literature (Parashar et al., 2019; Ma et al., 2017; Zhang et al., 2015; Zhao et al., 2019), involving multilevel loop tiling, parallelization, and interchange to the seven nested loops of the convolution. As a result, this optimization problem has a huge search space of possible mappings.
Some of the prior works (Zhao et al., 2019; Chen et al., 2016) focused on developing mappers specific to their architectures, for, e.g., mRNA mapper (Zhao et al., 2019) for the MAERI accelerator (Kwon et al., 2018), limiting their applicability to generic spatial accelerators. In addition, other prior works (Zhang et al., 2015; Ma et al., 2017; Motamedi et al., 2016) fixed certain aspects of loop optimization problem such as choice of parallel loops, loop orders, but such choices may not be efficient for rapidly evolving DNN models. Furthermore, past work in (Lu et al., 2017)
focused only on selecting parallel loops and the degree of parallelism, ignoring other aspects of the optimization problem. In addition to the above limitations, most of the prior works use approximate cost models for measuring PE utilization and onchip communication. Such approximate cost models are not sufficient for precise estimation of throughput and energy efficiency, because edgeconditions from layer shapes and the degree of parallelism can lead to a significant slowdown.
TVM compiler infrastructure (Chen et al., 2018) offers an MLbased cost model to find optimal implementations of convolutions on a variety of platforms including accelerators. But, we believe that such MLbased cost models may not be suitable for spatial accelerators because of two reasons – 1) the statistical MLbased cost models are generally not accurate to precisely estimate performance and energy, and not accounting PE underutilization, edge conditions can lead to significant imprecision, and 2) it requires training the MLbased cost models even for a slight change in number of PEs in the accelerator configuration, which makes it challenging to use for the designspace exploration.
To the best of our knowledge, TimeLoop (Parashar et al., 2019) is the only framework that considers all aspects of the loop transformations space for a fully flexible spatial accelerator. But, it employs either an exhaustive linear search or a random samplingbased heuristic to explore the optimization search space. In addition, the framework doesn’t consider datalayouts of convolution tensors into the mapping space formulation. But, our approach (Marvel) includes dimension permutation, a form of datalayouts, in the mapping space formulation along with the loop transformations. Then, Marvel leverages the proposed “decoupled offchip/onchip” approach along with a set of pruning strategies to reduce the mapping space exploration, which is described in section 3.
3. Our Decoupled Modeldriven Approach
The first step in our approach is to convert a given convolution layer into an equivalent loopnest form, because loopnest notation (Parashar et al., 2019; Zhang et al., 2015; Ma et al., 2017) has been widely used to describe mappings of convolution onto spatial accelerators having multiple levels of the memory hierarchy. A sample mapping in the loopnest form for a parametric version of 1D convolution is shown in fig. 4 (b), and visualization of it is shown in (c). Now, we briefly describe different aspects of a mapping below.
1) Multilevel tiling (including parallelization). A mapping includes multilevel tiling for the multiple levels of the memory hierarchy and PE array of the accelerator^{1}^{1}1In this work, we focus on spatial accelerators having only three levels of the memory hierarchy (L1 buffer, L2 buffer, and DRAM). However, our formulation and approach can be extendable to more levels of hierarchy. – 1) Level1 tiling on the iteration space of a convolution to enhance temporal reuse via the private L1 buffer of a PE, 2) Level2 tiling to parallelize multiple level1 tiles across the PE array, and 3) Level3 tiling to enhance temporal reuse via the the shared L2 buffer. We denote the leveli tile sizes corresponding to seven loops of the convolution loopnest as T, T, T, T, T, T, T, and the naming conventions are described in fig. 2 (a). Also, we denote the loop iterators over leveli tiles as t, t, t, t, t, t, and t for the seven nested loops of the convolution.
2) Intertile loop orders. A mapping also includes temporal ordering via the intertile loop orders^{2}^{2}2An ndimensional loopnest after one level of tiling will have 2n loops. The first nloops are referred to as intertile loops and the later nloops as intratile loops., which describe the execution order of tiles in time, resulting in reuse opportunities present across adjacent tiles. For e.g., the level2 intertile loop order reflects spatiotemporal reuse opportunities via the PE array, and similarly, the level3 intertile loop order reflects temporal reuse opportunities via the onchip L2 buffer. However, the level1 intertile loop order doesn’t reflect any reuse opportunities, because these loops are annotated with parallelism and they run simultaneously across the PE array. Similarly, the level1 intratile loop order doesn’t provide any reuse opportunities because there is no more intermediate staging between the PE and its L1 buffer.
3) Datalayouts. We include dimension permutation (Li et al., 2016)
as part of the datalayouts of convolution tensors on DRAM, in a mapping, which is not considered in the mapping space of convolutions for spatial accelerators by prior work. Datalayouts are beneficial in reducing offchip data movement because accessing the data that is laid contiguously on DRAM requires fewer block transfers. The dimension permutation layouts were exclusivelyy explored in the past to improve spatial data locality for better vectorization efficiency
(Kandemir et al., 1998), and also to optimize memory efficiency for CNNs on GPUs (Li et al., 2016). But, to the best of our knowledge, there exists no prior work in including datalayouts as part of the search space of mappings for convolutions onto spatial accelerators.Overall, the mapping space is a Cartesian product of six dimensions which represent different aspects of a mapping, i.e., 1) level1 tile sizes, 2) level2 tile sizes (parallelism), 3) level2 intertile loop orders, 4) level3 tile sizes, 5) level3 intertile loop orders, and 6) datalayout of tensors. The first three dimensions are grouped under “onchip mapping subspace” since they influence parallelization and onchip data movement, and the remaining three dimensions are grouped under “offchip mapping subspace” since they influence the offchip data movement.
The size of mapping space is enormous with the above formulation, and hence it is impractical to explore mapping space of a layer via a bruteforce enumerate approach in a reasonable amount of time. Our approach towards the above optimization problem is motivated by the observation that the offchip data movement between DRAM and accelerator is 23 orders of magnitude more compared to the onchip data movement. Hence, we propose a novel approach referred as “decoupled offchip/onchip” that decomposes the mapping space into two subspaces, i.e., offchip and onchip subspaces, and first optimizes the offchip subspace followed by the onchip subspace which is constructed with the optimal mappings from the offchip subspace. In contrast to prior work (Parashar et al., 2019), we use different approaches and cost models for these subspaces, i.e., a classical distinctblock (DB) locality cost model (Ferrante et al., 1991; Sarkar, 1997) to explore the offchip subspace, and a stateofthe art DNN accelerator cost model, MAESTRO (Kwon et al., 2019), for the onchip subspace. The overall approach is summarized in fig. 5 which is implemented as a standalone tool that takes a convolution layer sizes from a DNN model and also hardware parameters of the target accelerator, and then outputs optimal mappings for each of the three optimization goals, i.e., runtime, energy consumption, and energydelay product.
3.1. Solving offchip mapping subspace
The goal of finding an optimal mapping in the offchip mapping subspace to minimize offchip data movement between DRAM and the L2 buffer of an accelerator. In our work, we assume the L2 buffer to be a softwaremanaged scratchpad buffer, and reducing the offchip data movement^{3}^{3}3In case of nonsoftwaremanaged scratchpad buffers, reducing data movement between DRAM and L2 buffer is equivalent to finding a level3 tile whose memory footprint can fit into the L2 buffer and is maximum. is equivalent to finding a level3 tile that has highest arithmetic intensity, this is because the highest arithmetic intensity results in higher reuse and less data transfer.
Since the offchip data movement happens in multiples of DRAM block sizes, we redefine the arithmetic intensity as the number of operations performed on a single block, and minimizing inverse of the arithmetic intensity is the exact goal of optimal tile size selection problem that has been focused by the compiler research community from the last couple of decades using a variety of approaches ranging from analytical models (Sarkar, 1997; Sarkar and Megiddo, 2000; Shirako et al., 2012, 2012)
to machine learning models
(Rahman et al., 2010). However, none of the prior works consider the tile size selection problem along with datalayouts to minimize the offchip data movement, even for CPUs/GPUs.In our approach, we consider the classical distinctblock (DB) locality cost model (Ferrante et al., 1991) to measure the offchip data movement cost, which was developed as part of the memory cost analysis to guide automatic selection of loop transformations and also optimal tile size selections (Sarkar, 1997; Sarkar and Megiddo, 2000; Shirako et al., 2012) in IBM XL compilers. The distinct blocks (DB) model starts with a datalayout of multidimensional arrays and also a parametric tiled version of a perfectly nested loop. Then the model symbolically estimates the offchip data movement cost involved in a tile of computation by measuring the number of the distinct number of DRAM blocks required for all the references in the tile of computation. To explain in detail, let’s assume the layouts of 4dimensional filter, input, and output tensors are SRCK, XYCN, and PQKN (leftmost dimension is the innermost storage dimension) respectively. Then, the distinct number of DRAM blocks (with block size as B) required for each of the array references (W[t][t][t][t], I[t+][t+t][t][t], and O[t][t][t][t]) inside a level3 tile of convolution computation are expressed below as a function of the level3 tile size variables.
In the above formulation, the innermost access of each reference is divided by the block size, because the data movement with DRAM happens in multiples of block sizes. Now, the total data movement cost (DMC), a.k.a. memory cost per iteration, involved in a tile is computed as the number of distinct DRAM blocks required for all references in the tile by the total number of operations in the tile.
The optimal level3 tile sizes and datalayouts are computed by minimizing above data movement cost function for every layout and tile sizes in the offchip mapping subspace with the two constraints, i.e., 1) the tile size of a loop should be greater than 0 and should not exceed its corresponding loop bound, and 2) the total data required (including double buffering) for a level3 computation tile should fit into the onchip L2 buffer. In the past, the problem of determining the optimal tiles using the DB model was modeled as a geometric program, transformed it then into a convex optimization problem (Renganarayana and Rajopadhye, 2008, 2004), and further solved using integer geometric programming (IGP) frameworks instead of enumeration. Marvel currently has support for doing an exhaustive search (feasible because of only one level of tiling for offchip data movement) and also using the IGP formulation for the tile size estimation.
After computing optimal level3 tile sizes and datalayouts of tensors, our approach computes the partial derivatives (slopes) of the data movement cost function (based on the optimal datalayout) with respect to parametric level3 tile sizes (similar to (Sarkar, 1997)), and evaluate the partial derivatives by substituting optimal level3 tile sizes. The key insight is that having a higher negative value of a partial derivative along a loop indicates the lesser distinct number of elements referenced along the loop, i.e., highest reuse along the loop, and it is suggested to keep it in the innermost position to exploit maximum temporal reuse. Similarly, the rest of the loops are ordered based on their partial derivative values.
Rationale for using DB model: The DB model is a good choice for our approach, since the model only focuses on optimizing for offchip data movement, and in addition, it focuses only on perfectly nested loop and convolutions are a great example of such nested loops.
3.2. Solving onchip mapping subspace
The onchip mapping subspace is constructed based on the optimal values of level3 tile sizes of the offchip mapping subspace. Then, our approach explores the constructed subspace using a set of pruning strategies to find optimal mappings for each of the three optimal goals, i.e., higher throughput (runtime), lower energy consumption, and lower energydelay product. In this work, we use MAESTRO (Kwon et al., 2019), which is the stateofart DNN accelerator behavior cost model to estimate various metrics, including latency and energy of each mapping in the onchip subspace.
Rationale for using MAESTRO: A good cost model for onchip DNN mapping exploration needs three capabilities: (i) describe/model the behavior of diverse DNN accelerators, (ii) precisely compute performance and energy, accounting for underutilization, edge conditions, and data reuse or movement across time (via L1/L2 buffers (Chen et al., 2016)), space (via broadcast links (Kwon et al., 2018)), and spacetime (via neighboring links (Jouppi et al., 2017; Chen et al., 2017)), and (iii) lightweight and fast to enable rapid evaluation through a large search space. We found MAESTRO (Kwon et al., 2019) valuable in meeting all three requirements. MAESTRO provides the ability to model hierarchically composed spatial accelerators with variable number of PEs and connectivity at each level, and analytically describe the behavior of a variety of DNN accelerators, without requiring explicit RTL/cyclelevel simulations or access to real hardware. Moreover, the analytical cost model within the MAESTRO framework is validated against the RTL implementations of Eyeriss (Chen et al., 2016) and MAERI (Kwon et al., 2018) on VGG16 and AlexNet models. In addition, its datacentric representation enables faster computation of estimates because the data movement of a mapping is explicit in its specification, unlike computecentric representation, which requires heavy linearalgebra frameworks to estimate data movement precisely. So, before invoking the MAESTRO framework, Marvel translates an onchip mapping in the loopnest form to the datacentric representation that is understood by MAESTRO. We omit the translation details in the interest of space. Algorithm 1 shows an overview of our approach in exploring the onchip mapping subspace along with pruning strategies, and we describe them in detail as below.
Level2 intertile loop order. There are a total of 5040 (= 7!) possible level2 loop orders in the onchip subspace, and our approach can leverage (if specified) the following two domainspecific pruning strategies to reduce the number of loop orders, i.e., 1) symmetric property (doesn’t prune optimal mappings), and 2) unrolling loops corresponding to filter width and height (can prune optimal mappings).
The symmetric property is motivated by the observation that a convolution operation generally operates on a square shaped input activation and filter tensors. In addition, the loop iterators corresponding to input and filter width (t,t) are tightly coupled in the input array subscript (I[t+][t+t][t]), and similarly for input and filter height (t, t). This leads to an interesting observation that exchanging iterators corresponding to input width with its height and filter width with its height doesn’t alter semantics if the input activation and filter tensors after level3 tiling are squareshaped. This property helps in pruning loop orders, for, e.g., exploring loop order ttttttt enables us to ignore the loop order ttttttt safely without missing optimal mappings.
The width and height of filters are very small (either 3x3 or 1x1) because the modern DNN models are focusing on reducing the total number of operations (Sandler et al., 2018). We leverage this trend in our approach by unrolling these loops, which results in the 5d loop nest from the 7D loop nest. Besides, in most inference scenarios, the batch size (N) of 1 is used, especially on mobile/edge devices. This further prunes the search space by ignoring the loop corresponding to batch size. As a result, the total number of level2 tile orders are reduced from 7! (= 5040) to (= 12), leading to a reduction of 420x in the onchip mapping subspace.
Level2 tile sizes. The level2 tile size of a loop indicates the degree of parallelism along the loop, and the tile size of one indicates that there is no parallelism exploited along that loop dimension. Prior works (Chen et al., 2016; NVIDIA, 2018; Zhang et al., 2015; Ma et al., 2017) exploited only a maximum of two loops for parallelism, and work in (Chen et al., 2019) demonstrates the need for going beyond two loops to achieve peak performance especially in case of modern convolution layers as certain layer dimensions are small. In Marvel, the number of loops to exploit for parallelism is a configuration setting provided by the user, and our approach performs pruning on the level2 tile sizes based on the provided value. In addition, we introduce a new parameter called “PE utilization bound (p)” to further prune search space of level2 tile sizes by bounding the overall PE array utilization to be atleast the utilization bound. The above technique is beneficial in finding optimal onchip mappings with the optimization goal being throughput, because the highest throughput is typically obtained at higher PE utilization rates (Chen et al., 2019).
Level1 tile size. Level1 tiling is ignored in most of the prior accelerator designs (Zhang et al., 2015; Ma et al., 2017; Zhao et al., 2019, 2019), because of the absence of a private buffer inside a PE in those accelerators. But, the Eyeriss accelerator (Chen et al., 2016) showcased the benefit of having a private buffer by exploiting rowstationary reuse via the buffer. In our approach, we consider level1 tile sizes in the onchip space and explore the tile sizes such that their memory footprint (including double buffering) involved in the level1 tile computation fits in the L1 buffer.
Our approach also includes a pruning strategy to choose level1 and level2 tile sizes such that they don’t result in any prologues or epilogues, i.e., the tile sizes are factors of loop bounds. This strategy is to help in simplifying the design of a PE and also control signal generation inside the accelerator; but, this can miss optimal mappings. All of the abovementioned pruning strategies can be enabled/disabled in Marvel by passing them as input parameters.
4. Evaluation
In this section, we begin with an overview of the experimental setup and DNN layer descriptions used in our evaluation. Then, we present the evaluation of mappings generated by Marvel, and discuss insights from the mappings while comparing them with previous work.



#PEs  168  1024  
Clock frequency  200 MHz  200 MHz  
GigaOpsPerSec(GOPS)  67.2  409.6  

2.4  25.6  
L1 buffer size  512B  512B  
L2 buffer size  108KB  108KB  
DRAM block size (Jedec, 2017)  64  64 
Target accelerators. Marvel is applicable to any spatial accelerator since it abstracts accelerator details as #PEs, L1/L2 buffer sizes, NoC bandwidth, reduction/multicast support, etc, which can be used to model a wide variety of accelerators including Eyeriss (Chen et al., 2016), NVDLA (NVIDIA, 2018), TPU (11), xDNN. Due to space limitations, we present our evaluation for only two accelerator platforms (shown in table 1): An accelerator (Eyerisslike (Chen et al., 2016)) having 168 PEs and 2.4GB/s NoC bandwidth, and another accelerator having 1024 PEs and 25.6GB/s. We inherit L1, L2 buffer, and clock frequency for both platforms from Eyeriss (Chen et al., 2016), i.e., 512B L1 buffer, 108KB L2 buffer, and 200MHz clock frequency. The bidirectional NoC used in our evaluation is a twolevel hierarchical bus, which has support for multicasting similar to Eyeriss. The bidirectional NoC is used to deliver input activations and filter values (collectively referred to as ingress traffic) from L2 buffer to PE array and also output activations (referred to as egress traffic) back to the L2 buffer from the PE array.









MobileNetV2 

0.90.9  11  0.30.3  

3.88.8  99  2.12.9  

15.9154.2  16960  7.477.4  

7.97.9  3232  5.35.3  

0.90.9  12801280  0.90.9  
ResNet50 

24.7452.5  5764608  24.5236.8  

30.7724.7  147256  24.866.8  

0.90.9  11  0.30.3  

44.7486.6  642304  21.2180.9  

35.448.2  5121024  33.140.6  

0.990.99  100352100352  0.90.9 
DNN model  Type  No. 

Layer sizes  Level3 tile sizes 



K  C  R  S  P  Q  Stride  T  T  T  T  T  T  W  I  O  
MobileNetV2  RAdd  L1 

1  16  1  1  112  112  1  1  2  1  1  112  112  t  C  X  P  
DSCONV  L2  Bottleneck1_1_2  1  32  3  3  110  110  1  1  1  3  3  110  110  t  S  X  P  
L3  Bottleneck4_3_2  1  192  3  3  12  12  1  1  64  3  3  12  12  t  C  C  K  
L4  Bottleneck6_2_2  1  576  3  3  5  5  1  1  64  3  3  5  5  t  C  C  K  
PWCONV  L5  Bottleneck1_1_3  16  32  1  1  112  112  1  16  32  1  1  56  16  tt  C  X  P  
SPWCONV  L6  CONV1  32  32  1  1  224  224  2  32  32  1  1  56  14  tt  K  X  P  
FC  L7  CONV2D_3  1000  1280  1  1  1  1  1  125  320  1  1  1  1  tt  C  C  K  
ResNet50  REGULAR  L8  CONV2_2_2  64  64  3  3  54  54  1  64  32  3  3  54  6  tt  K  X  K  
L9  CONV5_3_2  512  512  3  3  5  5  1  64  64  3  3  5  5  tt  K  C  K  
RAdd  L10  CONV5_3_Residual  1  2048  1  1  7  7  1  1  1024  1  1  7  7  t  C  C  P  
PWCONV  L11  CONV2_1_2  64  64  1  1  56  56  1  64  64  1  1  7  56  t  K  C  K  
L12  CONV3_4_1  128  256  1  1  28  28  1  64  256  1  1  4  28  t  K  C  K  
L13  CONV5_1_3  2048  512  1  1  7  7  1  128  256  1  1  7  7  tt  K  C  K  
SPWCONV  L14  CONV5_1_1  512  1024  1  1  7  7  2  128  256  1  1  7  7  tt  K  C  K  
FC  L15  FC1000  1000  2048  7  7  1  1  1  500  2  7  7  1  1  tt  K  X  K 
Network layers. Modern DNNs such as MobileNetV2 (Sandler et al., 2018), ResNet50 (He et al., 2016) include diverse types of convolution layers such as regular convolutions, depthwise separable, pointwise convolutions. As shown in table 2, these layers have diverse arithmetic intensities, e.g., residual links have very less arithmetic intensity because they are identity (elementwise) operators, and regular convolutions with nonunit filter sizes and pointwise convolutions have very high arithmetic intensity because they have input activation reuse across multiple filters. In our evaluation, we consider layers from each type listed in table 2, and paid special attention to layers with lower arithmetic intensity (e.g., depthwise separable) since they are bandwidth limited, appear more frequently in recent networks, and have not been deeply explored in prior work. table 3 list the 15 layers, labeled L1 to L15, that were considered in our evaluation. We apply 8bit fixed point precision for both of activations and filters (weights).
Methodology. In the evaluation, we have passed the following pruning strategies as input to Marvel: 1) Fully unrolling loops corresponding to filter width and height, 2) Setting batch size (N) as one, which captures the low latency requirement use case, and also represents a more challenging setup for energy efficiency and throughput (Chen et al., 2019), 3) Pruning tile sizes based on the finite L1 and L2 buffer sizes, 4) Minimum PE array utilization bound as 0.1, 5) Symmetric pruning strategy, and 6) Exploring tile sizes only that divide loop bounds evenly without any remainder, i.e., no prologues or epilogues, but we also evaluated by not considering this pruning strategy. table 4 shows the impact of our decoupling and pruning strategies on the original search space of schedules for all the fifteen layers with an average reduction of in the mapping space.
Variants  Search space size  
Min  Avg  Max  
Original search space  2.7  9.4  1.8 
Offchip schedules search  7.3  3.6  1.3 
space after decoupling  
Onchip schedules search  2.9  2.4  1.4 
space after decoupling  
Offchip schedules search  9.9  1.5  6.3 
space after decoupling + pruning  
Onchip schedules search  3.8  5.9  2.4 
space after decoupling + pruning 
4.1. Evaluation of generated offchip subspace mappings
The first step in Marvel is to compute optimal mappings of offchip subspace (i.e., level3 tile sizes, level3 intertile loop order, and tensor datalayouts) to minimize the offchip data movement cost, and the optimal mappings for all the fifteen layers are shown in table 3. Since both the accelerator platforms have the same DRAM block size and onchip L2 buffer, the optimal mappings of the offchip subspace are the same for both platforms. Some of the interesting observations that can be made from the results are described below.
Impact of layer type on level3 tile sizes. The generated optimal level3 tile sizes of P, Q, R, S loops for depthwise convolutions (e.g., L2, L3, and L4) are the same as their corresponding loop sizes. Such tile sizes completely exploit convolution reuse (sliding window) present in an input channel without going back to DRAM, and this is in accordance with depthwise separable convolutions which exhibit only the convolution reuse, but not filter reuse.
Impact of datalayouts on level3 tile sizes. In this work, Marvel integrates datalayouts (array dimension permutations) and level3 tile selection problem to minimize offchip data movement, which none of the prior works considered, even for CPUs/GPUs. Since one of the pruning strategies is having the batch size of one, there are only three dimension to output activations, three dimensions to input activations, and four dimensions to weights. As a result, there are 864 (=4!3!3!) possibilities of dimension permutations as part of datalayouts; however, we only consider dimension permutations with unique innermost dimension since the order of remaining dimensions doesn’t affect our data movement cost function. Hence, there are only 36 (=433) possibilities, which we have considered in our evaluation.
In fig. 6, each colored circle represents the impact of optimal level3 tile sizes for a *given layer (xaxis) and a layout (circle)* on the offchip data movement cost normalized to the best of all layouts for the layer. Considering data layouts enabled Marvel to find better tile sizes, especially for energy efficiency, for, e.g., optimal tile sizes of CNN layer L7 in fig. 6 with a good layout resulted in close to 100x smaller data movement costs (which translates into energy reduction) compared to optimal tile sizes with a bad layout.
Even though datalayouts of input, weight, and output tensors can make significant impact on reducing offchip data movement cost of a single convolution layer, an explicit data layout conversion operator support via hardware (or) software will be required if the data layouts are different for adjacent layers, and the overhead of such conversion depends on complexity of layout translation, tensor shapes and sizes. However, Marvel can be used to find uniform datalayouts across different convolutional layers. Such uniform data layouts don’t require translation overhead at every layer (except the first layer), but maybe suboptimal for some layers. fig. 7 shows impact of each layout on the total offchip data movement cost normalized to the optimal layouts (column ”Datalayouts” in table 3) at each layer without translation overhead. fig. 7 shows that with unified data layouts (laying out input activations height (X), output activations channels (K), batch of filters (K) in the innermost position for input, output, and filter tensors respectively), the degradation is within in 5% compared to the total offchip data movement cost considering optimal layout at each layer.
4.2. Evaluation of generated onchip subspace mappings
Marvel uses the computed optimal mappings of offchip subspace to construct the onchip subspace, and then explores the space with a set of pruning techniques (mentioned in the methodology of our evaluation) to find optimal onchip mappings corresponding to higher throughput, lower energy consumption, and also lower energydelay product.
Experimental variants. We compare Marvel generated onchip mappings for each layer and accelerator platform with three other stateoftheart onchip mappings: QRpartitioned^{4}^{4}4QRpartitioned mapping refers to a mapping which explores parallelism across output height (Q) and filter height (R) with rowstationary reuse behavior, and similarly explanation for other mappings. (RSlike (Chen et al., 2016)), KCpartitioned (DLAlike (NVIDIA, 2018)), and QPpartitioned (ShiDianNaolike (Du et al., 2015))^{5}^{5}5The entire onchip mapping of the stateofart dataflows are obtained by fixing aspects such as loop order and parallelization as per the dataflow, varied free aspects with Marvel, and picked the best. Also, we compared with another variant of Marvel by disabling pruning strategy on prologues and epilogues. In addition, for throughput, we also compared with attainable (possible peak) performance using roofline models of both accelerator platforms. To evaluate all of the above variants, we used MAESTRO (Kwon et al., 2019), a stateoftheart analytical cost model and RTL validated to accurately estimate various metrics such as throughput, energy consumption, latency, and others. A comparison of Marvel generated onchip mappings with other variants for throughput and energy consumption is shown in fig. 8, and Marvel achieved a geometric mean improvement of 5.23x higher throughput and 1.12x lower energy consumption compared to all the stateoftheart mappings for all fifteen layers and both accelerator platforms. Some of the interesting insights from Marvel generated mappings are described below.
Low arithmetic intensity layers. The residual layers, fullyconnected layers, and depthwise convolution layers have low arithmetic intensity of ingress data, i.e., weights and input activations. This is because residual layers (e.g., L1, L10) and fullyconnected layers (e.g., L7, L15) don’t have convolution reuse (sliding behavior), and depthwise convolutions (e.g., L2, L3, L4) don’t have filter reuse. As a result, the performance of these layers are often bounded by NoC bandwidth; hence the Marvel generated mappings achieve performance close to the attainable from the roofline. In addition, fully connected layers produce only single output activation per a output channel, and any mapping exploring parallelism across width (P) or height (Q) of an output activation channel would result in lower performance (e.g., QPPartitioned and QRpartitioned mappings on L7 and L15). Similarly, residual and depthwise layers have only one filter, and any mapping exploring parallelism along number of filters would result in lower performance (e.g., KCpartitioned on L1 and L10). Unlike residual layers, the depthwise layers have nonunit filter width (S) and height (R) which exploits convolution reuse via sliding behavior. But, the resolution (width and height) of input activations decreases from early layers to late layers, resulting in lesser convolution reuse going deeper in the model; hence, the performance of Marvel generated mappings reduce from L2 to L4, but close to attainable from the roofline.
High arithmetic intensity layers. In general, the arithmetic intensity of regular convolutions, pointwise convolutions, and their strided variations are higher because of the presence of filter reuse, input activations reuse, and convolution reuse (except for pointwise). The pointwise convolutions don’t have convolutional reuse because of unit filter width and height, and any mappings exploiting convolutional reuse would result in lower performance (e.g., QRpartitioned (or) rowstationary like) on L5, L11, L12, L13). However, Marvel generated mappings leveraged other forms of reuse such as filter and input activations via the appropriate loop orders, and as result, Marvel achieved 6.1x geometric mean throughput improvements for the pointwise layers. Such large improvements are possible because the stateofart mappings from recent accelerators are optimized for early CNN models that have 3x3 or larger kernels, not 1x1 (pointwise) kernels.
In case of regular convolutions and their strided variations, Marvel generated mappings didn’t achieve close to peak performance (e.g., L9 in platform 1). This is because of the PE underutilization arising from the mismatch of the parallelism granularity in the mappings with size of the PE array. For, e.g., the optimal level3 tile size corresponding to output channels in L14 is 512, and if this channel dimension is parallelized using 168 PEs, then only 8 PEs (= 512 % 168) are utilized in the third step leading to 95% under utilization of the PE array. In addition, some of the layer dimensions are prime numbers (e.g., output width and height are 7 in L13), leading to lesser factorization choices, leading to PE under utilization. Such mappings and observations were possible because of the accurate cost model, MAESTRO (Kwon et al., 2019), in precisely estimating data movement behavior and also PE under utilization scenarios. Furthermore, the performance of QP mapping on L9 and KC mapping on L2 are very small because the optimal level3 tile sizes along those parallel dimensions are very small (e.g., 5 for Q,P in L9 and 1 for K,C in L2). So, parallelizing across such dimensions yields less performance, and hence they appear very small in Fig. 8.
Disabling pruning on epilogues/prologues. We compared Marvel generated onchip mappings with another variant of Marvel where we disabled the pruning strategy on epilogues/prologues, i.e., allowing the tile sizes that are not factors of loop bounds as well in the mapping space. The throughput of optimal onchip mappings found by the variant is on average 2% higher relative to mappings obtained by enabling the pruning. However, the energyoptimal mappings found by the variant is 10% better compared to mappings obtained by enabling the pruning, but lead to an increase of 20x in search time relative to enabling the pruning.
Uniform dataflow. Even though our approach generates efficient mappings for each layer on flexible accelerators, it can also be used to find a uniform dataflow for a rigid accelerator, that is good across the layers. The dataflow of an accelerator can be viewed as a unique combination of parallelization (parallel loops) and staging behavior (loop order). Marvel keeps a list of optimal mappings for each dataflow behavior, and construct a dataflow that is good across layers, for, e.g., it identified a dataflow with exploiting parallelism on input channels (C) and output width (P), i.e., CPpartitoned as a best dataflow among the possibilities for higher throughput on the platform 1 across all layer. This is because the layers L1 to L6 have higher values of output tensor width and height (P, Q) compared to channels of output and input tensors (K, C) after the level3 tiling, and reverse for the remaining layers. This is because the layers L1 to L6 have higher values of output tensor width and height (P, Q) compared to channels of output and input tensors (K, C) after the level3 tiling, and reverse for the remaining layers.
4.3. Performance on entire DNN models
Along with the evaluation on different layer types of convolutions on both the accelerator platforms in the previous subsection, we also have evaluated Marvel on the set of four popular DNN models (involving convolutions) such as VGG (Simonyan and Zisserman, 2015), AlexNet (Krizhevsky et al., 2012), ResNet50 (He et al., 2016), MobileNetV2 (Sandler et al., 2018) to present the robustness of our approach.
As seen from Figure 9, our approach achieved a geometric mean improvement of 3.43 in throughput and another geometric mean of 9.19% reduction in energy consumption across all the four DNN models on both platforms with respect to the stateoftheart dataflows (described in Section 4.2). This comprehensive evaluation presents the robustness of our approach and consistent performance improvement over the stateoftheart dataflows across multiple DNN models.
4.4. Comparison with Bruteforce and Random Sampling exploration strategies
To evaluate the quality of mappings generated by our approach, we compared with the bruteforce exploration and random sampling exploration^{6}^{6}6All the so far best mappings obtained in the bruteforce and randomsampling exploration took almost around 47 hours of conducted 48 hours. (stopped after 48 hours) of mapping space (similar to the proposed in the TimeLoop (Parashar et al., 2019) framework) for all the fifteen layers for both the accelerator platforms in our evaluation. The comparison results are shown in fig. 10. Even though the bruteforce and random sampling conducted exploration gave better mappings in layers such as L1, L3, L7, and L9, the overall geometric mean performance (shown in fig. 10) in runtime and energy of both exploration strategies are less than almost 50% of performance of the mappings reported by our approach. In addition, our approach took close to 300 faster in search time compared to both the explorations conducted for 48 hours.
5. Conclusion & Future work
In this paper, we propose a decoupled offchip/onchip approach that decomposes the mapping space into offchip and onchip subspaces, and first optimizes the offchip subspace followed by the onchip subspace. The motivation for this decomposition is to dramatically reduce the size of the search space, and to also prioritize the optimization of offchip data movement, which is 23 orders of magnitude more compared to the onchip data movement. Then, we introduce Marvel, which implements the above approach by leveraging two cost models to explore the subspaces – a classical distinctblock (DB) locality cost model for the offchip subspace, and a stateoftheart DNN accelerator behavioral cost model, MAESTRO, for the onchip subspace. Marvel considers a large search space involving data layouts along with loop transformations, and then uses our decoupled approach to reduce the search space from to , which is a reduction factor of ten billion (Table 4). Mappings found by Marvel offer a geometric mean improvement of 5.23 higher throughput and 1.12 lower energy consumption compared to the stateofart mappings across 15+ layers from MobileNetV2 and ResNet50 models on two DNN spatial accelerators. Furthermore, we compared our approach with bruteforce and random sampling techniques (used in TimeLoop (Parashar et al., 2019)) in the search space exploration. In the future, we envision that Marvel can be used for a wide range of applications, including the neuroarchitecture search.
References
 Snapea: predictive early activation for reducing computation in deep convolutional neural networks. In International Symposium on Computer Architecture (ISCA), Cited by: Figure 3.
 End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
 TVM: An Automated Endtoend Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, Berkeley, CA, USA, pp. 579–594. External Links: ISBN 9781931971478, Link Cited by: §2.3.
 Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. In International conference on Architectural support for programming languages and operating systems (ASPLOS), Cited by: §1, §2.2.
 Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture (ISCA), Cited by: §1, §1, §1, §1, §1, Figure 3, §2.1, §2.2, §2.3, §3.2, §3.2, §3.2, Figure 8, §4.2, Table 1, §4.
 Eyeriss: an energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits 52 (1), pp. 127–138. Cited by: §3.2.
 Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §1, §1, §1, §3.2, §4.
 Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38 (2), pp. 8–20. External Links: Link, Document Cited by: §1, §1, §2.2.
 Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (ICANN), pp. 281–290. Cited by: §2.1.
 ShiDianNao: shifting vision processing closer to the sensor. In International Symposium on Computer Architecture (ISCA), Cited by: §1, §1, Figure 8, §4.2.
 [11] (2019) Edge tpu: google’s purposebuilt asic designed to run inference at the edge.. Note: https://cloud.google.com/edgetpu/ Cited by: §1, §1, Table 1, §4.
 Learning hierarchical features for scene labeling. PAMI 35 (8), pp. 1915–1929. Cited by: §1, §2.1.
 On estimating and enhancing cache effectiveness. In International Workshop on Languages and Compilers for Parallel Computing, pp. 328–343. Cited by: §1, §3.1, §3.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §1, §1, §1, §2.1, §4.3, Table 2, Table 3, §4.  Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
 DDR4 SDRAM STANDARD. Note: https://www.jedec.org/standardsdocuments/docs/jesd794a Cited by: Table 1.
 Indatacenter performance analysis of a tensor processing unit. In International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1, Figure 3, §2.2, §2.2, §3.2.
 Improving Locality Using Loop and Data Transformations in an Integrated Framework. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 31, Los Alamitos, CA, USA, pp. 285–297. External Links: ISBN 1581130163, Link Cited by: §3.
 Deep visualsemantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
 A Communicationcentric Approach for Designing Flexible DNN Accelerators. In Proceedings of the 12th International Workshop on Network on Chip Architectures, NoCArc, New York, NY, USA, pp. 6:1–6:1. External Links: ISBN 9781450369497, Link, Document Cited by: §1.
 Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §4.3.
 Understanding reuse, performance, and hardware cost of dnn dataflow: a datacentric approach. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, New York, NY, USA, pp. 754–768. External Links: ISBN 9781450369381, Link, Document Cited by: §1, §1, §3.2, §3.2, §3, §4.2, §4.2.
 MAERI: enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the TwentyThird International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 461–475. Cited by: §1, §1, §2.3, §3.2.
 Optimizing memory efficiency for deep convolutional neural networks on GPUs. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 633–644. Cited by: §3.
 FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In International Symposium on High Performance Computer Architecture (HPCA), Cited by: §1, §1, §2.3.
 Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In International Symposium on FieldProgrammable Gate Arrays (FPGA), pp. 45–54. Cited by: §1, §1, §1, §2.3, §2.3, §3.2, §3.2, §3.
 Design space exploration of FPGAbased deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASPDAC), pp. 575–580. Cited by: §1, §2.3.
 NVIDIA Deep Learning Accelerator (NVDLA). Note: https://nvldla.org Cited by: §1, §1, §2.2, §3.2, Figure 8, §4.2, §4.
 Timeloop: A Systematic Approach to DNN Accelerator Evaluation. Cited by: §1, §1, §1, §1, §2.3, §2.3, §3, §3, Figure 10, §4.4, §5.
 SCNN: an accelerator for compressedsparse convolutional neural networks. In International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §1, Figure 3, §2.2.
 Neural Network Assisted Tile Size Selection . In International Workshop on Automatic Performance Tuning (IWAPT’2010), Berkeley, CA. Cited by: §3.1.
 A Geometric Programming Framework for Optimal MultiLevel Tiling. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC ’04, Washington, DC, USA, pp. 18–. External Links: ISBN 0769521533, Link, Document Cited by: §3.1.
 Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, Piscataway, NJ, USA, pp. 55:1–55:12. External Links: ISBN 9781424428359, Link Cited by: §3.1.
 Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1, §2.1.
 Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §1, §1, §2.1, §3.2, §4.3, Table 2, Table 3, §4.
 An Analytical Model for Loop Tiling and Its Solution. In Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS ’00, Washington, DC, USA, pp. 146–153. External Links: ISBN 078036418X, Link Cited by: §3.1, §3.1.
 Automatic Selection of Highorder Transformations in the IBM XL FORTRAN Compilers. IBM J. Res. Dev. 41 (3), pp. 233–264. External Links: ISSN 00188646, Link, Document Cited by: §1, §3.1, §3.1, §3.1, §3.
 From highlevel deep neural models to fpgas. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §1, Figure 3, §2.2.
 Analytical Bounds for Optimal Tile Size Selection. In Proceedings of the 21st International Conference on Compiler Construction, CC’12, Berlin, Heidelberg, pp. 101–121. External Links: ISBN 9783642286513, Link, Document Cited by: §3.1, §3.1.
 Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §4.3.
 Efficient Processing of Deep Neural Networks: A Tutorial and Survey. CoRR abs/1703.09039. External Links: Link, 1703.09039 Cited by: §1, §1.
 [42] (2017) The future is here  iphone x (neural engine). Note: https://www.apple.com/newsroom/2017/09/thefutureishereiphonex/ Cited by: §1.

Deeppose: human pose estimation via deep neural networks
. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1. 
Google’s neural machine translation system: bridging the gap between human and machine translation
. arXiv preprint arXiv:1609.08144. Cited by: §1.  [45] Accelerating dnns with xilinx alveo accelerator cards. Note: https://www.xilinx.com/support/documentation/white_papers/wp504acceldnns.pdf Cited by: §1, §2.2.
 Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1.
 A Systematic Approach to Blocking Convolutional Neural Networks. CoRR abs/1606.04209. External Links: Link, 1606.04209 Cited by: §1.
 Optimizing fpgabased accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 161–170. Cited by: §1, §1, §1, §2.3, §2.3, §3.2, §3.2, §3.
 mRNA: Enabling Efficient Mapping Space Exploration on a Reconfigurable Neural Accelerator. In Proceedings of 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Cited by: §1, §1, §2.3, §2.3, §3.2.