Deep learning (DL) is a fundamental technology for many emerging applications such as autonomous driving (Bojarski et al., 2016), translation (Wu et al., 2016), and image classification (Russakovsky et al., 2015), with accuracy close to, and even surpassing, that of humans (Karpathy and Fei-Fei, 2015; Toshev and Szegedy, 2014; Farabet et al., 2013)
. Modern deep neural networks (DNNs) such as convolution neural networks (CNNs) require millions of input activations and weights, resulting in hundreds of millions of multiply-and-accumulate (MAC) operations for each inference pass. Moreover, DL inference is being increasingly deployed on mobile devices(42) and edge devices (11) for fast response times and data privacy. Achieving low latency and energy goals with stringent computation and memory constraints in mobile and edge devices has emerged as an important challenge.
To cope with this challenge, specialized hardware accelerators for DNN inference are being developed and deployed (Y. Chen, T. Yang, J. Emer, and V. Sze (2019); E. S. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. M. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, M. Abeydeera, L. Adams, H. Angepat, C. Boehn, D. Chiou, O. Firestein, A. Forin, K. S. Gatlin, M. Ghandi, S. Heil, K. Holohan, A. E. Husseini, T. Juhász, K. Kagi, R. Kovvuri, S. Lanka, F. van Megen, D. Mukhortov, P. Patel, B. Perez, A. Rapsang, S. K. Reinhardt, B. Rouhani, A. Sapek, R. Seera, S. Shekar, B. Sridharan, G. Weisz, L. Woods, P. Y. Xiao, D. Zhang, R. Zhao, and D. Burger (2018); NVIDIA (2018); 11; 45). Most of these accelerators are “spatial”, i.e., they are built by interconnecting hundreds to thousands of processing elements (PEs). They achieve high throughput by exploiting parallelism over the PEs, and energy efficiency by maximizing data reuse within the PE array via direct data forwarding between PEs and the use of scratchpad memories (Chen et al., 2016, 2014; NVIDIA, 2018; Parashar et al., 2017; Sharma et al., 2016; Jouppi et al., 2017; Ma et al., 2017; Zhang et al., 2015). The mechanism used by a spatial accelerator to exploit parallelism and perform data staging is known as dataflow (Chen et al., 2016), and it is a crucial component of an accelerator design because it directly impacts performance and energy efficiency of the accelerator design. Some of the state-of-the-art dataflows include row-stationary (RS) of Eyeriss (Chen et al., 2016), DLA of NVDLA (NVIDIA, 2018), and output-stationary of ShiDianNao (Du et al., 2015). A mapper (or) compiler for such a spatial accelerator system takes a DNN model, dataflow, and hardware parameters of the accelerator as inputs, and generate dataflow-compatible mappings for execution (Sze et al., 2017).
Even though dataflows have shown to impact the performance of DNN accelerators significantly, no single dataflow is an excellent choice for all shapes and types of convolution layers (Lu et al., 2017; Kwon et al., 2019). As a result, there is a substantial interest in exploring flexible spatial accelerators, where networks and state machines in the hardware are fully programmable which allow dynamic reconfiguration of dataflows during its execution (Lu et al., 2017; Kwon et al., 2018; Parashar et al., 2019; Chen et al., 2019; Krishna, 2019). Now, the overall performance and energy of these flexible accelerators depend heavily on the compiler in generating efficient mappings, thereby reinforcing the importance of the “mapping problem” (Parashar et al., 2019) for DNN accelerators. The focus of this paper is on efficiently mapping convolutions and related computations on flexible accelerators for optimized throughput and energy efficiency.
The efficiency of any mapping is tightly cross-coupled with both the algorithmic aspects of DNN models and the microarchitectural aspects of accelerators. On the algorithmic end, DNNs have been changing at an exponential rate since the success of early models like AlexNet (Krizhevsky et al., 2012). For convolutions alone, many new algorithmic techniques have been developed such as depth-wise convolution (Howard et al., 2017; Sandler et al., 2018), point-wise convolution (Sandler et al., 2018; He et al., 2016), and skip connections (also referred to as residual links or identity operators) (He et al., 2016; Xie et al., 2017). These techniques sacrifice arithmetic intensity (or algorithmic reuse) for fewer computations (Table 2). On the microarchitecture end, DNN accelerators have evolved from simple systolic arrays with limited flexibility to spatial arrays that are becoming increasingly complex with various network-on-chip (NoC) implementations (Chen et al., 2019; Kwon et al., 2018; Lu et al., 2017), reuse mechanisms (Chen et al., 2016, 2019), and flat/hierarchical implementations (Chung et al., 2018). However, much of the prior work on mapping (Zhang et al., 2015; Ma et al., 2017; Zhao et al., 2019; Yang et al., 2016) targeted hardware with limited capabilities and used DNNs with standard convolutions.
Prior work. The mapping problem for convolutions is described as a loop optimization problem in the literature (Parashar et al., 2019; Ma et al., 2017; Zhang et al., 2015; Zhao et al., 2019), involving several transformations such as multi-level loop tiling, parallelization, and interchange to the seven nested loops of the convolution. As a result, the mapping problem has an enormous search space to explore. For example, there are over 10 valid mappings for a single convolution layer on average for mapping ResNet50 (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) models on an Eyeriss-like accelerator (Chen et al., 2016) configuration. Prior work has fixed certain aspects of the mapping space such as choice of parallel loops, loop orders, and performed a brute-force exploration (Zhang et al., 2015; Ma et al., 2017; Motamedi et al., 2016) of the search space, but fixing such choices may not be efficient for rapidly evolving convolutions. To the best of our knowledge, TimeLoop (Parashar et al., 2019)
is the only framework that considers all aspects of the loop transformation space for a fully flexible spatial accelerator. But, it employs either an exhaustive linear search or a random sampling-based heuristic to explore the optimization search space. Also, none of the prior work has included data layouts of tensors as part of the mapping space for spatial accelerators.
Our approach for the mapping problem is motivated by the observation that the off-chip data movement between DRAM and accelerator is 2-3 orders of magnitude more compared to the on-chip data movement involving the PE array and L1/L2 buffers (Chen et al., 2016; Sze et al., 2017). Hence, we propose a novel approach referred as “decoupled off-chip/on-chip” that decomposes the mapping space into two subspaces, i.e., off-chip and on-chip subspaces, and first explores the off-chip subspace followed by exploring the on-chip mapping subspace constructed with the optimal mappings from the off-chip subspace. In contrast to prior work (Parashar et al., 2019), we use different approaches and cost models for these subspaces, i.e., a classical distinct-block (DB) locality cost model (Ferrante et al., 1991; Sarkar, 1997) to explore the off-chip subspace, and a state-of-the art DNN accelerator cost model, MAESTRO (Kwon et al., 2019), for the on-chip subspace. Note that, the MAESTRO’s DSE tool (Kwon et al., 2019) is limited to design space exploration of hardware parameters, and doesn’t explore the mapping (dataflow) search space as in our approach.
fig. 1 shows Marvel in a DNN accelerator compilation flow. The goal of Marvel is to formulate and explore mapping space of convolutions and its variants for a target fully flexible accelerator, and recommend efficient one(s) for code/configuration generation. We explore a much larger space of mappings relative to past work, because our mapping space includes dimension permutation, a form of data-layouts, in the mapping space formulation along with the loop transformations. Even though our approach generates efficient mappings for each layer on the flexible accelerators, it can also be used to find a uniform data-layout (fig. 7) and also a uniform dataflow for a rigid accelerator, that is good across the layers.
We ran Marvel for 15 representative layers with diverse types, shapes and sizes across ResNet50 (He et al., 2016) and MobileNetv2 (Sandler et al., 2018) on two accelerator configurations. Our approach reduces the mapping space by an factor, and the generated optimal mappings for the on-chip subspace demonstrate a geometric mean improvement of 5.23 higher throughput and 1.12 lower energy consumption with respect to optimal mappings corresponding to three state-of-the-art dataflows (Chen et al., 2016; NVIDIA, 2018; Du et al., 2015). In addition, the performance cost of mappings obtained by our decoupled approach is 5% higher compared to the best mappings identified by a brute-force exploration, similar to the proposed in the TimeLoop framework (Parashar et al., 2019) (stopped after 48 hours), while being close to 300 faster in search time.
2. Background and Related work
In this section, we provide a brief overview of convolutions and variations to the convolutions in modern DNNs. Then, we briefly discuss spatial DNN accelerators and prior works on mapping convolutions to the the accelerators.
Convolution Neural Networks (CNNs) are one of the most popular DNNs for image recognition (Russakovsky et al., 2015; Karpathy and Fei-Fei, 2015; Toshev and Szegedy, 2014; Farabet et al., 2013). Among many layers in CNN models, convolution layers account for more than 90% of overall computation (Cong and Xiao, 2014; Chen et al., 2016), dominating overall latency, and energy consumption in inferences. In general, the convolution layers deal with three high-dimensional tensors: A four-dimensional tensor for filter (weights), a four-dimensional tensor for inputs, and another four-dimensional tensor for outputs, whose conventions and visualization are shown in fig. 2 (a) and (b) respectively. A regular convolution operation (CONV2D) is comprised of three-dimensional multiply and accumulate (MAC) operations enclosed in four loops, where each three-dimensional multiply and accumulate yields an element of the output activation tensor. The loop-nest representation of a regular convolution is shown in fig. 2 (c). Fully-connected layers (FC) are also very common and can be viewed as a particular case of convolution that has a sliding window as large as input activation. In addition to CONV2D and FC layers, recent DNN models such as MobileNetV2 (Sandler et al., 2018) and ResNet50 (He et al., 2016) employ diverse layer types which are briefly described below.
Point-wise Convolution (PWCONV) Layer. These layers are special cases from regular convolutions, which operate on 1x1 filter size, i.e., these layers accumulate partial sums only across input channels (depth) to generate an output activation, resulting in no convolution reuse (sliding window behavior).
Depth-wise Convolution (DWCONV) Layer. These layers include the same operation as regular convolutions, but does not accumulate partial sums across input channels (depth). In addition, these layers have a filter batch size (K) as one, resulting in no filter reuse. But, these layers can exploit convolution reuse present in an input activation channel. A loop-nest representation of the these layers is shown in fig. 2(d).
Residual Addition (RAdd) Layer. Residual links in inference are used to perform element-wise additions of inputs and weights in a channel, and don’t perform reduction across channels. These layers can be viewed as depth-wise separable layers with filter width and height as one, and as result, these layers have no filter reuse and no convolution reuse.
2.2. Spatial DNN Accelerators
Spatial DNN accelerators based on ASICs and FPGAs have emerged to address extreme demands on performance and energy-efficiency of CNN layers (Chen et al., 2016, 2014; NVIDIA, 2018; Parashar et al., 2017; Sharma et al., 2016; Jouppi et al., 2017). Such accelerators are built using an array of processing elements (PEs) to provide high parallelism and use direct communication instead of via shared memory for energy-efficiency. An abstract model of spatial accelerators is shown in fig. 3, where each PE of an accelerator consists of a single/multiple ALU(s) dedicated for multiply-accumulate operations (MACs) and a local scratchpad (L1 buffer). Also, accelerators employ various network-on-chips (NoCs) for direct communication among PEs and between PE array and L2 scratchpad buffer. The interconnection network often supports multi-casting data to multiple PEs, which can reduce the total number of data reads from L2 buffer to PEs. Unlike GPU cores, PEs can communicate with adjacent PEs (data forwarding) using a NoC, which can significantly reduce the energy consumption for expensive L2 buffer accesses. Accelerators also typically employ a large shared L2 scratchpad buffer to stage data from DRAM and also partial accumulations from PE arrays. Both L1 and L2 scratchpad buffers are software-controlled memories, i.e., programmer/compiler directly controls contents of the buffer, unlike cache memories, which implicitly manages them, and this is because the memory traffic in accelerators is known in advance. Many spatial accelerators can be further interconnected together to create a scale-out system (Chung et al., 2018).
Systolic arrays (Jouppi et al., 2017; xDNN, ) are also popular DNN accelerators, which entirely relies on the point-to-point connection among adjacent PEs for input data distribution and partial sum accumulations. That is, systolic arrays distribute input data and accumulate partial sums via store-and-forward. Typically, systolic arrays are two dimensional, and each dimension is used for data forwarding and partial sum accumulation, respectively. Although systolic arrays can provide high throughput and energy-efficiency, they lack flexibility in its dataflow due to their rigid NoC architecture. Such inflexibility allows limited dataflow styles, which can lead to low compute unit utilization depending on the layer type and dimensions. Therefore, in this work, we focus on spatial accelerators, which provide more flexibility via NoC, so that we can explore massive benefits from scheduling the convolutions onto them.
2.3. Past work on Mapping
The problem of optimally mapping a convolution operation onto a spatial accelerator is described as a loop optimization problem in the literature (Parashar et al., 2019; Ma et al., 2017; Zhang et al., 2015; Zhao et al., 2019), involving multi-level loop tiling, parallelization, and interchange to the seven nested loops of the convolution. As a result, this optimization problem has a huge search space of possible mappings.
Some of the prior works (Zhao et al., 2019; Chen et al., 2016) focused on developing mappers specific to their architectures, for, e.g., mRNA mapper (Zhao et al., 2019) for the MAERI accelerator (Kwon et al., 2018), limiting their applicability to generic spatial accelerators. In addition, other prior works (Zhang et al., 2015; Ma et al., 2017; Motamedi et al., 2016) fixed certain aspects of loop optimization problem such as choice of parallel loops, loop orders, but such choices may not be efficient for rapidly evolving DNN models. Furthermore, past work in (Lu et al., 2017)
focused only on selecting parallel loops and the degree of parallelism, ignoring other aspects of the optimization problem. In addition to the above limitations, most of the prior works use approximate cost models for measuring PE utilization and on-chip communication. Such approximate cost models are not sufficient for precise estimation of throughput and energy efficiency, because edge-conditions from layer shapes and the degree of parallelism can lead to a significant slowdown.
TVM compiler infrastructure (Chen et al., 2018) offers an ML-based cost model to find optimal implementations of convolutions on a variety of platforms including accelerators. But, we believe that such ML-based cost models may not be suitable for spatial accelerators because of two reasons – 1) the statistical ML-based cost models are generally not accurate to precisely estimate performance and energy, and not accounting PE under-utilization, edge conditions can lead to significant imprecision, and 2) it requires training the ML-based cost models even for a slight change in number of PEs in the accelerator configuration, which makes it challenging to use for the design-space exploration.
To the best of our knowledge, TimeLoop (Parashar et al., 2019) is the only framework that considers all aspects of the loop transformations space for a fully flexible spatial accelerator. But, it employs either an exhaustive linear search or a random sampling-based heuristic to explore the optimization search space. In addition, the framework doesn’t consider data-layouts of convolution tensors into the mapping space formulation. But, our approach (Marvel) includes dimension permutation, a form of data-layouts, in the mapping space formulation along with the loop transformations. Then, Marvel leverages the proposed “decoupled off-chip/on-chip” approach along with a set of pruning strategies to reduce the mapping space exploration, which is described in section 3.
3. Our Decoupled Model-driven Approach
The first step in our approach is to convert a given convolution layer into an equivalent loop-nest form, because loop-nest notation (Parashar et al., 2019; Zhang et al., 2015; Ma et al., 2017) has been widely used to describe mappings of convolution onto spatial accelerators having multiple levels of the memory hierarchy. A sample mapping in the loop-nest form for a parametric version of 1D convolution is shown in fig. 4 (b), and visualization of it is shown in (c). Now, we briefly describe different aspects of a mapping below.
1) Multi-level tiling (including parallelization). A mapping includes multi-level tiling for the multiple levels of the memory hierarchy and PE array of the accelerator111In this work, we focus on spatial accelerators having only three levels of the memory hierarchy (L1 buffer, L2 buffer, and DRAM). However, our formulation and approach can be extendable to more levels of hierarchy. – 1) Level-1 tiling on the iteration space of a convolution to enhance temporal reuse via the private L1 buffer of a PE, 2) Level-2 tiling to parallelize multiple level-1 tiles across the PE array, and 3) Level-3 tiling to enhance temporal reuse via the the shared L2 buffer. We denote the level-i tile sizes corresponding to seven loops of the convolution loop-nest as T, T, T, T, T, T, T, and the naming conventions are described in fig. 2 (a). Also, we denote the loop iterators over level-i tiles as t, t, t, t, t, t, and t for the seven nested loops of the convolution.
2) Inter-tile loop orders. A mapping also includes temporal ordering via the inter-tile loop orders222An n-dimensional loop-nest after one level of tiling will have 2n loops. The first n-loops are referred to as inter-tile loops and the later n-loops as intra-tile loops., which describe the execution order of tiles in time, resulting in reuse opportunities present across adjacent tiles. For e.g., the level-2 inter-tile loop order reflects spatio-temporal reuse opportunities via the PE array, and similarly, the level-3 inter-tile loop order reflects temporal reuse opportunities via the on-chip L2 buffer. However, the level-1 inter-tile loop order doesn’t reflect any reuse opportunities, because these loops are annotated with parallelism and they run simultaneously across the PE array. Similarly, the level-1 intra-tile loop order doesn’t provide any reuse opportunities because there is no more intermediate staging between the PE and its L1 buffer.
3) Data-layouts. We include dimension permutation (Li et al., 2016)
as part of the data-layouts of convolution tensors on DRAM, in a mapping, which is not considered in the mapping space of convolutions for spatial accelerators by prior work. Data-layouts are beneficial in reducing off-chip data movement because accessing the data that is laid contiguously on DRAM requires fewer block transfers. The dimension permutation layouts were exclusivelyy explored in the past to improve spatial data locality for better vectorization efficiency(Kandemir et al., 1998), and also to optimize memory efficiency for CNNs on GPUs (Li et al., 2016). But, to the best of our knowledge, there exists no prior work in including data-layouts as part of the search space of mappings for convolutions onto spatial accelerators.
Overall, the mapping space is a Cartesian product of six dimensions which represent different aspects of a mapping, i.e., 1) level-1 tile sizes, 2) level-2 tile sizes (parallelism), 3) level-2 inter-tile loop orders, 4) level-3 tile sizes, 5) level-3 inter-tile loop orders, and 6) data-layout of tensors. The first three dimensions are grouped under “on-chip mapping subspace” since they influence parallelization and on-chip data movement, and the remaining three dimensions are grouped under “off-chip mapping subspace” since they influence the off-chip data movement.
The size of mapping space is enormous with the above formulation, and hence it is impractical to explore mapping space of a layer via a brute-force enumerate approach in a reasonable amount of time. Our approach towards the above optimization problem is motivated by the observation that the off-chip data movement between DRAM and accelerator is 2-3 orders of magnitude more compared to the on-chip data movement. Hence, we propose a novel approach referred as “decoupled off-chip/on-chip” that decomposes the mapping space into two subspaces, i.e., off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace which is constructed with the optimal mappings from the off-chip subspace. In contrast to prior work (Parashar et al., 2019), we use different approaches and cost models for these subspaces, i.e., a classical distinct-block (DB) locality cost model (Ferrante et al., 1991; Sarkar, 1997) to explore the off-chip subspace, and a state-of-the art DNN accelerator cost model, MAESTRO (Kwon et al., 2019), for the on-chip subspace. The overall approach is summarized in fig. 5 which is implemented as a standalone tool that takes a convolution layer sizes from a DNN model and also hardware parameters of the target accelerator, and then outputs optimal mappings for each of the three optimization goals, i.e., runtime, energy consumption, and energy-delay product.
3.1. Solving off-chip mapping subspace
The goal of finding an optimal mapping in the off-chip mapping subspace to minimize off-chip data movement between DRAM and the L2 buffer of an accelerator. In our work, we assume the L2 buffer to be a software-managed scratchpad buffer, and reducing the off-chip data movement333In case of non-software-managed scratchpad buffers, reducing data movement between DRAM and L2 buffer is equivalent to finding a level-3 tile whose memory footprint can fit into the L2 buffer and is maximum. is equivalent to finding a level-3 tile that has highest arithmetic intensity, this is because the highest arithmetic intensity results in higher reuse and less data transfer.
Since the off-chip data movement happens in multiples of DRAM block sizes, we redefine the arithmetic intensity as the number of operations performed on a single block, and minimizing inverse of the arithmetic intensity is the exact goal of optimal tile size selection problem that has been focused by the compiler research community from the last couple of decades using a variety of approaches ranging from analytical models (Sarkar, 1997; Sarkar and Megiddo, 2000; Shirako et al., 2012, 2012)
to machine learning models(Rahman et al., 2010). However, none of the prior works consider the tile size selection problem along with data-layouts to minimize the off-chip data movement, even for CPUs/GPUs.
In our approach, we consider the classical distinct-block (DB) locality cost model (Ferrante et al., 1991) to measure the off-chip data movement cost, which was developed as part of the memory cost analysis to guide automatic selection of loop transformations and also optimal tile size selections (Sarkar, 1997; Sarkar and Megiddo, 2000; Shirako et al., 2012) in IBM XL compilers. The distinct blocks (DB) model starts with a data-layout of multi-dimensional arrays and also a parametric tiled version of a perfectly nested loop. Then the model symbolically estimates the off-chip data movement cost involved in a tile of computation by measuring the number of the distinct number of DRAM blocks required for all the references in the tile of computation. To explain in detail, let’s assume the layouts of 4-dimensional filter, input, and output tensors are SRCK, XYCN, and PQKN (leftmost dimension is the inner-most storage dimension) respectively. Then, the distinct number of DRAM blocks (with block size as B) required for each of the array references (W[t][t][t][t], I[t+][t+t][t][t], and O[t][t][t][t]) inside a level-3 tile of convolution computation are expressed below as a function of the level-3 tile size variables.
In the above formulation, the innermost access of each reference is divided by the block size, because the data movement with DRAM happens in multiples of block sizes. Now, the total data movement cost (DMC), a.k.a. memory cost per iteration, involved in a tile is computed as the number of distinct DRAM blocks required for all references in the tile by the total number of operations in the tile.
The optimal level-3 tile sizes and data-layouts are computed by minimizing above data movement cost function for every layout and tile sizes in the off-chip mapping subspace with the two constraints, i.e., 1) the tile size of a loop should be greater than 0 and should not exceed its corresponding loop bound, and 2) the total data required (including double buffering) for a level-3 computation tile should fit into the on-chip L2 buffer. In the past, the problem of determining the optimal tiles using the DB model was modeled as a geometric program, transformed it then into a convex optimization problem (Renganarayana and Rajopadhye, 2008, 2004), and further solved using integer geometric programming (IGP) frameworks instead of enumeration. Marvel currently has support for doing an exhaustive search (feasible because of only one level of tiling for off-chip data movement) and also using the IGP formulation for the tile size estimation.
After computing optimal level-3 tile sizes and data-layouts of tensors, our approach computes the partial derivatives (slopes) of the data movement cost function (based on the optimal data-layout) with respect to parametric level-3 tile sizes (similar to (Sarkar, 1997)), and evaluate the partial derivatives by substituting optimal level-3 tile sizes. The key insight is that having a higher negative value of a partial derivative along a loop indicates the lesser distinct number of elements referenced along the loop, i.e., highest reuse along the loop, and it is suggested to keep it in the innermost position to exploit maximum temporal reuse. Similarly, the rest of the loops are ordered based on their partial derivative values.
Rationale for using DB model: The DB model is a good choice for our approach, since the model only focuses on optimizing for off-chip data movement, and in addition, it focuses only on perfectly nested loop and convolutions are a great example of such nested loops.
3.2. Solving on-chip mapping subspace
The on-chip mapping subspace is constructed based on the optimal values of level-3 tile sizes of the off-chip mapping subspace. Then, our approach explores the constructed subspace using a set of pruning strategies to find optimal mappings for each of the three optimal goals, i.e., higher throughput (runtime), lower energy consumption, and lower energy-delay product. In this work, we use MAESTRO (Kwon et al., 2019), which is the state-of-art DNN accelerator behavior cost model to estimate various metrics, including latency and energy of each mapping in the on-chip subspace.
Rationale for using MAESTRO: A good cost model for on-chip DNN mapping exploration needs three capabilities: (i) describe/model the behavior of diverse DNN accelerators, (ii) precisely compute performance and energy, accounting for under-utilization, edge conditions, and data reuse or movement across time (via L1/L2 buffers (Chen et al., 2016)), space (via broadcast links (Kwon et al., 2018)), and space-time (via neighboring links (Jouppi et al., 2017; Chen et al., 2017)), and (iii) light-weight and fast to enable rapid evaluation through a large search space. We found MAESTRO (Kwon et al., 2019) valuable in meeting all three requirements. MAESTRO provides the ability to model hierarchically composed spatial accelerators with variable number of PEs and connectivity at each level, and analytically describe the behavior of a variety of DNN accelerators, without requiring explicit RTL/cycle-level simulations or access to real hardware. Moreover, the analytical cost model within the MAESTRO framework is validated against the RTL implementations of Eyeriss (Chen et al., 2016) and MAERI (Kwon et al., 2018) on VGG16 and AlexNet models. In addition, its data-centric representation enables faster computation of estimates because the data movement of a mapping is explicit in its specification, unlike compute-centric representation, which requires heavy linear-algebra frameworks to estimate data movement precisely. So, before invoking the MAESTRO framework, Marvel translates an on-chip mapping in the loop-nest form to the data-centric representation that is understood by MAESTRO. We omit the translation details in the interest of space. Algorithm 1 shows an overview of our approach in exploring the on-chip mapping subspace along with pruning strategies, and we describe them in detail as below.
Level-2 inter-tile loop order. There are a total of 5040 (= 7!) possible level-2 loop orders in the on-chip subspace, and our approach can leverage (if specified) the following two domain-specific pruning strategies to reduce the number of loop orders, i.e., 1) symmetric property (doesn’t prune optimal mappings), and 2) unrolling loops corresponding to filter width and height (can prune optimal mappings).
The symmetric property is motivated by the observation that a convolution operation generally operates on a square shaped input activation and filter tensors. In addition, the loop iterators corresponding to input and filter width (t,t) are tightly coupled in the input array subscript (I[t+][t+t][t]), and similarly for input and filter height (t, t). This leads to an interesting observation that exchanging iterators corresponding to input width with its height and filter width with its height doesn’t alter semantics if the input activation and filter tensors after level-3 tiling are square-shaped. This property helps in pruning loop orders, for, e.g., exploring loop order ttttttt enables us to ignore the loop order ttttttt safely without missing optimal mappings.
The width and height of filters are very small (either 3x3 or 1x1) because the modern DNN models are focusing on reducing the total number of operations (Sandler et al., 2018). We leverage this trend in our approach by unrolling these loops, which results in the 5d loop nest from the 7D loop nest. Besides, in most inference scenarios, the batch size (N) of 1 is used, especially on mobile/edge devices. This further prunes the search space by ignoring the loop corresponding to batch size. As a result, the total number of level-2 tile orders are reduced from 7! (= 5040) to (= 12), leading to a reduction of 420x in the on-chip mapping subspace.
Level-2 tile sizes. The level-2 tile size of a loop indicates the degree of parallelism along the loop, and the tile size of one indicates that there is no parallelism exploited along that loop dimension. Prior works (Chen et al., 2016; NVIDIA, 2018; Zhang et al., 2015; Ma et al., 2017) exploited only a maximum of two loops for parallelism, and work in (Chen et al., 2019) demonstrates the need for going beyond two loops to achieve peak performance especially in case of modern convolution layers as certain layer dimensions are small. In Marvel, the number of loops to exploit for parallelism is a configuration setting provided by the user, and our approach performs pruning on the level-2 tile sizes based on the provided value. In addition, we introduce a new parameter called “PE utilization bound (p)” to further prune search space of level-2 tile sizes by bounding the overall PE array utilization to be at-least the utilization bound. The above technique is beneficial in finding optimal on-chip mappings with the optimization goal being throughput, because the highest throughput is typically obtained at higher PE utilization rates (Chen et al., 2019).
Level-1 tile size. Level-1 tiling is ignored in most of the prior accelerator designs (Zhang et al., 2015; Ma et al., 2017; Zhao et al., 2019, 2019), because of the absence of a private buffer inside a PE in those accelerators. But, the Eyeriss accelerator (Chen et al., 2016) showcased the benefit of having a private buffer by exploiting row-stationary reuse via the buffer. In our approach, we consider level-1 tile sizes in the on-chip space and explore the tile sizes such that their memory footprint (including double buffering) involved in the level-1 tile computation fits in the L1 buffer.
Our approach also includes a pruning strategy to choose level-1 and level-2 tile sizes such that they don’t result in any prologues or epilogues, i.e., the tile sizes are factors of loop bounds. This strategy is to help in simplifying the design of a PE and also control signal generation inside the accelerator; but, this can miss optimal mappings. All of the above-mentioned pruning strategies can be enabled/disabled in Marvel by passing them as input parameters.
In this section, we begin with an overview of the experimental setup and DNN layer descriptions used in our evaluation. Then, we present the evaluation of mappings generated by Marvel, and discuss insights from the mappings while comparing them with previous work.
|Clock frequency||200 MHz||200 MHz|
|L1 buffer size||512B||512B|
|L2 buffer size||108KB||108KB|
|DRAM block size (Jedec, 2017)||64||64|
Target accelerators. Marvel is applicable to any spatial accelerator since it abstracts accelerator details as #PEs, L1/L2 buffer sizes, NoC bandwidth, reduction/multicast support, etc, which can be used to model a wide variety of accelerators including Eyeriss (Chen et al., 2016), NVDLA (NVIDIA, 2018), TPU (11), xDNN. Due to space limitations, we present our evaluation for only two accelerator platforms (shown in table 1): An accelerator (Eyeriss-like (Chen et al., 2016)) having 168 PEs and 2.4GB/s NoC bandwidth, and another accelerator having 1024 PEs and 25.6GB/s. We inherit L1, L2 buffer, and clock frequency for both platforms from Eyeriss (Chen et al., 2016), i.e., 512B L1 buffer, 108KB L2 buffer, and 200MHz clock frequency. The bidirectional NoC used in our evaluation is a two-level hierarchical bus, which has support for multicasting similar to Eyeriss. The bidirectional NoC is used to deliver input activations and filter values (collectively referred to as ingress traffic) from L2 buffer to PE array and also output activations (referred to as egress traffic) back to the L2 buffer from the PE array.
|Layer sizes||Level-3 tile sizes||
Network layers. Modern DNNs such as MobileNetV2 (Sandler et al., 2018), ResNet50 (He et al., 2016) include diverse types of convolution layers such as regular convolutions, depth-wise separable, point-wise convolutions. As shown in table 2, these layers have diverse arithmetic intensities, e.g., residual links have very less arithmetic intensity because they are identity (element-wise) operators, and regular convolutions with non-unit filter sizes and point-wise convolutions have very high arithmetic intensity because they have input activation reuse across multiple filters. In our evaluation, we consider layers from each type listed in table 2, and paid special attention to layers with lower arithmetic intensity (e.g., depth-wise separable) since they are bandwidth limited, appear more frequently in recent networks, and have not been deeply explored in prior work. table 3 list the 15 layers, labeled L1 to L15, that were considered in our evaluation. We apply 8-bit fixed point precision for both of activations and filters (weights).
Methodology. In the evaluation, we have passed the following pruning strategies as input to Marvel: 1) Fully unrolling loops corresponding to filter width and height, 2) Setting batch size (N) as one, which captures the low latency requirement use case, and also represents a more challenging setup for energy efficiency and throughput (Chen et al., 2019), 3) Pruning tile sizes based on the finite L1 and L2 buffer sizes, 4) Minimum PE array utilization bound as 0.1, 5) Symmetric pruning strategy, and 6) Exploring tile sizes only that divide loop bounds evenly without any remainder, i.e., no prologues or epilogues, but we also evaluated by not considering this pruning strategy. table 4 shows the impact of our decoupling and pruning strategies on the original search space of schedules for all the fifteen layers with an average reduction of in the mapping space.
|Variants||Search space size|
|Original search space||2.7||9.4||1.8|
|Off-chip schedules search||7.3||3.6||1.3|
|space after decoupling|
|On-chip schedules search||2.9||2.4||1.4|
|space after decoupling|
|Off-chip schedules search||9.9||1.5||6.3|
|space after decoupling + pruning|
|On-chip schedules search||3.8||5.9||2.4|
|space after decoupling + pruning|
4.1. Evaluation of generated off-chip subspace mappings
The first step in Marvel is to compute optimal mappings of off-chip subspace (i.e., level-3 tile sizes, level-3 inter-tile loop order, and tensor data-layouts) to minimize the off-chip data movement cost, and the optimal mappings for all the fifteen layers are shown in table 3. Since both the accelerator platforms have the same DRAM block size and on-chip L2 buffer, the optimal mappings of the off-chip subspace are the same for both platforms. Some of the interesting observations that can be made from the results are described below.
Impact of layer type on level-3 tile sizes. The generated optimal level-3 tile sizes of P, Q, R, S loops for depth-wise convolutions (e.g., L2, L3, and L4) are the same as their corresponding loop sizes. Such tile sizes completely exploit convolution reuse (sliding window) present in an input channel without going back to DRAM, and this is in accordance with depth-wise separable convolutions which exhibit only the convolution reuse, but not filter reuse.
Impact of data-layouts on level-3 tile sizes. In this work, Marvel integrates data-layouts (array dimension permutations) and level-3 tile selection problem to minimize off-chip data movement, which none of the prior works considered, even for CPUs/GPUs. Since one of the pruning strategies is having the batch size of one, there are only three dimension to output activations, three dimensions to input activations, and four dimensions to weights. As a result, there are 864 (=4!3!3!) possibilities of dimension permutations as part of data-layouts; however, we only consider dimension permutations with unique innermost dimension since the order of remaining dimensions doesn’t affect our data movement cost function. Hence, there are only 36 (=433) possibilities, which we have considered in our evaluation.
In fig. 6, each colored circle represents the impact of optimal level-3 tile sizes for a *given layer (x-axis) and a layout (circle)* on the off-chip data movement cost normalized to the best of all layouts for the layer. Considering data layouts enabled Marvel to find better tile sizes, especially for energy efficiency, for, e.g., optimal tile sizes of CNN layer L7 in fig. 6 with a good layout resulted in close to 100x smaller data movement costs (which translates into energy reduction) compared to optimal tile sizes with a bad layout.
Even though data-layouts of input, weight, and output tensors can make significant impact on reducing off-chip data movement cost of a single convolution layer, an explicit data layout conversion operator support via hardware (or) software will be required if the data layouts are different for adjacent layers, and the overhead of such conversion depends on complexity of layout translation, tensor shapes and sizes. However, Marvel can be used to find uniform data-layouts across different convolutional layers. Such uniform data layouts don’t require translation overhead at every layer (except the first layer), but maybe sub-optimal for some layers. fig. 7 shows impact of each layout on the total off-chip data movement cost normalized to the optimal layouts (column ”Data-layouts” in table 3) at each layer without translation overhead. fig. 7 shows that with unified data layouts (laying out input activations height (X), output activations channels (K), batch of filters (K) in the innermost position for input, output, and filter tensors respectively), the degradation is within in 5% compared to the total off-chip data movement cost considering optimal layout at each layer.
4.2. Evaluation of generated on-chip subspace mappings
Marvel uses the computed optimal mappings of off-chip subspace to construct the on-chip subspace, and then explores the space with a set of pruning techniques (mentioned in the methodology of our evaluation) to find optimal on-chip mappings corresponding to higher throughput, lower energy consumption, and also lower energy-delay product.
Experimental variants. We compare Marvel generated on-chip mappings for each layer and accelerator platform with three other state-of-the-art on-chip mappings: QR-partitioned444QR-partitioned mapping refers to a mapping which explores parallelism across output height (Q) and filter height (R) with row-stationary reuse behavior, and similarly explanation for other mappings. (RS-like (Chen et al., 2016)), KC-partitioned (DLA-like (NVIDIA, 2018)), and QP-partitioned (ShiDianNao-like (Du et al., 2015))555The entire on-chip mapping of the state-of-art dataflows are obtained by fixing aspects such as loop order and parallelization as per the dataflow, varied free aspects with Marvel, and picked the best. Also, we compared with another variant of Marvel by disabling pruning strategy on prologues and epilogues. In addition, for throughput, we also compared with attainable (possible peak) performance using roof-line models of both accelerator platforms. To evaluate all of the above variants, we used MAESTRO (Kwon et al., 2019), a state-of-the-art analytical cost model and RTL validated to accurately estimate various metrics such as throughput, energy consumption, latency, and others. A comparison of Marvel generated on-chip mappings with other variants for throughput and energy consumption is shown in fig. 8, and Marvel achieved a geometric mean improvement of 5.23x higher throughput and 1.12x lower energy consumption compared to all the state-of-the-art mappings for all fifteen layers and both accelerator platforms. Some of the interesting insights from Marvel generated mappings are described below.
Low arithmetic intensity layers. The residual layers, fully-connected layers, and depth-wise convolution layers have low arithmetic intensity of ingress data, i.e., weights and input activations. This is because residual layers (e.g., L1, L10) and fully-connected layers (e.g., L7, L15) don’t have convolution reuse (sliding behavior), and depth-wise convolutions (e.g., L2, L3, L4) don’t have filter reuse. As a result, the performance of these layers are often bounded by NoC bandwidth; hence the Marvel generated mappings achieve performance close to the attainable from the roofline. In addition, fully connected layers produce only single output activation per a output channel, and any mapping exploring parallelism across width (P) or height (Q) of an output activation channel would result in lower performance (e.g., QP-Partitioned and QR-partitioned mappings on L7 and L15). Similarly, residual and depth-wise layers have only one filter, and any mapping exploring parallelism along number of filters would result in lower performance (e.g., KC-partitioned on L1 and L10). Unlike residual layers, the depth-wise layers have non-unit filter width (S) and height (R) which exploits convolution reuse via sliding behavior. But, the resolution (width and height) of input activations decreases from early layers to late layers, resulting in lesser convolution reuse going deeper in the model; hence, the performance of Marvel generated mappings reduce from L2 to L4, but close to attainable from the roofline.
High arithmetic intensity layers. In general, the arithmetic intensity of regular convolutions, point-wise convolutions, and their strided variations are higher because of the presence of filter reuse, input activations reuse, and convolution reuse (except for point-wise). The point-wise convolutions don’t have convolutional reuse because of unit filter width and height, and any mappings exploiting convolutional reuse would result in lower performance (e.g., QR-partitioned (or) row-stationary like) on L5, L11, L12, L13). However, Marvel generated mappings leveraged other forms of reuse such as filter and input activations via the appropriate loop orders, and as result, Marvel achieved 6.1x geometric mean throughput improvements for the point-wise layers. Such large improvements are possible because the state-of-art mappings from recent accelerators are optimized for early CNN models that have 3x3 or larger kernels, not 1x1 (point-wise) kernels.
In case of regular convolutions and their strided variations, Marvel generated mappings didn’t achieve close to peak performance (e.g., L9 in platform 1). This is because of the PE under-utilization arising from the mismatch of the parallelism granularity in the mappings with size of the PE array. For, e.g., the optimal level-3 tile size corresponding to output channels in L14 is 512, and if this channel dimension is parallelized using 168 PEs, then only 8 PEs (= 512 % 168) are utilized in the third step leading to 95% under utilization of the PE array. In addition, some of the layer dimensions are prime numbers (e.g., output width and height are 7 in L13), leading to lesser factorization choices, leading to PE under utilization. Such mappings and observations were possible because of the accurate cost model, MAESTRO (Kwon et al., 2019), in precisely estimating data movement behavior and also PE under utilization scenarios. Furthermore, the performance of QP mapping on L9 and KC mapping on L2 are very small because the optimal level-3 tile sizes along those parallel dimensions are very small (e.g., 5 for Q,P in L9 and 1 for K,C in L2). So, parallelizing across such dimensions yields less performance, and hence they appear very small in Fig. 8.
Disabling pruning on epilogues/prologues. We compared Marvel generated on-chip mappings with another variant of Marvel where we disabled the pruning strategy on epilogues/prologues, i.e., allowing the tile sizes that are not factors of loop bounds as well in the mapping space. The throughput of optimal on-chip mappings found by the variant is on average 2% higher relative to mappings obtained by enabling the pruning. However, the energy-optimal mappings found by the variant is 10% better compared to mappings obtained by enabling the pruning, but lead to an increase of 20x in search time relative to enabling the pruning.
Uniform dataflow. Even though our approach generates efficient mappings for each layer on flexible accelerators, it can also be used to find a uniform dataflow for a rigid accelerator, that is good across the layers. The dataflow of an accelerator can be viewed as a unique combination of parallelization (parallel loops) and staging behavior (loop order). Marvel keeps a list of optimal mappings for each dataflow behavior, and construct a dataflow that is good across layers, for, e.g., it identified a dataflow with exploiting parallelism on input channels (C) and output width (P), i.e., CP-partitoned as a best dataflow among the possibilities for higher throughput on the platform 1 across all layer. This is because the layers L1 to L6 have higher values of output tensor width and height (P, Q) compared to channels of output and input tensors (K, C) after the level-3 tiling, and reverse for the remaining layers. This is because the layers L1 to L6 have higher values of output tensor width and height (P, Q) compared to channels of output and input tensors (K, C) after the level-3 tiling, and reverse for the remaining layers.
4.3. Performance on entire DNN models
Along with the evaluation on different layer types of convolutions on both the accelerator platforms in the previous subsection, we also have evaluated Marvel on the set of four popular DNN models (involving convolutions) such as VGG (Simonyan and Zisserman, 2015), AlexNet (Krizhevsky et al., 2012), ResNet50 (He et al., 2016), MobileNetV2 (Sandler et al., 2018) to present the robustness of our approach.
As seen from Figure 9, our approach achieved a geometric mean improvement of 3.43 in throughput and another geometric mean of 9.19% reduction in energy consumption across all the four DNN models on both platforms with respect to the state-of-the-art dataflows (described in Section 4.2). This comprehensive evaluation presents the robustness of our approach and consistent performance improvement over the state-of-the-art dataflows across multiple DNN models.
4.4. Comparison with Brute-force and Random Sampling exploration strategies
To evaluate the quality of mappings generated by our approach, we compared with the brute-force exploration and random sampling exploration666All the so far best mappings obtained in the brute-force and random-sampling exploration took almost around 47 hours of conducted 48 hours. (stopped after 48 hours) of mapping space (similar to the proposed in the TimeLoop (Parashar et al., 2019) framework) for all the fifteen layers for both the accelerator platforms in our evaluation. The comparison results are shown in fig. 10. Even though the brute-force and random sampling conducted exploration gave better mappings in layers such as L1, L3, L7, and L9, the overall geometric mean performance (shown in fig. 10) in runtime and energy of both exploration strategies are less than almost 50% of performance of the mappings reported by our approach. In addition, our approach took close to 300 faster in search time compared to both the explorations conducted for 48 hours.
5. Conclusion & Future work
In this paper, we propose a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. The motivation for this decomposition is to dramatically reduce the size of the search space, and to also prioritize the optimization of off-chip data movement, which is 2-3 orders of magnitude more compared to the on-chip data movement. Then, we introduce Marvel, which implements the above approach by leveraging two cost models to explore the subspaces – a classical distinct-block (DB) locality cost model for the off-chip subspace, and a state-of-the-art DNN accelerator behavioral cost model, MAESTRO, for the on-chip subspace. Marvel considers a large search space involving data layouts along with loop transformations, and then uses our decoupled approach to reduce the search space from to , which is a reduction factor of ten billion (Table 4). Mappings found by Marvel offer a geometric mean improvement of 5.23 higher throughput and 1.12 lower energy consumption compared to the state-of-art mappings across 15+ layers from MobileNetV2 and ResNet50 models on two DNN spatial accelerators. Furthermore, we compared our approach with brute-force and random sampling techniques (used in TimeLoop (Parashar et al., 2019)) in the search space exploration. In the future, we envision that Marvel can be used for a wide range of applications, including the neuro-architecture search.
- Snapea: predictive early activation for reducing computation in deep convolutional neural networks. In International Symposium on Computer Architecture (ISCA), Cited by: Figure 3.
- End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
- TVM: An Automated End-to-end Optimizing Compiler for Deep Learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI’18, Berkeley, CA, USA, pp. 579–594. External Links: Cited by: §2.3.
- Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In International conference on Architectural support for programming languages and operating systems (ASPLOS), Cited by: §1, §2.2.
- Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture (ISCA), Cited by: §1, §1, §1, §1, §1, Figure 3, §2.1, §2.2, §2.3, §3.2, §3.2, §3.2, Figure 8, §4.2, Table 1, §4.
- Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits 52 (1), pp. 127–138. Cited by: §3.2.
- Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems. Cited by: §1, §1, §1, §3.2, §4.
- Serving DNNs in Real Time at Datacenter Scale with Project Brainwave. IEEE Micro 38 (2), pp. 8–20. External Links: Cited by: §1, §1, §2.2.
- Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (ICANN), pp. 281–290. Cited by: §2.1.
- ShiDianNao: shifting vision processing closer to the sensor. In International Symposium on Computer Architecture (ISCA), Cited by: §1, §1, Figure 8, §4.2.
-  (2019) Edge tpu: google’s purpose-built asic designed to run inference at the edge.. Note: https://cloud.google.com/edge-tpu/ Cited by: §1, §1, Table 1, §4.
- Learning hierarchical features for scene labeling. PAMI 35 (8), pp. 1915–1929. Cited by: §1, §2.1.
- On estimating and enhancing cache effectiveness. In International Workshop on Languages and Compilers for Parallel Computing, pp. 328–343. Cited by: §1, §3.1, §3.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1, §1, §1, §2.1, §4.3, Table 2, Table 3, §4.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
- DDR4 SDRAM STANDARD. Note: https://www.jedec.org/standards-documents/docs/jesd79-4a Cited by: Table 1.
- In-datacenter performance analysis of a tensor processing unit. In International Symposium on Computer Architecture (ISCA), pp. 1–12. Cited by: §1, Figure 3, §2.2, §2.2, §3.2.
- Improving Locality Using Loop and Data Transformations in an Integrated Framework. In Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 31, Los Alamitos, CA, USA, pp. 285–297. External Links: Cited by: §3.
- Deep visual-semantic alignments for generating image descriptions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
- A Communication-centric Approach for Designing Flexible DNN Accelerators. In Proceedings of the 12th International Workshop on Network on Chip Architectures, NoCArc, New York, NY, USA, pp. 6:1–6:1. External Links: Cited by: §1.
- Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §4.3.
- Understanding reuse, performance, and hardware cost of dnn dataflow: a data-centric approach. In Proceedings of the 52Nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’52, New York, NY, USA, pp. 754–768. External Links: Cited by: §1, §1, §3.2, §3.2, §3, §4.2, §4.2.
- MAERI: enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 461–475. Cited by: §1, §1, §2.3, §3.2.
- Optimizing memory efficiency for deep convolutional neural networks on GPUs. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 633–644. Cited by: §3.
- FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In International Symposium on High Performance Computer Architecture (HPCA), Cited by: §1, §1, §2.3.
- Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 45–54. Cited by: §1, §1, §1, §2.3, §2.3, §3.2, §3.2, §3.
- Design space exploration of FPGA-based deep convolutional neural networks. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 575–580. Cited by: §1, §2.3.
- NVIDIA Deep Learning Accelerator (NVDLA). Note: https://nvldla.org Cited by: §1, §1, §2.2, §3.2, Figure 8, §4.2, §4.
- Timeloop: A Systematic Approach to DNN Accelerator Evaluation. Cited by: §1, §1, §1, §1, §2.3, §2.3, §3, §3, Figure 10, §4.4, §5.
- SCNN: an accelerator for compressed-sparse convolutional neural networks. In International Symposium on Computer Architecture (ISCA), pp. 27–40. Cited by: §1, Figure 3, §2.2.
- Neural Network Assisted Tile Size Selection . In International Workshop on Automatic Performance Tuning (IWAPT’2010), Berkeley, CA. Cited by: §3.1.
- A Geometric Programming Framework for Optimal Multi-Level Tiling. In Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, SC ’04, Washington, DC, USA, pp. 18–. External Links: Cited by: §3.1.
- Positivity, Posynomials and Tile Size Selection. In Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, SC ’08, Piscataway, NJ, USA, pp. 55:1–55:12. External Links: Cited by: §3.1.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1, §2.1.
- Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §1, §1, §1, §2.1, §3.2, §4.3, Table 2, Table 3, §4.
- An Analytical Model for Loop Tiling and Its Solution. In Proceedings of the 2000 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS ’00, Washington, DC, USA, pp. 146–153. External Links: Cited by: §3.1, §3.1.
- Automatic Selection of High-order Transformations in the IBM XL FORTRAN Compilers. IBM J. Res. Dev. 41 (3), pp. 233–264. External Links: Cited by: §1, §3.1, §3.1, §3.1, §3.
- From high-level deep neural models to fpgas. In IEEE/ACM International Symposium on Microarchitecture (MICRO), Cited by: §1, Figure 3, §2.2.
- Analytical Bounds for Optimal Tile Size Selection. In Proceedings of the 21st International Conference on Compiler Construction, CC’12, Berlin, Heidelberg, pp. 101–121. External Links: Cited by: §3.1, §3.1.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §4.3.
- Efficient Processing of Deep Neural Networks: A Tutorial and Survey. CoRR abs/1703.09039. External Links: Cited by: §1, §1.
-  (2017) The future is here - iphone x (neural engine). Note: https://www.apple.com/newsroom/2017/09/the-future-is-here-iphone-x/ Cited by: §1.
Deeppose: human pose estimation via deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1.
Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.
-  Accelerating dnns with xilinx alveo accelerator cards. Note: https://www.xilinx.com/support/documentation/white_papers/wp504-accel-dnns.pdf Cited by: §1, §2.2.
- Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §1.
- A Systematic Approach to Blocking Convolutional Neural Networks. CoRR abs/1606.04209. External Links: Cited by: §1.
- Optimizing fpga-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. Cited by: §1, §1, §1, §2.3, §2.3, §3.2, §3.2, §3.
- mRNA: Enabling Efficient Mapping Space Exploration on a Reconfigurable Neural Accelerator. In Proceedings of 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Cited by: §1, §1, §2.3, §2.3, §3.2.