1 Introduction
Deep neural networks (DNNs) are being deployed at an increasing scale—across the cloud and IoT platforms—to solve complex regression and classification problems in image recognition [24], speech recognition [3], language translation [30], and many more, with accuracy close to and even surpassing that of humans [15, 27, 12]. Tight latency, throughput, and energy constraints when running DNNs have led to a meteoric increase in hardware accelerators.
DNN accelerators achieve high performance by exploiting parallelism over hundreds of processing elements (PEs) and high energy efficiency by maximizing data reuse within PEs and onchip scratchpads [6, 5, 1, 22, 23, 14]. For a specific DNN workload and a hardware accelerator, the achieved utilization and datareuse directly depends on (1) how we schedule the DNN computations (e.g., choice of loop transformations) and (2) how we map computations across PEs. These two components are collectively referred to as dataflow in the accelerator literature [6, 22, 17, 16]. It has been shown that the energy cost of moving data exceeds the cost of computation [6, 13], and so understanding and optimizing dataflow is a critical component of DNN accelerator design, as it directly determines how data is transferred between multipliers (L0), staged in local buffers (L1), and in the global buffer hierarchy (L2 and beyond).
The performance and energy efficiency of DNN accelerators depend on (1) target DNN model and its layers types/dimensions, (2) dataflow, and (3) available hardware resources and their connectivity. These three dimensions are tightly coupled, and optimizing DNN accelerators across these dimensions is a challenging task. For example, a dataflow that exploits input channel parallelism [1]
in convolutional neural networks (CNNs) may not achieve high utilization on layers with a small number of channels. Alternatively, dataflows that require more transfer bandwidth than the networkonchip (NoC) provides may result in underutilization of the hardware. In such cases, increasing the L1 scratchpad size may allow the same dataflow to require less data bandwidth, but this larger L1 may increase area and energy consumption. Thus, cooptimizing the hardware microarchitecture and the dataflow is one of the primary optimization targets for any accelerator design. This remains an open challenge, as observed by the number of novel dataflows and microarchitectures that continue to be proposed recently
[17, 13, 19, 7].Dataflows are often expressed as loop nests [18, 7, 17], a syntax that resembles a simple imperative programming language with explicit parallelism, and infers potential data reuse opportunities from loop nests. This approach makes it possible for humans to read, write, and reason about dataflows using familiar concepts from software development. However, once a dataflow is specified, there is a need to represent its properties and the relationship between its various entities in a precise, compilerfriendly format, and to build cost models that support the format, so as to enable the development of an ecosystem of tools that leverage the dataflow concept in order to significantly increase productivity in architectural design space exploration.
This paper makes three contributions. First, we introduce a datacentric notation to represent various accelerator dataflows with data mappings and reuses being firstclass entities, unlike computecentric notation which infers the data reuses from a loopnest representation. These datacentric directives can express a wide range of datareuses (across space, time, and space+time) over arbitrary hierarchies of PEs for both dense and sparse DNN layers such as convolutions, LSTMs, and fullyconnected layers. We believe that our datacentric notation can complement the commonly used loopnest notation, i.e., our notation can be viewed as an intermediate representation (IR) which can be extracted from a highlevel loopnest notation or specified directly.
Second, we introduce an analytical cost model named MAESTRO (shown in Fig. 1) that takes as input 1) a DNN model with a set of layers, 2) a dataflow description for each layer specified using our proposed directives, and 3) the hardware configuration. Based on these inputs, MAESTRO outputs estimates of endtoend execution time, energy (including all compute, buffer, and interconnect activities), NoC costs, and so on. A key challenge in our proposed approach is to provide a cost estimation that is both efficient and sufficiently precise to effectively support design space exploration. As demonstrated in our paper, abstract hardware model and analytic model within MAESTRO are found to be within 9095% accuracy of actual opensource RTL [16] while being 1029.14116.3 faster (10ms to run MAESTRO versus 7.228.8 hours for an equivalent RTL simulation on a workstation with Xeon E52699 processor and 64GB memory). Fig. 2 shows two example scenarios that can MAESTRO.
Finally, we demonstrate how the MAESTRO cost model can be used by accelerator designers to determine paretooptimal parameters for an accelerator with a given area, energy, or throughput budget (Fig. 2(b)). For the NVDLA [1] dataflow in VGG16 [25] convolutional layer 11, we see up to a 2.16 difference in power consumption between an energy versus throughput optimized design. The energyoptimized design employs 10.6 more buffers and 0.8 number of PEs in throughputoptimized design, which led to throughput degrade by 61.7% and performance per energy improvement by 1148.6%.
2 Background
Although our approach can be applied to various DNN layers  convolution, fullyconnected (FC), LSTM, separable convolution, and so on  we focus on convolutions in this paper because convolutional neural networks (CNNs) are popular, and convolution accounts for more than 90% of overall computation in CNNs [10, 6]. To understand the costbenefit tradeoffs of various approaches to compute convolutions, we first introduce common DNN accelerator architectures, and discuss core concepts related to data reuse and dataflows.
2.1 DNN Accelerators
DNN accelerators are specialized hardware to run DNN applications with high throughput and energy efficiency. As described in Fig. 3, most DNN accelerators employ hundreds of processing elements (PEs) to exploit inherent parallelism in DNN applications. PEs typically include scratchpad memories (L1) and ALUs that perform multiplyaccumulate operations (MACs). To reduce energy and timeconsuming DRAM accesses, most DNN accelerators also include a shared scratchpad buffer (L2) large enough to stage data to feed all the PEs. Shared L2 buffer and PEs are interconnected with a networkonchip (NoC). Our approach supports a wide range of interconnect designs in the NoC module. For example, a systolic array could be represented as a 2D array that provides unidirectional links toward East and South. Depending on the hardware parameters selected, our approach can support architecture designs that can efficiently execute a wide range of DNN operations, including convolutions, because it enables exploiting not only parallelism but also data reuse via buffers and forwarding/multicasting NoCs.
2.2 Tensors in Convolution
We present an example of 7D convolution in Fig. 4 that involves seven dimensions across three data classes: input/output activation and weight tensors. As presented in Fig. 4 (e), tensors in those three data classes correlate to seven dimensions in a complex manner. For example, row/column indices of output can be deduced using those of input row/column and filter row/column indices (a.k.a inputcentric view of the convolution loop nest). Also, the input channel appears in both of filter and input, and the output channel appears in both filter and output activation. Because of these specific data access patterns, we can transform the loop nest to keep one of the data classes stationary, which can significantly reduce global/local buffer access counts in DNN accelerators, as well as energy consumption. Such combinations of loop transformations and mappings to PEs are termed as dataflows [6], and can be categorized into three classes  weight stationary, output stationary, and nolocalreuse  so as to cover a subset of dataflows that are frequently used in practice.
2.3 Data Reuse Taxonomy
We observe data reuse originates from three behaviors of DNN accelerators  staging, multicasting, and forwarding. The purpose of staging is to keep data points in a local buffer to reuse them in the future, multicasting is to simultaneously send the same data points to multiple PEs to reuse them across PEs, and forwarding is to send data points to adjacent PEs so that they can be reused in future iterations. We define each of those reuse behaviors as temporal, spatial, and spatiotemporal reuse respectively.
Fig. 5 shows a timeline of a rowstationary (logical PE version) accelerator [6] with four PEs. Fig. 5 (b) highlights examples of (1) temporal, (2) spatial, and (3) spatiotemporal data reuse. The data point W1 of weight data class is used at cycle 0 in PE3, and used again at cycle 2 in the same PE, which is temporal reuse. The data point I1 ofinput data class is used at cycle 0 across PE2 and PE1, which is spatial reuse. The partial sum P is accumulated at cycle 3 in PE0 but generated by PE3 at cycle 2, which is spatiotemporal reuse. Note that this example assumed a certain architecture that enables all of three data reuse.
Table 1 summarizes the three types of data reuses in DNN accelerators, and also presents opportunities and implications for those data reuses. Each data reuse requires specific conditions to be met in the dataflow (or, loop nest structure). Temporal reuse is mainly related to the loop order and buffers, spatial reuse is mainly related to the correlation of a data class dimensions to the spatially mapped (or, iterated in a parallelfor) dimension and NoC structure, and spatiotemporal reuse is related to all of the aspects in temporal/spatial reuse. That is, loop nest structures need to be organized in a specific style to reveal data reuse opportunities with a proper accelerator. Such restructured loop nests correspond to various dataflows.
2.4 Dataflow Description
Dataflows are often expressed as loop nests, a syntax that resembles a simple imperative programming language with explicit parallelism, as presented in Eyeriss v2 [7]. We call the loop nest notation as computecentric representation since the data movement is implicit from the loop order and the explicit parallelism specified by the programmer. The loop order dictates the schedule/ordering of computations, the explicit annotation of loops with parallelfor captures parallelism, and tiling a loop into multiple loops enables data reuse. The example in Fig. 7 (a) describing a version of rowstationary dataflow [6] has two parallelfor with a specific loop order that constructs the rowstationary dataflow. Also, loop c in the original loop nest in Fig. 4 (d) is tiled into c1 and c0 and separated by parallel_for y’ in Fig. 7 (a) to implement tiling. Upon those components, the loop order in the loop nest, i.e., (n>k1>c1>y’>k0>c0>x’>r>s) implicitly guides data movement within accelerators across multiple levels of memory hierarchy, which influences data reuses. Therefore, computer architects started to explore optimized loop nests encompassing all of the three aspects; loop order, parallelism, and tiling. For example, the Eyeriss v2 [7] describes its dataflow in a 22dimensional loop nest.
Since parallelism and loop order are the first class entities in a computecentric (loopnest) representation, inferring accurate data movement in the accelerator and then developing cost models to precisely estimating efficiency is very challenging. This motivates us to explore alternative intermediate representation (IR) of dataflows that focus on data, a datacentric representation of dataflows where data movement and organization are being the first class entities.
3 Dataflow Description
To explicitly describe key aspects of dataflows, we propose a datacentric IR that clearly describes (1) data iteration schedule, (2) data tiling, and (3) data mapping on PEs, unlike computercentric representations in the loop nest form imply them. The IR is based on three datacentric dataflow directives  spatial map, temporal map, and cluster  that describe how each dimension of data iterates over time (iterations) and space (PEs), how large a mapping over each iteration is for each dimension, and how we organize the granularity of the mapping over space (PE clustering). We discuss the syntax and semantics of those three directives in the following subsection.
3.1 Datacentric Dataflow Directives
Because data in DNN accelerators are high dimensional tensors, we take the divideandconquer strategy: we describe the tiling and mapping of each dimension and stack them to represent the tiling and mapping of entire data in a data class (input/weight/output), as presented in Fig. 7 (b). We represent the data iteration order using the order of dataflow directives for each dimension, similar to the loop order in the computecentric representation. Although the concept of schedule implied by order of data/compute dimension is similar, two directives, temporal and spatial map, in the proposed directives innately encapsulate the tiling and mapping over time/space.Also, the other directive, cluster, enables manipulating the spatial granularity of mapping. We discuss the syntax and semantics of each directive next.
3.1.1 Cluster: Dimensionality in PE array
Syntax: Cluster(size, type), where size is integer, and type is either L (logical) or P (physical)
Semantics: The base clusters at the innermost loop directive are logical PEs. Cluster directive is processed from the innermost one, and it bundles size subclusters at the level cluster is specified to construct superclusters that becomes unit spatial dimension for directives outer than the cluster. If the type is physical, the super cluster constructed by the cluster becomes the unit physical compute unit, or physical PE. If no physical cluster is defined, the logical PEs are treated as physical PEs.
Implication: cluster enables to describe mapping over multidimensional processing element array with static/dynamic grouping of PEs. For example, Eyeriss [6] has a 2D PE array and constructs a dynamic cluster over rows to cover the row dimension of a sliding window. Two cluster directives can be used to describe such clustering.
Example: In the example in Fig. 6 (c), two cluster directives are defined in the order of cluster (3,P) and cluster (2,L). Because cluster is constructed from the innermost level, cluster(2,L) first operates on 12 logical PEs and generates level 1 clusters that contain two logical PEs each. Other directives between cluster(3,P) and clsuter(2,L) operates on the level 1 clusters., cluster(3,P) groups three level 1 clusters and construct two level 2 clusters. All directives above cluster(3,P) operate on the level 2 cluster. Because of the type P specified in cluster(3,P)
, level 2 clusters are mapped on a physical PE. Logical PEs within a level 2 cluster, or a physical PE, potentially operate as vector lanes of ALUs depending on the vector lane of the target hardware.
3.1.2 Temporal Map: Tiling and mapping over time
Syntax: TemporalMap (size, offset) x, where size and offset are integers and x is a data dimension.
Semantics: TemporalMap directives specify (1) the number of elements mapped in a dimension x (“size") and (2) how the mapping move in the next iteration of x (offset) to all of the subclusters.
Implication: TemporalMap maps the same set of elements in a dimension to all the subclusters. It expresses data tiling using size and tile iteration rule over time using offset.
Example: TemporalMap(3,1) X in Fig. 6 (a) indicates that three elements in X dimension is mapped in each iteration, and the mapping shifts by one in the next iteration. That is, the mapped x over time is (0,1,2), (1,2,3), (2,3,4), and so on.
3.1.3 Spatial Map: Tiling and mapping over space
Syntax: Spatial (size, offset) x, where size and offset are integers, and x is a data dimension.
Semantics: SpatialMap directives specify (1) (1) the number of elements mapped in a dimension x (“size") and (2) how the mapping move in the next cluster (offset). When the size of the spatial dimension (number of subclusters) is not sufficient at the level SpatialMap is specified, the spatial mapping is folded over time.
Implication: SpatialMap maps different sets of elements in a dimension to subclusters. It expresses data tiling using size and tile iteration rule over space using offset.
Example: SpatialMap(3,1) Y in Fig. 6 (b) indicates that three elements in Y dimension is mapped on each cluster (in this example, logical PEs) and the mapped elements shift by one over space (PEs). That is, the mapped y on each PE is as follows: PE0 <= (0,1,2), PE1 <= (1,2,3), PE2 <= (2,3,4) .., PE5 <= (5,6,7). Because the number of PEs is not sufficient to cover entire Y dimension, the mapping is folded over time.
3.2 Dataflow Construction
We discussed the syntax and semantics of dataflow directives we propose and their implication. To represent a full dataflow, directives must be defined over all the subdimensions of each data; seven dimensions introduced in Fig. 4 (a). Fig. 7 (b) describes a rowstationary dataflow [6] described in our directives over the layer defined in Fig. 4 (b). The dataflow is equivalent to the dataflow implied by Fig. 7 (a). Using this example, we discuss how to determine the directives for each dimension and organize them to represent the rowstationary dataflow described in Fig. 7.
Clustering PEs: The rowstationary dataflow treats R x S logical PEs (R/S: filter row/column size) as a cluster and maps an outputs row on each cluster. For simplicity, we allocate logical PEs as physical PEs and map S elements of the filter column dimension (unrolling S dimension) on each PE, which corresponds to TemporalMap(3,3) S in Fig. 7 (b). Because the logical PEs are physical PEs, the cluster directive has the type of "logical." Also, because the size of the R dimension is three in the target layer, the cluster size is three. Therefore, we obtain a cluster directive Cluster(3,L) as described in Fig. 7 (b) Note that clusters might introduce additional directives (at maximum, number of cluster directives extra "map"s for a dimension) because a SpatialMap directive defined over a cluster does not indicate how the mapped elements are distributed within a cluster. For example, Fig. 7 (b) contains two SpatialMap directives for Y dimension in each cluster level.
Tiling Tensors: In the rowstationary dataflow, PEs in a cluster compute partial sums for output, and each PE is responsible for a row in the sliding window generating S partial sums. Therefore, for each PE, one R element, three S elements (unrolled), and one Y element need to be mapped. We can determine the number of X elements to be mapped, but we select the minimum size (three; following S dimension size). For input/output channels, we tile into two and three elements (different sizes) to show how tiling works in the example. For input batch, we use the batch size of one as the rowstationary accelerator, Eyeriss, also processes one batch. Therefore, the tile size of each dimension in a PE is as follows: (N:1, K:2, C:3, Y:1, X:3, R:1, S:3).
Scheduling data iteration: Because the S dimension is unrolled in each PE, the map directive for S dimension is placed at the bottom. Over S dimension, because each PE generates partial sums for a filter row, R and Y elements need to be distributed over the logical PEs. To prevent duplicated partial sums, we specify SpatialMap of R and Y. The order of R and Y can be any because R dimension is unrolled in a cluster. Although all the clusters operate on different output rows, they iterate over the output column (X dimension) in the same manner as presented in Fig. 7 (c), which requires a TemporalMap on X. Another SpatialMap of Y over X dimension needs to be specified to allocate different output rows over clusters. C and K dimensions can be in any order above the last SpatialMap of Y. We place C dimension first for output reuse and use TemporalMap to prevent redundant data mapping. Finally, we place the batch at the top because different batch can be interpreted as another run over a different input activation tensor. Therefore, we obtain the data iteration schedule as follows: (tMap N> tMap K> tMap C> sMap Y> tMap X (Cluster) sMap Y> sMap R> tMap S, where tMap and sMap are TemporalMap and SpatialMap directives)
Constructing stacked directives: We aggregate the constructed cluster, tiles, and schedules in previous steps to describe a full dataflow. To determine the offset, we consider if the dataflow requires overlapped data points between temporal/spatial iterations or not. For tMap N, tMap K, and tMap C, we select the same offset as the tile size because they should not be overlapped. In contrast, we select an offset of one for sMap Y and tMap X to describe overlapping between adjacent sliding windows. Within a cluster, all of the sMap Y, sMap R, and tMap S have offset values the same as tile size to prevent redundant data mapping. Applying those offset values, we finally obtain the dataflow described in Fig. 7 (b).
3.3 Legal Dataflows
The dataflow description directives are flexible that users can specify any data mapping following a uniform pattern for each dimension. However, like the flexibility of programming languages allow users to write buggy code, dataflow directives allow users to specify illegal dataflows. Legal dataflows must meet the following conditions:
Bound condition: Bound condition is the first requirement that prohibits mappings of nonexisting indices. For example, in the example layer presented in Fig. 4, the number of output channels (K) is four. Therefore, TemporalMap(5,5) K is an illegal mapping directive because the mapping size (five) in the directive exceeds the bound of K dimension (four).
Coverage condition: Coverage condition requires the data mapping to cover all the pairs of operands (inputs and weights in CNNs) to generate the desired outputs. For example, in the dataflow in Fig. 7, if we replace the second directive, TemporalMap(2,2) K, to TemporalMap(2,4) K, the dataflow becomes illegal because the dataflow does not cover entire output channels (K). MAESTRO
prints out warning messages when the dataflow does not cover all the pairs of operands assuming CNNs as the target program. However, in some cases such as stride larger than one in a CNN or dilated convolutions can intentionally drop some data points to implement their functionality. Therefore, users need to carefully review the dataflow when they use
MAESTRO to model such cases.No redundancy condition: No redundancy condition prevents redundant data mapping that produces the same output as previously mapped data points. That is, once a set of operands are mapped on a PE, all the computation using the operands need to be done at the PE. For example, in the dataflow in Fig. 7, if we replace the third directive, TemporalMap(3,3) C, with TemporalMap(2,1) C, the dataflow becomes an illegal one because a redundant input channel is mapped along two temporal iterations of C.
4 Dataflow Analysis
In this section, we introduce our approach to estimating runtime and energy efficiency of dataflows on a target DNN model and hardware configuration. Based on the approach, we implement an analysis framework, MAESTRO, which consists of data reuse, performance, and buffer analysis engines as described in Fig. 1. We discuss the input to MAESTRO and three analysis engines of MAESTRO next.
4.1 Input Specification
The input to MAESTRO consists of a DNN model, hardware description, and dataflows described in datacentric directives discussed in Section 3. Fig. 8 shows a sample specification of the VGG16 model as an input to the framework. Each layer of the model also includes a description of dataflow using a sequence of our datacentric directives discussed in Section 3.1. Also, a hardware description of target accelerator such as the number of PEs, sizes of L1/L2 buffers and bandwidth of NoC are specified. MAESTRO supports a wide range of layers including convolutional, pooling, fullyconnected, and so on. MAESTRO
also models sparse DNN layers via specifying a percentage of sparsity for each data class assuming a uniform distribution of zeros.
4.2 Data Reuse Analysis Engine
Data reuse is the prime optimization targets of DNN accelerators to reduce energyconsuming DRAM and shared buffer accesses. Data reuse opportunities are implied by dataflows and exploited with proper hardware support. To analyze such processes, we first analyze the reuse opportunities in data reuse analysis engine, which computes the number of temporally, spatially, and spatiotemporally reused data points in each data class for each unit temporal iteration, a fundamental time concept in dataflows we discuss next.
4.2.1 Unit Temporal and Spatial Iterations
The unit temporal and spatial iterations are the fundamental time concepts in the data reuse analysis engine. In the loop nest representation of a dataflow, the unit temporal iteration is the iteration of the first nonparallel loop above the innermost parallel loop. For example, x loop in Fig. 7 (a) is the unit temporal iteration, so it is marked as a unit time step. In datacentric representation, the first TemporalMap above the nonfully unrolled SpatialMap is the unit temporal iteration. For example, TemporalMap(3,1) X is the unit temporal iteration in Fig. 7 (b). Note that SpatialMap(1,1) R covers entire elements in R dimension, so it does not introduce any changes in R elements in the mapping over time, as shown in Fig. 7 (d). Therefore, such fullyunrolled spatial/temporal Maps are ignored when we determine unit iterations.
We define unit spatial iterations as the individual mapping of tiles over physical PEs. For example, in the SpatialMap example in Fig. 6, the mapping of Y elements (0,1,2) on PE0 is the first spatial iteration, (1,2,3) to PE1 is the second spatial iteration, and so on. Among spatial iterations, we take the innermost spatial iterations in a physical PE as the unit spatial iterations. That is, the most finegrained spatial iteration in a physical PE is the unit spatial iteration. The unit temporal iteration can be viewed as the iteration of entire tiles mapped over a physical PE array. The unit spatial iteration can be viewed as the iteration of individual tiles mapped over each physical PE within a temporal iteration. These time concepts are the most finegrained time units from the perspective of a physical PE array and physical PE, respectively. Based on these time concepts, we analyze temporal, spatial, and spatiotemporal data reuses by tracking the overlapped data points between unit temporal/spatial iterations.
4.2.2 Data Reuse Analysis
To identify the amount of data reuse, we investigate the number of data points overlapped among adjacent unit temporal and spatial iterations. We discuss how we analyze each data reuse classes introduced in Table 1.
Spatial Reuse: The amount of spatial reuse is the number of overlapped data points across unit spatial iterations within a unit temporal iteration. For example, y elements of [12] are reused across clusters in Fig. 7 (d) at the unit temporal iteration (time step) zero. Because other dimensions correlated to input tensor are all temporally mapped, the overlapped input tensor volume is . That is, 18 input tensor data points can be spatially reused, or multicasted.
Temporal Reuse: The amount of temporal reuse is the number of overlapped data points between two consecutive unit temporal iterations. For example, in Fig. 7 (d), the entire weight data points for each PE remain stationary between unit temporal iteration 0 and 1. Therefore, the number of temporally reused weight data points is from each PE.
Spatiotemporal Reuse: The amount of spatiotemporal reuse in partial sum is the number PEs that share the same output data points in each unit temporal iteration. For example, in Fig. 7 (d), PEs in each cluster shares the mapped output tensor data points. That is, all the partial sums generated in each PE within a unit temporal iteration can be first accumulated within a PE and then across the PEs. For operand spatiotemporal reuse, we check the same condition as spatial reuse when the target NoC does not support multicasting.
We discussed how we estimate the amount of temporally, spatially, and spatiotemporally reused data points. MAESTRO computes the overlapped elements in each dimension across unit temporal/spatial iterations using mapping size and arguments considering cluster structures. Using the number of overlapped elements, we can compute the number of reused data points in each data class in the same way, multiplying the number of reused elements in each dimension correlated to a data class. The amount of each data reuse functions as the fundamental information for other engines.
4.3 Performance Analysis Engine
The runtime of DNN accelerator consists of communication delay (L2 to L1, L1 to L2, local forwarding) and computation delay in each PE. Considering the double buffering and latency hiding, we perform a worstcase analysis to estimate the delay of each unit temporal iteration. That is, the longest latency among L2toL1/L1toL2 communication, local forwarding, and computation is the delay of each unit temporal iteration. Each delay component is estimated as follows:
L2toL1 Delay: MAESTRO identifies the number of data points to be transferred from shared buffer (L2) to local scratchpad (L1) considering all the reuses (i.e., unique data points between unit temporal/spatial iterations contribute to the traffic). Using the NoC bandwidth and latency information, MAESTRO computes the total delay to finish transactions for a unit temporal iteration.
L2toL1 Delay: MAESTRO identifies the number of data points to be sent back to shared buffer (L2). The amount of traffic is the number of unique output tensor data points mapped over the PE array. We apply our NoC model to compute delay.
Local Forwarding: MAESTRO assumes onecycle delay for each spatiotemporally reused data points because they are via a direct link between adjacent PEs.
Compute Delay: MAESTRO analyzes the tile size of each data class mapped on each PE and deduce the number of partial sums to be generated from them. Considering the vector width of the ALU in each PE, MAESTRO computes the compute delay for each unit temporal iteration.
MAESTRO considers all the nontrivial aspects such as spatial folding and PE underutilization due to mismatch of the number of PEs and the size of the tensor dimension to be mapped. An example of them is presented in the example of SpatialMap in Fig. 6. MAESTRO also considers the irregularity among first, last, and steady unit temporal iterations, the distribution of NoC traffic over compute delay as an optimization works with double buffering, the lifetime of temporally reused data points and so on, which encompasses most of the practical aspects of DNN accelerator execution. Also, note that MAESTRO does not run cycleaccurate simulation; delivering sufficiently precise results within a few milliseconds scale, as we discuss in Section 4.5 and Section 5.
4.4 Buffer Analysis Engine
The buffer is a crucial component that enables not only tiling of data points but also temporal reuses, which requires significant area and energy in DNN accelerators. Therefore, precisely estimating the amount of buffer required to support an input dataflow is crucial for DNN accelerator designers. Also, global/local buffer access counts are directly related to the energy consumption because their unbalanced energy consumption and their analysis is as follows.
Buffer Size: MAESTRO defines the minimum local scratchpad (L1) size to support a dataflow as a double of tile size mapped on each PE in each unit temporal iteration. Note that we double the tile size to model double buffering. For shared buffer (L2) size requirement, MAESTRO computes a double of unique data points across the PE array within two temporal iterations for double buffering support. Depending on the dataflow, MAESTRO allocates some shared buffer spaces for partial sums to prevent redundant DRAM accesses.
Buffer Access Counts: For local scratchpad (L1), MAESTRO identifies the number of data points to generate all the partial sums within a tile mapped on a PE in each unit iteration. This number contributes to both of L1 read and write. The number of output tensor data points is counted as L1 writes. Those numbers are multiplied by the total number of unit temporal iterations. For shared buffer (L2), MAESTRO identifies the number of unique data points mapped over entire PE array in each unit temporal iteration. MAESTRO multiply the number with the number of entire unit temporal iterations and add up them to L2 read counts. We count the number of data points read from DRAM as L2 write counts. For both of L1 and L2, partial sum accumulation may increase L2 access counts when the input dataflow generates partial outputs from each unit spatial iteration. Also, note that the buffer analysis engine also considers all the realistic aspects discussed in Section 4.3. In particular, data reuse converts high L2/DRAM counts into affordable L1 access counts, effectively reducing energy consumption.
4.5 Model Validation
We validated the runtime model of MAESTRO against two accelerators  MAERI [16] and Eyeriss [8]. For MAERI, we use the cycle accurate simulation for VGG16, which is uploaded by the developers to the open source repository. For Eyeriss, we use the reported runtime for Alexnet because detailed mapping parameters are described for only Alexnet. We described their dataflows in our dataflow directives and used them as an input to MAESTRO. As the results presented in Fig. 9 show, the runtime estimated by MAESTRO closely matches the cycleaccurate simulation results and reported runtime with an average error rate of 5.90% for MEARI and 19.26% for Eyeriss. Note that we have full access to the simulation infrastructure for MAERI, but we did not for Eyeriss, so we had to decide some hardware parameters by ourselves. Also, Eyeriss data is based on the real chip, which might involve irregularity although we used the reported processing time, not the total runtime. However, because our framework estimation is consistently optimistic matching the data trend, the runtime model of MAESTRO is still sound to compare dataflows relatively.
4.6 Supported Dataflows
MAESTRO can model a variety of layers (LSTM hidden layer, pooling, fullyconnected, separable convolution, and so on) thanks to the generality of our datacentric approach that specifies a mapping of input tensors. For LSTM hidden layer, we use the input width (dimension X) to specify the input size to the hidden layer, input channel (dimension C) to specify different gates. For convolution with stride, pooling layer, and transposed convolution, users need to specify the stride, pooling size, and expansion factor, respectively. MAESTRO also models uniformly distributed sparsity for any supported dataflow.
MAESTRO does not support programs with nonaffine indices because most DNNs have only affine data indices. Also, MAESTRO does not support nonuniform tiling that maps a different number of data points to each PE. However, such mapping is also very rare because PEs are regular and load balancing is required across PEs. Finally, MAESTRO does not support an accelerator with intermediate memory hierarchies other than two levels we target (Global L2 and Local L1 in each PE). We support only such memory hierarchies because they are the most common design in the recent accelerators [6, 22, 23].
5 Case Studies
Using MAESTRO, we present two case studies  (1) estimating the potential of five dataflows in Table 2 and (2) hardware design space exploration.
5.1 Case study I: Dataflow Potentials
Accelerator  Dataflow Strategy  Dataflow 

Example for this work  No Local Reuse (NLR)  TemporalMap (1,1) K TemporalMap (1,1) C TemporalMap (1,1) Y TemporalMap (1,1) R TemporalMap (1,1) S SpatialMap (1,1) X 
Example for this work  Weight Stationary (WS)  TemporalMap (1,1) K TemporalMap (1,1) C TemporalMap (3,3) Y SpatialMap (3,1) X TemporalMap (Sz(R),Sz(R)) R TemporalMap (Sz(S),Sz(S)) S 
ShiDiannao [11]  Output Stationary (OS)  TemporalMap (1,1) K TemporalMap (1,1) C TemporalMap (Sz(R),1) Y SpatialMap (Sz(S),1) X TemporalMap (Sz(R),Sz(R)) R TemporalMap (Sz(S),Sz(S)) S 
Eyeriss [6]  Rowstationary (RS)  TemporalMap (1,1) K TemporalMap (1,1) C SpatialMap (Sz(R),1) Y Cluster (Sz(R)) TemporalMap (Sz(S),1) X TemporalMap (Sz(R),Sz(R)) R TemporalMap (Sz(S),Sz(S)) S 
NVDLA [1]  Weight Stationary  TemporalMap (Sz(R),Sz(R)) R TemporalMap (Sz(S),Sz(S)) S TemporalMap (64,64) C TemporalMap (1,1) Y TemporalMap (1,1) X Cluster (P_cluster_size, L) SpatialMap (1,1) K 
Fig. 10 presents bandwidth and L1 memory requirements of five dataflows discussed in Table 2. Throughput is measured for a hypothetical 64PE architecture running in a steady state (nonedge regions). Fig. 11 plots the energy consumption across the MAC, L1, and L2 for the same dataflows. We perform this analysis for two CONV layers of VGG16. We emphasize that this is an evaluation of these dataflows’ applicability to this hypothetical architecture, and not meant as a comparison the original systems, which vary widely in number of PEs, buffer sizes, network topology, a so on^{1}^{1}1However, for DLA and Shi, a vector read of size 16 and 4 respectively from the L1 is assumed in the energy calculations, instead of multiple expensive scalar reads, as their dataflows are tuned for such an implementation..
We gather useful insights across the dataflows and layers. Between dataflows, we observe, as expected, that NLR has the least L1 memory requirement (as it does not perform temporal reuse at the PE), and therefore has significant L2 energy consumption as presented in Fig. 11. For CONV1, NVDLA dataflow consumes 98% of the average amount of energy. However, for CONV11, this trend changes  NVDLA consumes 63% of the average amount of energy, which is 2 lower than NLR, WS, and Shi, on average. This is because the ratio of input feature map and weight is dramatically different in CONV1(inputdominated) and CONV11(weightdominated), and NVDLA dataflow is tuned to work efficiently in weightdominated layers. In detail, CONV1 has just 3 input channels, while CONV11 has 512; NVDLA is tuned for operating on layers with large input channels (as TemporalMap (64,64) on variable C of NVDLA dataflow in Table 2 shows), making it inefficient for early layers since it still needs to pay the energy cost of vector reads, but is much more efficient than other dataflows in later layers. For the same reason, NVDLA requires notably high NoC bandwidth in CONV11 (compared to CONV1), since more partial sums get mapped on each PE of NVDLA with CONV11, leading to more L1 to L2 communication for partial sums and outputs. The RS dataflow is observed to be the most energyefficient due to very few L2 reads demonstrating the best input and weight reuse. Compared to NVDLA, it has much worse roofline throughput in CONV1, but slightly better in CONV11. The Shi dataflow has the highest L1 buffer requirement among all dataflows, as it spatially replicates variable X across 3 PEs and two variables (R and S) are unrolled in each X iteration.
Fig. 12 plots the MAC and buffer access energy with five dataflows on two convolutional layers with the number of PEs 16, 32, 64, 128, and 256. Please note that the number of PEs varies within each dataflow buckets. When we increase the number of PEs, the energy consumption scalability depends on both of target layer and dataflow, as Fig. 12 presents. Row stationary dataflow scales well in an early layer of VGG16 (CONV1); however, its energy consumption in a late layer of VGG16 (CONV11) increases superlinearly. This sharp increase is because the characteristic of the late layer (small input and a large number of channels) does not work well with spatially mapped inputcolumns. Because of the underutilization of PEs and halo in Y dimension (TemporalMap (3,1) Y in Table 2, rowstationary dataflow needs to read the same input data over small tiles, which results in a large number of L2 reads. Because of the good scalability, DLA dataflow performs better with a large number of PEs on CONV11. However, DLA dataflow performs worst in CONV1 because of the lack of input/output channels in early layers. Therefore, we need to perform a careful study considering layer dimensions, dataflow characteristics, and also the scalability before we select a dataflow.
5.2 Case study II: Design Space Exploration
Using MAESTRO, we implement a hardware design space exploration (DSE) tool that searches four hardware parameters (the number of PEs, L1/L2 buffer sizes, and NoC bandwidth) optimized for either energy efficiency, throughput, or throughput per Watt within given hardware area and power constraint. The DSE tool receives the same set of inputs as MAESTRO
with hardware area/power constraints and the area/power of building blocks synthesized with the target technology. For the cost of building blocks, we implement float/fixed point multiplier and adder, bus, bus arbiter, global and local scratch pad in RTL and synthesis them using 28nm technology. For bus and arbiter cost, we fit the costs into a linear and quadratic model using regression because the bus cost increases linearly and the arbiter (matrix arbiter) cost increases in a quadratic manner. Users can specify one of the three optimization targets: runtime (throughput), energy, or run timeenergy product.
The DSE tool sweeps a target design space specified in the range of each parameter and searches granularity. However, it skips design spaces at each iteration of hardware parameters by checking the minimum area and power of all the possible design points from inner loops of hardware parameters. This optimization allows to skip invalid design points in a various granularity that reduces a large number of futile searches, which led to a large effective DSE rate ranging from 3.3K to 0.46M designs per second, as presented in Table 3.
Table 3 shows statistics of four DSE runs explored the design space presented in Fig. 13. We ran DSEs on a machine with i78700k CPU and 32GB memory operating Linux Mint 19 OS. We run four sets of the DSE on the machine at the same time, and all of them terminated within 24 minutes, with effective DSE rate of 0.17M designs per second, on average. Using the DSE tool, we explore the design space of DLA [1] and Rowstationary [6] dataflow accelerators. We set the area and power constraint as 16mm and 450mW, which is the reported chip area and power of Eyeriss [8]. We plot entire design space we explored in Fig. 13.
Design Space Analysis: Whether an accelerator can achieve peak throughput depends on not only the number of PEs but also NoC bandwidth. In particular, although an accelerator has sufficient number of PEs to exploit the maximum degree of parallelism a dataflow allows, if the NoC does not provide sufficient bandwidth, the accelerator suffers communication bottleneck in the NoC. Such design points can be observed in the bottomright region of areathroughput plots in each DSE runs in Fig. 13.
During DSE runs, MAESTRO reports buffer requirements for an input dataflow and the DSE tool places the exact amount buffers MAESTRO reported. Opposed to the intuition, the larger buffer size does not always provide high throughput, as shown in bufferthroughput plots in Fig. 13 (plots in the second column). The optimal points regarding the throughput per buffer size are in the topleft region of the bufferthroughput plots. The existence of such points indicates that the tiling strategy of the dataflow (mapping sizes in our directive representation) significantly affects the efficiency of buffer use.
We observe that the throughputoptimized designs have a moderate number of PEs and buffer sizes, implying that hardware resources need to be distributed not only to PEs but also to NoC and buffers for high PE utilization. Likewise, we observe that the buffer amount does not directly increase the throughput and energy efficiency. These results imply that all the components are intertwined, and they need to be wellbalanced to obtain a highefficient accelerator.
6 Related Works
Hardware DSE and dataflow optimization: Dataflow optimization is one of the key optimization target in many recent DNN accelerators such as Eyeriss [6], Flexflow [17], SCNN [22], and NVDLA [1]. Cbrain [26], Flexflow [17], and analyzed the costbenefit tradeoff of three dataflows and explored the opportunity of adaptive dataflow based on the tradeoff. Ma et al. [18] also constructed an analytic model for convolutions on FPGAs focusing on three loop transformations; interchange, unroll, and tiling. Although their analytic model provides an intuition for tradeoffs of dataflows, the model focus on one dataflow style they propose, does not consider regional spatial reuse, spatiotemporal reuse opportunities in DNN accelerators, and also don’t consider communication delay in NoC, which can dominate for dataflows with large tile sizes. Also, the target dataflow is optimized for HLS flow, and requires expressing using complex annotated loop nest with HLS synthesis directives. Caffeine [31] proposed a full automatic FPGA flow that includes pragmabased annotation in programs, dataflow optimization framework, and DSE for FPGAs based on the analytic model defined over loop tiling and unrolling. However, the dataflow search space is limited due to fixed loop orders; three presets termed straightforward, inputmajor, and weightmapping.
DNN loop optimization frameworks: Domainspecific compiler frameworks (including polyhedralbased) for DNN applications are becoming very popular to improve productivity and also performance by efficiently mapping specified DNN models on to a variety of hardware including CPUs [28], GPUs [29], FPGAs [9], and specialized accelerators [20, 4]. The above compiler frameworks have either support for automatic loopnest based optimization or support for explicit userscheduling directives to map DNN computations effectively onto hardware. A significant difference with the above compiler frameworks is that we use datacentric directives to represent and map dataflows instead of computecentric (loopnest) notation. However, we believe that both the computecentric and datacentric representations are complementary, i.e., an intermediate IR with our datacentric directives can be obtained from the computecentric (loopnest) representation to facilitate the design space exploration easily and automatically finding optimal dataflows.
7 Discussion and Future work
This work is motivated by the observation that cooptimizing DNN accelerator microarchitecture and its internal dataflow(s) is crucial for accelerator designers to achieve both higher performance and energy efficiency. In this work, we introduced datacentric directives to specify DNN dataflows in a compact form, and an analytical model (MAESTRO) to estimate execution time, energy efficiency, and THE hardware cost of dataflows. We evaluated our analytical model relative to the MAERI and Eyeriss accelerators, and found our model to be highly consistent with cycleaccurate simulations and reported runtimes, which shows the soundness of the analytic model. Finally, we also demonstrated the use of MAESTRO for designspace exploration of two dataflows. Our DSE tool based on MAESTRO enabled fast DSE based on optimization to skip invalid designs, which led to a high average DSE rate of 0.17M designs per second.
References

[1]
Nvdla deep learning accelerator.
http://nvdla.org, 2017.  [2] V Aklaghi, Amir Yazdanbakhsh, Kambiz Samadi, H Esmaeilzadeh, and RK Gupta. Snapea: Predictive early activation for reducing computation in deep convolutional neural networks. In International Symposium on Computer Architecture (ISCA), 2018.
 [3] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Jingdong Chen, Mike Chrzanowski, Adam Coates, Greg Diamos, et al. Deep speech 2: Endtoend speech recognition in english and mandarin. arXiv preprint arXiv:1512.02595, 2015.
 [4] Andre Xian Ming Chang, Aliasger Zaidy, Lukasz Burzawa, and Eugenio Culurciello. Deep Neural Networks Compiler for a Tracebased Accelerator (Short WIP Paper). In Proceedings of the 19th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES 2018, pages 89–93, New York, NY, USA, 2018. ACM.

[5]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and
Olivier Temam.
Diannao: A smallfootprint highthroughput accelerator for ubiquitous machinelearning.
In International conference on Architectural support for programming languages and operating systems (ASPLOS), 2014.  [6] YuHsin Chen, Joel Emer, and Vivienne Sze. Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks. In International Symposium on Computer Architecture (ISCA), 2016.
 [7] YuHsin Chen, Joel Emer, and Vivienne Sze. Eyeriss v2: A Flexible and HighPerformance Accelerator for Emerging Deep Neural Networks, 2018.
 [8] YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of SolidState Circuits, 52(1):127–138, 2017.
 [9] Jason Cong and Jie Wang. PolySA: Polyhedralbased Systolic Array Autocompilation. In Proceedings of the International Conference on ComputerAided Design, ICCAD ’18, pages 117:1–117:8, New York, NY, USA, 2018. ACM.
 [10] Jason Cong and Bingjun Xiao. Minimizing computation in convolutional neural networks. In International conference on artificial neural networks (ICANN), pages 281–290. Springer, 2014.
 [11] Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. Shidiannao: Shifting vision processing closer to the sensor. In International Symposium on Computer Architecture (ISCA), 2015.
 [12] Clement Farabet, Camille Couprie, Laurent Najman, and Yann LeCun. Learning hierarchical features for scene labeling. PAMI, 35(8):1915–1929, 2013.
 [13] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3d memory. ACM SIGOPS Operating Systems Review, 51(2):751–764, 2017.
 [14] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. Indatacenter performance analysis of a tensor processing unit. In International Symposium on Computer Architecture (ISCA), pages 1–12. IEEE, 2017.

[15]
Andrej Karpathy and Li FeiFei.
Deep visualsemantic alignments for generating image descriptions.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, 2015.  [16] Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. Maeri: Enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 461–475, 2018.
 [17] Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In International Symposium on High Performance Computer Architecture (HPCA), 2017.
 [18] Yufei Ma, Yu Cao, Sarma Vrudhula, and Jaesun Seo. Optimizing loop operation and dataflow in fpga acceleration of deep convolutional neural networks. In International Symposium on FieldProgrammable Gate Arrays (FPGA), pages 45–54. ACM, 2017.
 [19] Mostafa Mahmoud, Kevin Siu, and Andreas Moshovos. Diffy: a déja vufree differential deep neural network accelerator. In International Symposium on Microarchitecture (MICRO), 2018.
 [20] Thierry Moreau, Tianqi Chen, Ziheng Jiang, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. VTA: An Open HardwareSoftware Stack for Deep Learning. arXiv preprint arXiv:1807.04188, 2018.
 [21] Naveen Muralimanohar, Rajeev Balasubramonian, and Norman P Jouppi. Cacti 6.0: A tool to model large caches. HP laboratories, pages 22–31, 2009.
 [22] A. Parashar et al. Scnn: An accelerator for compressedsparse convolutional neural networks. In International Symposium on Computer Architecture (ISCA), pages 27–40, 2017.
 [23] Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. From highlevel deep neural models to fpgas. In IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016.
 [24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2015.
 [26] Lili Song, Ying Wang, Yinhe Han, Xin Zhao, Bosheng Liu, and Xiaowei Li. Cbrain: a deep learning accelerator that tames the diversity of cnns through adaptive datalevel parallelization. In Design Automation Conference (DAC), pages 1–6, 2016.
 [27] Alexander Toshev and Christian Szegedy. Deeppose: Human pose estimation via deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
 [28] Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. Latte: A Language, Compiler, and Runtime for Elegant and Efficient Deep Neural Networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’16, pages 209–223, New York, NY, USA, 2016. ACM.
 [29] Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. Tensor Comprehensions: FrameworkAgnostic HighPerformance Machine Learning Abstractions. arXiv preprint arXiv:1802.04730, 2018.
 [30] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv preprint arXiv:1609.08144, 2016.
 [31] Chen Zhang, Guangyu Sun, Zhenman Fang, Peipei Zhou, Peichen Pan, and Jason Cong. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems (TCAD), 2018.