Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators

02/18/2020 ∙ by Prasanth Chatarasi, et al. ∙ Georgia Institute of Technology Nvidia 0

The efficiency of a spatial DNN accelerator depends heavily on the compiler and its cost model ability to generate optimized mappings for various operators of DNN models on to the accelerator's compute and memory resources. But, existing cost models lack a formal boundary over the operators for precise and tractable analysis, which poses adaptability challenges for new DNN operators. To address this challenge, we leverage the recently introduced Maestro Data-Centric (MDC) notation. We develop a formal understanding of DNN operators whose mappings can be described in the MDC notation, because any mapping adhering to the notation is always analyzable by the MDC's cost model. Furthermore, we introduce a transformation for translating mappings into the MDC notation for exploring the mapping space. Searching for the optimal mappings is challenging because of the large space of mappings, and this challenge gets exacerbated with new operators and diverse accelerator configurations.To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. The motivation for this decomposition is to reduce the size of the search space dramatically and also to prioritize the optimization of off-chip data movement, which is 2-3 orders of magnitude more compared to the on-chip data movement. We implemented our approach in a tool called Marvel, and another major benefit of our approach is that it is applicable to any DNN operator conformable with the MDC notation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) is a fundamental technology for many emerging applications such as autonomous driving [bojarski2016end], translation [wu2016google], and image classification [russakovsky2015imagenet], with accuracy close to, and even surpassing, that of humans [karpathy2015deep, toshev2014deeppose, farabet2013learning]

. Achieving low latency and energy goals with stringent computation and memory constraints of deep neural network models (DNNs) for mobile 

[apple_neural_core] and edge [edge_tpu] devices has emerged as an important challenge. To cope with this challenge, specialized hardware accelerators for DNN inference are being developed and deployed [chen2019eyeriss, DBLP:journals/micro/ChungFOPCMLLAHA18, nvdla, edge_tpu, xDNN-web]. Most of these accelerators are “spatial", i.e., they are built by interconnecting hundreds to thousands of processing elements (PEs). They achieve high throughput by exploiting parallelism over the PEs and energy efficiency by maximizing data reuse within the PE array via direct data forwarding between PEs and the use of scratchpad memories [eyeriss_isca, chen2014diannao, nvdla, parashar2017scnn, sharma2016high, jouppi2017datacenter, ma2017optimizing, zhang2015optimizing].

Figure 1: Overview of the design-time flow for computer architects developing new accelerators, and the compilation flow for ML programmers leveraging the accelerators. Scope of this work is the mapping explorer and the loop optimizer in the above diagram.

The efficiency of accelerators depends heavily on the compiler’s ability to generate optimized mappings for various operators of DNN models on to the accelerator’s compute and memory resources. A mapping involves parallelization, tiling, and scheduling strategies [angshu2019timeloop, kwon2018maestro]. Optimized compilers (or mappers) optimizing various DNN operators are necessary during compile-time for ML programmers, and design-time for computer architects to understand reuse and data movement behaviors to design a new accelerator, as shown in  Figure 1. Thus, expressing DNN mappings and determining optimal ones is a crucial component of DNN deployment on accelerators.

Mappings are often expressed as loop nests, a syntax that resembles a simple imperative programming language with explicit parallelism. Many cost models such as TimeLoop [angshu2019timeloop], DMazeRunner [DMazeRunner], Interstellar [interstellar] are developed over the loop nest description of mappings. The loop nests syntax is very generic and can help architects/compilers in expressing a wide range of operator mappings, but the underlying cost models may not analyze all possible mappings expressible in loop nests. Furthermore, these cost models do not have a formal boundary over DNN operators for precise and tractable analysis. Having such no formal boundaries can bring adaptability challenges to these cost models in the compiler infrastructures and also to computer architects for design-time exploration of new DNN operators onto accelerators.

In this paper, we address the above challenge. We leverage the recently introduced “Maestro Data-Centric” (MDC) notation [kwon2018maestro] for expressing mappings. MDC is promising because any mapping adhering to the notation can be analyzable using the MDC’s cost model. Moreover, the notation explicitly defines data mapping and organization, instead of inferring it from loop nests. The overall focus of this work is on (1) developing a formal understanding of DNN operators whose mappings can be described in the MDC notation, (2) introducing a transformation for translating mappings into the MDC notation for exploring the mapping space, and finally (3) proposing an efficient exploration strategy to quickly navigate the large mapping space of DNN operators. The key contributions are briefly described below.

1) Conformable DNN operators. The promising aspect of the MDC notation, i.e., analyzability, comes at the cost of its expressiveness. In this work, we introduce a formal set of rules (Section 3) in identifying DNN operators whose mappings can be described in the MDC notation. We call an operator satisfying the formal rules as the conformable operator, and Table 1 lists the conformability of the popular operators with the MDC notation.

2) Transformation. The MDC notation is powerful in expressing and reasoning complex mappings of DNN operators onto the diverse spatial accelerators, but explicitly writing and exploring such mappings can be error-prone and tedious. Computer architects [angshu2019timeloop] and DNN compiler frameworks [Chen:2018:TAE:3291168.3291211] view the operators and their mappings majorly in the loop nest form. Hence, we introduce a transformation (Section 4) that translates a mapping specified in the loop nest form to the MDC notation and can help both the architects and compilers for mapping space exploration.

3) Mapping space exploration. The efficiency of any mapping is tightly cross-coupled with both the algorithmic aspects of DNN operators and the microarchitectural aspects of accelerators. Searching for the optimal mapping is challenging because of a massive space of possible loop transformations on the operators. For example, there are over 10 valid mappings for the CONV2D on average for mapping ResNet50 [Resnet] and MobileNetV2 [sandler2018mobilenetv2] on a representative DNN edge accelerator. This challenge gets exacerbated with new operators (e.g., depth-wise) and diverse hardware accelerator configurations. Much of the prior work [zhang2015optimizing, ma2017optimizing, zhao2019mRNA, DBLP:journals/corr/YangPRBRKRPH16] targeted hardware with limited capabilities or fixed certain aspects of the mapping space such as choice of parallel loops and loop orders [interstellar, DMazeRunner, zhang2015optimizing, ma2017optimizing, motamedi2016design]. Approaches supporting broader classes of architectures and mappings suffer from a combinatoric explosion in the size of mapping space.

Our approach for the mapping problem is motivated by the observation that the off-chip data movement between DRAM and accelerator is 2-3 orders of magnitude more compared to the on-chip data movement involving the PE array and the local scratchpad buffers [eyeriss_isca, DBLP:journals/corr/SzeCYE17]. Hence, we propose an approach (Section 5) referred as “decoupled off-chip/on-chip" that decomposes the mapping space into two subspaces, i.e., off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by exploring the on-chip mapping subspace constructed with the optimal mappings from the off-chip subspace. In contrast to prior work [angshu2019timeloop, interstellar, DMazeRunner], we use different approaches and cost models for these subspaces, i.e., a classical distinct-block (DB) locality cost model [ferrante1991estimating, Sarkar:1997:ASH:271819.271828] to explore the off-chip subspace, and the MDC’s cost model [kwon2018maestro] for the on-chip subspace.

We implemented the above approach in a tool called “Marvel”, and our approach is applicable to any operator conformable with the MDC notation. Given a conformable DNN operator, workload sizes, and a target accelerator configuration, Marvel explores the mapping space of the operator using the decoupled approach and then outputs the mappings optimized for runtime and energy. Overall, our approach reduced the mapping space by an factor for the four major CNN models (AlexNet, VGG16, ResNet50, MobileNetV2), while generating mappings that demonstrate a geometric mean performance improvement of 10.25 higher throughput and 2.01 lower energy consumption compared with three state-of-the-art mapping styles from past work. We also evaluated our approach over the GEMM, LSTM, and MLP workloads and also compared Marvel generated mappings with the optimizers from past work.

2 Background

In this section, we provide a brief overview of the spatial DNN accelerators and also the MDC notation to describe mappings of a DNN operator onto the accelerators.

2.1 Spatial DNN Accelerators

Figure 2: Abstract spatial accelerator model which is pervasive in many state-of-the-art accelerators [eyeriss_isca, nvdla, kwon2018maeri, jouppi2017datacenter].

Spatial DNN accelerators based on ASICs and FPGAs have emerged to address extreme demands on performance and energy-efficiency of CNN layers  [eyeriss_isca, chen2014diannao, nvdla, parashar2017scnn, sharma2016high, jouppi2017datacenter]. Such accelerators are built using an array of processing elements (PEs) to provide high parallelism and use direct communication instead of via shared memory for energy-efficiency. An abstract model of spatial accelerators is shown in fig. 2, where each PE of an accelerator consists of a single/multiple ALU(s) dedicated for multiply-accumulate operations (MACs) and a local scratchpad (L1 buffer). Also, accelerators employ various network-on-chips (NoCs) for direct communication among PEs and between PE array and L2 scratchpad buffer. The interconnection network often supports multi-casting data to multiple PEs, which can reduce the total number of data reads from L2 buffer to PEs. Unlike GPU cores, PEs can communicate with adjacent PEs (data forwarding) using a NoC, which can significantly reduce the energy consumption for expensive L2 buffer accesses. Accelerators also typically employ a large shared L2 scratchpad buffer to stage data from DRAM and also partial accumulations from PE arrays. Both L1 and L2 scratchpad buffers are software-controlled memories, i.e., programmer/compiler directly controls contents of the buffer, unlike cache memories, which implicitly manages them, and this is because the memory traffic in accelerators is known in advance. Many spatial accelerators can be further interconnected together to create a scale-out system  [DBLP:journals/micro/ChungFOPCMLLAHA18].

2.2 MDC Notation

The Maestro Data-Centric (MDC) notation for a DNN operator mapping onto a spatial accelerator consists of two aspects, i.e., 1) Computation and tensor sizes, and 2) Data mapping directives over tensor dimensions. A sample mapping of the CONV1D operator in the MDC notation is shown in 

fig. 3

(B). A major novelty of the MDC notation is that the data mappings of tensors across space (PEs) and time are explicitly specified using a set of data mapping directives, which makes the MDC’s cost-model to estimate data movement and reuse behaviors of a mapping precisely and quickly. We briefly describe the data mapping directives of the MDC notation with the mapping in 

fig. 3(B) as the example.

Figure 3: A mapping of the CONV1D in the MDC notation along with the visualization of its data mappings.

1) TemporalMap (size, offset) specifies a distribution of the dimension of a tensor across time steps in a PE, and the mapped set of dimension indices is same across PEs in a given time step. The size parameter refers to the number of contiguous indices mapped in the dimension to each PE, and the offset parameter describes the shift in the starting indices of across consecutive time steps in a PE. For instance, the directive TemporalMap(2,2) in the running example represents the distribution of first dimension () of the weight tensor with two indices mapped in each time step (i.e., ={0,1} in PE0 and PE1 at t = 0). Also, the offset of two denotes the increment in index after each time step (i.e., ={2,3} in PE0 and PE1 at t = 1) till the extent of dimension is explored.

2) SpatialMap (size, offset) specifies a distribution of the dimension of a tensor across PEs. The size parameter refers to the number of contiguous indices mapped in the dimension to each PE, and the offset describes the shift in the starting indices of across consecutive PEs. For instance, the directive SpatialMap(1,1) in the running example represents the distribution of first dimension () of the output tensor with one index mapped to each PE (i.e., ={0} in PE0 and ={1} in PE1 at t = 0). If the number of PEs is not sufficient to cover all indices of the dimension mapped, then the mapping is folded over time across the same set of PEs.

3) Directive order. The sequence of spatial and temporal map directives in a mapping dictates the change of data mappings to PEs across time. Similar to a loop order, all the dimension indices corresponding to a mapping directive are explored before its outer mapping directive in the sequence begins exploring its next set of indices. For instance, the sequence of directives in the running example, i.e., spatial map over followed by temporal map over dictates that all the dimension indices of the weight tensors need to be explored before exploring the next set of indices. This order results in accumulating the partial results of an output before computing another output, popularly referred to as “output stationary” mapping [du2015shidiannao]. However, the sequence notation has a limitation that it cannot capture scenarios where more than one dimension index is simultaneously changing over time (except at the dimension boundaries).

4) Clusters (size) logically groups multiple PEs or nested sub-clusters with the group size as the size parameter. For example, Cluster (2) directive on an accelerator with ten PEs arranges the PEs into five clusters with the cluster size as two. All the mapping directives above a cluster directive operate over the introduced logical clusters, while those below the cluster directive operate within a logical cluster. The cluster directive is extremely useful in exploiting spatial distribution of more than one tensor dimensions (e.g., row-stationary mapping [eyeriss_isca]). Also, the directive helps in constructing hierarchical accelerators by recursive grouping.

The above aspects of the MDC notation can help in precisely specifying a wide range of mappings, including popular and sophisticated mapping styles such as row-stationary in Eyeriss [eyeriss_isca], weight-stationary in NVDLA in [nvdla], output-stationary in ShiDianNao [du2015shidiannao] accelerators. However, its not clearer if all mapping behaviors of an operator can be represented in the MDC notation.

3 Conformable DNN Operators

In this section, we introduce formal rules in identifying conformable DNN operators whose mappings (reuse, parallelization and tiling strategies) can be described using the MDC notation. We discuss rules over the abstract loop nest notation of DNN operators without any transformations for reuse and parallelization (e.g., CONV1D in fig. 4).

R1: A conformable DNN operator in the abstract loop nest form must be a perfectly nested loop without any conditional statements.

The MDC notation restricts its computation to be uniform across all PEs at all time-steps. This restrict is satisfied if the computation is enclosed in a perfectly nested loop without any conditional statements. Most of the DNN operators such as CONV2D, GEMM, MLP (more in table 1) can be expressed in the form of perfectly nested loops without any conditionals. But, there can be implementation of certain operators such as fused convolutions, where each PE requires executing the non-uniform computation. Hence, such operators are discarded and are non-conformable to the MDC notation.

R2: The perfectly nested loop must not have any dependences (flow, anti, output) except reduction dependences, and thus the loops can be freely reordered.

The MDC notation restricts the input and output tensors of an operator to be different, and this results in not having any flow and anti dependences between the tensors. However, the notation can support reduction operations (e.g., add, max, min), and this leads to supporting reduction dependences, i.e., flow, anti, output dependences only on the output tensor. Similar to the rule R1, most of the DNN operators mostly have only reduction dependences, except few operators such as parametric multi-step LSTMs which have flow dependences.

R3: The dimension dependence graph (DDG) of the perfectly nested loop must have a topological ordering, and the subscripts of dependent dimension variables of the DDG graph must be in the form of linear combination of its loop iterators.

The directive order (sequence of mapping directives) of the MDC notation dictates the change of the data mappings to PEs across time. As described in the section 2.2, the directive order has limitations in capturing more than one tensor dimension variable changing simultaneously over the time (except at boundaries). We introduce a directed graph called Dimension Dependence Graph (DDG) to find the possibility of such data movement behaviors in a DNN operator.

Figure 4: The dimension dependence graph (DDG) of simple operators such as CONV1D and stencil satisfying the rule R3, and an example violating the rule R3. d/d/d: tensor dimension variables corresponding to the output, input, and weight tensors.

Each node of a DDG graph denotes a tensor dimension variable along with the array subscript referenced in that dimension. For instance, the node (d:i+i) in fig. 4(a) represents the tensor subscript i+i used in the input tensor dimension with name d. The edges of the DDG are constructed as follows: 1) An edge is added from a node having a SIV/MIV subscript111Single Index Variable (SIV) subscript involves one loop iterator, whereas Multiple Index Variable (MIV) subscript involves more than one loop iterator [DBLP:books/mk/AllenK2001]. to another node having a MIV subscript if there is a common loop iterator in their subscripts. For e.g., there is a directed edge from the node (d:i) to (d:i+i) in fig. 4(a) since they have a loop iterator i in common. 2) All the SIV subscripts are grouped based on their loop iterators, and then edges are added from the SIV subscript of a group having the lowest constant value (randomly choose if there exists multiple) to other SIV subscripts in the same group. For e.g., there is a directed edge from the node (d:i) to all the nodes (d:i), (d:i), and (d:i) in fig. 4(b). 3) If there is a loop iterator (say i) dependent on other loop iterators (say j) in its loop bounds, then construct an edge from a node with subscript having the loop iterator i to other nodes having the loop iterator j in their subscripts.

Now, the possibility of having multiple dimension variables changing simultaneously is reduced to the problem of finding a topological ordering in the DDG graph. In essence, the absence of a topological ordering indicates the presence of mutually dependent dimension variables (e.g., example in fig. 4(c)). In the presence of a topological ordering, the MDC notation requires the data mappings of independent dimension variables to be specified, and these variables are identified from the nodes of the DDG graph having zero in-degree. For e.g., in the case of CONV1D in fig. 4(a), only the data mappings of dimension variables related to output and weight tensors must be specified, and the dimension variable related to the input tensor is inferred by the underlying MDC’s cost model. Hence, the subscripts of dependent dimension variables need to be linear expressions of loop iterators so as to be analyzable by the MDC’s cost model. In addition, the MDC notation expects to have only one data mapping over an independent dimension variable. If there exists more than one node with zero in-degree in the DDG graph associated with the same dimension variable, then we consider that DNN operator to be non-conformable.

R4: The subscripts associated with the independent dimension variables of the DDG graph must be in the form of linear combinations of its loop iterators with the positive unit coefficients and no constants.

A mapping directive (either spatial or temporal) over a dimension variable restricts the variable to start from zero and increase with unit stride. These restrictions don’t allow the dimension variable to have strided increments or negative strides. To characterize the implication of above restrictions, we assume the abstract loop nest form of the DNN operator to be normalized, i.e., its loop iterators start from zero and have unit strides. To support the restricts imposed the mapping directives, each subscript (in the normalized form) associated with an independent dimension variable must be in the form of a linear combination of the subscript’s loop iterators with the positive unit coefficients and no constants. For e.g., the subscript

i associated with the dimension variable d in fig. 4(a) is in the linear form of its iterators (i) with coefficient as one and no constant.

With positive unit coefficients and no constants, the SIV subscript associated with an independent dimension variable is simply an unique loop iterator (e.g., i for d, i for d in fig. 4(a)). Furthermore, the MIV subscript associated with an independent dimension variable is also in the form of adding the subscript’s loop iterators. These loop iterators cannot be part of any subscripts associated with other dimension variables; otherwise, their in-degree wouldn’t have been zero. Hence, the loop iterators corresponding to such MIV subscript can be merged into a single loop. Overall, the subscripts associated with each of the independent dimension variables are simply unique loop iterators (e.g., i for d, i for d in fig. 4(a)).

Finally, an operator is said to MDC conformable if it satisfies all the four rules described above. Table 1 lists the set of popular DNN operators and the conformability of these operators with the MDC notation. As can be seen, the MDC notation can capture most of the DNN operators except parametric LSTM’s, and the mappings of these operators can be analyzable by the MDC’s cost model.

DNN
Operator
Types R1 R2 R3 R4
Conformable
to MDC
CONV1D Regular Y Y Y Y Y
CONV2D Regular Y Y Y Y Y
Point-wise,
Depth-wise
Y Y Y Y Y
Strided,
Dilated
Y Y Y Y Y
MLP
Fully
connected
Y Y Y Y Y
Pooling Max, Avg Y Y Y Y Y
GEMM Regular Y Y Y Y Y
Triangular Y Y Y Y Y
LSTM Single cell Y Y Y Y Y
Parametric
multi-cell
Y N Y Y N
Element
wise
Residual Y Y Y Y Y
ReLU Y Y Y Y Y
Stencils Regular Y Y Y Y Y
Table 1: Conformability of the popular DNN operators onto the MDC notation (Y/N refers to YES/NO).

4 Transformation

The MDC notation is powerful in expressing and reasoning complex mappings of DNN operators onto the diverse spatial accelerators, but explicitly writing and exploring such mappings can be error-prone and tedious. Computer architects [angshu2019timeloop] and DNN compiler frameworks [Chen:2018:TAE:3291168.3291211] view the operators and their mappings majorly in the loop nest form [angshu2019timeloop, zhang2015optimizing, ma2017optimizing]. This section introduces a transformation to translate a mapping of the conformable DNN operator in the loop-nest form into the MDC notation. In this work, we assume the target spatial accelerators having three levels of the memory hierarchy (private L1 buffer, shared L2 buffer, and DRAM). However, our transformation can be easily extendable to more levels of hierarchy.

As described in the section 2.2, the MDC notation consists of two aspects, i.e., 1) Computation and tensor sizes, and 2) Data mapping directives over independent tensor dimensions. The statements enclosed in the perfectly nested loop form of the conformable DNN operator are used as the computation, and the tensor sizes are extracted from the workload configuration. The computation and tensor sizes of the MDC notation remains the same for each mapping of the operator. Then, the dimension dependence graph of the operator is constructed to identify the set of independent tensor dimension variables (having zero in-degree). If there are no such independent dimension variables, then the operator is discarded as non-conformable. The rest of the section focuses on generating data mapping directives for each mapping.

4.1 Data Mapping directives

According to the rule R2, the loops of a conformable DNN operator can be freely reordered, so it is safe to perform multi-level tiling to exploit temporal reuse across each level of the memory hierarchy and also to exploit parallelism of the accelerator. Each tiling, reuse and parallelization behavior of an operator onto a spatial accelerator is referred to as a “mapping”. An example of the mapping of a CONV1D operation over a 3-level accelerator is shown in fig. 5 (C), and the different aspects of the mapping are described below.

1) Multi-level tiling tile sizes. A mapping includes tile sizes of all loop iterators for each level of tiling, i.e., 1) Level-1 tiling for the private L1 buffer, 2) Level-2 tiling for the parallelism, and 3) Level-3 tiling for the shared L2 buffer.

2) Inter-tile loop orders. A mapping also includes inter-tile loop orders222An n-dimensional loop nest after one level of tiling will have 2n loops. The outer n-loops are referred to as inter-tile loops and the later n-loops as intra-tile loops. The innermost n-loops after multi-level tiling are called as point-loops. to describe the execution order of tiles reflecting various reuse opportunities. E.g., the level-2 inter-tile loop order reflects spatio-temporal reuse over the PE array, and the level-3 inter-tile loop order reflects temporal reuse over the on-chip L2 buffer. But, the level-1 inter-tile loop order doesn’t reflect any reuse, because these loops are annotated with parallelism. Also, the loop order among point-loops doesn’t provide any reuse opportunities because there is no more intermediate staging between the PE and its L1 buffer.

Figure 5: A brief overview of the mapping expressed in the loop-nest form of CONV1D, and its translation into the MDC notation with data mapping directives.

An n-level tiling will have n set of tile-loops (including parallel loops) and a set of point-loops. Each set of loops can have a different data movement (reuse) behavior based on its sizes and loop order. We introduce a term called “region” to denote a sequence of data mapping directives over independent tensor dimension variables (e.g., Region R1 in fig. 5(d)) without any cluster directives, and each region captures the data movement behavior present in each set of loops. Given a mapping of the operator in the form of multi-level tile sizes and inter-tile loop orders, our approach transforms the mapping into the MDC notation as per the following steps.

1) Point-loops. As described in Rule 4, each subscript associated with an independent dimension variable is simply an unique loop iterator. Our approach translates each loop of point-loops into a temporal map directive over the corresponding independent dimension variable with size and offset parameters of the directive being the point-loop size. For, e.g., the point loop t with tile size as T in fig. 5(c) is directly translated into TemporalMap(T,T) d in the region R1 shown in fig. 5(d). Since the loop order among the point-loops doesn’t provide any reuse benefits, the directive order in the region R1 doesn’t matter.

2) Parallel-loops. Since each independent dimension variable is uniquely associated with a loop iterator, parallel execution of each loop iterator introduces a different data movement behavior. Hence, for each parallel loop, we introduce a region with a spatial map over the dimension variable associated with the parallel loop, and the temporal maps for the rest of the dimension variables in the region. For, e.g., there are two regions with name R2 and R3 for the parallel loops corresponding to t and t, respectively. Also, the dimension d associated with the iterator t and the dimension d associated with the iterator t are translated into spatial maps in R2 and R3 regions respectively. The size and offsets of each spatial map over a dimension variable is derived from the strides of the parallel loop iterators corresponding to the dimension variable. The order of directives in each region corresponding to a parallel loop doesn’t matter because the number of iterations arising from the rest of the temporal maps is one. Each region corresponding to a parallel loop (except the innermost) is ended with a cluster directive with size as the number of iterations in the parallel loop. For, e.g., the region R3 is ended with a cluster directive with size as the number of iterations of the loop t.

3) Inter-tile loops. For each set of tile-loops excluding parallel loops, our transformation generates a region by creating a temporal map directive for each loop of the set with the size and offset of the directive as the loop stride. For, e.g., the inter-tile loop t with stride as T in fig. 5(c) is directly translated into TemporalMap(T,T)d in the region R4 shown in fig. 5(d). The order of directives in a region is governed by the loop order among the corresponding tile-loops. For, e.g., the level-3 inter-tile loop order (t,t) dictates the temporal map over d outer compared to temporal map over d in region R5. Furthermore, each region is separated by cluster directive with size one to support different data movement behaviors across each set of tile-loops.

5 Mapping Space Exploration

The mapping space of a conformable DNN operator onto an accelerator having three levels of memory hierarchy is a cross product of valid level-1 tile sizes, level-2 tile sizes (parallelism), level-2 inter-tile loop orders, level-3 tile sizes, and level-3 inter-tile loop orders. For example, there are over 10 valid mappings for a single CONV2D operator on average for mapping ResNet50 and MobileNetV2 on a representative DNN edge accelerator. Because of this massive space of mappings, searching for efficient mappings is really challenging. This challenge gets exacerbated with new operators (e.g., depth-wise) and diverse hardware accelerator configurations (e.g., tree-based interconnect [kwon2018maeri]).

We consider (optional) a limited form of data-layouts, i.e., innermost dimension reordering [li2016optimizing] for the tensors of operators on the DRAM. Overall, the mapping space of an operator is a Cartesian product of six dimensions which represent different aspects of a mapping, i.e., 1) level-1 tile sizes, 2) level-2 tile sizes (parallelism), 3) level-2 inter-tile loop orders, 4) level-3 tile sizes, 5) level-3 inter-tile loop orders, and 6) data-layout of tensors. The first three dimensions are grouped under “on-chip mapping subspace" since they influence parallelization and on-chip data movement, and the remaining three dimensions are grouped under “off-chip mapping subspace" since they influence the off-chip data movement.

Our approach towards the mapping space exploration is motivated by the observation that the off-chip data movement between DRAM and accelerator is 2-3 orders of magnitude more compared to the on-chip data movement. Hence, we propose an approach referred as “decoupled off-chip/on-chip" that decomposes the mapping space into two subspaces, i.e., off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace which is constructed with the optimal mappings from the off-chip subspace. In contrast to prior work [angshu2019timeloop, interstellar, DMazeRunner], we use different approaches and cost models for these subspaces, i.e., a classical distinct-block (DB) locality cost model [ferrante1991estimating, Sarkar:1997:ASH:271819.271828] to explore the off-chip subspace, and the MDC’s cost model [kwon2018maestro] for the on-chip subspace. The overall approach is implemented as a standalone tool (shown in fig. 6) that takes a conformable DNN operator, workload sizes, and a target accelerator configuration, then explores the mapping space of the operator using the decoupled approach, and finally outputs the mappings optimized for runtime and energy.

Figure 6: An overview of our approach along with pruning strategies for searching mapping space of convolutions. The pruning strategies in green color preserve optimal mappings, whereas the strategies in red color may prune optimal.

5.1 Solving off-chip mapping subspace

The goal of finding an optimal mapping in the off-chip mapping subspace is to minimize off-chip data movement between DRAM and the L2 buffer of an accelerator. In our work, we assume the L2 buffer to be a software-managed scratchpad buffer, and reducing the off-chip data movement333In case of non-software-managed scratchpad buffers, reducing data movement between DRAM and L2 buffer is equivalent to finding a level-3 tile whose memory footprint can fit into the L2 buffer and is maximum. is equivalent to finding a level-3 tile that has highest arithmetic intensity, this is because the highest arithmetic intensity results in higher reuse and less data transfer.

In our approach, we consider the classical distinct-block (DB) locality cost model [ferrante1991estimating] to measure the off-chip data movement cost, which was developed as part of the memory cost analysis to guide automatic selection of loop transformations and also optimal tile size selections [Sarkar:1997:ASH:271819.271828, Sarkar:2000:AML:1153923.1154542, Shirako:2012:ABO:2259230.2259238] in IBM XL compilers. The DB model is a good choice for our approach, since the model only focuses on optimizing for off-chip data movement. Moreover, it focuses only on perfectly nested loop, and conformable DNN operators are perfectly nested loops as per the rule R1 in section 3.

The distinct blocks (DB) model starts with data-layouts of multi-dimensional arrays and also the parametric tiled version of a perfectly nested loop. Then, the model symbolically estimates the off-chip data movement cost involved in a tile of computation by measuring the number of the distinct number of DRAM blocks required for all the references in the tile of computation. Assuming the array I is laid out in the row-major order, the distinct number of DRAM blocks (with block size as B and tile sizes T, T) required for an example array reference I[x+y][y] enclosed in a triply nested loop with iterators x, y, z is computed as follows:

In the above formulation, the innermost access of the reference is divided by the block size444Setting block size to one ignores the impact of data-layouts that we consider in our approach (innermost dimension reordering [li2016optimizing])., because the data movement with DRAM happens in multiples of block sizes. Now, the total data movement cost (DMC), a.k.a. memory cost per iteration, involved in a tile is computed as the number of distinct DRAM blocks required for all references in the tile by the total number of iterations in the tile. The optimal level-3 tile sizes and data-layouts are computed by minimizing the data movement cost function for every layout and tile sizes in the off-chip mapping subspace with the two constraints, i.e., 1) the tile size of a loop should be greater than 0 and should not exceed its corresponding loop bound, and 2) the total data required (including double buffering) for a level-3 computation tile should fit into the on-chip L2 buffer.

After computing the optimal level-3 tile sizes and data-layouts of tensors, our approach computes the partial derivatives (slopes) of the data movement cost function (based on the optimal data-layout) with respect to parametric level-3 tile sizes (similar to [Sarkar:1997:ASH:271819.271828]), and evaluate the partial derivatives by substituting optimal level-3 tile sizes. The key insight is that having a higher negative value of a partial derivative along a loop indicates the lesser distinct number of elements referenced along the loop, i.e., highest reuse along the loop, and it is suggested to keep it in the innermost position to exploit maximum temporal reuse. Similarly, the rest of the loops are ordered based on their partial derivative values.

5.2 Solving on-chip mapping subspace

The on-chip mapping subspace is constructed based on the optimal values of level-3 tile sizes. Then, our approach explores the constructed subspace to find optimal mappings for each of the three optimal goals, i.e., lower runtime (higher throughput), lower energy consumption, and lower energy-delay product. For each mapping of the constructed subspace, our approach transforms the mapping into its equivalent MDC notation (described in section 4). Then, our approach uses the MDC’s cost model [kwon2018maestro] to estimate various metrics such as latency and energy of each mapping in the on-chip subspace. The MDC’s cost model precisely computes performance and energy, accounting for under-utilization, edge conditions, and data reuse or movement across time (via L1/L2 buffers [eyeriss_isca]), space (via broadcast links [kwon2018maeri]), and space-time (via neighboring links [jouppi2017datacenter, chen2017eyeriss_issc]) without requiring explicit RTL/cycle-level simulations or access to real hardware.

1for every level-2 inter-tile loop order do
2       for every level-2 tile size do
3             Hardware pruning: PE utilization bound
4             Hardware pruning: No prologues/epilogues
5             for every level-1 tile size do
6                   Hardware pruning: Finite L1 size buffer
7                   Hardware pruning: No prologue/epilogue
8                   // Translate mapping into MDC form
9                   Invoke the MDC’s cost model (runtime, energy, and other metrics)
10            
11      
Algorithm 1 Our approach to explore on-chip mapping subspace, including pruning strategies

Algorithm 1 shows an overview of our approach in exploring the on-chip mapping subspace along with pruning strategies. We introduce a parameter called “PE utilization bound (p)" to prune search space of level-2 tile sizes by bounding the overall PE array utilization to be at-least the parameter p. The above technique is beneficial in finding optimal on-chip mappings with the optimization goal being throughput, because the highest throughput is typically obtained at higher PE utilization rates [chen2019eyeriss]. Our approach also includes a pruning strategy to choose level-1 and level-2 tile sizes such that they don’t result in any prologues or epilogues, i.e., the tile sizes are factors of loop bounds. All of the above-mentioned pruning strategies can be enabled/disabled in Marvel by passing them as input parameters.

6 Evaluation

In this section, we begin with an overview of the experimental setup used in our evaluation. Then, we present the evaluation of mappings generated by Marvel for a wide variety of DNN operators (CONV2D, GEMM, MLP, and LSTM), and discuss insights from the mappings while comparing them with previous work.

Accelerator
platform (P1)
(Eyeriss-like [eyeriss_isca])
Accelerator
platform (P2)
(Edge/IoT-like) [edge_tpu]
#PEs 168 1024
Clock frequency 200 MHz 200 MHz
GigaOpsPerSec(GOPS) 67.2 409.6
NoC bandwidth (GB/s)
2.4 25.6
L1 buffer size 512B 512B
L2 buffer size 108KB 108KB
DRAM block size [ddr4_spec] 64 64
Table 2: Accelerator setups in our evaluation.

Target accelerators. Marvel is applicable to any spatial accelerator since it abstracts accelerator details as #PEs, L1/L2 buffer sizes, NoC bandwidth, reduction/multicast support, etc, which can be used to model a wide variety of accelerators including Eyeriss [eyeriss_isca], NVDLA [nvdla], TPU [edge_tpu], xDNN. Due to space limitations, we present our evaluation for only two accelerator platforms (shown in table 2): An accelerator (Eyeriss-like [eyeriss_isca]) having 168 PEs and 2.4GB/s NoC bandwidth, and another accelerator having 1024 PEs and 25.6GB/s. We inherit L1, L2 buffer, and clock frequency for both platforms from Eyeriss [eyeriss_isca], i.e., 512B L1 buffer, 108KB L2 buffer, and 200MHz clock frequency. The bidirectional NoC used in our evaluation is a two-level hierarchical bus, which has support for multicasting similar to Eyeriss.

Experimental variants. We have implemented few of the exploration strategies of recent optimizers such as Interstellar [interstellar] and dMazeRunner [DMazeRunner] in our framework. For, instance, the Interstellar optimizer focuses on parallelizing input and output channels of CONV2D operators, whereas the dMazeRunner optimizer focuses on parallelizing only output channels and a limited set of loop orders. We compare Marvel generated mappings for each workload and accelerator platform with three variants: 1) Marvel implemented Interstellar-like [interstellar] optimizer generated mappings, 2) Marvel implemented dMazeRunner-like [DMazeRunner] optimizer generated mappings, and 3) Roof-line peak based on the workload arithmetic intensities and accelerator configurations.

Methodology. We have evaluated all the mappings generated by the experimental variants using the MAESTRO cost model [kwon2018maestro]. Moreover, the analytical cost model within the MAESTRO framework is validated against the RTL implementations of Eyeriss [eyeriss_isca] and MAERI [kwon2018maeri] on VGG16 and AlexNet models. We passed a pruning option to the Marvel to choose tile sizes that divide loop bounds evenly without any remainder, and this has been the consideration in the other approaches [angshu2019timeloop, interstellar, DMazeRunner, ma2017optimizing, zhang2015optimizing]. We also set the minimum PE array utilization bound as 0.1, i.e., at-least 10% of the PE array should be mapped with computation. We apply 8-bit fixed point precision for all the tensors used in our evaluation.

6.1 Evaluation on CONV2D

The CONV2D is a widely used DNN operator in convolution neural networks, and these operators account for more than 90% of overall computation 

[cong2014minimizing, eyeriss_isca], dominating overall latency, and energy consumption in inferences. In our evaluation, we considered popular CNN models, such as AlexNet [Alexnet], VGG16 [VGGnet], ResNet50 [Resnet], and MobileNetV2 [sandler2018mobilenetv2], with a batch size of one as this captures the low latency requirement use case and also represents a more challenging setup for energy efficiency and throughput [chen2019eyeriss]. In addition, these models encompass a broad spectrum of CONV2D operators such as regular, point-wise, depth-wise, strided variants with different filter shapes.

Figure 7: Performance comparison of Marvel generated mappings with the mappings of dMazeRunner-like optimizer [DMazeRunner] and Interstellar-like optimizer [interstellar] relative to the roof-line peaks of the AlexNet and VGG-16 models on both the platforms (P1 and P2).
Figure 8: Runtime and energy comparison of Marvel generated mappings with the popular mapping styles such as row-stationary (RS) from Eyeriss  [eyeriss_isca], weight-stationary from DLA [nvdla], output-stationary from ShiDianNao [du2015shidiannao] for the AlexNet [Alexnet], VGG-16 [VGGnet], ResNet-50 [Resnet], MobileNet-V2 [sandler2018mobilenetv2] models on both the platforms (P1 and P2).

Comparison with the existing optimizers. Figure 7 presents the runtimes of optimized mappings generated by Marvel, dMazeRunner-like optimizer [DMazeRunner], and Interstellar-like optimizer [interstellar] relative to the roof-line peaks of the AlexNet and VGG-16 models on both the platforms. Since each model involves multiple CONV2D operations, we have added the runtimes of the each CONV2D operator to present our evaluation at the level of DNN models. The Interstellar-like optimizer is almost equivalent to the brute-force exploration except that it restricts exploiting parallelism along only input and output channels. As a result, the evaluation using the Interstellar-like optimizer is really time-consuming (multiple days for MobileNetV2 and ResNet50), and hence we restricted the comparison to only AlexNet and VGG16 models. As can be observed from the fig. 7, Marvel generated mappings are geometrically 2.35 and 1.15

faster compared to the mappings obtained by the dMazeRunner-like optimizer and Interstellar-like optimizer, respectively. The dMazeRunner-like optimizer focuses on exploiting parallelism along only output channels (in presence of unit batch size) to avoid inter-PE communication, and this results in under-utilization of the PE array for both models. But, the Interstellar-like optimizer is able to perform close to Marvel, because the number of input and output channels in these models are larger (except at the initial layers). Furthermore, our approach is able to identify mappings in seconds to few minutes for each operator on a local machine, unlike the Interstellar-like optimizer which takes almost 1-5 hours for each operator. We don’t compare the search time with the dMazeRunner-like optimizer, because we haven’t implemented all the heuristic strategies, for, e.g., exploring tiling factors that highly utilize (at-least 60 %) the scratchpad buffers. Table 

3 shows the impact of our decoupling and pruning strategies on the original search space of mappings of the four models with an average reduction of in the mapping space.

Variants Search space size
Min Avg Max
Original search space 2.7 9.4 1.8
Off-chip schedules search 7.3 3.6 1.3
space after decoupling
On-chip schedules search 2.9 2.4 1.4
space after decoupling
Off-chip schedules search 9.9 1.5 6.3
space after decoupling + pruning
On-chip schedules search 3.8 5.9 2.4
space after decoupling + pruning
Table 3: The statistics (min/avg/max) of the CONV2D mapping space in our evaluation and the resultant mapping subspaces after decoupling and pruning strategies.
Figure 9: Performance comparison of Marvel generated mappings with the mappings of dMazeRunner-like optimizer [DMazeRunner], and Interstellar-like optimizer [interstellar] relative to the roof-line peaks of the GEMM workloads in table 4 and LSTM, MLP in table 5 on both the platforms (P1 and P2).

Comparison with the popular mapping styles. Some of the state-of-the-art mapping styles are row-stationary (RS) from Eyeriss  [eyeriss_isca], weight-stationary from DLA [nvdla], and output-stationary from ShiDianNao [du2015shidiannao]. In our evaluation, we encoded the above mapping styles in the form of parallelization and loop order constraints on the on-chip mapping space of our decoupled approach. For instance, weight-stationary (DLA) mapping style includes parallelization across input and output channels with the loop iterators corresponding to the weight tensor in the innermost positions of the loop orders. As can be observed from fig. 8, the runtimes of Marvel generated mappings for all the models are only 1.31 and 1.10 higher relative to the roof-line peaks of all the models on both accelerator platforms P1 and P2, respectively.

The Eyeriss-like mappings [eyeriss_isca] exploit parallelism along output width and filter width dimensions, where as the ShiDianNao-like mappings [du2015shidiannao] exploit along output width and height. But, the extents of these dimensions are relatively small especially in modern DNN models such as ResNet50 and MobileNetV2. Hence, these mappings are often resulted in under-utilization of the PE array leading to higher runtimes compared to the roof-line peak (e.g., 100.36 for Eyeriss-like mappings on platform P2). But, these mappings exploit popular row-stationary and output-stationary behavior leading to lower energy consumption (e.g., 2.91 for Eyeriss-like mappings on platform P2) relative to the Marvel reported energy-efficient mappings.

The DLA-like mappings exploit parallelism along input and output channels, and the extent of these dimensions are sufficient enough to keep the PE array busy for most of the layers of AlexNet, VGG16, and ResNet50 models. However, the MobileNetV2 model has introduced depth-wise operators which lacks parallelism in the input channels. This resulted in less performance of the DLA-like mapping compared to the roof-line peak, and our approach exploited alternate dimensions (more than one) for the parallelism. However, the DLA-like mappings exploit weight-stationary reuse behavior, and these DNN models have large number of weight parameters compared to other tensors. This resulted in only 1.10 higher energy consumption relative to the Marvel reported energy-efficient mappings.

6.2 Evaluation on GEMM

In this evaluation, we have considered GEMM workloads from the recent work in [qin2020sigma]. An interesting aspect of these workloads is that they are irregular in their shapes making the rigid accelerators (e.g., TPUs) hard to reach their peak utilization [qin2020sigma]. A summary of these workloads are shown in table 4, where M, N, K refers to number of rows, columns of first matrix followed by the columns of second matrix.

Workload Application Dimensions
M N K
GNMT
Machine
Translation
128 2048 4096
320 3072 4096
1632 36548 1024
2048 4096 32
DeepBench
General
Workload
1024 16 500000
35 8457 2560
Transformer
Language
Understanding
31999 1024 84
84 1024 84
NCF
Collaborative
Filtering
2048 1 128
256 256 2048
Table 4: Description of the GEMM workloads which are taken from the recent work in [qin2020sigma].

We translated the GEMM workloads into their equivalent CONV2D workloads for the Interstellar-like and dMazeRunner-like optimizers, because their exploration strategies are specific to the CONV2D workloads (e.g., parallelization strategies). Figure 9 presents the runtime of optimized mappings generated by Marvel, dMazeRunner-like optimizer [DMazeRunner], and Interstellar-like optimizer [interstellar] relative to the roof-line peak of each GEMM workload. The runtimes of Marvel generated mappings are only 1.24 and 1.10 higher relative to the roof-line peaks of accelerator platforms P1 and P2 respectively, thereby demonstrating the closeness of mappings obtained using our approach to the peak. Furthermore, we observed that maximum reuse (spatial, temporal, spatio-temporal) is exploited only when all the dimensions of the GEMM operator are parallelized. Hence, Marvel generated mappings included parallelization of the three dimensions to make the PE array occupied along with exploiting maximum reuse. This is in contrast to other approaches, i.e., Interstellar-like optimizer focusing on parallelizing only (N,K) dimensions and dMazeRunner-like optimizer focusing on parallelizing only (K) dimension. As a result, Marvel generated mappings are 6.87 and 1.81 faster compared to the mappings obtained by the dMazeRunner-like optimizer and Interstellar-like optimizer for all the GEMM workloads on the both accelerator platforms.

Figure 10: Comparison of Marvel with prior approaches (mRNA [zhao2019mRNA], Zhang et al. [zhang2015optimizing], Ma et al. [ma2017optimizing], Auto-TVM [Chen:2018:TAE:3291168.3291211], dMazeRunner [DMazeRunner], Interstellar [interstellar], TimeLoop [angshu2019timeloop]) for the mapping space exploration of DNN operators. Our approach (Marvel) supports any operator conformable with the MDC notation.

6.3 Evaluation on MLP and LSTM

In this evaluation, we have considered the MLP and LSTM workloads from the Interstellar work in [interstellar], and a summary of these workloads are shown in table 5.

Network Layer Input channels Output channels
MLP-M FC1 784 1000
FC2 1000 500
FC3 500 250
MLP-L FC1 784 1500
FC2 1500 1000
FC3 1000 500
Network Embedding size Batch size
LSTM-M 500 128
LSTM-L 1000 128
RHN 1500 128
Table 5: Description of the MLP and LSTM workloads which are taken from the Interstellar work in [interstellar].

We translated the MLP workloads into CONV2D workloads for the Interstellar-like and dMazeRunner-like optimizers. We also translated the LSTMs workloads into their equivalent CONV2D workloads via first converting into GEMM workloads. For instance, a LSTM workload with batch size as B and embedding size555

Embedding size is the size of input and hidden vectors.

as E can be translated into a GEMM workload with M being the batch size (B), N being the embedding size (E), and K being the 2E. Figure 9 presents the runtime of optimized mappings generated by Marvel, dMazeRunner-like optimizer [DMazeRunner], and Interstellar-like optimizer [interstellar] relative to the roof-line peak of each workload in table 5. Marvel generated mappings are 4.46 and 1.22 faster compared to the mappings obtained by the dMazeRunner-like optimizer and Interstellar-like optimizer for all the workloads on the both accelerator platforms. The benefits compared to dMazeRunner-like optimizer is higher because of its parallelization across only a single dimension (Embedding size in case of LSTM and Output channels in case of MLP) and also exploring only limited loop orders for reuse. In addition, Marvel is able to do better compared to Interstellar-like optimizer by exploring more levels of parallelism to make the PE array occupied (e.g., only 1.04 higher relative to roof-line peak on platform P2).

7 Related Work

In this section, we discuss prior work only on compilers/mappers (shown in Figure 10) for finding efficient mappings of DNN operators on to the spatial accelerators. Prior work [zhao2019mRNA, eyeriss_isca] focused on developing mappers specific to their architectures, for, e.g., mRNA mapper [zhao2019mRNA] for the MAERI accelerator [kwon2018maeri], limiting their applicability to generic spatial accelerators. Prior work such as Auto-TVM [Chen:2018:TAE:3291168.3291211], Zhang et al. [zhang2015optimizing], Ma et al. [ma2017optimizing] focused on spatial accelerators without L1 buffers inside a PE, again limiting their mapping space formulation. Furthermore, they don’t employ accurate cost models and focus only on optimizing for runtime.

In addition, other prior works such as Interstellar [interstellar], dMazeRunner [DMazeRunner] fixed certain aspects of mapping space such as choice of parallel loops, loop orders, and these choices may not reflect the efficient mappings for a wide variety of DNN operators. To the best of our knowledge, TimeLoop [angshu2019timeloop] is the only framework that considers all aspects of a mapping for a fully flexible spatial accelerator. However, it employs either an exhaustive linear search or a random sampling-based heuristic to explore the search space. In contrast to all of the above works, our approach considers all the aspects of mapping space and uses the decoupled strategy to efficiency navigate the mapping space.

Most of the prior work focus on optimizing convolution operators, and its not clearer if their approaches are applicable to any DNN operator expressed in the loop-nest form. But, our approach is guaranteed to perform on any operator conformable to the MDC notation.

8 Conclusion & Future work

In this paper, we provide a formal understanding of DNN operators whose mappings can be described in the MDC notation by introducing a set of rules over the abstract loop nest form of the operators. Furthermore, we introduce a transformation for translating mappings into the MDC notation for exploring the mapping space. Then, we also proposed a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. We implemented our decoupled approach in a tool called Marvel, and a major benefit of our approach is that it is applicable to any DNN operator conformable with the MDC notation. Our approach reduced the search space of CONV2D operators from four major DNN models from to , which is a reduction factor of ten billion (Table 3), while generating mappings that demonstrate a geometric mean performance improvement of 10.25 higher throughput and 2.01 lower energy consumption compared with three state-of-the-art mapping styles from past work. In the future, we envision 1) Marvel integration with the MLIR compiler infrastructure for wide usability, 2) extending the MDC notation and its cost model to support non-conformable operators, and also 3) using for a wide range of applications, including the neuro-architecture search.

References