Optimally Scheduling CNN Convolutions for Efficient Memory Access

02/04/2019
by   Arthur Stoutchinin, et al.
University of Bologna
ETH Zurich
0

Embedded inference engines for convolutional networks must be parsimonious in memory bandwidth and buffer sizing to meet power and cost constraints. We present an analytical memory bandwidth model for loop-nest optimization targeting architectures with application managed buffers. We applied this model to optimize the CNN convolution loop-nest. We show that our model is more accurate than previously published models. Using this model we can identify non-trivial dataflow schedules that result in lowest communication bandwidth given tight local buffering constraints. We show that optimal dataflow schedules are implementable in practice and that our model is accurate with respect to a real implementation; moreover, we introduce an accelerator architecture, named Hardware Convolution Block (HWC), which implements the optimal schedules, and we show it achieves up to 14x memory bandwidth reduction compared to a previously published accelerator with a similar memory interface, but implementing a non-optimal schedule.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 6

page 7

page 9

page 12

page 13

11/02/2020

On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator

Dedicated accelerators are being designed to address the huge resource r...
10/12/2021

Memory-Efficient CNN Accelerator Based on Interlayer Feature Map Compression

Existing deep convolutional neural networks (CNNs) generate massive inte...
03/05/2020

Compiling Neural Networks for a Computational Memory Accelerator

Computational memory (CM) is a promising approach for accelerating infer...
11/08/2017

Hydra: An Accelerator for Real-Time Edge-Aware Permeability Filtering in 65nm CMOS

Many modern video processing pipelines rely on edge-aware (EA) filtering...
04/25/2018

Giving Text Analytics a Boost

The amount of textual data has reached a new scale and continues to grow...
04/30/2018

Ultra Power-Efficient CNN Domain Specific Accelerator with 9.3TOPS/Watt for Mobile and Embedded Applications

Computer vision performances have been significantly improved in recent ...
07/22/2020

ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework

Building efficient embedded deep learning systems requires a tight co-de...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Convolutional neural networks (CNNs) are widely used for solving artificial intelligence problems, such as object and voice recognition, scene labeling and others [1]. Many research and development efforts have recently focused on domain-specific CNN accelerators in order to meet the high computational requirements of CNN inference with reasonable energy and cost efficiency [2].

One of the primary bottlenecks for implementing efficient and cost effective embedded CNN accelerators for state-of-the-art deep convolutional networks is the memory system. Large volume of data accessed (weights) and produced (feature maps) during CNN inference computation makes it difficult to simultaneously buffer the input feature maps, the output feature maps, and the filter weights in limited internal accelerator memory. One way to alleviate this issue is to use large SRAM buffers (up to a few MBytes may be used) in order to completely eliminate main memory traffic  [3, 4]. When massive accelerator area budget is available, this may be an acceptable approach. However, large amounts of memory are not affordable in deeply-embedded markets, such as mobile or IoT clients, for example.

While completely absorbing all CNN memory traffic in internal accelerator storage is usually not possible, the memory bandwidth requirement for a given accelerator storage capacity can be significantly reduced if sufficient data reuse happens. The data reuse pattern, in time and space, is determined by the dataflow schedule of computation. Generally, the efficiency and performance impact of the dataflow schedule varies with CNN topology and size making it difficult to adapt the accelerator architecture to different CNNs.

Existing work on scheduling CNN computations [5, 6, 7] is based on memory models originally developed for cache-based memory hierarchies [8]. Previously published models essentially search the tiling (or blocking) space of the CNN convolution loop-nest with the objective to identify the innermost loops set such that the working set of these innermost loops fits the available internal storage while the data transfers between the internal and external memories are minimized. However, existing models have not been adapted to explicitly application managed buffers, which constitute by far the most common memory architecture template for CNN accelerators [9, 10, 11, 12, 13, 3, 14, 15, 16, 17]. In this case, published models overestimate internal storage requirements of the CNN computation and result in sub-optimal dataflow schedules.

In this paper, we provide three key contributions to the state-of-the-art. First, we propose a new analytical memory performance model to evaluate dataflow schedules in terms of their local memory requirements and overall external (off-accelerator) memory traffic. We compared the best dataflow schedules that can be identified with our model with the current state-of-the-art models [5], showing that they require 5-15% less memory traffic when applied to a number of state-of-the-art CNN topologies, enabling a better usage of available resources. Moreover, to validate our model, we applied it on the case study of the design of a flexible CNN accelerator for deeply embedded Systems-on-Chip. Our accelerator architecture is based on a shared memory cluster, similar to [18], enhanced with dedicated convolution hardware units (HWC).

We have used our analytical memory bandwidth model to perform architectural exploration of the HWC accelerator with the objective to find a dataflow schedule that results in the best trade-off between the capacity of the hardware unit storage and the traffic to shared memory. The dataflow schedule found using our methodology is non-trivial and it achieves up to 10-14 reduction with respect to a previously published accelerator based on a similar architecture [11], while assuming similar amounts of internal (in-accelerator) memory. In absolute terms, a complete HWC implemented from the dataflow schedule by means of high-level synthesis can sustain a throughput of up to 16 Multiply-Accumulate (MAC) operations per clock cycle using only 1KB of internal storage and having only 332-bit ports to off-accelerator shared memory. Finally, we verify the accuracy of our memory model by comparing the predicted memory traffic with the real number of memory accesses measured from our accelerator implementation, showing a maximum deviation of 0.5% in a few corner cases.

The rest of the paper is organized as follows: Section II introduces the CNN convolution notations, the opportunities and challenges of application-managed buffers with respect to caches, and the CNN data locality optimization problem; Section III explains previously published related work; Section IV presents our analytical memory performance model applied to the CNN convolution computation and compares it to previously published models; finally, Section V illustrates application of this model for the development of a shared memory CNN convolution hardware accelerator.

Ii Background and notation

Fig. 1: Generic view of a CNN accelerator combining a computing datapath with an optimized application-managed local reuse buffer, and off-accelerator memory external to the datapath.

friendly

[numbersep=5pt, gobble=0, frame=lines, fontsize=

, framesep=2mm, mathescape=true, escapeinside=——]C // M output fmaps loop LOF: for (m = 0; m ¡ M; m++) // C input fmaps loop LIF: for (c = 0; c ¡ C; c++) // spatial loops (ExE) LSY: for (y = 0; y ¡ E; y++) LSX: for (x = 0; x ¡ E; x++) // filter loops (RxR, stride S) LFY: for (k = 0; k ¡ R; k++) LFX: for (l = 0; l ¡ R; l++) p = I[c][y*S+k][x*S+l]; w = W[m][c][k][l]; O[m][y][x] += p*w;

Fig. 2: Canonical form of the CNN convolution layer loop-nest.

In this paper we propose an analytical model for evaluating storage requirements and memory bandwidth of a generic CNN convolutional layer, expressed as a 6 level loop nest, as shown in Figure 2 in a “canonical” form. Table I lists symbols used in discussing the CNN convolutional layer and their meaning. In the Figure 2, there are 3 arrays referenced in the loop-nest: holds the input feature maps of size ; holds convolution kernels, with weights each; and holds output feature maps of size 111We use square feature maps to avoid overburdening the mathematical notation, although the model can be easily extended to general rectangular feature maps..

Symbol Description
Input feature map size ()
Output feature map size ()
Convolution kernel size ()
Convolution kernel stride
Number of input feature maps
Number of output feature maps
TABLE I: Symbols used in this paper

We consider the problem of optimizing the CNN convolutional layer memory access pattern for a two-level memory hierarchy. Figure 1 shows a generic view for an accelerator consisting of computing datapath, local reuse buffer optimized for this datapath, and off-accelerator memory external to the computing datapath. The problem consists in minimizing the number of memory accesses to the off-accelerator memory given a limited local buffer capacity.

Ii-a Data reuse in CNN convolutional layers

Data reuse occurs when a reference within a loop accesses the same data element in different iterations. Convolutional loop-nest shown in Figure 2 contains several opportunities for data reuse:

  • convolution reuse: Each input feature map pixel is reused times within each input feature map.

  • weight reuse: Each kernel weight is reused times.

  • input fmap reuse: Each input feature map pixel is reused across output feature map computations.

  • output fmap accumulation: Each output feature map pixel is reused across accumulations of partial results from input feature maps.

Carries? LFX LFY LSX LSY LIF LOF
or or or or
TABLE II: Carrying loops of array references of the CNN loop-nest.

We say that the reuse of a reference is carried by a loop if the same memory location is used by different iterations of that loop [19]. In Figure 2, reuse of the reference is carried by two loops, LSX and LSY; reuse of the reference is carried by the loops LFX,LFY, and LIF. The reuse of the reference is slightly more complex because it is carried by a combination of loops: which pair of loops between (LFX,LSX) or (LFY,LSY) is carrying the reuse of array depends on relative ordering of the loops; the reuse is carried by the outer loop in each pair. The reference is also carried by loop LOF. Table II resumes which loops are carrying each array reference in the CNN loop-nest.

In order to take full advantage of the reuse, such that every data is loaded from/stored to off-accelerator memory only once, large local buffering capacity is necessary. In Figure 2, the entire set of input feature maps needs to be stored in the local buffer to reuse the input data across the loop LOF. To reuse the accumulated partial output feature maps across loop LIF, one full output feature map needs to be stored in local buffer. To put this into perspective, the total amount of buffering required for the second convolution layer of AlexNet [20], with , exceeds 284 kB if we consider that each element in the input and output feature maps and the kernel weights is 1 byte in size.

If no large local buffer is available, optimal data reuse cannot be achieved and some data must be accessed from the external level in memory hierarchy multiple times. From Figure 2, if only lines of input feature maps can be buffered locally, similar to 2D convolvers line buffers [21], then every input feature map pixel will need to be re-loaded times, once for each iteration of loop LOF. For the second convolution layer of the AlexNet, this would multiply the required number of memory accesses for the array by 256. The problem is that, although there is plenty of data reuse in the CNN loop-nest, unless the entire working set fits the local buffer, this reuse cannot be taken full advantage of. The remainder of this Section builds up the necessary background and notation to tackle the problem of maximizing data reuse in local memory in an analytical fashion.

Ii-B Local Reuse Buffers

A generic 2-level memory hierarchy such as that exemplified in Figure 1 exposes two memory levels: off-accelerator memory, where we assume all data involved in the CNN loop-nest is resident, and a local reuse buffer that usually cannot host the entirety of the data, but is vastly faster and more energy-efficient than off-accelerator memory. We assume the local reuse buffer to be implemented either as a data cache or as a software-managed scratchpad memory. This buffer is used to host data that is reused multiple times. While data reuse is inherent in the computation and not dependent on a particular shape of the loop-nest, the usage of a local reuse buffer of any kind implies that the reuse only translates into a reduction of memory accesses if there is enough data locality, i.e. if data inside the local buffer are reused within a short period of time and are not replaced between reuse accesses.

Ii-B1 Data caches

Existing reuse evaluation methods derive from cache behavior and build a localized iteration space, i.e. a set of innermost loops of a loop-nest where data locality is exposed [8]. It is assumed that all array references inside the localized iteration space need to be simultaneously stored in a local reuse buffer [19, 9, 6, 7]. Indeed, in the context of a data cache, every array reference is mapped to a unique location in the cache. If the cache capacity is smaller then required for holding all array references, reused data can be displaced from the cache and are not guaranteed to remain in the cache in every iteration that they are referenced. Thus, all data touched inside the localized iteration space need to be cached in order to benefit from the data reuse. For example, in order to reuse elements of array across the loop LSY in Figure 2, the localized iteration space will have to include loops LFX, LFY, LSX and LSY. The data cache would need to hold elements of - otherwise referencing could displace some other reused data, for example. The data cache would also need to hold elements of the output feature map array - otherwise, referencing the array may displace elements of array from the cache before they have been reused.

Ii-B2 Application-managed scratchpad memories

Dedicated hardware accelerators, including CNN accelerators, commonly use application-managed scratchpad memories as local reuse buffers instead of caches, as these are deemed to provide better performance, predictability, and energy efficiency [22, 2]. In this case, data placement, reuse, and transfer have to be managed explicitly by partitioning the local reuse buffer in a set of application-managed buffers, one per each array referenced in the loop-nest. Application-managed buffers can be partitioned statically (i.e. by using physically separate memory to implement them) or dynamically (i.e. by partitioning a single piece of memory so that all array references fit). Instead of a single localized iteration space, each array reference can have its own data locality scope. Thus, utilization of the local reuse buffer can be optimized by choosing, for each array reference, a nested loop level at which data are buffered for reuse. We call this level the buffering level of the array reference .

The number of loop iterations, , that separates two consecutive accesses to the same data element, is called the dependence distance of data reuse or simply the reuse distance [8]. Only the elements of the array touched by loop iterations need to be buffered in application-managed local buffer in order to ensure that reused data is preserved across loop iterations. For example, in Figure 2, the reuse distance of the reference to array across the loop LSY is iterations ( elements of are touched iterating over the LSX loop). Therefore, if an implementation chooses to buffer the array at the LSY level, enabling data reuse across this loop, elements of need to be buffered in the local buffer. Similarly, buffering the array at the LFY level, i.e. reusing array data across the two innermost loops, requires only a single element of the array to be buffered inside the local buffer.

In the following of this work, we will focus mostly on the case of application-managed scratchpad memories, which are used in the majority of computing platforms dedicated to CNNs  [9, 10, 11, 12, 13, 3, 14, 15, 16, 17].

Ii-C Data Locality Optimization

[numbersep=5pt, gobble=2, frame=lines, fontsize=, framesep=2mm]C // output fmaps – loop on tiles LTOF: for (mm = 0; mm ¡ M; mm += mss) // input fmaps – loop on tiles LTIF: for (cc = 0; cc ¡ C; cc += css) // spatial – loops on tiles LTSY: for (yy = 0; yy ¡ E; yy += iss) LTSX: for (xx = 0; xx ¡ E; xx += jss) // output fmaps – tile loop LOF: for (m=mm; m¡min(mm+mss,M); m++) // input fmaps – tile loop LIF: for (c=cc; c¡min(cc+css,C); c++) // spatial – tile loops LSY: for (y=yy; y¡min(yy+iss,E); y++) LSX: for (x=xx; x¡min(xx+jss,E); x++) // kernel – tile loops LFY: for (k=0; k¡R; k++) LFX: for (l=0; l¡R; l++) p = I[c][y*S+k][x*S+l]; w = W[m][c][k][l]; O[m][y][x] += p*w;

Fig. 3: Tiled CNN convolution layer loop-nest.

Memory performance optimization methods, such as [19] or [9], use reordering and tiling of the loop-nest to maximize the data locality. Loop reordering places a subset of the loops that form the localized iteration space at the innermost position. Loop tiling partitions the loop-nest iteration space into a number of smaller blocks, such that data used inside the localized iteration space stays in the local buffer until it is reused. Figure 3 shows a general form of tiled convolution loop-nest, where the choice of tile sizes leads to different data locality results.

With caches, the order in which loops from the localized iteration space execute is not important because the ability to reuse data depends solely on the total number of data elements loaded between reuses, i.e. only on the size of the localized iteration space. Conversely, with application-managed buffers, the order of loop execution has a substantial effect on data locality. To see this, consider reusing elements of array across the LIF loop in the Figure 3. Let be the number of elements of touched across iterations of each of LSX and LSY. Since the reuse distance of with respect to the loop LSY is iterations, and there is no reuse across the loop LIF, elements need to be buffered in the local buffer to ensure the reuse of across the loop LIF. Consider reordering loops LIF and LOF. The reuse distance of with respect to the loop LOF is , therefore the number of elements touched by iterations of this loop need to be buffered in order to ensure the reuse of across the loop LIF. A local buffer of elements of is thus necessary.

Generally, with limited buffering capacity it is not possible to fully exploit the full amount of data reuse for all array references at the same time. Different choices of loop order, tile sizes, and buffering levels lead to dataflow schedules with different reuse characteristics. In the following Sections, we first review the related work in the state-of-the-art, with previously proposed models to evaluate and compare dataflow schedules; then we introduce our proposed analytical memory performance model for evaluating this tradeoff in application to the CNN convolution loop-nests.

Iii Related Work

The main difficulty in implementing cost effective embedded CNN accelerators is optimizing memory organization to minimize off-chip memory transfers and perform them as efficiently as possible.

One research direction that aims at alleviating the CNN memory bottleneck is data compression. Various data compression techniques have been proposed in literature: quantization [23, 24, 25]

and binarization 

[26, 27, 28, 29, 30], data compression [31]. A survey on CNN data compression can be found in [32]. Our work is orthogonal to these techniques and can be used on top of them. Indeed, in our implementation we have used a dynamic fixed-point data quantization technique from [24]. We shall also point out that recently sparse neural networks emerged as one solution to reduce the amount of computation and memory required for the CNN processing [33, 34, 35, 36]. This approach is beyond the scope of our work.

The straightforward approach to scheduling CNN computations is to use the 2D convolution along with line buffers for the data reuse [21, 37, 38, 10, 11, 39, 40]. Although simple to implement, such accelerators are often limited to a particular size of convolution kernels and input image size. Additionally, they can only exploit a fraction of data reuse that exists inside the CNN convolution computations.

Several architectures attempted to overcome the 2D convolution limitations by scheduling the CNN convolution dataflow differently, with the dataflow schedules being a result of ad-hoc and empirical exploration. For example, the Eyeriss accelerator [41, 3] proposed the row stationary dataflow which schedules the CNN convolution on a 2D array of processing elements and optimizes the data reuse by exploiting the low-cost memory levels, the PE scratchpads and the inter-PE communication, while minimizing data accesses to the high-cost levels, including the large on-chip global buffer and the off-chip DRAM. Another CNN accelerator, Angel-Eye [42], is a FPGA accelerator that combines multiple parallel 2D convolvers such that several partial sums are accumulated simultaneously. The DianNao architecture [43] have determined dataflow schedule and buffer sizes experimentally. The authors acknowledged that, like many processing architectures, DianNao’s efficiency and scalability remained severely limited by memory bandwidth constraints. DianNao’s successor, ShiDianNao accelerator [4] was directly integrated with an image processor sensor, therefore fully eliminating main memory. This approach is not scalable as only a few small CNNs can be accommodated. Another successor, DaDianNao [44] employed a sufficiently large on-chip eDRAM for storing large CNNs close to the datapath. The DLAU architecture [45] utilizes tiling techniques and minimizes main memory transfers by accumulating a few partial results in internal buffers.

To solve the same problem in a more formal way, several publications proposed analytical memory performance models. For example, memory optimization models for stencil computational kernels were published in [46] and in [47]. Stencils differ from CNN convolutional layers in that they do not need to handle a large amount of convolution kernel weights. Therefore, these models cannot be used to optimize the CNN computation. TETRIS [48] analytically derived optimal dataflow schedules for the CNN convolution by simplifying the problem. They proposed the bypass ordering where internal on-chip storage is bypassed for two out of the three input streams in the convolution layer using a large register file for buffering the bypassed two streams instead. Bypass ordering is significantly simpler than the general computation scheduling problem, and it is possible to analytically derive the optimal loop-nest shape without recurring to exhaustive search of the solution space. However, the bypass ordering relies on a particular architecture and requires large local register file and buffer.

Wolf and Lam [8, 19]

used loop blocking and reordering techniques to capture data locality and reduce the memory traffic in scientific computations. They used a combination of loop interchange, skewing, and reversal (unimodular transformations) with loop tiling (blocking) to improve data locality of loop-nests in the context of memory hierarchies with caches. Such model is inaccurate in the context of memory hierarchies with application-managed buffers. As a result, Wolf’s method overestimates the real buffering requirements and required memory bandwidth of a computation and leads to sub-optimal

dataflow schedules.

Peemen et al. [5, 9] proposed an architecture model where a computation loop nest is split into two parts: an innermost tile, (similar to M.Wolf’s localized iteration space), for execution on the accelerator, and outer controlling loops that run on a host processor. Peemen’s approach improves on M.Wolf’s cache model significantly by taking into account that with application managed buffers some data can be reused between consecutive executions of the innermost tiles. They proposed a design flow for selecting the best dataflow schedule to maximize data reuse given a buffer size restriction. This is achieved by tiling of the CNN convolution loop-nest and reordering the controlling loops. However, as explained in section IV

, the buffer estimation remains inaccurate and the solution is sub-optimal.

Zhang et al. [6] proposed an analytical approach for analyzing computing throughput and required memory bandwidth of a CNN design on an FPGA platform. Similar to Peemen’s method, the CNN loop-nest is tiled with the innermost tiled loops executing in the FPGA, while controlling loops execute in the host. Data need to be loaded into (and read from) the FPGA internal memory for each execution of the innermost tile. In order to reduce the main memory traffic they use local memory promotion [49] for placing out-of-FPGA communication operations optimally. The local memory promotion allows moving a transfer of data from array across the innermost controlling loop, when this loop carries full reuse of , i.e. this loop’s iterator does not appear in any reference to array . Similarly, to Peemen’s method, Zhang et al. explore 4 possibilities for the 4 different innermost controlling loops in the convolution loop-nest. The resulting dataflow schedules are less efficient than in Peemen’s approach because only a subset of array references benefit from data reuse across consecutive executions of the innermost tile.

Yang et al. [7] published a method for the convolution loop-nest tiling for multi-level memory hierarchies. The authors focus on improving the total energy consumption in such systems. In Yang’s work, one loop blocking is performed for each target memory hierarchy level, building one localized iteration space per memory level. Blocking the convolution loop-nest at each level can be though of as tiling the loops in the loop-nest, and then exchanging the order in which the controlling loops are executed. Yang et al. acknowledge that optimal application of their algorithm to multiple memory hierarchy levels is quite computationally costly. In order to achieve a reasonable computation time, instead of optimal multi-level blocking, the authors then propose to apply a 2-level blocking repeatedly starting from lower memory hierarchy level to the upper, while adjusting the lower level results at each new level. However, for the 2-level blocking, since the memory level buffering requirements are estimated as the sum of all data elements touched inside the level’s localized iteration space, this approach results in dataflow schedule quality essentially similar to the Peemen’s method.

Overall, derived from the cache behavior, the above memory performance models assume that local reuse buffer must be dimensioned to simultaneously hold all data elements in the localized iteration space. Our method is based on the observation that under application control, different data may be buffered at different loop-nest level, i.e. there is no one single localized iteration space. As a result, our memory performance model results in a more accurate buffer size estimation for application-managed buffers and in better dataflow schedules.

Iv Memory Performance Model

Given a limited local reuse buffer capacity, memory performance optimization consists in finding a dataflow schedule, i.e. the computation order, such that i) the working set of the computation, fits in the local reuse buffer; ii) traffic to off-accelerator memory is minimized, therefore reducing energy and time dedicated to data movement. Therefore, using the notation introduced in Section II, building a dataflow schedule involves specifying the loop-nest shape via loop tiling and reordering and choosing a buffering level for each array reference.

In this section, we develop an analytical model for evaluating the dataflow schedules that is more accurate than previously published models in the case of application-managed buffers. Given a dataflow schedule, our analytical model computes the local buffer size and the number of bytes accessed from the off-accelerator memory required for this schedule execution. Note that though we develop our model in a context of the CNN convolution computation, in itself it is generic and applicable to other types of computation organized in loop-nests.

Iv-a Local reuse buffer size

LFX LFY LSX LSY LIF LOF
1 or RFN1 R or 1FN1 1 or RFN1 R or 1FN1 1 M
1 1 E E 1 1
R R 1 1 C 1
TABLE III: Reuse distance of arrays in CNN convolution loop-nest.
11footnotetext: The reuse of array is carried by a combination of two loops as explained earlier: which pair of loops – (LFX,LSX) or (LFY,LSY) – is carrying the reuse depends on relative ordering of the loops.

Let us first introduce the concept of the footprint necessary to build the analytical model for a loop-nest. As explained earlier, the reuse distance of an array reference with respect to a loop corresponds to the number of iterations during which the corresponding array element is used by the computation. Table III lists the reuse distances of CNN convolution array references with respect to all loops in the loop-nest. The footprint of an array inside a loop is the portion of array that is touched while iterating over the loop. Thus, the footprint of array in loop L measures the number of distinct elements of used inside L. Assume are the loops in the current dataflow schedule, ordered from the innermost to the outermost one. If we call the number of iterations of loop , and the reuse distance of the array reference with respect to loop , the footprint of array in loop can be computed as follows:

(1)

with . Intuitively, this means that the footprint of an array over a given loop is the footprint of the same array over one loop iteration, multiplied by the number of times that elements of this array have to be replaced in the local buffer during the iterations of .

The footprint takes into account any reuse of data elements that exists in a loop. Thus, in order for the elements of array to be reused across iterations of a loop , the local buffer must be big enough for holding the full footprint of one iteration of the loop. Furthermore, if the loop does not carry reuse of the array, the application-managed buffer can be shared by data elements from multiple loop iterations. This means that the actual required size for the application-managed buffer is computed as follows:

(2)

Given a reordered and tiled loop-nest, Equation 2 allows to recursively compute local buffer requirements for all array references, starting from the innermost loop in the loop-nest. Therefore, it allows to evaluate the buffering requirements of a dataflow schedule, given its loop-nest shape with buffering levels annotated for all data references.

Iv-B Off-accelerator memory traffic

Equation 2 allows to evaluate which dataflow schedules are feasible given a certain buffer size, as well as to compare them based on the minimum local buffer size they require, but does not give any indication on its quality in terms of off-accelerator memory traffic. Let us call the memory traffic computed as number of bytes accessed in off-accelerator memory for array , and the numerical precision (in bytes) used for its storage. At buffering level , then the number of memory accesses to is given by the footprint of with respect to multiplied by the total number of times that loop is executed:

(3)

Note that does not depend explicitly on the size of the local buffer, but only on the dataflow schedule, through the footprint and the iterations of the outermost loops. Whether the schedule fits with respect to a given local buffer capacity depends only on Equation 2.

Memory accesses to the array constitute a special case as they include two distinct contributions: writes of the final fully computed output feature maps, and memory accesses corresponding to the accumulation of intermediate partial results (each accumulation composed of two accesses, 1 write + 1 read). Storage of accumulated partial results typically uses a different (higher) numerical precision than that used for the final array. The total traffic is therefore the sum of the traffic of the three , , arrays:

(4)

Equations 3 and 4 enable a quantitative comparison of dataflow schedules in terms of memory traffic, which is known to be strongly correlated with energy consumption, and of course with system cost [2, 30].

Iv-C Dataflow schedule selection procedure

We conduct the CNN convolution design space exploration in two steps. In the first step, we compute local buffering requirements for each array reference at different loop levels across an enumeration of different loop orders and loop tile sizes. This step is independent of a particular CNN layer shape because the local buffering requirements at any loop-nest level depend only on loop order and tile sizes of different loops. In the second step, using these pre-enumerated buffer requirements, we analyze a particular CNN layer, exhaustively searching for a best combination of buffering levels for the three CNN arrays under different local buffer capacities.

The first step requires the enumeration of 6! = 720 loop-nest permutations. However, we can reduce this number by not considering permutations between the two kernel loops (LFX and LFY in Figure 3), and between the two image loops (LSX and LSY in Figure 3). These permutations can be omitted without affecting our conclusions because they result in symmetric dataflow schedules. In order to reduce the enumeration size further, tile sizes are enumerated selectively. We want to quantify how different loop orderings and tile sizes affect the number of memory accesses. For this it is not necessary to enumerate all possible tile sizes; instead we examine a sequence of monotonically increasing, power of two, tile sizes as well as tile sizes that correspond to common CNN layer configurations. The first step results in buffering requirements for each CNN array reference at each loop level for different loop-nest shapes.

With above search space reduction, the first exploration step yields 180 possible convolution loop-nest permutations with multiple tiling shapes each. In the second step, we search within this pre-enumerated loop-nest permutation space for dataflow schedules that fit the local reuse buffer capacities between 1KB and 512KB, while minimizing required off-accelerator memory access bandwidth. This search results in one best dataflow schedule, i.e. loop-nest order, tiling sizes, and buffering levels for CNN memory references, for each evaluated convolution layer and for each buffer capacity.

Iv-D Comparison vs existing models

Fig. 4: Memory traffic comparison of the best dataflow schedules identified by our model, the model proposed in Peemen et al. and the cache model on a set of representative CNNs, while sweeping the local memory constraint from 1 to 256 KB. Both axes are in logarithmic scale. Results aggregate traffic contributions from all convolutional layers; the dashed red line indicates the ideal memory traffic when all data reuse is fully exploited.
Fig. 5: Memory traffic overhead with respect to our proposed model when using the model proposed in Peemen et al. to select the best dataflow schedule, while sweeping the local memory size from 1 to 256 kB.

The main use of the memory performance model is to evaluate the quality of dataflow schedules pending a set of local buffering constraints, to determine which schedule minimizes memory traffic. Therefore, to compare our model with the current state-of-the-art, we first investigated whether it can identify better dataflow schedules than those that can be extracted by previously published models [8, 5, 6, 7]. We selected two representative models, cache and Peemen, to compare with our own proposed model. Similarly to what happens with our model, the selection of the best dataflow schedules involves exhaustive search over a large solution space, considering all possible loop tile sizes and loop orderings.

Iv-D1 cache

M. Wolf et al. [8] were first to apply loop-nest reordering and tiling in order to find the localized iteration space. The method conservatively assumes that each data element in the localized iteration space working set needs to be allocated a place in the cache, thus overestimating the required memory footprint. It is also assumed that the entire localized iteration space working set needs to be transferred between the memory and the cache for each new localized iteration space execution because it cannot be guaranteed that data from a previously execution is still present in the cache. The original paper [8]

also proposed a heuristics for trimming the number of tiling possibilities, guided by the cache behavior in scientific computations.

Iv-D2 Peemen

The memory performance model proposed by Peemen et al. [5] and independently by Zhang et al. [6], improves on the original cache model significantly by taking into account that with application managed buffers some data can be reused between consecutive executions of the localized iteration space, which they call the innermost tile. By changing the order of controlling loops, different quality solutions are obtained. Yang et al. [7] published a similar model extended for optimal CNN loop-nest tiling for multiple levels of memory hierarchy. As explained in section III, the models developed by Zhang et al. and by Yang et al. yield dataflow schedules essentially similar to the schedules generated by the Peemen et al. method, therefore we implemented the latter as a representative model for the three approaches222 Appendix A describes Peemen’s memory performance model that we derived for the loop-nest from the Figure 3. For a detailed explanation of these formulas the reader is referred to [5]. .

In all these methods, the ordering of the loops inside the innermost tile is not important because the ability to reuse data depends solely on the total number of data loaded to the local reuse buffer between reuses. In Peemen’s and Zhang’s models, only the innermost controlling loop affects the data reuse across consecutive iterations of the innermost tile. Therefore, fewer loop-nest permutations need to be explored considerably reducing the solution search space.

We have compared the dataflow schedules generated by our model with dataflow schedules computed by previously published models [8], [5], [6], and [7] over the convolutional layers from five state-of-the-art CNN topologies: AlexNet, ZFNet, VGG16, Inception v3, and ResNet-20333 Configurations of the CNN layers that we’ve used in our evaluations are listed in the Appendix B.. Although the following comparison is based on generated dataflow schedules, we show in Section V that our method is exact with respect to the real execution. Therefore, this estimation corresponds to the actual amount of memory traffic generated by these CNN layers.

Figure 4 plots the total number of data transfers to and from off-accelerator memory, using the best dataflow schedules estimated by the three models, while sweeping the size of the local reuse buffer from 1 to 256 KB. We aggregate the memory traffic from the best dataflow schedules identified for each layer, and we also show the ideal result given by the “essential” memory traffic that is present when all data reuse is exploited. From the plot, it is clear that the cache model largely overestimates the memory traffic requirements compared to the two application managed buffer models, and always results in sub-optimal dataflow schedules for any buffer size and for all CNN convolution layers that we tested, by a factor of up to 3.5.

The advantage of our model when compared with Peemen’s method is more subtle, as both exploit the characteristics of application-managed buffers to yield a better schedule. To analyze the difference between the two models, Figure 5 plots the relative overhead in memory traffic of Peemen’s model with respect to our model. Figure 5 plots the bandwidth requirements estimated from Peemen’s model as a percentage overhead compared to our model, given local reuse buffer capacities between 1KB and 256KB, for the same CNN convolution layers. Our model finds dataflow schedules with between 2.5% and 17.5% lower memory traffic, with over 10% memory traffic reduction for several CNNs, especially when targeting smaller local reuse buffer sizes. Even for a relatively large local buffer size of 128KB and 256KB, our method results in dataflow schedules with more than 5% memory traffic reduction over the set of convolution layers for several CNN networks.

For most of evaluated CNN convolution layers, our method results in some reduction of memory traffic due to its ability to exploit data reuse across all levels of the CNN convolution loop-nest. With rare exceptions, Peemen’s method is able to find similar schedules only when a full volume of the input or the output feature maps can be stored in the local reuse buffer. We have noticed the following points that contribute to these results:

  • Our method’s footprint calculation better takes into account the data reuse because it considers independent buffering for the I, W, O arrays. As a result, our buffer requirements are systematically lower for a given loop-nest shape, and allow room for bigger tiles to be placed in local reuse buffers.

  • Due to independent buffering of the 3 arrays, our method always places memory transfers optimally with respect to the total memory traffic.

  • In Peemen’s method, unless LTIF is the innermost controlling loop, the memory traffic for the array is multiplied by 2 to account for 1 read and 1 write of partial accumulations. This happens even when the LIF loop is not tiled.

Moreover, for a given CNN convolution layer, the memory traffic overhead from Peemen’s method does not necessarily decrease when the local reuse buffer capacity is increased. Increasing the local buffer capacity allows, in the first place, to generate larger tiles. However, at some buffer capacity points, our method is able to find a completely different loop ordering and buffering levels for the data references, such that more important memory traffic reduction can be achieved than by simply increasing the tile sizes.

It is worth noticing that Yang’s algorithm can be modified to achieve the same quality dataflow schedules as the ones using our approach. It can be verified that applying a 4-level blocking at each memory hierarchy level - essentially enumerating the permutations of the 4 loops, LSX, LSY, LIF, and LOF from the Figure 3, would lead to schedules equivalent to our method. However, such 4-level blocking is quite computationally expensive: Yang et al. reported that a 4-level blocking of a single CNN layer takes 24 hours on a Xeon E5645 processor. Our method is significantly faster: the first step described in Section IV-C is performed only once for many different CNN layers, whereas in Yang’s method, a solution search needs to be performed for each convolution layer individually. Even with densely sampled tiling space, for example exploring all tile sizes multiple of 2, our first step takes 2 minutes per layer on a simple Intel i7 processor running at 3,4 GHz,

V Case Study: ASMP Cluster Accelerator

As a case study for our proposed model, in this Section we illustrate how it can be used for a practical implementation of a low-cost CNN accelerator. Specifically, we (1) show how our model can be used to derive a dataflow schedule for a specialized hardware block, dedicated for processing CNN convolution layers, (2) show that this dataflow schedule is implementable, and (3) evaluate its efficiency in terms of memory bandwidth utilization and accuracy.

Fig. 6: ASMP Accelerator Platform tightly-coupled shared-memory cluster.

V-a Target architecture

As a target architectural template, we chose to integrate special-purpose convolution hardware blocks (HWCs) inside STMicroelectronics’ ASMP tightly-coupled shared-memory cluster [50]. Fig. 6 shows the block diagram of this architectural template. An ASMP cluster is composed by a number of programmable cores, HWCs, and a DMA, all connected together to a shared Tightly-Coupled Data Memory (TCDM) via a single-cycle logarithmic interconnect [51]

. The shared memory enables efficient exchange of data between the convolutional units and programmable processors, achieving high degree of flexibility by computing non-convolution functions, such as pooling, normalization, etc., in software. Additionally, such shared memory cluster can efficiently support traditional computer vision algorithms, ORB, HOG, etc. because many image processing algorithms are essentially based on convolutional operations.

The current ASMP cluster implementation targets mobile image processing applications and includes up to 16 RISC processor cores and HWC blocks running at moderate frequency (500 MHz), and up to 256KB of the TCDM memory. The interconnect can be designed with 32-bit or 64-bit TCDM access width with maximum peak bandwidth of 64 GB per second. The key element of the ASMP cluster is the logarithmic interconnect that allows multiple concurrent accesses to the multi-bank TCDM memory. The logarithmic interconnect provides a common infrastructure for the core-to-core, the core-to-hardware-block, or the hardware-block-to-hardware-block communication.

The specialized convolution hardware blocks, the HWCs, are essential for achieving the required performance while keeping the cost and the power consumption low. For designing the HWC, we have leveraged on shared memory dedicated hardware blocks methodology similar to the one described in [52]. The tight coupling of the HWC to shared memory allows usage of complex memory access patterns, such as sliding kernels, repeated re-fetch of data, etc. This flexibility enables efficient implementation of vast variety of dataflow schedules. We have chosen to implement a Single Instruction Multiple Data (SIMD) type of datapath. The SIMD processing ensures a steady datapath utilization independent on the convolution kernel size. Furthermore, we have opted for a relatively narrow 16-byte SIMD width such that the HWC efficiency is maintained even for smaller images.

V-B HWC Design Space Exploration

The HWC is dedicated to processing the CNN convolutional layer, which accounts for most of the computational work and most of the partial results data bandwidth in existing CNNs [2]. A critical point in the design of HWCs, like most accelerators, is what data has to be internalized within a local buffer and what is accessed from outside the accelerator, in this case from the cluster shared TCDM. This problem is readily mapped to the conceptual view of our model (Figure 1), where internalized memories constitute the application-managed local reuse buffer, whereas the TCDM is the off-accelerator memory. It is therefore straightforward to apply the memory performance model presented in Section IV as a tool for design space exploration, with the objective to find the best trade-off between the HWC local storage capacity and the required TCDM bandwidth.

Local storage and TCDM bandwidth are generally conflicting objectives. On the one hand, minimizing local storage capacity is important to reduce area and therefore cost of the HWC IP. Our shared cluster platform includes several HWC and a TCDM memory for buffering data on-chip. Making the total HWC internal storage capacity close to the TCDM capacity would make its usage redundant, as the access energy for local memory would be comparable to that of a TCDM access. On the other hand, the bandwidth to cluster shared memory remains a scarce resource because multiple actors in the system are accessing it simultaneously; without any local storage at all, every data access from HWC would be done to the TCDM memory. The resulting bandwidth requirement would exceed the local interconnect capacity, leading to a drop in performance and to high energy consumption. Furthermore, accessing the cluster shared memory is more expensive in terms of energy consumption than accessing internal HWC storage [30], and the number of ports used to connect the HWCs to the TCDM has a significant impact on its size and the maximum working frequency of the cluster – which means that minimizing TCDM bandwidth requirements is also important.

In general, the trade-off between local storage and memory bandwidth depends in on the shape of the specific CNN layer: convolution kernel size, number and size of the feature maps, feature map and kernel numeric precision. Therefore, for a general-target HWC we want to build a CNN loop-nest dataflow schedule that, given a local reuse buffer capacity in the order of a few KBs, minimizes the TCDM memory traffic – and therefore bandwidth – across a large number of CNN layers taken from different CNNs. We conducted this design space exploration in two steps as described in Section IV. We analyzed 71 different representative CNN layers chosen from the AlexNet, ZFNet, VGG, Inception v3 and ResNet topologies, exhaustively searching for a best dataflow schedule for each of 180 different loop-nest permutations under different local buffer capacities.

Fig. 7: Distribution of dataflow schedules across different loop-nest permutations, binned depending on the amount of memory traffic they generate.

Figure 7 shows the distribution of the dataflow schedule quality across the 180 loop-nest permutations obtained with different local memory constraints, binned according to the amount of memory traffic they generate. The Y-axis shows, for each local buffer capacity, the percentage of loop-nest permutations that result in optimal traffic, or add up to 10%, 20%, etc. of overhead to the optimal traffic, or exceed 2 times optimal traffic, respectively. For small local buffer capacities, less than 20% of loop-nest permutations can achieve optimal bandwidth. On the other hand, with a large local buffer 50% of loop permutations can be tiled in such a way that optimal bandwidth is achieved.

[numbersep=5pt, gobble=2, frame=lines, fontsize=, framesep=2mm]C LTOF: for (mm = 0; mm ¡ M; mm += mss) LTIF: for (cc = 0; cc ¡ C; cc += css) LTSY: for (yy = 0; yy ¡ E; yy += iss) LTSX: for (xx = 0; xx ¡ E; xx += jss) // Buffering O LIF: for (c=cc;c¡min(cc+css,C);c++) // Buffering I LSY: for (y=yy;y¡min(yy+iss,E);y++) LFY: for (k = -R/2; k ¡ R/2; k++) // Buffering W LOF: for (m = mm; m ¡ mss; m++) LSX: for (x=xx;x¡min(xx+jss,E);x++) LFX: for (l = -R/2; l ¡ R/2; l++) p = I[c][y*S+k][x*S+l] w = W[m][c][k][l] O[m][y][x] += p*w

Fig. 8: HWC dataflow schedule: gives best bandwidth trade-off with local storage capacity less than 4KB.

By analyzing the best dataflow schedules obtained in this experiment, we confirmed the intuition that those dataflow schedules that allow the output feature maps to be fully accumulated locally by buffering the partial sums tend to be the best performing ones, especially when the local buffer capacity is small. Although in these schedules the input feature maps and weights are read from memory multiple times, they still result in fewer total memory accesses compared to dataflow schedules where the output feature maps are swapped out to memory before being fully accumulated through all of the input feature maps. Swapping and re-fetching the output feature maps to complete the accumulation generates twice the traffic compared with the read-only input feature maps and weights. Furthermore, the partially accumulated output feature maps require higher precision and therefore are more costly in terms of required bandwidth. It is interesting to notice that given less than 512KB buffering capacity, no one permutation resulted in a dataflow schedule with optimal bandwidth across the entire set of tested convolution layers.

Among several small footprint schedules, we chose one that, for most tested CNN layers, leads to minimal memory bandwidth requirements for local buffer sizes from 1KB to 4KB. Our selection was also guided by several hardware implementation criteria, such as the number of required simultaneous local buffer accesses, access alignment, etc. and the compatibility with a SIMD datapath. Figure 8 shows the dataflow schedule chosen for the HWC implementation, with the buffering level for each array shown as a comment on top of the corresponding loop. The HWC main loop executes an innermost tile in order LIF - LSY - LFY - LOF - LSX - LFX. The relative order of loops LSX, LFX and LFY ensures that the partial sum accumulation remains internal to the HWC as much as possible. In the actual implementation, the tiling factor for the LIF loop is fixed and equals the SIMD datapath width. The remaining tile dimensions: the number of output feature maps, , the input feature maps, , and the number of output lines in a tile, , are computed for each particular convolution layer also using our performance model. In practice, over all tested CNN layers, the input feature maps volume was never tiled (), allowing a complete accumulation of the partial sums inside the HWC local buffer.

V-C HWC evaluation

Fig. 9: HWC block internal architecture.

To explore the hardware implications of the dataflow schedule proposed in Figure 8, we specified a HWC prototype using it directly in C, using the CatapultC high-level synthesis (HLS) tool to derive a Verilog design. The design exposes one slave configuration port and 332-bit master ports towards TCDM (one for each array reference) to separate control for memory accesses for each array, easing the high-level synthesis process. We specified the HWC datapath so that it is capable of handling convolution kernel sizes up to 1111 with stride and unlimited input/output feature map sizes. It is design as single-instruction multiple data (SIMD) engine and it is capable of sixteen 8-bit8-bit or 8 16-bit16-bit fixed-point MAC operations per clock cycle. Figure 9 shows the block diagram of the HWC and the final breakdown of the local storage capacity for the three CNN arrays. The prototype HWC has slightly over 1KB of internal storage including small input (), weight (), and accumulated sum () buffers, with partial sums accumulated in 32-bit precision.

Using the HWC design generated by HLS, we performed synthesis and place-and-route of an ASMP cluster with 4 HWC targeting a 28nm technology node, achieving a maximum frequency of 500 MHz. The computing cluster achieves up to 64 8x8 MAC operations per cycle for a total of 32 GMAC/s at 500 MHz, with average utilization of 80% across the set of convolutional layers used for the design space exploration.

To understand how accurately our memory performance model evaluates the memory traffic with respect to an actual implementation, we have measured the actual number of bytes transferred between the HWC and the TCDM during execution of various CNN layers and compared these measurements with the bandwidth predicted by our memory performance model. Our memory performance model is almost exact with respect to the measured bandwidth. The memory performance model does not explicitly account for strided convolutional layers when tile sizes are not integral multiples of the tile; a small overestimation of the memory traffic can be observed in such cases. From our experiments such overestimation is always within 0.5% of the total memory traffic.

Fig. 10: Histogram of ratios of memory traffic achieved using the HWCE dataflow schedule vs the one proposed for the HWC, under the constraint of a local storage capacity of 1KB.

In order to put the HWC schedule into perspective, we compare the amount of memory traffic generated by the HWC to the memory traffic generated by a state-of-the-art tightly-coupled CNN convolution accelerator, HWCE [11, 53]. Both hardware units target ultra low-cost and low-power applications and implement very little internal buffer storage. Both hardware units target tightly-coupled shared memory clusters and are constrained by the performance of the shared memory logarithmic interconnect in a similar way. The HWCE implements a 2D convolution dataflow with linear input buffering similar to [21].

[numbersep=5pt, gobble=2, frame=lines, fontsize=, framesep=2mm]C // X-direction, image split in stripes in SW LTSX: for (xx = 0; xx ¡ E; xx += jss) // output fmaps – loop tiled in SW LTOF: for (mm = 0; mm ¡ M; mm += 1) // input fmaps – loop tiled in SW LTIF: for (cc = 0; cc ¡ C; cc += 1) // Y-direction – innermost controlling loop LTSY: for (yy = 0; yy ¡ E; yy += iss) // Below executed in HWCE // Buffering I,W LOF: for (m=mm; m¡mm+1; m++) LIF: for (c=cc; c¡cc+1; c++) // spatial – tile loops LSY: for (y=yy; y¡min(yy+iss,E); y++) LSX: for (x=xx; x¡min(xx+jss,E); x++) // Buffering O LFY: for (k=0; k¡R; k++) LFX: for (l=0; l¡R; l++) p = I[c][y*S+k][x*S+l]; w = W[m][c][k][l]; O[m][y][x] += p*w;

Fig. 11: HWCE CNN convolution layer loop-nest

As shown in Figure 11, the HWCE uses a variant of the canonical tiled loop-nest shown in Figure 3, with the outermost tiling loops executed in software – it executes a 2D convolution over a single input feature map resulting in one partially computed output feature map. The HWCE takes limited advantage of the data reuse existing in the convolution loop-nest. The convolution reuse results from the input feature maps being buffered in an input linear buffer [21]. The weight reuse is ensured by buffering the convolution kernel weights required for processing a single pair of an input and an output feature map. The output feature maps and the partially accumulated sums are stored in the TCDM memory, the elements of the array are only buffered while applying a single convolution kernel.

Figure 10 shows the ratio of the HWCE traffic for different CNN layers vs the HWC prototype using 1KB of internal buffer. While HWC and HWCE expose the same number of ports towards TCDM (332 bits), the HWC dataflow schedule results in lower traffic for all tested convolutional layers. For some layers, the HWC dataflow schedule results in up to 14 times reduction in traffic compared to the HWCE dataflow. The HWCE dataflow schedule suffers from important penalty due to large amounts of partial accumulation sums that are stored in the TCDM memory. Writing and reading these partial sums with higher numeric precision result in noticeable increase in memory traffic. Additionally, with 1KB memory budget, the HWCE linear input buffer would be too small for some layers, such as AlexNet-1, ZFNet-1, or ResNet-1 - a bar is absent for such layers in the Figure. To handle insufficient linear input buffer size, the actual HWCE implementation splits the input feature maps into several smaller stripes that can be handled with the small linear input buffer - this results in a further slight increase of redundant shared memory traffic, due to the overlap between stripes. Since HWC and HWCE expose the same number of TCDM ports, the ones on the HWC are used, on average, significantly less. This translates to a lower amount of energy spent in transactions with memory and on less contention, making it easier to combine HWC operation with other computations in the cluster [52] without significant performance hits.

Vi Conclusion and Future Work

We have presented an analytic memory performance model suitable for memory hierarchies that use application managed buffers. We have shown that our model results in more accurate memory footprint estimation than previously published models and is accurate with respect to a real implementation. We have used this model for designing a CNN convolution hardware block in the context of the ASMP shared memory cluster. Our further work includes applying our model to automatic generation of dataflow schedules from standard CNN descriptions in Caffe, TensorFlow or similar tools.

References

  • [1]

    Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning.”

    Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [2] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey.” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
  • [3] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks.” J. Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017.
  • [4] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: shifting vision processing closer to the sensor.” in ISCA, D. T. Marr and D. H. Albonesi, Eds.   ACM, 2015, pp. 92–104.
  • [5] A. Peemen, B. Mesman, and H. Corporaal, “Optimal iteration scheduling for intra- and inter-tile reuse in nested loop accelerators,” Eindhoven University of Technology, Tech. Rep. EC Reports; ESR-2013-3, January 2013.
  • [6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpga-based accelerator design for deep convolutional neural networks.” in FPGA, G. A. Constantinides and D. Chen, Eds.   ACM, 2015, pp. 161–170.
  • [7] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. Ragan-Kelley, A. Pedram, and M. Horowitz, “A systematic approach to blocking convolutional neural networks.” CoRR, vol. abs/1606.04209, 2016.
  • [8] M. E. Wolf and M. S. Lam, “A data locality optimizing algorithm.” in PLDI, D. S. Wise, Ed.   ACM, 1991, pp. 30–44.
  • [9] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, “Memory-centric accelerator design for convolutional neural networks.” in ICCD.   IEEE Computer Society, 2013, pp. 13–19.
  • [10] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 g-ops/s mobile coprocessor for deep neural networks.” in CVPR Workshops.   IEEE Computer Society, 2014, pp. 696–701.
  • [11] F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters.” in DATE, W. Nebel and D. Atienza, Eds.   ACM, 2015, pp. 683–688.
  • [12]

    Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam, “Diannao family: energy-efficient hardware accelerators for machine learning.”

    Commun. ACM, vol. 59, no. 11, pp. 105–112, 2016.
  • [13] J. Sim, J.-S. Park, M. Kim, D. Bae, Y. Choi, and L.-S. Kim, “14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems.” in ISSCC.   IEEE, 2016, pp. 264–265.
  • [14] “Intel nervana neural network processor: Architecture update,” https://ai.intel.com/intel-nervana-neural-network-processor-architecture-update, Intel, 2018.
  • [15]

    “An in-depth look at google’s first tensor processing unit (tpu),”

    https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu, Google, 2018.
  • [16] “Powervr series2nx neural network accelerators (nna),” https://www.imgtec.com/powervr/vision/series2nx, Imagination Technologies, 2018.
  • [17] G. Desoli, V. Tomaselli, E. Plebani, G. Urlini, D. Pau, V. D’Alto, T. Majo, F. D. Ambroggi, T. Boesch, S. pal Singh, E. Guidetti, and N. Chawla, “The orlando project: A 28 nm fd-soi low memory embedded neural network asic.” in ACIVS, ser. Lecture Notes in Computer Science, J. Blanc-Talon, C. Distante, W. Philips, D. C. Popescu, and P. Scheunders, Eds., vol. 10016, 2016, pp. 217–227.
  • [18] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, “Pulp: A parallel ultra low power platform for next generation iot applications.” in Hot Chips Symposium.   IEEE, 2015, pp. 1–39.
  • [19] M. S. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations of blocked algorithms.” in ASPLOS, D. A. Patterson, Ed.   ACM Press, 1991, pp. 63–74, sIGARCH Computer Architecture News 19(2), SIGOPS Operating System Review 25(Special Issue April 1991), and SIGPLAN Notices 26(4).
  • [20]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [21] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpga-based processor for convolutional networks.” in FPL, M. Danek, J. Kadlec, and B. E. Nelson, Eds.   IEEE, 2009, pp. 32–37.
  • [22] F. Conti, C. Pilkington, A. Marongiu, and L. Benini, “He-P2012: architectural heterogeneity exploration on a scalable many-core platform,” in CASSAP, 2014, pp. 114–120.
  • [23] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision.” CoRR, vol. abs/1502.02551, 2015.
  • [24] P. Gysel, “Ristretto: Hardware-oriented approximation of convolutional neural networks.” CoRR, vol. abs/1605.06402, 2016.
  • [25] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices.” in CVPR.   IEEE Computer Society, 2016, pp. 4820–4828.
  • [26] M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations.” CoRR, vol. abs/1511.00363, 2015.
  • [27] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or -1.” CoRR, vol. abs/1602.02830, 2016.
  • [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks.” CoRR, vol. abs/1603.05279, 2016.
  • [29] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An architecture for ultralow power binary-weight cnn acceleration.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48–60, 2018.
  • [30]

    F. Conti, P. D. Schiavone, and L. Benini, “XNOR Neural Engine: A Hardware Accelerator IP for 21.6-fJ/op Binary Neural Network Inference,”

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2940–2951, Nov 2018.
  • [31] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally, “Deep compression and eie: Efficient inference engine on compressed deep neural network.” in Hot Chips Symposium.   IEEE, 2016, pp. 1–6.
  • [32] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks.” CoRR, vol. abs/1710.09282, 2017.
  • [33] B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional neural networks,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 806–814.
  • [34] S. Changpinyo, M. Sandler, and A. Zhmoginov, “The power of sparsity in convolutional neural networks,” CoRR, vol. abs/1702.06257, 2017.
  • [35] X. Zhou, Z. Du, S. Zhang, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Addressing sparsity in deep neural networks,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. Early Access, pp. 1–1, 2018.
  • [36] L. Cavigelli and L. Benini, “Extended Bit-Plane Compression for Convolutional Neural Network Accelerators,” arXiv:1810.03979 [cs], Oct. 2018.
  • [37] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello, “Hardware accelerated convolutional neural networks for synthetic vision systems,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, May 2010, pp. 257–260.
  • [38] P. Merolla, J. V. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha, “A digital neurosynaptic core using embedded crossbar memory with 45pj per spike in 45nm.” in CICC, R. Patel, T. Andre, and A. Khan, Eds.   IEEE, 2011, pp. 1–4.
  • [39] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz, “Convolution engine: balancing efficiency and flexibility in specialized computing.” Commun. ACM, vol. 58, no. 4, pp. 85–93, 2015.
  • [40] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and energy efficient deep learning with smart memory cubes.” CoRR, vol. abs/1701.06420, 2017.
  • [41] Y.-H. Chen, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks.” in ISCA.   IEEE Computer Society, 2016, pp. 367–379.
  • [42] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angel-eye: A complete design flow for mapping cnn onto embedded fpga.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2018.
  • [43] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning.” in ASPLOS, R. Balasubramonian, A. Davis, and S. V. Adve, Eds.   ACM, 2014, pp. 269–284.
  • [44] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machine-learning supercomputer.” in MICRO.   IEEE, 2014, pp. 609–622.
  • [45] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “Dlau: A scalable deep learning accelerator unit on fpga.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 36, no. 3, pp. 513–517, 2017.
  • [46] V. Rana, I. Beretta, F. Bruschi, A. A. Nacci, D. Atienza, and D. Sciuto, “Efficient hardware design of iterative stencil loops.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 35, no. 12, pp. 2018–2031, 2016.
  • [47] J. Cong, P. Li, B. Xiao, and P. Zhang, “An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 35, no. 3, pp. 407–418, 2016.
  • [48] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory.” in ASPLOS, Y. Chen, O. Temam, and J. Carter, Eds.   ACM, 2017, pp. 751–764.
  • [49] L.-N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedral-based data reuse optimization for configurable computing.” in FPGA, B. L. Hutchings and V. Betz, Eds.   ACM, 2013, pp. 29–38.
  • [50] L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building an ecosystem for a scalable, modular and high-efficiency embedded computing accelerator,” in DATE, 2012, pp. 983–987.
  • [51] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011.   IEEE, 2011, pp. 1–6.
  • [52] F. Conti, A. Marongiu, C. Pilkington, and L. Benini, “He-P2012: Performance and Energy Exploration of Architecturally Heterogeneous Many-Cores.” Journal of Signal Processing Systems, vol. 85, no. 3, pp. 325–340, 2016.
  • [53] F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gürkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini, “An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 64, no. 9, pp. 2481–2494, Sep. 2017.

Appendix A Peemen Equations for the CNN Convolution Loop-Nest

The buffering requirements for the 3 array references in the CNN convolution loop-nest are computed as shown in Equation A:

(5)

The terms and compute the dimensions, in pixels, of the input feature map tile, given the tile size for the output feature map, , the convolution kernel size, , and the convolution stride .

The total memory traffic is computed by multiplying the buffering requirements by the total number of tiles, with one improvement. Peemen noticed that using application-managed buffers, data can also be reused between consecutive innermost tile executions. This significantly improves the memory traffic estimation accuracy for the computations involving prologue, steady state, and the epilogue. Equation 6 shows the general form of the memory traffic computation:

(6)

The 4 cases that need to be considered for the CNN loop-nest, corresponding each to one of the controlling loops, LTSX, LTSY, LTIF, and LTOF, being the innermost controlling loop, are shown in Equations 7, 8, 9, and 10 below.

  1. Innermost LTOF:

    (7)

    with

  2. Innermost LTIF:

    (8)

    with

  3. Innermost LTSY:

    (9)

    with

  4. Innermost LTSX:

    (10)

    with

In Equations 7 - 10, in order to account for the data reuse across consecutive innermost tile executions, the memory traffic computation is done as if the innermost controlling loop were not tiled. For example, with loop LTOF being the innermost controlling loop, the tile size of this loop is set to the total loop count, i.e. , inside the general equation 6.

Appendix B Used CNN Layers Configuration

Layer Conv.
AlexNet 1 3 96
AlexNet 2 96 256
AlexNet 3 256 384
AlexNet 4 384 384
AlexNet 5 384 256
ZFNet 1 3 96
ZFNet 3 256 384
ZFNet 4 384 384
ZFNet 5 384 256
ZFNet 6 256 256
VGG 1 3 64
VGG 2 64 64
VGG 3 64 128
VGG 4 128 128
VGG 5 128 256
VGG 6 256 256
VGG 8 512 256
VGG 9 512 512
VGG 11 512 512
TABLE IV: Configuration of CNN layers used in our evaluation.
# Conv.
0 192 64
192 32
1 256 64
256 64
2 288 64
288 64
3 288 384
288 64
4 798 192
768 192
TABLE V: Configuration of Inception v3 used in our evaluation.
# Conv.
1 3 64
2
3
4
5
TABLE VI: Configuration of ResNet used in our evaluation.