I Introduction
Convolutional neural networks (CNNs) are widely used for solving artificial intelligence problems, such as object and voice recognition, scene labeling and others [1]. Many research and development efforts have recently focused on domainspecific CNN accelerators in order to meet the high computational requirements of CNN inference with reasonable energy and cost efficiency [2].
One of the primary bottlenecks for implementing efficient and cost effective embedded CNN accelerators for stateoftheart deep convolutional networks is the memory system. Large volume of data accessed (weights) and produced (feature maps) during CNN inference computation makes it difficult to simultaneously buffer the input feature maps, the output feature maps, and the filter weights in limited internal accelerator memory. One way to alleviate this issue is to use large SRAM buffers (up to a few MBytes may be used) in order to completely eliminate main memory traffic [3, 4]. When massive accelerator area budget is available, this may be an acceptable approach. However, large amounts of memory are not affordable in deeplyembedded markets, such as mobile or IoT clients, for example.
While completely absorbing all CNN memory traffic in internal accelerator storage is usually not possible, the memory bandwidth requirement for a given accelerator storage capacity can be significantly reduced if sufficient data reuse happens. The data reuse pattern, in time and space, is determined by the dataflow schedule of computation. Generally, the efficiency and performance impact of the dataflow schedule varies with CNN topology and size making it difficult to adapt the accelerator architecture to different CNNs.
Existing work on scheduling CNN computations [5, 6, 7] is based on memory models originally developed for cachebased memory hierarchies [8]. Previously published models essentially search the tiling (or blocking) space of the CNN convolution loopnest with the objective to identify the innermost loops set such that the working set of these innermost loops fits the available internal storage while the data transfers between the internal and external memories are minimized. However, existing models have not been adapted to explicitly application managed buffers, which constitute by far the most common memory architecture template for CNN accelerators [9, 10, 11, 12, 13, 3, 14, 15, 16, 17]. In this case, published models overestimate internal storage requirements of the CNN computation and result in suboptimal dataflow schedules.
In this paper, we provide three key contributions to the stateoftheart. First, we propose a new analytical memory performance model to evaluate dataflow schedules in terms of their local memory requirements and overall external (offaccelerator) memory traffic. We compared the best dataflow schedules that can be identified with our model with the current stateoftheart models [5], showing that they require 515% less memory traffic when applied to a number of stateoftheart CNN topologies, enabling a better usage of available resources. Moreover, to validate our model, we applied it on the case study of the design of a flexible CNN accelerator for deeply embedded SystemsonChip. Our accelerator architecture is based on a shared memory cluster, similar to [18], enhanced with dedicated convolution hardware units (HWC).
We have used our analytical memory bandwidth model to perform architectural exploration of the HWC accelerator with the objective to find a dataflow schedule that results in the best tradeoff between the capacity of the hardware unit storage and the traffic to shared memory. The dataflow schedule found using our methodology is nontrivial and it achieves up to 1014 reduction with respect to a previously published accelerator based on a similar architecture [11], while assuming similar amounts of internal (inaccelerator) memory. In absolute terms, a complete HWC implemented from the dataflow schedule by means of highlevel synthesis can sustain a throughput of up to 16 MultiplyAccumulate (MAC) operations per clock cycle using only 1KB of internal storage and having only 332bit ports to offaccelerator shared memory. Finally, we verify the accuracy of our memory model by comparing the predicted memory traffic with the real number of memory accesses measured from our accelerator implementation, showing a maximum deviation of 0.5% in a few corner cases.
The rest of the paper is organized as follows: Section II introduces the CNN convolution notations, the opportunities and challenges of applicationmanaged buffers with respect to caches, and the CNN data locality optimization problem; Section III explains previously published related work; Section IV presents our analytical memory performance model applied to the CNN convolution computation and compares it to previously published models; finally, Section V illustrates application of this model for the development of a shared memory CNN convolution hardware accelerator.
Ii Background and notation
friendly
In this paper we propose an analytical model for evaluating storage requirements and memory bandwidth of a generic CNN convolutional layer, expressed as a 6 level loop nest, as shown in Figure 2 in a “canonical” form. Table I lists symbols used in discussing the CNN convolutional layer and their meaning. In the Figure 2, there are 3 arrays referenced in the loopnest: holds the input feature maps of size ; holds convolution kernels, with weights each; and holds output feature maps of size ^{1}^{1}1We use square feature maps to avoid overburdening the mathematical notation, although the model can be easily extended to general rectangular feature maps..
Symbol  Description 

Input feature map size ()  
Output feature map size ()  
Convolution kernel size ()  
Convolution kernel stride  
Number of input feature maps  
Number of output feature maps 
We consider the problem of optimizing the CNN convolutional layer memory access pattern for a twolevel memory hierarchy. Figure 1 shows a generic view for an accelerator consisting of computing datapath, local reuse buffer optimized for this datapath, and offaccelerator memory external to the computing datapath. The problem consists in minimizing the number of memory accesses to the offaccelerator memory given a limited local buffer capacity.
Iia Data reuse in CNN convolutional layers
Data reuse occurs when a reference within a loop accesses the same data element in different iterations. Convolutional loopnest shown in Figure 2 contains several opportunities for data reuse:

convolution reuse: Each input feature map pixel is reused times within each input feature map.

weight reuse: Each kernel weight is reused times.

input fmap reuse: Each input feature map pixel is reused across output feature map computations.

output fmap accumulation: Each output feature map pixel is reused across accumulations of partial results from input feature maps.
Carries?  LFX  LFY  LSX  LSY  LIF  LOF 

✓ or ✗  ✗ or ✓  ✓ or ✗  ✗ or ✓  ✗  ✓  
✗  ✗  ✓  ✓  ✗  ✗  
✓  ✓  ✗  ✗  ✓  ✗ 
We say that the reuse of a reference is carried by a loop if the same memory location is used by different iterations of that loop [19]. In Figure 2, reuse of the reference is carried by two loops, LSX and LSY; reuse of the reference is carried by the loops LFX,LFY, and LIF. The reuse of the reference is slightly more complex because it is carried by a combination of loops: which pair of loops between (LFX,LSX) or (LFY,LSY) is carrying the reuse of array depends on relative ordering of the loops; the reuse is carried by the outer loop in each pair. The reference is also carried by loop LOF. Table II resumes which loops are carrying each array reference in the CNN loopnest.
In order to take full advantage of the reuse, such that every data is loaded from/stored to offaccelerator memory only once, large local buffering capacity is necessary. In Figure 2, the entire set of input feature maps needs to be stored in the local buffer to reuse the input data across the loop LOF. To reuse the accumulated partial output feature maps across loop LIF, one full output feature map needs to be stored in local buffer. To put this into perspective, the total amount of buffering required for the second convolution layer of AlexNet [20], with , exceeds 284 kB if we consider that each element in the input and output feature maps and the kernel weights is 1 byte in size.
If no large local buffer is available, optimal data reuse cannot be achieved and some data must be accessed from the external level in memory hierarchy multiple times. From Figure 2, if only lines of input feature maps can be buffered locally, similar to 2D convolvers line buffers [21], then every input feature map pixel will need to be reloaded times, once for each iteration of loop LOF. For the second convolution layer of the AlexNet, this would multiply the required number of memory accesses for the array by 256. The problem is that, although there is plenty of data reuse in the CNN loopnest, unless the entire working set fits the local buffer, this reuse cannot be taken full advantage of. The remainder of this Section builds up the necessary background and notation to tackle the problem of maximizing data reuse in local memory in an analytical fashion.
IiB Local Reuse Buffers
A generic 2level memory hierarchy such as that exemplified in Figure 1 exposes two memory levels: offaccelerator memory, where we assume all data involved in the CNN loopnest is resident, and a local reuse buffer that usually cannot host the entirety of the data, but is vastly faster and more energyefficient than offaccelerator memory. We assume the local reuse buffer to be implemented either as a data cache or as a softwaremanaged scratchpad memory. This buffer is used to host data that is reused multiple times. While data reuse is inherent in the computation and not dependent on a particular shape of the loopnest, the usage of a local reuse buffer of any kind implies that the reuse only translates into a reduction of memory accesses if there is enough data locality, i.e. if data inside the local buffer are reused within a short period of time and are not replaced between reuse accesses.
IiB1 Data caches
Existing reuse evaluation methods derive from cache behavior and build a localized iteration space, i.e. a set of innermost loops of a loopnest where data locality is exposed [8]. It is assumed that all array references inside the localized iteration space need to be simultaneously stored in a local reuse buffer [19, 9, 6, 7]. Indeed, in the context of a data cache, every array reference is mapped to a unique location in the cache. If the cache capacity is smaller then required for holding all array references, reused data can be displaced from the cache and are not guaranteed to remain in the cache in every iteration that they are referenced. Thus, all data touched inside the localized iteration space need to be cached in order to benefit from the data reuse. For example, in order to reuse elements of array across the loop LSY in Figure 2, the localized iteration space will have to include loops LFX, LFY, LSX and LSY. The data cache would need to hold elements of  otherwise referencing could displace some other reused data, for example. The data cache would also need to hold elements of the output feature map array  otherwise, referencing the array may displace elements of array from the cache before they have been reused.
IiB2 Applicationmanaged scratchpad memories
Dedicated hardware accelerators, including CNN accelerators, commonly use applicationmanaged scratchpad memories as local reuse buffers instead of caches, as these are deemed to provide better performance, predictability, and energy efficiency [22, 2]. In this case, data placement, reuse, and transfer have to be managed explicitly by partitioning the local reuse buffer in a set of applicationmanaged buffers, one per each array referenced in the loopnest. Applicationmanaged buffers can be partitioned statically (i.e. by using physically separate memory to implement them) or dynamically (i.e. by partitioning a single piece of memory so that all array references fit). Instead of a single localized iteration space, each array reference can have its own data locality scope. Thus, utilization of the local reuse buffer can be optimized by choosing, for each array reference, a nested loop level at which data are buffered for reuse. We call this level the buffering level of the array reference .
The number of loop iterations, , that separates two consecutive accesses to the same data element, is called the dependence distance of data reuse or simply the reuse distance [8]. Only the elements of the array touched by loop iterations need to be buffered in applicationmanaged local buffer in order to ensure that reused data is preserved across loop iterations. For example, in Figure 2, the reuse distance of the reference to array across the loop LSY is iterations ( elements of are touched iterating over the LSX loop). Therefore, if an implementation chooses to buffer the array at the LSY level, enabling data reuse across this loop, elements of need to be buffered in the local buffer. Similarly, buffering the array at the LFY level, i.e. reusing array data across the two innermost loops, requires only a single element of the array to be buffered inside the local buffer.
IiC Data Locality Optimization
Memory performance optimization methods, such as [19] or [9], use reordering and tiling of the loopnest to maximize the data locality. Loop reordering places a subset of the loops that form the localized iteration space at the innermost position. Loop tiling partitions the loopnest iteration space into a number of smaller blocks, such that data used inside the localized iteration space stays in the local buffer until it is reused. Figure 3 shows a general form of tiled convolution loopnest, where the choice of tile sizes leads to different data locality results.
With caches, the order in which loops from the localized iteration space execute is not important because the ability to reuse data depends solely on the total number of data elements loaded between reuses, i.e. only on the size of the localized iteration space. Conversely, with applicationmanaged buffers, the order of loop execution has a substantial effect on data locality. To see this, consider reusing elements of array across the LIF loop in the Figure 3. Let be the number of elements of touched across iterations of each of LSX and LSY. Since the reuse distance of with respect to the loop LSY is iterations, and there is no reuse across the loop LIF, elements need to be buffered in the local buffer to ensure the reuse of across the loop LIF. Consider reordering loops LIF and LOF. The reuse distance of with respect to the loop LOF is , therefore the number of elements touched by iterations of this loop need to be buffered in order to ensure the reuse of across the loop LIF. A local buffer of elements of is thus necessary.
Generally, with limited buffering capacity it is not possible to fully exploit the full amount of data reuse for all array references at the same time. Different choices of loop order, tile sizes, and buffering levels lead to dataflow schedules with different reuse characteristics. In the following Sections, we first review the related work in the stateoftheart, with previously proposed models to evaluate and compare dataflow schedules; then we introduce our proposed analytical memory performance model for evaluating this tradeoff in application to the CNN convolution loopnests.
Iii Related Work
The main difficulty in implementing cost effective embedded CNN accelerators is optimizing memory organization to minimize offchip memory transfers and perform them as efficiently as possible.
One research direction that aims at alleviating the CNN memory bottleneck is data compression. Various data compression techniques have been proposed in literature: quantization [23, 24, 25]
and binarization
[26, 27, 28, 29, 30], data compression [31]. A survey on CNN data compression can be found in [32]. Our work is orthogonal to these techniques and can be used on top of them. Indeed, in our implementation we have used a dynamic fixedpoint data quantization technique from [24]. We shall also point out that recently sparse neural networks emerged as one solution to reduce the amount of computation and memory required for the CNN processing [33, 34, 35, 36]. This approach is beyond the scope of our work.The straightforward approach to scheduling CNN computations is to use the 2D convolution along with line buffers for the data reuse [21, 37, 38, 10, 11, 39, 40]. Although simple to implement, such accelerators are often limited to a particular size of convolution kernels and input image size. Additionally, they can only exploit a fraction of data reuse that exists inside the CNN convolution computations.
Several architectures attempted to overcome the 2D convolution limitations by scheduling the CNN convolution dataflow differently, with the dataflow schedules being a result of adhoc and empirical exploration. For example, the Eyeriss accelerator [41, 3] proposed the row stationary dataflow which schedules the CNN convolution on a 2D array of processing elements and optimizes the data reuse by exploiting the lowcost memory levels, the PE scratchpads and the interPE communication, while minimizing data accesses to the highcost levels, including the large onchip global buffer and the offchip DRAM. Another CNN accelerator, AngelEye [42], is a FPGA accelerator that combines multiple parallel 2D convolvers such that several partial sums are accumulated simultaneously. The DianNao architecture [43] have determined dataflow schedule and buffer sizes experimentally. The authors acknowledged that, like many processing architectures, DianNao’s efficiency and scalability remained severely limited by memory bandwidth constraints. DianNao’s successor, ShiDianNao accelerator [4] was directly integrated with an image processor sensor, therefore fully eliminating main memory. This approach is not scalable as only a few small CNNs can be accommodated. Another successor, DaDianNao [44] employed a sufficiently large onchip eDRAM for storing large CNNs close to the datapath. The DLAU architecture [45] utilizes tiling techniques and minimizes main memory transfers by accumulating a few partial results in internal buffers.
To solve the same problem in a more formal way, several publications proposed analytical memory performance models. For example, memory optimization models for stencil computational kernels were published in [46] and in [47]. Stencils differ from CNN convolutional layers in that they do not need to handle a large amount of convolution kernel weights. Therefore, these models cannot be used to optimize the CNN computation. TETRIS [48] analytically derived optimal dataflow schedules for the CNN convolution by simplifying the problem. They proposed the bypass ordering where internal onchip storage is bypassed for two out of the three input streams in the convolution layer using a large register file for buffering the bypassed two streams instead. Bypass ordering is significantly simpler than the general computation scheduling problem, and it is possible to analytically derive the optimal loopnest shape without recurring to exhaustive search of the solution space. However, the bypass ordering relies on a particular architecture and requires large local register file and buffer.
used loop blocking and reordering techniques to capture data locality and reduce the memory traffic in scientific computations. They used a combination of loop interchange, skewing, and reversal (unimodular transformations) with loop tiling (blocking) to improve data locality of loopnests in the context of memory hierarchies with caches. Such model is inaccurate in the context of memory hierarchies with applicationmanaged buffers. As a result, Wolf’s method overestimates the real buffering requirements and required memory bandwidth of a computation and leads to suboptimal
dataflow schedules.Peemen et al. [5, 9] proposed an architecture model where a computation loop nest is split into two parts: an innermost tile, (similar to M.Wolf’s localized iteration space), for execution on the accelerator, and outer controlling loops that run on a host processor. Peemen’s approach improves on M.Wolf’s cache model significantly by taking into account that with application managed buffers some data can be reused between consecutive executions of the innermost tiles. They proposed a design flow for selecting the best dataflow schedule to maximize data reuse given a buffer size restriction. This is achieved by tiling of the CNN convolution loopnest and reordering the controlling loops. However, as explained in section IV
, the buffer estimation remains inaccurate and the solution is suboptimal.
Zhang et al. [6] proposed an analytical approach for analyzing computing throughput and required memory bandwidth of a CNN design on an FPGA platform. Similar to Peemen’s method, the CNN loopnest is tiled with the innermost tiled loops executing in the FPGA, while controlling loops execute in the host. Data need to be loaded into (and read from) the FPGA internal memory for each execution of the innermost tile. In order to reduce the main memory traffic they use local memory promotion [49] for placing outofFPGA communication operations optimally. The local memory promotion allows moving a transfer of data from array across the innermost controlling loop, when this loop carries full reuse of , i.e. this loop’s iterator does not appear in any reference to array . Similarly, to Peemen’s method, Zhang et al. explore 4 possibilities for the 4 different innermost controlling loops in the convolution loopnest. The resulting dataflow schedules are less efficient than in Peemen’s approach because only a subset of array references benefit from data reuse across consecutive executions of the innermost tile.
Yang et al. [7] published a method for the convolution loopnest tiling for multilevel memory hierarchies. The authors focus on improving the total energy consumption in such systems. In Yang’s work, one loop blocking is performed for each target memory hierarchy level, building one localized iteration space per memory level. Blocking the convolution loopnest at each level can be though of as tiling the loops in the loopnest, and then exchanging the order in which the controlling loops are executed. Yang et al. acknowledge that optimal application of their algorithm to multiple memory hierarchy levels is quite computationally costly. In order to achieve a reasonable computation time, instead of optimal multilevel blocking, the authors then propose to apply a 2level blocking repeatedly starting from lower memory hierarchy level to the upper, while adjusting the lower level results at each new level. However, for the 2level blocking, since the memory level buffering requirements are estimated as the sum of all data elements touched inside the level’s localized iteration space, this approach results in dataflow schedule quality essentially similar to the Peemen’s method.
Overall, derived from the cache behavior, the above memory performance models assume that local reuse buffer must be dimensioned to simultaneously hold all data elements in the localized iteration space. Our method is based on the observation that under application control, different data may be buffered at different loopnest level, i.e. there is no one single localized iteration space. As a result, our memory performance model results in a more accurate buffer size estimation for applicationmanaged buffers and in better dataflow schedules.
Iv Memory Performance Model
Given a limited local reuse buffer capacity, memory performance optimization consists in finding a dataflow schedule, i.e. the computation order, such that i) the working set of the computation, fits in the local reuse buffer; ii) traffic to offaccelerator memory is minimized, therefore reducing energy and time dedicated to data movement. Therefore, using the notation introduced in Section II, building a dataflow schedule involves specifying the loopnest shape via loop tiling and reordering and choosing a buffering level for each array reference.
In this section, we develop an analytical model for evaluating the dataflow schedules that is more accurate than previously published models in the case of applicationmanaged buffers. Given a dataflow schedule, our analytical model computes the local buffer size and the number of bytes accessed from the offaccelerator memory required for this schedule execution. Note that though we develop our model in a context of the CNN convolution computation, in itself it is generic and applicable to other types of computation organized in loopnests.
Iva Local reuse buffer size
LFX  LFY  LSX  LSY  LIF  LOF  

1 or RFN1  R or 1FN1  1 or RFN1  R or 1FN1  1  M  
1  1  E  E  1  1  
R  R  1  1  C  1 
Let us first introduce the concept of the footprint necessary to build the analytical model for a loopnest. As explained earlier, the reuse distance of an array reference with respect to a loop corresponds to the number of iterations during which the corresponding array element is used by the computation. Table III lists the reuse distances of CNN convolution array references with respect to all loops in the loopnest. The footprint of an array inside a loop is the portion of array that is touched while iterating over the loop. Thus, the footprint of array in loop L measures the number of distinct elements of used inside L. Assume are the loops in the current dataflow schedule, ordered from the innermost to the outermost one. If we call the number of iterations of loop , and the reuse distance of the array reference with respect to loop , the footprint of array in loop can be computed as follows:
(1) 
with . Intuitively, this means that the footprint of an array over a given loop is the footprint of the same array over one loop iteration, multiplied by the number of times that elements of this array have to be replaced in the local buffer during the iterations of .
The footprint takes into account any reuse of data elements that exists in a loop. Thus, in order for the elements of array to be reused across iterations of a loop , the local buffer must be big enough for holding the full footprint of one iteration of the loop. Furthermore, if the loop does not carry reuse of the array, the applicationmanaged buffer can be shared by data elements from multiple loop iterations. This means that the actual required size for the applicationmanaged buffer is computed as follows:
(2) 
Given a reordered and tiled loopnest, Equation 2 allows to recursively compute local buffer requirements for all array references, starting from the innermost loop in the loopnest. Therefore, it allows to evaluate the buffering requirements of a dataflow schedule, given its loopnest shape with buffering levels annotated for all data references.
IvB Offaccelerator memory traffic
Equation 2 allows to evaluate which dataflow schedules are feasible given a certain buffer size, as well as to compare them based on the minimum local buffer size they require, but does not give any indication on its quality in terms of offaccelerator memory traffic. Let us call the memory traffic computed as number of bytes accessed in offaccelerator memory for array , and the numerical precision (in bytes) used for its storage. At buffering level , then the number of memory accesses to is given by the footprint of with respect to multiplied by the total number of times that loop is executed:
(3) 
Note that does not depend explicitly on the size of the local buffer, but only on the dataflow schedule, through the footprint and the iterations of the outermost loops. Whether the schedule fits with respect to a given local buffer capacity depends only on Equation 2.
Memory accesses to the array constitute a special case as they include two distinct contributions: writes of the final fully computed output feature maps, and memory accesses corresponding to the accumulation of intermediate partial results (each accumulation composed of two accesses, 1 write + 1 read). Storage of accumulated partial results typically uses a different (higher) numerical precision than that used for the final array. The total traffic is therefore the sum of the traffic of the three , , arrays:
(4) 
Equations 3 and 4 enable a quantitative comparison of dataflow schedules in terms of memory traffic, which is known to be strongly correlated with energy consumption, and of course with system cost [2, 30].
IvC Dataflow schedule selection procedure
We conduct the CNN convolution design space exploration in two steps. In the first step, we compute local buffering requirements for each array reference at different loop levels across an enumeration of different loop orders and loop tile sizes. This step is independent of a particular CNN layer shape because the local buffering requirements at any loopnest level depend only on loop order and tile sizes of different loops. In the second step, using these preenumerated buffer requirements, we analyze a particular CNN layer, exhaustively searching for a best combination of buffering levels for the three CNN arrays under different local buffer capacities.
The first step requires the enumeration of 6! = 720 loopnest permutations. However, we can reduce this number by not considering permutations between the two kernel loops (LFX and LFY in Figure 3), and between the two image loops (LSX and LSY in Figure 3). These permutations can be omitted without affecting our conclusions because they result in symmetric dataflow schedules. In order to reduce the enumeration size further, tile sizes are enumerated selectively. We want to quantify how different loop orderings and tile sizes affect the number of memory accesses. For this it is not necessary to enumerate all possible tile sizes; instead we examine a sequence of monotonically increasing, power of two, tile sizes as well as tile sizes that correspond to common CNN layer configurations. The first step results in buffering requirements for each CNN array reference at each loop level for different loopnest shapes.
With above search space reduction, the first exploration step yields 180 possible convolution loopnest permutations with multiple tiling shapes each. In the second step, we search within this preenumerated loopnest permutation space for dataflow schedules that fit the local reuse buffer capacities between 1KB and 512KB, while minimizing required offaccelerator memory access bandwidth. This search results in one best dataflow schedule, i.e. loopnest order, tiling sizes, and buffering levels for CNN memory references, for each evaluated convolution layer and for each buffer capacity.
IvD Comparison vs existing models
The main use of the memory performance model is to evaluate the quality of dataflow schedules pending a set of local buffering constraints, to determine which schedule minimizes memory traffic. Therefore, to compare our model with the current stateoftheart, we first investigated whether it can identify better dataflow schedules than those that can be extracted by previously published models [8, 5, 6, 7]. We selected two representative models, cache and Peemen, to compare with our own proposed model. Similarly to what happens with our model, the selection of the best dataflow schedules involves exhaustive search over a large solution space, considering all possible loop tile sizes and loop orderings.
IvD1 cache
M. Wolf et al. [8] were first to apply loopnest reordering and tiling in order to find the localized iteration space. The method conservatively assumes that each data element in the localized iteration space working set needs to be allocated a place in the cache, thus overestimating the required memory footprint. It is also assumed that the entire localized iteration space working set needs to be transferred between the memory and the cache for each new localized iteration space execution because it cannot be guaranteed that data from a previously execution is still present in the cache. The original paper [8]
also proposed a heuristics for trimming the number of tiling possibilities, guided by the cache behavior in scientific computations.
IvD2 Peemen
The memory performance model proposed by Peemen et al. [5] and independently by Zhang et al. [6], improves on the original cache model significantly by taking into account that with application managed buffers some data can be reused between consecutive executions of the localized iteration space, which they call the innermost tile. By changing the order of controlling loops, different quality solutions are obtained. Yang et al. [7] published a similar model extended for optimal CNN loopnest tiling for multiple levels of memory hierarchy. As explained in section III, the models developed by Zhang et al. and by Yang et al. yield dataflow schedules essentially similar to the schedules generated by the Peemen et al. method, therefore we implemented the latter as a representative model for the three approaches^{2}^{2}2 Appendix A describes Peemen’s memory performance model that we derived for the loopnest from the Figure 3. For a detailed explanation of these formulas the reader is referred to [5]. .
In all these methods, the ordering of the loops inside the innermost tile is not important because the ability to reuse data depends solely on the total number of data loaded to the local reuse buffer between reuses. In Peemen’s and Zhang’s models, only the innermost controlling loop affects the data reuse across consecutive iterations of the innermost tile. Therefore, fewer loopnest permutations need to be explored considerably reducing the solution search space.
We have compared the dataflow schedules generated by our model with dataflow schedules computed by previously published models [8], [5], [6], and [7] over the convolutional layers from five stateoftheart CNN topologies: AlexNet, ZFNet, VGG16, Inception v3, and ResNet20^{3}^{3}3 Configurations of the CNN layers that we’ve used in our evaluations are listed in the Appendix B.. Although the following comparison is based on generated dataflow schedules, we show in Section V that our method is exact with respect to the real execution. Therefore, this estimation corresponds to the actual amount of memory traffic generated by these CNN layers.
Figure 4 plots the total number of data transfers to and from offaccelerator memory, using the best dataflow schedules estimated by the three models, while sweeping the size of the local reuse buffer from 1 to 256 KB. We aggregate the memory traffic from the best dataflow schedules identified for each layer, and we also show the ideal result given by the “essential” memory traffic that is present when all data reuse is exploited. From the plot, it is clear that the cache model largely overestimates the memory traffic requirements compared to the two application managed buffer models, and always results in suboptimal dataflow schedules for any buffer size and for all CNN convolution layers that we tested, by a factor of up to 3.5.
The advantage of our model when compared with Peemen’s method is more subtle, as both exploit the characteristics of applicationmanaged buffers to yield a better schedule. To analyze the difference between the two models, Figure 5 plots the relative overhead in memory traffic of Peemen’s model with respect to our model. Figure 5 plots the bandwidth requirements estimated from Peemen’s model as a percentage overhead compared to our model, given local reuse buffer capacities between 1KB and 256KB, for the same CNN convolution layers. Our model finds dataflow schedules with between 2.5% and 17.5% lower memory traffic, with over 10% memory traffic reduction for several CNNs, especially when targeting smaller local reuse buffer sizes. Even for a relatively large local buffer size of 128KB and 256KB, our method results in dataflow schedules with more than 5% memory traffic reduction over the set of convolution layers for several CNN networks.
For most of evaluated CNN convolution layers, our method results in some reduction of memory traffic due to its ability to exploit data reuse across all levels of the CNN convolution loopnest. With rare exceptions, Peemen’s method is able to find similar schedules only when a full volume of the input or the output feature maps can be stored in the local reuse buffer. We have noticed the following points that contribute to these results:

Our method’s footprint calculation better takes into account the data reuse because it considers independent buffering for the I, W, O arrays. As a result, our buffer requirements are systematically lower for a given loopnest shape, and allow room for bigger tiles to be placed in local reuse buffers.

Due to independent buffering of the 3 arrays, our method always places memory transfers optimally with respect to the total memory traffic.

In Peemen’s method, unless LTIF is the innermost controlling loop, the memory traffic for the array is multiplied by 2 to account for 1 read and 1 write of partial accumulations. This happens even when the LIF loop is not tiled.
Moreover, for a given CNN convolution layer, the memory traffic overhead from Peemen’s method does not necessarily decrease when the local reuse buffer capacity is increased. Increasing the local buffer capacity allows, in the first place, to generate larger tiles. However, at some buffer capacity points, our method is able to find a completely different loop ordering and buffering levels for the data references, such that more important memory traffic reduction can be achieved than by simply increasing the tile sizes.
It is worth noticing that Yang’s algorithm can be modified to achieve the same quality dataflow schedules as the ones using our approach. It can be verified that applying a 4level blocking at each memory hierarchy level  essentially enumerating the permutations of the 4 loops, LSX, LSY, LIF, and LOF from the Figure 3, would lead to schedules equivalent to our method. However, such 4level blocking is quite computationally expensive: Yang et al. reported that a 4level blocking of a single CNN layer takes 24 hours on a Xeon E5645 processor. Our method is significantly faster: the first step described in Section IVC is performed only once for many different CNN layers, whereas in Yang’s method, a solution search needs to be performed for each convolution layer individually. Even with densely sampled tiling space, for example exploring all tile sizes multiple of 2, our first step takes 2 minutes per layer on a simple Intel i7 processor running at 3,4 GHz,
V Case Study: ASMP Cluster Accelerator
As a case study for our proposed model, in this Section we illustrate how it can be used for a practical implementation of a lowcost CNN accelerator. Specifically, we (1) show how our model can be used to derive a dataflow schedule for a specialized hardware block, dedicated for processing CNN convolution layers, (2) show that this dataflow schedule is implementable, and (3) evaluate its efficiency in terms of memory bandwidth utilization and accuracy.
Va Target architecture
As a target architectural template, we chose to integrate specialpurpose convolution hardware blocks (HWCs) inside STMicroelectronics’ ASMP tightlycoupled sharedmemory cluster [50]. Fig. 6 shows the block diagram of this architectural template. An ASMP cluster is composed by a number of programmable cores, HWCs, and a DMA, all connected together to a shared TightlyCoupled Data Memory (TCDM) via a singlecycle logarithmic interconnect [51]
. The shared memory enables efficient exchange of data between the convolutional units and programmable processors, achieving high degree of flexibility by computing nonconvolution functions, such as pooling, normalization, etc., in software. Additionally, such shared memory cluster can efficiently support traditional computer vision algorithms, ORB, HOG, etc. because many image processing algorithms are essentially based on convolutional operations.
The current ASMP cluster implementation targets mobile image processing applications and includes up to 16 RISC processor cores and HWC blocks running at moderate frequency (500 MHz), and up to 256KB of the TCDM memory. The interconnect can be designed with 32bit or 64bit TCDM access width with maximum peak bandwidth of 64 GB per second. The key element of the ASMP cluster is the logarithmic interconnect that allows multiple concurrent accesses to the multibank TCDM memory. The logarithmic interconnect provides a common infrastructure for the coretocore, the coretohardwareblock, or the hardwareblocktohardwareblock communication.
The specialized convolution hardware blocks, the HWCs, are essential for achieving the required performance while keeping the cost and the power consumption low. For designing the HWC, we have leveraged on shared memory dedicated hardware blocks methodology similar to the one described in [52]. The tight coupling of the HWC to shared memory allows usage of complex memory access patterns, such as sliding kernels, repeated refetch of data, etc. This flexibility enables efficient implementation of vast variety of dataflow schedules. We have chosen to implement a Single Instruction Multiple Data (SIMD) type of datapath. The SIMD processing ensures a steady datapath utilization independent on the convolution kernel size. Furthermore, we have opted for a relatively narrow 16byte SIMD width such that the HWC efficiency is maintained even for smaller images.
VB HWC Design Space Exploration
The HWC is dedicated to processing the CNN convolutional layer, which accounts for most of the computational work and most of the partial results data bandwidth in existing CNNs [2]. A critical point in the design of HWCs, like most accelerators, is what data has to be internalized within a local buffer and what is accessed from outside the accelerator, in this case from the cluster shared TCDM. This problem is readily mapped to the conceptual view of our model (Figure 1), where internalized memories constitute the applicationmanaged local reuse buffer, whereas the TCDM is the offaccelerator memory. It is therefore straightforward to apply the memory performance model presented in Section IV as a tool for design space exploration, with the objective to find the best tradeoff between the HWC local storage capacity and the required TCDM bandwidth.
Local storage and TCDM bandwidth are generally conflicting objectives. On the one hand, minimizing local storage capacity is important to reduce area and therefore cost of the HWC IP. Our shared cluster platform includes several HWC and a TCDM memory for buffering data onchip. Making the total HWC internal storage capacity close to the TCDM capacity would make its usage redundant, as the access energy for local memory would be comparable to that of a TCDM access. On the other hand, the bandwidth to cluster shared memory remains a scarce resource because multiple actors in the system are accessing it simultaneously; without any local storage at all, every data access from HWC would be done to the TCDM memory. The resulting bandwidth requirement would exceed the local interconnect capacity, leading to a drop in performance and to high energy consumption. Furthermore, accessing the cluster shared memory is more expensive in terms of energy consumption than accessing internal HWC storage [30], and the number of ports used to connect the HWCs to the TCDM has a significant impact on its size and the maximum working frequency of the cluster – which means that minimizing TCDM bandwidth requirements is also important.
In general, the tradeoff between local storage and memory bandwidth depends in on the shape of the specific CNN layer: convolution kernel size, number and size of the feature maps, feature map and kernel numeric precision. Therefore, for a generaltarget HWC we want to build a CNN loopnest dataflow schedule that, given a local reuse buffer capacity in the order of a few KBs, minimizes the TCDM memory traffic – and therefore bandwidth – across a large number of CNN layers taken from different CNNs. We conducted this design space exploration in two steps as described in Section IV. We analyzed 71 different representative CNN layers chosen from the AlexNet, ZFNet, VGG, Inception v3 and ResNet topologies, exhaustively searching for a best dataflow schedule for each of 180 different loopnest permutations under different local buffer capacities.
Figure 7 shows the distribution of the dataflow schedule quality across the 180 loopnest permutations obtained with different local memory constraints, binned according to the amount of memory traffic they generate. The Yaxis shows, for each local buffer capacity, the percentage of loopnest permutations that result in optimal traffic, or add up to 10%, 20%, etc. of overhead to the optimal traffic, or exceed 2 times optimal traffic, respectively. For small local buffer capacities, less than 20% of loopnest permutations can achieve optimal bandwidth. On the other hand, with a large local buffer 50% of loop permutations can be tiled in such a way that optimal bandwidth is achieved.
By analyzing the best dataflow schedules obtained in this experiment, we confirmed the intuition that those dataflow schedules that allow the output feature maps to be fully accumulated locally by buffering the partial sums tend to be the best performing ones, especially when the local buffer capacity is small. Although in these schedules the input feature maps and weights are read from memory multiple times, they still result in fewer total memory accesses compared to dataflow schedules where the output feature maps are swapped out to memory before being fully accumulated through all of the input feature maps. Swapping and refetching the output feature maps to complete the accumulation generates twice the traffic compared with the readonly input feature maps and weights. Furthermore, the partially accumulated output feature maps require higher precision and therefore are more costly in terms of required bandwidth. It is interesting to notice that given less than 512KB buffering capacity, no one permutation resulted in a dataflow schedule with optimal bandwidth across the entire set of tested convolution layers.
Among several small footprint schedules, we chose one that, for most tested CNN layers, leads to minimal memory bandwidth requirements for local buffer sizes from 1KB to 4KB. Our selection was also guided by several hardware implementation criteria, such as the number of required simultaneous local buffer accesses, access alignment, etc. and the compatibility with a SIMD datapath. Figure 8 shows the dataflow schedule chosen for the HWC implementation, with the buffering level for each array shown as a comment on top of the corresponding loop. The HWC main loop executes an innermost tile in order LIF  LSY  LFY  LOF  LSX  LFX. The relative order of loops LSX, LFX and LFY ensures that the partial sum accumulation remains internal to the HWC as much as possible. In the actual implementation, the tiling factor for the LIF loop is fixed and equals the SIMD datapath width. The remaining tile dimensions: the number of output feature maps, , the input feature maps, , and the number of output lines in a tile, , are computed for each particular convolution layer also using our performance model. In practice, over all tested CNN layers, the input feature maps volume was never tiled (), allowing a complete accumulation of the partial sums inside the HWC local buffer.
VC HWC evaluation
To explore the hardware implications of the dataflow schedule proposed in Figure 8, we specified a HWC prototype using it directly in C, using the CatapultC highlevel synthesis (HLS) tool to derive a Verilog design. The design exposes one slave configuration port and 332bit master ports towards TCDM (one for each array reference) to separate control for memory accesses for each array, easing the highlevel synthesis process. We specified the HWC datapath so that it is capable of handling convolution kernel sizes up to 1111 with stride and unlimited input/output feature map sizes. It is design as singleinstruction multiple data (SIMD) engine and it is capable of sixteen 8bit8bit or 8 16bit16bit fixedpoint MAC operations per clock cycle. Figure 9 shows the block diagram of the HWC and the final breakdown of the local storage capacity for the three CNN arrays. The prototype HWC has slightly over 1KB of internal storage including small input (), weight (), and accumulated sum () buffers, with partial sums accumulated in 32bit precision.
Using the HWC design generated by HLS, we performed synthesis and placeandroute of an ASMP cluster with 4 HWC targeting a 28nm technology node, achieving a maximum frequency of 500 MHz. The computing cluster achieves up to 64 8x8 MAC operations per cycle for a total of 32 GMAC/s at 500 MHz, with average utilization of 80% across the set of convolutional layers used for the design space exploration.
To understand how accurately our memory performance model evaluates the memory traffic with respect to an actual implementation, we have measured the actual number of bytes transferred between the HWC and the TCDM during execution of various CNN layers and compared these measurements with the bandwidth predicted by our memory performance model. Our memory performance model is almost exact with respect to the measured bandwidth. The memory performance model does not explicitly account for strided convolutional layers when tile sizes are not integral multiples of the tile; a small overestimation of the memory traffic can be observed in such cases. From our experiments such overestimation is always within 0.5% of the total memory traffic.
In order to put the HWC schedule into perspective, we compare the amount of memory traffic generated by the HWC to the memory traffic generated by a stateoftheart tightlycoupled CNN convolution accelerator, HWCE [11, 53]. Both hardware units target ultra lowcost and lowpower applications and implement very little internal buffer storage. Both hardware units target tightlycoupled shared memory clusters and are constrained by the performance of the shared memory logarithmic interconnect in a similar way. The HWCE implements a 2D convolution dataflow with linear input buffering similar to [21].
As shown in Figure 11, the HWCE uses a variant of the canonical tiled loopnest shown in Figure 3, with the outermost tiling loops executed in software – it executes a 2D convolution over a single input feature map resulting in one partially computed output feature map. The HWCE takes limited advantage of the data reuse existing in the convolution loopnest. The convolution reuse results from the input feature maps being buffered in an input linear buffer [21]. The weight reuse is ensured by buffering the convolution kernel weights required for processing a single pair of an input and an output feature map. The output feature maps and the partially accumulated sums are stored in the TCDM memory, the elements of the array are only buffered while applying a single convolution kernel.
Figure 10 shows the ratio of the HWCE traffic for different CNN layers vs the HWC prototype using 1KB of internal buffer. While HWC and HWCE expose the same number of ports towards TCDM (332 bits), the HWC dataflow schedule results in lower traffic for all tested convolutional layers. For some layers, the HWC dataflow schedule results in up to 14 times reduction in traffic compared to the HWCE dataflow. The HWCE dataflow schedule suffers from important penalty due to large amounts of partial accumulation sums that are stored in the TCDM memory. Writing and reading these partial sums with higher numeric precision result in noticeable increase in memory traffic. Additionally, with 1KB memory budget, the HWCE linear input buffer would be too small for some layers, such as AlexNet1, ZFNet1, or ResNet1  a bar is absent for such layers in the Figure. To handle insufficient linear input buffer size, the actual HWCE implementation splits the input feature maps into several smaller stripes that can be handled with the small linear input buffer  this results in a further slight increase of redundant shared memory traffic, due to the overlap between stripes. Since HWC and HWCE expose the same number of TCDM ports, the ones on the HWC are used, on average, significantly less. This translates to a lower amount of energy spent in transactions with memory and on less contention, making it easier to combine HWC operation with other computations in the cluster [52] without significant performance hits.
Vi Conclusion and Future Work
We have presented an analytic memory performance model suitable for memory hierarchies that use application managed buffers. We have shown that our model results in more accurate memory footprint estimation than previously published models and is accurate with respect to a real implementation. We have used this model for designing a CNN convolution hardware block in the context of the ASMP shared memory cluster. Our further work includes applying our model to automatic generation of dataflow schedules from standard CNN descriptions in Caffe, TensorFlow or similar tools.
References

[1]
Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning.”
Nature, vol. 521, no. 7553, pp. 436–444, 2015.  [2] V. Sze, Y.H. Chen, T.J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey.” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, 2017.
 [3] Y.H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks.” J. SolidState Circuits, vol. 52, no. 1, pp. 127–138, 2017.
 [4] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam, “Shidiannao: shifting vision processing closer to the sensor.” in ISCA, D. T. Marr and D. H. Albonesi, Eds. ACM, 2015, pp. 92–104.
 [5] A. Peemen, B. Mesman, and H. Corporaal, “Optimal iteration scheduling for intra and intertile reuse in nested loop accelerators,” Eindhoven University of Technology, Tech. Rep. EC Reports; ESR20133, January 2013.
 [6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing fpgabased accelerator design for deep convolutional neural networks.” in FPGA, G. A. Constantinides and D. Chen, Eds. ACM, 2015, pp. 161–170.
 [7] X. Yang, J. Pu, B. B. Rister, N. Bhagdikar, S. Richardson, S. Kvatinsky, J. RaganKelley, A. Pedram, and M. Horowitz, “A systematic approach to blocking convolutional neural networks.” CoRR, vol. abs/1606.04209, 2016.
 [8] M. E. Wolf and M. S. Lam, “A data locality optimizing algorithm.” in PLDI, D. S. Wise, Ed. ACM, 1991, pp. 30–44.
 [9] M. Peemen, A. A. A. Setio, B. Mesman, and H. Corporaal, “Memorycentric accelerator design for convolutional neural networks.” in ICCD. IEEE Computer Society, 2013, pp. 13–19.
 [10] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240 gops/s mobile coprocessor for deep neural networks.” in CVPR Workshops. IEEE Computer Society, 2014, pp. 696–701.
 [11] F. Conti and L. Benini, “A ultralowenergy convolution engine for fast braininspired vision in multicore clusters.” in DATE, W. Nebel and D. Atienza, Eds. ACM, 2015, pp. 683–688.

[12]
Y. Chen, T. Chen, Z. Xu, N. Sun, and O. Temam, “Diannao family: energyefficient hardware accelerators for machine learning.”
Commun. ACM, vol. 59, no. 11, pp. 105–112, 2016.  [13] J. Sim, J.S. Park, M. Kim, D. Bae, Y. Choi, and L.S. Kim, “14.6 a 1.42tops/w deep convolutional neural network recognition processor for intelligent ioe systems.” in ISSCC. IEEE, 2016, pp. 264–265.
 [14] “Intel nervana neural network processor: Architecture update,” https://ai.intel.com/intelnervananeuralnetworkprocessorarchitectureupdate, Intel, 2018.

[15]
“An indepth look at google’s first tensor processing unit (tpu),”
https://cloud.google.com/blog/bigdata/2017/05/anindepthlookatgooglesfirsttensorprocessingunittpu, Google, 2018.  [16] “Powervr series2nx neural network accelerators (nna),” https://www.imgtec.com/powervr/vision/series2nx, Imagination Technologies, 2018.
 [17] G. Desoli, V. Tomaselli, E. Plebani, G. Urlini, D. Pau, V. D’Alto, T. Majo, F. D. Ambroggi, T. Boesch, S. pal Singh, E. Guidetti, and N. Chawla, “The orlando project: A 28 nm fdsoi low memory embedded neural network asic.” in ACIVS, ser. Lecture Notes in Computer Science, J. BlancTalon, C. Distante, W. Philips, D. C. Popescu, and P. Scheunders, Eds., vol. 10016, 2016, pp. 217–227.
 [18] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, “Pulp: A parallel ultra low power platform for next generation iot applications.” in Hot Chips Symposium. IEEE, 2015, pp. 1–39.
 [19] M. S. Lam, E. E. Rothberg, and M. E. Wolf, “The cache performance and optimizations of blocked algorithms.” in ASPLOS, D. A. Patterson, Ed. ACM Press, 1991, pp. 63–74, sIGARCH Computer Architecture News 19(2), SIGOPS Operating System Review 25(Special Issue April 1991), and SIGPLAN Notices 26(4).

[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
Advances in neural information processing systems, 2012, pp. 1097–1105.  [21] C. Farabet, C. Poulet, J. Y. Han, and Y. LeCun, “Cnp: An fpgabased processor for convolutional networks.” in FPL, M. Danek, J. Kadlec, and B. E. Nelson, Eds. IEEE, 2009, pp. 32–37.
 [22] F. Conti, C. Pilkington, A. Marongiu, and L. Benini, “HeP2012: architectural heterogeneity exploration on a scalable manycore platform,” in CASSAP, 2014, pp. 114–120.
 [23] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision.” CoRR, vol. abs/1502.02551, 2015.
 [24] P. Gysel, “Ristretto: Hardwareoriented approximation of convolutional neural networks.” CoRR, vol. abs/1605.06402, 2016.
 [25] J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng, “Quantized convolutional neural networks for mobile devices.” in CVPR. IEEE Computer Society, 2016, pp. 4820–4828.
 [26] M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations.” CoRR, vol. abs/1511.00363, 2015.
 [27] M. Courbariaux and Y. Bengio, “Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1.” CoRR, vol. abs/1602.02830, 2016.
 [28] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks.” CoRR, vol. abs/1603.05279, 2016.
 [29] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An architecture for ultralow power binaryweight cnn acceleration.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 37, no. 1, pp. 48–60, 2018.

[30]
F. Conti, P. D. Schiavone, and L. Benini, “XNOR Neural Engine: A Hardware Accelerator IP for 21.6fJ/op Binary Neural Network Inference,”
IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 37, no. 11, pp. 2940–2951, Nov 2018.  [31] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, and B. Dally, “Deep compression and eie: Efficient inference engine on compressed deep neural network.” in Hot Chips Symposium. IEEE, 2016, pp. 1–6.
 [32] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, “A survey of model compression and acceleration for deep neural networks.” CoRR, vol. abs/1710.09282, 2017.

[33]
B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky, “Sparse convolutional
neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2015, pp. 806–814.  [34] S. Changpinyo, M. Sandler, and A. Zhmoginov, “The power of sparsity in convolutional neural networks,” CoRR, vol. abs/1702.06257, 2017.
 [35] X. Zhou, Z. Du, S. Zhang, L. Zhang, H. Lan, S. Liu, L. Li, Q. Guo, T. Chen, and Y. Chen, “Addressing sparsity in deep neural networks,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. Early Access, pp. 1–1, 2018.
 [36] L. Cavigelli and L. Benini, “Extended BitPlane Compression for Convolutional Neural Network Accelerators,” arXiv:1810.03979 [cs], Oct. 2018.
 [37] C. Farabet, B. Martini, P. Akselrod, S. Talay, Y. LeCun, and E. Culurciello, “Hardware accelerated convolutional neural networks for synthetic vision systems,” in Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, May 2010, pp. 257–260.
 [38] P. Merolla, J. V. Arthur, F. Akopyan, N. Imam, R. Manohar, and D. S. Modha, “A digital neurosynaptic core using embedded crossbar memory with 45pj per spike in 45nm.” in CICC, R. Patel, T. Andre, and A. Khan, Eds. IEEE, 2011, pp. 1–4.
 [39] W. Qadeer, R. Hameed, O. Shacham, P. Venkatesan, C. Kozyrakis, and M. Horowitz, “Convolution engine: balancing efficiency and flexibility in specialized computing.” Commun. ACM, vol. 58, no. 4, pp. 85–93, 2015.
 [40] E. Azarkhish, D. Rossi, I. Loi, and L. Benini, “Neurostream: Scalable and energy efficient deep learning with smart memory cubes.” CoRR, vol. abs/1701.06420, 2017.
 [41] Y.H. Chen, J. S. Emer, and V. Sze, “Eyeriss: A spatial architecture for energyefficient dataflow for convolutional neural networks.” in ISCA. IEEE Computer Society, 2016, pp. 367–379.
 [42] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angeleye: A complete design flow for mapping cnn onto embedded fpga.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2018.
 [43] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, “Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning.” in ASPLOS, R. Balasubramonian, A. Davis, and S. V. Adve, Eds. ACM, 2014, pp. 269–284.
 [44] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “Dadiannao: A machinelearning supercomputer.” in MICRO. IEEE, 2014, pp. 609–622.
 [45] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou, “Dlau: A scalable deep learning accelerator unit on fpga.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 36, no. 3, pp. 513–517, 2017.
 [46] V. Rana, I. Beretta, F. Bruschi, A. A. Nacci, D. Atienza, and D. Sciuto, “Efficient hardware design of iterative stencil loops.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 35, no. 12, pp. 2018–2031, 2016.
 [47] J. Cong, P. Li, B. Xiao, and P. Zhang, “An optimal microarchitecture for stencil computation acceleration based on nonuniform partitioning of data reuse buffers.” IEEE Trans. on CAD of Integrated Circuits and Systems, vol. 35, no. 3, pp. 407–418, 2016.
 [48] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris: Scalable and efficient neural network acceleration with 3d memory.” in ASPLOS, Y. Chen, O. Temam, and J. Carter, Eds. ACM, 2017, pp. 751–764.
 [49] L.N. Pouchet, P. Zhang, P. Sadayappan, and J. Cong, “Polyhedralbased data reuse optimization for configurable computing.” in FPGA, B. L. Hutchings and V. Betz, Eds. ACM, 2013, pp. 29–38.
 [50] L. Benini, E. Flamand, D. Fuin, and D. Melpignano, “P2012: Building an ecosystem for a scalable, modular and highefficiency embedded computing accelerator,” in DATE, 2012, pp. 983–987.
 [51] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini, “A fullysynthesizable singlecycle interconnection network for sharedl1 processor clusters,” in Design, Automation & Test in Europe Conference & Exhibition (DATE), 2011. IEEE, 2011, pp. 1–6.
 [52] F. Conti, A. Marongiu, C. Pilkington, and L. Benini, “HeP2012: Performance and Energy Exploration of Architecturally Heterogeneous ManyCores.” Journal of Signal Processing Systems, vol. 85, no. 3, pp. 325–340, 2016.
 [53] F. Conti, R. Schilling, P. D. Schiavone, A. Pullini, D. Rossi, F. K. Gürkaynak, M. Muehlberghuber, M. Gautschi, I. Loi, G. Haugou, S. Mangard, and L. Benini, “An IoT Endpoint SystemonChip for Secure and EnergyEfficient NearSensor Analytics,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 64, no. 9, pp. 2481–2494, Sep. 2017.
Appendix A Peemen Equations for the CNN Convolution LoopNest
The buffering requirements for the 3 array references in the CNN convolution loopnest are computed as shown in Equation A:
(5)  
The terms and compute the dimensions, in pixels, of the input feature map tile, given the tile size for the output feature map, , the convolution kernel size, , and the convolution stride .
The total memory traffic is computed by multiplying the buffering requirements by the total number of tiles, with one improvement. Peemen noticed that using applicationmanaged buffers, data can also be reused between consecutive innermost tile executions. This significantly improves the memory traffic estimation accuracy for the computations involving prologue, steady state, and the epilogue. Equation 6 shows the general form of the memory traffic computation:
(6) 
The 4 cases that need to be considered for the CNN loopnest, corresponding each to one of the controlling loops, LTSX, LTSY, LTIF, and LTOF, being the innermost controlling loop, are shown in Equations 7, 8, 9, and 10 below.

Innermost LTOF:
(7) with

Innermost LTIF:
(8) with

Innermost LTSY:
(9) with

Innermost LTSX:
(10) with
In Equations 7  10, in order to account for the data reuse across consecutive innermost tile executions, the memory traffic computation is done as if the innermost controlling loop were not tiled. For example, with loop LTOF being the innermost controlling loop, the tile size of this loop is set to the total loop count, i.e. , inside the general equation 6.
Appendix B Used CNN Layers Configuration
Layer  Conv.  

AlexNet 1  3  96  
AlexNet 2  96  256  
AlexNet 3  256  384  
AlexNet 4  384  384  
AlexNet 5  384  256  
ZFNet 1  3  96  
ZFNet 3  256  384  
ZFNet 4  384  384  
ZFNet 5  384  256  
ZFNet 6  256  256  
VGG 1  3  64  
VGG 2  64  64  
VGG 3  64  128  
VGG 4  128  128  
VGG 5  128  256  
VGG 6  256  256  
VGG 8  512  256  
VGG 9  512  512  
VGG 11  512  512 
#  Conv.  

0  192  64  
192  32  
1  256  64  
256  64  
2  288  64  
288  64  
3  288  384  
288  64  
4  798  192  
768  192 
#  Conv.  

1  3  64  
2  
3  
4  
5 
Comments
There are no comments yet.