1. Introduction
The rapid adoption of convolutional neural networks (CNNs) has transformed machine learning. CNNs have been embraced across a wide array of fields, such as recommendation systems (Oord et al., 2013)
(Collobert and Weston, 2008), and computer vision
(Krizhevsky et al., 2012; Simonyan and Zisserman, 2014; Iandola et al., 2016; Szegedy et al., 2015). In particular, image object recognition has become the de facto benchmark for CNNs, with new networks shattering all prior records in object detection and classification every year.However, improvements in CNN accuracy are accompanied by a rapid increase in computational cost. CNNs have already grown to the point where multicore CPUs are no longer a viable computing platform. At the same time, while GPUs offer adequate performance, GPU power consumption brings its own set of challenges, particularly at datacenter scale. As a result, FPGAs have seen a surge in interest for CNN acceleration due to their programmable, massively parallel, and powerefficient computing substrate. The combination of high performance and power efficiency on machine learning tasks is leading to the adoption of FPGAs in data center environments (Putnam et al., 2014).
CNNs comprise multiple computation layers, whose inputs are arrays of different dimensions. The prior state of the art for using FPGAs for CNNs is to implement an accelerator, which we call a convolutional layer processor (CLP), that processes the layers iteratively, one by one. A CLP design is parameterized by the dimensions of its computational grid; its speed depends on the compatibility of these dimensions with the CNN layers it computes. To achieve peak performance, the CLP parameters are jointly optimized for the ensemble of the layers to maximize the collective throughput of the accelerator. This approach closely follows from an ASIC accelerator design flow, where a given hardware design is optimized for the ensemble of the benchmarks that will run on it, so as to perform well for all likely workloads that will be used once the ASIC is deployed.
We observe that jointly optimizing one CLP for all CNN layers leads to a dynamic underutilization of FPGA resources, giving up performance that could be achieved on the FPGA platform. Although the CLP is optimized for maximum throughput, the fixed dimensions of the computational grid are suboptimal for some, or even all, of the individual layers. Figure 1 (top) illustrates this problem. The SingleCLP hardware (white box) iteratively processes the three layers (blue boxes). The dimensions of the hardware and the layers are represented by the size and shape of the boxes. L1 is smaller than the CLP dimensions, leaving some hardware unused when computing this layer (Figure 1(a)). L2’s size exactly matches the CLP, but L3’s dimensions exceed the CLP size. Therefore, the CLP computational grid must be used iteratively to compute different parts of L3 (first, its top portion, then, its bottom portion), again underutilizing the available hardware (Figure 1(b)). On the popular AlexNet CNN (Krizhevsky et al., 2012), an “optimal” SingleCLP derived from the stateoftheart methodology (Zhang et al., 2015) has dynamic utilization of less than 24%. This means that, on average, more than three quarters of the CLP’s arithmetic units (multipliers and adders built from the FPGA’s DSP slices) remain unused.
To overcome this problem, we propose a new CNN accelerator design that partitions FPGA resources among multiple CLPs, which operate on multiple images concurrently. We illustrate the operation of MultiCLP in Figure 1 (bottom), where the hardware resources are partitioned among two smaller CLPs that operate in parallel on different images. Note that the two CLPs are specialized and have different dimensions; this allows CLP1 to work well for L1 and L3, while CLP2’s dimensions are compatible with L2. The key is that these sizes allow the layers to be processed with very little idle hardware, enabling the MultiCLP to do the same amount of work in less time (Figure 1(c)).
We develop an optimization algorithm that, given CNN layer dimensions and a resource budget, computes a partitioning of the FPGA resources into multiple CLPs for an efficient highperformance design. Our algorithm runs in minutes and produces a set of CLP dimensions. We then use these dimensions to parameterize a CLP design specified using highlevel synthesis (HLS), combining the resulting CLPs to form a complete CNN implementation.
Our results demonstrate that partitioning FPGA resources into multiple CLPs can achieve over % arithmetic unit utilization, in some cases close to 100%. Our design methodology achieves 3.8x higher throughput than the stateoftheart approach for the popular AlexNet CNN on a Xilinx Virtex7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.
The rest of the paper is organized as follows. In Section 2, we provide background on CNNs. Section 3 describes the stateoftheart FPGA implementation and analyzes the inefficiency of SingleCLP accelerators. Section 4 presents our MultiCLP optimization methodology. Section 5 describes our design and implementation and Section 6 details experimental results. Section 7 discusses related work and we conclude in Section 8.
2. CNN Background
In typical object recognition examples (e.g., (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014)
), a CNN passes images through a number of convolutional layers, which convolve the input (an array of twodimensional matrices called feature maps) with an array of twodimensional filters, whose values, called weights, were previously learned using an algorithm such as backpropagation. Nonlinear layers, which typically perform computations such as subsampling or activation functions, interleave convolutional layers. In the end, the network includes one or more fullyconnected layers, each of which performs a number of dotproducts across its entire input. Figure
2 shows AlexNet (Krizhevsky et al., 2012), which contains five stages of paired convolutional layers (e.g., 1a and 1b), followed by three stages of fullyconnected layers (FC1–FC3). In this figure, the small nonlinear layers are omitted. As in prior work (Zhang et al., 2015), we focus on the convolutional layers of the network, because they are the most compute intensive layers.Figure 3 illustrates a convolutional layer and Listing 1 presents the pseudo code to compute it. To simplify presentation, we omit biases. Each layer takes as input input feature maps and convolves them with the filters. There are sets of filters; by convolving one set of filters ( weights) with the input feature maps, one of the output feature maps is obtained. For example, the blue point in the lowest output feature map in Figure 3 is computed by taking the dotproduct of the blue weights with the portion of the input feature maps shaded in blue. All points of the output feature map are computed by sliding the blue shaded region around the input feature maps. Repeating this process with each of the sets of filters, we compute each of the output feature maps.
3. Resource Utilization Problem
A common approach for building a CNN accelerator is what we call a convolutional layer processor (CLP), which computes the nested loop in Listing 1. Because a CNN has multiple convolutional layers, the same CLP is used to process all layers, one by one. Because different layers have different dimensions (), such a “one size fits all” approach creates a resource utilization problem, as illustrated in Figure 1. In this section, we analyze how this problem affects a stateoftheart FPGA CNN accelerator.
3.1. State of the Art Design
We base our analysis on the design in (Zhang et al., 2015). This design employs loop transformations, such as loop reordering, tiling, and unrolling to reorder computations and memory accesses, increasing throughput and reducing data transfer. The transformed loop is used as a template for constructing the accelerator.
Using the methodology in (Zhang et al., 2015), the nested loops in Listing 1 are transformed into Listing LABEL:list:tiledloop, illustrated as a datapath in Figure 4. The , , and arrays represent onchip buffers for input, output, and weight data, respectively. Copying data in or out of these arrays corresponds to transferring data between the onchip buffers and offchip memory. Doublebuffering is used to overlap data transfer with computation and requires provisioning each memory with twice the capacity. To simplify presentation, Listing LABEL:list:tiledloop omits a required boundary check when copying data.
The loops , , , and are tiled with factors , , , and , respectively. These loop tiling factors control how much data are transferred per buffer refill or writeout, and the order in which data are transferred. Because the innermost two loops are unrolled (based on and ), loop tiling also controls how the compute modules are constructed. In particular, to implement these two unrolled loops, vector dotproduct units are constructed, each of width . An accumulation adder is added after each unit, as shown in Figure 4. This yields multipliers and adders.
Given a resource budget (e.g., a number of DSP slices), one can find the optimal and for a given convolutional layer. In (Zhang et al., 2015), a joint optimization is performed to create a single CLP to compute all of the convolutional layers in the CNN. The optimization finds the that maximize the aggregate performance of the CLP.
3.2. Arithmetic Unit Utilization Problem
Although the methodology in (Zhang et al., 2015) produces a CLP optimized for the collective performance of all convolutional layers, we observe that its speed is limited by the fact that different convolutional layers of the CNN have different dimensions, but all are computed on the same CLP. Thus, the CLP that gives the best performance across all layers is not necessarily well suited for any one layer. Because the limiting factor of performance is the number of parallel arithmetic units in the CLP, the cost of the mismatch can be quantified by considering the utilization of the arithmetic units. That is, we can quantify the percentage of the time that the arithmetic units in the CLP are doing work versus the percentage of the time they are idle.
The primary cause of the utilization penalty is a mismatch between the tile parameters and their corresponding loop sizes . In particular, if is less than or is less than , then there must be cycles where some of the multipliers are not used. For example, following the methodology in (Zhang et al., 2015), we generated an accelerator for SqueezeNet (Iandola et al., 2016)
that targets the Virtex7 690T FPGA and uses single precision floatingpoint arithmetic. The best
we obtained is . However, the of layer one of the network is , therefore , leading to an arithmetic unit utilization of 33.3%. The of layer two of the network is , so . To make things worse, for layer two, is not a perfect multiple of , which is another source of underutilization. Eight iterations are needed for to cover . The first seven iterations cover the first 63 input feature maps, leaving only one for the eighth iteration, during which only of the is used. The compound effect of mismatch on both and leads to a utilization of only 22.2% for layer 2. Overall, analyzing all convolutional layers in SqueezeNet gives an arithmetic unit utilization of 76.4%.When fixedpoint arithmetic is used, more adders and multipliers can be built using the same DSP slice budget, exacerbating the mismatch problem. The worst case we observed is running AlexNet on a Virtex7 690T FPGA with 16bit fixedpoint arithmetic units, which has an overall arithmetic unit utilization of less than 24%.
4. MultiCLP Design
To improve the resource utilization and thus CNN performance, we propose MultiCLP accelerators, where the available resources are partitioned across several smaller convolutional layer processors rather than a single large one. The advantage comes from the CLPs having different sizes, more closely matching the dimensions of the CNN layers. This approach is possible because CNN accelerators process many input images, allowing CLPs to concurrently work on independent inputs.
To construct a MultiCLP accelerator for a given resource budget and CNN structure, one must decide how many CLPs to use, how to partition the resources among them, and how to distribute and schedule the processing of individual convolutional layers from multiple images on the CLPs. We describe (a) the operation of a MultiCLP, (b) a model to predict CLP costs and performance given its parameters, and (c) an optimization algorithm to find the best MultiCLP design for a given resource budget and set of CNN layers.
4.1. MultiCLP Accelerators for CNNs
Due to the feedforward nature of the CNN structure, it is natural to think of the layers of the CNN as a pipeline. Therefore, one way to construct a CNN accelerator with multiple CLPs is to implement a separate CLP for each layer (Li et al., 2016). An accelerator for an stage CNN would have CLPs and would operate on independent input images. (That is, CLP1 would work on image while CLP2 works on image , etc.) This would have the benefit of allowing each CLP to be optimized solely for the dimensions of one CNN layer, improving efficiency.
A limitation of this approach, however, is that it requires the number of CLPs to be equal to the number of convolutional layers. This poses several problems for practical CNNs. First, it forces the design to divide the onchip BRAM resources, reducing the buffer size of each CLP. As a result, the ability to exploit data reuse in each CLP diminishes, greatly increasing the overall memory bandwidth requirement and slowing down each CLP. Second, this onetoone mapping of CLPs to convolutional layers requires coordinating a large number of accesses to offchip memory, which is costly in terms of performance and logic resources. Third, each CLP has an overhead cost (i.e., control logic for address calculation and loop index state machine). If there are many CLPs, significant resources are devoted to control logic instead of CNN computation.
To address these problems, we target MultiCLP designs that minimize the number of CLPs in an accelerator. This approach requires at least one CLP in the design to compute multiple CNN layers. We use a static assignment of layers to CLPs, where each layer is bound to one CLP. Layers assigned to the same CLP need not be adjacent in the CNN structure.
The timeline of the accelerator operation is divided into epochs
. In each epoch, each CLP sequentially processes its layers, with each layer having its own independent data. The epoch ends when all CLPs finish. Figure
5 shows an example where CLP0 processes three layers (L1, L3, and L4) and CLP1 processes two layers (L2 and L5). In each epoch, each CLP only consumes data generated during the previous epoch, avoiding data dependencies within a epoch. For example, the output produced by L1 in epoch will be used as input for L2 in epoch . This means that processing an image requires five epochs, therefore data from five different images will be in flight at a time. Because the intermediate data are typically too large to hold on chip, all CLPs read their inputs from and write their outputs to offchip memory.If the evaluation latency must be limited further, one can constrain the layer assignment such that layers for the same CLP are adjacent in the CNN structure. This way, a CLP can process multiple layers for an image in a single epoch, and the total number of inflight images is equal to the number of CLPs. This means one can reduce latency by limiting the number of CLPs, but this is achieved at the cost of throughput.
There are several considerations for a MultiCLP system to achieve high throughput; we use these as the targets of our optimization method (Section 4.3). First, the epoch length, and thus the system throughput, is limited by the CLP that takes the longest to complete its assigned work. For example, in Figure 5, CLP0 is idle after it finishes L4, until the next epoch begins. Second, the convolutional layers assigned to a CLP should have dimensions compatible with the CLP’s dimensions to ensure high arithmetic unit utilization. Third, the onchip memory allocated to each CLP is inversely related to the offchip bandwidth that it requires; larger CLP buffers reduce offchip data transfer.
4.2. Modeling CLP Cost and Performance
To find an efficient MultiCLP design for a CNN, we first construct a model of the CLP costs (DSP slices, BRAMs, memory bandwidth) and speed (cycles). A CLP is parameterized by its size and the tiling parameters of each of its layers. The model also uses the dimensions of the convolutional layers: , and (Section 2). From these parameters, one can derive the resource use and performance of a CLP. Because our CLP is based on (Zhang et al., 2015), a number of the formulas used in our models appear similar, but include several notable differences.
Performance Model. Assuming a fixed frequency target, the performance of a CLP is dictated by the number of cycles needed to compute each of its layers. Because arithmetic units may be underutilized, the cycle count cannot be calculated by dividing the amount of work by the number of arithmetic units in the design. Instead, an exact counting of loop iterations based on Listing LABEL:list:tiledloop is needed. In Listing LABEL:list:tiledloop, the innermost two loops are unrolled, thus they do not contribute to the iteration count. Along the dimension, the combination of the outer loop (looping over tiles) and the inner loop (looping over elements in a tile) run for iterations. Similarly, there are iterations along the dimension. The remaining four loops have , , , and iterations, respectively. Together, the cycle count needed to compute one layer is
Importantly, if a CLP is found to be memorybandwidth bound, our optimization procedure determines the cycle count by using the data transfer time instead of the cycle count computed by this formula.
Modeling Bandwidth Usage.
We are primarily focused on the peak bandwidth use of a CLP, to estimate how much bandwidth is needed to support the maximum computation speed. When the peak bandwidth is unavailable on the target platform, the model must be able to estimate the throughput of the accelerator, taking into consideration how compute may be blocked by data transfer. This allows design space exploration to find the bestperforming design under a bandwidth limitation.
A CLP uses doublebuffering to overlap data transfer with computation. Layers are processed back to back, allowing data transfer for one layer to be overlapped with computation for another. The model calculates the start and end times of data transfer and compute, and takes into account the dependencies between them to determine the cycles required to finish computation in the bandwidthbound cases.
Modeling DSP Slice Usage. The dominant use of DSP slices is the dotproduct units (each of size ) and accumulator adders (see Figure 4); each CLP contains multipliers and adders. For floatingpoint arithmetic, each multiplier comprises two DSP slices and each adder comprises three. For 16bit fixedpoint arithmetic, a single DSP slice provides both an adder and multiplier. Therefore, the respective DSP counts are
Modeling BRAM Usage. BRAMs are used to construct the various onchip buffers. Modeling must account for the banking of these buffers, the number of read/write ports a buffer uses, whether doublebuffering is used, the capacity and capabilities of a BRAM, and the size of a word. Here, we assume the word size to be either 16 or 32 bits, although this can easily be extended. We report usage in terms of Virtex7 BRAM18Kb units, which can store 512 32bit words and operate in “Simple DualPort mode” (Xilinx, 2016), allowing one read and one write concurrently.
The input buffer is organized into banks of size , which is provisioned to support the computation of all of the layers on a CLP. When computing a layer, each bank stores
words. Because the parameters change from layer to layer, each layer needs a different amount of data, requiring to be large enough to support the most demanding layer. An input bank must be doublebuffered to support the overlapping of computation and data transfer, using one read port and one write port. With 32bit words, this buffer is constructed with BRAMs. However, because a single BRAM already provides a read port and a write port, when , one BRAM is sufficient to construct a doublebuffered input bank.
The weight buffer is similar to the input buffer. There are banks. When computing a layer, each weight bank stores a filter ( words). Thus, of the layers that a CLP computes, the layer with the largest determines the size of a weight bank. Other aspects of the weight buffer are modeled in the same way as the input buffer.
The output buffer is organized into banks. When computing a layer, each bank stores words. The output buffer is provisioned for the mostdemanding layer and is doublebuffered. However, to support the accumulation used in the CLP, an outputbuffer bank requires at least two BRAMs for doublebuffering, because the accumulation buffer requires both a read port and a write port. The remaining aspects of the output buffer model are the same as the input and weight buffer models.
For all buffers, the number of banks is halved for the 16bit fixed point data type, because pairs of 16bit words are packed into 32bit wide BRAMs. Additionally, if a bank stores only a small number of values (fewer than 10), we do not count them toward the BRAM budget, because small memories are implemented as LUTRAMs.
4.3. Optimization of MultiCLP Designs
We now describe an optimization tool that uses the above model to find the fastest MultiCLP configuration (for a given CNN) that fits within the specified resource constraints. Because we are targeting FPGAs, we optimize our accelerator only for a specific target CNN. However, this optimization can be simultaneously applied to multiple target CNNs to jointly optimize their performance.
The optimization takes as input the CNN layer descriptions and target FPGA resource constraints, and produces the parameters to construct and use a MultiCLP design. The optimization result includes the number of CLPs and their dimensions . It also includes the distribution of the CNN layers to the CLPs and, for each layer, the parameters, which dictate how the layer is computed by the CLP (Section 3.1).
Given a set of parameters, evaluating the model is simple. However, there are far too many possibilities to perform an exhaustive search for the fastest design that meets the resource constraints, requiring the use of heuristics during optimization.
The optimization process comprises two steps, iteratively repeating these steps until at least one solution that meets the constraints is discovered. At the beginning of each iteration, a performance target is set. If the performance target cannot be met at the end of the iteration, a new iteration is started with a slightly lower target.
In each iteration, the first step () focuses on the partitioning of DSP slices. The output of this step is a collection of partition candidates. Each candidate is a partial solution that specifies the number of CLPs, the of each, and the assignment of CNN layers to the CLPs.
The challenge of arises in assigning the CNN’s layers to CLPs. Because the number of possible assignments is exponential with respect to the number of layers, a complete search of this space is impractical. We mitigate this problem through the observation that, in the best designs, a CLP is assigned “similar” layers. We first produce an ordered list of the layers based on a heuristic (e.g., computetodata ratio for bandwidthlimited accelerators or Euclidean distance between pairs for computebound accelerators). Then, when we assign layers to CLPs, we only consider candidates where a CLP computes a set of adjacent layers in this order, allowing our search to prune inefficient solutions where incompatible layers would share a CLP.
The second step () of the optimization focuses on the partitioning of BRAM slices. For each candidate from step one, finds the and values to use for each layer that minimize the peak memory bandwidth use. These parameters in turn determine the buffer sizes of each CLP. A single candidate from can result in multiple solutions in ; we consider all of these solutions in the optimization process. If all candidates have peak memory bandwidth use higher than the available budget, a new iteration of optimization is needed, implying that the final solution will be bandwidth bound. When estimating the bandwidth requirements of a design during optimization, we allow computation of some CLPs to be blocked by data transfer. This potentially sacrifices dynamic utilization of some CLPs in the design, but in some cases results in the highestperforming designs overall, despite including bandwidthbound CLPs that are idle and waiting for data for a small portion of the overall execution.
Listing LABEL:list:optimize shows the pseudo code of the optimization procedure, which we implemented in C++. We separate each iteration into two steps, and , and we use different algorithms for each. Both steps use memoization to avoid redundant work. The search can be further sped up by limiting the maximum number of CLPs to consider. Our C++ implementation can complete an optimization of a MultiCLP accelerator for the GoogLeNet network in several minutes.
Lastly, we note that the same optimization method can be used for SingleCLP accelerator designs by constraining to only consider solutions with one CLP.
5. Design and Implementation
The optimization algorithm (Section 4) determines the characteristics of an optimized MultiCLP accelerator, producing parameters for a C++ highlevelsynthesis (HLS) template. The template is compiled to synthesizable Verilog using Vivado HLS 2016.3. Our optimizer works for any set of CNN layers and resource budget, and our template supports computation over arbitrary data types.
5.1. Convolutional Layer Processor Template
The accelerator template is constructed based on nine parameters: and (to size the CLP compute module), , , , and (to size the onchip bias, weight, input, and output buffers), and , , and (to specify the number of AXI stream ports for transferring input, weight, and output data). Each CLP produced with the HLS tool has an autogenerated AXI4Lite slave interface to trigger the start of computation and AXI stream interfaces for reading and writing data. For the MultiCLP designs, each parameterized CLP template is passed through the HLS tool separately, producing independent IP cores that can be inserted into the toplevel system and interconnected using a standard AXI crossbar and AXI DataMovers. One AXI4 port is used at the start of CLP operation to perform a burst transfer of a 32byte descriptor containing the arguments for the computation (, , , , , , , ). After these arguments are retrieved and the derived variables are computed (e.g., , , , ), the design state machine executes the four nested loops shown in Listing LABEL:list:hlstoplevel.
Each iteration of the toplevel loops performs computation for one input tile. The DATAFLOW directive ensures that the operations inside the loop are pipelined using pingpong buffers for the concurrently accessed data structures. The feature maps and are read on every iteration, using the pingpong buffers to allow reading the subsequent iteration’s data while computation is in progress. The feature maps are double buffered to allow the start of the subsequent computation while the write of the previous output is in progress; however, the condition prevents transfer on all but the last input tile. Bias read is similarly limited to occur only on the initial iterations to avoid unnecessary transfers.
To minimize port idle time, all transfer functions perform the maximumlength contiguous bursts allowed by the data structures. To minimize the number of bursts performed, we prioritize the dimension over the dimension of the array, as the CLP designs have smaller than . The read_input(), read_weights(), and write_output() functions are parameterized to support concurrent transfers across multiple ports by partitioning the transfers according to the top array dimension as demonstrated in write_output().
The PIPELINE directive in compute() unrolls the and loops to create the CLP compute module. To ensure concurrent access to the data, the , , , and arrays are partitioned across different memory banks. The loops are the outermost dimension to avoid backtoback (“loopcarry”) data dependencies across loop iterations. and equal and , except for the last iteration of the and loops, in which case and take the boundary into account.
6. Evaluation
We evaluate the MultiCLP approach to CNN accelerator design by applying our method to four networks (AlexNet, VGGNetE, SqueezeNet and GoogLeNet), targeting two Xilinx Virtex7 FPGAs (485T and 690T). We consider designs with both single precision floatingpoint and 16bit fixed point arithmetic.
We use our optimization method (Section 4) to determine the best SingleCLP and MultiCLP designs for each chip’s available resources and compare the model results. Further, we implement the designs using the HLSbased method (Section 5) and use simulation, synthesis, and place and route tools to compare the SingleCLP and MultiCLP methods, and to quantitatively evaluate the correctness of our models. To fairly compare with prior work, we first demonstrate that our SingleCLP design for AlexNet on a Virtex7 485T FPGA with single precision floating point is equivalent to the design in (Zhang et al., 2015).
Overall, our results show that our MultiCLP methodology yields improvements ranging from 1.01x (VGGNetE on 485T, single precision floating point) to 3.8x (AlexNet on 690T, fixed point) compared to SingleCLP designs.
6.1. Methodology
We use the optimization procedure from Section 4.3 to find the highestthroughput SingleCLP and MultiCLP designs for all designs. Optimization times range from less than a minute to less than an hour on one CPU. As input to the optimization procedure, we set the DSP and BRAM targets to 80% of the FPGA’s capacity: 1,648 BRAMs and 2,240 DSP slices on the 485T, and 2,352 BRAMs and 2,880 DSP slices on the 690T.
6.2. Utilization Benefits of MultiCLP
We first compare the dynamic arithmetic unit utilization of SingleCLP and MultiCLP designs across the 16 cases (four networks, two data types, two FPGAs). For this comparison, we do not restrict bandwidth, examining the effectiveness of MultiCLP in improving dynamic arithmetic unit utilization.
Table 1 shows the arithmetic unit utilization of all designs. MultiCLP achieves higher dynamic utilization than SingleCLP in all cases. The smallest improvement (1.01x) is seen when targeting VGGNetE because the convolutional layers of VGGNetE have very regular dimensions. The best improvement (3.8x) is seen when targeting AlexNet because the first layer of AlexNet requires a large amount of computation, but has a small of . MultiCLP gives significant improvements on SqueezeNet (2.1x) and GoogLeNet (2.0x), showing that it benefits a wide range of convolutional neural networks, including large and deep networks like GoogLeNet. Also noteworthy is that larger improvements are seen when the number of available arithmetic units increases—both on the larger FPGA (690T) and when using fixedpoint arithmetic (meaning more arithmetic is possible using the same resources). This demonstrates that SingleCLP has an inherent scaling problem: as the number of arithmetic units increases, a SingleCLP struggles to use them all. Conversely, our MultiCLP design makes much more efficient use of the available units.
AlexNet  VGGNetE  SqueezeNet  GoogLeNet  

485T (float)  
SCLP  74.1%  96.8%  78.0%  81.9% 
MCLP  95.4%  97.5%  95.8%  96.9% 
690T (float)  
SCLP  65.4%  96.0%  76.4%  78.1% 
MCLP  99.0%  98.7%  96.7%  96.0% 
485T (fixed)  
SCLP  31.0%  89.7%  51.1%  50.2% 
MCLP  93.9%  97.3%  93.6%  93.8% 
690T (fixed)  
SCLP  23.7%  88.3%  42.0%  44.0% 
MCLP  90.6%  96.1%  93.1%  89.3% 
6.3. Detailed Comparison: Single vs MultiCLP
To examine the differences between SingleCLP and MultiCLP designs, we present detailed comparisons for two networks and scenarios. First, to compare with the SingleCLP design in (Zhang et al., 2015), we choose the same network and parameters: AlexNet using floating point at 100MHz. Then, we evaluate a more aggressive scenario: SqueezeNet using 16bit fixedpoint arithmetic at 170MHz.
Tables 2 and 4 present the parameters chosen by our optimization for AlexNet and SqueezeNet on each FPGA. In each table, and give the parallelism of the compute module (Figure 4). Additionally, for AlexNet we show the and parameters, which control the onchip data tiling (Section 3.1).




For AlexNet, Table 2 shows that when we target the same system as (Zhang et al., 2015) (SingleCLP, 32bit floating point, 485T), our optimization yields the same parameters ( and ) and the same speed (2.0 million cycles).^{1}^{1}1The cycle counts in (Zhang et al., 2015) only account for half of the convolutional layers (i.e., layers 1a, 2a, …, 5a of Figure 2, but not layers 1b, 2b, …, 5b). We therefore double the cycle count in Table 4 of (Zhang et al., 2015) to compare with our implementation.^{,}^{2}^{2}2Prior work (Zhang et al., 2015) does not report and . Accenting the fairness of the comparison, we note that the SingleCLP and MultiCLP designs have the same arithmetic unit cost, which the MultiCLP design spreads among several CLPs. Recall that a CLP requires multipliers and adders. For example, on the 690T FPGA, the SingleCLP design uses multipliers and adders in one CLP. The corresponding MultiCLP design uses the same number (), but distributes them over six CLPs.
Table 2 also shows which of the 10 convolutional layers of the network map to which CLP (e.g., for the MultiCLP on the 485T, CLP0 executes layers 4a, 4b, 5a, and 5b). The last column of each table shows the number of cycles (in terms of cycles ) that each CLP takes to execute its layers, with the overall cycles per image for each system shown at the bottom. For the SingleCLP designs, the overall cycle count is the sum of how long the CLP takes to compute all ten layers. When multiple layers are listed in the same row (such as 4a and 4b), the cycle count is the number of cycles needed to execute all of those layers.
In the MultiCLP system, the CLPs operate concurrently. The overall cycle count for the accelerator is the maximum of the cycle counts of its CLPs, because this dictates the epoch length (the interval between times when the pipelined MultiCLP system is able to start processing a new image). For example, in the AlexNet 485T MultiCLP design, the four CLPs have cycle counts of , , , and thousand cycles, giving an overall time of thousand cycles.
Because our optimization maximizes the overall throughput, the MultiCLP designs it produces tend to be balanced. This balance indicates that the resources are effectively spread across the computation pipeline, such that each CLP can be kept busy most of the time. Table 3 shows the arithmetic unit utilization of each AlexNet design, as well as the throughput (for convolutional layers) and the modeled consumption of DSP slices, BRAMs, and bandwidth. We see that on both FPGAs, the MultiCLP designs provide a significant throughput advantage over the SingleCLP: 1.31x on the 485T and 1.54x on the 690T. Because the Single and MultiCLP designs use an equal number of arithmetic units (built from the same number of DSP slices), the speedup is proportional to the MultiCLP improvement in arithmetic unit utilization. The 485T and 690T SingleCLP designs are only able to provide useful work to the arithmetic units 72.6% and 64.0% of the time, respectively, while MultiCLP improves utilization to 95.1% and 98.9%. The GFlop/s rate (in the last column) is proportional to the throughput.
BRAM  DSP  B/w (GB/s)  Arith Util.(%)  Thr. (img/s)  Gflop/s  

485T  
SCLP  618  2,240  1.40  72.6  48.85  65.05 
MCLP  731  2,240  1.38  95.1  63.98  85.20 
690T  
SCLP  758  2,880  1.78  64.0  55.40  73.77 
MCLP  1,238  2,880  1.49  98.9  85.55  113.92 
As the rate of computation increases, the amount of data that must be kept on chip increase commensurately. On the 485T FPGA, the 1.31x throughput improvement comes at a cost of 1.18x higher BRAM usage. On the 690T, the throughput improvement of the MultiCLP designs grows to 1.54x, requiring 1.63x higher BRAM usage. However, it is worth noting that there is a tradeoff between the number of BRAMs used and offchip memory bandwidth. We can save bandwidth by adding more input and output buffers, or we can reduce buffer sizes at the cost of higher bandwidth.
We illustrate this phenomenon in Figure 6, showing the options for the two MultiCLP designs. The MultiCLP designs shown in Table 3 were chosen to roughly match the memory bandwidth used by the SingleCLP system. However, one could also adjust this tradeoff, saving BRAMs while using more bandwidth. All alternatives for each system have nearly identical throughput (e.g., all 690T designs have the same throughput as shown in the table, with the differences bounded by in Listing LABEL:list:optimize), but each makes a different tradeoff between BRAM capacity and offchip bandwidth. For example, the points labeled A and C correspond to the isobandwidth designs described above. Another useful example is represented by points B and D, which allow the MultiCLP designs to approximate the BRAM utilization of the SingleCLP designs, at the expense of bandwidth. Point B reduces the 485T MultiCLP BRAM usage to 619, but increases the bandwidth requirement to 1.46 GB/s. On the 690T FPGA, point D represents an alternate design using only 1075 BRAMs, but requiring a bandwidth of 2.44 GB/s. Given specific bandwidth and BRAM constraints, the optimization tool or the designer can choose between different points along the curve.
Tables 4 and 5 present the 16bit fixedpoint SqueezeNet designs at 170MHz. Here, we expect the accelerator to be bandwidth bound, so we direct the optimizer to use a heuristic that groups layers by their computetodata ratios (Section 4.3). To reduce optimization time, we limit the number of CLPs to at most six. Similar to the AlexNet case, each CLP of the MultiCLP SqueezeNet accelerator finishes its work in roughly the same time, minimizing idling due to work imbalance. The results show a dramatic improvement in arithmetic utilization and thus throughput—up to a 2.33x improvement over the SingleCLP design on the same FPGA. We also observe that the peak bandwidth for SqueezeNet is significantly higher than AlexNet—due to the characteristics of the network and because the accelerators run at a higher clock frequency. Although the two MultiCLP SqueezeNet designs require 1.23x and 1.32x more BRAMs than the SingleCLP designs, they provide 1.91x and 2.33x higher throughput with a lower offchip bandwidth requirement.
To estimate the bandwidth required for a CNN to reach peak throughput on a given FPGA, we set in Listing LABEL:list:optimize to match the best performance when bandwidth is unlimited, then we gradually relax the bandwidth constraint until solutions within 2% of can be found. The 2% margin is set to avoid overestimating bandwidth requirements due to instantaneous data transfer spikes. Throughputs in Tables 3 and 5 are bandwidthoptimized, whereas cycle counts in Tables 2 and 4 are bandwidthunconstrained.




BRAM  DSP  B/w (GB/s)  Arith Util.(%)  Thr. (img/s)  Gop/s  

485T  
SCLP  400  2,176  19.7  50.3  480.0  372.2 
MCLP  492  2,240  15.3  93.0  913.4  708.3 
690T  
SCLP  480  2,784  20.5  41.3  504.1  391.0 
MCLP  635  2,880  19.5  92.9  1173.0  909.7 
6.4. Model Validation
To validate our model, we synthesized and implemented (place and route) four designs using the HLSbased template described in Section 5. We placeandroute the CLPs in isolation, omitting infrastructure like AXI crossbars and memory controllers. First, we use our methodology to design a 32bit floating point SingleCLP for the 485T FPGA running at a 100MHz—this enables a direct comparison to the SingleCLP HLS results in (Zhang et al., 2015). Then, we evaluate MultiCLP designs for AlexNet on the 485T and 690T FPGAs, and a MultiCLP design for SqueezeNet on the 690T. Tables 6 (AlexNet) and 7 (SqueezeNet) compare our model predictions with the implementation results in terms of DSP slices and BRAMs. For the MultiCLP designs, we compare the metrics of each CLP in addition to the overall values for the entire MultiCLP design.
BRAM  DSP  
model  impl.  model  impl.  
485T SingleCLP  
CLP0  618  698  2,240  2,309 
485T MultiCLP  
CLP0  130  132  640  689 
CLP1  193  195  480  529 
CLP2  186  242  360  410 
CLP3  222  243  760  815 
Overall  731  812  2,240  2,443 
690T MultiCLP  
CLP0  129  131  320  369 
CLP1  193  195  480  529 
CLP2  130  132  640  689 
CLP3  166  226  240  290 
CLP4  160  162  240  290 
CLP5  460  590  960  1,010 
Overall  1,238  1,436  2,880  3,177 
BRAM  DSP  
model  impl.  model  impl.  
690T MultiCLP  
CLP0  24  42  128  227 
CLP1  152  218  192  264 
CLP2  44  78  352  508 
CLP3  72  138  512  592 
CLP4  259  520  1,280  1,416 
CLP5  84  112  416  478 
Overall  635  1,108  2,880  3,494 
The model predictions are extremely close to the implemented system, with only minor divergences. For example, the model underestimates the DSP counts by approximately 50–100 DSP slices per CLP, as the model accounts only for the DSP slices used for the compute module of the convolution arithmetic and does not include DSP slices that are used in the address calculations, loop indexing, and control logic. By examining the resource utilization of specific components, we verified that the number of DSP slices used in the compute modules perfectly matches the model prediction. Similarly, we find small discrepancies between the predicted and implemented BRAM counts, caused by the way the tools map memories.
We conclude that the differences observed between the model and the implementation results are not symptoms of the model not matching the design and its requirements. Instead, the differences occur because the model does not take into account some toolflowspecific considerations. Several minor modifications could be made to correct these errors, at the cost of making the model toolflowspecific. Furthermore, we also performed RTL simulation of the resulting designs; the simulated number of cycles only differs from our model by the pipeline depth of the implementation.
6.5. CNN Accelerator Resource Utilization
Tables 8 and 9 report the total resources (including FPGA flipflops and LUTs) used by each of the designs, extracted after placeandroute. The counts include the CLPs only, not including platformspecific memory controllers or crossbars. Closely following our model validation results, we observe that, for the same FPGA target, a MultiCLP implementation uses more DSP slices than the corresponding SingleCLP implementation. Although the compute modules (i.e., the arithmetic units used for the convolution’s multiplications and additions) use the same number of DSP slices, a difference arises due to the logic for address calculation and loop indexing, adding approximately 6% DSP slices to the MultiCLPs designs. Similar increases are seen in the flipflop and LUT counts; more CLPs require additional control logic beyond the DSP slices and BRAMs. However, ultimately, the DSP slices limit the implementations significantly more than the flipflops or LUTs. For completeness, we use Vivado to produce postplaceandroute power estimates, which are reported in Watts for each design.
485T  690T  
SingleCLP  MultiCLP  MultiCLP  
BRAM18K  698  812  1,436 
(34%)  (39%)  (49%)  
DSP  2,309  2,443  3,177 
(82%)  (87%)  (88%)  
FF  219,815  270,991  348,049 
(36%)  (45%)  (40%)  
LUT  146,325  176,876  236,877 
(48%)  (58%)  (55%)  
Power  6.6 W  7.6 W  10.2 W 
BRAM18K  DSP  FF  LUT  Power 

1,108  3,494  161,411  133,854  7.2 W 
(38%)  (97%)  (19%)  (31%) 
6.6. Projections to Future FPGAs
In addition to comparing the SingleCLP and MultiCLP designs on the Virtex7 FPGAs, it is also instructive to examine how the SingleCLP and MultiCLP designs scale as the FPGA resource budget grows. For example, the Xilinx roadmap includes UltraScale+ devices with over 10,000 DSP slices. Figure 7 projects the throughput of the MultiCLP and SingleCLP floating point AlexNet designs for DSP slice budgets ranging from 100 to 10,000. For each point, we perform an optimization to find the best SingleCLP and MultiCLP designs and report the estimated throughput.
The xaxis shows the number of DSP slices used for each design. The BRAM budget is set at a ratio of one BRAM (18Kb) to every 1.3 DSP slices, an approximation of the relationship we observe in the Virtex7 parts. Dashed vertical lines illustrate the total number of DSP slices available on the Virtex7 485T, Virtex7 690T, Virtex UltraScale+ 9P, and Virtex UltraScale+ 11P FPGAs. Note that the dashed lines are provided only for the perspective of resource capacity; real designs constrain the resources available to the CNN accelerator below the full chip capacity.
As the number of available DSP slices increases, the throughput difference between the Single and MultiCLP designs grows. For example, going from 2,240 to 9,600 DSP slices, the MultiCLP improvement over SingleCLP designs increases from 1.3x to 3.3x.
7. Related Work
Eyeriss (Chen et al., 2017, 2016) is a recent ASIC CNN accelerator that couples a compute grid with a NoC, enabling flexibility in scheduling CNN computation. This flexibility limits arithmetic unit underutilization. However, underutilization still exists when a CNN layer’s kernel size and output feature map height are incompatible with the dimensions of the compute grid.
(Li et al., 2016) proposes an FPGA accelerator for AlexNet that has one module per layer, and thus can achieve high arithmetic unit utilization. However, this design stores all intermediate data of the network on chip, limiting the size of the network that can be supported with this approach. Moreover, as discussed in Section 4.1, building one module per layer does not work well for larger networks.
Our baseline SingleCLP design is based on (Zhang et al., 2015). Similar designs are used in (Chen et al., 2014; Chen et al., 2014). Other recent works propose different CNN acceleration hardware. For example, (Farabet et al., 2009; Farabet et al., 2010, 2011; Sankaradas et al., 2009; Chakradhar et al., 2010) focus on 2Dconvolvers, which play the roles of both compute modules and data caches. Meanwhile, (Peemen et al., 2013, 2015) use FMA units for computation. The key differences between these approaches are the order of data transfer and the choice of memory organization. Several key similarities cause these methods to suffer from the underutilization problem we observe in our SingleCLP design. For example, the 2Dconvolvers used in (Farabet et al., 2009; Farabet et al., 2010; Sankaradas et al., 2009; Chakradhar et al., 2010) must be provisioned for the largest filter across layers; they will necessarily be underutilized when computing layers with smaller filters. In (Peemen et al., 2013), the organization of the compute modules depends on the number of output feature maps and their number of rows. Both of these parameters typically change across layers, resulting in an analogous resource underutilization problem. Our MultiCLP resource partitioning technique can used by these designs to improve arithmetic unit utilization.
CBrain (Song et al., 2016)
offers an orthogonal approach, transforming a stride
convolution to stride1 convolutions to increase PE utilization for CLPs. However, this method can only be used when the convolution stride of a layer is greater than one and the effectiveness of the technique depends on the stride size.Several recent works explored other promising, but orthogonal, aspects of CNN accelerators. (Albericio et al., 2016) proposes a CNN accelerator design that can skip computations on input values that are zeros. (Qiu et al., 2016; Judd et al., 2016) reduce an accelerator’s bandwidth and buffer use. (Qiu et al., 2016) uses perlayer data quantization and matrixdecomposition, whereas (Judd et al., 2016) uses perlayer numerical precision reduction. (Alwani et al., 2016) uses a fusedlayer technique to reduce bandwidth use of convolutional layers. (Shen et al., 2017) optimizes batch sizes to reduce offchip data transfer. These techniques can be integrated into MultiCLP designs.
(Sharma et al., 2016) and (Wang et al., 2016) propose complete frameworks for generating FPGAbased accelerators from CNN specifications. Our MultiCLP approach can be integrated into these frameworks to improve the performance of autogenerated accelerators. (Chi et al., 2016) and (Shafiee et al., 2016) explore inmemoryprocessing to accelerate CNNs. (Suda et al., 2016) develop an OpenCLbased HLS tool to implement CNN accelerators that use different modules for different kinds of layers, but all convolutional layers are computed with a single CLP.
8. Conclusions
The traditional approach to FPGAbased CNN accelerator design follows a “one size fits all” methodology, where a single convolutional layer processor (CLP) computes all convolutional layers of the CNN. In this paper, we observed that variation in the dimensions of the CNN layers limits the throughput of this “SingleCLP” approach; on layers whose dimensions are a poor fit for the CLP parameters, the arithmetic units exhibit low dynamic utilization, where adders and multipliers frequently remain idle. To overcome this inefficiency, we presented a new design paradigm that partitions hardware resources among multiple cooperating CLPs. Our “MultiCLP” approach allows the CLP dimensions to more closely match the CNN layer dimensions, resulting in better dynamic resource utilization and higher throughput.
The optimization algorithm we developed finds efficient MultiCLP designs within a given resource budget (DSP slices, BRAMs, and bandwidth). For example, on the Virtex7 690T FPGA, we showed that a MultiCLP accelerator yields a 3.8x higher throughput than the stateoftheart SingleCLP design, when accelerating AlexNet with 16bit fixed point arithmetic, corresponding to an improvement of dynamic utilization from 24% to 91%. For the more recent SqueezeNet and GoogLeNet networks, our method results in speedups of 2.2x and 2.0x, respectively. Further, we showed that the disparity between the throughput of the SingleCLP and MultiCLP designs grows rapidly as the resource budget increases.
Acknowledgements.
The authors would like to thank ChengYang Fu and Alex C. Berg from the Computer Vision Group at the University of North Carolina at Chapel Hill for their help. This material is based on work supported by the National Science Foundation (NSF) under Grant Nos. 1533739 and 1453460. The experiments were conducted with equipment purchased through NSF CISE Research Infrastructure Grant No. 1405641.References
 (1)

Albericio et al. (2016)
Jorge Albericio, Patrick
Judd, Tayler Hetherington, Tor Aamodt,
Natalie Enright Jerger, and Andreas
Moshovos. 2016.
Cnvlutin: Ineffectualneuronfree Deep Neural Network Computing. In
Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 1–13. DOI:https://doi.org/10.1109/ISCA.2016.11  Alwani et al. (2016) M. Alwani, H. Chen, M. Ferdman, and P. Milder. 2016. Fusedlayer CNN accelerators. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16). IEEE Computer Society, Washington, DC, USA, 1–12. DOI:https://doi.org/10.1109/MICRO.2016.7783725
 Chakradhar et al. (2010) Srimat Chakradhar, Murugan Sankaradas, Venkata Jakkula, and Srihari Cadambi. 2010. A Dynamically Configurable Coprocessor for Convolutional Neural Networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 247–257. DOI:https://doi.org/10.1145/1815961.1815993
 Chen et al. (2014) Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Smallfootprint Highthroughput Accelerator for Ubiquitous Machinelearning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 269–284. DOI:https://doi.org/10.1145/2541940.2541967
 Chen et al. (2014) Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A MachineLearning Supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’14). IEEE Computer Society, Washington, DC, USA, 609–622. DOI:https://doi.org/10.1109/MICRO.2014.58
 Chen et al. (2016) YuHsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energyefficient Dataflow for Convolutional Neural Networks. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 367–379. DOI:https://doi.org/10.1109/ISCA.2016.40
 Chen et al. (2017) YuHsin Chen, Tushar Krishna, Joel S Emer, and Vivienne Sze. 2017. Eyeriss: An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of SolidState Circuits 52, 1 (Jan 2017), 127–138. DOI:https://doi.org/10.1109/JSSC.2016.2616357
 Chi et al. (2016) Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processinginmemory Architecture for Neural Network Computation in ReRAMbased Main Memory. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 27–39. DOI:https://doi.org/10.1109/ISCA.2016.13
 Collobert and Weston (2008) Ronan Collobert and Jason Weston. 2008. A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. In Proceedings of the 25th International Conference on Machine Learning (ICML ’08). ACM, New York, NY, USA, 160–167. DOI:https://doi.org/10.1145/1390156.1390177
 Farabet et al. (2010) Clément Farabet, Berin Martini, Polina Akselrod, Selçuk Talay, Yann LeCun, and Eugenio Culurciello. 2010. Hardware accelerated convolutional neural networks for synthetic vision systems. In Proceedings of the 2010 IEEE International Symposium on Circuits and Systems (ISCAS ’10). 257–260. DOI:https://doi.org/10.1109/ISCAS.2010.5537908
 Farabet et al. (2011) Clément Farabet, Berin Martini, Benoit Corda, Polina Akselrod, Eugenio Culurciello, and Yann LeCun. 2011. NeuFlow: A runtime reconfigurable dataflow processor for vision. In CVPR 2011 WORKSHOPS. 109–116. DOI:https://doi.org/10.1109/CVPRW.2011.5981829
 Farabet et al. (2009) Clément Farabet, Cyril Poulet, Jefferson Y Han, and Yann LeCun. 2009. CNP: An FPGAbased processor for Convolutional Networks. In Proceedings of the 19th International Conference on Field Programmable Logic and Applications (FPL ’09). 32–37. DOI:https://doi.org/10.1109/FPL.2009.5272559
 Iandola et al. (2016) Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. 2016. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <1MB model size. CoRR abs/1602.07360 (2016).
 Judd et al. (2016) Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks. In Proceedings of the 2016 International Conference on Supercomputing (ICS ’16). ACM, New York, NY, USA, Article 23, 23:123:12 pages. DOI:https://doi.org/10.1145/2925426.2926294
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS ’12). Curran Associates Inc., Red Hook, NY, USA, 1097–1105.
 Li et al. (2016) Huimin Li, Xitian Fan, Li Jiao, Wei Cao, Xuegong Zhou, and Lingli Wang. 2016. A high performance FPGAbased accelerator for largescale convolutional neural networks. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications (FPL ’16). IEEE Computer Society, Los Alamitos, CA, USA, 1–9. DOI:https://doi.org/10.1109/FPL.2016.7577308
 Oord et al. (2013) Aäron van den Oord, Sander Dieleman, and Benjamin Schrauwen. 2013. Deep Contentbased Music Recommendation. In Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS ’13). Curran Associates Inc., Red Hook, NY, USA, 2643–2651.
 Peemen et al. (2015) Maurice Peemen, Bart Mesman, and Henk Corporaal. 2015. Intertile Reuse Optimization Applied to Bandwidth Constrained Embedded Accelerators. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE ’15). EDA Consortium, San Jose, CA, USA, 169–174.
 Peemen et al. (2013) Maurice Peemen, Arnaud AA Setio, Bart Mesman, and Henk Corporaal. 2013. Memorycentric accelerator design for Convolutional Neural Networks. In Proceedings of the 31st IEEE International Conference on Computer Design (ICCD ’13). 13–19. DOI:https://doi.org/10.1109/ICCD.2013.6657019
 Putnam et al. (2014) Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, JooYoung Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2014. A Reconfigurable Fabric for Accelerating Largescale Datacenter Services. In Proceedings of the 41st Annual International Symposium on Computer Architecture (ISCA ’14). IEEE Press, Piscataway, NJ, USA, 13–24. DOI:https://doi.org/10.1145/2678373.2665678
 Qiu et al. (2016) Jiantao Qiu, Jie Wang, Song Yao, Kaiyuan Guo, Boxun Li, Erjin Zhou, Jincheng Yu, Tianqi Tang, Ningyi Xu, Sen Song, Yu Wang, and Huazhong Yang. 2016. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. In Proceedings of the 24th ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’16). ACM, New York, NY, USA, 26–35. DOI:https://doi.org/10.1145/2847263.2847265
 Sankaradas et al. (2009) Murugan Sankaradas, Venkata Jakkula, Srihari Cadambi, Srimat Chakradhar, Igor Durdanovic, Eric Cosatto, and Hans Peter Graf. 2009. A Massively Parallel Coprocessor for Convolutional Neural Networks. In Proceedings of the 20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors (ASAP ’09). IEEE Computer Society, Washington, DC, USA, 53–60. DOI:https://doi.org/10.1109/ASAP.2009.25
 Shafiee et al. (2016) Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek Srikumar. 2016. ISAAC: A Convolutional Neural Network Accelerator with Insitu Analog Arithmetic in Crossbars. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA ’16). IEEE Press, Piscataway, NJ, USA, 14–26. DOI:https://doi.org/10.1109/ISCA.2016.12
 Sharma et al. (2016) Hardik Sharma, Jongse Park, Divya Mahajan, Emmanuel Amaro, Joon Kyung Kim, Chenkai Shao, Asit Mishra, and Hadi Esmaeilzadeh. 2016. From highlevel deep neural models to FPGAs. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’16). IEEE Computer Society, Washington, DC, USA, 1–12. DOI:https://doi.org/10.1109/MICRO.2016.7783720
 Shen et al. (2017) Yongming Shen, Michael Ferdman, and Peter Milder. 2017. Escher: A CNN Accelerator with Flexible Buffering to Minimize OffChip Transfer. In Proceedings of the 25th IEEE International Symposium on FieldProgrammable Custom Computing Machines (FCCM ’17). IEEE Computer Society, Los Alamitos, CA, USA.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for LargeScale Image Recognition. CoRR abs/1409.1556 (2014).

Song
et al. (2016)
Lili Song, Ying Wang,
Yinhe Han, Xin Zhao,
Bosheng Liu, and Xiaowei Li.
2016.
Cbrain: A Deep Learning Accelerator That Tames the Diversity of CNNs Through Adaptive Datalevel Parallelization. In
Proceedings of the 53rd Annual Design Automation Conference (DAC ’16). ACM, New York, NY, USA, Article 123, 123:1123:6 pages. DOI:https://doi.org/10.1145/2897937.2897995  Suda et al. (2016) Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jaesun Seo, and Yu Cao. 2016. ThroughputOptimized OpenCLbased FPGA Accelerator for LargeScale Convolutional Neural Networks. In Proceedings of the 24th ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’16). ACM, New York, NY, USA, 16–25. DOI:https://doi.org/10.1145/2847263.2847276

Szegedy et al. (2015)
Christian Szegedy, Wei
Liu, Yangqing Jia, Pierre Sermanet,
Scott E. Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and
Andrew Rabinovich. 2015.
Going deeper with convolutions. In
Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR ’15). 1–9. DOI:https://doi.org/10.1109/CVPR.2015.7298594  Wang et al. (2016) Ying Wang, Jie Xu, Yinhe Han, Huawei Li, and Xiaowei Li. 2016. DeepBurning: Automatic Generation of FPGAbased Learning Accelerators for the Neural Network Family. In Proceedings of the 53rd Annual Design Automation Conference (DAC ’16). ACM, New York, NY, USA, Article 110, 110:1110:6 pages. DOI:https://doi.org/10.1145/2897937.2898003
 Xilinx (2016) Xilinx. 2016. 7 Series FPGAs Memory Resources User Guide. (2016).
 Zhang et al. (2015) Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGAbased Accelerator Design for Deep Convolutional Neural Networks. In Proceedings of the 23rd ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (FPGA ’15). ACM, New York, NY, USA, 161–170. DOI:https://doi.org/10.1145/2684746.2689060
Comments
There are no comments yet.