I Introduction
There has been much recent work on developing FPGA implementations of Convolutional Neural Networks (CNNs). While significant progress has been made in optimising the inference process of general CNN models on FPGAs, training and optimising CNNs for various domainspecific applications
remain a demanding task. CNN models for domainspecific applications only need to detect or classify objects from a narrow range of classes. Recent discovery in
transfer learning [1] — a research topic focusing on exploiting features reusable from one task to another — shows that CNN models that are pretrained on general datasets can be efficiently finetuned [2] for specific domains. This approach works well for medical image analysis: a pretrained CNN with adequate finetuning can outperform or perform as well as training from scratch [3].While the transfer learning approach is promising, the challenge is to exploit it for domainspecific applications on FPGA, where efficient processing is vital. For tasks in a specific domain, standard convolution layers dedicated to extracting general features are overparameterised and can be replaced by efficient convolution blocks, which consist of multiple small convolution layers with much fewer parameters. Example blocks are bottleneck [4], depthwise separable [5], and separable bottleneck [6]. They can reduce computational redundancy while maintaining a satisfactory accuracy. Meanwhile, since a layerreplaced model normally can be easily finetuned, the cost of layer replacement is minor. However, they are rarely explored and implemented in any of the previous work on FPGA acceleration of CNNs.
This paper proposes TuRF, a novel framework that generates efficient CNN models on FPGA for domainspecific applications (Fig. 1). TuRF accepts a CNN model pretrained from a largescale dataset, replaces its selected standard convolution layers with various convolution blocks, finetunes and evaluates the layerreplaced model, and outputs an efficient FPGA design in the end. To efficiently process convolution blocks, TuRF generates FPGA designs that fuse their inner convolution layers. The major contributions are as follows:

[leftmargin=*]

Characterisation of the design space of CNN model regarding domainspecific applications and a transfer learning inspired layerwise optimisation that replaces standard convolution layers by blocks with finetuning (Section V).
Ii Motivations and Background
Implementing CNN onto FPGA for domainspecific applications is challenging. Most of the previous efforts focus on data quantisation and binarisation [8, 9], arithmetic transformations [10], or exploiting model sparsity with pruning [11]. However, it is difficult to apply these methods from a domainspecific perspective because their evaluations and results are based on specific CNN models, and are not guaranteed to be reproducible using other models. Also, no clear correlation between accuracy and data representation or sparsity has been discovered yet. Finally, a sparse CNN model is much harder to process on FPGA and different sparsity patterns can result in very different performance.
The motivation of the proposed framework rests on the recent trend in efficient CNN architecture design [5, 6, 4]. Essentially, the redundancy in CNN is sometimes architectural and can be reduced by modifying the standard convolution layer with efficient convolution blocks as shown in Fig. 2. A convolution block is a set of convolution layers that extracts features as a whole. More importantly, a convolution block generally consumes fewer resources than its equivalent standard convolution layer. Compared to quantisation and pruning which rely on statistical properties of pretrained CNN models, this approach is more generic and has been evaluated for different domainspecific applications, as shown in [5]. It is one of the major tasks for TuRF to explore the optimisation opportunity of switching from convolution layers to blocks. Nevertheless, there is no dedicated FPGA implementation for convolution blocks, and therefore, TuRF also aims at accelerating CNN for domainspecific applications by exploring the FPGA implementation of convolution blocks.
Iia Convolution Layer and Convolution Block
A convolution layer correlates an input feature map and a filter together. Suppose is an image with channels and spatial dimensions , and is a 4D filter that consists of output channels, input channels, and kernels of size , the resulting feature map is defined as (1), where is the spatial convolution operator.
(1)  
A standard convolution layer can be replaced by convolution block to improve efficiency. There are basically four types:
IiA1 Stacked Block
It simply stacks two standard convolution layers together and reduces the number of channels. Its input and output are connected by a shortcut connection. Please refer to the model ResNet34 [4] for more details.
IiA2 Depthwise Separable
It is proposed in [12, 13, 5], where the spatial and crosschannel correlation is studied separately using depthwise and pointwise convolution respectively. The depthwise convolution only performs spatial convolution in each channel of the input feature map, and the pointwise convolution is a special case of the standard convolution by setting to 1. Assume that is the 3D depthwise filter and is the 2D pointwise filter, (2) defines the depthwise separable convolution layer as described in [5].
(2) 
IiA3 Bottleneck Block
A recent trend of constructing CNN is the prevailing use of bottleneck block as demonstrated in ResNet50 [4] which is economical and easytotrain for deeper networks. A bottleneck block consists of a stack of , , and
convolution layers, in which the first and last one reduce and increase the number of channels respectively. The input of the bottleneck block is connected to the output of the previous stack where there also exists a residual connection performing elementwise addition.
IiA4 Separable Bottleneck
Evolved from the original bottleneck block, linear bottleneck is proposed and used in MobileNet V2 [6] for efficiency improvement. Compared to the original, the middle bottleneck convolution is replaced with its depthwise version and the activation layer is removed after the last convolution, so as to combine the efficiency of depthwise separable with the effectiveness of bottleneck block.
IiB Winograd Algorithm
As an arithmetic optimisation method dated back to 1980s, the Winograd’s minimal filtering algorithm [14] is still proven to be powerful in optimising convolution layer processing based on recent research [7]. An instance of 2D Winograd algorithm, denoted by where is the 2D tile size and is the filter kernel size, can be formulated as (3),
(3) 
where is the Hadamard product. , , are three transformation matrices with , , in shape. is an filter kernel and with the size is a tile of the input feature map. A widely used configuration is , and compared to standard convolution with multiplications to produce output elements, 2D Winograd based convolution reduces the computation complexity to multiplications, which is equivalent to x speedup. For details about the Winograd Algorithm, please refer to [7].
IiC Efficient CNN Models
In this paper, we mainly study three efficient CNN models: ResNet50 [4], MobileNet V1 [5] and V2 [6], to gain research insight for our hardware template and to make a comparison to our generated CNN model for domainspecific application. Default configurations are used for these networks. As shown in Fig. 3, convolution blocks are responsible for most of the operations and parameters. These models are also compared to VGG16 [18] as shown in Table I
. ImageNet top1 accuracy of each model is listed in the same table as well.
Model  Block^{1}  Ops (GOPS)  Params (M)  Top1 (%) 
VGG16  —  30.95  138.3  71.5 
ResNet50  (c)  7.72  24.3  75.2 
MobileNetV1  (b)  1.14  4.01  70.9 
MobileNetV2  (d)  0.61  3.31  71.9 

(b), (c), (d) are symbols that denote convolution blocks in Fig. 2.
Iii Hardware Design Template
TuRF proposes the use of a scalable hardware design template as a fundamental component in our framework. The template enables generation of optimised CNN hardware by supporting various convolution types. Particularly, the convolutional blocks discussed in Section II are the primary research focus in our template development process. Winograd transformation is also used to accelerate spatial convolution.
Iiia Design Template Overview
Our design template can be configured to support all the layers utilised in recent efficient CNN models. We focus on convolution layer and convolution blocks, which are the most timeconsuming parts in these models. An accelerator for convolution layer or block can be constructed by basic building modules in our template. Each module can be configured regarding the level of parallelism or computation sequence. Our design is by default implemented with fixedpoint representation, and its configuration is decided by the data range.
Similar to [19], a design module is described as a tuple in which is a set of module configuration and , specify the width of input and output streams respectively. The module configurations can be described with:

Tile shape: denote the height, width, and channels of the input and represents the output channels.

Level of parallelism: represent the number of elements to process in parallel along the input and output channels, height, and width axis.

Layer specifics: denotes the kernel size and mentioned in Section II is replaced by .
IiiB Basic Building Modules
IiiB1 Line Buffer
Given by (4), it is a module that creates sliding windows over an input feature map. We use to denote either convolution kernel size or the Winograd input tile size . This module is implemented using shift registers organised into rows.
(4) 
IiiB2 Input and Output Buffers
IiiB3 Winograd Transformation
The Winograd algorithm is applied to standard and depthwise convolution to reduce the computation complexity. According to (3), three transformation modules are required to process a Winograd convolution. Let be the Winograd tile size , (7), (8), (9) illustrate the configurations and interfaces of the transformation modules for input feature map , weights , and output respectively. Each transformation consists of two multiplications between an input and a constant matrix, which are implemented with either multipliers with LUTs or shift operators (for constants) to save resources.
(7)  
(8)  
(9) 
IiiB4 Arithmetic Module
Most of the arithmetic computations in a typical CNN workload is dotproduct, which is employed in the spatial and crosschannel convolution and also in the fullyconnected (FC) layers. Each dotproduct module consists of an array of multipliers followed by an adder tree. The dotproduct modules are further organised into a higherlevel array for parallelisation. This module can be shared among convolution and FC layers when necessary.
IiiB5 Other Design Modules
An elementwise addition module performs addition of two identically sized feature maps. An activation
module implements nonlinear activation functions. A
normalisation module normalises its input by Batch Normalisation. We omit details about these modules because they are simple and have limited impact on the overall performance.IiiC Implementation of a Single Layer
An accelerator can be constructed from building modules to perform the computation, which can only perform the computation of one layer at one time, in contrast to the fused layer design (Section IV). Fig. 4 shows the system diagram when a single layer is implemented. It can support different types of convolution such as depthwise or pointwise and fullyconnected layers by efficiently sharing dotproducts in the arithmetic module. Modules are connected using dataflow streams with the same input and output width. Outputs from building modules should be consumed immediately to avoid congestion. A global state controller together with counters are utilised for each design to assign addresses for buffers and to enable/disable read/write actions and control dataflow directions. Multiplexers are implicitly inserted between the columns in the figure to control computation. The arithmetic module is shared by convolution with or without Winograd transformation, and fullyconnected layer.
The computation sequence of convolution, which is a permutation of interchangeable nested loops indices based on (1), has a large impact on the architectural structure. The impact of such permutation for individual convolution layer has been extensively studied in recent research studies [20, 21]. Therefore, we only discuss two computation sequences, filtermajor and channelmajor , and present their impacts on buffer sizes and pipeline in the rest of this paper. The size of the buffer for the major index is linear to its parallel factor. For example, the output buffer, which is iterated with the index, is of size and is linear to . The pipeline behaviour between two adjacent layers can be different if their computation sequences are configured in different ways (6). Further discussions in the next section.
Iv Fused Convolution Blocks
Convolution blocks consume most of the operations according to Fig. 3 and should be welloptimised for performance improvement according to Amdahl’s law. Similar to the previous work [22], a baseline accelerator for the convolution block is mainly based on a layerbylayer execution. This approach incurs significant offchip data transfer and consequently cannot fully exploit the potential of pipelining CNN layers.
To overcome this drawback, we propose a fused accelerator for the convolution block that enables the computation of all layers to complete in one launch. The benefits of layer fusion are explained by the roofline model [23] and Fig. 5 illustrates the possible benefits from layer fusion in different convolution blocks. We extend the analysis from [10] by adding computational roofs for three convolution blocks. We have to note that the bandwidth of the evaluation system (Maxeler MPCX Node) is so large that only depthwise separable blocks can take advantage of layer fusion. However, all convolution blocks can benefit from layer fusion if the bandwidth is decreased to or smaller which is common for commonality FPGA devices.
The idea of layer fusion is inspired by [24, 25]. These works only fuse the standard convolution with uniform kernel size and target highend FPGAs. Yet layer fusion is more difficult in our case because there are other convolution variants, and FPGAs do not always have sufficient resources to fully place the fused block. Basically, we improve the previous approaches to address the following new challenges:

[leftmargin=*]

The fusion method can support various convolution types.

There are many options to explore when the layers are fused, such as buffer size and computation sequence.

Tiling must be considered to support small FPGAs.
Iva Fusion Method for General Layer
We propose a method that can generate the fused design for typical convolution blocks automatically using the following steps. 1) The hardware implementation of convolution layer is selected layer by layer. 2) The input and output buffers between adjacent layers are combined. 3) The final configurations are aggregated and decided by the predicted latency.
The first step can be easily implemented because our design template can support convolution types in any known blocks. In the second step, we decide the buffer usage between two adjacent layers by the layer types and performance requirements. If two adjacent layers are standard and pointwise convolution, the use of a single buffer can minimise the area cost but may incur stalling of the entire pipeline, as the previous layer can only write into the buffer when the subsequent layer finishes the computation. Doubling the buffer can eliminate this issue with the increase of the area cost.
The third step determines the final configurations of the fused accelerator which includes: the level of parallelism, buffer sizes, and computation sequences. Suppose there are layers in a given module. Let be the parallelisation parameters of layer . To represent the computation sequences, we use to denote whether layer is filtermajor or channelmajor .
IvA1 Parallelisation Parameters
The parameters of a fused design should satisfy constraints in (10), which ensures the widths between all the input and output ports along the design modules are the same. Derived from (10), parallelisation parameters of a fused design can be simplified as .
(10) 
Output  Input  Double Buffering  
inefficient 
IvA2 Buffer Size
The size of a buffer depends on the sequence of layers that it connects to. The first input and the last output buffers are similar to the ones in Section IIIB, while other intermediate buffers are more complicated to analyse. The size of buffer , which is connected to layer and , depends on , , and the following options: 1) the same size as the buffer in input or output; 2) double buffering to avoid stalling of the pipeline.
Table II lists the size of the intermediate buffer under different configurations. The first column is the sequence of the two layers connected using a buffer. Double buffering is only applied when it is indeed beneficial for improving the pipeline performance. A configuration is inefficient if the buffer is too small to store the required input or output.
IvA3 Computation Sequence and Pipeline
The fused design is a streaming architecture and the computation of all layers are pipelined. We notice that computation sequence of each layer can affect the pipelining as illustrated in Fig. 6. has a lower latency than (used in [25]) since the first output finalised by layer can be immediately consumed by layer . For more complicated cases, we implement a cycleaccurate simulator to obtain all combinations of computation sequences and evaluate their latency. Apart from latency, we also consider buffer sizes since they are affected by computation sequences as shown in Table II and Fig. 7 shows the exploration result for the bottleneck and stacked blocks.
IvA4 Tiling
A convolution block can be tiled into smaller pieces when the onchip resources are limited. Specifically, for a convolution block with layers, a tile can be defined as . Unlike tiling a convolution layer which mainly introduces an offchip transfer overhead, tiling a convolution block can incur much redundant computation as well. Therefore, tiling configurations should be carefully explored to avoid such cases.
IvB Design Space Exploration
When the previously discussed configurations are combined, we can characterise the hardware design space of a convolution block by . Winograd based design is used by default. The performance and area cost are evaluated in two steps: a cycleaccurate simulator is used to find the best computation sequence and its latency, and the latency can provide the performance numbers; and the resource consumption can be computed by a linear prediction model built upon synthesised designs. Finally, the roofline model is used to find the best design under resource constraints. Fig. 8 presents the exploration results for three efficient CNN models and a baseline VGG16 model using our hardware template. We notice that the performance in GOPs for the efficient models is generally smaller. However, fewer operations are also required for these networks, and hence the overall inference time is still shorter (Section VI).
Efficient CNN Models  VGG16 Variants^{1}  
ResNet50  MobileNet V1  MobileNet V2  VGG16  VGG16 (1)  VGG16 (2)  VGG16 (5)  
# Ops (GOP)  7.74  1.14  0.611  30.95  26.29  19.36  3.82 
# Param. (M)  25.5  4.21  3.47  138.3  132.1  129.83  125.3 
Clock Freq. (MHz)  200  200  200  200  200  200  200 
Bit width  16 bit  16 bit  16 bit  16 bit  16 bit  16 bit  16 bit 
DSP usage  1680  1664  1856  1738  1872  1536  1680 
Latency (ms)  7.95  0.884  1.02  14.5  10.3  9.65  8.42 
Throughput (GOPS)  973.2  1287.2  592  1928.4  2561.5  2007.0  453.6 
Top1 Accuracy  93.5  88.3  87.5  90.5  93.5  92.75  84.75 

The in VGG16 () denotes a VGG16 variant that has number of layers replaced by depthwise separable convolution.
V Layerwise Model Optimisation
The objective of TuRF is to find an efficient CNN model and its corresponding design on FPGA for a given domainspecific application. Section III and IV discuss how to map efficient CNN models on FPGA designs. Moreover, in this section, we look into the design space exploration of CNN model, which is inspired by transfer learning for layerwise optimisation, and the final toolflow for TuRF.
Va CNN Model Selection and Optimisation
CNN model optimisation is about searching for the most efficient network under predefined accuracy requirements. If the hardware factor is put aside for now, we can define model efficiency as the number of parameters and operations required to achieve a certain accuracy, and the remaining challenges are the characterisation and exploration of the model design space.
VA1 Model Design Space
A typical CNN model is a sequence of cascading layers with convolution layers. To restrict the scale of the design space, we only explore models that are grown from either VGG16 or ResNet50. To further limit the design space for feasible exploration, a model originated from VGG16 or ResNet50 can have its convolution layers replaced only by a particular separable convolution block as shown in (11). Such replacements are also the partial motivation for the design of MobileNet. We represent a model in our design space as , in which is the base model with convolution layers and indicates replacement.
standard convolution  (11)  
bottleneck block 
VA2 Exploration Method
In most cases we do not have a sufficient budget for training every possible model. Therefore, we devise the following optimisation approach, inspired by the principles of transfer learning. The input to our exploration procedure can be any models pretrained based on ImageNet, which supposedly is general and consists of removable redundancies regarding the targeting application. We intend to achieve the required accuracy by finetuning the input model, in which only top layers are trained and others are fixed. We also assume that replacing top convolution layers are more beneficial than bottom ones. This is based on [26]
explaining the mechanism of CNN for computer vision, that convolution layers closer to the bottom extract lowerlevel features such as edges and shapes, and those closer to the top understand the higherlevel features such as faces and eyes. Hence, our assumption is mostly valid because the model is now focusing highlevel domain knowledge.
As such, we propose a heuristic, greedy algorithm to explore model design space. It starts with a pretrained model and tries to replace layers from the top. In each iteration, this algorithm finetunes the model candidate for a fixed number of steps. The procedure terminates once the algorithm fails to satisfy the accuracy requirement. Note that when the budget is sufficient, we need not stop and can continue searching.
VA3 Evaluation
We evaluate our method on a flowers classification problem [2] where the original model is a pretrained VGG16. The convolution layers in VGG16 are replaced by their groups in this case. Figure 9 presents the evaluation results in two aspects. The left figure shows the final exploration results: each point illustrates the accuracy and size of an explored model, and the most efficient model is the one with the top convolution group replaced (the second point from left), which is even better than the one with no layer replacement (the leftmost point). Layers are consecutively replaced from top to bottom. The right figure evaluates our assumption that replacing the top convolution layers is more beneficial than the bottom ones. In the figure, a replaced group is closer to the top if its ID is bigger. The rightmost column with the topmost group replaced achieves almost the same top1 accuracy as the baseline model. This evaluation is a proofofconcept. We will further evaluate this approach for other models and applications in future work.
VB Final Toolflow
Combining the model optimisation procedure described and the hardware optimisation and generation method in Section III and Section IV, we can deduce the final toolflow for our framework as illustrated in Algorithm 1. This algorithm can jointly explore the design space of CNN model and hardware for efficient inference.
Given is the domainspecific dataset, are requirements, is platform specification, and are pretrained models. This algorithm is driven by the ModelGen procedure in line 10, which can generate new models from pretrained models and information from the current iteration, such as the performance of the intermediate model . Basically, DesignGen automatically explores the hardware design space regarding and and generates an optimised design . This design is then evaluated to get performance metrics . The best record is updated if the is better than the performance requirement and the current best . The algorithm terminates when the accuracy is worse than the accuracy requirement . In case we have sufficient training budget, we can loosen the terminating condition in line 4 by removing the accuracy requirement and checking the accuracy until all possible models are searched.
VGG16  ResNet50  
[9]  [27]  [22]  [10]  [17]  Ours  [22]  Ours (Plain)  Ours (Fused)  
Year  2016  2016  2017  2017  2018  2018  2017  2017  2018  2018 
FPGA board  ZC706  KU060  GX1150  ZCU102  VCU440  5GSD8  GXA7  GX1150  5GSD8  5GSD8 
Tech.  28nm  20nm  20nm  16nm  16nm  28nm  28nm  20nm  28nm  28nm 
Clock Freq. (MHz)  150  200  200  200  200  200  150  200  200  200 
Bit width  16bit  16bit  16bit  16bit  16bit  16bit  16bit  16bit  16bit  16bit 
Max DSP blocks  900  2760  1518  2520  2880  1963  256  1518  1963  1963 
Perf. (GOPS)  137.0  266.0  720.2  2941  821.0  1928  250.75  619.13  890.5  973.2 
Vi Evaluation
In this evaluation, we look at the capability of TuRF by generating hardware design for typical efficient CNN models. Then, we evaluate TuRF in terms of model transformation and optimisation by accepting conventional model VGG16 pretrained based on a large dataset and generating a set of smaller models with different number of groups replaced. Finally, we compare our approach with previous work.
Via Experimental Setup
The domainspecific application that we select in this evaluation is the flowers classification problem [2]
mentioned above. The data representation in the generated hardware designs is quantised to be 16 bit fixedpoint, which does not hurt the accuracy of the evaluated application. All the CNN models evaluated here are built, trained and evaluated using the latest TensorFlow (v1.6). Pretrained models are downloaded directly from TFSlim. The experimental FPGA platform is StratixV 5SGSD8 on a Maxeler MPCX node, which contains 262.4K adaptive logic modules (ALM), 1963 variableprecision DSP blocks, and 2567 BRAM (M20K). The bandwidth of offchip data transfer is 38 GB/s. The hardware template prototype is implemented in OpenSPL
[28]. MaxCompiler (v2016.1.1) synthesises generated designs.ViB Performance of Efficient CNN Models
We first evaluate the performance of three popular efficient CNN models: ResNet50, MobileNet V1 and V2 generated by our framework and the results are shown in Table III. Each model is finetuned to find the highest attainable top1 accuracy for flower classification. From the table, ResNet50 can achieve the best accuracy but it suffers from the worst performance in latency. The network size is also substantially larger when compared to the others. On the other hand, MobileNet V1 is better than the V2 in terms of latency and accuracy with just a minor increase in network size.
We also study the benefits of layer fusion for convolution blocks by analysing the performance in GOPS. The layers within the convolution block are fused in each efficient model. Fig. 10 (b) compares performance, showing the fused designs always outperform the implementations without layer fusion. Layer fusion is particularly effective for MobileNet V1, revealing that depthwise separable blocks can be fused more effectively. It also explains the compelling performance of MobileNet V1 as shown in Table III when compared to the V2. Our framework allows users to choose which models to use, based on their requirements. This involves a repeated execution of the exploration procedure and the model with higher satisfiability for the given requirements will be chosen.
ViC Evaluation on Model Optimisation
A pretrained VGG16 is used as an input to our framework so as to evaluate its capability to perform model optimisation. The accuracy requirement supplied to the framework is gradually adjusted to generate implementations with different number of groups replaced. This enables us to understand the implications of replacing the standard convolution layer with various types of convolution block in conventional CNN model. Table III shows the results where VGG16 (1), (2) and (5) imply one, two and five groups are replaced respectively. Essentially, VGG16 (1) and (2) perform better than the original model in flowers classification regarding the accuracy and hardware efficiency. The VGG16 (5) only showcases minor performance gain with an enormous accuracy drop.
Furthermore, Fig. 10 (a) demonstrates the accuracy versus the latency and size among all models. MobileNet is more suitable for performanceaware applications while ResNet50 and VGG16 (2) are more appropriate for accuracyaware applications. Yet, the model size cannot drop significantly for VGG16 because most parameters are occupied by FC layers.
ViD Comparison with Previous Work
To demonstrate the performance of our hardware template, we make a comparison to prior works related to automatic CNN accelerator generation on FPGA. The original pretrained CNN models, VGG16 and ResNet50, are used in this experiment. The convolution layer of our VGG16 accelerator is not replaced by any convolution blocks to ensure a fair comparison. The Winograd algorithm is applied to reduce the computation complexity. Table IV shows that our approach is better than most of the previous work and is still competitive with [10] in the same technology. Our performance normalised by 16 nm technology (3374 GOPS) is higher than [10] (2941 GOPS). Moreover, to show that layer fusion can be beneficial for efficient convolution blocks, we evaluate our accelerator on ResNet50. As shown in Table IV, our implementations outperform the ones given in [22], and the fused design can achieve the finest performance. Here the plain design is generated with only the Winograd algorithm, and the fused design performs layer fusion for all bottleneck blocks.
Vii Conclusion
This paper proposes TuRF, a new CNN optimisation framework inspired by efficient CNN architectures and transfer learning, which supports domainspecific optimisations. The novel aspects include a design template for various convolution blocks, a layer fusion method, and a model optimisation technique which allows layer replacement and finetuning of pretrained CNNs. The proposed approach is capable of producing some of the fastest CNN designs targeting FPGA implementations. Further research includes design space exploration with functional evaluation tools, such as ADAM [29], and extending our approach to support various applications.
Acknowledgements
The support of the United Kingdom EPSRC (grant numbers EP/I012036/1, EP/L00058X/1, EP/L016796/1, EP/N031768/1 and EP/K034448/1), European Union Horizon 2020 Research and Innovation Programme (grant number 671653), Corerain, Intel, Maxeler and the Lee Family Scholarship is gratefully acknowledged.
References

[1]
Y. Bengio, “Deep Learning of Representations for Unsupervised and Transfer Learning,”
JMLR, 2011.  [2] “FineTuning TensorFlow Flowers Dataset.” [Online]. Available: https://www.tensorflow.org/tutorials/image_retraining#training_on_flowers
 [3] N. Tajbakhsh et al., “Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?” IEEE Trans. on Medical Imaging, 2017.
 [4] K. He et al., “Deep Residual Learning for Image Recognition,” in CVPR, 2016.
 [5] A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” arXiv:1704.04861, 2017.
 [6] M. Sandler et al., “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv:1801.04381, 2018.
 [7] A. Lavin and S. Gray, “Fast Algorithms for Convolutional Neural Networks,” in CVPR, 2016, pp. 4013–4021.

[8]
Y. Umuroglu et al.
, “FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” in
FPGA, 2017, pp. 65–74.  [9] J. Qiu et al., “Going Deeper with Embedded FPGA Platform for Convolutional Neural Network,” in FPGA, 2016, pp. 26–35.
 [10] L. Lu et al., “Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs,” in FCCM, 2017, pp. 101–108.
 [11] S. Han et al., “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA,” in FPGA, 2017, pp. 75–84.
 [12] F. Mamalet and C. Garcia, “Simplifying ConvNets for Fast Learning Simplifying ConvNets for Fast Learning,” in ICANN, 2012, pp. 58–65.
 [13] F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” arXiv:1610.02357, 2016.
 [14] S. Winograd, Arithmetic complexity of computations. Siam, 1980, vol. 33.
 [15] U. Aydonat et al., “An OpenCL(TM) Deep Learning Accelerator on Arria 10,” in FPGA, 2017, pp. 55–64.
 [16] J. Yu et al., “Instruction Driven CrossLayer CNN Accelerator with Winograd Transformation on FPGA,” in ICFPT, 2017, pp. 227–230.
 [17] J. Shen et al., “Towards a Uniform Templatebased Architecture for Accelerating 2D and 3D CNNs on FPGA,” in FPGA, 2018, pp. 97–106.
 [18] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” in ICLR, 2015, pp. 1–14.
 [19] S. I. Venieris and C. S. Bouganis, “fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs,” in FCCM, 2016, pp. 40–47.
 [20] C. Zhang et al., “Optimizing FPGAbased Accelerator Design for Deep Convolutional Neural Networks,” in FPGA, 2015, pp. 161–170.
 [21] Y. Ma et al., “Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks,” in FPGA, 2017, pp. 45–54.
 [22] ——, “An Automatic RTL Compiler for HighThroughput FPGA Implementation of Diverse Deep Convolutional Neural Networks,” in FPL, 2017, pp. 1–8.
 [23] S. W. Williams et al., “Roofline: An Insightful Visual Performance Model for FloatingPoint Programs and Multicore Architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009.
 [24] A. Manoj et al., “FusedLayer CNN Accelerators,” in MICRO, 2016, pp. 1–12.
 [25] Q. Xiao et al., “Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs,” in DAC, 2017, pp. 1–6.
 [26] M. D. Zeiler and R. Fergus, “Visualizing and Understanding Convolutional Networks,” ECCV, vol. 8689, pp. 818–833, 2014.
 [27] C. Zhang et al., “Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks,” in ICCAD, 2016, pp. 1–8.
 [28] O. Consortium et al., “OpenSPL: Revealing the Power of Spatial Computing,” Tech. Rep., 2013.
 [29] H.C. Ng et al., “ADAM: Automated Design Analysis and Merging for Speeding Up FPGA Development,” in FPGA, 2018, pp. 189–198.