1 Introduction
Albeit the community of neural networks has been prospering for decades, stateoftheart CNNs still demand
Acknowledgement: This work was performed within the CRAFT project (DARPA Award HR001116C0037), and supported by NSF IIS1618477, and a research gift from Xilinx, Inc. Any findings in this material are those of the author(s) and do not reflect the views of any of the above funding agencies. The U.S. and U.K. Governments are authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
significant computing resources (i.e., highperformance GPUs), and are eminently unsuited for resource and powerlimited embedded hardware or InternetofThings (IoT) platforms [13]. Reasons for high resource needs include the complexity of connections among layers, the sheer number of fixedpoint multiplication and accumulation (MAC) operations, and the storage requirements for weights and biases. Even if network training is done offline, only a few highend IoT devices can realistically carry out the forward propagation of even a simple CNN for image classification.
Binarized convolutional neural networks (BCNNs) [6, 3, 18, 9, 13]
have been proposed as a more hardwarefriendly model with extremely degenerated precision of weights and activations. BCNN replaces floating or fixedpoint multiplies with XNOR operations (which can be implemented extremely efficiently on ASIC or FPGA devices) and achieved near stateofthe art accuracy on a number of realworld image datasets at time of publication. Unfortunately, this hardware efficiency is offset by the fact that a BCNN model is typically tens or hundreds times the size of a CNN model of equal accuracy. To make BCNNs practical, an effective way to reduce the model size is required.
In this paper, we introduce Separable Filters (SF) on binarized filters, as shown in Fig. 1(c), to further reduce the hardware complexity in two aspects:

SF reduces the number of possible unique by filters from to just , enabling the use of a small lookup table during forward propagation. This directly results in the reduction of memory footprint.

SF replaces each by 2D convolution with two length 1D convolutions, which reduces the number of MAC operations by . This translates to either speedup or the same throughput with fewer resources.
In addition, we propose two methods to train BCNNw/SF:
Method 1  Extended Straightthrough Estimator (eSTE)
: take the rank1 approximation for SFs as a process adding noise into the model and rely on batch normalization to regularize the noise. During backward propagation, we extend the straightthrough estimator (STE) to propagate gradient across the decomposition.
Method 2  Gradient over SVD
: go through the analytic closed form of the gradient over SVD to push the chain rule in backward propagation to the binarized filters, which is the filter before SVD.
The rest of the paper is organized as follows: Sec. 2 provides a brief survey of previous works, Sec. 3 presents the design of BCNNw/SF and some implementation details, Sec. 4 presents two methods for the training of BCNNw/SF., Sec. 5 shows experimental results, Sec. 6 describes the implementation of BCNNw/SF on an FPGA platform, and Sec. 7 concludes the paper.
2 Related Works
We leverage the lightweight method for training a BCNN as proposed by Hubara et al. [6, 3], which achieved stateoftheart results on datasets such as CIFAR10 and SVHN.
Two important ideas contributed to the effectiveness of their BCNN:
Batch normalization with scaling and shifting [7]
: A BN layer regularizes the training process by shifting the mean to zero, making binarization more discriminative. It also introduces two extra degrees of freedom in every neuron to further compensate for additive noises.
Larger Model: As with the wellknown XOR problem [15], using a larger network increases the power of the model by increasing the number of dimensions for projection and making the decision boundary more complex.
Rastegari et al. proposed XNORNet [13]
, an alternative BCNN formulation which relies on a multiplicative scaling layer instead batch normalization to regularize the additive noise introduced by binarization. The scaling factors are calculated to minimize the 1norm error between realvalued and binary filters. While Hubara’s BCNN did not perform well with a larger dataset such as ImageNet
[4], obtaining a top1 error rate of 72.1%, XNORNet improves this error rate to 55.8%.Rigamonti et al. [14]
proposed a rank1 approximate method to replace the 2D convolution in a CNN with two successive 1D convolutions. Every filter was approximated by the outer product of a column vector and a row vector which were obtained through Singular Value Decomposition (SVD). The authors proposed two schemes of the learning of separable filters: (1) retain only the largest singular value and corresponding vectors to reconstruct a filter; (2) linearly combine the outer products to lower the error rate. However, the first scheme sacrificed too much performance because the the other singular values can be comparable with the largest one in terms of magnitude. The second scheme was designed to compensate for loss of performance, but more singular values used to recover a filter means a lesser benefit from the approximation. Although learning with separable filters was computationally expensive, the low rank approximation is an important idea to alleviate hardware complexity.
Inspired by Rigamonti’s work, more research projects has been conducted to explore a more economic model, i.e. networks with smaller memory requirements for the kernels. Jaderberg et al. [8] proposed a filter compression method that analyzed the redundancy in a pretrained model, decomposed the filters into singlechannel separable filters, and then linearly combined separable filters to recover original filters. The decomposition was optimized to minimize the L2 reconstruction error of original filters. Alvarez et al. [1] presented DecomposeMe that further reduced the redundancy by sharing the separated filters in the same layer. To alleviate the computational congestion of GoogLeNet [22], Szegedy et al. [21, 20] proposed a multichannel asymmetric convolutional structure, which has the same architecture as the second scheme in the work of Jaderberg et al. [8] but in different purposes: Szegedy used the asymmetric convolutional structure to avoid the expensive 2D convolutions and train the filter directly, while Jaderberg decomposed pretrained filters to exploit both input and output redundancies. However, both Jaderberg’s and Alvarez’s methods required a pretrained model, and both Jaderberg’s and Szegedy’s multichannel asymmetric convolution brought additional channels requiring a larger memory footprint.
Our proposed method differs from the three methods above because we maintain the network structure during training phase, train rank1 separable filters directly, and then decompose the rank1 filters into pairs of vector filters for hardware implementation. Last but not least, to the best of our knowledge no existing work provides an analytic closed form of the gradient of filterdecomposition process for backward propagation.
3 Binarized CNN with Separable Filters
Here we describe the theory of BCNN with Separable filter in detail. Our main idea is to apply SVD on binarized filters to further reduce the memory requirement and computation complexity for hardware implementation. We present the details of forward propagation in this section and two methods of backward propagation in the next section.
3.1 The Subject of Decomposition
For BCNN, there are two approaches to binary filter decomposition. Fig. 2 depicts the two choices. If we adopt flow 1 and apply the rank1 approximation (the red box) directly on the realvalued filters, we cannot avoid realtime decomposition during training because the input filter has an infinite number of possible combination of pixel strengths. Therefore, we introduce an extra binarization (the blue box) on the realvalued filters and apply the rank1 approximation on the binarized filters. Then, the number of possible input filters of rank1 approximation are limited to , where is the width or height of a filter. With flow 2, we can build a lookup table beforehand and avoid realtime SVD during training.
Naturally, the rank1 approximation and the extra binarization will limit the size of the basis to recover the original filters and equivalently introduce more noise into the model, as shown in Fig. 1 from (b) to (c). Instead of introducing an additional linearcombination layer to improve the accuracy, we leave the task to the two aforementioned reasons that make BNN work.
3.2 Binarized Separable Filters
Here we provide the detailed steps from binarized filters to binarized separable filters. The result of SVD on a matrix includes three matrices as shown in Eq. 1.
(1) 
Similar to real value rank1 approximation for separable filter, the binarized separable filters are obtained with an extra binarization process on the dominate singular vectors as shown in Eq. 2.
(2) 
where and stand for the left and right singular vector corresponding to the largest singular value, respectively, and the function denotes the binarization and can be implemented in either a deterministic function or a stochastic process [6]. Please note the largest singular value is dropped because all singular values are always positive and have no effect on binarization.
Fig. 3(a) and (c) illustrates a kernel with three filters before and after binarized rank1 approximation. As with [6] we keep a copy of the realvalued filters during each training iteration and accumulate the weight gradients on them since SGD and other convex optimization methods presume a continuous hypothesis space. This also allows us to train the kernels as if the model is realvalued without the need for penalty rules [14] during the backward propagation. For the test phase, all filters are binarized and rank1 approximated to be binarized separable.
In our FPGA implementation, we use the pairs of vectors in Fig. 3(b) to replace 2D filters and perform separable convolution, which involves a rowwise 1D convolution followed by a columnwise convolution in backtoback fashion before accumulating across different channels. More details on the FPGA implementation are presented in Sec. 6.
3.3 Details of the Implementation
As mentioned in sec. 3.1, the benefit of flow 2 is to leverage a finitesized lookup table (LUT) to replace the costly SVD computation during the forward propagation of training phase. Although the training takes place on a highlyoptimized parallel computing machine, the LUT access is still a potential bottleneck if searching for an entry in the mapping is not efficient enough.
We build two tables to avoid realtime SVD. The first table is composed by all binarized separable filters. The number of entries in the first table can be calculated with Eq. 3.
(3) 
where is the width or height of a filter.
The second table is the mapping relationship between all possible binary filters to their corresponding binarized separable filters. We design an estimation function to make the tables contentaddressable. The key to index the first table can be obtained with Eq. 4.
(4) 
where is a vector or a matrix in the same size of , and all elements in are the weightings to convert a matrix into a number. The simplest choice of is the binarytointeger conversion method. We take the first element in as the least significant bit (LSB), so the is designed as Eq. 5
(5) 
where is the amount of elements of , and
. With this simple hash function and the efficient broadcasting technique in Theano
[23], we are able to efficiently obtain the keys for all filters in a convolutional layer.4 Backward Propagation of Separable Filters
Besides the extra degrees of freedom introduced to BCNNw/SF’s forward propagation, there are two more important techniques making binarized separable filters work. In this section, we present two methods utilizing the two techniques for the training of BCNNw/SF.
4.1 Method 1: Extended STE
As shown in Fig. 2(b), during the forward propagation, all filters must be degraded thrice. Since binarization can be considered as noise addition into the model and be regularized with batch normalization, the rank1 approximation, which is just another process adding extra noise, can be regularized as well. In details, we extend the straightthrough estimator across the three degradation processes in Fig. 2 to update the realvalued filters with the rank1 approximated filters. Eq. 6 shows the backward propagation of the gradient of rank1 approximated filter,, to the gradient of realvalued filter, .
(6) 
This simple method totally relies on batch normalization to regularize the noise introduced by two binarization and one rank1 approximation.
4.2 Method 2: Gradient over SVD
Whereas binarization is not a continuous function, Hubara et al. [6] resorted to the STE to update the realvalued weights with the gradient of loss w.r.t binarized weights. Howbeit, owing to the continuity of singular value decomposition, we are allowed to calculate the gradient w.r.t. the resultant of the first binarization, . More specifically, the rank1 approximation is differentiable because all of the three resultant matrices, i.e. ,, and , of SVD in Eq. 1 are differentiable w.r.t. every element of the original input matrix, . From the approximation we adopt for separable filters as shown in Eq. 2, one can easily obtain the derivative of w.r.t. the elements of the original matrix before the approximation as Eq. 7, if the STE for binarization is applied.
(7) 
Papadopoulo et al. [11] provided the mathematical closed form of the gradient of the three resultant matrices, as shown in Eq. 8, and 9.
(8)  
(9) 
where and are antisymmetric matrices with zeros on their diagonals, and all offdiagonal elements can be obtained by Eq. 10 and 11.
(10)  
(11) 
Eq. 12 shows the general form of the differential equation.
(12) 
(13) 
From Papadopoulo’s equations 8 to 11, we can derive every element in Eq. 12 as shown in Eq. 13 and see there exist crossterms between elements. The gradient of a SVD resultant matrix w.r.t. one element in the original input matrix is also a matrix of the same dimension, by, i.e. a single element’s change in the input matrix can affect all other elements in the resultant of SVD. The intuition behind is that the rank1 approximation is a matrixwise filterlevel mapping relationship rather than an elementwise operation, and multiple elements contribute to the mapping result of a filter.
To recap Eq. 12 with the chain rule calculation of backward propagation, we follow the similar fashion how higher layer neurons collect errors from the lower layer. Eq. 14 shows the inner product for collecting error from lower layer and propagate the error to every element in binarized filters . For method 2, we also build a table of the derivatives together with the binarized rank1 approximation to avoid realtime calculation of Eq. 12.
(14) 
5 Experiments
We conduct experiments on the Theano [23] based on the Courbariaux’s framework [2], using 2 GPUs: NVIDIA GeForce GTX Titan X and GTX 970 to finish the training/testing process. In most of the experiments, we obtain near stateoftheart results using BCNNw/SF.
In this section, we describe the network structures we use, and list the classification result on datasets. We compare our result with relevant works, and then make analysis on different perspectives, including binarized separable filter and learning ripples.
5.1 Datasets and Models
We evaluate our methods on three benchmark image classification datasets: MNIST, CIFAR10 and SVHN. MNIST is a dataset for x grayscale handwritten digits, which has a training set of K examples, and a testing set of K examples. SVHN is a realworld image dataset for street view house numbers, cropped to x color images, with K digits for training,
K digits for testing. Both of these datasets classify digits ranging from
to . CIFAR10 dataset consists of K x color images in mutually exclusive classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck), with images per class. There are K training images and K test images.The convolutional neural networks we use has almost the same architecture as Hubara et al. [6]’s except for some small modification. This architecture is inspired from the VGG [17] network. It contains fullyconnected layers and convolutional layers, in which the kernels for convolutional layers is x . For detailed network structure parameters, see Table 1.
Name  MNIST(CNN)  CIFAR10  SVHN 



Input  1x28  3x32x32  3x32x32 
Conv1  64x3x3  128x3x3  64x3x3 
Conv2  64x3x3  128x3x3  64x3x3 
Pooling  2 x 2 Max Pooling 

Conv3  128x3x3  256x3x3  128x3x3 
Conv4  128x3x3  256x3x3  128x3x3 
Pooling  2 x 2 Max Pooling  
Conv5  256x3x3  512x3x3  256x3x3 
Conv6  256x3x3  512x3x3  256x3x3 
Pooling  2 x 2 Max Pooling  
FC1  1024  1024  1024 
FC2  1024  1024  1024 
FC3  10  10  10 
In each experiment, we split the dataset into parts: of the training set is used for training the network, the remaining is used as validation set. During the training, we use both the training loss on training set and inference errorrate on the validation set as performance measurements. To evaluate the different trained models, we use the classification accuracy on the testing set as the evaluation protocol.
In order for all these benchmark to remain challenging, We didn’t use any preprocessing, dataaugmentation or unsupervised learning. We use binarized hard tangent
[6]function as activation function. The ADAM adaptive learning rate method
[10] is used while minimizing the square hinge loss with an exponentially decayed learning rate. We also apply batch normalization to our networks, with a minibatch of size , and (separately for MNIST, CIFAR10 and SVHN), to speed up the learning, and we scale the learning rate for each convolutional layer with a factor from Glorot’s batch normalization [7]. We train our networks for epochs on MNIST and CIFAR10 datasets, and epochs on SVHN datasets. The results are given in Sec. 5.2.5.2 Benchmark Result
Dataset  MNIST(CNN)  CIFAR10  SVHN 



No binarization (standard results)  
Maxout Networks [5]  0.94%  11.68%  2.47% 
Binarized Network  
BCNN(BinaryNet) [6]  0.47%  11.40%  2.80% 
Binarized Network with Separable Filters  
BCNNw/SF Method 1 (this work)  0.48%  14.12%  4.60% 
BCNNw/SF Method 2 (this work)  0.56%  15.46%  4.18% 
Fig. 4 depicts the learning curves on CIFAR10 dataset. There exists certain accuracy degradation if we compare BCNN with our methods due to a more aggressive noise. By the end of the training phase, our method 1 yields an accuracy less than that of BCNN by roughly %, and the method 2 reaches a even more inferior accuracy. For the sake of CIFAR10’s higher difficulty, the loss of accuracy meets our expectation. We will discuss in detail the benefit of using exact gradient over the rank1 approximation in next subsection.
Tab. 2 summarizes the experimental results in terms of error rate. Compared with BNN [6], for the grayscale manuscript number classification, both of our two training methods achieve a accuracy close to that of the binarized convolutional neural networks. The difference is within 0.09%. It is noteworthy that our method 2 outperforms method 1 on SVHN by error rate. For CIFAR10 and SVHN, our methods are inferior to BCNN by a difference less than 2.72% because we limit choices of filters from a number of to , where the filter size is x. Since the performance degradation on CIFAR10 is the largest, we implement a hardware accelerator in FPGA to inspect at what extent of hardware complexity can be improved with the sacrifice of the % accuracy loss. Sec. 6 provides the details and a comparison with a BCNN accelerator to demonstrate the benefits of BCNNw/SF.
5.3 Scalability
We also explore different sizes of networks to improve the accuracy and exam the scalability of BCNNw/SF. Tab. 3 lists two additional larger models and an AlexNetlike model for CIFAR10. The wider one stands for a model with all numbers of kernels doubled, and the deeper one is a network including two extra convolutional layers. Different from the models above, the AlexNetlike model includes three sizes of filters: by, by, and by. Applying our rank1 approximation on by filter, we can get memory reduction.
Name  Deeper  Wider  AlexNetlike 



Input  3x32x32  3x32x32  3x32x32 
Conv1  128x3x3  256x3x3  96x5x5 
Conv2  128x3x3  256x3x3  256x5x5 
Pooling  2 x 2 Max Pooling  
Conv3  256x3x3  512x3x3  512x3x3 
Conv4  256x3x3  512x3x3  512x3x3 
Pooling  2 x 2 Max Pooling  
Conv5  512x3x3  1024x3x3  256x3x3 
Conv6  512x3x3  1024x3x3  512x1x1 
Pooling  2 x 2 Max Pooling  
Conv7  512x3x3     
Conv8  512x3x3     
Pooling  2x2 Max Pooling     
FC1  1024  1024  1024 
FC2  1024  1024  128 
FC3  10  10  10 
We train the three bigger networks with our method 1, and Fig. 5 shows the learning curves of the two enlarged models for CIFAR10. Since the number of trainable parameters has been increased, it requires more epochs to travel in the hypothesis space and reach a local minimum. Therefore, we train these two bigger networks with epochs, and compare with BCNN(BinaryNet). As shown in Fig. 5 the wider one (blue) starts with largest ripple yet catch up the same performance as BCNN(black) does around the 175th epoch.
Dataset  CIFAR10 



BCNN(BinaryNet) [6]  11.40% 
Binarized Network with Separable Filters (this work)  
BCNNw/SF Method 1  14.12% 
BCNNw/SF Method 1 depper  14.11% 
BCNNw/SF Method 1 wider  11.68% 
BCNNw/SF Method 1 AlexNetlike  15.1% 
Classification Accuracy (Error Rate) of the three larger models.
Tab. 4 lists the results on CIFAR10 of the three bigger models as well as the CIFAR10 results in Tab. 2. The performance improvement of deeper network is very scarce since the feature maps experience the extra destructive max pooling layer as shown in Tab. 3, which reduces the size of the first fullyconnected layer, FC1, and hence suppresses the improvement. The wider network achieves , which is very close to the performance of BCNN(BinaryNet). The AlexNetlike model demonstrates that a model with by filters sacrifices more accuracy to provide higher memory reduction. In summary, the accuracy degradation of BCNNw/SF can be compensated by enlarging the size of network.
5.4 Discussion
In this section, we use the experimental results on CIFAR10 as an example of detailed analysis. We unpack the trained rank1 filters and learning curves to gain a better understanding of the mechanism of BCNNw/SF.
Fig. 6 lists all the rank1 filters and their frequency on CIFAR10. Although certain filters are rarely used, there is no filter forsaken. In Fig. 6 we can learn that the allpositive and allnegative filters are trained most frequently, and these two filters render the convolution to runningsum calculation with a sliding window. As mentioned in Sec. 3.1,through the summation of separated convolution from a preceding layer, we can achieve the tangled linear combinations, which are essential to BCNNw/SF.
Unknowing the spectrum of the ripple, we apply SavitskyGolay filter [16] to obtain the baseline of validation accuracy and, thereby, subtract the original accuracy with the baseline to get the ripple. The window width of the SavitskyGolay filter is , and we use quadratic equation to fit the original learning curve. All ripples are quantized into categories for the statistic analysis.
Statistics  mean  std  max 



BCNN(BinaryNet) [6]  0.052  1.213  5.09 
BCNNw/SF Method 1  0.055  1.059  4.465 
BCNNw/SF Method 2  0.035  0.723  3.622 
Tab. 5 compares our method 1 and methods 2 with BCNN. All three statistic values of the method 2 are reduced. The analytic gradient over the rank1 approximation stabilizes the descending trajectory with more accurate gradient calculation. Both BCNN and our method 1 rely on the gradient w.r.t. binarized filters to update all parameters due to the lack of analytic gradient w.r.t. realvalued filters. However, it is also the rigorous gradient that limits the possibility to escape a local minimum on the error surface. As we can see in Tab. 2, the results of our method 1 are closer to that of BCNN. We use the trained binarized separable filter from our method 1 to implement a FPGA accelerator for CIFAR10 in the following section.
6 FPGA Accelerator
6.1 Platform and Implementation
To quantify the benefits that BCNNw/SF can achieve for hardware BCNN accelerators, we created an FPGA accelerator for the six convolutional layers of the Courbariaux’s CIFAR10 network. Our accelerator is built from the opensource FPGA implementation in [24]. The dense layers were excluded as they are not affected by our technique. As BCNNw/SF is ideal for small, lowpower platforms, we targeted a Zedboard with a Xilinx XC7Z020 FPGA and an embedded ARM processor. This is a much smaller FPGA device compared to existing CNN FPGA accelerators [12, 19]. We write our design in C++ and use Xilinx’s SDSoC tool to generate Verilog through highlevel synthesis. We implement both BCNN and BCNNw/SF and examine the performance and resource usage of the accelerator with and without separable filters.
Our accelerator is designed to be small and resourceefficient; it classifies a single image at a time, and executes each layer sequentially. The accelerator contains two primary compute complexes: Conv1 computes the first (nonbinary) convolutional layer, and Conv25 is configurable to compute any of the binary convolutional layers. Other elements include hardware to perform pooling and batch normalization, as well as onchip RAMs to store the feature maps and weights. Computation with the accelerator proceeds as follows. Initially all input images and layer weights are stored in offchip memory accessible from both CPU and FPGA. The FPGA loads an image into local RAM, then for each layer it loads the layer’s weights and performs computation. Larger layers require several accelerator calls due to limited onchip weight storage. Intermediate feature maps are fully stored onchip. After completing the convolutional layers we write the feature maps back to main memory and the CPU computes the dense layers.
We kept the BCNN and BCNNw/SF implementations as similar as possible, with the main difference being the convolution logic and storage of the weights. For BCNN, each output pixel requires MAC operations to compute. For BCNNw/SF we can apply a x vertical followed by a x horizontal convolution, a total of MACs. As the MACs are implemented by XORs and an adder tree, BCNNw/SF can potentially save resource.
In terms of storage, BCNN requires the bits to store each filter. Naively, BCNNw/SF requires bits, as each filter is represented as two bit vectors. However, recall we only use rank1 filters — Eq. 3 shows that the number of unique is , meaning we can encode them losslessly with only bits. A small decoder in the design is used to map the bit encodings into bit filters.
6.2 Results and Discussion
Table 6 compares the execution time and resource usage of the two FPGA implementations. Resource numbers are reported post place and route, and runtime is wall clock measured on a real Zedboard. We exclude the time taken to transfer the final feature maps from FPGA to main memory, as it is equal between the two networks; transfer time for the initial image and weights are included.
BCNN  BCNNw/SF (this work)  
Conv layer  0.949  0.652  31.3% 
runtime (ms)  
LUT  35255  36384  +3.2% 
FF  41418  41054  1.0% 
Block RAM  94  78  17.0% 
DSP  8  8  0.0% 
Our experimental results show that BCNNw/SF achieves runtime reduction of 31% over BCNN, which equates to a X speedup. This is due mostly to the reduction of memory transfer time of the compressed weight filters. For similar reasons BCNNw/SF is able to save 17% of the total block RAM (RAMs are used for both features and weights). Lookup table (LUT) counts have increased slightly, due most likely to the additional logic needed to map the bit encodings to actual filters. Overall, BCNNw/SF realizes significant improvements to performance and memory requirement with minimal logic overhead.
7 Conclusion and Future Work
In this paper, we proposed binarized convolutional neural network with Separable Filters (BCNNw/SF) to make BCNN more hardwarefriendly. Through binarized rank1 approximation, 2D filters are separated into two vectors, which reduce memory footprint and the number of logic operations. We have implemented two methods to train BCNNw/SF with Theano and verified our methods with various CNN architectures on a suite of realistic image datasets. The first method relies on batch normalization to regularize noise, making it simpler and faster to train, while the second method uses gradient over SVD to make the learning curve more smooth and potentially achieves better accuracy. We also implement an accelerator for the inference of a CIFAR10 network on an FPGA platform. With separable filters, the total memory footprint is reduced by and the performance of the convolution layers is improved by X compared to baseline BCNN.
References
 [1] J. Alvarez and L. Petersson. DecomposeMe: Simplifying ConvNets for EndtoEnd Learning. arXiv eprint, arXiv:1606.05426, Jun 2016.
 [2] M. Courbariaux. BinaryNet. https://github.com/MatthieuCourbariaux/BinaryNet/, 2016.
 [3] M. Courbariaux, Y. Bengio, and J.P. David. BinaryConnect: Training Deep Neural Networks with binary weights during propagations. Advances in Neural Information Processing Systems (NIPS), pages 3123–3131, 2015.
 [4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. pages 248–255, Jun 2009.

[5]
I. J. Goodfellow, D. WardeFarley, M. Mirza, A. C. Courville, and Y. Bengio.
Maxout Networks.
Int’l Conf. on Machine Learning (ICML)
, pages 1319–1327, Feb 2013.  [6] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized Neural Networks. Advances in Neural Information Processing Systems (NIPS), 2016.
 [7] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv eprint, arXiv:1502.03167, Mar 2015.
 [8] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up Convolutional Neural Networks with Low Rank Expansions. arXiv eprint, arXiv:1405.3866, May 2014.
 [9] M. Kim and P. Smaragdis. Bitwise Neural Networks. arXiv eprint, arXiv:1601.06071, Jan 2016.
 [10] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv eprint, arXiv:1412.6980, Dec 2014.

[11]
T. Papadopoulo and M. I. Lourakis.
Estimating the jacobian of the singular value decomposition: Theory
and applications.
European Conference on Computer Vision (ECCV)
, pages 554–570, 2000.  [12] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al. Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. Int’l Symp. on FieldProgrammable Gate Arrays (FPGA), pages 26–35, Feb 2016.
 [13] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. European Conference on Computer Vision (ECCV), Oct 2016. arXiv:1603.05279.

[14]
R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua.
Learning Separable Filters.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 2754–2761, 2013.  [15] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. CALIFORNIA UNIV SAN DIEGO LA JOLLA INST FOR COGNITIVE SCIENCE, 1985.
 [16] A. Savitzky and M. J. Golay. Smoothing and differentiation of data by simplified least squares procedures. Analytical Chemistry, pages 1627–1639, Jul 1964.
 [17] K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for LargeScale Image Recognition. arXiv eprint, arXiv:1409.1556, Sep 2014.

[18]
D. Soudry, I. Hubara, and R. Meir.
Expectation backpropagation: parameterfree training of multilayer neural networks with continuous or discrete weights.
Advances in Neural Information Processing Systems (NIPS), pages 963–971, 2014.  [19] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.s. Seo, and Y. Cao. ThroughputOptimal OpenCLbased FPGA Accelerator for LargeScale Convolutional Neural Networks. Int’l Symp. on FieldProgrammable Gate Arrays (FPGA), pages 16–25, Feb 2016.
 [20] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. arXiv eprint, arXiv:1602.07261, Feb 2016.
 [21] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv eprint, arXiv:1512.00567, Dec 2015.
 [22] C. Szegedy, Y. J. Wei Liu, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
 [23] Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv eprint, arXiv:1605.02688, May 2016.
 [24] R. Zhao, W. Song, W. Zhang, T. Xing, J.H. Lin, M. Srivastava, R. Gupta, and Z. Zhang. Accelerating Binarized Convolutional Neural Networks with SoftwareProgrammable FPGAs. Int’l Symp. on FieldProgrammable Gate Arrays (FPGA), Feb 2017.
Comments
There are no comments yet.