I Introduction
Binarization has been used to augment the performance of Deep Neural Networks (DNNs), by quantizing network parameters to binary states, replacing many resourcehungry multiplyaccumulate operations with simple accumulations [1]. It has been demonstrated that Binarized Neural Networks (BNNs) implemented on customized hardware can perform inference faster than conventional DNNs on stateoftheart Graphics Processing Units (GPUs) [2, 3], while offering notable improvements in power consumption and resource utilizations [4, 5, 6]. However, there is still a performance gap between DNNs and conventional BNNs [7], which binarize parameters deterministically or stochastically. Moreover, the training routines of conventional BNNs are inherently unstable [8].
During backward propagations of conventional BNN training routines, gradients are approximated using a StraightThrough Estimator (STE) as the signum function is not continuously differentiable
[1]. The gap in performance, and the general instability of conventional BNNs compared to DNNs, can be largely attributed to the lack of an accurate derivative for weights and activations in BNNs, which creates a mismatch between binary and floating or fixedpoint realvalued representations [9].The training routines of DNNs that utilize continuously differentiable and adjustable functions in place of the signum function, which we denote Progressively Binarizing NNs (PBNNs), transform a complex and nonsmooth optimization problem into a sequence of smooth suboptimization problems. Such training routines that progressively binarize network parameters, were first used to binarize the last layer of DNNs to yield significant multimedia retrieval performance on standard benchmarks [10]. Since, various works have detailed training routines of complete PBNNs [11, 12, 13]. However, efficient customized hardware implementations of PBNNs are yet to be explored.
In this paper, we use the Intel FPGA SDK for OpenCL development environment to implement and train novel and scalable PBNNs on an OpenVINO FPGA, which progressively binarize a singular set of fixedpoint network parameters. We compare our approach to conventional BNNs and benchmark our implementations using CIFAR10 [14]. Our specific contributions are as follows:

We implement and present the first PBNNs using customized hardware and fixedpoint number representations;

We use a PieceWise Linear (PWL) function for binarization and activations with a constant derivative to simplify computations;

We demonstrate compared to training BNNs deterministically or stochastically on CIFAR10, PBNNs yield a marginal, yet consistent, increase in classification accuracy, and decrease both resource and power utilizations.
Ii Preliminaries
Iia Conventional BNNs
The training routines of conventional BNNs binarize parameters either deterministically or stochastically after performing parameter optimizations. Deterministic binarization is performed as per Eq. (1).
(1) 
where denotes binarized parameters and denotes realvalued fullprecision parameters. Stochastic binarization is performed as per Eq. (2), where
is the hard sigmoid function described in Eq. (
3).(2) 
(3) 
During backward propagations, large parameters are clipped using , as per Eq. (4), where denotes the objective function.
(4) 
IiB Progressively Binarizing DNNs
PBNNs use a set of constrained realvalued parameters at each layer , which are not directly learnable, but are a function of learnable parameters at each layer [11, 12]. The shape of evolves during training, to closely resemble the signum function once training is complete. The hyperbolic tangent function is commonly used to relate and , as described in Eq. (5).
(5) 
where is an adjustable scale parameter, which is used to evolve the shape of Eq. (5). The derivative of Eq. (5) is described in Eq. (6).
(6) 
As increases, the shape of Eq. (5) better mimics that of the signum function. During training, parameters, denoted using
, are optimized to minimize a loss function, while
is progressively increased. After training is completed, is sufficiently large that the parameters, , are very close to , as depicted in Fig. 1. The final binary parameters can simply be obtained by passing through the signum function.Iii Implementation Details
Iiia Our Progressive Binarization Training Routine
We employ a PWL function to approximate the hyperbolic tangent function, described in Eq. (7) and depicted in Fig. 2 to simplify computations. In addition to reducing a nonlinear function to a linear function, the derivative of Eq. (7), when bounded, is constant and does not depend on . Consequently, when all activations are computed simultaneously, the output of each layer during forward propagations, , does not need to be stored in memory to determine gradients during backward propagations.
(7) 
Algorithm 1 provides a highlevel overview of our progressive binarization training routine. Trained binary parameters can be computed after each training epoch to determine performance on the test set during training. Here, denotes the output of the th layer at the th epoch. As
the sign of the output of Batch Normalization (BN) is reformulated to reduce computation as per Eq. (
8) [12].(8) 
where is defined in Eq. (9). denotes the input, and are parameters that define an affline transform, and and
are the running mean and standard deviation of the feature maps that pass through them.
(9) 
We trained all networks until improvement on the test set was negligible (for 50 epochs) with a batch size . This is the largest possible batch size that makes comparison across devices possible. The initial learning rate was , which was decayed by an order of magnitude every 20 training epochs, i.e. when .
During training, each network’s scale parameter, , was increased logarithmically, from 1, at the first epoch, to 1000, at the final epoch. Eq. (8) was used to determine the output of all batch normalization layers when . Adam [15] was used to optimize network parameters and Cross Entropy (CE) [16] was used to determine network losses. After the trained binary parameters were determined, for all our implementations, a conventional OpenCL BNN inference accelerator was used to perform inference on the CIFAR10 test set.
IiiB Network Architecture
The network architecture, previously used in [17], was used in all of our DNNs. This architecture is a variant of the VGG [18] family of network architectures. It is summarized in Table I. For each convolutional and pooling layer, denotes the number of filters, determines the filter size,
is the stride length, and
denotes the padding. Here,
is the number of output neurons for each fully connected layer. All convolutional and fully connected layers are sequenced with batch normalization and activation layers. The last fully connected layer adopts realvalued representations.
Training Routine  Deterministic  Stochastic  Progressive 

Device  Intel FPGA OpenVINO  
Dataset  CIFAR10  
8bit Fixed Point  
Flip Flops (%)  63.19  66.42  62.95 
ALMs (%)  81.38  84.87  76.92 
DSPs (%)  100.00  100.00  93.20 
16bit Fixed Point  
Flip Flops (%)  96.06  98.43  91.96 
ALMs (%)  90.40  94.31  85.54 
DSPs (%)  100.00  100.00  100.00 
IiiC Hardware Architecture
All of our implementations are described using the heterogeneous OpenCL [20] framework, in which multiple OpenCL kernels are accelerated using either FPGAs or GPUs that are controlled using C++ host controllers. For FPGA implementations a SoC is used as the host controller, whereas for GPU implementations a CPU is used. We note that the power consumption of our FPGA implementations could be further decreased by realizing them using Hardware Description Language (HDL), removing the host controller, however, this would make fair comparisons between GPU and FPGA implementations difficult [21].
Iv Implementation Results
In order to investigate the performance of our progressively binarizing training routine, CIFAR10 was used. Prior to training, the color channels of each image were normalized using mean and standard deviation values of (0.4914, 0.2023), (0.4822, 0.1994), and (0.4465, 0.2010), for the red, green, and blue image channels, respectively. This normalization was performed because it has demonstrated significant performance on the ImageNet dataset
[22]. We compare FPGA implementations adopting 16bit and 8bit fixedpoint realvalued representations, as a large degradation in performance was observed when using smaller bit widths.To compile OpenCL kernels for the OpenVINO FPGA, the Intel FPGA SDK for OpenCL Offline Compiler (IOC) was used, as part of the Intel FPGA SDK for OpenCL and Quartus Prime Design Suite 18.1. For our GPU implementations, a Titan V GPU was used to execute OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU was used to drive the host controller. We used version 430.50 of the Titan V GPU driver to launch compute kernels. We report all GPU and FPGA implementation results in Table II.
From Table II, it can be observed that our progressive training routine consumed the least power and had the smallest total training time on FPGA. Moreover, when adopting 16bit fixedpoint realvalued representations during training it achieved the largest test set accuray. We believe that, similarly to [1], this can be attributed to the additional regularization that binarized parameters introduce. We note that the total training times of our GPU and FPGA implementations are not indicative of those with larger batch sizes, and that the available resources on the FPGA used, restricted us to use across all devices.
The device utilization of our FPGA implementations is presented in Table III. Our progressive binarizing training routine consumes notably less Adaptive Logic Modules (ALMs) and Flip Flops than deterministic and stochastic routines for both 16 and 8bit fixedpoint representations. Digital Signal Processor (DSP) utilization is similar to deterministic and stochastic routines, and is only decreased marginally when 8bit fixedpoint realvalued representations are adopted.
V Conclusion
We proposed and implemented novel and scalable PBNNs on GPUs and FPGAs. We compared our approach to conventional BNNs and realvalued DNNs using GPUs and FPGAs and demonstrated notable reductions in power and resource utilizations for CIFAR10. This was achieved through approximations and hardware optimizations, as well as using only one set of network parameters compared to conventional BNNs. We leave further hardwarelevel dissemination, upscaling, hyperparameter optimization, and tuning to future works.
References
 [1] M. Courbariaux and Y. Bengio, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830
 [2] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC,” in 2016 International Conference on FieldProgrammable Technology (FPT), Dec 2016, pp. 77–84.
 [3] C. Lammie, A. Olsen, T. Carrick, and M. Rahimi Azghadi, “LowPower and HighSpeed Deep FPGA Inference Engines for Weed Classification at the Edge,” IEEE Access, vol. 7, pp. 51 171–51 184, 2019.
 [4] L. Yang, Z. He, and D. Fan, “A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference,” in Proceedings of the International Symposium on Low Power Electronics and Design, ser. ISLPED ’18. New York, NY, USA: ACM, 2018, pp. 50:1–50:6. [Online]. Available: http://doi.acm.org/10.1145/3218603.3218615
 [5] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “FPBNN: Binarized neural network on FPGA,” Neurocomputing, vol. 275, pp. 1072 – 1086, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231217315655
 [6] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL,” in 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS), Aug 2019, pp. 626–629.
 [7] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “BNN+: Improved Binary Network Training,” CoRR, vol. abs/1812.11800, 2018. [Online]. Available: http://arxiv.org/abs/1812.11800
 [8] W. Tang, G. Hua, and L. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?” 2017. [Online]. Available: https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14619
 [9] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary Convolutional Neural Network,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 345–353. [Online]. Available: http://papers.nips.cc/paper/6638towardsaccuratebinaryconvolutionalneuralnetwork.pdf
 [10] Z. Cao, M. Long, J. Wang, and P. S. Yu, “HashNet: Deep Learning to Hash by Continuation,” CoRR, vol. abs/1702.00758, 2017. [Online]. Available: http://arxiv.org/abs/1702.00758
 [11] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. Shanbhag, “True GradientBased Training of Deep Binary Activated Neural Networks Via Continuous Binarization,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 2346–2350.
 [12] F. Lahoud, R. Achanta, P. MárquezNeila, and S. Süsstrunk, “SelfBinarizing Networks,” CoRR, vol. abs/1902.00730, 2019. [Online]. Available: http://arxiv.org/abs/1902.00730
 [13] Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu, “Towards BinaryValued Gates for Robust LSTM Training,” CoRR, vol. abs/1806.02988, 2018. [Online]. Available: http://arxiv.org/abs/1806.02988
 [14] A. Krizhevsky et al., “Learning Multiple Layers of Features from Tiny Images,” Citeseer, Tech. Rep., 2009.
 [15] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.
 [16] Z. Zhang and M. R. Sabuncu, “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels,” CoRR, vol. abs/1805.07836, 2018. [Online]. Available: http://arxiv.org/abs/1805.07836

[17]
C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “DeeplySupervised
Nets,” in
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, ser. Proceedings of Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR, 09–12 May 2015, pp. 562–570. [Online]. Available:
http://proceedings.mlr.press/v38/lee15a.html  [18] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for LargeScale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
 [19] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudik, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 315–323. [Online]. Available: http://proceedings.mlr.press/v15/glorot11a.html
 [20] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
 [21] T. Sorensen and A. F. Donaldson, “The Hitchhiker’s Guide to CrossPlatform OpenCL Application Development,” in Proceedings of the 4th International Workshop on OpenCL, ser. IWOCL ’16. New York, NY, USA: ACM, 2016, pp. 2:1–2:12. [Online]. Available: http://doi.acm.org/10.1145/2909437.2909440
 [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017. [Online]. Available: http://doi.acm.org/10.1145/3065386
Comments
There are no comments yet.