I Introduction

Depiction of (A) network parameter representations and (B) binarization and activation functions required during training for various DNNs and BNNs. Different levels of discretization are depicted using various shade palettes. DNNs require one set of real-valued (A1) or limited-precision parameters (A2) and typically have continuously differentiable activation functions (B1). In addition to the real-valued parameters (A3), BNNs also require another set of binarized parameters (A4), for which they use a STE (B2) to determine gradients of the signum function, which is not continuously differentiable. In contrast to BNNs, PBNNs require one set of real-valued parameters (A5) and use a continuously differentiable activation and binarization function, with a shape that progressively evolves during training (B3).
Binarization has been used to augment the performance of Deep Neural Networks (DNNs), by quantizing network parameters to binary states, replacing many resource-hungry multiply-accumulate operations with simple accumulations [1]. It has been demonstrated that Binarized Neural Networks (BNNs) implemented on customized hardware can perform inference faster than conventional DNNs on state-of-the-art Graphics Processing Units (GPUs) [2, 3], while offering notable improvements in power consumption and resource utilizations [4, 5, 6]. However, there is still a performance gap between DNNs and conventional BNNs [7], which binarize parameters deterministically or stochastically. Moreover, the training routines of conventional BNNs are inherently unstable [8].
During backward propagations of conventional BNN training routines, gradients are approximated using a Straight-Through Estimator (STE) as the signum function is not continuously differentiable
[1]. The gap in performance, and the general instability of conventional BNNs compared to DNNs, can be largely attributed to the lack of an accurate derivative for weights and activations in BNNs, which creates a mismatch between binary and floating- or fixed-point real-valued representations [9].The training routines of DNNs that utilize continuously differentiable and adjustable functions in place of the signum function, which we denote Progressively Binarizing NNs (PBNNs), transform a complex and non-smooth optimization problem into a sequence of smooth sub-optimization problems. Such training routines that progressively binarize network parameters, were first used to binarize the last layer of DNNs to yield significant multimedia retrieval performance on standard benchmarks [10]. Since, various works have detailed training routines of complete PBNNs [11, 12, 13]. However, efficient customized hardware implementations of PBNNs are yet to be explored.
In this paper, we use the Intel FPGA SDK for OpenCL development environment to implement and train novel and scalable PBNNs on an OpenVINO FPGA, which progressively binarize a singular set of fixed-point network parameters. We compare our approach to conventional BNNs and benchmark our implementations using CIFAR-10 [14]. Our specific contributions are as follows:
-
We implement and present the first PBNNs using customized hardware and fixed-point number representations;
-
We use a Piece-Wise Linear (PWL) function for binarization and activations with a constant derivative to simplify computations;
-
We demonstrate compared to training BNNs deterministically or stochastically on CIFAR-10, PBNNs yield a marginal, yet consistent, increase in classification accuracy, and decrease both resource and power utilizations.
Ii Preliminaries
Ii-a Conventional BNNs
The training routines of conventional BNNs binarize parameters either deterministically or stochastically after performing parameter optimizations. Deterministic binarization is performed as per Eq. (1).
(1) |
where denotes binarized parameters and denotes real-valued full-precision parameters. Stochastic binarization is performed as per Eq. (2), where
is the hard sigmoid function described in Eq. (
3).(2) |
(3) |
During backward propagations, large parameters are clipped using , as per Eq. (4), where denotes the objective function.
(4) |
Ii-B Progressively Binarizing DNNs
PBNNs use a set of constrained real-valued parameters at each layer , which are not directly learnable, but are a function of learnable parameters at each layer [11, 12]. The shape of evolves during training, to closely resemble the signum function once training is complete. The hyperbolic tangent function is commonly used to relate and , as described in Eq. (5).
(5) |
where is an adjustable scale parameter, which is used to evolve the shape of Eq. (5). The derivative of Eq. (5) is described in Eq. (6).
(6) |

As increases, the shape of Eq. (5) better mimics that of the signum function. During training, parameters, denoted using
, are optimized to minimize a loss function, while
is progressively increased. After training is completed, is sufficiently large that the parameters, , are very close to , as depicted in Fig. 1. The final binary parameters can simply be obtained by passing through the signum function.Network hyperparameters (the learning rate schedule,
, scale parameter schedule, , batch size, , gradient optimizer, loss function,, and the number of training epochs).
Iii Implementation Details
Iii-a Our Progressive Binarization Training Routine
We employ a PWL function to approximate the hyperbolic tangent function, described in Eq. (7) and depicted in Fig. 2 to simplify computations. In addition to reducing a non-linear function to a linear function, the derivative of Eq. (7), when bounded, is constant and does not depend on . Consequently, when all activations are computed simultaneously, the output of each layer during forward propagations, , does not need to be stored in memory to determine gradients during backward propagations.
(7) |
Algorithm 1 provides a high-level overview of our progressive binarization training routine. Trained binary parameters can be computed after each training epoch to determine performance on the test set during training. Here, denotes the output of the th layer at the th epoch. As
the sign of the output of Batch Normalization (BN) is reformulated to reduce computation as per Eq. (
8) [12].(8) |
where is defined in Eq. (9). denotes the input, and are parameters that define an affline transform, and and
are the running mean and standard deviation of the feature maps that pass through them.
(9) |
We trained all networks until improvement on the test set was negligible (for 50 epochs) with a batch size . This is the largest possible batch size that makes comparison across devices possible. The initial learning rate was , which was decayed by an order of magnitude every 20 training epochs, i.e. when .
During training, each network’s scale parameter, , was increased logarithmically, from 1, at the first epoch, to 1000, at the final epoch. Eq. (8) was used to determine the output of all batch normalization layers when . Adam [15] was used to optimize network parameters and Cross Entropy (CE) [16] was used to determine network losses. After the trained binary parameters were determined, for all our implementations, a conventional OpenCL BNN inference accelerator was used to perform inference on the CIFAR-10 test set.
Iii-B Network Architecture
The network architecture, previously used in [17], was used in all of our DNNs. This architecture is a variant of the VGG [18] family of network architectures. It is summarized in Table I. For each convolutional and pooling layer, denotes the number of filters, determines the filter size,
is the stride length, and
denotes the padding. Here,
is the number of output neurons for each fully connected layer. All convolutional and fully connected layers are sequenced with batch normalization and activation layers. The last fully connected layer adopts real-valued representations.
width=0.5 Layer Output Shape Binarized Convolutional, ✓ Convolutional, ✓ Max Pooling, ✓ Convolutional, ✓ Convolutional, ✓ Max Pooling, ✓ Convolutional, ✓ Convolutional, ✓ Max Pooling, ✓ Fully Connected, ✓ Fully Connected, ✓ Fully Connected,
width=1 Training Routine Total Kernel Power Usages (W) Total Training Time (s) Training Time per Epoch (s) Test Set Accuracy (%) FPGA GPU FPGA GPU FPGA GPU FPGA GPU 8-bit Fixed Point Stochastic 8.06 133.9 1,592.73 2,613,67 85.91 85.91 Deterministic 7.95 133.0 1,523.17 2,497.72 85.56 85.56 Progressive 7.60 130.3 1,383.17 2,315.97 86.28 86.28 16-bit Fixed Point Stochastic 10.19 134.2 1,989.25 3,147.23 86.45 86.45 Deterministic 10.03 132.8 1,907.17 2,909.62 86.16 86.16 Progressive 9.27 130.5 1,729.32 2,685.22 86.94 86.94 FP32 Baseline Real-valued — 137.1 — 2,524.20 — — 86.77
Similarly to conventional networks, the unbounded ReLU
[19] activation function was used instead of Eq. (7) for the real-valued FP-32 baseline implementation on GPU. The same test set accuracy was achieved for GPU and FPGA implementations.Training Routine | Deterministic | Stochastic | Progressive |
---|---|---|---|
Device | Intel FPGA OpenVINO | ||
Dataset | CIFAR-10 | ||
8-bit Fixed Point | |||
Flip Flops (%) | 63.19 | 66.42 | 62.95 |
ALMs (%) | 81.38 | 84.87 | 76.92 |
DSPs (%) | 100.00 | 100.00 | 93.20 |
16-bit Fixed Point | |||
Flip Flops (%) | 96.06 | 98.43 | 91.96 |
ALMs (%) | 90.40 | 94.31 | 85.54 |
DSPs (%) | 100.00 | 100.00 | 100.00 |
Iii-C Hardware Architecture
All of our implementations are described using the heterogeneous OpenCL [20] framework, in which multiple OpenCL kernels are accelerated using either FPGAs or GPUs that are controlled using C++ host controllers. For FPGA implementations a SoC is used as the host controller, whereas for GPU implementations a CPU is used. We note that the power consumption of our FPGA implementations could be further decreased by realizing them using Hardware Description Language (HDL), removing the host controller, however, this would make fair comparisons between GPU and FPGA implementations difficult [21].
Iv Implementation Results
In order to investigate the performance of our progressively binarizing training routine, CIFAR-10 was used. Prior to training, the color channels of each image were normalized using mean and standard deviation values of (0.4914, 0.2023), (0.4822, 0.1994), and (0.4465, 0.2010), for the red, green, and blue image channels, respectively. This normalization was performed because it has demonstrated significant performance on the ImageNet dataset
[22]. We compare FPGA implementations adopting 16-bit and 8-bit fixed-point real-valued representations, as a large degradation in performance was observed when using smaller bit widths.To compile OpenCL kernels for the OpenVINO FPGA, the Intel FPGA SDK for OpenCL Offline Compiler (IOC) was used, as part of the Intel FPGA SDK for OpenCL and Quartus Prime Design Suite 18.1. For our GPU implementations, a Titan V GPU was used to execute OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU was used to drive the host controller. We used version 430.50 of the Titan V GPU driver to launch compute kernels. We report all GPU and FPGA implementation results in Table II.
From Table II, it can be observed that our progressive training routine consumed the least power and had the smallest total training time on FPGA. Moreover, when adopting 16-bit fixed-point real-valued representations during training it achieved the largest test set accuray. We believe that, similarly to [1], this can be attributed to the additional regularization that binarized parameters introduce. We note that the total training times of our GPU and FPGA implementations are not indicative of those with larger batch sizes, and that the available resources on the FPGA used, restricted us to use across all devices.
The device utilization of our FPGA implementations is presented in Table III. Our progressive binarizing training routine consumes notably less Adaptive Logic Modules (ALMs) and Flip Flops than deterministic and stochastic routines for both 16- and 8-bit fixed-point representations. Digital Signal Processor (DSP) utilization is similar to deterministic and stochastic routines, and is only decreased marginally when 8-bit fixed-point real-valued representations are adopted.
V Conclusion
We proposed and implemented novel and scalable PBNNs on GPUs and FPGAs. We compared our approach to conventional BNNs and real-valued DNNs using GPUs and FPGAs and demonstrated notable reductions in power and resource utilizations for CIFAR-10. This was achieved through approximations and hardware optimizations, as well as using only one set of network parameters compared to conventional BNNs. We leave further hardware-level dissemination, upscaling, hyperparameter optimization, and tuning to future works.
References
- [1] M. Courbariaux and Y. Bengio, “BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1,” CoRR, vol. abs/1602.02830, 2016. [Online]. Available: http://arxiv.org/abs/1602.02830
- [2] E. Nurvitadhi, D. Sheffield, Jaewoong Sim, A. Mishra, G. Venkatesh, and D. Marr, “Accelerating Binarized Neural Networks: Comparison of FPGA, CPU, GPU, and ASIC,” in 2016 International Conference on Field-Programmable Technology (FPT), Dec 2016, pp. 77–84.
- [3] C. Lammie, A. Olsen, T. Carrick, and M. Rahimi Azghadi, “Low-Power and High-Speed Deep FPGA Inference Engines for Weed Classification at the Edge,” IEEE Access, vol. 7, pp. 51 171–51 184, 2019.
- [4] L. Yang, Z. He, and D. Fan, “A Fully Onchip Binarized Convolutional Neural Network FPGA Impelmentation with Accurate Inference,” in Proceedings of the International Symposium on Low Power Electronics and Design, ser. ISLPED ’18. New York, NY, USA: ACM, 2018, pp. 50:1–50:6. [Online]. Available: http://doi.acm.org/10.1145/3218603.3218615
- [5] S. Liang, S. Yin, L. Liu, W. Luk, and S. Wei, “FP-BNN: Binarized neural network on FPGA,” Neurocomputing, vol. 275, pp. 1072 – 1086, 2018. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0925231217315655
- [6] C. Lammie, W. Xiang, and M. R. Azghadi, “Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL,” in 2019 IEEE 62nd International Midwest Symposium on Circuits and Systems (MWSCAS), Aug 2019, pp. 626–629.
- [7] S. Darabi, M. Belbahri, M. Courbariaux, and V. P. Nia, “BNN+: Improved Binary Network Training,” CoRR, vol. abs/1812.11800, 2018. [Online]. Available: http://arxiv.org/abs/1812.11800
- [8] W. Tang, G. Hua, and L. Wang, “How to Train a Compact Binary Neural Network with High Accuracy?” 2017. [Online]. Available: https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14619
- [9] X. Lin, C. Zhao, and W. Pan, “Towards Accurate Binary Convolutional Neural Network,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 345–353. [Online]. Available: http://papers.nips.cc/paper/6638-towards-accurate-binary-convolutional-neural-network.pdf
- [10] Z. Cao, M. Long, J. Wang, and P. S. Yu, “HashNet: Deep Learning to Hash by Continuation,” CoRR, vol. abs/1702.00758, 2017. [Online]. Available: http://arxiv.org/abs/1702.00758
- [11] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. Shanbhag, “True Gradient-Based Training of Deep Binary Activated Neural Networks Via Continuous Binarization,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April 2018, pp. 2346–2350.
- [12] F. Lahoud, R. Achanta, P. Márquez-Neila, and S. Süsstrunk, “Self-Binarizing Networks,” CoRR, vol. abs/1902.00730, 2019. [Online]. Available: http://arxiv.org/abs/1902.00730
- [13] Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu, “Towards Binary-Valued Gates for Robust LSTM Training,” CoRR, vol. abs/1806.02988, 2018. [Online]. Available: http://arxiv.org/abs/1806.02988
- [14] A. Krizhevsky et al., “Learning Multiple Layers of Features from Tiny Images,” Citeseer, Tech. Rep., 2009.
- [15] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” CoRR, vol. abs/1412.6980, 2014.
- [16] Z. Zhang and M. R. Sabuncu, “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels,” CoRR, vol. abs/1805.07836, 2018. [Online]. Available: http://arxiv.org/abs/1805.07836
-
[17]
C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-Supervised
Nets,” in
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, ser. Proceedings of Machine Learning Research, G. Lebanon and S. V. N. Vishwanathan, Eds., vol. 38. San Diego, California, USA: PMLR, 09–12 May 2015, pp. 562–570. [Online]. Available:
http://proceedings.mlr.press/v38/lee15a.html - [18] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” CoRR, vol. abs/1409.1556, 2014.
- [19] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, ser. Proceedings of Machine Learning Research, G. Gordon, D. Dunson, and M. Dudik, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11–13 Apr 2011, pp. 315–323. [Online]. Available: http://proceedings.mlr.press/v15/glorot11a.html
- [20] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems,” Computing in Science Engineering, vol. 12, no. 3, pp. 66–73, May 2010.
- [21] T. Sorensen and A. F. Donaldson, “The Hitchhiker’s Guide to Cross-Platform OpenCL Application Development,” in Proceedings of the 4th International Workshop on OpenCL, ser. IWOCL ’16. New York, NY, USA: ACM, 2016, pp. 2:1–2:12. [Online]. Available: http://doi.acm.org/10.1145/2909437.2909440
- [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Commun. ACM, vol. 60, no. 6, pp. 84–90, May 2017. [Online]. Available: http://doi.acm.org/10.1145/3065386
Comments
There are no comments yet.