Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

05/15/2019
by   Corey Lammie, et al.
James Cook University
0

Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art performance, they consume large amounts of power. Training such networks on CPUs is inefficient, as data throughput and parallel computation is limited. FPGAs are considered a suitable candidate for performance critical, low power systems, e.g. the Internet of Things (IOT) edge devices. Using the Xilinx SDAccel or Intel FPGA SDK for OpenCL development environment, networks described using the high-level OpenCL framework can be accelerated on heterogeneous platforms. Moreover, the resource utilization and power consumption of DNNs can be further enhanced by utilizing regularization techniques that binarize network weights. In this paper, we introduce, to the best of our knowledge, the first FPGA-accelerated stochastically binarized DNN implementations, and compare them to implementations accelerated using both GPUs and FPGAs. Our developed networks are trained and benchmarked using the popular MNIST and CIFAR-10 datasets, and achieve near state-of-the-art performance, while offering a >16-fold improvement in power consumption, compared to conventional GPU-accelerated networks. Both our FPGA-accelerated determinsitic and stochastic BNNs reduce inference times on MNIST and CIFAR-10 by >9.89x and >9.91x, respectively.

READ FULL TEXT VIEW PDF

Authors

page 1

02/04/2016

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Deep neural networks (DNNs) demand a very large amount of computation an...
01/08/2020

Training Progressively Binarizing Deep Networks Using FPGAs

While hardware implementations of inference routines for Binarized Neura...
11/14/2019

An Efficient Hardware-Oriented Dropout Algorithm

This paper proposes a hardware-oriented dropout algorithm, which is effi...
06/21/2021

An Efficient SDN Architecture for Smart Home Security Accelerated by FPGA

With the rise in Internet of Things (IoT) devices, home network manageme...
03/16/2021

Parareal Neural Networks Emulating a Parallel-in-time Algorithm

As deep neural networks (DNNs) become deeper, the training time increase...
06/08/2020

Design Challenges of Neural Network Acceleration Using Stochastic Computing

The enormous and ever-increasing complexity of state-of-the-art neural n...
08/11/2021

ProAI: An Efficient Embedded AI Hardware for Automotive Applications - a Benchmark Study

Development in the field of Single Board Computers (SBC) have been incre...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Network (DNN) architectures have become integral to a variety of applications in Artificial Intelligence (AI) and Machine Learning (ML). While these learning networks and their underpinned elements have been actively researched since 1974 

[11], the inception of recent modern GPUs and faster CPU architectures have greatly facilitated Neural Network (NN) research and enabled the development of highly accurate and complex DNNs.

However, high-performance CPU- and GPU-accelerated DNNs are putative to consume large amounts of power. As a result, accelerating DNNs on low-power and resource-constrained devices, such as portable smart electronics and IOT edge devices, becomes formidable. Considerable efforts are currently being made to utilize customized hardware solutions using FPGAs, presenting significant reductions in power consumption using both Fully Connected (FC) and Convolutional Neural Networks (CNNs)

[10, 6, 4].

Despite the many improvements that recent FPGA studies offer in boosting parallelism and power efficiency, due to the large number of high-resolution multiplications required during learning and inference, such accelerated implementations are inhibited by the amount of dedicated multipliers and Digital Signal Processing (DSP) blocks available on FPGAs. Therefore, new techniques have recently been developed to account for the limited hardware resources available. A very popular technique, which quantizes network weights to binary states, has been proposed in order to greatly reduce resource utilization, and as a result, power consumption, while exhibiting minimal performance degradation [1]. Within these networks, denoted Binarized Neural Networks (BNNs), many resource-hungry multiply-accumulate operations, required during learning and inference, are replaced with simple accumulations.

Deterministic [2], stochastic [1], and recursive [8] BNNs binarize weights during forward and backwards learning propagation cycles, while retaining precision of the stored weights to which gradients are accumulated. Self-binarizing networks [8]

train using a unique representation of network weights, involving a smooth activation function, which is iteratively sharpened during training until it becomes a binary representation equivalent to the sign activation function.

While hardware implementations of deterministic BNNs are plentiful [6, 12], to the best of our knowledge, there are no current FPGA implementations of stochastic BNNs. Therefore, here we propose the first FPGA implementations of stochastic BNNs, as it has been demonstrated that stochastic BNNs further improve the learning performance of BNNs, compared to their deterministic counterparts [1].

In addition, we provide comprehensive results through investigating the acceleration of deterministic and stochastic BNNs on both GPUs and FPGAs using High Level Synthesis (HLS) techniques utilizing the OpenCL framework, to encourage deployment using heterogeneous platforms. Resource usage and performance of the implemented networks are also compared for permutation-invariant DNNs, and CNNs, trained and tested for MNIST [5] and CIFAR-10 [3]. For all our hardware implementations, we draw comparisons among designs utilizing deterministic, stochastic, or no regularization techniques. Our specific contributions are as follows:

  • We implement and present the first FPGA-accelerated stochastically binarized DNNs and CNNs.

  • We employ complete FPGA-accelerated DNNs and CNNs on a standalone System On a Chip (SoC), requiring no host computer or extra device for partial computation.

  • We demonstrate that our new binarized FPGA-accelerated DNNs and CNNs offer significantly reduced power usage and shorter inference times, compared to their equivalent full resolution counterparts, on MNIST and CIFAR-10, implemented on both GPU and FPGA.

  • We report and investigate the learning times required for all of our implemented networks.

Ii Preliminaries

This section briefly reviews and presents the algorithms and methods used in our developed networks for the MNIST and CIFAR-10 classification benchmarks.

0:  a mini-batch of (inputs, targets), previous parameters and , and a learning rate .
0:  updated parameters and .
1:  Forward Propogation binarize . For to , compute knowing .
2:  Backward Propogation Initialize output layer’s activation gradient For to , compute using and .
3:  Parameter Update Compute and , using and . clip .
4:  Weight Normalization
Algorithm 1 Training Algorithm of the Accelerated Binarized Neural Networks

Ii-a Binary Weight Regularization

Binary weight regularization [1], constrains network weights to binary states of {+1, -1}, during forward and backward propagations. The binarization operation transforms the full-precision weights into binary values, using either a deterministic or a stochastic approach.

Ii-A1 Deterministic Binarization

Deterministic binarization is defined in Equation (1).

(1)

where is the binarized weight and is the real-valued full-precision weight.

Ii-A2 Stochastic Binarization

Stochastic binarization is an alternative binarization technique, which stochastically binarizes weights. The stochastic binary projection is presented in Equation (2).

(2)

where

is the hard sigmoid function described in Equation (

3).

(3)

Ii-B Training Algorithm

Algorithm (1) provides a high-level overview of the training algorithm used for deterministic and stochastic BNNs. Here, , , and represent the weights, biases, and learning rate, while denotes the cost function for each mini-batch. Furthermore, represents binary weights and represents the th layer activation function, while binarize() implements Equation (1) or (2) depending the utilized regularization, and clip() clips values between and . By adopting this training algorithm, during learning and inference, network outputs can be determined using simple Multiply and Accumulate (MAC) operations, in-place of dedicated multiplier blocks [2].

Iii Network Architecture

The complete architecture of the implemented networks consists of two main components: A. The Software Architecture, and B. The Hardware Architecture. The software architecture defines the targeted neural network structure, which will be described in C++ and OpenCL kernels. The hardware architecture describes the integration between the hardware used to run the OpenCL kernels and a host controller, which is the program executed on a host processor. This processor is used to launch OpenCL kernels and to manage device memory.

Iii-a Software Architecture

We implement two distinct neural network architectures: a permutation-invariant FC network for MNIST, and the VGG-16 [9] CNN for CIFAR-10. Details pertaining to each network are provided in a publicly available GitHub repository111https://github.com/coreylammie/Accelerating-Stochastically-Binarized-Neural-Networks-on-FPGAs-using-OpenCL.

To decrease the quantization error, which binarization introduces, the output of each layer is normalized using batch normalization. The output of the final layer is fed through a Softmax activation layer, and the network’s loss is minimized using cross-entropy. SGD with momentum is used to optimize the network parameters, with a initial learning rate,

, and momentum, . In order to accelerate convergence, and maximize each networks’ performance, an adaptive decaying learning rate, , is used, as described in Equation (4).

(4)
Fig. 1: Top level flow diagram of the proposed network implemented on a SoC consisting of a processor describing the host controller and an FPGA to run OpenCL kernels.

Iii-B Hardware Architecture

The developed hardware architectures consist of C++ host controllers and multiple OpenCL kernels, which are accelerated using either an FPGA or a GPU. For x86-based systems, OpenCL accelerated kernels using FPGAs typically reside on an FPGA development board, which is connected to a separate independent host system through the PCIe express interface [10]. For ARM-based systems, the FPGA is typically connected to a Hard Processor System (HPS) on a SoC through specialized bridges, as in the case of the Intel DE1-SoC development board that we used here. This allows the proposed networks to be run completely independently on the SoC without using a separate device for computation. The full top level flow diagram of our implemented FPGA-accelerated networks is presented in Figure (1).

In addition to accelerating the targeted MNIST and CIFAR-10 networks on a FPGA development board, each network is also accelerated on a state-of-the-art Titan V GPU to execute OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU to drive the operating system.

width=1 Regularizer Total Kernel Power Usages (W) Learning Time per Epoch (s) Inference Time per Image (s) Validation Accuracy (%) FPGA GPU FPGA GPU FPGA GPU FPGA GPU MNIST No Regularizer 7.0 126.1 26.09 5.13 7.04E-05 3.12E-05 98.70 98.54 Deterministic 6.3 125.9 9.75 8.87 6.84E-06 9.71E-06 97.76 97.94 Stochastic 6.3 125.4 11.58 8.20 7.12E-06 9.92E-06 98.33 98.23 CIFAR-10 No Regularizer 7.9 128.4 43.97 28.45 1.15E-02 5.09E-03 86.72 86.73 Deterministic 6.5 126.3 16.91 34.86 1.11E-03 1.63E-03 86.48 86.46 Stochastic 6.6 126.9 20.08 33.79 1.16E-03 1.66E-03 86.75 86.76

TABLE I: Implementation results obtained using the MNIST and CIFAR-10 datasets for GPU and FPGA accelerated networks. The Learning Time per Epoch and Inference Time per Image metrics are averaged over all recorded samples during 200 training epochs.

Iv Implementation Results

Fig. 2: Validation Accuracy during training for FPGA- and GPU-accelerated permutation-invariant FCNN for the MNIST test set. Solid lines represent the validation error for networks accelerated using FPGA and dashed lines represent the validation error for networks accelerated using GPU.

In order to validate and investigate the performance of the proposed FPGA- and GPU-accelerated BNN architectures, the MNIST and CIFAR-10 datasets are used. To ensure a fair comparison, on account of the limited resource availability of the Intel DE1-SoC development board used, the batch size, , was fixed to 4 for all networks. The validation accuracy for all developed networks over 200 training epochs is presented in Figures (2) and (3).

From Figures (2) and (3), it can be observed that both GPU- and FPGA-accelerated networks achieve very similar validation accuracy rates during learning. For all implementations, regularized networks require more training epochs to converge.

The variations in validation accuracy trends reported between platforms can be associated to the different initial weights generated using the He weight initialization technique. Figures (2) and (3) also demonstrate that, networks employing stochastic and deterministic binarization techniques perform very similarly, compared to their base-line architectures employing no binary regularization techniques. For our FPGA-accelerated networks learning MNIST, the validation accuracy degrades by only 0.37% (for stochastic) and 0.94% (for deterministic), compared to no regularization. For our FPGA-accelerated networks learning CIFAR-10, a validation accuracy decrease of 0.24% was observed for our network employing deterministic binarization, while our networks with stochastic binarization regularization showed a validation accuracy increase of 0.03%. These findings are in good agreement with the software implementations of the binarized networks reported in [1].

To comprehensively compare the implemented FPGA- and GPU-accelerated networks, the total kernel power usages, learning time per epoch, inference time per image, and learning performances were determined and presented in Table I. The total kernel power usages were determined using the

Post Place & Route Estimator

for FPGA post-synthesis, and NVIDIA-SMI for GPU. It was found that the power consumption of all FPGA-accelerated networks reduce by 16 times, compared to their GPU-accelerated counterparts.

Fig. 3: Validation Accuracy for each Epoch during training for FPGA- and GPU-accelerated VGG-16 CNN for the CIFAR-10 test set. Solid lines represent the validation error for networks accelerated using FPGA and dashed lines represent the validation error for networks accelerated using GPU.

Despite this drastic reduction in power consumption, our deterministic and stochastic regularized FPGA-accelerated networks require similar training durations to their GPU-accelerated counterparts, which have much higher operation frequencies, compared to our utilized FPGA. As reported in Table I, our FPGA-accelerated permutation-invariant FC stochastic and deterministic BNNs for MNIST require 1.10 and 1.41 longer training intervals, respectively. Our FPGA-accelerated CNNs adopting the VGG-16 architecture accelerate learning by 2.06 and 1.68, respectively. These findings are in agreement with [2], which investigates execution times for FC and CNN BNNs, and demonstrates that convolutional operations are accelerated to a greater extent than matrix multiplications, which are required for FC layers.

When considering the inference time, all our FPGA-accelerated stochastic and deterministic regularized networks require shorter times to perform inference, compared to their GPU-accelerated counterparts. This is notable, considering our GPU-accelerated networks use the state-of-the-art Titan V GPU to execute OpenCL kernels, while the limited resources available on the utilized FPGA creates a large bottleneck on the maximum synthesizable frequency, and thus limits the speed of our FPGA-accelerated networks. We believe the shorter inference times observed are mainly due to the binarized parameters during inference, which accelerate the required computations. This also explains why, when no regularizer is used, our GPU-accelerated implementations require shorter inference times than our FPGA-accelerated implementations. Modern FPGAs such as the Stratix® V GXA7 and Virtex-7 VX485T, used in other recent works [7, 10], are expected to demonstrate even more significant improvements in speed during training and inference. This promises further inference acceleration for FPGA-based deterministically and stochastically binarized networks.

V Conclusion

We designed and implemented various FC and convolutional BNN architectures using the high-level OpenCL framework. We then accelerated the developed networks on both GPUs and FPGAs. The performance, power, and learning/inference times of these network architectures were investigated. It was found that both FPGA-accelerated BNNs with deterministic and stochastic regularizers have reduced inference times on MNIST and CIFAR-10 by an order of magnitude, compared to the no-regularized FPGA case. They require 25% shorter inference times than their GPU counterparts. Moreover, our FPGA-accelerated BNNs consumed less than 16 the power required by non-regularized GPU-accelerated networks. Finally, our BNNs achieved only slightly degraded validation errors on MNIST, and in some instances, outperformed our baseline non-regularized GPU-accelerated networks on CIFAR-10. In summary, our modular and scalable FC and CNN network architectures can be extrapolated to accelerate larger and more complex networks.

References

  • [1] M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems, pp. 3123–3131. Cited by: §I, §I, §I, §II-A, §IV.
  • [2] M. Courbariaux and Y. Bengio (2016) BinaryNet: training deep neural networks with weights and activations constrained to +1 or -1. arxiv.org/abs/1602.02830. External Links: 1602.02830 Cited by: §I, §II-B, §IV.
  • [3] A. Krizhevsky, V. Nair, and G. Hinton CIFAR-10 (canadian institute for advanced research). Cited by: §I.
  • [4] C. Lammie and M. R. Azghadi (2019-05)

    Stochastic Computing for Low-Power and High-Speed Deep Learning on FPGA

    .
    In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), External Links: Document, ISSN 2158-1525, Link Cited by: §I.
  • [5] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §I.
  • [6] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren (To Appear) A GPU-Outperforming FPGA accelerator architecture for binary convolutional neural networks. ACM Journal on Emerging Technologies in Computing (JETC). External Links: Link Cited by: §I, §I.
  • [7] J. Lin, T. Xing, R. Zhao, Z. Zhang, M. Srivastava, Z. Tu, and R. K. Gupta (2017) Binarized convolutional neural networks with separable filters for efficient hardware acceleration. Computer Vision and Pattern Recognition. External Links: Link Cited by: §IV.
  • [8] C. Sakr, J. Choi, Z. Wang, K. Gopalakrishnan, and N. R. Shanbhag (2018) True gradient-based training of deep binary activated neural networks via continuous binarization. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). External Links: Document Cited by: §I.
  • [9] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arxiv.org/abs/1409.1556. External Links: 1409.1556 Cited by: §III-A.
  • [10] D. Wang, K. Xu, and D. Jiang (2017-12) PipeCNN: an OpenCL-based open-source FPGA accelerator for convolution neural networks. In 2017 International Conference on Field Programmable Technology (ICFPT), pp. 279–282. External Links: Document Cited by: §I, §III-B, §IV.
  • [11] H. Wang, B. Raj, and E. P. Xing (2017) On the origin of deep learning. arxiv.org/abs/1702.07800. External Links: 1702.07800 Cited by: §I.
  • [12] L. Yang, Z. He, and D. Fan (2018) A fully onchip binarized convolutional neural network fpga impelmentation with accurate inference. In Proceedings of the International Symposium on Low Power Electronics and Design, ISLPED ’18, New York, NY, USA, pp. 50:1–50:6. External Links: Document, ISBN 978-1-4503-5704-3, Link Cited by: §I.