Training Progressively Binarizing Deep Networks Using FPGAs

01/08/2020
by   Corey Lammie, et al.
James Cook University
0

While hardware implementations of inference routines for Binarized Neural Networks (BNNs) are plentiful, current realizations of efficient BNN hardware training accelerators, suitable for Internet of Things (IoT) edge devices, leave much to be desired. Conventional BNN hardware training accelerators perform forward and backward propagations with parameters adopting binary representations, and optimization using parameters adopting floating or fixed-point real-valued representations–requiring two distinct sets of network parameters. In this paper, we propose a hardware-friendly training method that, contrary to conventional methods, progressively binarizes a singular set of fixed-point network parameters, yielding notable reductions in power and resource utilizations. We use the Intel FPGA SDK for OpenCL development environment to train our progressively binarizing DNNs on an OpenVINO FPGA. We benchmark our training approach on both GPUs and FPGAs using CIFAR-10 and compare it to conventional BNNs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

05/15/2019

Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL

Recent technological advances have proliferated the available computing ...
12/08/2020

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework

Deep Neural Networks (DNNs) have achieved extraordinary performance in v...
04/07/2021

On-FPGA Training with Ultra Memory Reduction: A Low-Precision Tensor Method

Various hardware accelerators have been developed for energy-efficient a...
03/29/2018

B-DCGAN:Evaluation of Binarized DCGAN for FPGA

We are trying to implement deep neural networks in the edge computing en...
01/22/2016

Bitwise Neural Networks

Based on the assumption that there exists a neural network that efficien...
10/13/2019

Overwrite Quantization: Opportunistic Outlier Handling for Neural Network Accelerators

Outliers in weights and activations pose a key challenge for fixed-point...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Figure 1:

Depiction of (A) network parameter representations and (B) binarization and activation functions required during training for various DNNs and BNNs. Different levels of discretization are depicted using various shade palettes. DNNs require one set of real-valued (A1) or limited-precision parameters (A2) and typically have continuously differentiable activation functions (B1). In addition to the real-valued parameters (A3), BNNs also require another set of binarized parameters (A4), for which they use a STE (B2) to determine gradients of the signum function, which is not continuously differentiable. In contrast to BNNs, PBNNs require one set of real-valued parameters (A5) and use a continuously differentiable activation and binarization function, with a shape that progressively evolves during training (B3).

Binarization has been used to augment the performance of Deep Neural Networks (DNNs), by quantizing network parameters to binary states, replacing many resource-hungry multiply-accumulate operations with simple accumulations [1]. It has been demonstrated that Binarized Neural Networks (BNNs) implemented on customized hardware can perform inference faster than conventional DNNs on state-of-the-art Graphics Processing Units (GPUs) [2, 3], while offering notable improvements in power consumption and resource utilizations [4, 5, 6]. However, there is still a performance gap between DNNs and conventional BNNs [7], which binarize parameters deterministically or stochastically. Moreover, the training routines of conventional BNNs are inherently unstable [8].

During backward propagations of conventional BNN training routines, gradients are approximated using a Straight-Through Estimator (STE) as the signum function is not continuously differentiable 

[1]. The gap in performance, and the general instability of conventional BNNs compared to DNNs, can be largely attributed to the lack of an accurate derivative for weights and activations in BNNs, which creates a mismatch between binary and floating- or fixed-point real-valued representations [9].

The training routines of DNNs that utilize continuously differentiable and adjustable functions in place of the signum function, which we denote Progressively Binarizing NNs (PBNNs), transform a complex and non-smooth optimization problem into a sequence of smooth sub-optimization problems. Such training routines that progressively binarize network parameters, were first used to binarize the last layer of DNNs to yield significant multimedia retrieval performance on standard benchmarks [10]. Since, various works have detailed training routines of complete PBNNs [11, 12, 13]. However, efficient customized hardware implementations of PBNNs are yet to be explored.

In this paper, we use the Intel FPGA SDK for OpenCL development environment to implement and train novel and scalable PBNNs on an OpenVINO FPGA, which progressively binarize a singular set of fixed-point network parameters. We compare our approach to conventional BNNs and benchmark our implementations using CIFAR-10 [14]. Our specific contributions are as follows:

  • We implement and present the first PBNNs using customized hardware and fixed-point number representations;

  • We use a Piece-Wise Linear (PWL) function for binarization and activations with a constant derivative to simplify computations;

  • We demonstrate compared to training BNNs deterministically or stochastically on CIFAR-10, PBNNs yield a marginal, yet consistent, increase in classification accuracy, and decrease both resource and power utilizations.

Ii Preliminaries

Ii-a Conventional BNNs

The training routines of conventional BNNs binarize parameters either deterministically or stochastically after performing parameter optimizations. Deterministic binarization is performed as per Eq. (1).

(1)

where denotes binarized parameters and denotes real-valued full-precision parameters. Stochastic binarization is performed as per Eq. (2), where

is the hard sigmoid function described in Eq. (

3).

(2)
(3)

During backward propagations, large parameters are clipped using , as per Eq. (4), where denotes the objective function.

(4)

Ii-B Progressively Binarizing DNNs

PBNNs use a set of constrained real-valued parameters at each layer , which are not directly learnable, but are a function of learnable parameters at each layer [11, 12]. The shape of evolves during training, to closely resemble the signum function once training is complete. The hyperbolic tangent function is commonly used to relate and , as described in Eq. (5).

(5)

where is an adjustable scale parameter, which is used to evolve the shape of Eq. (5). The derivative of Eq. (5) is described in Eq. (6).

(6)
Figure 2: Activation and binarization functions used for our progressive binarization training routine.

As increases, the shape of Eq. (5) better mimics that of the signum function. During training, parameters, denoted using

, are optimized to minimize a loss function, while

is progressively increased. After training is completed, is sufficiently large that the parameters, , are very close to , as depicted in Fig. 1. The final binary parameters can simply be obtained by passing through the signum function.

0:

  Network hyperparameters (the learning rate schedule,

, scale parameter schedule, , batch size, , gradient optimizer, loss function,

, and the number of training epochs).

0:  Trained binary weights and biases, .
  for each training epoch do
     1. Forward Propagation
     
     for each training batch do
              for each layer do
                       Determine
              end for
     end for
     2. Backward Propagation
     Determine
     for all other layers do
              Determine using and
     end for
     3. Parameter Optimization
     for each layer do
              Determine using
              Determine using and
     end for
  end for
  4. Determine the Trained Binary Parameters
  
  for each layer do
           
  end for
Algorithm 1 The training rotuine adopted by all of our progressively-binarizing DNNs.

Iii Implementation Details

Iii-a Our Progressive Binarization Training Routine

We employ a PWL function to approximate the hyperbolic tangent function, described in Eq. (7) and depicted in Fig. 2 to simplify computations. In addition to reducing a non-linear function to a linear function, the derivative of Eq. (7), when bounded, is constant and does not depend on . Consequently, when all activations are computed simultaneously, the output of each layer during forward propagations, , does not need to be stored in memory to determine gradients during backward propagations.

(7)

Algorithm 1 provides a high-level overview of our progressive binarization training routine. Trained binary parameters can be computed after each training epoch to determine performance on the test set during training. Here, denotes the output of the th layer at the th epoch. As

the sign of the output of Batch Normalization (BN) is reformulated to reduce computation as per Eq. (

8[12].

(8)

where is defined in Eq. (9). denotes the input, and are parameters that define an affline transform, and and

are the running mean and standard deviation of the feature maps that pass through them.

(9)

We trained all networks until improvement on the test set was negligible (for 50 epochs) with a batch size . This is the largest possible batch size that makes comparison across devices possible. The initial learning rate was , which was decayed by an order of magnitude every 20 training epochs, i.e. when .

During training, each network’s scale parameter, , was increased logarithmically, from 1, at the first epoch, to 1000, at the final epoch. Eq. (8) was used to determine the output of all batch normalization layers when . Adam [15] was used to optimize network parameters and Cross Entropy (CE) [16] was used to determine network losses. After the trained binary parameters were determined, for all our implementations, a conventional OpenCL BNN inference accelerator was used to perform inference on the CIFAR-10 test set.

Iii-B Network Architecture

The network architecture, previously used in [17], was used in all of our DNNs. This architecture is a variant of the VGG [18] family of network architectures. It is summarized in Table I. For each convolutional and pooling layer, denotes the number of filters, determines the filter size,

is the stride length, and

denotes the padding. Here,

is the number of output neurons for each fully connected layer. All convolutional and fully connected layers are sequenced with batch normalization and activation layers. The last fully connected layer adopts real-valued representations.

width=0.5 Layer Output Shape Binarized Convolutional,  Convolutional,  Max Pooling Convolutional,  Convolutional,  Max Pooling,  Convolutional,  Convolutional,  Max Pooling,  Fully Connected, Fully Connected,  Fully Connected, 

Table I: Adopted Network Architecture.

width=1 Training Routine Total Kernel Power Usages (W) Total Training Time (s) Training Time per Epoch (s) Test Set Accuracy (%) FPGA GPU FPGA GPU FPGA GPU FPGA GPU 8-bit Fixed Point Stochastic 8.06 133.9 1,592.73 2,613,67 85.91 85.91 Deterministic 7.95 133.0 1,523.17 2,497.72 85.56 85.56 Progressive 7.60 130.3 1,383.17 2,315.97 86.28 86.28 16-bit Fixed Point Stochastic 10.19 134.2 1,989.25 3,147.23 86.45 86.45 Deterministic 10.03 132.8 1,907.17 2,909.62 86.16 86.16 Progressive 9.27 130.5 1,729.32 2,685.22 86.94 86.94 FP32 Baseline Real-valued 137.1 2,524.20 86.77

Table II: Implementation results obtained using the CIFAR-10 dataset for GPU and FPGA accelerated networks. The mean and standard deviations reported for the Training Time per Epoch (s) metric are determined over 50 training epochs.

Similarly to conventional networks, the unbounded ReLU 

[19] activation function was used instead of Eq. (7) for the real-valued FP-32 baseline implementation on GPU. The same test set accuracy was achieved for GPU and FPGA implementations.
Training Routine Deterministic Stochastic Progressive
Device Intel FPGA OpenVINO
Dataset CIFAR-10
8-bit Fixed Point
Flip Flops (%) 63.19 66.42 62.95
ALMs (%) 81.38 84.87 76.92
DSPs (%) 100.00 100.00 93.20
16-bit Fixed Point
Flip Flops (%) 96.06 98.43 91.96
ALMs (%) 90.40 94.31 85.54
DSPs (%) 100.00 100.00 100.00
Table III: Comparison of device FPGA utilization for various binarization training approaches. The numbers are extracted from , generated by Quartus Prime Design Suite 18.1.

Iii-C Hardware Architecture

All of our implementations are described using the heterogeneous OpenCL [20] framework, in which multiple OpenCL kernels are accelerated using either FPGAs or GPUs that are controlled using C++ host controllers. For FPGA implementations a SoC is used as the host controller, whereas for GPU implementations a CPU is used. We note that the power consumption of our FPGA implementations could be further decreased by realizing them using Hardware Description Language (HDL), removing the host controller, however, this would make fair comparisons between GPU and FPGA implementations difficult [21].

Iv Implementation Results

In order to investigate the performance of our progressively binarizing training routine, CIFAR-10 was used. Prior to training, the color channels of each image were normalized using mean and standard deviation values of (0.4914, 0.2023), (0.4822, 0.1994), and (0.4465, 0.2010), for the red, green, and blue image channels, respectively. This normalization was performed because it has demonstrated significant performance on the ImageNet dataset 

[22]. We compare FPGA implementations adopting 16-bit and 8-bit fixed-point real-valued representations, as a large degradation in performance was observed when using smaller bit widths.

To compile OpenCL kernels for the OpenVINO FPGA, the Intel FPGA SDK for OpenCL Offline Compiler (IOC) was used, as part of the Intel FPGA SDK for OpenCL and Quartus Prime Design Suite 18.1. For our GPU implementations, a Titan V GPU was used to execute OpenCL kernels and an AMD Ryzen 2700X @ 4.10 GHz Overclocked (OC) CPU was used to drive the host controller. We used version 430.50 of the Titan V GPU driver to launch compute kernels. We report all GPU and FPGA implementation results in Table II.

From Table II, it can be observed that our progressive training routine consumed the least power and had the smallest total training time on FPGA. Moreover, when adopting 16-bit fixed-point real-valued representations during training it achieved the largest test set accuray. We believe that, similarly to [1], this can be attributed to the additional regularization that binarized parameters introduce. We note that the total training times of our GPU and FPGA implementations are not indicative of those with larger batch sizes, and that the available resources on the FPGA used, restricted us to use across all devices.

The device utilization of our FPGA implementations is presented in Table III. Our progressive binarizing training routine consumes notably less Adaptive Logic Modules (ALMs) and Flip Flops than deterministic and stochastic routines for both 16- and 8-bit fixed-point representations. Digital Signal Processor (DSP) utilization is similar to deterministic and stochastic routines, and is only decreased marginally when 8-bit fixed-point real-valued representations are adopted.

V Conclusion

We proposed and implemented novel and scalable PBNNs on GPUs and FPGAs. We compared our approach to conventional BNNs and real-valued DNNs using GPUs and FPGAs and demonstrated notable reductions in power and resource utilizations for CIFAR-10. This was achieved through approximations and hardware optimizations, as well as using only one set of network parameters compared to conventional BNNs. We leave further hardware-level dissemination, upscaling, hyperparameter optimization, and tuning to future works.

References