In recent years, neural networks (NNs) have demonstrated their outstanding performance in a variety of applications ranging from image classification ferianc2020vinnas or segmentation kendall2015bayesian to human action recognition fan2019f. However, one of the main drawbacks in standard NNs is that they are not able to capture the model uncertainty which is crucial for many safety-critical applications such as healthcare liang2018bayesian or autonomous vehicles azevedo2020stochasticyolo. In contrast to standard NNs, Bayesian neural networks (BNNs) [neal1993bayesian]
, which adopt Bayesian inference to provide a principled uncertainty estimation, have become more popular in these applications.
BNNs neal1993bayesian can describe complex stochastic patterns with well-calibrated confidence estimates. An example of this is shown in Figure 1, which demonstrates that the BNN is uncertain in its predictions when shown completely irrelevant input, in comparison to a standard NN, which is wrongfully overconfident. Hence with BNNs, we can treat special cases explicitly gal2016dropout and they have become relevant in applications where the notion of uncertainty is essential.
However, the advantage of BNNs comes with a burden: due to the high dimensionality of modern BNNs, it is intractable to analytically compute their predictive uncertainty. Instead, it is necessary to approximate the predictive distribution through Monte Carlo sampling that requires the users to perform repeated sampling of random numbers and then run the same input data through the BNN multiple times, which degradates the hardware performance. Several algorithmic approximation techniques and hardware architectures cai2018vibnn, myojin, 9116302 have been proposed to improve the hardware performance of BNNs. Nevertheless, there are particular drawbacks in these approaches: 1) The implementation needs of both an NN engine and a sampler makes the design resource and memory-demanding, and thus current accelerators can only support BNNs consisting solely of linear layers or binary operations, which does not reflect the need in the industrial or research communities in terms of the current state-of-the-art BNN architectures; 2) To obtain the uncertainty prediction, these accelerators simply perform forward passes through the whole network repeatedly without considering the actual algorithmic needs of BNNs, which makes them times slower than standard NNs.
To address these challenges, we propose an FPGA-based design with the support for fine-grained parallelism to accelerate BNNs inferred through Monte Carlo Dropout (MCD) gal2016dropout with high performance. The proposed accelerator is versatile to support a variety of BNN architectures. To further improve the hardware performance, we consider partial BNNs to decrease the amount of computation required by BNNs. An automatic framework is proposed to explore the trade-off between hardware and algorithmic performance, which is able to find a suitable hardware configuration and algorithmic parameters given users’ hardware constraints and algorithmic requirements. In summary, our contributions include:
A novel hardware architecture with an intermediate-layer caching technique to accelerate Bayesian neural networks inferred through Monte Carlo Dropout, which achieves high performance and resource efficiency (Section III).
An exploration framework for hardware-algorithmic performance trade-off and uncertainty estimation provided by partial Bayesian neural network design (Section IV).
A comprehensive evaluation of algorithmic and hardware performance on different datasets with respect to different state-of-the-art neural architectures (Section V).
Ii Related Work
Ii-a Field Programmable Gate Array-based Accelerators
Acceleration of standard NNs has enjoyed extensive research and industrial interests in the recent years mittal2020survey. Given the high computational demands of NNs, custom hardware accelerators are vital for boosting their performance. The high energy efficiency, computing capabilities and reconfigurability of FPGAs in particular make them a promising platform for acceleration of multiple different NN architectures mittal2020survey. Nevertheless, acceleration of BNNs specifically has not gained similar interests in the research community and there are only few works which approached this challenge cai2018vibnn, myojin, 9116302.
In VIBNN cai2018vibnn, the authors developed an efficient FPGA-based accelerator for BNNs, however, they focused on BNNs consisting only of linear layers. Myojin et al. myojin propose a method for reducing the sampling time required for MCD gal2016dropout in edge computing by parallelising the calculation circuit using an FPGA.
However, their method needs to binarise the BNN and they again focus only on linear layers. Awano & Hashimoto 9116302 propose a custom inference algorithm for BNNs consisting exclusively of linear layers - BYNQNet which employs quadratic nonlinear activation functions and hence the uncertainty propagation can be achieved using only polynomial operations.
Although the design can achieve a high throughput,
the restriction of the nonlinear activation functions limits generality for different application scenarios. In
which employs quadratic nonlinear activation functions and hence the uncertainty propagation can be achieved using only polynomial operations. Although the design can achieve a high throughput, the restriction of the nonlinear activation functions limits generality for different application scenarios. Inazevedo2020stochasticyolo, the authors propose software-based intermediate-layer caching (IC), evaluated in last layer BNNs.
In comparison to these works, we focus on accelerating BNNs consisting of different layers with or without residual connections
In comparison to these works, we focus on accelerating BNNs consisting of different layers with or without residual connectionshe2016deep, including convolutions or pooling, that have been popular in the present-day networks fan2019f, he2016deep. Additionally, our work wants to appeal to already wide-spread MCD without any additional software re-implementation effort.
Ii-B Monte Carlo Dropout (MCD)
The concept MCD gal2016dropout lays in casting dropout srivastava2014dropout in NNs as Bayesian inference.
Unlike the dropout used in standard NNs which is only enabled during training,
MCD applies the dropout during both training and evaluation.
MCD can be described as applying a random filter-wise mask to the output feature maps of layer with filters.
The mask follows a Bernoulli distribution
follows a Bernoulli distribution. practically trades-off certainty, accuracy and calibration of the BNN. After MCD removes the output feature maps with zeros, the non-zero elements are then scaled by . To get the final output under MCD, the computation can be formulated as: where represents a Hadamard product and is generated by a Bernoulli sampler at runtime for different filters and layers. The uncertainty estimation and prediction is thus obtained by running the same input through the BNN times, each time with different set of sampled masks for each layer where MCD is applied, and averaging the outputs. The works gal2016dropout, kendall2015bayesian demonstrate that MCD can achieve a high quality of uncertainty estimation.
Ii-C Partial Bayesian Inference
A full BNN should be trained with MCD applied after every layer gal2016dropout. However, the authors in kristiadi2020being, kendall2015bayesian have demonstrated theoretically and empirically that making a standard NN Bayesian in different parts of the NN, thus making it partially Bayesian, can improve uncertainty estimation and it can also improve accuracy. Assuming there is an -layer NN, partial Bayesian inference applies MCD in the last layers and makes the first layers behave as a feature extractor for the Bayesian remainder. Partially applied dropout then represents a trade-off between hardware, algorithmic performance and uncertainty estimation kendall2015bayesian. In this paper, we exploit this trade-off by proposing a framework for exploring the positioning of MCD at different parts of the NN which results in a partial BNN.
Iii Hardware Design
Iii-a Design Overview
An overview of the proposed hardware design is illustrated in Figure 2. The computation of the BNN is performed layer-by-layer using the same hardware design. The intermediate outputs of each layer are transferred to off-chip memory to reduce the on-chip memory consumption, and they are loaded back to the input buffer for processing of the next layer. The weights of different layers are stored in off-chip memory and loaded to the weight buffer while processing the corresponding layer.
The main component in the proposed accelerator is a neural network engine (NNE), which is designed for running one layer at a time and general enough to run linear and convolutional layers with different kernel sizes. The NNE consists of a processing engine (PE), a functional unit (FU) and a dropout unit (DU).
These sub-modules are queried in a pipeline manner to improve the hardware performance.
The PE is designed to perform matrix multiplication,
which supports three types of fine-grained parallelism: filter parallelism (), channel parallelism ( ) and vector parallelism ( multipliers followed by an adder tree for channel accumulation.
After each PU, there is a chain of FU modules including batch normalization (
) and vector parallelism (). In PE, there are processing units (PUs). Each PU contains multiplication-addition modules and each module contains
multipliers followed by an adder tree for channel accumulation. After each PU, there is a chain of FU modules including batch normalization (BN) [ioffe2015batch]ReLU) activation, Pooling (Pool) and Shortcut (SC). The DU is placed at the end, which is a batch of multiplexers controlled by the zeros and ones generated from the Bernoulli sampler.
Iii-B Bernoulli Sampler
MCD is applied filter-wise, which means the number of Bernoulli random variables generated for each layer is equal to the number of output filters. Therefore, we adopt the single-bit linear feedback shift register (LFSR) design to implement a Bernoulli sampler, which is illustrated in Figure 3.
The LFSR is composed of a chain of shift registers formed as a loop. The maximum sequence length of LFSR depends on the number of shift registers used in the loop: . The used LFSR design would take 1500 years to iterate through the whole sequence when clocked at 160MHz [andraka1998fpga]. Since a single LFSR can only support Bernoulli sampling with probability, the number of LFSRs depends on the required dropout rate. For instance, two LFSRs with an extra AND gate are required to implement Bernoulli sampler with . Also, as mentioned in Section III, the NNE only processes filters at a time, so we design a serial-in-parallel-out (SIPO) module, placed after LFSRs, to form a single Bernoulli bit of a -bit MCD mask. Since different filters are processed at different speeds, a first-in-first-out (FIFO) buffer is placed at the end of the Bernoulli sampler to cache generated Bernoulli random variables and pop out the mask when required. In case the overall processing is parallelised it is not necessary to use more than one sampler, however, the samples sampled during runtime for each instance need to be distinct.
Iii-C Intermediate-layer Caching (IC)
To further improve the overall hardware performance, we propose a hardware implementation of IC technique [azevedo2020stochasticyolo] to decrease the required compute and the number of memory accesses. An example of using IC is illustrated in Figure 4, where the NN contains two layers and it only requires the user to apply the dropout mask and run the last layer times when the partial Bayesian technique is applied. In IC, the input of the last layer is stored on chip until the sampling is finished. Assuming the NN requires to run the last layers times to obtain the prediction, the IC can reduce the compute by times and the number of memory accesses by times.
Iv Optimization Framework
Iv-a Workflow of Framework
As mentioned in Section II-C, partially applying MCD represents a trade-off between latency, accuracy, confidence and uncertainty estimation. The trade-off is decided by three types of parameters: 1) which denotes the portion of Bayesian layers, 2) which represents the number of times needed to repetitively run the Bayesian parts and 3) which represent hardware parallelism. In this paper, we propose a framework, shown in Figure 5, which automatically optimizes the configuration of the BNN with respect to parameters , and according to user requirements for the target hardware platform. In our hardware design space, we consider the domains for both and as and can be chosen from .
At the beginning, the framework requires users to specify the hardware constraints, optimization mode and the minimal requirement for each metric. The hardware constraints include the available DSPs and memory resources of the target hardware platform. The optimization mode is selected from optimal-latency, optimal-accuracy, optimal-uncertainty prediction and optimal-confidence to minimise or maximise the chosen objective through greedy optimisation with respect to software and hardware configurations.
The first optimization is the hardware optimization, which determines the maximum parallelism level implementable on the target hardware in terms of . The resource model is used at this step to estimate the resource consumption given the available degrees of parallelism. During algorithmic optimization, based on the determined hardware parameters, we obtain the latency from the performance lookup table for various BNNs with different and . At the same time, the accuracy, the quality of uncertainty prediction and confidence of the BNN are evaluated in software. Then, the configurations which do not meet the minimal requirements are filtered. The final configurations are selected according to the optimization mode specified at the beginning.
Iv-B Resource Model
As memory and DSPs are the limiting resources in FPGA-based NN accelerators [liu2018optimizing], we mainly consider the memory and DSPs usage. The DSP usage depends on the multipliers used in the NNE. Due to 8-bit processing, we implement two multipliers using one DSP and thus the DSP consumption can be calculated as . The memory resources are mainly consumed by the weight buffer, input buffer in the NNE and the FIFO buffer in the Bernoulli sampler. As the width of the FIFO is , its memory consumption can be represented as , where represents the depth of the FIFO used in the Bernoulli sampler and is the data width. As our design processes the NN layer-by-layer, the memory usage of input buffer is dominated by the layer with the maximal input size as , where and are the height, width and number of input channels of the th layer respectively. Since the weight buffer only needs to cache filters, the memory consumption of the weight buffer can be formulated as , where is the kernel size of the th layer.
|Opt-Mode||Latency [ms]||aPE [nats]||ECE [%]||Accuracy [%]|
V-a Experimental Setup
In this paper, Intel Arria 10 SX660 FPGA is set as our target hardware platform.
1GB DDR4 SDRAM is installed as off-chip memory.
The PyTorch framework is used for the software implementation.
We focus on image classification. We evaluate the networks on tuples is an one-hot encoding of
In this paper, Intel Arria 10 SX660 FPGA is set as our target hardware platform. 1GB DDR4 SDRAM is installed as off-chip memory. The PyTorch framework is used for the software implementation. We focus on image classification. We evaluate the networks on tuples, where the target
is an one-hot encoding ofclasses. Given the image input , we approximate the predictive distribution over the target with respect to samples as , where the is the set of Bernoulli masks and can be . We consider for all MCD instances.
LeNet-5 lecun1998gradient, VGG-11 simonyan2014very for SVHN and ResNet-18 he2016deep for CIFAR-10. We reduced the channel size of VGG-11 and ResNet-18 to fit them into memory. In terms of partial Bayesian inference, we explore adding dropout in the different parts of the NN, always following a convolutional, BN and ReLU layers, and optionally pooling. Similarly to the datasets, we explore state-of-the-art architectures of increasing complexity, whose core is widely used across practical applications. Their structural irregularities present challenges to the accelerator’s design. We consider partial BNNs, such that . All experiments were repeated 5 times.
In addition to measuring the classification accuracy, we establish metrics for the evaluation of the predictive uncertainty and confidence. For the input that should rightfully confuse the net, we measure the quality of the uncertainty prediction with respect to random Gaussian noise with mean and variance of the training data with the average predictive entropy (aPE) over a dataset of size
In addition to measuring the classification accuracy, we establish metrics for the evaluation of the predictive uncertainty and confidence. For the input that should rightfully confuse the net, we measure the quality of the uncertainty prediction with respect to random Gaussian noise with mean and variance of the training data with the average predictive entropy (aPE) over a dataset of sizeas: . Additionally, we measure the confidence with which the net is making its predictions on the test data through the expected calibration error (ECE) guo2017calibration. ECE signals that a BNN is uncalibrated if it is making predictions whose confidence are not matching its accuracy. We calculate ECE with respect to 10 bins.
We implement our design using Verilog and Quartus 17 Prime Pro is used for synthesis and implementation. Based on the resource model and the available resources on our FPGA, and are set to be , and respectively and the final design is clocked at MHz. The resource usage of the proposed accelerator is presented in Table II. Since our accelerator is based on 8-bit precision, the 8-bit linear quantization [jacob2018quantization] is applied on the trained models.
V-B Hardware Performance Comparison
For each network, we measure the hardware performance on the FPGA, Intel Core i9-9900K CPU and NVIDIA RTX 2080 SUPER GPU, the batch size is 1 for all the hardware platforms for a fair comparison555Since PyTorch does not support 8-bit quantization on a GPU, the latency of GPU is estimated by dividing its floating-point performance by 4 times, which is the theoretically the lowest latency that the GPU can achieve.. For the FPGA implementation, we measure the latency with and without IC (Section III-C) to demonstrate its effect and the results are shown in Table III, the down and up arrows indicate the desired tendency for a given metric. While comparing FPGA implementations with and without IC on VGG-11 and ResNet-18, it can be seen that the speed up brought by IC goes down when increases and the decreases. In comparison to CPU and GPU implementations, the BNNs on the FPGA with IC can achieve up to times and times speed up respectively. There are two reasons for the speedup: 1) The adoption of IC technique together with MCD and partial Bayesian inference, which decreases the amount of memory accesses and computation; 2) The support for fine-grained parallelism on the accelerator, which fully utilized the extensive concurrency exhibited in BNNs. On LeNet-5, since the execution time is mainly occupied by the last layer, IC does not bring too much improvement on FPGA compared with GPU and CPU. However, because the current state-of-the-art NNs only spend a small portion of the execution time in the last layer, our accelerator can still achieve speed up in most of NNs, and thus is practical enough for real-life applications.
|w/ IC||w/o IC|
We also compare our work with the other BNN accelerators cai2018vibnn, 9116302 in Table IV. Because the three-layer BNNs evaluated in cai2018vibnn and 9116302 are unrealistic for real-life applications, we run a commonly-used ResNet-101 he2016deep on our design with MCD applied onto every layer, such that . However, as both cai2018vibnn and 9116302 do not support ResNet-101, their performance reported is still based on the three-layer BNN in their original papers. For a fair comparison, we evaluate all the accelerators in terms of throughput, compute and energy efficiency555The energy efficiency is quoted in in giga-operations per second per watt (GOP/s/W) and the total board power consumption is 45W.. As shown in Table IV, our accelerator can achieve 3 times to 4 times higher energy efficiency and 6 times 9 times better compute efficiency. Also, it is worth to mention that previous BNN accelerators only support linear layers, while our proposed accelerator is versatile enough to support a wide range operations including convolution, pooling or residual addition.
|VIBNN cai2018vibnn||BYNQNet 9116302||Our work|
|FPGA||Cyclone V||Zynq||Arria 10|
|Clock frequency [MHz]||212.95||200||225|
|Total number of DSPs||342||220||1473|
|Energy Eff. [GOP/s/W]||9.75||8.77||33.3|
|Comp. Eff. [GOP/s/DSP]||0.174||0.121||1.079|
V-C Effectiveness of Framework
As introduced in Section IV, our framework is designed to explore the trade-off between accuracy, latency, uncertainty estimation and confidence. This section investigates design space exploration with and without user constraints.
V-C1 Exploration Without Constraints
To find the global optimal latency, accuracy, uncertainty and confidence points, we set four different optimization modes: Opt-Latency, Opt-Accuracy, Opt-Uncertainty and Opt-Confidence, for all BNNs without any constrains. The results are illustrated in Table I. The lowest latencies that our accelerator can achieve on these three NNs are ms, ms and ms respectively. With the Opt-Accuracy mode enabled, these three NNs can achieve %, % and % accuracy respectively on their corresponding datasets. The framework also suggests different configurations to achieve the optimal aPE and ECE.
V-C2 Exploration With Constraints
To demonstrate that our framework is able to find the optimal points when the user’s requirements are given, we set latency, accuracy and uncertainty constraints for ResNet-18 on CIFAR-10 dataset and select the Opt-Confidence mode for optimization. Figure 6 shows all the candidate points with respect to accuracy, latency, aPE and ECE. The global optimal points with respect to different metrics are highlighted by the black arrows. The feasible design space constructed by accuracy, latency and uncertainty constraints is represented by the black box. Within this feasible design space, our framework generates the point with the lowest ECE, which is marked by the red arrow. Therefore, the proposed framework is able to find the optimal points when users’ constraints are given.
This work proposes a high-performance FPGA-based design to accelerate Bayesian neural networks (BNNs) inferred through Monte Carlo Dropout. The accelerator is versatile enough to support a variety of Bayesian neural networks and it achieves up to 4 times higher energy efficiency and 9 times better compute efficiency than other state-of-the-art accelerators. Additionally, we presented a framework to automatically trade-off both hardware and algorithmic performance, given hardware constraints and algorithmic requirements. In future we aim to explore neural architecture search on BNN, and co-develop the hardware design for BNNs found.