I Introduction
In recent years, neural networks (NNs) have demonstrated their outstanding performance in a variety of applications ranging from image classification ferianc2020vinnas or segmentation kendall2015bayesian to human action recognition fan2019f. However, one of the main drawbacks in standard NNs is that they are not able to capture the model uncertainty which is crucial for many safetycritical applications such as healthcare liang2018bayesian or autonomous vehicles azevedo2020stochasticyolo. In contrast to standard NNs, Bayesian neural networks (BNNs) [neal1993bayesian]
, which adopt Bayesian inference to provide a principled uncertainty estimation, have become more popular in these applications.
BNNs neal1993bayesian can describe complex stochastic patterns with wellcalibrated confidence estimates. An example of this is shown in Figure 1, which demonstrates that the BNN is uncertain in its predictions when shown completely irrelevant input, in comparison to a standard NN, which is wrongfully overconfident. Hence with BNNs, we can treat special cases explicitly gal2016dropout and they have become relevant in applications where the notion of uncertainty is essential.
However, the advantage of BNNs comes with a burden: due to the high dimensionality of modern BNNs, it is intractable to analytically compute their predictive uncertainty. Instead, it is necessary to approximate the predictive distribution through Monte Carlo sampling that requires the users to perform repeated sampling of random numbers and then run the same input data through the BNN multiple times, which degradates the hardware performance. Several algorithmic approximation techniques and hardware architectures cai2018vibnn, myojin, 9116302 have been proposed to improve the hardware performance of BNNs. Nevertheless, there are particular drawbacks in these approaches: 1) The implementation needs of both an NN engine and a sampler makes the design resource and memorydemanding, and thus current accelerators can only support BNNs consisting solely of linear layers or binary operations, which does not reflect the need in the industrial or research communities in terms of the current stateoftheart BNN architectures; 2) To obtain the uncertainty prediction, these accelerators simply perform forward passes through the whole network repeatedly without considering the actual algorithmic needs of BNNs, which makes them times slower than standard NNs.
To address these challenges, we propose an FPGAbased design with the support for finegrained parallelism to accelerate BNNs inferred through Monte Carlo Dropout (MCD) gal2016dropout with high performance. The proposed accelerator is versatile to support a variety of BNN architectures. To further improve the hardware performance, we consider partial BNNs to decrease the amount of computation required by BNNs. An automatic framework is proposed to explore the tradeoff between hardware and algorithmic performance, which is able to find a suitable hardware configuration and algorithmic parameters given users’ hardware constraints and algorithmic requirements. In summary, our contributions include:

[leftmargin=*]

A novel hardware architecture with an intermediatelayer caching technique to accelerate Bayesian neural networks inferred through Monte Carlo Dropout, which achieves high performance and resource efficiency (Section III).

An exploration framework for hardwarealgorithmic performance tradeoff and uncertainty estimation provided by partial Bayesian neural network design (Section IV).

A comprehensive evaluation of algorithmic and hardware performance on different datasets with respect to different stateoftheart neural architectures (Section V).
Ii Related Work
Iia Field Programmable Gate Arraybased Accelerators
Acceleration of standard NNs has enjoyed extensive research and industrial interests in the recent years mittal2020survey. Given the high computational demands of NNs, custom hardware accelerators are vital for boosting their performance. The high energy efficiency, computing capabilities and reconfigurability of FPGAs in particular make them a promising platform for acceleration of multiple different NN architectures mittal2020survey. Nevertheless, acceleration of BNNs specifically has not gained similar interests in the research community and there are only few works which approached this challenge cai2018vibnn, myojin, 9116302.
In VIBNN cai2018vibnn, the authors developed an efficient FPGAbased accelerator for BNNs, however, they focused on BNNs consisting only of linear layers. Myojin et al. myojin propose a method for reducing the sampling time required for MCD gal2016dropout in edge computing by parallelising the calculation circuit using an FPGA. However, their method needs to binarise the BNN and they again focus only on linear layers. Awano & Hashimoto 9116302 propose a custom inference algorithm for BNNs consisting exclusively of linear layers  BYNQNet
which employs quadratic nonlinear activation functions and hence the uncertainty propagation can be achieved using only polynomial operations. Although the design can achieve a high throughput, the restriction of the nonlinear activation functions limits generality for different application scenarios. In
azevedo2020stochasticyolo, the authors propose softwarebased intermediatelayer caching (IC), evaluated in last layer BNNs.In comparison to these works, we focus on accelerating BNNs consisting of different layers with or without residual connections
he2016deep, including convolutions or pooling, that have been popular in the presentday networks fan2019f, he2016deep. Additionally, our work wants to appeal to already widespread MCD without any additional software reimplementation effort.IiB Monte Carlo Dropout (MCD)
The concept MCD gal2016dropout lays in casting dropout srivastava2014dropout in NNs as Bayesian inference. Unlike the dropout used in standard NNs which is only enabled during training, MCD applies the dropout during both training and evaluation. MCD can be described as applying a random filterwise mask to the output feature maps of layer with filters. The mask
follows a Bernoulli distribution
which generates binary random variables (0 or 1) with the probability
. practically tradesoff certainty, accuracy and calibration of the BNN. After MCD removes the output feature maps with zeros, the nonzero elements are then scaled by . To get the final output under MCD, the computation can be formulated as: where represents a Hadamard product and is generated by a Bernoulli sampler at runtime for different filters and layers. The uncertainty estimation and prediction is thus obtained by running the same input through the BNN times, each time with different set of sampled masks for each layer where MCD is applied, and averaging the outputs. The works gal2016dropout, kendall2015bayesian demonstrate that MCD can achieve a high quality of uncertainty estimation.IiC Partial Bayesian Inference
A full BNN should be trained with MCD applied after every layer gal2016dropout. However, the authors in kristiadi2020being, kendall2015bayesian have demonstrated theoretically and empirically that making a standard NN Bayesian in different parts of the NN, thus making it partially Bayesian, can improve uncertainty estimation and it can also improve accuracy. Assuming there is an layer NN, partial Bayesian inference applies MCD in the last layers and makes the first layers behave as a feature extractor for the Bayesian remainder. Partially applied dropout then represents a tradeoff between hardware, algorithmic performance and uncertainty estimation kendall2015bayesian. In this paper, we exploit this tradeoff by proposing a framework for exploring the positioning of MCD at different parts of the NN which results in a partial BNN.
Iii Hardware Design
Iiia Design Overview
An overview of the proposed hardware design is illustrated in Figure 2. The computation of the BNN is performed layerbylayer using the same hardware design. The intermediate outputs of each layer are transferred to offchip memory to reduce the onchip memory consumption, and they are loaded back to the input buffer for processing of the next layer. The weights of different layers are stored in offchip memory and loaded to the weight buffer while processing the corresponding layer.
The main component in the proposed accelerator is a neural network engine (NNE), which is designed for running one layer at a time and general enough to run linear and convolutional layers with different kernel sizes. The NNE consists of a processing engine (PE), a functional unit (FU) and a dropout unit (DU). These submodules are queried in a pipeline manner to improve the hardware performance. The PE is designed to perform matrix multiplication, which supports three types of finegrained parallelism: filter parallelism (), channel parallelism (
) and vector parallelism (
). In PE, there are processing units (PUs). Each PU contains multiplicationaddition modules and each module containsmultipliers followed by an adder tree for channel accumulation. After each PU, there is a chain of FU modules including batch normalization (
BN) [ioffe2015batch]ReLU) activation, Pooling (Pool) and Shortcut (SC). The DU is placed at the end, which is a batch of multiplexers controlled by the zeros and ones generated from the Bernoulli sampler.IiiB Bernoulli Sampler
MCD is applied filterwise, which means the number of Bernoulli random variables generated for each layer is equal to the number of output filters. Therefore, we adopt the singlebit linear feedback shift register (LFSR) design to implement a Bernoulli sampler, which is illustrated in Figure 3.
The LFSR is composed of a chain of shift registers formed as a loop. The maximum sequence length of LFSR depends on the number of shift registers used in the loop: . The used LFSR design would take 1500 years to iterate through the whole sequence when clocked at 160MHz [andraka1998fpga]. Since a single LFSR can only support Bernoulli sampling with probability, the number of LFSRs depends on the required dropout rate. For instance, two LFSRs with an extra AND gate are required to implement Bernoulli sampler with . Also, as mentioned in Section III, the NNE only processes filters at a time, so we design a serialinparallelout (SIPO) module, placed after LFSRs, to form a single Bernoulli bit of a bit MCD mask. Since different filters are processed at different speeds, a firstinfirstout (FIFO) buffer is placed at the end of the Bernoulli sampler to cache generated Bernoulli random variables and pop out the mask when required. In case the overall processing is parallelised it is not necessary to use more than one sampler, however, the samples sampled during runtime for each instance need to be distinct.
IiiC Intermediatelayer Caching (IC)
To further improve the overall hardware performance, we propose a hardware implementation of IC technique [azevedo2020stochasticyolo] to decrease the required compute and the number of memory accesses. An example of using IC is illustrated in Figure 4, where the NN contains two layers and it only requires the user to apply the dropout mask and run the last layer times when the partial Bayesian technique is applied. In IC, the input of the last layer is stored on chip until the sampling is finished. Assuming the NN requires to run the last layers times to obtain the prediction, the IC can reduce the compute by times and the number of memory accesses by times.
Iv Optimization Framework
Iva Workflow of Framework
As mentioned in Section IIC, partially applying MCD represents a tradeoff between latency, accuracy, confidence and uncertainty estimation. The tradeoff is decided by three types of parameters: 1) which denotes the portion of Bayesian layers, 2) which represents the number of times needed to repetitively run the Bayesian parts and 3) which represent hardware parallelism. In this paper, we propose a framework, shown in Figure 5, which automatically optimizes the configuration of the BNN with respect to parameters , and according to user requirements for the target hardware platform. In our hardware design space, we consider the domains for both and as and can be chosen from .
At the beginning, the framework requires users to specify the hardware constraints, optimization mode and the minimal requirement for each metric. The hardware constraints include the available DSPs and memory resources of the target hardware platform. The optimization mode is selected from optimallatency, optimalaccuracy, optimaluncertainty prediction and optimalconfidence to minimise or maximise the chosen objective through greedy optimisation with respect to software and hardware configurations.
The first optimization is the hardware optimization, which determines the maximum parallelism level implementable on the target hardware in terms of . The resource model is used at this step to estimate the resource consumption given the available degrees of parallelism. During algorithmic optimization, based on the determined hardware parameters, we obtain the latency from the performance lookup table for various BNNs with different and . At the same time, the accuracy, the quality of uncertainty prediction and confidence of the BNN are evaluated in software. Then, the configurations which do not meet the minimal requirements are filtered. The final configurations are selected according to the optimization mode specified at the beginning.
IvB Resource Model
As memory and DSPs are the limiting resources in FPGAbased NN accelerators [liu2018optimizing], we mainly consider the memory and DSPs usage. The DSP usage depends on the multipliers used in the NNE. Due to 8bit processing, we implement two multipliers using one DSP and thus the DSP consumption can be calculated as . The memory resources are mainly consumed by the weight buffer, input buffer in the NNE and the FIFO buffer in the Bernoulli sampler. As the width of the FIFO is , its memory consumption can be represented as , where represents the depth of the FIFO used in the Bernoulli sampler and is the data width. As our design processes the NN layerbylayer, the memory usage of input buffer is dominated by the layer with the maximal input size as , where and are the height, width and number of input channels of the ^{th} layer respectively. Since the weight buffer only needs to cache filters, the memory consumption of the weight buffer can be formulated as , where is the kernel size of the ^{th} layer.
V Experiments
OptMode  Latency [ms]  aPE [nats]  ECE [%]  Accuracy [%]  

FPGA  CPU  GPU  
LetNet5  OptLatency  1, 3  0.42  0.67  0.24  
OptAccuracy  , 100  14.32  24.69  12.87  
OptUncertainty  , 100  14.83  42.0  19.91  
OptConfidence  , 9  1.29  3.68  1.68  
VGG11  OptLatency  1, 3  0.57  0.95  0.68  
OptAccuracy  , 100  57.32  186.24  88.93  
OptUncertainty  , 100  42.89  110.32  59.78  
OptConfidence  , 100  42.89  110.32  59.78  
ResNet18  OptLatency  1, 3  0.47  1.31  0.87  
OptAccuracy  1, 8  0.50  2.03  1.17  
OptUncertainty  , 100  32.04  173.53  93.23  
OptConfidence  , 3  1.20  7.66  3.93 
Va Experimental Setup
In this paper, Intel Arria 10 SX660 FPGA is set as our target hardware platform. 1GB DDR4 SDRAM is installed as offchip memory. The PyTorch framework is used for the software implementation. We focus on image classification. We evaluate the networks on tuples
, where the targetis an onehot encoding of
classes. Given the image input , we approximate the predictive distribution over the target with respect to samples as , where the is the set of Bernoulli masks and can be . We consider for all MCD instances.For datasets, we consider classifying images of increasing difficulty: MNIST, SVHN and CIFAR10, through which we control the complexity of the experiments. For MNIST we implement
LeNet5 lecun1998gradient, VGG11 simonyan2014very for SVHN and ResNet18 he2016deep for CIFAR10. We reduced the channel size of VGG11 and ResNet18 to fit them into memory. In terms of partial Bayesian inference, we explore adding dropout in the different parts of the NN, always following a convolutional, BN and ReLU layers, and optionally pooling. Similarly to the datasets, we explore stateoftheart architectures of increasing complexity, whose core is widely used across practical applications. Their structural irregularities present challenges to the accelerator’s design. We consider partial BNNs, such that . All experiments were repeated 5 times.In addition to measuring the classification accuracy, we establish metrics for the evaluation of the predictive uncertainty and confidence. For the input that should rightfully confuse the net, we measure the quality of the uncertainty prediction with respect to random Gaussian noise with mean and variance of the training data with the average predictive entropy (aPE) over a dataset of size
as: . Additionally, we measure the confidence with which the net is making its predictions on the test data through the expected calibration error (ECE) guo2017calibration. ECE signals that a BNN is uncalibrated if it is making predictions whose confidence are not matching its accuracy. We calculate ECE with respect to 10 bins.We implement our design using Verilog and Quartus 17 Prime Pro is used for synthesis and implementation. Based on the resource model and the available resources on our FPGA, and are set to be , and respectively and the final design is clocked at MHz. The resource usage of the proposed accelerator is presented in Table II. Since our accelerator is based on 8bit precision, the 8bit linear quantization [jacob2018quantization] is applied on the trained models.
Resources  ALMs  Registers  DSPs  M20K 
Used  303,913  889,869  1,473  2,334 
Total  427,200  1,708,800  1,518  2,713 
Utilization  71%  52%  97%  86% 
VB Hardware Performance Comparison
For each network, we measure the hardware performance on the FPGA, Intel Core i99900K CPU and NVIDIA RTX 2080 SUPER GPU, the batch size is 1 for all the hardware platforms for a fair comparison^{5}^{5}5Since PyTorch does not support 8bit quantization on a GPU, the latency of GPU is estimated by dividing its floatingpoint performance by 4 times, which is the theoretically the lowest latency that the GPU can achieve.. For the FPGA implementation, we measure the latency with and without IC (Section IIIC) to demonstrate its effect and the results are shown in Table III, the down and up arrows indicate the desired tendency for a given metric. While comparing FPGA implementations with and without IC on VGG11 and ResNet18, it can be seen that the speed up brought by IC goes down when increases and the decreases. In comparison to CPU and GPU implementations, the BNNs on the FPGA with IC can achieve up to times and times speed up respectively. There are two reasons for the speedup: 1) The adoption of IC technique together with MCD and partial Bayesian inference, which decreases the amount of memory accesses and computation; 2) The support for finegrained parallelism on the accelerator, which fully utilized the extensive concurrency exhibited in BNNs. On LeNet5, since the execution time is mainly occupied by the last layer, IC does not bring too much improvement on FPGA compared with GPU and CPU. However, because the current stateoftheart NNs only spend a small portion of the execution time in the last layer, our accelerator can still achieve speed up in most of NNs, and thus is practical enough for reallife applications.
Latency [ms]  

FPGA  CPU  GPU  
w/ IC  w/o IC  
LetNet5  1, 100  13.73  14.38  11.17  5.81 
, 50  7.16  7.20  12.02  6.07  
VGG11  1, 100  0.76  57.3  11.76  6.33 
, 50  21.52  28.67  55.94  30.09  
ResNet18  1, 100  1.22  44.97  13.96  7.05 
, 50  18.90  22.48  131.41  65.9 
We also compare our work with the other BNN accelerators cai2018vibnn, 9116302 in Table IV. Because the threelayer BNNs evaluated in cai2018vibnn and 9116302 are unrealistic for reallife applications, we run a commonlyused ResNet101 he2016deep on our design with MCD applied onto every layer, such that . However, as both cai2018vibnn and 9116302 do not support ResNet101, their performance reported is still based on the threelayer BNN in their original papers. For a fair comparison, we evaluate all the accelerators in terms of throughput, compute and energy efficiency^{5}^{5}5The energy efficiency is quoted in in gigaoperations per second per watt (GOP/s/W) and the total board power consumption is 45W.. As shown in Table IV, our accelerator can achieve 3 times to 4 times higher energy efficiency and 6 times 9 times better compute efficiency. Also, it is worth to mention that previous BNN accelerators only support linear layers, while our proposed accelerator is versatile enough to support a wide range operations including convolution, pooling or residual addition.
VIBNN cai2018vibnn  BYNQNet 9116302  Our work  
FPGA  Cyclone V  Zynq  Arria 10 
5CGTFD9E5F35C7  XC7Z020  SX660  
Clock frequency [MHz]  212.95  200  225 
Total number of DSPs  342  220  1473 
Energy [W]  6.11  2.76  45.00 
Throughput [GOP/s]  59.6  24.22  1590 
Energy Eff. [GOP/s/W]  9.75  8.77  33.3 
Comp. Eff. [GOP/s/DSP]  0.174  0.121  1.079 
VC Effectiveness of Framework
As introduced in Section IV, our framework is designed to explore the tradeoff between accuracy, latency, uncertainty estimation and confidence. This section investigates design space exploration with and without user constraints.
VC1 Exploration Without Constraints
To find the global optimal latency, accuracy, uncertainty and confidence points, we set four different optimization modes: OptLatency, OptAccuracy, OptUncertainty and OptConfidence, for all BNNs without any constrains. The results are illustrated in Table I. The lowest latencies that our accelerator can achieve on these three NNs are ms, ms and ms respectively. With the OptAccuracy mode enabled, these three NNs can achieve %, % and % accuracy respectively on their corresponding datasets. The framework also suggests different configurations to achieve the optimal aPE and ECE.
VC2 Exploration With Constraints
To demonstrate that our framework is able to find the optimal points when the user’s requirements are given, we set latency, accuracy and uncertainty constraints for ResNet18 on CIFAR10 dataset and select the OptConfidence mode for optimization. Figure 6 shows all the candidate points with respect to accuracy, latency, aPE and ECE. The global optimal points with respect to different metrics are highlighted by the black arrows. The feasible design space constructed by accuracy, latency and uncertainty constraints is represented by the black box. Within this feasible design space, our framework generates the point with the lowest ECE, which is marked by the red arrow. Therefore, the proposed framework is able to find the optimal points when users’ constraints are given.
Vi Conclusion
This work proposes a highperformance FPGAbased design to accelerate Bayesian neural networks (BNNs) inferred through Monte Carlo Dropout. The accelerator is versatile enough to support a variety of Bayesian neural networks and it achieves up to 4 times higher energy efficiency and 9 times better compute efficiency than other stateoftheart accelerators. Additionally, we presented a framework to automatically tradeoff both hardware and algorithmic performance, given hardware constraints and algorithmic requirements. In future we aim to explore neural architecture search on BNN, and codevelop the hardware design for BNNs found.