1. Introduction
Deep learning based AI technologies, such as deep convolutional neural networks (DNNs), have recently achieved tremendous success in numerous application domains, such as object detection, image classification, etc (Szegedy et al., 2013; Bojarski et al., 2016; Farooq et al., 2017; Chéron et al., 2015; Simonyan and Zisserman, 2014a). However, DNN models are known to be prone to overfitting due to insufficient training data in the real world, which can lead to wrong predictions when the model is deployed in unfamiliar environments. With the increasing adaptation of safetycritical AI applications (e.g., healthcare and selfdriving), wrong predictions can result in catastrophic incidents. For example, several accidents have been recently reported regarding poor safetycritical AI designs (NHTSA, 2017; Tesla, 2020), e.g., in 2020 an autopilot car crashed into a white truck because the sensor failed to distinguish the truck from the bright sky (Tesla, 2020). Therefore, enhancing the reliability and robustness of deep learning has become an urgent demand from AI practitioners.
As one of the most popular probabilistic machine learning tools, Bayesian Neural Networks (BNNs) have been increasingly employed in a wide range of realworld AI applications which require reliable and robust decision making such as selfdriving, rescue robots, disease diagnosis, scene understanding, and so on
(Amini et al., 2018; Wulfmeier, 2018; Leibig et al., 2017; Kendall et al., 2015). BNNs have also emerged as a promising solution in today’s data center services for improving product experiences (e.g., Instagram and Youtube), infrastructure, and aiding cuttingedge research (Facebook, 2021). Different from the traditional DNNs which require massive training data, BNN models can more easily learn from small datasets and are more robust to overfitting issues (Blundell et al., 2015). Furthermore, BNNs are capable of providing valuable uncertainty information for users to better interpret the situation without making overconfident decisions (Gal and Ghahramani, 2016; Amodei et al., 2016; Cheung et al., 2011). Generally, a BNN model can be viewed as a probabilistic model where each model parameter, i.e., weight, is a probability distribution. Training a BNN essentially calculates the probability distribution of weights, which requires integrating on infinite number of neural networks. This is often intractable. To tackle this, recent efforts
(Blundell et al., 2015; Graves, 2011; Shridhar et al., 2019)leverage Gaussian distributions to approximate the target weight distributions via
weight samplingto identify the mean and standard deviation of each weight.
Training DNN models on current hardware devices has long been considered as a slow and energyconsuming task (Goel et al., 2020; Venkataramani et al., 2017; Das et al., 2016). Compared with the traditional DNN training, BNN training inefficiency is further exacerbated by the requirement of training an ensemble of sampled DNN models to ensure robustness. In consequence, we have observed that the total data movement during a BNN training procedure can be orders of magnitude larger than training one single DNN model. Moreover, as the existing DNN training optimization techniques (Zheng et al., 2020; Rhu et al., 2016; Zhang et al., 2019; Yang et al., 2020a; Wang et al., 2018a) are oblivious to the unique sampling process of the probabilistic BNN models, they lack the capabilities to efficiently and effectively deal with the excessive data movement induced by the memoryintensive BNN training, resulting in poor energy efficiency and long training latency.
In this paper, we first conduct a comprehensive characterization of the stateoftheart BNN training on current DNN accelerators and analyze its inefficiency. By carefully breaking down the memory activities in each BNN layer, we observe that the dominant factor that induces BNN training inefficiency is the massive data movement from Gaussian random variables (GRVs). These variables are generated during forward propagation for weight sampling and sent to offchip memory for later reuse during backward propagation. They contribute the major portion of the total offchip memory accesses for BNN training (e.g., up to 71%). To tackle this challenge, we propose a novel design that is capable of eliminating all the offchip memory accesses by GRVs without incurring any training accuracy loss. Our design is based on a key observation that the softwarelevel “forthback” training procedure shares great similarity with the classic hardwarelevel reversed shifting of the Linear Feedback Shift Registers (LFSRs) which are used in modern BNNs to generate the GRVs (Cai et al., 2018)
. By leveraging the reversible property of LFSR, we build a highly efficient memoryfriendly design based on LFSR reversed shifting, which can accurately retrieve all the GRVs (i.e., bit patterns generated in forward propagation) locally during backpropagation without ever storing them during the forward propagation. Furthermore, to investigate the compatibility of our LFSR reversion strategy on real hardware, we qualitatively study the design possibilities by directly integrating our strategy to the existing DNN accelerators that adopt various computation mapping schemes, and eventually identify the optimal mapping to support our BNN training design. Based on this knowledge, we design and prototype the first highly efficient hardware accelerator for BNN training, named
ShiftBNN, that takes advantage of drastically reduced data movement enabled by our LFSR reversion strategy. This study makes the following contributions:
We characterize modern BNN training on the stateoftheart DNN accelerators and reveal that the root cause for its training inefficiency originates from the massive data transfer induced by GRVs;

We propose a novel design that eliminates all the offchip data transfer related to GRVs through local LFSR reversed shifting without affecting the training accuracy;

We present the potential hardwarelevel challenges when directly applying our design to BNN training and significantly mitigate these issues via a sophisticated and qualitative design space exploration;

We design and prototype the first highlyefficient BNN training accelerator that is lowcost and scalable, well supported by a hybrid dataflow;

Extensive evaluation on five representative BNN models demonstrates that ShiftBNN achieves an average of 4.9 (up to 10.8) improvement in energy efficiency and 1.6 (up to 2.8) speedup over the baseline accelerator. ShiftBNN also scales well to larger BNN model sample sizes.
2. Background
2.1. Training BNNs with Variational Inference
A Bayesian neural network (BNN) can be viewed as a probabilistic model in which each model parameter,e.g., weight, is a probability distribution. One of the most popular method for training BNN models is known as Variational Inference (Blei et al., 2017; Hoffman et al., 2013; Zhang et al., 2018) , which finds a probability distribution to approximate the target weight distribution ( is a common distribution family). Searching for
is an optimization problem that aims to minimize the loss function with respect to
(Eq.1).(1) 
In Eq.1, denotes the th sample of weights drawn from the approximation distribution . Typically, is assumed to be a Gaussian distribution where and are the mean and standard deviation of the Gaussian distribution, respectively. Each sample of weight can be obtained by using , where denotes the th random variable drawn from unit Gaussian distribution and represent pointwise multiplication. , and are defined as posterior , prior and loglikelihood, respectively. In summary, the model parameters and can be learned progressively by repeating the following steps (details are shown in Fig.1 (a)):

Generate S ’s from for each weight;

Obtain S samples for each weight via ;

Calculate the loss function , where ;

Calculate the gradients with respect to and ;

Update model parameters and .
2.2. Computation Flow of BNN Training
From an algorithmic perspective, Fig.1 (a) illustrates the computation flow of BNN training which consists of three main stages: Forward (FW), Backward (BW) and Gradient Calculation (GC).
Forward (FW) stage aims to calculate the loss of network function given an input training example . For simplicity of discussion, we assume processing a minibatch with the size of 1. In each layer l, for one input training example, Gaussian random variables are sampled S times to obtain S samples of weights , denoted as process 1⃝. These weights are convolved with their corresponding input samples, i.e., , producing S samples of the output, which are then treated as the input for the next layer. For the first layer, all weight samples are convolved with the input . The outputs of the last layer are compared with the groundtruth to obtain the loss (error).
Backward (BW) stage propagates the network errors from the last layer to the first layer. In each layer l, S samples of weight matrices are reconstructed using the original Gaussian random variables and model parameters , denoted as process 2⃝. The reconstructed kernels are then rotated and convolved with the corresponding samples of errors to obtain the errors of the previous layer, i.e., .
Gradient Calculation (GC) stage updates the model parameters and to minimize the training loss, which requires to calculate the gradients of the model parameters, and . The gradient of a sampled weight comes from prior , posterior and likelihood . The gradient of likelihood is generated by convolving the feature maps with the errors . This part is the same as the normal DNN training. For the gradients of prior and posterior, they can be easily derived once the original weights are reconstructed because the computation for both prior and posterior requires no intermediate feature maps. Finally, the S samples of the gradients are summed up and then multiplied with a small coefficient to produce the weight updates . Based on the sampling rule , Gaussian random variables are used to calculate the final updates . This step corresponds to step 3⃝.
Fig.1 (b) illustrates the detailed computation within a single BNN’s convolutional layer. The key feature here is a sample dimension that adds on top of normal DNNs’ 6dimension convolution. Note that different samples execute independently without any data exchange.
3. Challenges of BNN training
Traditional DNN training. DNN training has long been considered as a slow and energy harvesting task (Goel et al., 2020; Venkataramani et al., 2017; Das et al., 2016). On the surface, the massive energy consumption and high latency mainly come from millions of Multiplyaccumulate operations (MACs) and intensive data movement between memory and processing elements (PEs). As the unit energy cost (J/bit) of offchip memory accesses is orders of magnitude higher than that of MACs (Chen et al., 2016; Dally, 2011; Horowitz, 2014), data movement usually poses greater challenges for energyefficient DNN training (Wang et al., 2019). Moreover, the ongoing development of lowprecision training techniques (Gupta et al., 2015; Wang et al., 2018b; Fu et al., 2020) can potentially reduce the unit energy cost of MACs, but this could also result in a proportionally higher impact on the overall training’s energy efficiency from the data movement.
Current BNN training. Compared to the traditional DNN training, BNN training inefficiency is further exacerbated by the requirement of training for an ensemble of sampled DNN models, shown in Fig. 1 (a). This is necessary because a sufficient number of training samples is essential for building a robust BNN model. But it could also incur an explosive amount of data movement during the training process.
To further quantify this, we investigate the impact of number of samples on the overall BNN training efficiency. We implemented five types of widelyadopted BNN models representing a broad range of domains, as well as their corresponding DNN models. Note that BNN models are typically built upon their matching DNN models, e.g., Bayesian AlexNet or BAlexnet is based on AlexNet. For verification purposes, the training process is performed on a general Diannaolike DNN accelerator equipped with output stationary dataflow (Chen et al., 2014). Detailed experimental setup can be found in Section 7.1. Three metrics are used for training evaluation, including data transfer, overall energy consumption, and training latency. The data transfer represents the amount of data that are read from and written to the offchip memory. Due to the architectural heterogeneity of the five BNN models, each result is normalized to its corresponding baseline DNN model. Fig.2 shows that a BNN model with only 8 samples would drastically increase the offchip data transfer by an average of 9.1 compared with its corresponding DNN model. This number grows to 35.3 as the number of BNN training samples scales up to 32. Specifically, for BVGG model with 16 samples (s=16), training each input example for one iteration would require 22.6GB data transfer from/to offchip memory, which is 17.9 increment over the original VGG model. Since the offchip memory access is often considered a highcost operation, a large amount of data transfer during BNN training could produce massive energy consumption and potentially lead to performance degradation. For example, we observed that the overall energy consumption and training latency on 32 samples incur an average of 33.2 and 31.8 increment over those on the baseline DNN models, respectively.
Fig.3 shows the breakdown of the total offchip data transfer when the accelerator evaluates every input training example during one training iteration. It can be observed that Gaussian random variables takes up the major portion of the total data transfer (i.e., 71% on average). Meanwhile, the weight parameters and the input/output feature maps only contribute to 16% and 12% on average, respectively. There are several reasons behind such dominating presence of . First, as a unique variable introduced by BNN execution, must be stored and reused in two different stages. As shown in Fig.1 (a) 1⃝, during the forward stage, S samples of are generated from the local random number generators for each pair of to obtain S samples of weights. After that, s have to be stored into the offchip memory due to its large data volume and reside there until the later weight reconstruction during the backward stage (2⃝) and the gradient of computation during the gradient calculation (GC) stage (3⃝). Note that recent memorycentric approaches such as vDNN (Rhu et al., 2016), Echo (Zheng et al., 2020) and SuperNeurons (Wang et al., 2018a) reduce the memory accesses through smart recomputation in backpropagation via selected small intermediate data from forward propagation. However, since s are a large amount of independent random numbers that cannot be recomputed, these works cannot help reduce intensive memory accesses in BNN training. Second, the size of is much larger than the weight parameters and the intermediate feature maps/errors. Since one pair of weight parameters requires S samples of for weight sampling, the total size of can be times of the weight parameters. And for the current BNN models, the size of weights (i.e., half of ) is still much larger than the size of feature maps. For instance, on average the size of weights is 122 of the size of feature maps/errors across five BNN models. Therefore, although input/output feature maps also consist of S samples, the total transferred intermediate data size is still much less than that from .
In summary, the long reuse distance of a large amount of Gaussian random variables across different training stages is the key problem that causes a huge amount of offchip memory accesses (the transferred amount of s grows linearly with the sample size). This further leads to massive energy consumption and potential performance degradation during BNN training. Besides the existing DNN accelerator, such a challenge is also observed on conventional CPU/GPU platforms as the crossstage memory access of is inevitable in the BNN training algorithm. Therefore, a special solution is needed.
4. Key Design Insights of ShiftBNN
To overcome these challenges brought by the excessive data movement for the Gaussian random variables s (or GRVs), we propose a novel design that is able to eliminate all the memory accesses related to without training accuracy loss. We made a key observation that the nature of softwarelevel“forthback” training procedure shares similarity with the classic hardwarelevel reversed shifting of Linear Feedback Shift Register (LFSR) which is used in BNNs to generate the Gaussian random variables (Cai et al., 2018). Specifically, we can potentially retrieve all the s locally during the Backward stage through shifting the LFSRs backward, instead of storing them during the Forward stage. In the following subsections, we will first introduce the principles of LFSR function, and then illustrate how to use LFSR reversed shifting to retrieve Gaussian random variables s. Finally, we showcase a detailed example to demonstrate the feasibility of our strategy while also exposing some potential hardwarelevel issues when directly applying it to BNN training.
4.1. Generating GRVs via LFSR Shifting
According to the Central Limit Theorem
(Brosamler, 1988) can approximate a Gaussian distribution if is large enough. Here represents the total number of independent trials and p denotes the possibility of success for each trial. For instance, assume if there are n individual bits that have the equal possibility of being 0 or 1, the total number of “1s” in these n bits will follow the binomial distribution , and further approximate the Gaussian distribution as when n is large enough. Based on this insight, previous efforts (Kang, 2010; Cai et al., 2018; Andraka and Phelps, 1998; Condo and Gross, 2015)have proposed efficient Gaussian Random Number Generator (GRNG) by implementing an nbit LFSR for uniformly distributed random bits generation and an adder tree for counting the number of “1s”. The structure of an 8bit Fibonacci LFSR is illustrated in Fig.
4(a). In each cycle, values in the tap registers, i.e., , , and , are combined using three XOR gates and produce one bit to update the value in the head register (highlighted in blue). Meanwhile, the rest of the values shift to the neighbour register from left to right and the value in the tail register is dropped (highlighted in red). Through this procedure, the LFSR creates a random bit sequence named “pattern” upon every shifting. For each pattern, the number of “1”s are counted by the adder tree to form a Gaussian random variable (GRV).4.2. Retrieving via Pattern Reproduction
Assume we employ one LFSR to generate s for sampling all the weights during BNN training. At the Forward stage, s are generated sequentially to sample from the first weight of the first layer to the last weight of the last layer, during which the LFSR continuously shifts from its initial pattern #1 to the latest pattern #N. At the Backward stage, we notice that the generated s are requested in a reversed order, i.e., from the latest pattern #N to the initial pattern #1 of the LFSR, due to the two key features of the training process. At the layerlevel, backpropagation executes from the last layer to the first layer, thus the s generated in the last layer in the Forward stage are needed first. At the kernellevel, constructing the kernels that were rotated during backpropagation is equivalent to sampling the previous weights reversely (shown in Fig. 5 (a)). The aforementioned insights motivate us to reproduce the previous LFSR patterns also in a reversed order so that all the previous s can be retrieved locally by LFSRs instead of storing/fetching them during Forward/Backward stage.
Key design insight. This comes from our finding that reproducing previous LFSR patterns can be simply accomplished by shifting the current LFSR pattern in an opposite direction, combined with three XOR operations on certain registers within an LFSR, as illustrated in Fig. 4 (b). Assume a nbit LFSR with taps =(a,b,c,n) is shifting right to generate the latest pattern #2 from its initial pattern #1. The value in the head register of pattern #2 is generated by XORing the tail tap with other taps in an order:
(2) 
where denotes XOR operation. Meanwhile, the value in the tail register is dropped from the LFSR. In order to reproduce pattern #1 from #2, the values in of pattern #1 can be obtained by left shifting pattern #2. Now the key question is how to reproduce the value in of pattern #1 since it has been dropped previously. Interestingly, for the XOR operation, one can prove that if . Thus we rewrite Eq.2 in a reversed order:
(3) 
where is the head register of pattern #2, and in pattern #1 are actually in pattern #2. Therefore, we can simply set as tap registers of pattern #2 for the retrieval of in pattern #1, as shown in the right part of Fig. 4(b). Furthermore, since the LFSR in pattern #2 shifts reversely, the tail register of pattern #2 should be updated by XORing of pattern #2 orderly. In this fashion, this interesting feature can always be leveraged to retrieve the value in through Eq.3. As can be seen, pattern #1 is successfully retrieved from pattern #2 via very simple logic operations. Fig. 4 (c) provides an example of reversing an 8bit LFSR to retrieve the previous patterns.
4.3. Potential Issues of Directly Applying LFSR Reversion to BNN Training
Fig. 6 depicts the details of applying our LFSR reversion strategy in a twolayer (convolution + fullyconnected (FC)) BNN training. For simplicity of discussion, we assume two LFSRs are deployed for GRN generation. During forward stage, for the convolutional layer, the LFSRs shift from status 1 to 6. Each status contains 9 sequential patterns to generate GRVs for a kernel (each pattern per weight). For the FC layer, the LFSRs continue shifting from status 7 to 14. Each status contains 4 sequential patterns for a
weight vector. During the Backward stage, by shifting the LFSRs reversely, all the previous status are retrieved in a reversed sequence that satisfies the weight fetching request by backpropagation. Note that for convolutional layers, the flipped (
rotated) kernels $⃝x^{\prime}$ can be constructed by the reversed order of $⃝x$ according to Fig. 5(a). And for the FC layers, since the internal weight order of each weight column (e.g., matrix) is not altered, the original weight matrices can all be retrieved via LFSR reversion. However, as shown in Fig.5 (b), since the kernels are reorganized across the input channel (N) dimension and output channel (M) dimension during the Backward stage, the computation flow could become inconsistent with that in the Forward stage. For example, at status 6 during the Forward stage in Fig.6, the partial sums calculated by kernel $⃝9$ and $⃝12$ are accumulated separately for the last two output channels (i.e., the blue blocks highlighted by red at layer ). When applying our LFSR reversion, kernel $⃝9^{{}^{\prime}}$ and $⃝12^{{}^{\prime}}$ will be constructed at status 6 during Backward stage. At this time, instead of being accumulated separately, the partial sums calculated by kernel $⃝9^{{}^{\prime}}$ and $⃝12^{{}^{\prime}}$ are added together for one single output channel (i.e., the green block highlighted by red at layer ). Although our LFSR reversed shifting can still retrieve all the s, such computation inconsistency between the Forward and Backward stages may pose significant design inefficiency for training accelerator design. Furthermore, this factor complicates the design choice selection due to the unclear impact our LFSR reversion strategy may pose on accelerators that adopt different computation mapping schemes. Thus, it is important to first understand the accelerator design space for our shiftBNN.5. Design Space Exploration
As discussed in Section 2.2 (also see Fig. 1 (b)), processing a typical DNN layer during any training stage can be decomposed into a sixdimension forloop execution. Instead of executing each dimension sequentially, the stateoftheart DNN accelerators usually select several dimensions and compute them simultaneously, during which MACs along a certain dimension are mapped onto a group of Processing Elements (PEs) that operate in parallel. Choosing different mapping dimensions creates a significant divergence in design efficiency. Generally, there have been three major types of computation mapping strategies for DNN inference: kernel (Kdimension) mapping, e.g., systolic array (Farabet et al., 2009), input channel and output channel (MNdimension) mapping, e.g., Diannao (Chen et al., 2014), NVDLA (Nvidia, 2021a), and output feature mapping (RCdimension) mapping, e.g., Shidiannao (Du et al., 2015). Since DNN training could also perform minibatch processing, a batch and output channel (BMdimension) mapping method (Yang et al., 2020a) is also under consideration. To efficiently apply our design insights into BNN training, we comprehensively study the impact of our LFSR reversed shifting strategy on the four types of stateoftheart computation mappings to explore the design space for BNN training accelerator. Specifically, we qualitatively discuss the design possibility by using each mapping,and finally select the optimal mapping to support our proposed ShiftBNN design. In the following analysis, we apply superscript and to denote the index of output and input channel, and subscript , and
to denote the weight location inside a kernel, the neuron/error location on an output feature map and the index of a training example in a minibatch, respectively.
MNdimension mapping. Fig. 7 (a) illustrates a basic architecture for MNdimension mapping. The xaxis of the 2D PE array (we assume the size is for simplicity) represents Mdimension mapping and the yaxis represents Ndimension mapping. As BNN training demands weight sampling, a GRNG is attached to each PE to generate s for weight parameters , which will be the common case among all four types of mapping methods. In each cycle, an input neuron from a certain input channel broadcasts horizontally to a row of PEs, where each PE calculates the partial sums for a certain output channel . These partial sums are collected vertically by an adder tree (denoted by the yellow bar) and summed up until a output neuron is generated. In this scheme, a PE located at coordinate will require a kernel from input channel and output channel to produce the partial sum of an output neuron. Therefore, during FW, the LFSR in each GRNG generates sequentially to produce a sampled kernel , as shown in Fig. 7 (d). With the proposed LFSR reversion strategy, the flipped kernel can be reconstructed by shifting the LFSR reversely during BW. However, also during this stage, as the kernels are also reorganized in the MNdimension, the partial sums generated in PE rows should be summed up instead of being accumulated separately (Sec.4.3). This results in the inconsistent computation patterns between FW and BW. To address this inconsistency in a uniform architecture design, one possible solution is to swap the Gaussian random variables, i.e., s, between PE and PE and then load the corresponding weight parameters and input neurons during the BW stage, as shown in Fig. 7 (b). Nevertheless, such design requires extra interconnections between PEs, leading to wiring overhead for a PE array, which hinders design scalability. Moreover, there must be an equal number of PEs in a row and a column due to the swapping mechanism, which further limits the design flexibility. Fig. 7 (c) shows an alternative design that avoids the data communication between PEs. In this design, during BW the partial sums generated by a PE row are summed up to an output neuron with duplicated adder trees. The partial sums generated by a PE column are accumulated separately by directly sending each of them to the output buffer. However, this method still requires an input adder tree for each row of PEs, which incurs extra resource and energy overheads.
RCdimension mapping. Fig. 7 (e) shows the basic output feature map (RC) dimension mapping strategy, where neurons on a output feature map are mapped to a 2D PE array and computed simultaneously. In each cycle, one weight from a kernel is broadcast to all PEs while a group of new input neurons are fed to the rightmost (or bottom) PEs. The partial sums stay in the PE and are accumulated to generate the output neurons as the input neurons flow from right to left (or bottom to up) through the PE array. Since the weight is fetched sequentially from a kernel, the GRNG also produces during FW. Thus, the flipped kernels can be reconstructed by shifting LFSR reversely during BW. Furthermore, since RCdimension mapping is irrelevant with M or Ndimension parallelism, it will not suffer from the swapping issue from MNmapping. Nevertheless, kernel reorganization still has a slight impact on RCmapping. During the FW stage, since the kernels are fetched along the Ndimension first and then Mdimension, the partial sum of an output neuron is accumulated inside the PE continuously until the output neuron is generated. However, during the BW stage, the kernels are fetched along the Mdimension first and then Ndimension; so the partial sum of an output neuron is sent to the output buffer and waits to be read and accumulated in the PE intermittently. Therefore, two types of control modes are required in RCmapping.
Kdimension mapping. Fig. 7 (g) shows the basic kernel (K) dimension mapping method, where a kernel is mapped to a 2D PE array and stays until all the computation related to that kernel is completed. In each cycle, an input neuron is broadcast to all the PEs and multiplied with weights inside a kernel. The partial sums are propagated and accumulated through the PEs to generate the output neurons. Under this scheme, during FW the PE array requires the kernel from the next input channel when the computation of the current kernel is finished. Hence, the GRNG generates s for weights along the Ndimension sequentially from the first to the last input channel, i.e., , as shown in Fig. 7 (i). During BW, reverse shifting LFSR can retrieve the original kernels from the last to the first input channel. However, Kdimension mapping can not reorder the weights to construct the flipped kernels required by the BW stage as the weights inside a kernel are sampled simultaneously. In fact, due to the kernel flipping, the generated by a certain PE during FW is required by another PE during BW. Fig. 7 (h) illustrates a solution for Kdimension mapping: adding datapaths between PEs for swapping. However, similar to the MNdimensionv1 (as shown in Fig.7 (b)), this design causes wiring overhead for a PE array. Moreover, due to the kernel reorganization, Kmapping also requires two types of control modes for different accumulation manners.
BMdimension mapping. Fig. 7 (j) illustrates the basic batch and output channel (BM) dimension mapping strategy, where the horizontally distributed PEs are processing different training examples and the vertically distributed PEs are calculating neurons in different output channels separately. In each cycle, a pair of weight parameters from a certain output channel is broadcast to an entire row of PEs while an input neuron from a certain training example is broadcast to an entire column. The output neurons can be collected in each PE. As the weights inside a certain kernel are requested sequentially (shown in Fig. 7 (l)), LFSR reversion can help reconstruct the flipped kernels. However, due to the kernel reorganization, the reconstructed kernels in a column of PEs should be used for Ndimension computation instead of Mdimension computation. Specifically, at the BW stage, the partial sums generated by PE columns should be summed up instead of being accumulated separately. To address this issue, an additional ninput adder tree is required for each PE column. Meanwhile, different input neurons from input channel are sent to each PE column, resulting in two different input buffer designs (Fig. 7 (k)). Therefore, this architecture not only incurs large hardware overhead but also leads to high design complexity.
In conclusion, the RCdimension mapping strategy (Fig. 7 (e)) only incurs modest design overhead compared to the other three mapping methods when applying our LFSR reversion strategy, which makes it an ideal fundamental computation mapping for designing our ShiftBNN architecture.
6. ShiftBNN Architecture Design
6.1. Architecture Overview
Figure 8 illustrates the overall architecture of our proposed ShiftBNN training accelerator, which comprises of a 3D PE array distributed to 16 Sample Processing Units (SPUs), a weight parameter buffer (WPB), and a central controller. Each SPU consists of an input/output neuron buffer (NBin/NBout), 16 slices of GRNG and function units, a PE tile, a 4 array of shift units, and a crossbar. Following the aforementioned LFSR reversion technique and the computation mapping consideration, our accelerator presents the following features: (1) a hybrid dataflow that adopts RCdimension on 2D PE tiles and samplelevel parallelism across SPUs, both of which exploit significant opportunities for data reuse; (2) an efficient GRNG design which can generate Gaussian random variables s sequentially during FW stage and reproduce the previous s reversely during BW stage; (3) function units design that satisfies necessary mathematical operations, i.e., weight sampling, derivative calculation of prior and posterior, and weight updating during the BNN training; (4) light implementation of RCdimension mapping architecture by using a PE tile, an array of shift units and a crossbar.
6.2. SPUs and Dataflow
Since the weight parameters are shared among sampled models, it is natural to process a batch of sampled models in parallel to increase the data reuse of weight parameters. Our design leverages such opportunities by allocating the workloads of training each sampled model to an individual SPU, which operates independently and in parallel with other SPUs. Each SPU is further equipped with the RCdimension mapping scheme that maximizes the data reuse of input neurons on a 2D feature map. We describe the main features of an SPU as follows.
PE tile, shift unit and crossbar. All convolution operations are performed in the 2D PE tile during all three stages of BNN training (i.e., FW, BW and GC). For simplicity of discussion, we use the FW stage as an example to illustrate the datapath design and the computation flow. Fig. 8 (a) shows the datapath for a convolutional layer, in which a sampled weight from the GRNG & function units is broadcast to all the PEs and multiplies with the input neuron, which will shift to the left (or up) neighbour PE in the next cycle (Fig.7(e)(f)). To support this type of dataflow, a dedicated PE design is implemented upon a typical inference accelerator (Du et al., 2015) that adopts RCdimension mapping, shown in Fig.8
(c). The right part of the PE is a shift unit. It determines which input neuron (Nin) should be received by the PE and which neuron that is stored in RegH/RegV should be sent (Nout) to the other PEs. The selected input neuron and the broadcast weight will then enter into the computation unit, which is depicted at the left part of the PE and performs basic MAC operations, ReLU functions and max pooling operations to produce the output neurons. Importantly, due to the kernel reorganization and
reproducing technique at the BW stage (Sec.5), our PE design supports two types of accumulation modes. (1) During the FW stage, since the kernels are fetched along the Ndimension first and then Mdimension, the partial sum is repeatedly fetched back to the PE, depicted by the green arrow in Fig. 8 (c). (2) During BW stage, the kernels are fetched along the Mdimension first and then Ndimension, thus the partial sum (named psum in the figure) is fetched from NBout and then gets accumulated in the PE intermittently, depicted by the orange arrow in Fig. 8 (c). Our PE design switches between these two accumulation modes for FW and BW stages. Furthermore, to satisfy the complex data requests from the PE tile, a crossbar is inserted between WPB, NBin, NBout and PE tile to select the appropriate data read from the buffer. Additionally, instead of using a column buffer in (Du et al., 2015), we employ a lightweight shift units array which stores the candidate input neurons that the PE tile will need in the next four cycles. The array is organized in the same way as the PE tile spatially and each shift unit is actually the same as the right part of the PE for simple data shifting operations.Efficient GRNG design. A SPU contains GRNGs, which corresponds to the PE tile. For a convolutional layer, since one weight is shared by every PE, only one GRNG needs to be enabled to generate one at a time. While for a FC layer, PEs require different sampled weights from the GRNG & function units thus all GRNGs are enabled to provide s to sample weights for their corresponding PE. Fig. 8 (b) left illustrates the microarchitecture of a single GRNG which consists of a 256bit LFSR and an generator. The GRNG features two properties. Firstly, it possesses three operating modes. (1) The forward mode for FW stage, during which the LFSR shifts from left to right. Each register (except ) of LFSR receives the values from the left neighbour register (named ) while gets updated by the orange taps. (2) The backward mode for BW stage, during which the GRNG switches to the reverse mode and shifts from right to left. Each register (except ) of LFSR receives the values from the right neighbour register (named ) while gets updated by the blue taps. (3) The idle mode, during which registers in the LFSR receive their own values and will not be updated. Secondly, since counting the number of “1s” (or the sum) of a LFSR pattern with an adder tree may cause large overhead (Cai et al., 2018), the proposed generator uses a more efficient way to generate s based on the LFSR patterns. Specifically, we store the sum of the bits in the LFSR’s initial seed in a register and track the difference between the old value () and the updated value () at or depending on the operating mode. The difference, i.e., bit update, will be added to the initial sum to form the current sum of LFSR which are then used to update the register of the initial sum.
Function units. The function units consist of a sampler, a derivative processing unit (DPU), and a weight parameter updater. As a whole, the function units receive the and from the crossbar and the GRNG respectively, and accomplish two tasks: weight sampling and final gradient calculation of the weight parameters. During both FW and BW stages, the weight sampling is performed in a sampler that applies the weight parameters to the Gaussian random number using a multiplier and an adder. The produced weight is sent to the PE tile and the DPU. During the BW stage, the DPU and the updater are both activated. The DPU uses the received reconstructed weight to calculate the derivatives of the sum of the prior and posterior with respect to the weight, . By decomposing the prior and posterior terms into a log form, can be approximated as . Since is a constant value of prior distribution and is usually chosen as 0.5, we thus calculate the by left shifting 2 bits. The is then added to the gradient of likelihood computed in the GC stage to obtain the final gradient . Lastly, in order to update the weight parameters, the updater calculates the gradients of using and , which corresponds to the process 3⃝ in Fig.1 (a). The produced will be further averaged across different SPUs and then used to update the weight parameters.
Buffer design. To support the dataflow of RCmapping in an SPU, we follow a similar design principle in (Du et al., 2015) to organize the data in the neuron buffer,i.e., NBin/NBout. NBin/NBout comprises multiple banks. Each bank provides the neurons requested by a PE row through the crossbar. For the weight parameter buffer (or WPB), we split it into two subbuffers that store and separately. Each subbuffer is also designed to consist of multiple banks and each entry of the bank stores the weights for a PE row. For a convolutional layer, one weight parameter is selected by the crossbar at each cycle while for an FC layer the entire entry read from the bank is sent to a PE row. Note that although the convolution operands (e.g., weight, neuron, error, and gradient) vary across the three BNN training stages, our uniform design of data organization in WPB, NBin and NBout is beneficial for the buffer function swapping. For example, during the BW stage, the error feature maps of layer stored in NBout can serve as the weights for the gradient calculation of layer by temporarily treating the NBout as WPB.
7. Evaluation
7.1. Experimental Methodology
BNN models and training datasets. We evaluate ShiftBNN by training on five representative BNN models. Among them, BMLP (Cai et al., 2018)
(fullyconnected BNN with 3 hidden layer) is trained with MNIST
(Deng, 2012). BLeNet (built on LeNet(LeCun and others, 2015)) is trained with CIFAR10
(Krizhevsky et al., 2009). These two networks are mostly adopted to handle small but safetycritical tasks. BAlexNet (built on AlexNet(Krizhevsky et al., 2012)), BVGG (built on VGG16 (Simonyan and Zisserman, 2014b)) and BResNet (built on ResNet18 (He et al., 2016)) are trained with ImageNet datasets
(Deng et al., 2009), which are used to deal with more complex tasks in the unfamiliar environments. For generality, the BNN models are trained with various number of samples, e.g., 8, 16, 32, 64, and 128 (if needed) samples.Comparison cases. To demonstrate the effectiveness of ShiftBNN, we compare it with three training accelerators: Firstly, since ShiftBNN adopts RCmapping as the fundamental design strategy, we compare it with the RCaccelerator that adopts RCmapping strategy but without LFSR reversion technique. Secondly, since MNmapping is commonly used in existing DNN training accelerators (Mahmoud et al., 2020; Zhang et al., 2019), we employ an MNaccelerator that adopts MNmapping strategy without LFSR reversion technique as the baseline accelerator for generality, which is also used for our preliminary investigation in Sec.3. Thirdly, to verify the analysis about design alternatives (see Sec.5), we further test the effectiveness of our LFSR reversion strategy on MNaccelerator by comparing with an MNShiftaccelerator that adopts both MNmapping strategy and LFSR reversion technique. To overcome the challenges caused by our LFSR reversion to the MNmapping scheme, we follow the design principle in Fig. 7 (c). For fair comparison, all accelerators employ 16
PE tile and are allocated with onchip buffer of the same size. The 16 PE tiles process 16 sampled models simultaneously for the same extent of weight parameter reuse. We evaluate the energy efficiency (performance/power) of ShiftBNN and compare with the modern GPU, i.e., Nvidia Telsa P100. We use Pytorch
(Paszke et al., 2019)to implement and train the BNNs from scratch, and the training hyperparameters (e.g., batch size, epochs. etc) are kept the same as in other comparison cases. The execution latency and energy consumption are extracted from the GPU runtime information obtained by Nvidia Profiler
(Nvidia, 2021b).Experimental Setup. All accelerator designs are implemented in Verilog RTL and synthesized on a Xilinx Virtex7 VC709 FPGA evaluation board. For offchip memory access, the accelerators communicate with two sets of DDR3 DRAM that provide sufficient data transfer rate to the PE tiles via a Memory Interface Generator (Xilinx, 2019). The execution time results are obtained from the postsynthesis design and the energy consumption is further evaluated with Xilinx Power Estimator (XPE) (Xilinx, 2020). The data precision for all architectures is set to 16bit and the operating frequency is set as 200MHz.
Training quality. Figure 9 compares the training curve of BLeNet when using the vanilla BNN training algorithm on Pytorch (baseline) and ShiftBNN. The training hyperparameters and data type are kept the same in baseline and ShiftBNN. It can be seen that ShiftBNN does not affect the overall training iterations to convergence and the final accuracy. Similar behavior is observed on the other networks. This is because our LFSR reversion strategy fundamentally does not modify the training algorithm and simply manages to accurately retrieve all the s during the entire training process. Hence, we only evaluate and validate the training quality results on ShiftBNN when using different bit length. Table 1
shows how different bit lengths affect the validation accuracy of five BNN models. The accuracy results are obtained after the same training epochs for a certain network. As can be seen, training with 16bit precision only brings an average 0.31% accuracy drop compared with singleprecision training. This negligible loss may be due to the error tolerance nature of the sampling process during BNN training. While ShiftBNN can employ 32bit floating point arithmetic to achieve lossless training, the lower precision training is more attractive as lower precision computation potentially consumes much less energy.
Network  BMLP  BLeNet  BAlexNet  BVGG  BResNet 
Dataset  MNIST  CIFAR10  ImageNet  ImageNet  ImageNet 
Valacc(8b)  95.67%  62.80%  NaN  45.50%  NaN 
Valacc(16b)  98.05%  65.62%  59.95%  67.52%  68.12% 
Valacc(32b)  98.11%  65.81%  60.10%  67.76%  69.03% 
The network hardly converges due to the low precision 8bit BNN training.
7.2. Evaluation Results
Effectiveness on energy and performance. Fig. 10 illustrates the energy consumption of ShiftBNN compared against other accelerators. As it shows, the ShiftBNN accelerator achieves an averagely 62% (up to 76%), 70% (up to 82%), and 39% (up to 44%) energy consumption reduction compared with RCaccelerator (RCAcc), MNaccelerator (MNAcc), and MNShiftaccelerator (MNShiftAcc), respectively. The outstanding energy reduction of ShiftBNN is from the elimination of ’s DRAM accesses by using our LFSR reversion strategy. The MNShiftAcc reduces the energy consumption by 53% averagely compared with MNAcc which is less than that of ShiftBNN over RCAcc (i.e., 62% reduction). This implies that our LFSR reversion technique is also effective on MNaccelerator but reaps less energy saving than applying on RCaccelerator. As discussed in Section 6, this is caused by the large design overhead, e.g., duplicated adder trees, etc, when applying LFSR reversion strategy to MNmapping scheme. We further observe that ShiftBNN achieves 68% and 70% energy consumption reduction over RCAcc when evaluating on BMLP and BLeNet models, respectively. These number are larger than that of other BNN models. This is because takes a larger portion in the total offchip data transfer, and offchip memory access consumes a larger portion of total training energy consumption for BMLP and BLeNet.
Since ShiftBNN mainly targets on reducing the data transfer during training, it is interesting to see if the data transfer reduction can be converted to performance improvement. Fig. 11 shows the speedup of ShiftBNN over other accelerators. From the figure, we observe that ShiftBNN accelerator achieves an average 1.6 (up to 2.8) speedup over RCAcc. We found that the reduced execution time mainly comes from the removal of all memory accesses of in FC layers. As we know, the memory access of and the computation in a certain layer can be done in parallel by using doublebuffering. Thus, in the computationdominated convolutional layers, removing the memory access of may not reduce the latency. However, in the parameterdominated FC layers, the memory access (including storing in FW and fetching in BW) time of S samples of significantly exceeds the computation time since the number of MACs in FC layers are much smaller than that of convolutional layers. For example, the memory access time of is 8
over computation time in the 1st layer of BMLP8. Accordingly, there is an obvious variance in the performance improvement across different BNN models. For instance, for the fullyconnected BMLP models, the ShiftBNN gains the maximum 2.6
speedup on average, while for the convolution dominated BVGG and BResNet models, ShiftBNN achieves an averagely 1.18 performance improvement.Fig. 12 shows the energy efficiency of ShiftBNN accelerator compared with other designs. The energy efficiency is defined as throughput per watt (GOPS/Watt). It is shown that ShiftBNN boosts the energy efficiency by 4.9 (up to 10.8), 10.3 (up to 26.1) and 2.5 (up to 4.6) averagely compared with RCAcc, MNAcc and MNShiftAcc, respectively. The highest energy efficiency achieved by ShiftBNN is observed on BMLP32 model, which enjoys significant reductions on both energy consumption and latency. Furthermore, ShiftBNN also yields averagely 4.7 energy efficiency compared with Telsa P100. We observe that the GPU outperforms the baseline when training deeper BNNs with larger sample size because of its highly parallel computing and sufficient memory bandwidth. However, it is still beaten by the variants of ShiftBNN that are equipped with our techniques, e.g., even MNShift outperforms GPU by 1.9 energy efficiency. This is because the offchip memory access of a large amount of GRVs can not be avoided when training BNNs on GPUs either.
Reduction of DRAM accesses and memory footprint.
Fig. 14 shows the number of DRAM accesses and memory footprint breakdown of four accelerators when training on BNN models with 16 samples. For the DRAM accesses, we observe that the MNAcc and RCAcc always require much more DRAM accesses than ShiftBNN and MNShiftAcc in different BNN models. For example, the number of DRAM accesses in MNAcc (RCAcc) are 5.7 (5.8) larger than that in MNShiftAcc (ShiftBNN) in the dominated BLeNet16 model. Even in the wider and deeper models (e.g., BVGG16 and BResNet16) where the weight parameters and intermediate feature maps occupy a considerate portion of total data transfer, ShiftBNN still gains averagely 2.6 reduction on DRAM accesses.
The significant reduction of DRAM access is the major source of ShiftBNN’s high energy efficiency. As various lowerprecision training techniques (Banner et al., 2018; Yang et al., 2020c; Fu et al., 2020), e.g., 8bit integer training, have been proposed recently, the cost of MACs could become much less. Thus the memory saving techniques of ShiftBNN could have more benefits once these techniques are extended to BNN models.
Furthermore, as the figure shows, both ShiftBNN and MNShiftAcc reduce averagely 76.1% memory footprint during training compared with accelerators without LFSR reversion technique.
From the figure, we can observe that the memory footprint taken by Gaussian variable is completely eliminated by MNShiftAcc and ShiftBNN.
Scalability to larger sample size. In some highrisk applications, one may need a more robust BNN model to make decisions, thus requires training BNNs with a larger sample size to strictly approximate the loss function in Eq.1. We evaluate three BNN models including BMLP, BLeNet and BVGG by training them with different number of samples and report the corresponding energy consumption reduction and energy efficiency under different hardware designs. As can be seen, for all three models, the energy reduction achieved by both MNShiftAcc and ShiftBNN increases as the sample size becomes larger. For example, the energy savings increase from 55.5% to 78.8% as the sample size grows from 4 to 128 in BLeNet. The outstanding scalability of our LFSR reversion technique is because of the increasing ratio of in the total offchip memory accesses when we use more samples. We observe the similar increase in energy efficiency for MNShiftAcc and ShiftBNN as the training sample increases. Lastly, compared with MNShiftAcc, ShiftBNN achieves higher energy efficiency with various sample sizes.
Resource  PE  Shift  Function  GRNGs  NBin 
tile  array  units  /NBout  
LUT  966  222  785  2277  0 
FF  469  464  399  4224  0 
DSP  16  0  32  0  0 
BRAM  0  0  0  0  48 
(W)  0.076  0.016  0.008  0.005  0.112 
Resource usage and power. Table 2 lists the resource usage and average power of different hardware modules in one SPU. As can be seen, the shift units array and function units consume less LUT and FF resources compared with the PE tile. Although the function units requires more DSPs to implement function units due to the sampling, derivative calculation and updating processes, their average power dissipation is much smaller than that of PEs since only 1 of 16 function units is activated during convolutional layers. The similar effect can be observed on GRNGs whose average power is only 0.005W, albeit occupying more LUT and FF resources than others.
8. Related works
Accelerators for BNNs. There is an increasing demand for designing specific BNN accelerators recently. VIBNN (Cai et al., 2018) optimizes the hardware design of GRNGs and proposes an FPGAbased implementation for BNN inference. FastBCNN (Wan and Fu, 2020) targets on accelerating the BNN inference via neuron skipping technique. (Yang et al., 2020b) proposes a BNN inference accelerator by leveraging postCMOS technology. Different from the above efforts, our work proposes a highly efficient BNN accelerator that focuses on optimizing the training procedure.
DNN training optimization has been extensively studied (Song et al., 2019; Qin et al., 2020; Zhang et al., 2019; Yang et al., 2020a; Mahmoud et al., 2020). For example, eager pruning (Zhang et al., 2019) and Procrustes (Yang et al., 2020a) exploit the weight sparsity during the training stage by leveraging aggressive pruning algorithms and develop customized hardware to improve the performance. Procrustes also employs LFSRbased GRNGs but in purpose of enabling weight initialization and decay. Since our work reveals the key challenge in BNN training and mainly focuses the reducing the data transfer of which is irrelevant to sparsity, the above works are orthogonal to ours.
Reducing DRAM energy consumption Many works focus on addressing costly DRAM accesses during the DNN inference or training process. EDEN (Koppula et al., 2019) leverages approximating DRAM technique to reduce the energy and latency while strictly meets the target accuracy. Shapeshifter (Lascorz et al., 2019) explores the opportunities in shortening the transferred data width during DNN inference. These works are orthogonal to ours since we explore the unique feature of BNN training and eliminate intensive data transfer without accuracy loss from a different perspective.
9. Conclusion
In this paper, we reveal that the massive data movement of GRVs is the key bottleneck that induces the BNN training inefficiency. We propose an innovative method that eliminates all the offchip memory accesses related to the GRVs without affecting the training accuracy. We further explore the hardware design space and propose a lowcost and scalable BNN accelerator to conduct highly efficient BNN training. Our experimental results show that our design achieves averagely 4.9 (up to 10.8) boost in energy efficiency and 1.6 (up to 2.8) speedup compared with the baseline accelerator.
Acknowledgements.
This research is partially supported by NSF grants CCF2130688, CCF1900904, CNS2107057, University of Sydney faculty startup funding, and Australia Research Council (ARC) Discovery Project DP210101984.References
 Spatial uncertainty sampling for endtoend control. arXiv preprint arXiv:1805.04829. Cited by: §1.
 Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
 An fpga based processor yields a real time high fidelity radar environment simulator. In Military and Aerospace Applications of Programmable Devices and Technologies Conference, pp. 220–224. Cited by: §4.1.
 Scalable methods for 8bit training of neural networks. arXiv preprint arXiv:1805.11046. Cited by: §7.2.
 Variational inference: a review for statisticians. Journal of the American statistical Association 112 (518), pp. 859–877. Cited by: §2.1.
 Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §1.
 End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §1.
 An almost everywhere central limit theorem. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 104, pp. 561–574. Cited by: §4.1.
 Vibnn: hardware acceleration of bayesian neural networks. ACM SIGPLAN Notices 53 (2), pp. 476–488. Cited by: §1, §4.1, §4, §6.2, §7.1, §8.
 Diannao: a smallfootprint highthroughput accelerator for ubiquitous machinelearning. ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: §3, §5.
 Eyeriss: a spatial architecture for energyefficient dataflow for convolutional neural networks. ACM SIGARCH Computer Architecture News 44 (3), pp. 367–379. Cited by: §3.

Pcnn: posebased cnn features for action recognition.
In
Proceedings of the IEEE international conference on computer vision
, pp. 3218–3226. Cited by: §1.  Bayesian uncertainty analysis with applications to turbulence modeling. Reliability Engineering & System Safety 96 (9), pp. 1137–1149. Cited by: §1.
 Pseudorandom gaussian distribution through optimised lfsr permutations. Electronics Letters 51 (25), pp. 2098–2100. Cited by: §4.1.
 Power, programmability, and granularity: the challenges of exascale computing. In 2011 IEEE International Test Conference, pp. 12–12. Cited by: §3.

Distributed deep learning using synchronous stochastic gradient descent
. arXiv preprint arXiv:1602.06709. Cited by: §1, §3. 
Imagenet: a largescale hierarchical image database.
In
2009 IEEE conference on computer vision and pattern recognition
, pp. 248–255. Cited by: §7.1.  The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: §7.1.
 ShiDianNao: shifting vision processing closer to the sensor. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pp. 92–104. Cited by: §5, §6.2, §6.2.
 Baysian optimization research. External Links: Link Cited by: §1.
 Cnp: an fpgabased processor for convolutional networks. In 2009 International Conference on Field Programmable Logic and Applications, pp. 32–37. Cited by: §5.
 A deep cnn based multiclass classification of alzheimer’s disease using mri. In 2017 IEEE International Conference on Imaging systems and techniques (IST), pp. 1–6. Cited by: §1.
 Fractrain: fractionally squeezing bit savings both temporally and spatially for efficient dnn training. arXiv preprint arXiv:2012.13113. Cited by: §3, §7.2.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §1.
 A survey of methods for lowpower deep learning and computer vision. In 2020 IEEE 6th World Forum on Internet of Things (WFIoT), pp. 1–6. Cited by: §1, §3.
 Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348–2356. Cited by: §1.
 Deep learning with limited numerical precision. In International conference on machine learning, pp. 1737–1746. Cited by: §3.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §7.1.
 Stochastic variational inference.. Journal of Machine Learning Research 14 (5). Cited by: §2.1.
 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14. Cited by: §3.
 FPGA implementation of gaussiandistributed pseudorandom number generator. In 6th International Conference on Digital Content, Multimedia Technology and its Applications, pp. 11–13. Cited by: §4.1.
 Bayesian segnet: model uncertainty in deep convolutional encoderdecoder architectures for scene understanding. arXiv preprint arXiv:1511.02680. Cited by: §1.
 EDEN: enabling energyefficient, highperformance deep neural network inference using approximate dram. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 166–181. Cited by: §8.
 Learning multiple layers of features from tiny images. Cited by: §7.1.
 Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §7.1.
 Shapeshifter: enabling finegrain data width adaptation in deep learning. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 28–41. Cited by: §8.
 LeNet5, convolutional neural networks. Note: URL: http://yann.lecun.com/exdb/lenet Cited by: §7.1.
 Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7 (1), pp. 1–14. Cited by: §1.
 Tensordash: exploiting sparsity to accelerate deep neural network training. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 781–795. Cited by: §7.1, §8.
 Tesla crash preliminary evaluation report. Technical report U.S. Department of Transportation, National Highway Traffic Safety Administration. Cited by: §1.
 Nvidia deep learning accelerator. External Links: Link Cited by: §5.
 Nvidia visual profiler. External Links: Link Cited by: §7.1.
 Pytorch: an imperative style, highperformance deep learning library. arXiv preprint arXiv:1912.01703. Cited by: §7.1.
 Sigma: a sparse and irregular gemm accelerator with flexible interconnects for dnn training. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 58–70. Cited by: §8.
 VDNN: virtualized deep neural networks for scalable, memoryefficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13. Cited by: §1, §3.
 A comprehensive guide to bayesian convolutional neural network with variational inference. arXiv preprint arXiv:1901.02731. Cited by: §1.
 Twostream convolutional networks for action recognition in videos. arXiv preprint arXiv:1406.2199. Cited by: §1.
 Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §7.1.
 Hypar: towards hybrid parallelism for deep learning accelerator array. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 56–68. Cited by: §8.
 Deep neural networks for object detection. Cited by: §1.
 Tesla car crash report 2020. External Links: Link Cited by: §1.
 Scaledeep: a scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 13–26. Cited by: §1, §3.
 Fastbcnn: massive neuron skipping in bayesian convolutional neural networks. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 229–240. Cited by: §8.
 Superneurons: dynamic gpu memory management for training deep neural networks. In Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming, pp. 41–53. Cited by: §1, §3.
 Training deep neural networks with 8bit floating point numbers. arXiv preprint arXiv:1812.08011. Cited by: §3.
 E2train: training stateoftheart cnns with over 80% energy savings. arXiv preprint arXiv:1910.13349. Cited by: §3.
 On machine learning and structure for mobile robots. arXiv preprint arXiv:1806.06003. Cited by: §1.
 Xilinx memory interface generator. External Links: Link Cited by: §7.1.
 Xilinx power estimator. External Links: Link Cited by: §7.1.
 Procrustes: a dataflow and accelerator for sparse deep neural network training. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 711–724. Cited by: §1, §5, §8.
 Allspin bayesian neural networks. IEEE Transactions on Electron Devices 67 (3), pp. 1340–1347. Cited by: §8.
 Training highperformance and largescale deep neural networks with full 8bit integers. Neural Networks 125, pp. 70–82. Cited by: §7.2.
 Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 2008–2026. Cited by: §2.1.
 Eager pruning: algorithm and architecture support for fast training of deep neural networks. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), pp. 292–303. Cited by: §1, §7.1, §8.
 Echo: compilerbased gpu memory footprint reduction for lstm rnn training. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 1089–1102. Cited by: §1, §3.
Comments
There are no comments yet.