I Introduction
Custom floating point (FP) formats, such as Google’s bfloat16 [bfloat] and Nvidia’s TensorFloat 32 [tf32], are increasingly replacing IEEE 754 32bit floating point (FP32) for DNN training. These formats more efficiently fit the empirical distribution of DNN weight, data, and gradient values, leading to a smaller hardware footprint for the multiplieraccumulator (MAC) unit. However, these formats are still significantly more expensive to implement than fixed point (INT) formats of similar bitwidths due to mantissa alignments which are required for each floating point MAC operation.
By comparison, Block Floating Point (BFP) [Wilkinson_64] formats offer a middle ground between FP and INT formats, by enforcing that a group of values share a common exponent while maintaining individual mantissas. This constraint enables BFP to achieve higher efficiency than FP for dot product (DP) computations for three reasons. First, mantissa alignments are only required on input values in each BFP group as opposed to after each FP multiplication. Second, there is only one exponent addition between each group as opposed to each FP multiplication. Therefore, performing DP computations in BFP can lead to a significant improvement in training efficiency.
In this work, we propose a Fast First, Accurate Second Training (FAST) system for variable precision BFP DNN training. Here, variable precision means that (1) the system efficiently supports BFP formats across a range of mantissa widths during training and (2) dot products between BFPs with different mantissa widths is permitted. To support our system, we have designed a FAST multiplieraccumulator (fMAC) which operates on nbit chunks of mantissas across two groups of BFP numbers being multiplied. Throughout this paper, we use 2bit chunks. Subdividing the computation into 2bit chunks allows the same fMAC to implement arithmetic operations involving higher precision mantissas by simply running multiple passes of the fMAC. For instance, multiplying two groups with 2bit and 4bit BFP mantissas translates to
passes. The rate at which our FAST system performs dot product computations is based on the BFP precision of the two vectors being multiplied.
With FAST, we propose a DNN training regime that starts with lowprecision BFP and increases the precision of weights, data, and gradients over the course of training. Figure 1 presents an overview of how FAST can accelerate training via lowprecision operations. The left side of the figure provides a sketch of the FAST system, which supports FP32 to BFP conversion for a range of mantissa bitwidths and stochastic rounding for gradients (to maintain training stability under lowprecision BFP). The FAST compute engine consists of a systolic array [kung1982systolic]
of fMAC units, which efficiently supports BFP dot products with varying mantissa bitwidths. The right side of the figure shows how the precision of the mantissa field in BFP for a DNN increases across DNN layers and over training iterations. While each DNN layer in the figure is presented with a single precision, in practice the precision of the weight, data, and gradient tensors in each layer are selected independently for a given iteration (see Figure
21for how this precision selection works). Using this approach, FAST is able to achieve this result on (1) CNNs for ImageNet
[deng2009imagenet], (2) Transformers for the IWSLT14 GermanEnglish benchmark [iwsltbenchmark], and (3) YOLOv2 [redmon2017yolo9000] for the PASCAL VOC2012 [everingham2011pascal] dataset.In FAST, we use (1) BFP for variable precision training and (2) BFP with stochastic rounding. BFP is an old idea dating back as early as the 1960s (see, e.g., [Wilkinson_64]), and there has been recent literature demonstrating the advantages of using BFP in DNN inference [rouhani20bfp] and training [drumondbfp18]. We believe that our ideas of (1) and (2) are novel. For (1), we point out the convenience in Section VII of implementing variable precision hardware for BFP. For (2), we note in Section IIIC that using stochastic rounding in conjunction with BFP is critical to model accuracy, especially when using BFP for gradients with lowprecision mantissa (e.g., 2 or 4 bits). In Section IIID we provide an analysis of the reasons for using stochastic rounding in BFP. The main contributions of the paper are:

The FAST variable precision training algorithm for efficient DNN training. The proposed solution reduces total training time by adaptively selecting the optimal precision for weights, data, and gradients in every DNN layers at each iteration.

A modular architecture consisting of FAST multiplieraccumulator (fMAC) for groups of BFP values. fMAC operates on chunks of BFP mantissas (e.g., 2bit chunks) to support variablewidth mantissas in 2bit increments.
Table I lists the terminology and notation used in the paper.
Ii Background and Related Work
In Section IIA, we provide an overview of number formats for DNN training and inference. Section IIB provides an overview of the computation dataflow of DNN training. Finally, in Section IIC, we review related work on hardware accelerators for DNN training.
Iia Number Formats for DNNs
As the majority of computation in both DNN training and inference are dot products, the formats for their underlying numbers has been extensively studied to make this computation efficient. Figure 2 divides various formats into three groups: fixed point (top), floating point (middle), and BFP (bottom). The number of exponent bits (e) and mantissa bits (m) are provided for each format.
Fixed point formats do not have an exponent field, which reduces the dynamic range that can be represented but simplifies the hardware. The use of fixed point formats for DNN training and inference has been well explored [courbariaux2014training, courbariaux2015binaryconnect, gupta2015deep, zhu2016trained, hubara2017quantized, park2017weighted, kapur2017low, banner2018scalable, jacob2018quantization, bilaniuk2019bit, kung2019packing]. The smallest format is a 1bit binary representation (upper left of Figure 2
) used by binarized neural networks which has no exponent or mantissa bits. Floating point formats have a larger dynamic range than fixed point, making them more amenable to wider dynamic range of gradients in DNN training
[chmiel2020neural]. IEEE 754 32bit IEEE floating point or FP32 (middle left of Figure 2) is the conventional format for DNN training. Mixed precision training operates on some tensors in a higher precision and other tensors in a lower precision. For instance, Nvidia Mixed Precision (MP) [micikevicius2018mixed] proposed to perform most computations in the forward pass in FP16, while keeping an FP32 copy of the weights for updating during training. Custom floating point formats like bfloat16 [bfloat], TensorFloat [tf32] and HFP8 [sun2019hybrid] (middle right of Figure 2) operate in a similar mixed precision regime and have been shown to work as well as FP32 for training accurate DNNs. For example, HFP8 performs forward pass computations using 8bit FP with one bit sign, 4bit exponent, 3bit mantissa (143) and backward pass computations with one bit sign, 5bit exponent and 2bit mantissa (152).While BFP represents a promising direction for improving the efficiency of DNN training as middle ground between fixed and floating point, there has been a small amount of prior work in designing efficient hardware to support it. Flexpoint [koster2017flexpoint] proposed a BFP format with a 16bit mantissa (m=16) and a 5bit shared exponent across an entire tensor. Our paper focuses on how to adjust the BFP mantissa bitwidth adaptively (e.g., m = 2 or 4) during training to reduce training time and power consumption. Drumond et al. [drumondbfp18] proposed to use a large BFP group size of 576 (a 2D tile of size ) which requires a wide mantissa bitwidth of m=12 to achieve good accuracy. Compared to these prior approaches, we show in Section VIB, that training under INT with similar number of bits (e.g., 12 bits) also has good performance, which suggests that BFP has little advantages over INT12 for such large tiles. Additionally, they did not provide a detailed hardware design for the implementation of BFP computation.
More recently, Microsoft proposed a BFP format (MSFP12 in lower right of Figure 2) for DNN inference via posttraining quantization [rouhani20bfp] on their Project Brainwave FPGA cloud platform[fowers2018brainwave]. In our paper, by using BFPaware DNN training instead of posttraining quantization, we are able to use a smaller exponent width of 4 instead of 8 while achieving similar inference accuracy. For training, as we show in Figure 22, via variable precision BFP, FAST reduces the training time and power consumption compared to the previously mentioned training based on floating point or BFP formats.
IiB Matrix Computation of DNN Training
Each iteration of DNN training on a minibatch consists of a forward pass to compute a loss and a backward pass to update the DNN weights with gradients computed from the loss. Figure 3 illustrates all of the matrix computations required for both forward and backward passes for one convolutional layer in a CNN (fully connect layers operate in a similar manner). Both the convolutional view and the corresponding matrix operation view are presented. During the forward pass, the input activations () are convolved with the layer weights () to compute the output () as depicted in Figure 3a. Then,
will be passed through normalization and a nonlinear activation function, to become the input activations for the next layer.
During the backward pass, two convolutions are performed at each layer. Figure 3b shows how the output gradients are convolved with the transposed weights to compute the activation gradients , which will be passed to the layer below. In Figure 3c, the transposed input activations are convolved with the output gradients in order to compute the weight gradients . These weight gradients are then added in an elementwise fashion with the weights W in order to compute the updated weights .
IiC Accelerators for DNN Training
Previous work on accelerating the DNN training has focused on leveraging the sparsity present in weights and activations [mahmoud2020tensordash, zhang2019eager, yang2020procrustes, choi2020energy]. TensorDash [mahmoud2020tensordash] accelerates the DNN training process while achieving higher energy efficiency via eliminating the ineffectual operations resulted from the sparse input data. Eager Pruning [zhang2019eager] and Procrustes [yang2020procrustes] improve DNN training efficiency by codesigning the training algorithm with the target hardware platform (“hardwareaware training”). Insignificant DNN weights are pruned in the middle of the DNN training process and ineffectual operation involving zero weights can be eliminated without impacting the final accuracy. In comparison, our approach applies BFP to dynamically adjust the precision of DNN training, which is in orthogonal to these methods which exploit valuelevel sparsity.
Multiprecision methods of reducing the computation of DNNs have been explored in the literature. Stripes [judd2016stripes] multiplies two 16bit integers by only adding those shifted multiplicands corresponding to the nonzero bits during DNN inference. In the work, we use short 2bit chunks of mantissas, making a straightforward single clock bitparallel implementation efficient compared to a bitserial approach for DNN training. Lee et al. [lee20197] proposed using finegrained mixed precision (FGMP) of FP8FP16, which represents some parts of a tensor in FP8 and other parts in FP16. FAST uses both 2bit and 4bit mantissas, and iterates like FGMP on the lowbitwidth hardware for performing the highbitwidth arithmetics. FAST performs an integer MAC for each partial product in a BFP group, rather than an FP MAC as in FGMP.
Iii Overview of BFP with Stochastic Rounding
In this section, we provide an overview of how we use BFP under stochastic rounding (SR) in FAST to facilitate efficient and accurate DNN training.
Iiia Quantization from FP to BFP under Stochastic Rounding
Figure 4 shows this quantization process for converting a group of three FP values. First, in Figure 4a the largest exponent in the group is found, which becomes the shared exponent for the group. Then, in Figure 4b, the mantissas of each value are aligned based on the difference between the exponent of each value and the max exponent. Next, in Figure 4c, stochastic noise is added for gradients (critically important for low bitwidth mantissas). Finally, in Figure 4d, the loworder mantissa bits are truncated to a specified mantissa bitwidth.
IiiB Dot Product under FP, INT, and BFP
In this section, we discuss the conversion and computation costs of dot product (DP) under three number formats (BFP, INT, and FP). We argue that BFP DP is less costly than FP DP due to its use of shared exponents, and BFP DP is less costly than INT DP because the former can achieve the same accuracy as the latter with a smaller mantissa bitwidth.
Consider the DP of two vectors of length (e.g., an activation vector and a weight vector). The DP computation can be broken into two parts: (Part M) integer multiplications for computing partial products and (Part A) accumulation of partial products resulting from part M.
IiiB1 FP Conversion Cost (BFP DP versus INT DP)
Before performing BFP DP, we must first convert values in the two FP input vectors into BFP values. After the DP is computed, we need to convert the result to FP to add to an accumulation across BFP groups. This conversion operation is an FP normalization that involves bit shifts of the mantissa.
For the INT DP, we need to perform similar conversions between FP and INT. Suppose that We use conventional uniform quantization (UQ) for the conversion from FP to INT. Since the scale factor is in FP, the INT conversion cost is much higher than the BFP conversion. The conversion of the DP result from INT to FP is an FP normalization, like the conversion of the BFP DP result to FP.
IiiB2 Computation Cost (BFP DP versus FP DP)
Figure 5 illustrates the dot product between two BFP groups of size = 4. We see that BFP DP costs substantially less than FP DP for three reasons. First, for part M, BFP DP just needs to perform one exponent addition on the shared exponents of the two input vectors. In contrast, FP DP requires exponent additions. Additionally, unlike FP DP, BFP DP does not perform FP normalization after each of the multiplications. Finally, for part A, unlike FP DP, BFP DP does not need to align partial products, as they are already aligned.
IiiB3 Computation Cost (BFP DP versus INT DP)
BFP can have a much smaller mantissa bitwidth (e.g., = 4) than INT (e.g., = 12) while achieving similar classification accuracy (Table 18). Since the computational complexity of fixed point multipliers scales in a quadratic fashion with bitwidth, for part M, BFP DP costs much less than INT due to the reduced . But, BFP does incur a relatively small cost compared to INT for adding the shared exponents between the BFP groups.
IiiC BFP Exponent and Mantissa Bitwidths
The number of exponent and mantissa bits in BFP play different roles in determining the amount of quantization error after conversion from FP to BFP. If the shared exponent bitwidth is too small, then it may not be able to represent numbers in the dynamic range of the group. If the mantissa bitwidth is too small, then some values with smaller exponents in a BFP group will have all mantissa bits shifted out of range resulting in data loss (see the third value in Figures 4a and 4b with = 2).
Figure 6 presents the distribution of the difference in exponents between the maximum exponent in a group and all other exponents for three different group sizes (g = 8,16,32). The weight, data, and gradient tensors are taken from layer 10 in ResNet18 at the halfway point of the training on ImageNet (other layers and DNNs generally follow the same trend). The difference dictates the amount of shifting required to align mantissas as depicted in Figures 4b. Large differences lead to large worstcase quantization errors. When the difference is larger than the mantissa bitwidth, all bits will be truncated. Compared to the weights and activations, the gradients have a much wider exponent disparity, leading to a larger quantization error. This is why stochastic rounding (SR) for gradient computations, as depicted in Figure 4c, is essential to achieve high accuracy when using a low number of mantissa bits. We notice that the mass of each distribution moves to the right as the group size increases, as indicated in the positions of the red vertical line. Thus, increasing the value will increase truncation errors for the same mantissa bitwidth. In this paper, we set at 16 unless otherwise stated.
IiiD Stochastic Rounding (SR) of Gradients in Gradient Descent
In this section, we present an analysis and illustrations on the working of SR in gradient descent for DNN training. The analysis shows that for lowprecision BFP with small mantissa bitwidth m, applying stochastic rounding to gradients, as illustrated in Figure 4c, can minimize the impact of rounding on gradient descent performance.
Let be the training loss of a DNN (e.g., can be based on cross entropy). Without loss of generality we assume in this analysis that the learning rate
is 1. We consider use of stochastic gradient descent (SGD) to minimize
using multiple rounds of iterations. Consider any specific parameter of the neural network. For each iteration, the backpropagation algorithm computes the partial derivative
and updates w with the following rule:where is the learning rate and
. We refer to a collection of multiple iterations over all training data as an epoch. In Figure
7 (left), we illustrate four iterations of updating from its initial value to , , and , using fullprecision floating point numbers FP32 without rounding.IiiD1 Impact of rounding on weight updates
In Figure 7 (right), we consider the scenario when we perform training iterations using fixed point integers quantized from FP32. The diagram illustrates that if the gradient at each of the four iterations is rounded down, then the total weight increments will be reduced leading to a higher loss, compared to the FP32 case without rounding. This is because the gradient at each red dot corresponding to an iteration is rounded down to a smaller value, causing a smaller weight increment .
IiiD2 Use of stochastic rounding to minimize impact of rounding on weight updates
We use SR to minimize the impact of rounding on weight updates and the corresponding reduction on loss.
Theorem 1
If the gradient remains the same over iterations, then SR is expected to yield the same total weight increments as FP32 without rounding, assuming that the stochastic noise used by SR is full precision.
The assumption that gradients stay the same is just to simplify the explanation below. The same argument can derive the expected increment on a weight w over an iteration based on the expected gradient value for that iteration, without having to make the assumption of constant gradients.
To explain Theorem 1, we first consider a simple case where we quantize a gradient () in a quantization decision interval , as depicted in Figure 8a. Under SR,
is rounded to 0 and 1 with probability 1/3 and 2/3, respectively, reflecting the distance of
to each endpoint. Note that over 3 iterations, is expected to round down to 0 once and round up to 1 twice. Figure 8d illustrates that is rounded to , and for iteration 1, 2 and 3, respectively. We note that SR increments by the same amount (i.e., in Figure 8c ) towards computing to , as the FP32 case, as depicted in Figure 8b.We now explain Theorem 1 by considering a general case, where we round a gradient in a decision quantization interval , as depicted in Figure 8c. In this case, we express the weight gradient as for some and with . (Note that if , , and , , then Figure 8d depicts the scenario of Figure 8a.) Using SR, () is rounded to and with probability and , respectively. Note that the two probabilities sum to 1, as expected. Thus, each iteration is expected to increment the weight value by , which is the same weight increment under FP32 without rounding.
Iv FAST Strategy for Training
In this section, we describe our FAST strategy for DNN training which varies the BFP precision of weights, activations, and gradients based over the course of training. First, in Section IVA, we provide motivation for progressively increasing the precision across both training iterations and DNN layers. Then, in Section IVB, we propose an approach to adaptively increase the precision during training.
Iva Progressive Precision Changes over DNN Training
Previous literature has demonstrated that adding zeromean Gaussian noise to the weight gradient can reduce overfitting and improve the convergence of DNN training [neelakantan2015adding]
. They show that decreasing the variance of the noise over iterations achieves better performance than using fixed Gaussian noise throughout training. We hypothesis that a similar effect can be achieved by adjusting the BFP precision of weights, activations, and gradients from low to high precisions over training. To test this, we compare two training schemes (using ResNet20 on CIFAR10) that use different strategies for switching the DNN training precision over time. In the Temporal HightoLow scheme, we use FP32 for weights, activations, and gradients for the first half of training, and lowprecision BFP with a mantissa bitwidth of 3 and group size of 16 for the second half of training. For the Temporal LowtoHigh scheme, we adopt the opposite approach by using lowprecision BFP in the first half of training and FP32 in the second half of training. Figure
9 (left) shows the test accuracy of these two schemes over the training process. The LowtoHigh scheme achieves a higher performance, which indicates that training is more amenable to lowprecision BFP in the early stages.Additionally, during the backward pass of the training, the BFP quantization error for the data gradient will have a greater impact on the early layers than later layers. We perform another experiment to show this impact by comparing against two training schemes. In the Layerwise HightoLow scheme, we use FP32 precision for the first ten layers, and lowprecision BFP with a mantissa bitwidth of 3 and group size of 16 for later 10 layers. For the Layerwise LowtoHigh scheme, we apply the opposite precision setting by switching the training precision between the first and second half of the DNN layers. To eliminate the impact on the architectural difference, we change the structure of ResNet20 so that the first and the second halves have the same weight filter layout. The results shown in Figure 9 (right) indicate that applying low precision in the early layers works better than later layers.
IvB FAST Adaptive Training
Based on the insight of the prior section, we propose an adaptive training strategy that progressively increase the BFP precision across both training iterations and layer depth. Algorithm 1 describes the mechanism of the FAST training algorithm. FAST supports two precision levels by representing the BFP mantissas with either 2 bits (low precision) or 4 bits (high precision). For a given FP tensor , FAST first evaluates the relative improvement , defined by Equation 2, of using a 4bit mantissa compared to a 2bit BFP mantissa. If the relative improvement of using the higher precision setting is smaller than a threshold (i.e., ), then a 4bit mantissa does not offer significant improvement over a 2bit mantissa. However, if the relative improvement is larger than the threshold, then using a 4bit mantissa will significantly reduce the quantization error compared to a 2bit mantissa.
To allow for an incrementally increasing BFP precision across both layer depth and training iterations, the threshold is set to vary with the layer depth and training iteration based on the following equation:
(1) 
where and are the total number of training iterations and DNN layers, respectively. and
are the hyperparameters that specify the offset and the slope of the threshold function. Equation
1 sets to decrease gradually with both training iteration and layer depth, so that higher precisions will be used as the training iteration and layer depth grow.We define the relative improvement of using higherprecision BFP () compared to lowprecision BFP () as follows:
(2) 
where denotes the th element of X, and represents the quantized with an bit mantissa. The numerator of reflects the total difference between the BFP values with highprecision and lowprecision mantissas across each element of X. These difference is further divided by the summation of magnitudes of the BFPquantized values so that the scale of will be consistent across different training iteration and DNN layers. Finally, the numerator and denominator of are computed by summing the BFPquantized numbers across each element, which can be implemented with low hardware cost. The expensive division operation is only performed once between the two sums.
V FAST System
The major components of the proposed FAST system are shown in Figure 10. We use a 2D systolic array (Section VA) to perform the matrix multiplications for both the forward and backward passes of DNN training. The systolic array contains systolic cells of FAST MAC (fMAC), discussed in Section VB
, which support variable precision BFP. The memory subsystem has three SRAMs used to store weights, activations, and gradients, respectively. When performing matrix multiplication, the systolic array data generator is used to skew input for data synchronization in the systolic array. The accumulator is used to buffer partial accumulations across multiple tiles (for matrices that are larger than the systolic array). The output of the accumulator is passed to the BFP generator (Section
VC), which converts groups of FP values into BFP groups.Va Systolic Array Operations
To support matrix transposition required during the backward pass of training (See Figure 3), we have developed a systolic array that can perform matrix multiplication involving a transposed matrix operand without explicit transposition. This allows for no extra data copying and thus reduces the implementation overhead of the matrix transposition operation. In Figure 12, we illustrate how this systolic array operates for each of the forward and backward pass matrix operations given in Figure 3. For clarity, we show each systolic cell with single INT values instead of a BFP group.
To compute output for each layer (Figure 12a), the weights are first prestored in systolic cells. Then, the activation enters the systolic array from bottom and the output exits the systolic array from the right side (refer to Figure 3). During the backward pass, to compute the activation gradients (Figure 12b), is also prestored in the systolic array. However, unlike the forward pass, with entering from below, the output gradients enter the systolic array from left and the activation gradients are produced at the top of systolic array. By changing the side that input enters the systolic array while keeping the orientation of fixed, we can compute without explicitly transposing .
Finally, to compute the weight gradients (Figure 12c), the systolic array is reconfigured to be accumulation stationary. During the computation, the input activation and output gradient will enter the systolic array from left and below, respectively, and the weight gradients are computed and accumulated within each systolic cell. At the end of this computation, the accumulated gradient in each systolic cell will sum with the weight to generate the updated weight , which is then stored back in the weight SRAM. For optimizer like Adam [kingma2014adam]
, additional hardware is required to compute the first and second moments for the weight updates.
VB Design of FAST MAC for BFP
Each cell in the systolic array implements a fMAC, which perform the DP between two BFP groups. Figure 11 shows the design of a fMAC for a group size . A DP consists of fixedpoint multiplications between each pair of BFP mantissas for values in the groups, which are performed by the multipliers in the fMAC. The output generated by each multiplier is summed using the adder tree. The FP generator takes the fixed point summation from the adder tree to create the FP mantissa. The shared exponents of the two BFP groups are added together to create the FP exponent (refer to Figure 5). The resulting FP value is adding to the FP accumulator which stores the partial result spanning across many BFP groups.
Additionally, the fMAC can be reconfigured to support the different operations of DNN training described in Section VA which may require matrix transposition. To compute the output during the forward pass (Figure 12a), the BFP shared exponent and mantissas of are first prestored in the fMAC using the and ports, respectively. Then, the activation enter the fMAC via the same and ports to perform the DP computation. The output generated by this computation exits to the right neighbor via output port . To compute the activation gradient during the backward pass (Figure 12b), is prestored in the same fashion, and the BFP output gradients are passed into the multipliers via the and input ports, with the output exiting to the neighbor above via . Finally, to compute the weight gradients (Figure 12c), the output gradients and input activation enter the fMAC using the ports , and , , respectively. The accumulator output then loops back to be summed with the prestored FP weights .
To support BFP DP with variable precision, each DP is processed in 2bit mantissa chunks as shown in Figure 13. Here, the mantissas bitwidths for two operands ( and ) are 4 bits and 2 bits, respectively. In the first round, the fMAC computes the dot product between Y and the first 2bit chunk of the X (). The partial accumulation result is then buffered for subsequent processing. In the second round, the fMAC computes the dot product between and the second 2bit chunk of X () in order to finish the DP computation. More iterations are required for higher mantissa bitwidth. For example, multiplying a pair of BFP numbers with 4bit and 4bit mantissas translates to rounds. To account for the difference in exponent magnitude between two chunks, the BFP exponent of the second 2bit chunk () is decremented by two. Note that this decrement is performed by the BFP converter when it generates each 2bit chunk, and therefore fMAC is agnostic to these exponent difference across chunks.
VC BFP Converter
The BFP converter, shown in Figure 14, takes a group of FP values and converts them into BFP following the process outlined earlier in Figure 4. The comparator consists of compare and forward (CF) blocks arranged in a tree structure. Each CF block takes a pair of FP exponents and forwards the larger exponent to the next tree level. The largest exponent will be output and used as the shared exponent (Figure 4a). Then, a group of subtractors calculate the differences between the shared exponent and each exponent in the group. The shift blocks, which are implemented using Barrel shifters [pillmeier2002design], perform right shifts on each FP mantissa based on the exponent difference for each value (Figure 4b). Then, to perform stochastic rounding (Figure 4c), a group of 8bit random binary streams produced by the linear feedback shift register (LFSR) are summed with the mantissas. Finally, the loworder bits of the BFP mantissas are truncated (Figure 4d). The BFP exponents and BFP mantissas will also be delivered to the improvement computation block which computes the relative improvement as defined in Equation 2.
VD Memory Layout for BFP Values
We have developed an efficient storage format for variable precision BFP, where the shared exponent and BFP mantissas are stored separately. Figure 15 provides an example for = 4 and = 2. The 2bit chunks across all the mantissas in a group are saved in the same memory entry for efficient access during DP computation. The first 2bit chunks of each BFP mantissa (Figure 15a) are stored together in the same memory entry (Figure 15b), followed by the second 2bit chunks which are saved in the next memory entry.
Under this storage scheme, each BFP group will be represented by bits, where is the bitwidth of the BFP exponent, is the group size, is the number of 2bit chunks in an mbit mantissa. An additional bit is required per mantissa to represent the sign, leading to 3 bits per mantissa. In our hardware system, and are set to be 3 and 16, respectively, and is or based on the current precision. This leads to an average of () and () bits to store each value, which significantly reduces the storage overhead compared with other formats we evaluate in Section VI. All outputs from the BFP converter (Figure 14) are stored in BFP with 4bit mantissas divided into two 2bit chunks. If Algorithm 1 selects the 2bit mantissa, then the loworder 2bit chunk is discarded.
VE Training Workflow
Figure 16 summarizes the overall workflow of DNN training using the FAST system. During the forward pass (Figure 16a), the filter weights in BFP format are first loaded and saved into each fMAC (step 1). Next, the BFP activations are loaded from data SRAM into the systolic array (step 2). Each systolic cell (fMAC) performs a partial DP for two BFP groups, followed by FP accumulation spanning across many BFP groups. The output is then delivered to the BFP converter (step 3), which converts these values back into BFP format and stores them in the data SRAM for subsequent processing (step 4). Additionally, the activations must also be kept in the data SRAM for the backward pass (Figure 16bc).
To compute the activation gradients (Figure 16b), the weights are again prestored into systolic array (step 1). Then, the output gradients are delivered to the systolic array from the left (step 2). The results are produced at the top of systolic array and are converted into BFP format (step 3) before being saved into the gradient SRAM (step 4). To compute (Figure 16c), the input activation and the output gradient are delivered to the systolic array concurrently (step 2). The results are produced within each systolic cell and then they are used to generate the updated weight . Finally, the updated weights are converted into BFP and stored in weight SRAM (step 3).
Vi Training Evaluation of FAST
In this section, we evaluate FAST’s training performance for DNNs. In Section VIA, we visualize the FAST precision adaptation over the course of training to show how FAST is able to achieve faster training time by staying in a lowresolution regime for a large portion of training. Next, in section VIB, we compare the accuracy performances of DNNs trained under BFP against other commonly used FP and INT formats. We also compare against three fixed BFP settings that do not change over the course of training: LowBFP uses (=3, =2) for all DNN weights, data, and activations, MidBFP uses (=3, =3), and HighBFP uses (=3, =4). Finally, in Section VIC
, we evaluate the performance of fixed BFP settings to show the relative advantage of using FAST. All the CNNs are trained with ImageNet for 60 epochs (120000 iterations). We use the hyperparameter settings from the PyTorch website
[pytorchsettings].For Transformers, we use the 12layer model with 12 heads and a hidden size of 768. The Transformer is trained using the Adam optimizer with a learning rate of , and . The batch size is set to 16. We train on the IWSLT14 GermanEnglish dataset [iwsltbenchmark] for 150 epochs. Finally, we train YOLOv2 [redmon2017yolo9000] on the PASCAL VOC2012 [everingham2011pascal] dataset with 120 epochs using a batch size of 64. We apply the SGD optimizer with a initial learning rate of , dividing it by 10 at 60 and 90 epochs, the weight decay and momentum are set to 0.0005 and 0.9. The and FAST hyperparameters in Equation 1 are set to 0.6 and 0.3 for all the DNNs.
Via FAST Precision Adaptation
In this section, we visualize the precision changes during the training of a DNN using the FASTAdaptive algorithm (Algorithm 1). Since the weights , activations , and gradients independently determine their BFP precision, there are possible precision settings per layer using two different BFP resolutions ( and ). In the figure, we have ordered these settings based on their computational costs when deployed in the FAST system (discussed next in Section VII). For instance, (, , ) of (, , ) has a slightly lower computational cost than (, , ) due to how the gradients are used multiple times during the backward pass (see Figure 3). Figure 21 shows the BFP precisions of 5 layers in ResNet18 on ImageNet change over the course of training under FAST. As expected, we observe that the BFP precision grows across both layer depths and training iterations.
ViB Comparing Number Formats
Table 18 shows the validation accuracies for different DNN models trained using a wide range of number formats. The IEEE 754 32bit FP (FP32) generally achieves the best performance (accuracy or BLEU) across all models. We find that bfloat16, Nvidia MixedPrecision (MP), MSFP12 and HFP8 are able to achieve similar performance as the baseline FP32 model for all DNNs. Additionally, the HighBFP setting is also able to achieve comparable accuracy. HighBFP with = 4 represents a substantial saving compared to FP32 with = 23. The LowBPF and MidBFP settings with = 2 and = 3 loss and in accuracy across all CNNs compared to the FP32 baseline. The INT8 setting has an even larger reduction in accuracy, losing 46% compared to the baseline, even though it has more mantissa bits over HighBFP. For fixed point to achieve a similar level of performance as the baseline FP model requires an INT12 with 11 mantissa bits. As we note in Section VII, fixed point multipliers used to perform this computation incur cost quadratically with the mantissa bitwidth, making large mantissa bitwidths costly to implement. By comparison, our FASTAdaptive approach can achieve a comparable performance to FP32 across all the DNNs.
ViC BFP Hyperparameter Sensitivity
In this section, we investigate the impact of the group size and mantissa bitwidth on the DNN training accuracy. Figure 21 shows the validation accuracy on ResNet18 for different BFP configurations settings. The three curves represent different group size configurations (i.e., , , and ) and the xaxis corresponds to varying the number of mantissa bits (i.e., , , and ) for each group size. For a given mantissa bitwidth, a smaller group size (e.g., ) is generally able to achieve a higher accuracy than a larger group size (e.g., ). However, a smaller group size has some additional implementation overhead due to more FP exponent additions as each shared exponent spans fewer elements in a smaller group. Overall, we observe that a group size of with produce the optimal performance, and we use this setting as our baseline for FAST training.
Vii Hardware Evaluation of FAST
In this section, we evaluate the hardware performance of the FAST system described in Section V. We have synthesized our system using the Synopsys Design Compiler [synopsysdesigncompiler] with 45nm NanGate Open Cell Library [nangatelib] and CACTI [cacti]. CACTI is used to simulate the performance of the memory subsystem and Synopsys Design Compiler is used for all other subsystems shown in Figure 10. For our FPGA evaluation, we use a Xilinx VC707 FPGA evaluation board. The FAST system contains a systolic array of fMAC cells. Gradient SRAM, weight SRAM and data SRAM each consist of 128 16kB memory banks. The FAST system runs at 500 MHZ. Table 18 summarizes the area and power breakdowns of FAST.
Viia Evaluation of fMAC
We evaluate the efficiency of our fMAC design by comparing it against FP and INT MAC designs. For FP MACs, we implement them with bfloat16, FP16 and HFP8. FP16 is used by Nvidia MP. An FP MAC performs multiplyaccumulate operations between two FP numbers followed by a 32bit FP accumulation. For INT MACs, we implement them with 8bit (INT8) and 12bit (INT12) variants. Refer to Figure 2 for details on each number format. Two floating point formats are used by HFP8 during training: 4bit exponent/3bit mantissa for the forward pass and 5bit exponent/2bit mantissa for the backward pass. For a hardware cost comparison to FAST, we implement a MAC that supports a 4bit exponent/2bit mantissa, so that the hardware cost is strictly less than either floating point format used by HFP8. Since a single fMAC performs a BFP DP across two groups of numbers, we use for all other MAC designs for a fair comparison.
Table II provides an ASIC evaluations in terms of area and power consumption and FPGA resource consumption for all MAC designs. Area consumption is normalized by the area of our fMAC design. The fMAC achieves a superior area and power consumption compared to the other MAC designs. The main advantage of the fMAC over the INT MAC designs is the significantly reduced mantissa bitwidth, leading to a substantial reduction in cost of the fixed point multipliers. When comparing fMAC to the FP MACs, the expensive FP accumulator is amortized over the group for fMAC instead of between each pair of elements for FP MACs.
ViiB Training Speedup of FAST Strategies
In this section, we evaluate the performance of the FAST system by comparing against the other DNN training systems implemented with systolic arrays using different number formats. We configure each system for a given number format to have the same total area as our FAST system. Specifically, with the same area, we are able to fit a DNN training system with a systolic array of HFP8 (4bit exponent and 2bit mantissa) MACs, MSFP12 MACs, INT12 MACs, bfloat16 MACs and FP16 MACs, respectively. Note that our FAST system contains a fMAC systolic array, and each fMAC can perform multiplyaccumulate operations for 16 BFP numbers within one cycle. The design for other major components (i.e., accumulator, numerical converter, systolic array data generator and memory subsystem) of the baseline DNN systems are modified according to a given MAC design. For example, for bfloat16, a bfloat16 converter is used instead of a BFP converter. All designs run at a 500MHz clock frequency.
We use TimetoAccuracy (TTA) [coleman2017dawnbench]
as the evaluation metric to compare different approaches. Figure
21shows the TTA for ResNet18 models trained under various number formats to achieve a validation accuracy of 68% on ImageNet. The training time is normalized by the FASTAdaptive model which achieves 68% the fastest. Some settings that were unable to achieve 68% validation accuracy, such as INT8 and LowBFP, were omitted. The results are measured by performing a single round of forward pass and backward pass with a input minibatch of size 256. The evaluation results are generated based on the computation required for all convolutional and fully connected layers. Normalization and Activation layers are not considered in the cost analysis of this paper. Prior work suggests that activations and batch normalization take less than 5% of total running time
[fleischer18], and a small amount of power relative to the systolic array and other components [lee2020, qin2020].Generally, we see that FP32 is significantly slower than reduced/mixed precision formats such as bfloat16 and Nvidia MP. However, the floating point accumulations required for each MAC using these formats introduces overhead compared to fixed point of BFP formats. The MSFP12 achieves the best performance of all prior work. Our proposed FAST schemes outperform MSFP12 by more than 2 by using lower mantissa and exponent bitwidths and switching to a higher precision in the later stage of training.
Figure 22 depicts the normalized training time and energy cost for all evaluation DNNs to reach a target accuracy or BLEU. We note that performance trend and performance gain of FASTAdaptive are consistent across models, with prior reduced/mixed precision formats outperforming FP32 by a factor of 23 and our proposed BFP formats achieving an additional 23 improvement.
Viii Conclusion
The FAST system proposed in this paper uses block floating point (BFP) to support lowprecision arithmetic to reduce DNN training time, power consumption, and hardware requirements. With FAST, we exploit an observation that earlier layers and training iterations can afford larger error margins, making them amenable to efficient lowprecision computation.
We empirically demonstrate a 26 speedup in training over prior work based on mixedprecision or BFP number systems while achieving similar accuracy. FAST’s superior performance is due to our architectural choice of using the BFP number system, use of stochastic rounding in BFP, and modular fMAC design to support multiple precisions. This work shows that variable precison BFP with stochastic rounding offers a promising strategy in speeding up training and improving its efficiency. As DNN training is now often distributed across multichip systems [jouppi2020domain, tesladojo], future work is to study how well FAST could scale in such a multichip deployment.
Comments
There are no comments yet.