I Introduction
Deep Neural Networks (DNNs) have achieved stateoftheart performance across a variety of domains, including Recurrent Neural Networks (RNNs) and Transformers for natural language processing and Convolutional Neural Networks (CNNs) for computer vision. However, the high computation complexity of DNNs makes them expensive to deploy at scale in datacenter contexts, as a popular model (e.g., Google’s Smart Compose email autocomplete
[1]) may be queried millions of times per day, with each query requiring 10s to 100s of GFLOPs.To address these high computational costs, significant research effort has been spent on developing techniques that reduce the computational complexity of pretrained DNNs. One of the most commonly used techniques is posttraining quantization (see, e.g., [2]), where 32bit floatingpoint DNN weights and data (activations) are converted to a fixedpoint representation (e.g., 8bit fixedpoint) to reduce the amount of computation performed per inference sample. One benefit of posttraining quantization is that it does not require access to the original training dataset, and can therefore be applied by a thirdparty (such as a cloud service) as a step to reduce costs. In this work, we define a term as a nonzero bit in a quantized fixedpoint value. For instance, we say that the 8bit value 3 (00000011) is composed of two terms: .
In this paper, we further quantize the computation of an already quantized DNN at run time to realize additional substantial computation savings. That is, we propose to perform further runtime quantization on, for example, a quantized 8bit DNN while still achieving the same level of model performance. Note that for furthering quantization, we must use new techniques beyond conventional quantization methods, for otherwise, the original DNN could have been quantized to a lower precision in the first place while maintaining acceptable performance (e.g., a 4bit DNN instead of an 8bit DNN).
Specifically, we introduce a novel groupbased quantization method, which we call Term Revealing (TR). TR, shown in Figure 1, ranks the terms in a group of values associated with a dotproduct computation to reveal a fixed number of top terms (called a group budget) to use for the dotproduct computation. By limiting the number of terms to a group budget and pruning the remaining smaller terms, TR enables a more efficient implementation of dotproduct computations in DNN. With TR, the selected terms for a value are based on their relative rankings against terms of other values in the group. TR’s runtime groupbased quantization, a departure from traditional valuebased quantization, allows TR to carry out additional quantization on an already quantized DNN.
While allowing further quantization at run time, TR is able to achieve the same level of model performance as the original quantized DNNs for two reasons. First, TR uses groupbased term selection, which prunes only smaller terms in a group (e.g., and terms), leading to minimal added quantization error. For many groups, with fewer terms than the allocated group budget, no additional quantization is performed. Second, by leveraging normallike weight and data distributions typically present in DNNs (see Section IIIA), TR can use a small group budget without introducing quantization error for many groups.
With a simple FPGA design, we can use a small number of control bits to reconfigure a hardware supporting quantized computations under conventional quantization to one supporting runtime TR quantization, and vice versa.
To simplify our introduction of TR, we use conventional binary representations where all terms are nonnegative. However, shorter signed expressions which use both positive and negative terms, such as Booth encodings [3], can typically allow fewer terms in expressing a value and lead to increased computation savings in TRenabled quantization. To this end, we have developed a new signed encoding called Hybrid Encoding for Shortened Expressions (HESE), which we use in the later sections of the paper to express DNN weights and data.
The novel contributions of the paper are:

The concept of runtime quantization on already quantized DNNs to realize further computation savings.

A groupbased term ranking mechanism, called term revealing (TR) and its term MAC (tMAC) hardware design for the implementation of our proposed further quantization at run time.

An FPGA system which requires minimal reconfiguration to efficiently supports both conventional quantization and TRenabled quantization.

A signed poweroftwo encoding called Hybrid Encoding for Shortened Expressions (HESE). Using fewer terms compared to previous signed representations, HESE enhances TR’s computation efficiency.
Ii Background and Related Work
In Section IIA, we discuss related work on pruning and quantization techniques for performing efficient DNN inference. Then, in Section IIB, we discuss prior work on hardware architectures which aim to exploit bitlevel sparsity. Finally, in Section IIC we illustrate how matrix multiplication is performed with systolic arrays.
Iia Pruning and Quantization Methods
There has been significant research efforts in pruningbased methods which exploit valuelevel sparsity in CNN weights, as performing multiplication with zero operands can be viewed as wasted computation [4, 5, 6, 7, 8, 9, 10, 11, 12]. However, these pruning methods typically require model retraining, making them not feasible for a thirdparty that is hosting the model (as it requires access to the full training dataset). Additionally, unstructured pruning methods which achieve the best performance (e.g., [4]) are hard to implementing efficiently in specialpurpose hardware, as the the remaining nonzero weights are randomly distributed. In this paper, we propose to further reduce the amount of computation even for nonzero values by exploiting bitlevel sparsity as opposed to conventional valuelevel sparsity.
Quantization [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] lowers precision of the values in weights and data in order to reduce the associated storage, I/O and/or computation costs. However, aggressive posttraining quantization (e.g., to 4bit representations) introduces additional error into the computation, leading to decreased model performance. Due to this, many lowprecision quantization approaches, such as binary neural networks [27], must be performed during training. Our proposed TR approach is applied on top of 8bit quantization and does not require additional training. Note that our approach does not reduce the precision of the weights (i.e., weights are still 8bit fixedpoint values after term revealing) but instead reduces the number of nonzero terms to be used at runtime across a group of weights.
IiB Hardware Architectures for Exploiting Bitlevel Sparsity
There has been growing interest in exploiting bitlevel sparsity (i.e., the zero bits present in weight and data values) as opposed to valuelevel sparsity discussed in the previous section. A bitlevel multiplication with a zero bit can be viewed as wasted computation in the same manner as a valuelevel multiplication with a zero value, in that both operations do not effect the result. Based on this observation, BitPragmatic introduces an architecture that utilizes a nonzero termbased representation to remove multiplication with zero bits in weights while keeping data in a conventional representation [28]. BitTactical follows up this work by grouping nonzero weight and data values to achieve more efficient scheduling of nonzero computation [29]. Both of these approaches assume 8bit or 16bit fixedpoint quantization (i.e., the first step in Figure 1).
However, due to the more finegrained nature of these bitlevel architectures, efficiently scheduling bitlevel operations across multiple groups of computations becomes challenging, as each group may have a different amount computation to perform. Generally, this leads to stragglers that require significantly more bitlevel operations than other groups. Both BitPragmatic and BitTactical handle this straggler problem by adding a synchronization barrier which makes all groups wait until the straggler is finished. Due to this, in processing many groups concurrently, they can only exploit bitlevel sparsity up to the degree of the group with most bitlevel operations (i.e., the straggler). We find that this worse case can be a factor of 23 more bitlevel operations compared to the average case. By comparison, TR provides a tighter processing bound which enables synchronous computation across all groups. This is done by removing smaller terms from groups with a large number of terms (the second quantization step in Figure 1).
IiC Systolic Arrays for Matrix Multiplication
The majority of computation in the forward propagation of a DNN consists of matrix multiplications between a learned weight matrix in each layer and input or data being propagated through the layer, as shown on the left side of Figure 2. Systolic arrays are known to be able to efficiently implement matrix multiplication due to their regular design, dataflow architectures and reduced memory access [30]. The right side of Figure 2 shows a 33 systolic array, for computing dot products between and (highlighted in the weight and data matrices). The data in the partition (e.g.,
) are passed into the systolic array from below in a skewed fashion in order to maintain synchronization between cells. We use this systolic array design as the starting point for our FPGA system in Section
V.Iii Term Revealing
In this section, we introduce a groupbased quantization method called term revealing (TR), which is applied to quantized DNNs at run time.
Iiia DNN Weight and Data Distributions
As mentioned earlier, TR leverages weight and data distributions of DNNs. DNNs are often trained with weight decay regularization to improve model generalization [31]
[32]on data which both improves the stability of convergence and improves the performance of the learned model. A consequence is that the weights are approximately normally distributed and the data follow a halfnormal distribution (as ReLU sets negative values to 0). Figure
3 (top row) illustrates these distributions for the weights in 7th convolution layer of ResNet18 [33] trained on ImageNet [34] and the data input to the layer. Both the weights and data are quantized to 8bit fixedpoint using uniform quantization (QT).The higher frequency of small values means that most elements are represented with only 2 or 3 poweroftwo terms as shown in Figure 3 (bottom). For instance, the value 6 is represented with two poweroftwo terms . In the figure, 79% of weight values and 84% of data are represented with 3 or fewer poweroftwo terms. Note that the most significant bit (MSB) in the 8bit representation is used to represent the sign of each value, thus each value has at most 7 terms.
IiiB Computing Dot Products via Term Pair Multiplications
Assume that we compute dot products in matrixmatrix multiplication between quantized weights and data by dividing both vectors into groups of a given length (e.g., 16). This groupbased formulation is motivated by efficient hardware implementations described later in Section V. Figure 4 illustrates how partial dot products, partitioned into groups of length 16, are computed using term pair multiplications. In the example, the first value in the weight vector multiplied with the first data value is computed using two term pair multiplications () as shown on the right of Figure 4. Using this paradigm, we can analyze the number of term pair multiplications that are required per partial dot product (e.g., with a group size of 16) across all groups in a matrixmatrix multiplication.
Figure 5 shows a histogram of the number of term pair multiplications for partial dot products with groups of 16 values in the 7th convolutional layer of ResNet18. Interestingly, 99% of these groups require under 110 term pair multiplications even though the theoretical maximum, where all weight and data values use 7 terms (i.e., every value is ), is . In this work, we propose to restrict the number of term pair multiplications performed in each partial dot product (e.g., to 110 instead of 784) in order to achieve tightly synchronized parallel processing across systolic cells. As we know that DNNs are robust in performance to small amounts of error (e.g., through the initial uniform quantization step), it is reasonable to expect that they would also tolerate an additional quantization step such as our proposed TRenabled quantization that makes small modifications to enforce a tighter processing bound. We use term pairs multiplications as a proxy for the amount of computation performed during inferences, as the hardware system described in Section V performs dot products using this term pair multiplication approach.
IiiC Overview of Term Revealing
Term revealing is a groupbased term ranking method that sets a limit on the number of terms allotted to a scheduling group. TR consists of three steps:

Grouping elements as shown in Figure 6 (left). For a given weight matrix, we partition it into equal size groups which are used in dot product computations. The group size , denotes the number of values per group and may assume various values such as 2, 3, 4, 8, 16, etc.

Configure a group budget which is used for every group. The budget bounds the number of terms used in dotproduct computations across the values in a group.

Identify top terms in the group using a receding water algorithm that ranks and selects the terms as shown in Figure 6 (right). It keeps the largest terms in a group and prunes the remaining smaller terms below a waterline. Note that some groups may have fewer than terms, meaning that no pruning occurs.
Figure 6 illustrate how TR is applied to a group of values in a weight matrix with a term budget . The three values are decomposed into their term representations, and scanned row by row (viewed as a waterline), starting from the term and finishing at the term, until the group budget is reached. In the example, the group budget of is reached at the term for . The remaining loworder terms (e.g., and below for this group) are pruned, adding a small amount of additional quantization error. For instance, after TR, is quantized from 81 to 80.
Since the position of the waterline is determined by the distribution of terms in a group, the amount of pruning induced by TR varies across each group of values. Consider two groups of weights (group a: ), and (group b: ). Figure 7 illustrates the quantization error incurred when 4bit QT (truncating the and terms) and TR (keeping the top terms) are applied to both groups. For group a, we see that TR introduces no error as the group has only 6 terms. By comparison, 4bit QT introduces error by pruning all of the and terms, as conventional quantization keeps only the largest 4 terms across all values. For group b, which has significantly more terms, TR and 4bit QT perform a similar amount of truncation. Group b represents a worse case for TR, as most groups will have significantly fewer terms.
Therefore, in practice, we can use a small group budget such as without introducing significant quantization error. By constraining the number of terms to across the values, TR is are able to ensure a tighter processing bound compared to 4bit QT. Specifically, assuming each data value has up to term, the maximum number of term pairs with TR is reduced to , which is smaller than 4bit QT of by a factor of .
IiiD Term Pair Reduction for Term Revealing Groups
To more formally quantify the term pair reduction due to TR, suppose for a group size of that the group budget is and the receding water algorithm reveals , and number of terms for weight values , and , respectively, with . Suppose further that and have and terms, respectively. (For example, , , , and , , .) Then, with TR, the total number of term pairs to be processed for the dot product computation between and is
In reality, since most weights and data require significantly fewer than the maximum allotted number of poweroftwo terms, for most groups, dot products will complete the computation below this bound, as discussed earlier in relation to Figure 5. In this sense, TR can be viewed as shifting this upper bound from terms per group in the baseline case (7 terms for both weights and data) to terms per group, where . In Section V, we utilize this significantly reduced upper bound enabled via TR to implement tightly synchronized processor arrays for DNN inference.
IiiE Relationship Between Group Size and Group Budget
TR budgets terms for a group of size . Let for some , where is the average number of terms budgeted for each value in the group. Recall from Figure 3 that of weight values are represented in 3 or fewer terms. This means that as the group size increases, the average number of budgeted terms per value approaches the mean of the weight term distribution. For the weight term distribution in Figure 3, the mean is only terms per values, even though some values have as many as terms. Practically, this means that a larger group size allows for a smaller relative term budget which is close to the mean, as it becomes increasingly unlikely that many groups have more than terms.
IiiF Bounding Truncationinduced Error in Dot Products
TR strives to minimize truncationinduced relative error . Suppose that is the water line determined by TR under a given group budget. That is, terms smaller than are truncated. Then, for a group size and , kept terms have value at least , or , and truncated terms have value at most , or . We have . Larger results in a reduced upper bound on .
We provide a bound on the relative error introduced by TR in truncated dot products between weights and data. For a given group of data (, , ), the dot product over the group is where , and are corresponding weights of the filter. Each values may be positive, negative or zero, while data values are nonnegative. For simplicity, we assume here that all are positive while noting the result also holds when they are all negative. After TR, each is replaced with a truncated in the dot product computation. Let denote the relative error of induced by TR, i.e., . Then, the dot product result with can be decomposed as follows:
Therefore, the relative error of the dot product with truncated values as an approximation to the original dot product is:
Suppose that, as described above, by TR we can assure that for . Then,
Thus, the relative error in the computed dot products is bounded by .
Iv Hybrid Encoding for Shortened Expressions
In this section, we present Hybrid Encoding for Shortened Expressions (HESE), a signed poweroftwo encoding which reduces the number of terms required to represent 8bit fixedpoint values for DNN weights and data. HESE complements TR by reducing the number of terms used before TR is applied.
Iva Signed poweroftwo Representations
Booth radix4 encoding [3] converts a conventional binary representation with only positive poweroftwo terms (e.g., ) into a representation with both positive and negative poweroftwo terms (e.g., ). Note that while the underlying value is the same, the signed representation of Booth uses only two terms as opposed to four in the positiveonly binary case. Booth reduces the number of terms by encoding strings of consecutive 1s, corresponding to positive poweroftwo terms, such as () of , into a pair a positive and negative terms ( ).
Booth radix4 bounds the number of poweroftwo terms in an nbit value to [3]. This is utilized in the design of efficient Booth multipliers to provide a smaller bound on the amount of computation required for any pair of nbit values. This bound is necessary for synchronization purposes across multiple processing elements (e.g., systolic arrays). In this work, we are interested in representations with fewer terms as TR enforces a tighter computational bound by truncating smaller terms in a group.
IvB Overview of HESE
HESE is a hybrid encoding method that combines Booth, which efficiently handles strings of 1s, with additional rules for reducing the number of terms required for isolated 1s and 0s. Figure 8 shows how HESE is used to encode 8bit binary values into a signed poweroftwo expression with fewer terms. For example, Booth translate 95 (01011111) to . In comparison, as shown in Figure 8, HESE will translate the value to a shorter expression . The rules for the generated output for this example can be explained by combining the strategy of Booth radix2 encoding for strings of 1s with that of the standard binary representation for isolated 1s which are surrounded by zeros. For instance, in the example shown in Figure 8, (01011111) has a single isolated 1 in the 7th bit position, and a string of 1s from the 5th to 1st bit positions. This translates to (for the isolated 1) (for the string of 1s). By using the third rule in Figure 9a for isolated 0, we can save an additional term for values such as (0110111) by translating into . Because this rule is rarely needed for 7bit binary values, for implementation simplicity, we omit it in design and analysis reported in this paper. That is, we only use the other four rules.
IvC Reducing Number of Terms per Encoding
HESE encodings have strictly equal or fewer terms than binary and Booth radix4. Figure 8 shows the number of terms required for these encodings across two distributions of values: data values from ResNet18 and values drawn from a uniform distribution over the same range as the data. The xaxis is the number of terms required to represent a value and the yaxis is the cumulative percentage of values that are represented within a given number of terms. HESE outperforms both Booth and binary across both distributions of values.
As expected, Booth leads to more compact representations than binary for values drawn from the uniform distribution. However, most of the reduction in terms comes from larger values in the 8bit range (with many 1s), which occur much less frequently for the data, as depicted in Figure 3 (bottom). Therefore, radix4 performs equal or worse than binary for the distribution of data values we are interested in. By comparison, when applying HESE on data, of values are represented in 3 or fewer terms. Practically, this means we can use 3 poweroftwo terms for both weights and data.
V Hardware Design for Efficient Term Revealing
In this section, we present our hardware design for TRbased quantization. Figure 9 provides an overview of the TR system design, consisting of the following components: (1) weight and data buffers which store DNN layer weights and input/intermediate data, (2) a systolic array which performs dot products between weights and data using term MACs (tMAC) described in Section VB, (3) a binary stream converter to convert systolic array output into binary representation (Section VC), (4) a ReLU block (Section VC), (5) a HESE encoder (Section VD) to convert the binary representations to shorted signed expressions, and (6) a term comparator (Section VE) which applies TR by selecting the top terms in a group. In Section VA, we first give some highlevel reasoning on how tMAC can save computation.
Va Highlevel Comparison Between Bitparallel MAC (pMAC) and Term MAC (tMAC)
To help understand the inherent advantage of TR, we provide a highlevel argument on how our proposed term MAC (tMAC) saves a significant amount of work over a conventional parallel MAC (pMAC). Here, we define work as the amount of computation, including both arithmetic and bookkeeping operations, which are performed per group. The work incurred by a method largely determines the energy, area, and latency of its implementation.
To be concrete, we study a 1D systolic array of 3 cells, as depicted in Figure 10, for the processing of groups of 3 data values (, , ) in computing their dot products with weights (, , ) prestored in the systolic array. To provide a baseline for comparison, we consider a conventional implementation, where each cell is a pMAC performing an 8bit bitparallel multiplication, , and a 32bit accumulation adding an intermediate to the computed . In comparison, Figure 10 depicts a tMACbased implementation which significantly reduces the work by only processing available term pairs. The number of terms is relatively small due to high bitlevel sparsity generally presented in CNN weights and data. This comparison result applies to a general 2D systolic array, which is a stack of 1D systolic arrays. In Section VIIA, we show how this analysis translates to realized performance on an FPGA with a group size of .
For this illustrative analysis, we assume that tMAC uses a TR group of size and budget for weight values, and leading terms for data values under HESE (Section IVB). Thus, for weights, each value in a group uses on average terms. As we show in Table III, under similar settings, TR will incur a minimum decrease in classification accuracy (e.g., less than ) when dropping lowerorder terms exceeding the group budget across multiple CNNs.
Our analysis on work proceeds as follows. A conventional pMAC implementation of a single systolic cell incurs 7 8bit additions for the multiplication and 1 32bit accumulation operation for . Therefore, the pMAC implementation of a 1D systolic array with three cells requires 21 8bit additions and 3 32bit accumulation operations. In contrast, a tMAC implementation incurs significantly less work. Specifically, it uses at most 12 3bit additions on exponents of poweroftwo terms (weight and data exponents are both less than 8) for termpair multiplications. (Recall that we assume and terms for data values, as depicted in Figure 10). The updating of accumulating coefficient vector (discussed in detail in the next section) requires bookkeeping operations for bit alignment, etc., with work we assume is no larger than the equivalent 12 3bit additions. Thus, tMAC substantially reduces work compared to pMAC, that is, 24 3bit additions vs. 21 8bit additions plus 3 32bit accumulations.
VB Term MAC (tMAC) Design
The term MAC (tMAC) performs dot products between a data and weight vector of group size by multiplying all term pairs. Figure 11 illustrates how these term pair multiplications in tMAC are performed for a group of size and a group budget . In this example, TR ensures that there are or fewer terms across all weight values in the group. For illustration simplicity, assume all data values can be represented with a single term (in our implementation there are as many as terms per data value). Under these assumptions, term pair multiplications are performed and the results are added to a coefficient vector depicted in the upper right of Figure 11.
The coefficient vector stores the current partial result of the dot product as coefficients for each power of two. In this example, the coefficients are set to (, , , , , ), which represents a value of . For the first term pair in the figure, in , the coefficient for is decremented by 1 as the signs of the terms differ. Once all exponent additions are completed for a dot product, the coefficient vector is reduced to a single value. As the largest term is , assuming 8bit uniform quantization, the largest term pair is . Therefore, the coefficient vector has a length of in order to store all possible term pair results from to . To ensure overflow is not possible for dot products of length as large as , each element in the coefficient vector is 12 bits.
The hardware design of tMAC is shown in Figure 12. The exponents for term pairs are stored in data and weight exponent arrays, with the sign of each term stored in the parallel arrays with one bit per term. For instance, the term would store a in the exponent array and a minus () in the sign array. The yellow, red, blue, and green colors denote the term pair boundaries for each data weight multiplication. The exponent duplicator takes in data exponents and duplicates them based on the number of weight exponents in each value. Each cycle, a pair of exponents from these two arrays are passed into the adder, which computes the sum of the exponents, sets the sign, and sends to the result to a coefficient accumulator (CA) (Figure 12) within one cycle. Therefore, to process a group with 8 term pairs takes 8 cycles in total.
The CAs perform bitserial addition between the coefficient vector and the output of the exponent adder in the tMAC. Due to the bitserial design, the number of CAs must match the size of the data and weight register arrays (8 in this example) in order to maintain synchronization across the cells of the systolic array. At each cycle, one of the eight CAs takes the sum of two exponents from the adder, and adds/subtracts 1 to/from the corresponding coefficient. In our implementation, each tMAC can choose to reuse the current coefficient vector or take the new coefficient vector from its neighboring cell via the selection signal , as depicted in Figure 12.
VC Binary Stream Converter and ReLU Block
The binary stream converter takes coefficient vectors, output from the systolic array, and transforms them into a binary format by multiplying each element of the coefficient vector with the corresponding poweroftwo term then summing over the partial results. The outputs of the binary stream converter are sent to the ReLU block in a bitserial fashion. Using a two’s complement representation for the outputs, the sign can be determined by detecting the most significant bit of the output streams. The ReLU block buffers all the lower bits until the MSB arrives. Then, it outputs zero if the sign of the MSB indicates that the value is negative; otherwise it outputs the original bit stream.
VD HESE Encoder
The HESE encoder produces two bit streams, which represent the magnitude and sign of each poweroftwo term, respectively. For example, for a bitserial input of , the HESE encoder will produce two output streams: (magnitude) and (sign), to indicate . The HESE encoder is implemented with a finite state machine.
VE Term Comparator
The term comparator in Figure 13 selects the top terms from the outputs of every consecutive HESE encoders, where and are the group budget and group size, respectively. Figure 13 shows the operation of term comparator on the outputs of four HESE encoders, the HESE encoder outputs are divided into two groups, where each group has a group size and group budget of . The inputs enter the term comparator in a reverse order such that their most significant bits (MSB) enter the term comparator first. Each cycle, the term comparator counts the total number of nonzero bits encountered so far, and truncates the remaining loworder terms once the group budget is reached for a group.
The term comparator contains multiple accumulate and compare (A&C) blocks which are arranged into a tree structure. Each A&C block takes a single input bit stream and counts the total number of nonzero bits in this stream. Figure 14 show how the A&C blocks can be reconfigured for different group sizes . For , the A&C blocks on the first level of the tree will compare the number of nonzero bits in their input stream against the group budget , and truncate each stream accordingly. If the group size is larger than 1 (e.g., 2), each A&C block in the first level of the tree will forward its input stream together with the nonzero bit count to its parent A&C block. The parent A&C block then operates on these two streams in a similar fashion to its children. The tree architecture allows for minimal changes to the term comparator under different group sizes, which leads to a low reconfiguration overhead and maximum level of hardware reuse.
VF Memory Subsystem
Our memory subsystem consists of a data and weight buffer. The data buffer holds the term exponents and signs for both the input and result data of the current layer, and the weight buffer holds the term exponents and signs of the weights for each group. For the weight buffer, we use double buffering to prefetch the next weight tile from the offchip DRAM so that the computation of the systolic array can overlap with the traffic transfer from the offchip DRAM to weight buffer. Note that TR does not reduce the storage complexity of the model, as each weight is stored in an 8bit fixedpoint format.
VG FPGA Reconfiguration for QT and TR
Our TR system system can be easily reconfigured for different group sizes and group budgets , in order to adapt to dynamic requirements on group size and group budget during inference with a negligible delay. In addition, our system can also supports conventional quantization (QT) by performing poweroftwo operations with binary representations. Since QT does not require TR or HESE encoding, the term comparator and HESE encoder can be turned off by using clock gating to reduce power consumption. Table I summarizes all of the control registers which need to be modified when switching between TR and QT. The switching process only takes several clock cycles (i.e., within 100ns for our FPGA implementation).
Vi Term Revealing Evaluation
In this section, we evaluate the performance of TR when applied to an MLP on MNIST [35], a broad range of CNNs (VGG19 [36], ResNet18 [33], MobileNetV2 [37], and EfficientNetb0 [38]) on ImageNet [34], and an LSTM [39] on Wikitext2 [40]. In Section VIA, we compare TR against conventional uniform quantization (QT) on the performance (i.e., accuracy or perplexity) of these DNNs. Then, in Section VIB, we provide analysis on how the (average number of terms) and (group size) parameters impact the classification accuracy. Next, in Section VIC, we analyze the individual contribution of HESE and TR on model performance. Finally, in Section VID, we show that the quantization error introduce by TR is substantially less than when using a more aggressive QT setting (e.g., 6bit uniform quantization).
To perform this analysis, we have implemented a CUDA kernel for TR which only increases the inference runtime of a pretrained model running on a NVIDIA 1080 Ti by under . This means that the validation accuracy for a pretrained CNN for ImageNet can still be obtained within several minutes. Using pretrained models has the advantage of making parameter search (e.g., for group size and term budget ) simple compared to methods such as weight pruning [4] that require model retraining which takes hours or days for each setting. Before applying TR, each model is quantized from 32bit floatingpoint to 8bit fixedpoint using a layerwise procedure described in [41].
Via Comparing Term Revealing to Uniform Quantization
Motivated by the design in Section V, we are interested in minimizing the number of term pair multiplications per sample, as this directly translates to the processing latency of a sample. For the uniform quantization (QT) approach with 8bit fixedpoint weights and data, each multiplication translates to term pair multiplications. By comparison, for term revealing (TR), the number of term pair multiplications is instead bounded by the average number of term pairs which is shared across a group of values. We show that TR gives a significant reduction (e.g., 310) over QT while maintaining the nearly identical performance (e.g., within 0.1% accuracy).
ViA1 MLP on MNIST
We train an MLP with one hidden layer with 512 neurons for MNIST using the parameter settings given in the PyTorch examples for MNIST
^{1}^{1}1https://github.com/pytorch/examples/tree/master/mnist. Figure 16 (left) shows the performance of QT and TR applied to the pretrained MLP. TR achieves a reduction in number of term pair multiplications over QT while achieving a clasification accuracy of 98.4% (compared to the 98.5% baseline).ViA2 CNNs on ImageNet
We use pretrained models provided by the PyTorch torchvision package^{2}^{2}2https://github.com/pytorch/vision/tree/master/torchvision/models for VGG16, ResNet18, and MobileNetv2 and a PyTorch implementation^{3}^{3}3https://github.com/lukemelas/EfficientNetPyTorch of EfficientNet with pretrained models. Figure 16 (center) shows the performance of TR and QT for the 4 CNNs. TR achieves a 14 reduction in term pair multiplications over QT for VGG16, which is known to be significantly overprovisioned (e.g., amenable to quantization and pruning). Even for more recent models, which have significantly fewer parameters, such as MobileNetv2 and EfficientNetb0, TR is still able to achieve a 4 and 6 reduction in term pair multiplications, respectively, losing less than 0.1% classification accuracy compared to the 8bit QT settings. Generally, we see that more aggressive TR settings (e.g., with a reduced group budget) appears to degrade the accuracy more gracefully than more aggressive QT settings (e.g., with reduced bitwidth for weight).
ViA3 LSTM on WikiText2
We train a 1layer LSTM with 650 hidden units (i.e., neurons), a word embedding of length 650, and a dropout rate of 0.5, following the PyTorch word language model example^{4}^{4}4https://github.com/pytorch/examples/blob/master/word_language_model. This baseline model achieves a perplexity of 86.85. Figure 16 (right) shows how the perplexity of the pretrained model is impacted by QT and TR. Again, we find that TR is able to reduce the number of term pair multiplications by a significant factor of , while achieving the same perplexity.
ViB Improved Term Allocation with Larger Group Size
Figure 16 shows the classification accuracy for ResNet18 as
is varied for different group sizes. As the group size increases, the variance in number of terms across values in a group shrinks, meaning that a larger group budget
at a fixed ratio is strictly better than a smaller for the same . As observed, the classification accuracy for a larger group size strictly outperforms all settings with smaller group sizes. For instance, a group size of 8 with of 1 achieves a classification accuracy of 67.72% which is 5.21% better than a group size of 1 at the same value. Note that a group size of 1 is equivalent to truncating each value to exactly terms.ViC Isolating the Impact of TR and HESE
Figure 17 shows the relative impact of TR and HESE in terms of classification accuracy by measuring them in isolation. The HESE and QT settings (without TR) apply term truncation by keeping the top terms in each individual weight. In this case, is equal to as the group size is 1 (i.e., there is no grouping). We see that HESE substantially outperforms QT until the top terms are kept per weight () due to it requiring fewer terms. For the settings with TR, QT + TR and HESE + TR, we use a group size of , with term budget values of , , , , and to generate comparable values of as in the settings without TR. We find that TR improves the performance of both the QT and HESE encoding methods, with HESE + TR achieving the best performance.
ViD Quantization Error Analysis
The reason for TR’s superior performance (e.g., accuracy or perplexity) over QT discussed in Section VIA is due to TR introducing less quantization error. Figure 18 shows the quantization error across the layers in ResNet18 for 3 QT settings (from 6bit to 8bit) and TR with a group size and a group budget . We see that TR introduces a small amount of quantization error over 8bit QT, which makes sense as TR is applied on top of 8bit QT. The 7bit and 6bit QT settings truncate the loworder terms, leading to larger quantization error and reduced classification accuracy.
Vii FPGA Evaluation
In this section, we evaluate the hardware performance of the TR system described in Section V. We have synthesized our TR system using Xilinx VC707 FPGA evaluation board. We first compare the performance of tMAC against a bitparallel MAC in Section VIIA. Then, we demonstrate that our TR system can be used to implement both QT and TR in Section VIIB. Finally, in Section VIIC, we compare our TR system against the other FPGAbased CNN accelerators on ResNet18.
Viia Comparing Performance of Bitparallel MAC and tMAC
In this section, we evaluate the hardware performance of a single tMAC by comparing it against a bitparallel MAC (pMAC) shown in Figure 10. For both designs, we perform a group of MAC computation: , where , , and are 32bit, 32bit, 8bit and 8bit, respectively, and is the number of elements in the weight and data vectors (i.e., the group size in TR). In one cycle, the pMAC performs an 8bit multiplication between and and a 32bit accumulation between the result and . Therefore, is generated in cycles. By comparison, the tMAC takes a variable number of cycles to process each multiplication in the group, depending on the number of term pairs in the multiplication. In total, it requires no more than cycles, where is the maximum number of terms for each data value and is the group term budget.
Table II shows the FPGA resource consumption of the two MAC designs in terms of LookUp Tables (LUTs) and Flipflops (FFs). The tMAC consume less LUTs and FFs than the pMAC. The tMAC requires less FPGA resources as it performs 3bit exponent additions as opposed to 8bit additions and 32bit accumulation as in the pMAC.
We evaluate the two designs in terms of the energy efficiency, which is the ratio between the throughput and power consumption. Table III shows the energy efficiency and classification accuracy for the two designs across four CNNs. For the tMAC settings, different values of and are selected for each CNN such that the classification accuracy stays competitive with the baseline model (less than difference in accuracy across all settings) while keeping the group size fixed (). For each CNN, the energy efficiency of both MAC designs is normalized to that of the pMAC. We observe that tMAC achieves superior energy efficiency ( on average) compared to pMAC across the four CNNs. This reflects that pMAC needs to perform more work than tMAC, as our analysis in Section VA shows.
ViiB System Comparison of QT and TR
In this section, we compare the hardware performance of TR against QT with the DNNs shown in Figure 16. The systolic array in the TR system has 128 rows by 64 columns, with each systolic cell implementing a tMAC with a group size . The group budget is chosen independently for each network such that the TR is within 0.15% accuracy of the QT setting (or 0.05 perplexity for the LSTM). In this section, we use the same TR system (Figure 9) for the implementation of both QT and TR in order to show the reconfigurability of our design. The implementation of QT does not require groupbased ranking and HESE encoding, so we turn off these components of the hardware system to reduce dynamic power consumption. All control registers are configured based on Table I.
We evaluate our FPGA system with the following performance metrics: (1) Average Processing latency of the hardware system to generate the prediction result, and (2) Energy efficiency or the average amount of energy required to process a single input sample. As shown in Figure 19, our TR system outperforms the QT by and on average in terms of processing latency and energy efficiency, respectively. For more difficult tasks, such as Wikitext2 for the LSTM, a more conservative group budget is selected, leading to less relative improvement over QT. For overprovisioned models (e.g., VGG16), a more aggressive group budget of is used, leading to more substantial improvements in latency and energy efficiency.
ViiC FPGA System Evaluation
In this section, we evaluate our TR system over ResNet18 on ImageNet, using a group size and group budget . While an even larger group size could theoretically lead to additional savings, there are diminishing returns as shown in Figure 16 when comparing and . Additionally, larger group sizes increase the complexity of the term comparator due to additional tree levels of A&C blocks.
We compare our TR system with the other FPGAbased accelerators which implements different CNN architectures (e.g., AlexNet) on ImageNet. We evaluate our design in terms of the average processing latency for the input samples, energy efficiency of the hardware system and classification accuracy. As shown in Table IV, our design achieves the highest classification accuracy , energy efficiency ( frames/J), and the second lowest latency ().
Our hardware system achieves the best performance for multiple reasons. First, TR coupled with the proposed HESE encoding greatly reduce the amount of term pair multiplications, which reduces the number of cycles in tMACs. TR allows tMACs to achieve a much tighter processing bound of pairs per group as opposed to in the case of standard binary encoding without TR. Second, the bitserial design of the coefficient accumulator in tMAC together with the systolic architecture of the computing engine leads to a highly regular layout with low routing complexity.
Viii Conclusion
We proposed term revealing (TR) as a general runtime approach for furthering quantized computation on already quantized DNNs. Departing from conventional quantization that operates on individual values, TR is a groupbased method that keeps a fixed number of terms within a group of values. TR leverages the weight and data distributions of DNNs, so that it can achieve good model performance even with a small group budget. We measure the computation cost of TRenabled quantization using the number of term pair multiplications per inference sample. Under this clearly defined cost proxy, we have shown that TR significantly lowers computation costs for MLPs, CNNs, and LSTMs. As shown in Section VIIB, this reduction in operations translates to improved energy efficiency and reduced latency over conventional quantization for our FPGA system. Furthermore, our FPGA system demonstrates that by changing a small number of control bits we can reconfigure a quantized computation under conventional quantization to one under TRenabled quantization, and vice versa (Table I). Quantization is one of most widely used approaches in streamlining DNNs; TR proposed in this paper brings the success of the quantization approach to another level.
References
 [1] “Smart compose: Using neural networks to help write emails.” Available at: ”https://ai.googleblog.com/2018/05/smartcomposeusingneuralnetworksto.html.

[2]
D. Lin, S. Talathi, and S. Annapureddy, “Fixed point quantization of deep
convolutional networks,” in
International Conference on Machine Learning
, pp. 2849–2858, 2016.  [3] A. D. Booth, “A signed binary multiplication technique,” The Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, no. 2, pp. 236–240, 1951.
 [4] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [5] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems, pp. 2074–2082, 2016.
 [6] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger, “Condensenet: An efficient densenet using learned group convolutions,” arXiv preprint arXiv:1711.09224, 2017.
 [7] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in International Conference on Computer Vision (ICCV), vol. 2, 2017.
 [8] J.H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” arXiv preprint arXiv:1707.06342, 2017.
 [9] S. Narang, E. Undersander, and G. F. Diamos, “Blocksparse recurrent neural networks,” CoRR, vol. abs/1711.02782, 2017.
 [10] S. Gray, A. Radford, and D. Kingma, “Gpu kernels for blocksparse weights.” https://s3uswest2.amazonaws.com/openaiassets/blocksparse/blocksparsepaper.pdf, 2017. [Online; accessed 12January2018].
 [11] H. T. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization,” 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.
 [12] A. Ren, T. Zhang, S. Ye, J. Li, W. Xu, X. Qian, X. Lin, and Y. Wang, “Admmnn: An algorithmhardware codesign framework of dnns using alternating direction methods of multipliers,” in Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 925–938, ACM, 2019.
 [13] M. Courbariaux, Y. Bengio, and J.P. David, “Training deep neural networks with low precision multiplications,” arXiv preprint arXiv:1412.7024, 2014.

[14]
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep learning with limited numerical precision,” in
International Conference on Machine Learning, pp. 1737–1746, 2015.  [15] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064, 2016.
 [16] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio, “Quantized neural networks: Training neural networks with low precision weights and activations,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6869–6898, 2017.

[17]
E. Park, J. Ahn, and S. Yoo, “Weightedentropybased quantization for deep
neural networks,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 5456–5464, 2017.  [18] S. Kapur, A. Mishra, and D. Marr, “Low precision rnns: Quantizing rnns without losing accuracy,” arXiv preprint arXiv:1710.07706, 2017.
 [19] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen, “Incremental network quantization: Towards lossless cnns with lowprecision weights,” arXiv preprint arXiv:1702.03044, 2017.
 [20] P. Wang and J. Cheng, “Fixedpoint factorized networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 3966–3974, IEEE, 2017.
 [21] C.Y. Chen, J. Choi, K. Gopalakrishnan, V. Srinivasan, and S. Venkataramani, “Exploiting approximate computing for deep learning acceleration,” in 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 821–826, IEEE, 2018.

[22]
E. Park, D. Kim, and S. Yoo, “Energyefficient neural network accelerator based on outlieraware lowprecision computation,” in
2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 688–698, IEEE, 2018.  [23] E. Park, S. Yoo, and P. Vajda, “Valueaware quantization for training and inference of neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 580–595, 2018.
 [24] Q. Hu, P. Wang, and J. Cheng, “From hashing to cnns: Training binaryweight networks via hashing,” arXiv preprint arXiv:1802.02733, 2018.
 [25] B. McDanel, S. Q. Zhang, H. T. Kung, and X. Dong, “Fullstack optimization for accelerating cnns using powersoftwo weights with fpga validation,” International Conference on Supercomputing, 2019.

[26]
A. Li, T. Geng, T. Wang, M. Herbordt, S. L. Song, and K. Barker, “Bstc: a novel binarizedsofttensorcore design for accelerating bitbased approximated neural nets,” in
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–30, 2019.  [27] M. Courbariaux, Y. Bengio, and J.P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, pp. 3123–3131, 2015.
 [28] J. Albericio, A. Delmás, P. Judd, S. Sharify, G. O’Leary, R. Genov, and A. Moshovos, “Bitpragmatic deep neural network computing,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 382–394, ACM, 2017.
 [29] A. Delmas, P. Judd, D. M. Stuart, Z. Poulos, M. Mahmoud, S. Sharify, M. Nikolic, and A. Moshovos, “Bittactical: Exploiting ineffectual computations in convolutional neural networks: Which, why, and how,” 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.
 [30] H. T. Kung, “Why systolic architectures?,” IEEE Computer, vol. 15, pp. 37–46, 1982.
 [31] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
 [32] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 [34] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, “Imagenet: A largescale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255, IEEE, 2009.
 [35] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.
 [36] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
 [37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520, 2018.
 [38] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.

[39]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [40] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
 [41] J. H. Lee, S. Ha, S. Choi, W.J. Lee, and S. Lee, “Quantization for rapid deployment of deep neural networks,” arXiv preprint arXiv:1810.05488, 2018.
 [42] X. Zhang, J. Wang, C. Zhu, Y. Lin, J. Xiong, W.m. Hwu, and D. Chen, “Dnnbuilder: an automated tool for building highperformance dnn hardware accelerators for fpgas,” in Proceedings of the International Conference on ComputerAided Design, p. 56, ACM, 2018.
 [43] Y. Shen, M. Ferdman, and P. Milder, “Maximizing cnn accelerator efficiency through resource partitioning,” in Computer Architecture (ISCA), 2017 ACM/IEEE 44th Annual International Symposium on, pp. 535–547, IEEE, 2017.
 [44] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang, N. Xu, S. Song, et al., “Going deeper with embedded fpga platform for convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, pp. 26–35, ACM, 2016.
 [45] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.W. Tai, “Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on fpgas,” in Proceedings of the 54th Annual Design Automation Conference 2017, p. 62, ACM, 2017.
Comments
There are no comments yet.