Term Revealing: Furthering Quantization at Run Time on Quantized DNNs

07/13/2020
by   H. T. Kung, et al.
Harvard University
0

We present a novel technique, called Term Revealing (TR), for furthering quantization at run time for improved performance of Deep Neural Networks (DNNs) already quantized with conventional quantization methods. TR operates on power-of-two terms in binary expressions of values. In computing a dot-product computation, TR dynamically selects a fixed number of largest terms to use from the values of the two vectors in the dot product. By exploiting normal-like weight and data distributions typically present in DNNs, TR has a minimal impact on DNN model performance (i.e., accuracy or perplexity). We use TR to facilitate tightly synchronized processor arrays, such as systolic arrays, for efficient parallel processing. We show an FPGA implementation that can use a small number of control bits to switch between conventional quantization and TR-enabled quantization with a negligible delay. To enhance TR efficiency further, we propose HESE encoding (Hybrid Encoding for Signed Expressions) of values, as opposed to classic binary encoding with nonnegative power-of-two terms. We evaluate TR with HESE encoded values on an MLP for MNIST, multiple CNNs for ImageNet, and an LSTM for Wikitext-2, and show significant reductions in inference computations (between 3-10x) compared to conventional quantization for the same level of model performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

09/08/2021

Elastic Significant Bit Quantization and Acceleration for Deep Neural Networks

Quantization has been proven to be a vital method for improving the infe...
05/20/2020

BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs

The number of parameters in deep neural networks (DNNs) is rapidly incre...
06/22/2017

Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

Quantized Neural Networks (QNNs), which use low bitwidth numbers for rep...
11/04/2019

Ternary MobileNets via Per-Layer Hybrid Filter Banks

MobileNets family of computer vision neural networks have fueled tremend...
09/27/2018

Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks

Quantization of weights and activations in Deep Neural Networks (DNNs) i...
05/31/2019

Multi-Precision Quantized Neural Networks via Encoding Decomposition of -1 and +1

The training of deep neural networks (DNNs) requires intensive resources...
10/20/2016

Bit-pragmatic Deep Neural Network Computing

We quantify a source of ineffectual computations when processing the mul...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Neural Networks (DNNs) have achieved state-of-the-art performance across a variety of domains, including Recurrent Neural Networks (RNNs) and Transformers for natural language processing and Convolutional Neural Networks (CNNs) for computer vision. However, the high computation complexity of DNNs makes them expensive to deploy at scale in datacenter contexts, as a popular model (e.g., Google’s Smart Compose email autocomplete 

[1]) may be queried millions of times per day, with each query requiring 10s to 100s of GFLOPs.

To address these high computational costs, significant research effort has been spent on developing techniques that reduce the computational complexity of pre-trained DNNs. One of the most commonly used techniques is post-training quantization (see, e.g., [2]), where 32-bit floating-point DNN weights and data (activations) are converted to a fixed-point representation (e.g., 8-bit fixed-point) to reduce the amount of computation performed per inference sample. One benefit of post-training quantization is that it does not require access to the original training dataset, and can therefore be applied by a third-party (such as a cloud service) as a step to reduce costs. In this work, we define a term as a nonzero bit in a quantized fixed-point value. For instance, we say that the 8-bit value 3 (00000011) is composed of two terms: .

Fig. 1: In conventional quantization, a 32-bit floating point weight matrix (left) is converted to an 8-bit fixed-point format via uniform quantization (middle). We propose to further quantization via Term Revealing (TR) which is a group-based run-time quantization method (right). By limiting the number of power-of-two terms across a group of values, TR enables a tighter processing bound for DNN dot product computations.

In this paper, we further quantize the computation of an already quantized DNN at run time to realize additional substantial computation savings. That is, we propose to perform further run-time quantization on, for example, a quantized 8-bit DNN while still achieving the same level of model performance. Note that for furthering quantization, we must use new techniques beyond conventional quantization methods, for otherwise, the original DNN could have been quantized to a lower precision in the first place while maintaining acceptable performance (e.g., a 4-bit DNN instead of an 8-bit DNN).

Specifically, we introduce a novel group-based quantization method, which we call Term Revealing (TR). TR, shown in Figure 1, ranks the terms in a group of values associated with a dot-product computation to reveal a fixed number of top terms (called a group budget) to use for the dot-product computation. By limiting the number of terms to a group budget and pruning the remaining smaller terms, TR enables a more efficient implementation of dot-product computations in DNN. With TR, the selected terms for a value are based on their relative rankings against terms of other values in the group. TR’s run-time group-based quantization, a departure from traditional value-based quantization, allows TR to carry out additional quantization on an already quantized DNN.

While allowing further quantization at run time, TR is able to achieve the same level of model performance as the original quantized DNNs for two reasons. First, TR uses group-based term selection, which prunes only smaller terms in a group (e.g., and terms), leading to minimal added quantization error. For many groups, with fewer terms than the allocated group budget, no additional quantization is performed. Second, by leveraging normal-like weight and data distributions typically present in DNNs (see Section III-A), TR can use a small group budget without introducing quantization error for many groups.

With a simple FPGA design, we can use a small number of control bits to reconfigure a hardware supporting quantized computations under conventional quantization to one supporting run-time TR quantization, and vice versa.

To simplify our introduction of TR, we use conventional binary representations where all terms are nonnegative. However, shorter signed expressions which use both positive and negative terms, such as Booth encodings [3], can typically allow fewer terms in expressing a value and lead to increased computation savings in TR-enabled quantization. To this end, we have developed a new signed encoding called Hybrid Encoding for Shortened Expressions (HESE), which we use in the later sections of the paper to express DNN weights and data.

The novel contributions of the paper are:

  • The concept of run-time quantization on already quantized DNNs to realize further computation savings.

  • A group-based term ranking mechanism, called term revealing (TR) and its term MAC (tMAC) hardware design for the implementation of our proposed further quantization at run time.

  • An FPGA system which requires minimal reconfiguration to efficiently supports both conventional quantization and TR-enabled quantization.

  • A signed power-of-two encoding called Hybrid Encoding for Shortened Expressions (HESE). Using fewer terms compared to previous signed representations, HESE enhances TR’s computation efficiency.

Ii Background and Related Work

In Section II-A, we discuss related work on pruning and quantization techniques for performing efficient DNN inference. Then, in Section II-B, we discuss prior work on hardware architectures which aim to exploit bit-level sparsity. Finally, in Section II-C we illustrate how matrix multiplication is performed with systolic arrays.

Ii-a Pruning and Quantization Methods

There has been significant research efforts in pruning-based methods which exploit value-level sparsity in CNN weights, as performing multiplication with zero operands can be viewed as wasted computation [4, 5, 6, 7, 8, 9, 10, 11, 12]. However, these pruning methods typically require model retraining, making them not feasible for a third-party that is hosting the model (as it requires access to the full training dataset). Additionally, unstructured pruning methods which achieve the best performance (e.g., [4]) are hard to implementing efficiently in special-purpose hardware, as the the remaining nonzero weights are randomly distributed. In this paper, we propose to further reduce the amount of computation even for nonzero values by exploiting bit-level sparsity as opposed to conventional value-level sparsity.

Quantization [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26] lowers precision of the values in weights and data in order to reduce the associated storage, I/O and/or computation costs. However, aggressive post-training quantization (e.g., to 4-bit representations) introduces additional error into the computation, leading to decreased model performance. Due to this, many low-precision quantization approaches, such as binary neural networks [27], must be performed during training. Our proposed TR approach is applied on top of 8-bit quantization and does not require additional training. Note that our approach does not reduce the precision of the weights (i.e., weights are still 8-bit fixed-point values after term revealing) but instead reduces the number of nonzero terms to be used at runtime across a group of weights.

Ii-B Hardware Architectures for Exploiting Bit-level Sparsity

There has been growing interest in exploiting bit-level sparsity (i.e., the zero bits present in weight and data values) as opposed to value-level sparsity discussed in the previous section. A bit-level multiplication with a zero bit can be viewed as wasted computation in the same manner as a value-level multiplication with a zero value, in that both operations do not effect the result. Based on this observation, Bit-Pragmatic introduces an architecture that utilizes a nonzero term-based representation to remove multiplication with zero bits in weights while keeping data in a conventional representation [28]. Bit-Tactical follows up this work by grouping nonzero weight and data values to achieve more efficient scheduling of nonzero computation [29]. Both of these approaches assume 8-bit or 16-bit fixed-point quantization (i.e., the first step in Figure 1).

However, due to the more fine-grained nature of these bit-level architectures, efficiently scheduling bit-level operations across multiple groups of computations becomes challenging, as each group may have a different amount computation to perform. Generally, this leads to stragglers that require significantly more bit-level operations than other groups. Both Bit-Pragmatic and Bit-Tactical handle this straggler problem by adding a synchronization barrier which makes all groups wait until the straggler is finished. Due to this, in processing many groups concurrently, they can only exploit bit-level sparsity up to the degree of the group with most bit-level operations (i.e., the straggler). We find that this worse case can be a factor of 2-3 more bit-level operations compared to the average case. By comparison, TR provides a tighter processing bound which enables synchronous computation across all groups. This is done by removing smaller terms from groups with a large number of terms (the second quantization step in Figure 1).

Ii-C Systolic Arrays for Matrix Multiplication

The majority of computation in the forward propagation of a DNN consists of matrix multiplications between a learned weight matrix in each layer and input or data being propagated through the layer, as shown on the left side of Figure 2. Systolic arrays are known to be able to efficiently implement matrix multiplication due to their regular design, dataflow architectures and reduced memory access [30]. The right side of Figure 2 shows a 33 systolic array, for computing dot products between and (highlighted in the weight and data matrices). The data in the partition (e.g., 

) are passed into the systolic array from below in a skewed fashion in order to maintain synchronization between cells. We use this systolic array design as the starting point for our FPGA system in Section 

V.

Fig. 2: The weight matrix and data matrix for a layer in a DNN (left) are partitioned into four tiles to be processed in a systolic array (right). The highlighted weight tile () is shown loaded into the systolic array with the data tile () entering the systolic array from below.

Iii Term Revealing

In this section, we introduce a group-based quantization method called term revealing (TR), which is applied to quantized DNNs at run time.

Iii-a DNN Weight and Data Distributions

As mentioned earlier, TR leverages weight and data distributions of DNNs. DNNs are often trained with weight decay regularization to improve model generalization [31]

and batch normalization 

[32]

on data which both improves the stability of convergence and improves the performance of the learned model. A consequence is that the weights are approximately normally distributed and the data follow a half-normal distribution (as ReLU sets negative values to 0). Figure 

3 (top row) illustrates these distributions for the weights in 7th convolution layer of ResNet-18 [33] trained on ImageNet [34] and the data input to the layer. Both the weights and data are quantized to 8-bit fixed-point using uniform quantization (QT).

Fig. 3: The distributions of weight and data values (top) shape the distribution of the number of terms in a binary encoding for both weights and data (bottom).

The higher frequency of small values means that most elements are represented with only 2 or 3 power-of-two terms as shown in Figure 3 (bottom). For instance, the value 6 is represented with two power-of-two terms . In the figure, 79% of weight values and 84% of data are represented with 3 or fewer power-of-two terms. Note that the most significant bit (MSB) in the 8-bit representation is used to represent the sign of each value, thus each value has at most 7 terms.

Iii-B Computing Dot Products via Term Pair Multiplications

Assume that we compute dot products in matrix-matrix multiplication between quantized weights and data by dividing both vectors into groups of a given length (e.g., 16). This group-based formulation is motivated by efficient hardware implementations described later in Section V. Figure 4 illustrates how partial dot products, partitioned into groups of length 16, are computed using term pair multiplications. In the example, the first value in the weight vector multiplied with the first data value is computed using two term pair multiplications () as shown on the right of Figure 4. Using this paradigm, we can analyze the number of term pair multiplications that are required per partial dot product (e.g., with a group size of 16) across all groups in a matrix-matrix multiplication.

Fig. 4: A matrix-matrix multiplication between a weight and data matrix (left) divided into partial dot products of length 16 (one partial dot product is shown in the middle). Each partial dot product is computed by multiplying all pairs of terms (right) across the 16 value in the weight and data vectors.

Figure 5 shows a histogram of the number of term pair multiplications for partial dot products with groups of 16 values in the 7th convolutional layer of ResNet-18. Interestingly, 99% of these groups require under 110 term pair multiplications even though the theoretical maximum, where all weight and data values use 7 terms (i.e., every value is ), is . In this work, we propose to restrict the number of term pair multiplications performed in each partial dot product (e.g., to 110 instead of 784) in order to achieve tightly synchronized parallel processing across systolic cells. As we know that DNNs are robust in performance to small amounts of error (e.g., through the initial uniform quantization step), it is reasonable to expect that they would also tolerate an additional quantization step such as our proposed TR-enabled quantization that makes small modifications to enforce a tighter processing bound. We use term pairs multiplications as a proxy for the amount of computation performed during inferences, as the hardware system described in Section V performs dot products using this term pair multiplication approach.

Fig. 5: The number of term pair multiplications required for partial dot products with groups of size 16 in an 8-bit DNN.

Iii-C Overview of Term Revealing

Term revealing is a group-based term ranking method that sets a limit on the number of terms allotted to a scheduling group. TR consists of three steps:

  1. Grouping elements as shown in Figure 6 (left). For a given weight matrix, we partition it into equal size groups which are used in dot product computations. The group size , denotes the number of values per group and may assume various values such as 2, 3, 4, 8, 16, etc.

  2. Configure a group budget which is used for every group. The budget bounds the number of terms used in dot-product computations across the values in a group.

  3. Identify top terms in the group using a receding water algorithm that ranks and selects the terms as shown in Figure 6 (right). It keeps the largest terms in a group and prunes the remaining smaller terms below a waterline. Note that some groups may have fewer than terms, meaning that no pruning occurs.

Figure 6 illustrate how TR is applied to a group of values in a weight matrix with a term budget . The three values are decomposed into their term representations, and scanned row by row (viewed as a waterline), starting from the term and finishing at the term, until the group budget is reached. In the example, the group budget of is reached at the term for . The remaining low-order terms (e.g., and below for this group) are pruned, adding a small amount of additional quantization error. For instance, after TR, is quantized from 81 to 80.

Fig. 6: A weight matrix (left) is partitioned into groups of size 3. The elements of each group (middle) are passed into TR. The receding water algorithm (right) based on term ranking keeps the top terms (red) for the values in the group; the rest of the terms are pruned.

Since the position of the waterline is determined by the distribution of terms in a group, the amount of pruning induced by TR varies across each group of values. Consider two groups of weights (group a: ), and (group b: ). Figure 7 illustrates the quantization error incurred when 4-bit QT (truncating the and terms) and TR (keeping the top terms) are applied to both groups. For group a, we see that TR introduces no error as the group has only 6 terms. By comparison, 4-bit QT introduces error by pruning all of the and terms, as conventional quantization keeps only the largest 4 terms across all values. For group b, which has significantly more terms, TR and 4-bit QT perform a similar amount of truncation. Group b represents a worse case for TR, as most groups will have significantly fewer terms.

Therefore, in practice, we can use a small group budget such as without introducing significant quantization error. By constraining the number of terms to across the values, TR is are able to ensure a tighter processing bound compared to 4-bit QT. Specifically, assuming each data value has up to term, the maximum number of term pairs with TR is reduced to , which is smaller than 4-bit QT of by a factor of .

Iii-D Term Pair Reduction for Term Revealing Groups

To more formally quantify the term pair reduction due to TR, suppose for a group size of that the group budget is and the receding water algorithm reveals , and number of terms for weight values , and , respectively, with . Suppose further that and have and terms, respectively. (For example, , , , and , , .) Then, with TR, the total number of term pairs to be processed for the dot product computation between and is

In reality, since most weights and data require significantly fewer than the maximum allotted number of power-of-two terms, for most groups, dot products will complete the computation below this bound, as discussed earlier in relation to Figure 5. In this sense, TR can be viewed as shifting this upper bound from terms per group in the baseline case (7 terms for both weights and data) to terms per group, where . In Section V, we utilize this significantly reduced upper bound enabled via TR to implement tightly synchronized processor arrays for DNN inference.

Iii-E Relationship Between Group Size and Group Budget

TR budgets terms for a group of size . Let for some , where is the average number of terms budgeted for each value in the group. Recall from Figure 3 that of weight values are represented in 3 or fewer terms. This means that as the group size increases, the average number of budgeted terms per value approaches the mean of the weight term distribution. For the weight term distribution in Figure 3, the mean is only terms per values, even though some values have as many as terms. Practically, this means that a larger group size allows for a smaller relative term budget which is close to the mean, as it becomes increasingly unlikely that many groups have more than terms.

Fig. 7: 4-bit uniform quantization (QT) always truncates smaller terms (e.g., the and terms). This leads to large quantization error for groups with many small terms as in group a. In contrast, by keeping the top 6 terms, TR introduces significantly less quantization error on average. Additionally, TR reduces the number of term pair multiplications to , which is smaller than 4-bit QT of by a factor of .

Iii-F Bounding Truncation-induced Error in Dot Products

TR strives to minimize truncation-induced relative error . Suppose that is the water line determined by TR under a given group budget. That is, terms smaller than are truncated. Then, for a group size and , kept terms have value at least , or , and truncated terms have value at most , or . We have . Larger results in a reduced upper bound on .

We provide a bound on the relative error introduced by TR in truncated dot products between weights and data. For a given group of data (, , ), the dot product over the group is where , and are corresponding weights of the filter. Each values may be positive, negative or zero, while data values are non-negative. For simplicity, we assume here that all are positive while noting the result also holds when they are all negative. After TR, each is replaced with a truncated in the dot product computation. Let denote the relative error of induced by TR, i.e., . Then, the dot product result with can be decomposed as follows:

Therefore, the relative error of the dot product with truncated values as an approximation to the original dot product is:

Suppose that, as described above, by TR we can assure that for . Then,

Thus, the relative error in the computed dot products is bounded by .

Iv Hybrid Encoding for Shortened Expressions

In this section, we present Hybrid Encoding for Shortened Expressions (HESE), a signed power-of-two encoding which reduces the number of terms required to represent 8-bit fixed-point values for DNN weights and data. HESE complements TR by reducing the number of terms used before TR is applied.

Iv-a Signed power-of-two Representations

Booth radix-4 encoding [3] converts a conventional binary representation with only positive power-of-two terms (e.g., ) into a representation with both positive and negative power-of-two terms (e.g., ). Note that while the underlying value is the same, the signed representation of Booth uses only two terms as opposed to four in the positive-only binary case. Booth reduces the number of terms by encoding strings of consecutive 1s, corresponding to positive power-of-two terms, such as () of , into a pair a positive and negative terms ( ).

Booth radix-4 bounds the number of power-of-two terms in an n-bit value to  [3]. This is utilized in the design of efficient Booth multipliers to provide a smaller bound on the amount of computation required for any pair of n-bit values. This bound is necessary for synchronization purposes across multiple processing elements (e.g., systolic arrays). In this work, we are interested in representations with fewer terms as TR enforces a tighter computational bound by truncating smaller terms in a group.

Iv-B Overview of HESE

Fig. 8:

(a) HESE converts binary encodings into shorter signed encodings such as the example in (b). (c) HESE requires fewer terms than both binary and radix-4 encodings for 8-bit values over DNN data (data) and a uniform distribution (unif).

HESE is a hybrid encoding method that combines Booth, which efficiently handles strings of 1s, with additional rules for reducing the number of terms required for isolated 1s and 0s. Figure 8 shows how HESE is used to encode 8-bit binary values into a signed power-of-two expression with fewer terms. For example, Booth translate 95 (01011111) to . In comparison, as shown in Figure 8, HESE will translate the value to a shorter expression . The rules for the generated output for this example can be explained by combining the strategy of Booth radix-2 encoding for strings of 1s with that of the standard binary representation for isolated 1s which are surrounded by zeros. For instance, in the example shown in Figure 8, (01011111) has a single isolated 1 in the 7th bit position, and a string of 1s from the 5th to 1st bit positions. This translates to (for the isolated 1) (for the string of 1s). By using the third rule in Figure 9a for isolated 0, we can save an additional term for values such as (0110111) by translating into . Because this rule is rarely needed for 7-bit binary values, for implementation simplicity, we omit it in design and analysis reported in this paper. That is, we only use the other four rules.

Iv-C Reducing Number of Terms per Encoding

HESE encodings have strictly equal or fewer terms than binary and Booth radix-4. Figure 8 shows the number of terms required for these encodings across two distributions of values: data values from ResNet-18 and values drawn from a uniform distribution over the same range as the data. The x-axis is the number of terms required to represent a value and the y-axis is the cumulative percentage of values that are represented within a given number of terms. HESE outperforms both Booth and binary across both distributions of values.

As expected, Booth leads to more compact representations than binary for values drawn from the uniform distribution. However, most of the reduction in terms comes from larger values in the 8-bit range (with many 1s), which occur much less frequently for the data, as depicted in Figure 3 (bottom). Therefore, radix-4 performs equal or worse than binary for the distribution of data values we are interested in. By comparison, when applying HESE on data, of values are represented in 3 or fewer terms. Practically, this means we can use 3 power-of-two terms for both weights and data.

V Hardware Design for Efficient Term Revealing

In this section, we present our hardware design for TR-based quantization. Figure 9 provides an overview of the TR system design, consisting of the following components: (1) weight and data buffers which store DNN layer weights and input/intermediate data, (2) a systolic array which performs dot products between weights and data using term MACs (tMAC) described in Section V-B, (3) a binary stream converter to convert systolic array output into binary representation (Section V-C), (4) a ReLU block (Section V-C), (5) a HESE encoder (Section V-D) to convert the binary representations to shorted signed expressions, and (6) a term comparator (Section V-E) which applies TR by selecting the top terms in a group. In Section V-A, we first give some high-level reasoning on how tMAC can save computation.

Fig. 9: The term revealing system design.

V-a High-level Comparison Between Bit-parallel MAC (pMAC) and Term MAC (tMAC)

To help understand the inherent advantage of TR, we provide a high-level argument on how our proposed term MAC (tMAC) saves a significant amount of work over a conventional parallel MAC (pMAC). Here, we define work as the amount of computation, including both arithmetic and bookkeeping operations, which are performed per group. The work incurred by a method largely determines the energy, area, and latency of its implementation.

To be concrete, we study a 1D systolic array of 3 cells, as depicted in Figure 10, for the processing of groups of 3 data values (, , ) in computing their dot products with weights (, , ) pre-stored in the systolic array. To provide a baseline for comparison, we consider a conventional implementation, where each cell is a pMAC performing an 8-bit bit-parallel multiplication, , and a 32-bit accumulation adding an intermediate to the computed . In comparison, Figure 10 depicts a tMAC-based implementation which significantly reduces the work by only processing available term pairs. The number of terms is relatively small due to high bit-level sparsity generally presented in CNN weights and data. This comparison result applies to a general 2D systolic array, which is a stack of 1D systolic arrays. In Section VII-A, we show how this analysis translates to realized performance on an FPGA with a group size of .

Fig. 10: (a) A systolic array where each of the three systolic cells is a conventional bit-parallel MAC (pMAC) which performs an 8-bit multiplication between weights () and data () values and a 32-bit accumulation each systolic array cycle. (b) The proposed term MAC (tMAC) processes all term-pair multiplications (e.g., ), for the same systolic array cycle, across a group of weight and data values (group size here) in a bit-serial fashion. For a group budget of and -term data, the number of term-pair multiplications is bounded by . Here, with and , it is 8 ().

For this illustrative analysis, we assume that tMAC uses a TR group of size and budget for weight values, and leading terms for data values under HESE (Section IV-B). Thus, for weights, each value in a group uses on average terms. As we show in Table III, under similar settings, TR will incur a minimum decrease in classification accuracy (e.g., less than ) when dropping lower-order terms exceeding the group budget across multiple CNNs.

Our analysis on work proceeds as follows. A conventional pMAC implementation of a single systolic cell incurs 7 8-bit additions for the multiplication and 1 32-bit accumulation operation for . Therefore, the pMAC implementation of a 1D systolic array with three cells requires 21 8-bit additions and 3 32-bit accumulation operations. In contrast, a tMAC implementation incurs significantly less work. Specifically, it uses at most 12 3-bit additions on exponents of power-of-two terms (weight and data exponents are both less than 8) for term-pair multiplications. (Recall that we assume and terms for data values, as depicted in Figure 10). The updating of accumulating coefficient vector (discussed in detail in the next section) requires bookkeeping operations for bit alignment, etc., with work we assume is no larger than the equivalent 12 3-bit additions. Thus, tMAC substantially reduces work compared to pMAC, that is, 24 3-bit additions vs. 21 8-bit additions plus 3 32-bit accumulations.

V-B Term MAC (tMAC) Design

The term MAC (tMAC) performs dot products between a data and weight vector of group size by multiplying all term pairs. Figure 11 illustrates how these term pair multiplications in tMAC are performed for a group of size and a group budget . In this example, TR ensures that there are or fewer terms across all weight values in the group. For illustration simplicity, assume all data values can be represented with a single term (in our implementation there are as many as terms per data value). Under these assumptions, term pair multiplications are performed and the results are added to a coefficient vector depicted in the upper right of Figure 11.

Fig. 11: Term pair multiplication for a dot product across a group of 4 weight and data values over 8 cycles ( to ).

The coefficient vector stores the current partial result of the dot product as coefficients for each power of two. In this example, the coefficients are set to (, , , , , ), which represents a value of . For the first term pair in the figure, in , the coefficient for is decremented by 1 as the signs of the terms differ. Once all exponent additions are completed for a dot product, the coefficient vector is reduced to a single value. As the largest term is , assuming 8-bit uniform quantization, the largest term pair is . Therefore, the coefficient vector has a length of in order to store all possible term pair results from to . To ensure overflow is not possible for dot products of length as large as , each element in the coefficient vector is 12 bits.

The hardware design of tMAC is shown in Figure 12. The exponents for term pairs are stored in data and weight exponent arrays, with the sign of each term stored in the parallel arrays with one bit per term. For instance, the term would store a in the exponent array and a minus () in the sign array. The yellow, red, blue, and green colors denote the term pair boundaries for each data weight multiplication. The exponent duplicator takes in data exponents and duplicates them based on the number of weight exponents in each value. Each cycle, a pair of exponents from these two arrays are passed into the adder, which computes the sum of the exponents, sets the sign, and sends to the result to a coefficient accumulator (CA) (Figure 12) within one cycle. Therefore, to process a group with 8 term pairs takes 8 cycles in total.

The CAs perform bit-serial addition between the coefficient vector and the output of the exponent adder in the tMAC. Due to the bit-serial design, the number of CAs must match the size of the data and weight register arrays (8 in this example) in order to maintain synchronization across the cells of the systolic array. At each cycle, one of the eight CAs takes the sum of two exponents from the adder, and adds/subtracts 1 to/from the corresponding coefficient. In our implementation, each tMAC can choose to reuse the current coefficient vector or take the new coefficient vector from its neighboring cell via the selection signal , as depicted in Figure 12.

Fig. 12: (a) The term MAC (tMAC) performs term pair multiplications between data and weight terms for a group of values. (b) A coefficient accumulator (CA) takes the adder result and add or subtract 1 from the corresponding coefficient.

V-C Binary Stream Converter and ReLU Block

The binary stream converter takes coefficient vectors, output from the systolic array, and transforms them into a binary format by multiplying each element of the coefficient vector with the corresponding power-of-two term then summing over the partial results. The outputs of the binary stream converter are sent to the ReLU block in a bit-serial fashion. Using a two’s complement representation for the outputs, the sign can be determined by detecting the most significant bit of the output streams. The ReLU block buffers all the lower bits until the MSB arrives. Then, it outputs zero if the sign of the MSB indicates that the value is negative; otherwise it outputs the original bit stream.

V-D HESE Encoder

The HESE encoder produces two bit streams, which represent the magnitude and sign of each power-of-two term, respectively. For example, for a bit-serial input of , the HESE encoder will produce two output streams: (magnitude) and (sign), to indicate . The HESE encoder is implemented with a finite state machine.

V-E Term Comparator

The term comparator in Figure 13 selects the top terms from the outputs of every consecutive HESE encoders, where and are the group budget and group size, respectively. Figure 13 shows the operation of term comparator on the outputs of four HESE encoders, the HESE encoder outputs are divided into two groups, where each group has a group size and group budget of . The inputs enter the term comparator in a reverse order such that their most significant bits (MSB) enter the term comparator first. Each cycle, the term comparator counts the total number of nonzero bits encountered so far, and truncates the remaining low-order terms once the group budget is reached for a group.

Fig. 13: (a) The design of the term comparator which implements term revealing. The term comparator takes the group size and group budget as inputs, counts the total number of terms within each group, and set the corresponding terms to zeros once the group budget is reached. (b) An example of term comparator operating on two groups. At T=2, the group budget is reached for group 1 and all the remaining terms are pruned. At T=3, group budget is also reached for group 2.

The term comparator contains multiple accumulate and compare (A&C) blocks which are arranged into a tree structure. Each A&C block takes a single input bit stream and counts the total number of nonzero bits in this stream. Figure 14 show how the A&C blocks can be reconfigured for different group sizes . For , the A&C blocks on the first level of the tree will compare the number of nonzero bits in their input stream against the group budget , and truncate each stream accordingly. If the group size is larger than 1 (e.g., 2), each A&C block in the first level of the tree will forward its input stream together with the nonzero bit count to its parent A&C block. The parent A&C block then operates on these two streams in a similar fashion to its children. The tree architecture allows for minimal changes to the term comparator under different group sizes, which leads to a low reconfiguration overhead and maximum level of hardware reuse.

Fig. 14: Configurations of term comparator under different group sizes.

V-F Memory Subsystem

Our memory subsystem consists of a data and weight buffer. The data buffer holds the term exponents and signs for both the input and result data of the current layer, and the weight buffer holds the term exponents and signs of the weights for each group. For the weight buffer, we use double buffering to prefetch the next weight tile from the off-chip DRAM so that the computation of the systolic array can overlap with the traffic transfer from the off-chip DRAM to weight buffer. Note that TR does not reduce the storage complexity of the model, as each weight is stored in an 8-bit fixed-point format.

V-G FPGA Reconfiguration for QT and TR

Our TR system system can be easily reconfigured for different group sizes and group budgets , in order to adapt to dynamic requirements on group size and group budget during inference with a negligible delay. In addition, our system can also supports conventional quantization (QT) by performing power-of-two operations with binary representations. Since QT does not require TR or HESE encoding, the term comparator and HESE encoder can be turned off by using clock gating to reduce power consumption. Table I summarizes all of the control registers which need to be modified when switching between TR and QT. The switching process only takes several clock cycles (i.e., within 100ns for our FPGA implementation).

width=center Uniform quantization (QT) Term revealing (TR) HESE_ENCODER_ON (1 bit) HESE encoder is turned off by setting this bit to 0 HESE encoder is turned on by setting this bit to 1 COMPARATOR_ON (1 bit) term comparator is turned off by setting this bit to 0 term comparator is turned on by setting this bit to 1 QUANT_BITWIDTH (4 bit) quantization bitwidth used for QT quantization bitwidth used for TR DATA_TERMS (4 bit) same as the quantization bitwidth for QT Maximum number of power- of-two terms in data for TR GROUP_SIZE (3 bit) group size is set to 1 for QT group size is between 2 to 8 for TR GROUP_BUDGET (5 bit) group budget is the same as quantization bitwidth for QT group budget can be up to for TR

TABLE I: Control registers for supporting QT and TR.

Vi Term Revealing Evaluation

In this section, we evaluate the performance of TR when applied to an MLP on MNIST [35], a broad range of CNNs (VGG-19 [36], ResNet-18 [33], MobileNet-V2 [37], and EfficientNet-b0 [38]) on ImageNet [34], and an LSTM [39] on Wikitext-2 [40]. In Section VI-A, we compare TR against conventional uniform quantization (QT) on the performance (i.e., accuracy or perplexity) of these DNNs. Then, in Section VI-B, we provide analysis on how the (average number of terms) and (group size) parameters impact the classification accuracy. Next, in Section VI-C, we analyze the individual contribution of HESE and TR on model performance. Finally, in Section VI-D, we show that the quantization error introduce by TR is substantially less than when using a more aggressive QT setting (e.g., 6-bit uniform quantization).

Fig. 15: Comparing uniform quantization (QT) and term revealing (TR) for an MLP on MNIST (left), CNNs on ImageNet (center), and an LSTM on Wikitext-2 (right). The QT settings vary the weight bit-width (from 4 to 8 bits), while the TR settings vary (group size) and (number of terms per group). TR reduces the number of term pair multiplications per sample over QT by 3-10 across the three types of DNNs.
Fig. 16: A larger group size improves ResNet-18 ImageNet classification accuracy for a given .

To perform this analysis, we have implemented a CUDA kernel for TR which only increases the inference runtime of a pre-trained model running on a NVIDIA 1080 Ti by under . This means that the validation accuracy for a pre-trained CNN for ImageNet can still be obtained within several minutes. Using pre-trained models has the advantage of making parameter search (e.g., for group size and term budget ) simple compared to methods such as weight pruning [4] that require model retraining which takes hours or days for each setting. Before applying TR, each model is quantized from 32-bit floating-point to 8-bit fixed-point using a layerwise procedure described in [41].

Vi-a Comparing Term Revealing to Uniform Quantization

Motivated by the design in Section V, we are interested in minimizing the number of term pair multiplications per sample, as this directly translates to the processing latency of a sample. For the uniform quantization (QT) approach with 8-bit fixed-point weights and data, each multiplication translates to term pair multiplications. By comparison, for term revealing (TR), the number of term pair multiplications is instead bounded by the average number of term pairs which is shared across a group of values. We show that TR gives a significant reduction (e.g., 3-10) over QT while maintaining the nearly identical performance (e.g., within 0.1% accuracy).

Vi-A1 MLP on MNIST

We train an MLP with one hidden layer with 512 neurons for MNIST using the parameter settings given in the PyTorch examples for MNIST

111https://github.com/pytorch/examples/tree/master/mnist. Figure 16 (left) shows the performance of QT and TR applied to the pre-trained MLP. TR achieves a reduction in number of term pair multiplications over QT while achieving a clasification accuracy of 98.4% (compared to the 98.5% baseline).

Vi-A2 CNNs on ImageNet

We use pre-trained models provided by the PyTorch torchvision package222https://github.com/pytorch/vision/tree/master/torchvision/models for VGG-16, ResNet-18, and MobileNet-v2 and a PyTorch implementation333https://github.com/lukemelas/EfficientNet-PyTorch of EfficientNet with pre-trained models. Figure 16 (center) shows the performance of TR and QT for the 4 CNNs. TR achieves a 14 reduction in term pair multiplications over QT for VGG-16, which is known to be significantly overprovisioned (e.g., amenable to quantization and pruning). Even for more recent models, which have significantly fewer parameters, such as MobileNet-v2 and EfficientNet-b0, TR is still able to achieve a 4 and 6 reduction in term pair multiplications, respectively, losing less than 0.1% classification accuracy compared to the 8-bit QT settings. Generally, we see that more aggressive TR settings (e.g., with a reduced group budget) appears to degrade the accuracy more gracefully than more aggressive QT settings (e.g., with reduced bit-width for weight).

Vi-A3 LSTM on WikiText-2

We train a 1-layer LSTM with 650 hidden units (i.e., neurons), a word embedding of length 650, and a dropout rate of 0.5, following the PyTorch word language model example444https://github.com/pytorch/examples/blob/master/word_language_model. This baseline model achieves a perplexity of 86.85. Figure 16 (right) shows how the perplexity of the pre-trained model is impacted by QT and TR. Again, we find that TR is able to reduce the number of term pair multiplications by a significant factor of , while achieving the same perplexity.

Vi-B Improved Term Allocation with Larger Group Size

Figure 16 shows the classification accuracy for ResNet-18 as

is varied for different group sizes. As the group size increases, the variance in number of terms across values in a group shrinks, meaning that a larger group budget

at a fixed ratio is strictly better than a smaller for the same . As observed, the classification accuracy for a larger group size strictly outperforms all settings with smaller group sizes. For instance, a group size of 8 with of 1 achieves a classification accuracy of 67.72% which is 5.21% better than a group size of 1 at the same value. Note that a group size of 1 is equivalent to truncating each value to exactly terms.

Vi-C Isolating the Impact of TR and HESE

Figure 17 shows the relative impact of TR and HESE in terms of classification accuracy by measuring them in isolation. The HESE and QT settings (without TR) apply term truncation by keeping the top terms in each individual weight. In this case, is equal to as the group size is 1 (i.e., there is no grouping). We see that HESE substantially outperforms QT until the top terms are kept per weight () due to it requiring fewer terms. For the settings with TR, QT + TR and HESE + TR, we use a group size of , with term budget values of , , , , and to generate comparable values of as in the settings without TR. We find that TR improves the performance of both the QT and HESE encoding methods, with HESE + TR achieving the best performance.

Fig. 17: Measuring the individual contributions of TR and HESE in reducing number of terms while maintaining high classification accuracy.

Vi-D Quantization Error Analysis

The reason for TR’s superior performance (e.g., accuracy or perplexity) over QT discussed in Section VI-A is due to TR introducing less quantization error. Figure 18 shows the quantization error across the layers in ResNet-18 for 3 QT settings (from 6-bit to 8-bit) and TR with a group size and a group budget . We see that TR introduces a small amount of quantization error over 8-bit QT, which makes sense as TR is applied on top of 8-bit QT. The 7-bit and 6-bit QT settings truncate the low-order terms, leading to larger quantization error and reduced classification accuracy.

Fig. 18: The average quantization error (relative to the original 32-bit floating-point weights) across the convolutional layers in ResNet-18 for 3 QT settings and one TR setting.

Vii FPGA Evaluation

In this section, we evaluate the hardware performance of the TR system described in Section V. We have synthesized our TR system using Xilinx VC707 FPGA evaluation board. We first compare the performance of tMAC against a bit-parallel MAC in Section VII-A. Then, we demonstrate that our TR system can be used to implement both QT and TR in Section VII-B. Finally, in Section VII-C, we compare our TR system against the other FPGA-based CNN accelerators on ResNet-18.

Vii-a Comparing Performance of Bit-parallel MAC and tMAC

In this section, we evaluate the hardware performance of a single tMAC by comparing it against a bit-parallel MAC (pMAC) shown in Figure 10. For both designs, we perform a group of MAC computation: , where , , and are 32-bit, 32-bit, 8-bit and 8-bit, respectively, and is the number of elements in the weight and data vectors (i.e., the group size in TR). In one cycle, the pMAC performs an 8-bit multiplication between and and a 32-bit accumulation between the result and . Therefore, is generated in cycles. By comparison, the tMAC takes a variable number of cycles to process each multiplication in the group, depending on the number of term pairs in the multiplication. In total, it requires no more than cycles, where is the maximum number of terms for each data value and is the group term budget.

Table II shows the FPGA resource consumption of the two MAC designs in terms of LookUp Tables (LUTs) and Flip-flops (FFs). The tMAC consume less LUTs and FFs than the pMAC. The tMAC requires less FPGA resources as it performs 3-bit exponent additions as opposed to 8-bit additions and 32-bit accumulation as in the pMAC.

We evaluate the two designs in terms of the energy efficiency, which is the ratio between the throughput and power consumption. Table III shows the energy efficiency and classification accuracy for the two designs across four CNNs. For the tMAC settings, different values of and are selected for each CNN such that the classification accuracy stays competitive with the baseline model (less than difference in accuracy across all settings) while keeping the group size fixed (). For each CNN, the energy efficiency of both MAC designs is normalized to that of the pMAC. We observe that tMAC achieves superior energy efficiency ( on average) compared to pMAC across the four CNNs. This reflects that pMAC needs to perform more work than tMAC, as our analysis in Section V-A shows.

width=0.4center LUT FF pMAC 154 148 tMAC 25 26

TABLE II: FPGA resource consumption of pMAC and tMAC.

width=center Model MAC Accuracy Energy Eff. Resnet-18 pMAC - - - 69.62% 1.0 tMAC 3 12 8 69.60% 2.1 VGG-16 pMAC - - - 73.11% 1.0 tMAC 2 12 8 73.11% 3.1 MoblNet-v2 pMAC - - - 71.76% 1.0 tMAC 3 18 8 71.65% 1.5 EffNet-b0 pMAC - - - 75.99% 1.0 tMAC 3 16 8 75.84% 1.7

TABLE III: Classification accuracy and energy efficiency comparison for the two MAC designs across four CNNs.

Vii-B System Comparison of QT and TR

In this section, we compare the hardware performance of TR against QT with the DNNs shown in Figure 16. The systolic array in the TR system has 128 rows by 64 columns, with each systolic cell implementing a tMAC with a group size . The group budget is chosen independently for each network such that the TR is within 0.15% accuracy of the QT setting (or 0.05 perplexity for the LSTM). In this section, we use the same TR system (Figure 9) for the implementation of both QT and TR in order to show the reconfigurability of our design. The implementation of QT does not require group-based ranking and HESE encoding, so we turn off these components of the hardware system to reduce dynamic power consumption. All control registers are configured based on Table I.

Fig. 19: Normalized energy efficiency and latency improvements of TR over QT. All models use a group size of . The group budget is selected for each model such that it is within 0.15% accuracy of the corresponding QT setting ( is , , , , , for MLP, VGG-16, ResNet-18, MobileNet-V2, EfficientNet-b0, and LSTM, respectively). All models keep the top terms except for VGG-16 which uses .

We evaluate our FPGA system with the following performance metrics: (1) Average Processing latency of the hardware system to generate the prediction result, and (2) Energy efficiency or the average amount of energy required to process a single input sample. As shown in Figure 19, our TR system outperforms the QT by and on average in terms of processing latency and energy efficiency, respectively. For more difficult tasks, such as Wikitext-2 for the LSTM, a more conservative group budget is selected, leading to less relative improvement over QT. For overprovisioned models (e.g., VGG-16), a more aggressive group budget of is used, leading to more substantial improvements in latency and energy efficiency.

Vii-C FPGA System Evaluation

In this section, we evaluate our TR system over ResNet-18 on ImageNet, using a group size and group budget . While an even larger group size could theoretically lead to additional savings, there are diminishing returns as shown in Figure 16 when comparing and . Additionally, larger group sizes increase the complexity of the term comparator due to additional tree levels of A&C blocks.

We compare our TR system with the other FPGA-based accelerators which implements different CNN architectures (e.g., AlexNet) on ImageNet. We evaluate our design in terms of the average processing latency for the input samples, energy efficiency of the hardware system and classification accuracy. As shown in Table IV, our design achieves the highest classification accuracy , energy efficiency ( frames/J), and the second lowest latency ().

width=center [42] [43] [44] [45] Ours FPGA Chip VC706 Virtex-7 ZC706 ZC706 VC707 Acc. (%) 53.30% 55.70% 64.64% N/A 69.48% Frequency (MHz) 200 100 150 100 170 FF 51k(12%) 348k(40%) 127k(29%) 96k(22%) 316k(51%) LUT 86k(39%) 236k(55%) 182k(83%) 148k(68%) 201k(65%) DSP 808(90%) 3177(88%) 780(89%) 725(80%) 756(27%) BRAM 303(56%) 1436(49%) 486(86%) 901(82%) 606(59%) Latency (ms) 5.88 11.7 224 17.3 7.21 Energy eff. (frames/J) 23.6 8.39 0.46 6.13 25.22

TABLE IV: Comparison of our FPGA implementation of ResNet-18 to other FPGA-based accelerators on ImageNet.

Our hardware system achieves the best performance for multiple reasons. First, TR coupled with the proposed HESE encoding greatly reduce the amount of term pair multiplications, which reduces the number of cycles in tMACs. TR allows tMACs to achieve a much tighter processing bound of pairs per group as opposed to in the case of standard binary encoding without TR. Second, the bit-serial design of the coefficient accumulator in tMAC together with the systolic architecture of the computing engine leads to a highly regular layout with low routing complexity.

Viii Conclusion

We proposed term revealing (TR) as a general run-time approach for furthering quantized computation on already quantized DNNs. Departing from conventional quantization that operates on individual values, TR is a group-based method that keeps a fixed number of terms within a group of values. TR leverages the weight and data distributions of DNNs, so that it can achieve good model performance even with a small group budget. We measure the computation cost of TR-enabled quantization using the number of term pair multiplications per inference sample. Under this clearly defined cost proxy, we have shown that TR significantly lowers computation costs for MLPs, CNNs, and LSTMs. As shown in Section VII-B, this reduction in operations translates to improved energy efficiency and reduced latency over conventional quantization for our FPGA system. Furthermore, our FPGA system demonstrates that by changing a small number of control bits we can reconfigure a quantized computation under conventional quantization to one under TR-enabled quantization, and vice versa (Table I). Quantization is one of most widely used approaches in streamlining DNNs; TR proposed in this paper brings the success of the quantization approach to another level.

References