I Introduction
Deep Neural Networks (DNNs) are stateoftheart models for applications like object classification, image segmentation and speech processing [8]. However, they have high computational complexity and memory footprint, which translates to high hardware and energy requirements [14]. This resourcehungry nature of DNNs challenges their highaccuracy deployment in resourceconstrained scenarios such as inference on resourceconstrained embedded devices.
Methods like pruning and quantization are used to reduce the computational complexity and memory footprint of DNNs [1, 3, 20, 11]. However, these stateoftheart approaches are highly effective when used with retraining to minimize the accuracy loss. While retraining smaller DNNs designed for less complex datasets incurs fewer overheads, it is super costly to retrain larger DNNs designed for complex applications, which can take several days even on highend GPU servers. Moreover, proper retraining of DNNs during optimization phase requires access to a comprehensive dataset, and may not be possible in cases where the dataset is an IP of a company and it has not made it available (e.g., Google’s JFT300M dataset [13]) to the end user who wants to optimize the given pretrained DNN for a specific set of resource constraints. Under such cases, an effective approach is posttraining quantization that enables lowprecision FixedPoint (FP) implementation (8 bits) of DNNs [2][9], and thereby reduces power, latency, and memory footprint. However, reducing precision introduces quantization errors, leading to potentially noticeable accuracy loss. This additional source of error degrades the applicationlevel accuracy of DNNs and restricts the designers from reducing the bitwidths of DNN datastructures beyond a level without significantly affecting the accuracy.
StateoftheArt Works and Their Limitations: To overcome the abovementioned limitations and achieve significant efficiency gains, several works have been proposed. CompensatedDNN [4] proposed a technique to quantize DNN datastructures and dynamically compensate for quantization errors. It achieves this using a novel Fixed Point with Error Compensation (FPEC) data representation format and a specialized Processing Element (PE) design containing a lowcost error compensation unit. In FPEC, each number consists of two sets of bits: (1) bits that represent quantized value of the number in traditional FP formats; and (2) bits that store quantization related information, such as error direction, which is useful for error compensation. Techniques like [15] and [5] make use of poweroftwo quantization and multiscaled FP representation (respectively) to reduce the bitwidth of DNN datastructures as well as the complexity of the PEs (mainly multipliers). Recently, CAxCNN [12] proposed the use of reduced precision CanonicalSignDigit (CSD) representation to decrease the complexity of MultiplyandAccumulate (MAC) units in DNN accelerators. Similarly, [16] combined the use of reduced precision CSD with customized delta encoding scheme to efficiently approximate and encode DNN parameters. A summary of the key characteristics of the most relevant stateoftheart techniques is presented in Table I. The table shows that CompensatedDNN is the only work that employs error compensation (without retraining) to offer better energyaccuracy tradeoff. However, it does not exploit the advantages of efficient data representations such as poweroftwo or reduced precision CSD representations. Moreover, it also requires extra hardware support for error compensation.
Other approaches that are also designed to improve the energy efficiency of DNN inference without the need of retraining involve the use of approximate hardware modules (mainly approximate multipliers) [18][10], and voltagescaling of computational array and memory modules [19][7]. AxNN [18]
selectively approximates the neurons that do not impact the applicationlevel accuracy much. However, it offers close to the baseline accuracy only with approximationaware retraining. ALWANN
[10] presents a method that employs functional approximations in the MAC units to improve the energy efficiency of DNN inference without involving approximationaware retraining. It mainly determines the most suitable approximation for each layer of the given DNN to achieve a reasonable amount of savings.However, note that these techniques have shown effectiveness only at 8bit FP precision level (i.e., not for less than 8bit precision) and only for relatively less complex datasets like Cifar10.
Moreover, the voltage scaling techniques such as ThunderVolt [19] and MATIC [7] either introduce faults at runtime that can lead to undesirable accuracy loss or the need for expensive faultaware retraining.Summary of Key Challenges Targeted in this Work: Based on the abovementioned limitations of the existing works, we highlight the following key challenges:

[leftmargin=*]

There is a need to investigate existing DNN posttraining quantization and approximation techniques to identify the methods and data representation formats that can enable high energy and performance efficiency without affecting the applicationlevel accuracy of DNNs.

To maintain high accuracy, dynamic error compensation requires additional resources that increase DNN inference cost. Therefore, there is a need to explore lowcost softwarelevel error compensation mechanisms that are required to be applied only once during DNN conversion (i.e., designtime phase) and have the potential to offer benefits equivalent to costly dynamic error compensation techniques.
Our Novel Contributions: We propose an algorithm and architecture codesign framework, CoNLoCNN, for effectively approximating DNNs through posttraining quantization to improve the power/energyefficiency of DNN inference process without involving retraining. Towards this:

[leftmargin=*]

We investigate different methods for quantizing DNN datastructures in search for an effective method for approximating DNNs that offers high energyaccuracy tradeoff. We identify that choosing a data representation format that is aligned to the longtailed data distribution of DNN parameters (see Fig. 3) results in less overall quantization error, and this selection can also help in simplifying the hardware components (mainly multipliers in the processing array of DNN hardware accelerators) (Section IIIA). Based on our analysis, we propose an Encoded LowPrecision Binary Signed Digit data representation format that at the core is an evolved version of poweroftwo representation and thereby enables the use of lowcost shift operations instead of multiplications. This results in significant simplification of the MAC units in DNN accelerators and thereby enables high energy and performance efficiency. (Section IV)

We propose a lowcost error compensation strategy that is required to be applied only once at the conversiontime (compiletime) to compensate for quantization errors. This enables the users to completely avoid overheads associated with dynamic compensation. (Section IIIB)

We propose a systematic methodology for quantizing DNNs while exploiting the proposed error compensation strategy. (Section V)

We propose a specialized multiplyandaccumulate (MAC) unit design that fully exploits the potential of the proposed approximation scheme for improving the energy efficiency of DNN inference. (Section IV4)
A summary of our novel contributions is shown in Fig. 1.
Ii Preliminaries
This section introduces the key terms used in this paper.
Iia Deep Neural Networks (DNNs)
A DNN is an interconnected network of neurons, where each neuron performs a weighted sum operation, and then passes the output through a nonlinear activation function. The functionality of a neuron can mathematically be written as
, where represents the output, represents the weight, represents the input (a.k.a. activation), represents the bias, and represents the nonlinear activation function. In DNNs, neurons are arranged in the form of layers. The layers are then connected in a specialized format to form a DNN.IiB Convolutional Neural Networks (CNNs)
A CNN is a type of DNN specialized for processing spatially correlated data such as images [8]. It is mainly composed of convolutional (CONV) layers and fullyconnected (FC) layers (see Fig. 2(a)). The CONV layers are used to extract features from the input by convolving it with a set of filters (see Fig. 2
(b)). The output generated by convolving a filter with the input is referred as a feature map. The FC layers are typically used at the end of the network for classification. A FC layer is composed of several neurons where each neuron receives the complete output vector of the previous layer to generate an output.
Iii Strategies for Enabling LowPrecision and EnergyEfficient DNN Inference
Iiia Strategy 1: Select a quantization scheme that is aligned with the data distribution
Fig. 3
shows the distributions of weights, biases, and activations of the layers of a trained AlexNet. It can be observed from the figure that the distributions of weights and activations have long tails, i.e., in each datastructure, majority of the values have small magnitude and only a limited number of values have large magnitude. Moreover, for each layer, the distribution of weights is close to a Gaussian distribution with mean equal to zero. Considering the data distributions, a low precision uniform quantization (for example, see Fig.
4(a)) would result in high overall quantization error compared to a nonuniform quantization (for example, see Fig. 4(b)) having the same (or less) number of quantization levels, assuming the levels are distributed based on the data distribution, i.e., more number of narrowlyspaced quantization levels in dense regions and less number of widelyspaced quantization levels in light regions. Therefore, aligning quantization scheme with the data distribution can help in reducing the overall/average quantization error. However, a potential limitation of this is that, in case of lowprecision nonuniform quantization, it can result in a significant increase in the maximum quantization error leading to high error variance.
Hence, the ideal quantization scheme should balance between the average and the maximum quantization error to achieve minimum quality degradation.IiiB Strategy 2: Exploit correlation between neighboring feature map values to reduce the effective variance and mean of quantization error
Here, we first analyze the impact of variations in the bias values of a CNN on its classification accuracy. Note that this analysis mainly helps us understand the effects of errors that affect the mean of activation values. Then, we study the correlation of data within and across input feature maps of a layer to see if quantization errors can be partially compensated by distributing them across weights in the same filter/neuron. Afterward, we present a mathematical analysis and show how the gained insights can be exploited to reduce the impact of quantization errors.
IiiB1 Impact of modifying the bias values in DNNs
Fig. 5 shows the impact of varying the bias of different number of filters of a layer of a pretrained AlexNet on its classification accuracy. From Fig. 5(a) and Fig. 5(c) it can be observed that, when a small constant value, i.e., a value close to the range of original bias values (see Fig. 3
(b)), is added to the bias values of a number of filters, the accuracy of the network stays close to its baseline. However, when the magnitude of the constant is large, it degrades the accuracy. The difference between the impact of positive and negative noise is mainly due to the presence of ReLU activation function in the AlexNet, as a large negative bias leads to a large negative output which is then mapped to zero by the ReLU function and thereby limits the impact of error on the final output. Fig.
5(b) and Fig. 5(d) show that, when the bias values of half of the filters are injected with positive noise and half with negative noise (all having the same magnitude), the behavior of classification accuracy is dominated by the behavior of filters that are injected with positive noise.To further study the impact of mean shifts in the output feature maps, we performed an experiment where we added noise generated using a Gaussian distribution to the bias values of the filters/neurons of different layers of a pretrained AlexNet (see Fig. 6
). We observed that when the noise is generated using smaller standard deviation values and is injected to the intermediate layers of the network, it does not impact the accuracy much. However, if the noise is injected to the last layer of the network or is generated using larger standard deviation values, it leads to a significant drop in the DNN accuracy.
From the above analysis, we deduce the following three conclusions. (1) Small to moderate level of variations due to any error/noise source in the mean of activation values of all the layers except the last layer do not impact the accuracy much. (2) Mean shift in the output degrades the accuracy only if it is large in magnitude or it is in the output of last layer of a DNN. (3) Resilience of DNNs to smalltomoderate level of errors in bias values points to the significance of large activation values.
IiiB2 Correlation between activation values of input feature maps
IntraFeature Map Correlation: Fig. 7 shows the correlation between neighboring input activation values located at a constant shift from each other (represented using and in the figure) within input feature maps of a convolutional layer. Based on the correlation values in the figure, it can be said that, in all the convolutional layers of a DNN, there is a significant correlation between the neighboring activation values.
InterFeature Map Correlation: Fig. 8 shows the distribution of correlation between different input feature maps of a layer of a pretrained AlexNet. Fig. 8(a) shows that there is a significant correlation between the input feature maps of layer 1 of the network. The distributions in Figs. 8(b), 8(c), and 8(d) show that as we move deeper into the network the acrossfeature map correlation moves towards zero.
Based on the above analysis, we conclude that only intrafeature map correlation can be exploited for error compensation.
IiiB3 Analysis of quantization error
To analyze the effects of quantization on the output quality, let us consider a scenario in which we quantize the weights of a layer of a DNN and keep the activations in fullprecision format. A quantized weight can be written as
(1) 
where represents unquantized weight and represents quantization error. If we assume and , and and to be independent, then
(2) 
Similar to the distribution of , for activations, we assume
(3) 
Now, for , using the above equations and assuming the weights and activations to be independent, we get
(4) 
As highlighted in the earlier analysis, small deviations in the mean of output activations do not impact the accuracy much; therefore, we mainly focus on comparing the variance of with the variance of . Subtracting variance of from the variance of , we are left with
(5) 
Now, to reduce the intensity of this additional term, we need to reduce the intensity of and . We can achieve this by exploiting the high correlation among the neighboring activation values in feature maps. Fig. 9(a) shows a possible way of decomposing activation values of a block of feature map based on correlation among neighboring values. In case of high correlation (for example, see Fig. 9(b)), the variables , and (in Fig. 9(a)) have high values (i.e., close to 1). As , and represent the elements of , and (respectively) that are orthogonal to , the variables , and exhibit low variance compared to , and in the case of high correlation with . We can exploit the presence of , and in the neighboring activations to partially compensate/balance the error introduced in the dotproduct of weights and activations due to quantization of weights by modifying the quantization scheme in such a way that it balances the mean quantization error of the neighboring weights.
Example: To understand this, consider the activation block A shown in Fig. 9(c) and 2D filter W shown in Fig. 9(d). The output of dotproduct of A and W comes out to be 50.32. Now, if we quantize the weights of the filter to the nearest integer values and perform the dotproduct operation, we get 57.7 as the output; see Fig. 9(e). However, if we map the value of the second weight to its other nearest integer value, i.e., 2 instead of 3 (see Figs. 9(e) and 9(f)), we can reduce the mean absolute error (MAE) in weights from 0.225 to 0.025 and the absolute error in the output of dotproduct from 7.38 to 1.12. This shows that high correlation among the neighboring values enable us to reduce effective mean and variance of by minimizing the local mean quantization error inside filter channels.
IiiB4 Impact of Adjusting Intrachannel Mean Quantization Error in Weights
To study the impact of adjusting the mean error in filters, we performed an experiment where we injected noise generated using a Gaussian distribution to the weights of the filters of layer 1 and layer 4 of the AlexNet. We studied three different cases: (1) No adjustment in the mean error of the weights; (2) Mean error adjustment case 1, where the mean error of each filter is subtracted from the corresponding weights; and (3) Mean error adjustment case 2, where the mean error of each filter channel is subtracted from the corresponding weights. Fig. 10 shows that, among the three cases, intrachannel adjustment (i.e., Mean error adjustment case 2) leads to highest compensation and thereby best results.
Iv Type of NonUniform Quantization and Design of Supporting DNN Hardware
Strategy 1 in Section III states that aligning quantization scheme with the data distribution helps in restricting the overall quantization error. However, the key challenge is how can this observation be exploited for improving the efficiency of a DNNbased system. To address this, we propose Encoded LowPrecision Binary Signed Digit (ELP_BSD) representation that evolves from poweroftwo representation and thereby enables the use of shift operations instead of multiplications in processing arrays of DNN accelerators. In the following, we discuss the details of our data representation format, starting with the initial concept, limitations of initial proposition, and how ELP_BSD overcomes these limitations.
Iv1 Initial Proposition
In this work, the key focus is on exploiting nonuniform quantization for simplifying the MAC unit design as well as in reducing the bitwidth of weights, as both contribute towards improving the energyefficiency. Poweroftwo quantization is one potential solution, as it allows to replace a costly multiplication operation with a shift operation and reduce the bitwidth by storing only the power of 2. Moreover, the distribution of quantization levels of poweroftwo quantization is aligned with the distribution of DNN weights, as can be observed from Fig. 11(b). However, use of poweroftwo quantization results in a significant drop in applicationlevel accuracy due to a significant reduction in the number of quantization levels compared to a traditional FP quantization, which can be observed by comparing Fig. 11(b) with Fig. 11(a). Therefore, almost all the previous works that use poweroftwo quantization employ retraining to regain the lost accuracy. To offer additional quantization levels to avoid significant accuracy loss due to quantization while benefiting from the advantages of poweroftwo quantization scheme, we propose to use sum of signed poweroftwo quantization, where more than one signed poweroftwo digits are combined to offer additional quantization levels. Fig. 11(c) shows that addition of only a single lowrange lowprecision signed poweroftwo digit can significantly increase the number of unique quantization levels and thereby can help in achieving an ultraefficient system.
Iv2 Limitations of Initial Proposition and New Improvements
One key issue with sum of signed poweroftwo quantization is redundant representations of values. For example, . To reduce the amount of redundancy, we propose to reduce the number of possible power of 2 values per digit. For example, if a number representation is given as and , to reduce the redundancy, we can reduce the set of possible values of to . Fig. 11(d) shows that reducing the number of possible power of 2 values for the first digit leads to a reduction in the amount of redundancy, which can be observed from the reduced number of yellow semicircles (representing the range of the second digit) centered at every possible quantization level of the first digit. Note that the reduction in redundancy can be exploited to reduce the bitwidth of numbers as well as to further simplify the MAC unit design. However, a drawback of this is that it can result in a small decrease in the number of quantization levels, as highlighted in Fig. 11(d). The redundancy can also be reduced by allowing some digits to have signed values and some only positive (i.e., unsigned) values. This is illustrated in Fig. 11(e) by restricting the range of the second digit to only positive values. Note that even though the sum of powerof2 digits leads to some redundancy, it enable us to use shifters instead of multipliers in the hardware, which significantly improves the energy efficiency of DNN systems (as will be highlighted in Section VI). Moreover, the above explanation shows that the redundancy can be controlled by intelligent selection of the exact number representation format.
Iv3 ELP_BSD data representation
To efficiently represent lowprecision sum of signed poweroftwo numbers, we define a novel data representation format, Encoded LowPrecision Binary Signed Digit (ELP_BSD) representation. Fig. 12(a) shows how the specifications of a ELP_BSD representation are defined, and Fig. 12(b) shows the corresponding binary representation format. As shown in Fig. 12(b), the bits are divided into groups, where each group is responsible for representing a single signed poweroftwo digit. Each group consists of a sign bit and bits to represent the index of shift count in the digit specification, where is the number of different shift counts mentioned for digit in the specification. Note that the sign bit is optional and only used when the corresponding digit is signed. Fig. 12 also presents two examples to explain the conversion of ELP_BSD numbers to FP values.
Iv4 Supporting Hardware Design
As shown in Fig. 12, ELP_BSD format mainly stores poweroftwo digits in an encoded format. To efficiently implement multiplication with a poweroftwo digit at hardwarelevel, shifters (e.g., a barrel shifter) can be used. To realize a MAC unit, the shifter is followed by an adder that adds previously computed partial sum to the newly computed product. Fig. 13(a) shows the MAC unit design for the case when weights are represented using a single signed poweroftwo digit. If we use the same ELP_BSD format for all the weights, we can hard code the indexing functionality in the shifter and use indexes directly from the encoded weight for multiplication. Moreover, we can choose the set of possible shift counts in a manner that it results in a less complex shifter design.
In case weights are represented using an ELP_BSD format that contains multiple digits, we can use multiple of these units (one perdigit) in parallel to compute the partial sums, which then have to be added together to generate one output. To achieve this addition, we propose to use a compressor tree followed by a multibit adder to add the outputs of the shifters and the partial sum from the previous computation to generate only a single output. Fig. 13
(b) shows how multiple single digit MAC units can be integrated in the Processing Elements (PEs) of a Neural Processing Array (NPU), e.g., like the Tensor Processing Unit (TPU)
[6]. Fig. 13(c) shows the processing array design of the TPU like architecture. This processing array follows a weight stationary dataflow where weights are kept stationary inside the PEs during execution. The input activations are fed from the left and moved towards the right over clock cycles. Similarly, the partial sums are moved towards the bottom of the array. Note that, for this work, we choose to represent activations using FP and 2’s complement format, as changing the format of activations won’t have any significant impact on the length of the adders in PEs. It is also important to highlight here that in most of the case 13 single digit MAC units per PE are sufficient to meet the userdefined accuracy constraints.V Our Methodology for Efficient Approximation of CNNs through NonUniform Quantization
Fig. 14 presents our methodology for approximating CNNs through nonuniform quantization while exploiting the strategies mentioned in Section III. The following steps explain working of the methodology.

[leftmargin=*]

Determine critical FP bitwidth of activations: Starting from maximum allowed FP bitwidth activations, we gradually reduce the bitwidth to find the critical point after which the accuracy loss of the input DNN raises above the userdefined accuracy loss constraint (AC). For this step, we assume bitwidth of all the activations to be the same. The key intuition behind this step is that a decrease in the activations’ bitwidth results in a linear decrease in the width of MAC units, which can help in improving the energyefficiency. This step outputs critical bitwidth for activations, denoted as .

Determine scaling factor for weights: Given the data representation format for weights, we compute the scaling factor for weights of each layer of the input DNN separately. For this work, in case of ELP_BSD format, we compute the scaling factor as .

Apply nearest neighbor quantization: Using the data representation format and the scaling factors, we generate a table of possible quantization levels (TQL) for each layer of the input DNN. Then to perform quantization, we replace the weights with their corresponding nearest values in the tables.

Apply error compensation algorithm: For each convolutional layer, we pass the weights to Algo. 1, which (partially) compensates for errors introduced due to quatization of weights by exploiting Strategy 2 mentioned in Section III. Note that as most of stateoftheart architectures use small filter sizes, e.g., 3x3, Algo. 1 focuses on compensating the overall channellevel mean quantization error in filter weights. To achieve this, it computes the mean quantization error of a channel of a filter, locates the values that can be mapped to their other neighboring quantization level to reduce the mean error, sorts all the located values based on a cost function and starts altering the values starting form the values having least cost till the point the absolute mean error is decreasing.

Estimate the overall accuracy loss: In this step, we compute the accuracy to check if the userdefined accuracy constraint is met. In case the constraint is not satisfied, the algorithm increases the value of by 1 and performs accuracy evaluation again. If the constraint is still not met and becomes equal to , it outputs the latest quantized DNN.
Vi Results and Discussion
Via Experimental Setup
To evaluate CoNLoCNN, we extended MatConvNet [17]
framework for FP implementation and our CoNLoCNN methodology. We evaluated CoNLoCNN using two popular DNNs used for benchmarking FP implementations, i.e., AlexNet and VGG16 trained on the ImageNet dataset. For hardware synthesis, we implemented PEs composed of different MAC unit designs in Verilog and synthesized for the TSMC 65nm technology using Cadence Genus.
ViB Effectiveness of Our Error Compensation Strategy, i.e., Algorithm 1
To demonstrate the effectiveness of error compensation algorithm, we applied it with FP implementation and compared the results with conventional FP implementation. Note that for this experiment, we assumed the bitwidth of weights and activations to be the same and uniform across all the layers of the DNN. Fig. 15(a) shows the results for the AlexNet. As shown in the figure, our error compensation strategy helps in improving the accuracy of the network, specifically at lower bitwidths.
ViC Effectiveness of CoNLoCNN for StateoftheArt DNNs
To demonstrate the effectiveness of our overall methodology, we considered four different ELP_BSD data representation formats. The formats are listed in Table II along with the hardware characteristics of the PEs implemented using the corresponding MAC designs for a 32x32 processing array (shown in Fig. 13(c)). Note that the scaling factor (represented by in the table) of the representations is not considered to be the same across layers, and it is selected based on the statistics of the parameters of the corresponding layer. For each representation, we considered five different activation bitwidths, i.e., 8bit till 4bit, to study the impact of change in activation bitwidth on the accuracy of the network and the hardware efficiency. Fig. 15(b) shows the accuracy vs. PDP results achieved when CoNLoCNN is used for the AlexNet considering the ELP_BSD configurations mentioned in Table II. The plots on the left inside Fig. 15(b) are generated by CoNLoCNN while the two points on the right represent conventional designs considered for comparison. The plot shows that as the bitwidth of the activations decreases, we observe a slight decrease till a point after which the rate of decay increases drastically. Moreover, different ELP_BSD formats offer different accuracyefficiency characteristics. The key thing to observe in Fig. 15(b) is that even the most power consuming PE design generated by the proposed CoNLoCNN framework offers around 50% reduction in PDP compared to conventional designs. In case 1.44% drop in accuracy is acceptable, the proposed method can offer around 76% reduction in the PDP. Similar results are observed for the VGG16 network.
ViD Comparison with the stateoftheart
To compare the results with the stateoftheart, we implemented CAxCNN [12] inside our framework. We selected Canonical Approximate (CA) representation with 1 nonzero digit, where the weights are converted using their exhaustive search algorithm (i.e., their best algorithm) to have a fair comparison. For AlexNet, CAxCNN with 1 nonzero digit CA representation achieves 50.9% top1 accuracy while CoNLoCNN achieves 55.4% accuracy (i.e., close to the baseline). This improvement is mainly due to our error compensation strategy. The quantization levels offered by 1 nonzero digit CA are almost the same as offered by ELP_BSD{SF,[1,0,1,2,3,4,5,6,7]} format with the only difference being of ’0’, which is not present in ELP_BSD{SF,[1,0,1,2,3,4,5,6,7]}. However, absence of ’0’ does not affect the accuracy due to the presence of 1 and 1 quantization levels and the use error compensation. Note that the absence of ’0’ in the highlighted case helps in achieving a simplified PE design. Even in the best possible scenario, CA representation would require 5 bits per weight while ELP_BSD{SF,[1,0,1,2,3,4,5,6,7]} requires 4 bits per weight. Similar to the AlexNet, for the VGG16, we observed 3% higher accuracy with CoNLoCNN compared to CAxCNN. Note that CoNLoCNN not only helps in reducing the complexity of the hardware but also helps in reducing the memory footprint of DNNs unlike other approximate computing works (e.g., [10]) that operate at 8bit precision.
Vii Conclusion
In this paper, we proposed CoNLoCNN, a framework to enable energyefficient lowprecision approximate DNN inference. CoNLoCNN mainly exploits nonuniform quantization of weights to simplify processing elements in the computational array of DNN accelerators and correlation between activation values to (partially) compensate for the quantization errors without any runtime overheads. We also proposed, Encoded LowPrecision Binary Signed Digit (ELP_BSD) representation, to reduce the bitwidth of weights while ensuring direct use of the encoded weight in computations by designing supporting MAC units.
Acknowledgment
This research is partly supported by the ASPIRE AARE Grant (S1561) on ”Towards Extreme Energy Efficiency through CrossLayer Approximate Computing”.
References
 [1] (2017) Structured pruning of deep convolutional neural networks. ACM JETC 13 (3), pp. 1–18. Cited by: §I.
 [2] (2016) Hardwareoriented approximation of convolutional neural networks. arXiv preprint arXiv:1604.03168. Cited by: §I.
 [3] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §I.
 [4] (2018) Compensateddnn: energy efficient lowprecision deep neural networks by compensating quantization errors. In ACM/ESDA/IEEE DAC, pp. 1–6. Cited by: §I.
 [5] (2019) BiScaleddnn: quantizing longtailed datastructures with two scale factors for deep neural networks. In ACM/IEEE DAC, pp. 1–6. Cited by: §I.
 [6] (2017) Indatacenter performance analysis of a tensor processing unit. In IEEE ISCA, pp. 1–12. Cited by: §IV4.
 [7] (2018) MATIC: learning around errors for efficient lowvoltage neural network accelerators. In IEEE DATE, pp. 1–6. Cited by: §I.
 [8] (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §I, §IIB.
 [9] (2016) Fixed point quantization of deep convolutional networks. In ICML, pp. 2849–2858. Cited by: §I.
 [10] (2019) ALWANN: automatic layerwise approximation of deep neural network accelerators without retraining. arXiv preprint arXiv:1907.07229. Cited by: §I, §VID.

[11]
(2016)
Xnornet: imagenet classification using binary convolutional neural networks.
In
European conference on computer vision
, pp. 525–542. Cited by: §I.  [12] (2020) CAxCNN: towards the use of canonic sign digit based approximation for hardwarefriendly convolutional neural networks. IEEE Access 8 (), pp. 127014–127021. Cited by: §I, §VID.
 [13] (2017) Revisiting unreasonable effectiveness of data in deep learning era. In Proceedings of the IEEE international conference on computer vision, pp. 843–852. Cited by: §I.
 [14] (2017) Efficient processing of deep neural networks: a tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §I.
 [15] (2017) Hardwaresoftware codesign of accurate, multiplierfree deep neural networks. In ACM/EDAC/IEEE DAC, pp. 1–6. Cited by: §I.
 [16] (2015) Efficient data encoding for convolutional neural network application. ACM TACO 11 (4), pp. 1–21. Cited by: §I.
 [17] (2015) MatConvNet – convolutional neural networks for matlab. In Proceeding of the ACM Int. Conf. on Multimedia, Cited by: §VIA.
 [18] (2014) AxNN: energyefficient neuromorphic systems using approximate computing. In IEEE/ACM ISLPED, Vol. , pp. 27–32. Cited by: §I.
 [19] (2018) Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In IEEE DAC, pp. 1–6. Cited by: §I.
 [20] (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §I.