With the increasing ubiquity of intelligent applications in our daily lives, machine learning and artificial neural networks are rapidly finding their way into a wide range of systems and devices, from large-scale cloud computing systems all the way down to handheld and even miniature implanted devices. As the amount of data captured and generated from devices such as smartphones and Internet-of-Things (IoT) sensors, the deep neural networks (DNNs) needed to accomplish the desired tasks can be too large or complex to realize entirely within a device. For devices located at the edge of a network, lightweight and mobile-friendly architectures[1, 2, 3] can facilitate the implementation of these DNNs. When a DNN that performs real-time inference or other compute-intensive operations is too complex to realize fully on an edge device, a collaborative intelligence [4, 5]
paradigm can be used to split the DNN so that the bulk of the computations can be performed in the cloud. In this case, a subset of layers of the DNN is computed inside the edge device, and then the output of the last layer on the device is signaled to the cloud, to be used as input to the remaining layers of the DNN. The ideal location to split the DNN can be determined by examining both the available computational resources in the edge device and the size of the data such as feature tensors that need to be signaled.
The amount of data or the size of feature tensors generated by hidden layers in a DNN can be quite inflated compared to the size of the DNN’s input or output data. If a network is split at a hidden layer, then data reduction and compression methods are needed to make the signaling of feature tensors practical. Typical approaches for data reduction in neural networks include network optimization methods such as factorization , pruning , and network parameter quantization [8, 9]. The goal of many of the neural network quantization methods is to reduce the size of the data, complexity of operations, and amount of memory so that the DNN can operate more efficiently on a device. In the context of collaborative intelligence, an additional goal is to process and efficiently compress the features produced by the DNN’s front-end, so that they can be signaled between an edge device and the cloud.
In our earlier work 
, we presented a lightweight compression method that is well-suited for coding the output of a split DNN in edge-based devices. This method uses simple and very coarse scalar quantization along with clipping, binarization, and entropy coding to compress the activations, without needing any retraining of network weights, i.e., post-training quantization. All the clipping ranges in that paper were determined empirically. In this paper, we introduce mathematical models for the distributions of feature tensors output by a leaky rectified linear unit (ReLU). We use these models to obtain closed-form expressions for clipping and quantization error, from which we can obtain optimal clipping values. We show how well these models estimate the error, and then we compare the overall neural network performance using these models to the performance obtained when using empirically determined clipping ranges. We also provide additional detailed experimental results for the networks used in this paper.
In Section II we describe prior related works on quantization and compression in neural networks. In Section III, we present the lightweight codec, models for optimal clipping and quantization, discussions and illustrations of how the models behave, methods for improving the system’s compression efficiency, and computational complexity comparisons. Section IV presents experimental results and comparisons between empirical and model-based performance, followed by conclusions in Section V.
Ii Related work
Methods for quantizing or compressing neural networks weights or feature tensors generally fall into two categories: quantization-aware training and post-training quantization. In quantization-aware training, the network weights are trained with quantization applied or simulated in floating point, often in the forward pass, with higher-precision arithmetic used during stochastic gradient descent to assist with convergence of the training process. For example, Jacob, et al. uses that approach to quantize weights and activations to 8 bits, while quantizing bias parameters to 32 bits. Mishra, et al.  find that when increasing the batch size during training of a low-precision network, the proportion of memory used by activation maps increases significantly. To reduce the precision of weights and activations while maintaining the overall accuracy of the network, they insert additional filter maps into each layer. They find that for networks such as ResNet-34  and others that near full-precision performance can be achieved with 4-bit activations and 2-bit weights. In 
, weights, activations, and some of the gradient and back-propagation computations are quantized to 8 bits, along with a scaling modification made to batch normalization. In, networks are trained with weights and activations quantized to two bits. Good performance is achievable even with extremely coarse quantization, as shown in [15, 16] where filters and the inputs to convolutional layers are one-bit values.
For quantizing neural networks that have already been trained, including those trained without considering quantization, post-training quantization can be applied during inference. If this quantization is sufficiently fine or if the quantizer design method is tailored to the characteristics of the weights or activations being quantized, then the performance of this inference can become quite good, and in some cases can match the performance of a system designed using quantization-aware training or even no quantization at all. For example, 
applies straightforward 8-bit quantization to weights and activations while maintaining nearly the same accuracy as obtained with floating point, and with 4-bit quantization, a 5% loss in accuracy is observed. Two approaches that deal with how to quantize floating-point values that have a high dynamic range are outlier-aware quantization and clipping. In[18, 19], quantization of weights and activations to 4 bits is achieved by using a coarse quantizer for 97% of the values and a finer quantizer for the remaining outliers. OCS  instead splits a channel into two channels in which the weights and outputs are halved, which reduces the dynamic range of the outliers. DFQ 
quantizes weights and activations to 8 bits by assuming that the inputs to the activations have a Gaussian distribution, so that a model can be used to equalize the dynamic range of the data being quantized, along with a correction to the bias introduced by quantization. Clipping-based approaches are used in[22, 23, 10], in which activations or weights are clipped prior to quantization. Banner, et al. [22, 23] uses a piecewise linear model of symmetric or one-sided Laplace and Gaussian distributions to model weights and activations to determine optimal clipping values, along with a simple bias correction for weights. That method can be used to quantize neural networks layers to an average of four bits. For networks that use batch normalization, ZeroQ 
obtains model parameters from synthetic data instead of from training or validation sets. The synthetic data is generated based upon minimizing a distortion metric that computes the respective differences between the means and variances of batch normalization and those of the reconstructed synthetic input. Pareto optimization is used to select the bit precision for each layer. In, empirical-based clipping is used to apply extremely coarse post-training quantization to activations, down to one bit. Fractional bit precisions are supported as well.
For collaborative intelligence applications, in which a DNN is split, additional compression is applied after quantization, so that the compressed activations can be efficiently signaled between a device and the cloud. In [25, 26], feature tensors are quantized to 8 bits and then converted to tiled images to be compressed by off-the-shelf codecs. However, conventional image and video codecs are tailored to coding camera-captured images, whereas images formed using feature tensors exhibit significantly different characteristics. To address this discrepancy, the codec of  is tailored according to the feature tensor statistics so that lossless coding can be applied to 8-bit pre-quantized features. Lossy compression can also be used to further compress the activations output at a split layer. In [28, 29], a small auto-encoder-like neural network is introduced between the edge front-end and the cloud back-end, to reduce the dimensions of the signaled feature tensor. However, this approach requires end-to-end (re)training of the original DNN to minimize accuracy loss caused by the dimension reduction. In , the dimension of the feature tensor is reduced using a small neural network in a process called Back-and-Forth (BaF) prediction, which does not require retraining of the original DNN. The reduced-dimension tensor is then further quantized to 6 bits and compressed with a state-of-the-art lossless codec.
To achieve high compression efficiency without requiring the complexity of an off-the-shelf image codec, our previous work  applies clipping and very coarse post-training quantization, e.g., up to two bits, to the output activations, followed by binarization and entropy coding to generate a compressed bit-stream. The optimal clipping values in  were determined empirically. Due to the asymmetry of the distributions of feature tensor values at our split layers and the extreme coarseness of our quantizers, the assumptions of symmetry and uniformity used for the piecewise model of [22, 23] do not apply. The purpose of this paper is to extend the work of 
by developing a mathematical model of the feature tensors output by a leaky ReLU activation function whose input is asymmetric; using this model to estimate clipping and quantization error of the activations; determining how these error estimates behave with extremely coarse quantization; and applying these models to determine optimal clipping ranges for quantization. We also review the complete lightweight compression system, which includes an entropy-constrained quantizer design algorithm modified to pin the outermost quantizer reconstruction levels so that the dynamic range of clipped activations is preserved when decoded.
Iii Modeling and lightweight compression of feature tensors
Fig. 1 shows an overview of the proposed lightweight compression method used in a collaborative intelligence environment. The first several layers of a DNN are processed on a mobile or edge device. A typical layer includes convolutions, batch normalization, and an activation function. The outputs of this subset of layers are signaled to the cloud or to another computing device for processing by the remaining DNN layers. To efficiently transmit the activations or feature tensors from the first portion of the DNN, data compression is needed. Because a mobile or edge device may have limitations on computing complexity or available energy, a low-complexity compression method is preferred, if it does not significantly impact on overall accuracy of the DNN. The lightweight compression system illustrated here uses simple operations including clipping and very coarse memoryless scalar quantization to represent the feature tensors using a small set of quantized symbols. These symbols are binarized and entropy coded to further compress the data, and the compressed bit-stream is transmitted to a different computing platform to be decoded and used as input to the remaining layers of the DNN. The minimum and maximum clipping values and can be determined empirically or via a model-based analysis based on the sample mean and variance of data to be compressed.
It is well known that 8-bit post-training quantization of 32-bit floating point neural network activations can be applied with little effect on the overall accuracy of the network . Likewise, the same quantization can be applied to the intermediate layer where a DNN is split for a collaborative intelligence application. When quantizing without special considerations to four or fewer bits, however, performance can be significantly degraded. For example, if we split ResNet-50 
at layer 21 and apply 3-bit uniform quantization to the activations output by that layer, then the Top-1 classification accuracy over the ImageNet ILSVRC2012 validation data set goes from 75.8% without this quantization to 59.7% with quantization.
Most if not all this performance loss can be mitigated simply by clipping the activations before quantization, without any retraining of the DNN weights. In this section, we first examine the effects of clipping on the overall network performance. We then present an analytic model for obtaining optimal clipping parameters prior to coarse quantization, and we examine in detail how well this model works for different levels of coarse quantization. We also review the methods used to achieve further compression, such as binarization and a modified entropy-constrained design process for quantizing clipped values.
Iii-a Effects of clipping
(a) shows the effects that clipping and coarse quantization have on the Top-1 classification accuracy of the ResNet-50 classification network when averaged over 5k images from the ImageNet ILSVRC2012 validation data set. Here, the activations at the output of layer 21 are clipped (clamped) to be betweenand . Layer 21 of ResNet-50 corresponds to the output of the shortcut connection and element-wise addition applied to the output of the second (out of four) residual blocks in the conv3_x layer shown in [12, Table 1]. If we instead use AlexNet  with the ImageNet validation set and apply clipping and quantization to the output of layer 4, we obtain Fig. 2(c). Layer 4 of AlexNet corresponds to the convolutional layer immediately after the second maxpool layer shown in [33, Figure 2]. A similar plot is shown in Fig. 2(b) for the mean Average Precision (mAP) of the YOLOv3 
object detection network, when run on the COCO 2017 validation data set with Intersection-over-Union (IoU) threshold set to 0.5. In this case, the output of layer 12 is clipped and quantized. Layer 12 of YOLOv3 corresponds to the output of the convolution just before the first group of residual blocks shown in [34, Table 1]. We cut the networks at these layers because generally for collaborative intelligence applications, a subset of the neural network is implemented on a lightweight device. Therefore, the size of this subset is usually much smaller than the portion of the network implemented in the cloud. However, we do not want to cut the network too early, as the size of the feature tensor typically grows rapidly in the first few layers. Additionally, we do not want to cut across many data paths at once. For example, the feature tensors in YOLOv3 rapidly expand by a factor of over ten times, and it is not until layer 12 until the feature tensor size comes back down to the same order of magnitude as the input.
Each clipped activation value, denoted as , is processed by an -level quantizer as follows:
where rounds away from zero for halfway cases. Note that unlike related literature that focuses on reduced bit-depth architectures, our does not need to be a power of two, as the purpose of quantization is for compression and subsequent transmission or storage in a bit-stream. The mean-square reconstruction error (MSRE) computed between an unmodified activation and the inverse-quantized clipped activation is shown using dotted lines.
For 8-level (3-bit) quantization of ResNet-50 layer 21 activations, Fig. 2(a) shows that peak Top-1 accuracy is achieved over a range of values between roughly 9.0 and 25.0. This range is indicated by the correspondingly colored shaded region. As the number of quantization levels is decreased, the optimal decreases, as does the range of values that achieves peak performance. With 1-bit (2-level) quantization, the optimal range is quite narrow. When the quantization is not extremely coarse, e.g. 8-level (3-bit) or higher, the minimum MSRE generally coincides with the peak accuracy of ResNet-50 and peak mAP performance of YOLOv3, as can be seen in the plot where the minimum of the MSRE curve falls within the shaded region corresponding to the maximum network accuracy. Earlier works, e.g. [22, 23] have leveraged this behavior to model the quantization error to select the optimal clipping range for all activations in a DNN. However, it is evident from these prior works and explicitly stated by the authors that deviations from the models occur with extremely coarse quantization, e.g. corresponding to 2-bit (4-level) and below in this example. We can see in Fig. 2(a) for ResNet-50 that the optimal for 2-level (1-bit) quantization is approximately 7.0, whereas the minimum MSRE occurs near . Similar behavior is exhibited in Fig. 2(b) for YOLOv3. Thus, choosing based on minimizing MSRE can result in a potential loss in accuracy of several percent when is small. Nonetheless, it is still worthwhile examining an analytical model for estimating the optimal clipping ranges, given that the model may still be useful for some values of , and because empirically determining optimal ranges increases the complexity of the design process.
Iii-B Model for computing optimal clipping ranges for activations
In a collaborative intelligence system where a neural network is split, we may only have access to the activations or feature tensors output by the front-end of the network. We would like to measure the statistical properties of that output to determine optimal clipping values. Prior works such as [22, 23]
assume a Laplace or Gaussian model for the distribution of feature tensors, and for when a rectified linear unit (ReLU) is used for the activation function, a single-sided Laplace, i.e. exponential distribution is used as a model on the assumption that all negative values are rectified. For networks such as ResNet-50, however, leaky ReLU is used for the activation function, in which negative values are preserved at a smaller scale. Fig.3
shows the distribution of feature tensor elements immediately before and after leaky ReLU at the output of layer 21 of ResNet-50 when run on the full ImageNet ILSVRC2012 validation data set of 50k images. Due in part to the scaling of negative values in the earlier layers, we can see that the distribution of the data input to the layer 21 activation function is skewed and has a peak not located at zero. The output of layer 21 thus results from a leaky ReLU operation applied to an asymmetric distribution, rather than to a symmetric and zero-mean distribution, as assumed in earlier works.
For developing a model to compute optimal clipping ranges at the output of a leaky ReLU activation function, we model the input to the activation function as having an asymmetric Laplace distribution 
with probability density function (PDF)
where is a constant that determines the asymmetry of the distribution, is the location of the peak of the distribution, and . Note that, unlike the symmetric Laplace distribution, here is not the mean. For ResNet-50, we use to obtain an asymmetric Laplace distribution that approximates the distribution of feature tensor elements input to leaky ReLU. The density function now becomes
The ResNet-50 implementation that we use has a leaky ReLU activation function with a scaling factor of 0.1 for negative values, defined as follows:
We can see from Fig. 3(a) that the peak of the histogram corresponds to a negative value, therefore
, and the mean of a random variablehaving the PDF of (5) can be simplified to
and the variance becomes
By setting (6) equal to the sample mean and (7) equal to the sample variance measured at the output of the layer, we can solve for and . For the ResNet-50 layer 21 output described earlier, the sample mean over the full validation set is 1.1235656, and the sample variance is 4.9280124, so the numerical solution yields and . Now that we have numeric values for these parameters, we can substitute them into (5) to obtain a final analytic model for the PDF of data output by layer 21:
Fig. 3(b) shows the fit of this analytic model to the histogram of the feature tensor values output by that layer. We can see that the model captures the salient features of the empirical distribution rather well: sharp peak on the negative side, and slowly decaying exponential on the positive side.
The feature tensor elements output by the activation function are clipped and quantized using (1). Unlike the quantizer model of  for which values are quantized to the midpoint of a quantizer bin, values falling within the first and last bins of our quantizer are quantized to the outer boundaries of those bins. Since these boundaries correspond to the minimum and maximum clipping limits, values that are clipped to or incur no further quantization error. For the -level quantizer of (1), the width of an interior bin is , and the width of each outermost bin is . Given a PDF , the quantization error for values inside the clipping range is
where the first and last integrals compute the quantization error for the outermost bins, and the summation portion accumulates quantization error over the interior bins. The error caused by clipping is
as was the case in , except that our clipped values incur no further distortion from quantization.
The total reconstruction error for the clipping and quantizing processes is . Given an -level quantizer and the analytic model for the density function from (8), we can numerically solve for the optimal clipping range by minimizing , or for the case when we want to be zero, we can solve for . An example illustrating the effects of these errors is shown in Fig. 4 for when clipping and quantization with levels is applied to the distribution model in (8).
As expected, the clipping error decreases monotonically as increases, because fewer values are being clipped. We can see from Fig. 3(b), which is the density plot corresponding to (8), that feature tensor values are in the tail of the distribution, causing the clipping error curve to level out. Note also from (10) that the clipping error is independent of the number of quantizer levels . The quantization error reduces when becomes small, given that the width of a quantization bin is also small. Within the clipping ranges of interest, continues to increase with , however, it will eventually level off because the leftmost bin will become so large that most values are quantized to zero. The total reconstruction error shown here is the sum of the clipping and quantization error, so we can see that for small clipping ranges, the clipping error dominates, and for large clipping ranges, the quantization error dominates. For this example with , the closed-form expression corresponding to that we want to minimize can be simplified to:
By applying this method over different values of , we obtain a complete set of closed-form expressions for that can be minimized to give us clipping ranges that are optimal for an -level quantizer. We can also apply the same technique for modeling the error for YOLOv3 layer 12 activations, whose distribution is
based upon sample mean and variance values of 0.4484323 and 0.5742644, respectively, obtained over the COCO 2017 validation set. We use the same methodology to obtain a model for AlexNet, which uses non-leaky ReLU.
For ResNet-50, the total error computed by model does an excellent job of estimating the actual measured total error for a given clipping range. For YOLOv3 and AlexNet, although the curves do not overlap exactly, their minimum ranges yield corresponding values that are relatively close to the empirically found optima, which is the intent of these models. The deviation at larger clipping values of the model-based error from the measured error is shown for completeness. This deviation does not negatively impact the overall neural network performance, as we will show in Section IV-A.
To verify that this model-based method also works well on other layers, for the next two experiments we split ResNet-50 at layers 25 and then at layer 29. These activations correspond to the outputs of the next two shortcut layers after layer 21. Fig. 6 shows that the model provides a good fit to the measured total error for both these layers.
Now that we have a method for optimally clipping and quantizing the feature tensors, we next present methods to improve the compression efficiency of the overall system.
Iii-C Modified entropy-constrained quantization for clipped activations
A uniform quantizer is optimal only for signals that are uniformly distributed. Since neural network activations are not uniformly distributed, as seen in the previous section, we need to look at methods for non-uniform quantizer design. Specifically, we consider entropy-constrained quantization in conjunction with novel adjustments designed to improve network accuracy under such quantization. In conventional quantization, the reconstruction value for each bin of an -norm optimized non-uniform quantizer corresponds to the centroid (conditional mean) of the data quantized to that bin. Clipped values would therefore be further quantized to the centroid of the outermost quantizer bins. This would, in turn, cause the reconstructed inverse-quantized values to span a range smaller than the optimal clipping range. We showed earlier that with coarse quantization, the accuracy of a DNN can be quite sensitive to the clipping range of a layer’s activation. To address this problem, we present a modified entropy-constrained quantizer design process where the reconstruction values of the outermost bins are pinned to and , to ensure that the reconstructed activations span the full clipping range. This pinning is only used in outermost bins; the reconstruction values for other bins, as well as bin boundaries, are not pinned and are optimized by the algorithm.
, with shaded areas indicating how it differs from the conventional entropy-constrained quantizer design approach. The main modifications are related to pinning the reconstruction values for the boundary bins and using codeword lengths instead of probabilities for computing rate-related terms. The boundary pinning occurs in Step4. Here, the smallest and the largest reconstruction values are always set to the minimum and maximum activation clipping values and , respectively. The interior levels are computed as is done in the conventional algorithm. For computing the rate terms, we replace the probability-based rate estimate, , with the known length of the binarized codeword used to represent the bin.
With these modifications in mind, we can now summarize the new quantizer design process. The feature tensor elements output by a split DNN are first clipped in Step 1 to be within . Next, the values to which input data are quantized, , are initialized uniformly over the interval. In Step 3, the samples used to train the quantizer, which is unrelated to the concept of training weights of a DNN, are assigned to a reconstruction value based on a cost function that uses a Lagrange multiplier . For our experiments with this process, we use feature tensors generated by 100 images from the validation set. For small , the objective is to minimize the quantization error, at the expense of having a larger bit-stream. Conversely, a large tries to minimize the bit-stream size while allowing the distortion to be large. Thus, can be used to determine our operating point on a rate-distortion curve.
Once all training samples have been assigned to a reconstruction value, Step 4 updates the reconstruction values by setting them to the average value of the training samples that were assigned to it, i.e. their centroid. However, our modifications pin the first and last reconstruction values to be equal to and , respectively, to ensure that the decoded and reconstructed feature tensors span the optimal range. Steps 3 and 4 are repeated until the decrease in the cost function is less than a threshold, or until a certain number of iterations have occurred. Finally, Step 6 computes the decision boundaries between these reconstruction values, also using a Lagrangian cost function. When the quantizer is deployed, each input value is associated with a reconstructed value based on where it falls in relation to the decision thresholds, and the index correspond to the associated reconstruction value is output by the quantizer. Next, we discuss the binarization of this index and subsequent entropy coding.
Input: Training samples
Number of quantizer bins
[colback=mypeach,boxrule=0pt,frame hidden,left=0pt, right=0pt, top=0pt, bottom=0pt, left skip=1.42cm,nobeforeafter,after skip=2pt]Codeword lengths Lagrange multiplier
[colback=mypeach,boxrule=0pt,frame hidden,left=0pt, right=0pt, top=0pt, bottom=0pt, left skip=1.42cm,nobeforeafter]Activation clipping range
Output: Quantizer reconstruction values
Quantizer decision thresholds
Iii-D Binarization and entropy coding
After quantizing an activation element, an index associated with the selected reconstruction value is coded and signaled to a bit-stream. For the DNNs considered in this work, the activation values tend to be concentrated around zero, as illustrated in Fig. 3(b) for the unclipped layer 21 activations of ResNet-50. Since we want to achieve good performance when quantizing to very few bins, a truncated unary binarization scheme  is well suited for this purpose. Given a non-negative integer , this binarization maps to a binary string comprising ones followed by a zero, except for the maximum value of which just maps to ones. For example, the binarization scheme for a 2-bit (4-level) value maps to .
Bit strings produced by binarization can be further compressed using a binary entropy codec. In this work, we use a simplified version of the Context-based Adaptive Binary Arithmetic Coding (CABAC)  used in HEVC and related codecs. One context is used for each bit position in the binarized string. For the 2-bit example described above, three contexts would be used. CABAC builds a separate probability model for each context. When encoding a particular symbol, the probability model associated with the context of that symbol will be used for producing the compressed output following arithmetic coding operations.
Iii-E Computational complexity
In this section we look at the complexity of the lightweight compression codec and compare it to that of HEVC. When deployed, the lightweight codec itself comprises four steps: clipping, quantization, binarization, and entropy coding. For each feature tensor to be compressed, the clipping step performs two in-place comparisons (, ) for each tensor element. The clipped values are next quantized using (1), whose complexity is equivalent to one addition, two multiplications, and one rounding operation, assuming that the constant values are precomputed. The binarization of the quantizer index can be implemented simply via a lookup table. Given that the lightweight codec typically uses only a few quantization levels, this binarization could even be implemented using a few Boolean logic equations. To entropy code the binarized strings, we use the same entropy coder that HEVC uses (CABAC), except with only a few contexts. To further reduce latency and complexity, some of these operations could be fused into the layer whose output we are compressing.
The complexity of HEVC is broken down by class in [40, Table III]. If we compare the building blocks of our lightweight codec with those listed for HEVC All-Intra (AI), we can get an idea of the relative complexity between these two codecs. Our quantization process would fall into the TComTrQuant class. Note that we are only performing quantization, whereas HEVC performs transforms, quantization, and rate-distortion optimization, so our quantization is only a small fraction, perhaps a percent or two, out of the 24.4% listed in the table. Our binarization and entropy coding operations fall under the entropy-coding classes TEncSbac, TEncEntropy, and TEncBinCABAC. As noted in , those classes include scanning and context derivation, whereas our context is simply based on the position of the bit in the binarization table. The lightweight codec’s binarization and entropy coding therefore consume a small portion of the 11.8% total listed for these entropy coding classes. If we estimate that we use few percent from each of these four classes, which is likely a generous overestimate, we can see that our total distribution is less than 10%. Thus, the lightweight codec is certainly well over 90% less complex than HEVC. Of course, the actual computation cost depends upon the implementation, but since the modules of the lightweight codec are a subset of those of HEVC, the same optimizations that can be applied to those modules in HEVC can be applied here.
During operation, the lightweight codec needs to know what clipping values , to use just before quantization. To obtain the mean and variance estimates, we used in-line computations on the feature tensor elements at the split layer, over the validation set, before running the codec. For the ImageNet and COCO data sets used in our experiments, the validation sets contained 50k and approximately 5k images, respectively. However, we found that the estimated mean and variance need only a few hundred images to converge; no more than 1k. Since these estimates are computed on unquantized feature tensors, they can easily be computed ahead of time, even during training. This codec is also amenable to adaptive operation if inference is performed in real time while processing video on an edge device. In that case, the measured statistics can adjust based on the most recent few hundred frames.
Iv Experimental results
We applied our lightweight compression technique to activations output from the split layer of a DNN, for two different inference scenarios: ResNet-50 image classification at layer 21, and YOLOv3 object detection at layer 12. The dimensions of the activations at these layers were 3232512 and 5252256, respectively. Pre-trained network weights were obtained from . The software used to run the experiments was a modified version of the Darknet software from . For ResNet-50 with network input size 256256, classification accuracy metrics were obtained directly from the Darknet software using the full ImageNet ILSVRC2012 validation data set, which has 50k images. For YOLOv3 with a network input size of 416416, mAP (IoU = 0.5) results were obtained using the COCO API  and the COCO 2017 validation data set, which includes just under 5k images. For experiments using entropy-constrained quantization, the quantizer design algorithms were run on activations output when running the first part of the network on 100 images from the data set. After clipping, quantization, and coding to a bit-stream, the activations were decoded and inverse quantized and then passed to the remainder of the neural network. The bit-streams also included side information needed by the decoder, e.g. , , , and some dimensional parameters for object detection, which together comprised 24 bytes for object detection and 12 bytes for classification networks. The size of the compressed data is reported as bits per feature tensor element, i.e., the size of the bit-stream divided by the number of elements in the activation’s output feature tensor.
In , Analytical Clipping for Integer Quantization (ACIQ) uses a piecewise linear model for the distribution of the feature tensor values, to approximate the clipping values to be used when all activations of a DNN are quantized to an average of 4 bits. Since ACIQ, like our method, does not require any training to compute the clipping values, we include comparisons to when computed by ACIQ is applied to the output of the layer where our DNN is split. As described in , the parameter can be estimated from the feature tensor values, assuming that for example the data fits a Laplace density function . If ReLU is the activation that is applied to this data, then ACIQ assumes . The equation used to compute can be simplified to:
where is the Lambert W function and is the number of bits to which elements are quantized. Although  only quantizes to an integer number of bits, the purpose of our quantization is for subsequent compression, so we can allow for non-integer bit-widths by substituting with -level quantization.
Iv-a Clipping and uniform quantization performance
Fig. 7(a) shows the Top-1 performance of ResNet-50 when clipping and -level quantization is applied to the activations output from layer 21. The empirical curve is obtained by running the network over the full validation set with and set to the value that yields the best accuracy from Fig. 2(a). The curves based on our asymmetric Laplace distribution model are shown both for when the optimal is obtained after fixing , and for when is not constrained. Plots corresponding to YOLOv3 layer 12 and AlexNet layer 4 are shown in Fig. 7 (b) and (c), respectively. Note that the performance is plotted vs. the number of quantizer levels ranging from 2 to 8, which corresponds to between 1 and 3 bits before binarization and entropy coding are applied.
We can see that for 4-level (2-bit) or 5-level and finer quantization, the model with constrained to zero does a particularly good job of estimating the empirically determined optimal clipping range . For extremely coarse quantization, e.g. 2-level (1-bit) and 3-level quantization, the model’s deviation from empirically determined values is expected, given that the model is based on minimizing MSRE and clipping error, and minimizing MSRE does not maximize the overall neural network performance in these cases, as discussed in Section III-A. By removing the constraint , the performance can be slightly better or worse than with the constraint, especially with 2-level (1-bit) quantization. The empirical and model-based clipping ranges obtained from these experiments are summarized in Table I, along with the maximum clipping values computed using ACIQ. We can see that applying the constraint has essentially no effect on the size of the clipping interval, i.e. is shifted to . For 4-level (2-bit) and finer quantization, the effects of this constraint on overall network performance are negligible, and therefore fixing may be preferable for ease of implementation. We also can see that as the number of quantization levels decreases, generally the optimal clipping range decreases as well, as discussed in Section III-A.
When using ACIQ, the values are computed based on a linear approximation to the data’s distribution. We can see from Table I that for quantizers having few levels, the values from ACIQ are generally higher than our empirical and model-based values. As we saw earlier in Fig. 2, the network performance is sensitive to changes in with coarse quantization. As the quantizer becomes finer, the performance using ACIQ approaches that of the empirical and model-based methods, given that the range of acceptable becomes wider.
Iv-B Lightweight compression system performance
Distortion-rate plots showing the overall performance of ResNet-50 and YOLOv3 when lightweight compression is used on layer 21 and layer 12 activations, respectively, are shown in Fig. 8(a) and (b), for when -level uniform quantization is used with . The rate or bit-count points in these plots have a one-to-one correspondence to the -level points in Fig. 7. To compute the amount of compressed bits per feature tensor element, we divide the size of the final output bit-stream by the total number of feature tensor elements that were coded. We also account for the side information included in the header of the bit-streams, which requires 12 bytes for image classification networks and 24 bytes for object-detection networks. This side information includes the original input image dimensions, , , and for object detection the input dimensions of the first layer are signaled so that bounding box coordinates around objects can be computed.
For ResNet-50, Fig. 8(a) shows that the activations could be quantized to 8 levels (3 bits) with no loss relative to when no quantization or clipping was used, and at 4 levels (2 bits), the drop in Top-1 accuracy was well below 1%. One-bit quantization was feasible with ResNet-50 and YOLOv3, which yielded 4.9% and 4.8% losses, respectively, with corresponding compressed sizes of 0.41 and 0.45 bits per element. Additional bit reductions are possible, e.g. to 0.39 bits per element as shown in  when nonzero values were used with empirical clipping on YOLOv3. We saw in Fig. 7(a) for ResNet-50 that the model-based clipping and empirical clipping yielded equivalent performance at 4-level and finer quantization. When the actual compressed rate is used, the distortion-rate performance in Fig. 8(a) shows that 6-level (between 2 and 3 bits) quantization and finer yields equivalent performance. For YOLOv3, the model-based and empirical clipping yielded similar performance with 5-level and finer quantization, and for distortion-rate performance, the model worked well with 3-level and finer quantization. Additionally, for YOLOv3 we were able to use the model to obtain clipping ranges that yielded better performance than those used in the empirical study. For YOLOv3, we showed in  that no loss in performance occurred when quantizing all the way down to 16 levels (4 bits), which is why we show here the performance for 8 and fewer levels. The drop in mAP was less than 1% with a 4-level (2-bit) quantizer.
We can also observe here the effect that has on the compressed bit-stream size. For example, the leftmost point on the empirical curve in Fig. 8(b) corresponds to 2-level quantization with =1.95. With ACIQ, =2.46. We can see in Fig. 2 that 2.46 is beyond the optimal clip range for 2-level quantization. However, a wider clip range with uniform quantization causes the quantization bins to become wider. Since the distribution of the data being quantized is denser near zero, more elements will be quantized to the first quantizer bin, whose binarized representation uses only one bit, thus reducing the overall size of the compressed bit-stream.
Fig. 8 also shows the performance when coding the activations using the HM16.20  implementation of the HEVC screen content coding extension (HEVC-SCC) . HEVC-SCC includes tools that help with the coding of non-camera-captured pictures. As shown in , activation channels arranged as pictures exhibit much high-frequency content. HEVC-SCC includes a transform skip (TS) mode that is available for all transform block sizes, so we show results when enabling TS for 44 blocks only, and for when enabled for all block sizes. Each set of activation channels were quantized to 8 bits and mosaicked into an 832832 picture for YOLOv3 and to 1024512 for ResNet-50. Given the fineness of the quantizer, clipping was not necessary. The mosaicked feature tensors for the validation set were coded by HEVC-SCC as an all-Intra sequence of monochrome (4:0:0) 8-bit pictures. Even with the improved performance with TS on all block sizes, the lightweight compression system outperformed HEVC-SCC by up to 1.3%, depending upon rate.
The performance of lightweight compression with modified entropy-constrained quantization is shown in Fig. 9 for ResNet-50 image classification, and in Fig. 10 for YOLOv3 object detection. We show the performance here using extremely coarse quantization, namely 2–3 levels, which corresponds to a 1–2 bit quantized representation, followed by binarization and entropy coding to generate the final compressed bit-stream. We also include for comparison the best-performing HEVC-SCC curves from Fig. 8. We also show the performance when using an entropy-constrained quantizer designed using the conventional algorithm, which does not pin the outermost reconstruction levels. With 4-level quantization, using the modified quantizer design method improved the neural network performance by about 0.5–1.5% as compared to when using the conventional algorithm. We can also see that the entropy-constrained quantizer’s ability to cover a range of rates enabled us to improve the 2-level (1-bit) performance by about 1% for YOLOv3. For ResNet-50, the modified quantizer design algorithm also allowed us to obtain improved accuracies for 3- and 4-level (2-bit) quantizers. The range of achievable compressed bit-stream sizes for all these experiments with quantization to 4 and fewer levels was between about 0.3 to 1.0 bits per tensor element.
We presented an efficient and lightweight post-training compression method for coding the intermediate feature tensors of a split deep neural network. The codec only requires clipping, coarse quantization, binarization, and entropy coding to compress feature tensors, so is well over 90% less complex than existing image or video codecs such as HEVC that are typically used for picture compression. We improved upon our earlier results by presenting an analytic model for obtaining optimal clipping ranges for feature tensors output by leaky ReLU activation functions. We used the new models with ReLU and leaky ReLU to estimate clipping and quantization error and showed that it produced a good match to empirically obtained results. We also presented an entropy-constrained quantizer design algorithm that pinned boundary reconstruction levels to quantize clipped activations, resulting in a 0.5–1.5% improvement in performance as compared to using the conventional algorithm. With this lightweight lossy compression technique, we were able to quantize the 32-bit floating point activations output by a split DNN to fewer than 2 bits per element and then compress them further to 0.6 to 0.8 bits per element, while keeping the loss in output precision or accuracy to less than 1%. We also showed that the lightweight codec yielded accuracies of up to 1.3% higher than HEVC-SCC. The performance and simplicity of this lightweight compression technique makes it an attractive option for coding activations for edge/cloud DNN applications.
-  J. Chen and X. Ran, “Deep learning with edge computing: A review,” Proceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, Aug. 2019.
-  N. D. Lane and P. Warden, “The deep (learning) transformation of mobile and embedded computing,” Computer, vol. 51, no. 5, pp. 12–16, May 2018.
-  M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “MnasNet: Platform-aware neural architecture search for mobile,” , pp. 2815–2823, June 2018.
-  Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. N. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” in ASPLOS ’17, Apr. 2017.
-  I. V. Bajić, W. Lin, and Y. Tian, “Collaborative intelligence: Challenges and opportunities,” in Proc. IEEE ICASSP, 2021, to appear. Available: arXiv:2102.06841.
-  X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient and accurate approximations of nonlinear convolutional networks,” in 2015 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1984–1992.
-  Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu, “Discrimination-aware channel pruning for deep neural networks,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS, 2018, pp. 875–886.
-  A. Mishra, E. Nurvitadhi, J. J. Cook, and D. Marr, “WRPN: Wide reduced-precision networks,” in 6th Int. Conf. on Learning Representations (ICLR), May 2018.
-  B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2704–2713.
-  R. A. Cohen, H. Choi, and I. V. Bajić, “Lightweight compression of neural network feature tensors for collaborative intelligence,” Proc. 21st IEEE Int. Conf. Multimedia and Expo (ICME), July 2020.
-  A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Int. Conf.Machine Learning (ICML), 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016.
-  R. Banner, I. Hubara, E. Hoffer, and D. Soudry, “Scalable methods for 8-bit training of neural networks,” in Proc. 32nd Int. Conf. Neural Information Processing Systems (NeurIPS), Dec. 2018, pp. 5151––5159.
-  J. Choi, S. Venkataramani, V. Srinivasan, K. Gopalakrishnan, Z. Wang, and P. Chuang, “Accurate and efficient 2-bit quantized neural networks,” in Proc. 2nd SysML Conf., Mar. 2019.
-  I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proc. 30th Int. Conf. Neural Information Processing Systems (NIPS), Dec. 2016, pp. 4114––4122.
-  M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, in 14th European Conf. on Computer Vision (ECCV), Oct. 2016, pp. 525–542.
-  R. Krishnamoorthi, “Quantizing deep convolutional networks for efficient inference: A whitepaper,” arXiv abs/1806.08342, June 2018.
-  E. Park, D. Kim, and S. Yoo, “Energy-efficient neural network accelerator based on outlier-aware low-precision computation,” in 2018 ACM/IEEE 45th Annual Int. Symposium Computer Architecture (ISCA), 2018, pp. 688–698.
-  E. Park, S. Yoo, and P. Vajda, “Value-aware quantization for training and inference of neural networks,” in 15th European Conf. on Computer Vision (ECCV), Sep. 2018.
-  R. Zhao, Y. Hu, J. Dotzel, C. D. Sa, and Z. Zhang, “Improving neural network quantization without retraining using outlier channel splitting,” in Proc. 36th Int. Conf. on Machine Learning, ICML 2019, June 2019, pp. 7543–7552.
-  M. Nagel, M. van Baalen, T. Blankevoort, and M. Welling, “Data-free quantization through weight equalization and bias correction,” in 2019 IEEE Int. Conf. Computer Vision (ICCV), 2019.
-  R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “ACIQ: Analytical clipping for integer quantization of neural networks,” [Online]: https://openreview.net/forum?id=B1x33sC9KQ, Sept. 2018.
-  R. Banner, Y. Nahshan, E. Hoffer, and D. Soudry, “Post-training 4-bit quantization of convolution networks for rapid-deployment,” in Proc. 33rd Int. Conf. Neural Information Processing Systems (NeurIPS), May 2019, pp. 7950–7958.
-  Y. Cai, Z. Yao, Z. Dong, A. Gholami, M. W. Mahoney, and K. Keutzer, “ZeroQ: A novel zero shot quantization framework,” in IEEE Conf. Computer Vision and Pattern Recognition (CVPR), June 2020.
H. Choi and I. V. Bajić,
“Deep feature compression for collaborative object detection,”Proc. 25th IEEE Int. Conf. Image Processing (ICIP), pp. 3743–3747, Oct. 2018.
-  A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: An efficient training and inference engine for intelligent mobile cloud computing services,” IEEE Trans. Mobile Comput., Oct. 2019.
-  H. Choi and I. V. Bajić, “Near-lossless deep feature compression for collaborative intelligence,” IEEE 20th Int. Workshop on Multimedia Signal Processing (MMSP), Aug. 2018.
-  A. E. Eshratifar, A. Esmaili, and M. Pedram, “Towards collaborative intelligence friendly architectures for deep learning,” 20th Int. Symposium Quality Electronic Design (ISQED), pp. 14–19, Mar. 2019.
-  A. E. Eshratifar, A. Esmaili, and M. Pedram, “BottleNet: A deep learning architecture for intelligent mobile cloud computing services,” in Proc. IEEE/ACM Int. Symposium Low Power Electronics and Design (ISLPED), July 2019.
-  H. Choi, R. A. Cohen, and I. V. Bajić, “Back-and-forth prediction for deep tensor compression,” Proc. 45th IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), May 2020, in press.
-  V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Deep Learning and Unsupervised Feature Learning Workshop, NIPS, Dec. 2011.
-  O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” Int. J. Comput. Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems. 2012, vol. 25, Curran Associates, Inc.
-  J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, Apr. 2018.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in European Conf. on Computer Vision (ECCV), Sept. 2014.
-  T. Kozubowski and K. Podgorski, “A multivariate and asymmetric generalization of Laplace distribution,” Computational Statistics, vol. 15, pp. 531–540, Dec. 2000.
P. A. Chou, T. Lookabaugh, and R. M. Gray,
“Entropy-constrained vector quantization,”IEEE Trans. Acoust., Speech, Signal Process., vol. 37, no. 1, pp. 31–42, Jan. 1989.
-  B. Girod, “Quantization,” EE398A Image and Video Compression, [Online]: https://web.stanford.edu/class/ee398a/handouts/lectures/05-Quantization.pdf, Accessed: 2020-04-02.
-  D. Marpe, H. Schwarz, and T. Wiegand, “Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard,” IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 7, pp. 620–636, July 2003.
-  F. Bossen, B. Bross, K. Suhring, and D. Flynn, “HEVC complexity and implementation analysis,” IEEE Trans. Circuits Syst. Video Technol., vol. 22, no. 12, pp. 1685–1696, Dec. 2012.
“Darknet: Open source neural networks in C,” [Online]:https://pjreddie.com/darknet, Accessed: 2021-02-25.
-  A. Bochkovskiy, “darknet,” [Online]: https://github.com/AlexeyAB/darknet/tree/8c80ba6, Accessed: 2020-11-22.
-  “COCO API,” [Online]: https://github.com/cocodataset/cocoapi, Accessed: 2019-03-19.
-  “HEVC reference software (HM 16.20),” [Online]: http://hevc.hhi.fraunhofer.de/svn/svn_HEVCSoftware/tags/HM-16.20+SCM-8.8, Accessed: 2019-12-12.
-  “High efficiency video coding,” ITU-T and ISO/IEC, Rec. ITU-T H.265 | ISO/IEC 23008-2:2017, 2017.