Neural Network Compression Framework for fast model inference

02/20/2020 ∙ by Alexander Kozlov, et al. ∙ Intel 0

In this work we present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF). It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. These methods allow getting more hardware-friendly models which can be efficiently run on general-purpose hardware computation units (CPU, GPU) or special Deep Learning accelerators. We show that the developed methods can be successfully applied to a wide range of models to accelerate the inference time while keeping the original accuracy. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations. Currently, a PyTorch <cit.> version of NNCF is available as a part of OpenVINO Training Extensions at



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Neural Networks are perhaps the most important breakthrough in machine learning in the last ten years

[AlexNet, VGG, GNMT, WAVENet]. It brought new opportunities to improve the accuracy of algorithms for almost all ML problems by introducing models with millions of parameters. However, such models dramatically affected the performance of algorithms because the majority of them requires billions of operations to make the accurate predictions. The analysis of the models [zeiler2014, rodriguez2016, CReLU] has shown that many of them have a high level of redundancy, basically caused by the fact that most of the networks were created to achieve the highest accuracy in academia environment while deployment has performance/accuracy trade-off as a guiding principle. This observation induced the development of new methods which create more computation-efficient Deep Learning (DL) models, so that these models can be used in real-world applications with constrained resources, such as edge inferencing.

Most of these methods can be roughly divided into two categories. The first category contains the so-called Neural Architecture Search (NAS) algorithms [NASNet, MNASNet, PNASNet], which allow constructing efficient neural networks for a particular dataset and specific hardware that will be used for model inference. The second category of methods aims to improve the performance of the existing and usually hand-crafted DL models without an impact on their architecture design. Moreover, as we show in our research these methods can be successfully applied to the models obtained by NAS algorithms. One example of such methods is quantization [QuantizationGoogle, PACT], which is used to transform the model from floating-point to fixed-point representation and more effectively use the hardware supporting fixed-point arithmetic. The extreme case of quantized networks are binary networks [BNN, XNOR, DOREFA]

where the weights and/or activations are represented by one of two available values so that the original convolution can be equivalently replaced by XNOR and POPCOUNT operations, leading to a dramatic decrease in inference time on proper hardware. Another method belonging to this group is introducing sparsity into the model weights

[SparseCNN, FasterCNNSparsity, SparsityL0] which can be further exploited to reduce the data transfer rate at inference time, or even bring performance speed-up by the usage of sparse arithmetic given that it is supported by HW.

In general, any method from the second group can be applied either during or after the training, which adds a further distinction of these methods into post-training methods and methods which are applied with fine-tuning. Our framework contains methods which use fine-tuning when compressing model.

Shortly, our contribution is a new NNCF framework which has the following important features:

  • Support of quantization, binarization, and sparsity algorithms with fine-tuning.

  • Automatic model graph transformation - the model is wrapped and additional layers are inserted in the model graph.

  • Ability to mix compression methods and apply them at the same time.

  • Training samples for Image Classification, Object Detection and Semantic Segmentation as well as configuration files for various models compression.

  • HW-accelerated layers for fast model fine-tuning and multi-GPU training support.

  • Compatibility with OpenVINO Toolkit [OpenVINO].

It worth noting that we make an accent on production in our work in order to provide a simple but powerful solution for inference acceleration of neural network solving problems in various domains.

2 Related Work

Currently, there are multiple efforts to bring compression algorithms not only into the research community but towards a wider range of users who are interested in real-world DL applications. Almost all DL frameworks, in one way or another, provide support of compression features. For example, quantizing a model into INT8 precision is now becoming a mainstream approach to accelerate inference with minimum effort.

One of the influential works here is [QuantizationGoogle]

, which introduced the so-called Quantization-aware Training (QAT) for TensorFlow. This work highlights problems of algorithmic aspects of uniform quantization for CNNs with fine-tuning, and also proposes an efficient inference pipeline based on the instructions available in specific hardware. The QAT is based on the

Fake Quantization operation which, in turn, can be represented a pair of Quantize/Dequantize operations. The important feature of the proposed software solution is the automatic insertion of the Fake Quantization operations, which makes the model optimization more straightforward for the user. However, this approach has significant drawbacks - namely, increased training time and memory consumption. Another concern is that the quantization method of [QuantizationGoogle] is based on the naive min/max approach and potentially may achieve worse results than more sophisticated quantization range selection strategies. The latter problem is solved by methods proposed in [PACT], where quantization parameters are learned using gradient descent. In our framework we use a similar quantization method, along with other quantization schemes, while also providing the ability to automatically insert Fake Quantization operations in the model graph.

Another TensorFlow-based Graffitist framework, which also leverages the training of quantization thresholds [TQT]

, aims to improve upon the QAT techniques by providing range-precision balancing of the resultant per-tensor quantization parameters via training these jointly with network weights. This scheme is similar to ours but is limited to symmetric quantization, factor-of-2 quantization scales, and only allows for 4/8 bit quantization widths, while our framework imposes no such restrictions to be more flexible to the end users. Furthermore, NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding which requires double computation of convolutional operation, and therefore is less demanding to memory and computational resources.

From the PyTorch-based tools available for model compression, the Neural Network Distiller [distiller] is the famous one. It contains an implementation of algorithms of various compression methods, such as quantization, binarization, filter pruning, and others. However, this solution mostly focuses on research tasks rather than the application of the methods to real use cases. The most critical drawback of Distiller is the lack of a ready pipeline from the model compression to the inference.

The main feature of existing compression frameworks is usually the ability to quantize the weights and/or activations of the model from 32 bit floating point into lower bit-width representations without sacrificing much of the model accuracy. However, as it is now commonly known [SongHan_sparsity]

, deep neural networks can also typically tolerate high levels of sparsity, that is, a large proportion of weights or neurons in the network can be zeroed out without much harm to model performance. NNCF allows to produce compressed models that are both quantized and sparsified. The algorithms implemented in NNCF constitute non-structured network sparsification approaches, i.e. methods that result in sparse weight matrices of convolutional and fully-connected layers with zeros randomly distributed inside the weight tensors. This is in contrast to the so-called structured sparsity methods, which aim to prune away whole neurons or convolutional filters

[ChannelPruning]. The non-structured sparsity algorithms generally range from relatively straightforward magnitude-based weight pruning schemes [SongHan_sparsity, MagnitudeSparsity] to more complex approaches such as variational and targeted dropout [VariationalDropout, TargetedDropout] and regularization [RBSparsity].

3 Framework Architecture

NNCF is built on top of the popular PyTorch framework. Conceptually, NNCF consists of an integral core part with set of compression methods which form an NNCF Python package, and of a set of training samples which demonstrate capabilities of the compression methods implemented in the package. Each compression method has three basic components with a defined interface:

  • Compression Method itself, responsible for correct model transformation, initialization and exporting the model to use it outside PyTorch.

  • Compression Loss

    , representing an additional loss function introduced in the compression algorithm to facilitate compression.

  • Compression Scheduler, controlling the parameters of the compression method during the training process.

We assume that potentially any compression method can be implemented using these three abstractions. For example, Regularization-Based (RB) sparsity method implemented in NNCF introduces a weights mask for Convolution and Fully-Connected layers which is an additional training parameter. This mask is added when the model is being wrapped and multiplied by weights during the export. This sparsity method exploits -regularization loss and implements its own scheduler which gradually increases the sparsity rate.

As mentioned before, one of the important features of the framework is automatic model transformation, i.e. the insertion of the auxiliary layers and operations required for a particular compression algorithm. This requires access to the PyTorch model graph, which is actually not made available by the PyTorch framework. To overcome this problem we patch PyTorch module in order to get access to all its operations at any time.

Another important novelty of NNCF is the support of algorithms mixing where the user can build own compression pipeline using several compression methods. An example of it is the models which are trained to be sparse and quantized at the same time to efficiently utilize sparse fixed-point arithmetic of the target hardware. The mixing feature implemented inside the framework does not require any adaptations from the user side. To enable it one only needs to specify set of supported compression methods in the configuration file.

Fig. 1 shows the common training pipeline for model compression. At the initial step a particular compression algorithm is instantiated and the model is wrapped with additional compression layers. After that the wrapped model is fine-tuned on the target dataset using modified training pipeline. This modification contains a call of Compression Loss computation which is added to the main loss (e.g. it can be cross-entropy loss in case of classification task) and then Compression Scheduler step is called. As we show in Appendix A any existing training pipeline written on PyTorch can be easily adopted to support model compression using NNCF. After the compressed model is trained we can export it to ONNX format for the further usage in OpenVINO[OpenVINO] inference toolkit.

Figure 1:

A common model compression pipeline with NNCF. At the first step, an original full-precision model is automatically wrapped by the compression algorithm and then fine-tuned for a pre-defined number of epochs. After the training is done the model is exported to ONNX format and can be inferred with OpenVINO.

4 Compression Methods Overview

In this section we give an overview of the compression methods implemented in NNCF framework.

4.1 Quantization

The first and most common compression method is quantization. Our quantization approach combines ideas of QAT [QuantizationGoogle] and PACT [PACT] and very close to TQT [TQT]: we train quantization parameters jointly with network weights using so-called ”fake” quantization operations inside the model graph. But in contrast to TQT, NNCF supports symmetric and asymmetric schemes for activations and weights as well as the support of per-channel quantization of weights which helps to quantize even lightweight models produced by NAS, such as EfficientNet-B0.

For all supported schemes quantization is represented by the affine mapping of integers to real numbers :


where and are quantization parameters. The constant (”scale factor”) is a positive real number, (zero-point) has the same type as quantized value and maps to the real value

. Zero-point is used for asymmetric quantization and provides proper handling of zero paddings. For symmetric quantization it is equal to 0.

Symmetric quantization. During the training we optimize the parameter that represents range of the original signal:

where defines quantization range. Zero-point always equal to zero in this case. Quantization ranges for activation and weights are tailored toward the hardware options available at OpenVINO Toolkit (see Table 1). Three point-wise operations are sequentially applied to quantize r to q: scaling, clamping and rounding.


where denotes “bankers” rounding operation.

Signed Activation
Unsigned Activation
Table 1: Quantization ranges for symmetric mode for different bit-width of integer number.

Asymmetric quantization. Unlike symmetric quantization, for asymmetric we optimize boundaries of floating point range (, ) and use zero-point () from (1).


In addition we add a constraint to the quantization scheme: floating-point zero should be exactly mapped into integer within quantization range. This constraint allows efficient implementation of layers with padding. Therefore we ”tune” ranges before quantization with the following scheme:

Comparing quantization modes. The main advantage of symmetric quantization is simplicity. It does not have zero-point, which introduces additional logic in hardware. But asymmetric mode allows fully utilize quantization ranges, which may potentially lead to better accuracy, especially for lower than to 8-bits quantization, as we show in the table 2.

Scheme W8/A8 W4/A8 W4/A4
Symmetric 65.93 64.42 62.9
Asymmetric 66.1 65.87 64.7
Table 2: Quantization scheme comparison for different bit-width for weights and activations using MobileNet v2 and CIFAR-100 dataset. Top-1 accuracy is used as a metric. 32-bit floating point model has 65.53%.

Training and inference. As it was mentioned, quantization is simulated on forward pass of training by means of FakeQuantization operations which perform quantization according to (2) or (3) and dequantization (4) at the same time:


FakeQuantization layers are automatically inserted in the model graph. Weights get ”fake” quantized before the corresponding operations. Activations are quantized when the preceding layer changes the data type of the tensor, except for basic fusion patterns which correspond to one operation at inference time, such as Conv + ReLU or Conv + BatchNorm + ReLU.

Unlike QAT [QuantizationGoogle] and TQT [TQT], we do not do BatchNorm folding in order to avoid double computation of convolution and additional memory consumption which significantly slow down the training. However, to avoid misalignment between BatchNorm statistics during the training and inference we need to use a large batch size (256 or more).

Model Dataset Metric type Acc. FP32 Acc. compressed
ResNet-50 ImageNet top-1 acc. 76.1 76.03
Inception-v3 ImageNet top-1 acc. 77.32 78.36
MobileNet-v1 ImageNet top-1 acc. 69.6 69.75
MobileNet-v2 ImageNet top-1 acc. 71.8 71.8
MobileNet-v3 Small ImageNet top-1 acc. 67.1 66.77
SqueezeNet v1.1 ImageNet top-1 acc. 58.19 58.16
SSD300-BN VOC07+12 mAP 78.28 78.18
SSD512-BN VOC07+12 mAP 80.26 80.32
UNet Camvid mIoU 72.5 73.0
UNet Mapillary Vistas mIoU 56.23 56.16
ICNet Camvid mIoU 67.89 67.78
BERT-base-chinese XNLI (test, Chinese) top-1 acc. 77.68 77.02
BERT-large-uncased-wwm SQuAD v1.1 (dev) F1/EM 93.21/87.2 92.48/85.95

* Whole word masking.

Table 3: Accuracy results of INT8 quantization measured in the training framework in FP32 precision.

4.2 Binarization

Model Dataset Weight / activation bin type % ops binarized Acc. FP32 Acc. compressed
ResNet-18 ImageNet XNOR / scale-threshold 92.4 69.75 61.71
ResNet-18 ImageNet DoReFa / scale-threshold 92.4 69.75 61.58
Table 4: Binarization results measured in the training framework
Model Dataset Acc. metric type Acc. FP32 Acc.Metric compressed
ResNet-50 INT8 w/ 60% of sparsity (RB) ImageNet top-1 acc. 76.13 75.2
Inception v3 INT8 w/ 60% of sparsity (RB) ImageNet top-1 acc. 77.32 76.8
MobileNet v2 INT8 w/ 51% of sparsity (RB) ImageNet top-1 acc. 71.8 70.9
MobileNet v2 INT8 w/ 70% of sparsity (RB) ImageNet top-1 acc. 71.8 70.1
SSD300-BN INT8 w/ 70% of sparsity (Magnitude) VOC07+12 mAP 78.28 77.94
SSD512-BN INT8 w/ 70% of sparsity (Magnitude) VOC07+12 mAP 80.26 80.11
UNet INT8 w/ 60% of sparsity (Magnitude) CamVid mIoU 72.5 73.27
UNet INT8 w/ 60% of sparsity (Magnitude) Mapillary mIoU 56.23 54.30
ICNet INT8 w/ 60% of sparsity (Magnitude) CamVid mIoU 67.89 67.53
Table 5: Sparsification+quantization results measured in the training framework
Model Accuracy drop
All per-tensor symmetric 0.75
All per-tensor asymmetric 0.21
Per-channel weights asymmetric 0.17
All per-tensor asymmetric
w/ 31% of sparsity 0.35
Table 6: Accuracy top-1 results for INT8 quantization of EfficientNet-B0 model on ImageNet measured in the training framework

Currently NNCF supports binarizing weights and activations of 2D convolutional PyTorch layers (Conv2D).

Weight binarization can be done either via XNOR binarization [rastegari2016xnor] or via DoReFa binarization [zhou2016dorefa] schemes. For DoReFa binarization the scale of binarized weights for each convolution operation is calculated as the mean of absolute values of non-binarized convolutional filter weights, while for XNOR binarization each convolutional operation will have scales that are calculated in the same manner, but per-input channel for the convolutional filter.

Activation binarization is implemented via binarizing inputs of the convolutional layers in the following way:


where are the non-binarized activation values, - binarized activation values, is the Heaviside step function and and are trainable parameters corresponding to binarization scale and threshold respectively. The thresholds are trained separately for each output activation channel dimension.

It is usually not recommended to binarize certain layers of CNNs - for instance, the input convolutional layer, the fully connected layer and the convolutional layer directly preceding it or the ResNet “downsample” layers. NNCF allows picking an exact subset of layers to be binarized via the layer blacklist/whitelist mechanism as in other NNCF compression methods.

Finally, training binarized networks requires a special scheduling of the training process, tailored specifically for each model architecture. NNCF samples demonstrate binarization of a ResNet-18 architecture pre-trained on ImageNet using a four-stage process, where each stage taking a certain number of fine-tuning epochs:

  • Stage 1: the network is trained without any binarization,

  • Stage 2: the training continues with binarization enabled for activations only,

  • Stage 3: binarization is enabled both for activations and weights,

  • Stage 4: the optimizer learning rate, which had been kept constant at previous stages, is decreased according to a polynomial law, while weight decay parameter of the optimizer is set to 0.

The configuration files for the NNCF binarization algorithm allow controlling the stage durations of this training schedule. Table 4 presents the results of binarizing ResNet-18 with either XNOR or DoReFa weight binarization and scale-threshold activation binarization (5).

4.3 Sparsity

NNCF supports two non-structured sparsity algorithms: i.) a simple magnitude-based sparsity training scheme, and ii.) regularization-based training, which is a modification of the method proposed in [RBSparsity]. It has been argued [SparsityBenchmark] that complex approaches to sparsification like regularization produce inconsistent results when applied to large benchmark dataset (e.g. ImageNet for classification, as opposed to e.g. CIFAR-100) and that magnitude-based sparsity algorithms provide comparable or better results in these cases. However, we found out in our experiments that the regularization-based (RB) approach to sparsity outperforms the simple magnitude-based method for several classification models trained on ImageNet, achieving higher accuracy for the same sparsity level value for several models (e.g. MobilenetV2). Hence, both methods could be used in different contexts, with RB sparsity requiring tuning of the training schedule and a longer training procedure, but ultimately producing better results for certain tasks. We briefly describe the details of both network sparsification algorithms implemented in NNCF below.

Magnitude-based sparsity. In the magnitude-based weight pruning algorithm, the magnitude of each weight is used as a measure of its importance. In the NNCF implementation of magnitude-based sparsity, a certain schedule for the desired sparsity rate (equal to , where is the sparsity level) over the training process is defined, and a threshold value is calculated each time is changed by the compression scheduler. Weights that are lower than the calculated threshold value are then zeroed out. The compression scheduler can be set to increase the sparsity rate from an initial to a final level over a certain number of training epochs. The dynamics of sparsity level increase during training are adjustable and support , , and modes.

Regularization-based sparsity. In our formulation of the RB sparsity algorithm, a complexity loss term is added to the total loss function during training, defined as


where is the number of network parameters and is the desired sparsity level (percentage of non-zero weights) of the network is set out to achieve. Note that the above regularization loss term penalizes networks with sparsity levels both lower and higher than the defined level. Following the derivations in [RBSparsity], in order to make the loss term differentiable, the model weights are reparametrized as follows:

where is the stochastic binary gate, . It can be shown that the above formulation is equivalent to

being sampled from the Bernoulli distribution

with probability parameter

. Hence, are the trainable parameters which control whether the weight is going to be zeroed out at test time (which is done for ).

On each training iteration, the set of binary gate values is sampled once from the above distribution and multiplied with network weights. In the Monte Carlo approximation of the loss function in [RBSparsity], the mask of binary gates is generally sampled and applied several times per training iteration, but single mask sampling is sufficient in practice (as shown in [RBSparsity]). The expected loss term was shown to be proportional to the sum of probabilities of gates being non-zero [RBSparsity], which in our case results in the following expression


To make the error loss term (e.g. cross-entropy for classification) differentiable w.r.t , we treat the threshold function

as a straight-through estimator (i.e.


5 Results

Some of the results for different compression methods were disclosed in the corresponding sections. Table 6 reports compression results for EfficientNet-B0, which gives best combination of accuracy and performance on ImageNet. We compare accuracy achieved on floating point model (76.84% of top-1) and with accuracy of compressed one.

Here we show performance results for compressed models with OpenVINO toolkit in the Table 7.

Model Accuracy drop (%) Speed up
MobileNet v2 INT8 0.44 1.82x
ResNet-50 v1 INT8 -0.34 3.05x
Inception v3 INT8 -0.62 3.11x
SSD-300 INT8 -0.12 3.31x
UNet INT8 -0.5 3.14x
ResNet-18 XNOR 7.25 2.56x
Table 7: Relative performance/accuracy results with OpenVINO on Intel Xeon Gold 6230 Processor

To extend the scope of trainable models and to validate that NNCF could be easily combined with existing PyTorch-based training pipelines, we integrated NNCF with the popular mmdetection object detection toolbox [MMDet]. As a result, we were able to train INT8-quantized and INT8-quantized+sparse object detection models available in mmdetection

on the challenging COCO dataset and achieve a less than 1 mAP point drop for the COCO-based mAP evaluation metric. Specific results for compressed RetinaNet-FPN-based detection models are shown in table


Model FP32 Compressed
RetinaNet-ResNet50-FPN INT8 35.6 35.3
64x4d-FPN INT8 39.6 39.1
INT8+50% sparsity 35.6 34.7
Table 8: Validation set metrics for original and compressed models. Shown are mAP values for models trained and tested on the COCO dataset.

6 Conclusions

In this work we presented new NNCF framework for model compression with fine-tuning. It supports various compression methods and allows combining them to get more lightweight neural networks. We paid special attention to usability aspects and simplified the compression process setup as well as approbate the framework on a wide range of models. Models obtained with NNCF show state-of-the-art results in terms of accuracy-performance trade-off. The framework is compatible with OpenVINO inference toolkit which makes it attractive to apply the compression to real-world applications. We are constantly working on developing new features and improvement of the current ones as well as adding support of new models.


Appendix A Appendix

Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pre-trained model snapshot. The objective of NNCF is to compress this model in order to accelerate the inference time. Once the NNCF package is installed, the user needs to revise the training code and introduce minor changes to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch:

  • Add the following imports in the beginning of the training sample right after importing PyTorch:

    from nncf.dynamic_graph \
        import patch_torch_operators
    from nncf.algo_selector \
        import create_compression_algorithm \
        as create_cm_algo
  • Once a model instance is created and the pre-trained weights are loaded, a compression algorithm should be created and the model should be wrapped:

    cm_algo = create_cm_algo(model, config)
    model = cm_algo.model

    where config

    is a dictionary where all the options and hyperparameters of compression methods are specified.

  • Then the model can be wrapped with DataParallel or DistributedDataParallel classes for multi-GPU training. In the case of distributed training you also need to call compression_algo.distributed() method at this stage.

  • You should call the compression_algo.initialize() method before the start of your training loop to initialize model parameters related to its compression (e.g. parameters of FakeQuantize layers). Some compression algorithms (e.g. quantization) require arguments (e.g. the train_loader for your training dataset) to be supplied to the initialize() method.

  • The following changes have to be applied to the training loop code: after model inference is done on the current training iteration, the compression loss should be added (using the + operator) to the common loss. e.g. cross-entropy loss:

    compress_loss = cm_algo.loss()
    loss = cross_entropy_loss + compress_loss

    Call the scheduler step() after each training iteration:


    Call the scheduler epoch_step() after each training epoch: