1 Introduction
Deep Neural Networks are perhaps the most important breakthrough in machine learning in the last ten years
[AlexNet, VGG, GNMT, WAVENet]. It brought new opportunities to improve the accuracy of algorithms for almost all ML problems by introducing models with millions of parameters. However, such models dramatically affected the performance of algorithms because the majority of them requires billions of operations to make the accurate predictions. The analysis of the models [zeiler2014, rodriguez2016, CReLU] has shown that many of them have a high level of redundancy, basically caused by the fact that most of the networks were created to achieve the highest accuracy in academia environment while deployment has performance/accuracy tradeoff as a guiding principle. This observation induced the development of new methods which create more computationefficient Deep Learning (DL) models, so that these models can be used in realworld applications with constrained resources, such as edge inferencing.Most of these methods can be roughly divided into two categories. The first category contains the socalled Neural Architecture Search (NAS) algorithms [NASNet, MNASNet, PNASNet], which allow constructing efficient neural networks for a particular dataset and specific hardware that will be used for model inference. The second category of methods aims to improve the performance of the existing and usually handcrafted DL models without an impact on their architecture design. Moreover, as we show in our research these methods can be successfully applied to the models obtained by NAS algorithms. One example of such methods is quantization [QuantizationGoogle, PACT], which is used to transform the model from floatingpoint to fixedpoint representation and more effectively use the hardware supporting fixedpoint arithmetic. The extreme case of quantized networks are binary networks [BNN, XNOR, DOREFA]
where the weights and/or activations are represented by one of two available values so that the original convolution can be equivalently replaced by XNOR and POPCOUNT operations, leading to a dramatic decrease in inference time on proper hardware. Another method belonging to this group is introducing sparsity into the model weights
[SparseCNN, FasterCNNSparsity, SparsityL0] which can be further exploited to reduce the data transfer rate at inference time, or even bring performance speedup by the usage of sparse arithmetic given that it is supported by HW.In general, any method from the second group can be applied either during or after the training, which adds a further distinction of these methods into posttraining methods and methods which are applied with finetuning. Our framework contains methods which use finetuning when compressing model.
Shortly, our contribution is a new NNCF framework which has the following important features:

Support of quantization, binarization, and sparsity algorithms with finetuning.

Automatic model graph transformation  the model is wrapped and additional layers are inserted in the model graph.

Ability to mix compression methods and apply them at the same time.

Training samples for Image Classification, Object Detection and Semantic Segmentation as well as configuration files for various models compression.

HWaccelerated layers for fast model finetuning and multiGPU training support.

Compatibility with OpenVINO Toolkit [OpenVINO].
It worth noting that we make an accent on production in our work in order to provide a simple but powerful solution for inference acceleration of neural network solving problems in various domains.
2 Related Work
Currently, there are multiple efforts to bring compression algorithms not only into the research community but towards a wider range of users who are interested in realworld DL applications. Almost all DL frameworks, in one way or another, provide support of compression features. For example, quantizing a model into INT8 precision is now becoming a mainstream approach to accelerate inference with minimum effort.
One of the influential works here is [QuantizationGoogle]
, which introduced the socalled Quantizationaware Training (QAT) for TensorFlow. This work highlights problems of algorithmic aspects of uniform quantization for CNNs with finetuning, and also proposes an efficient inference pipeline based on the instructions available in specific hardware. The QAT is based on the
Fake Quantization operation which, in turn, can be represented a pair of Quantize/Dequantize operations. The important feature of the proposed software solution is the automatic insertion of the Fake Quantization operations, which makes the model optimization more straightforward for the user. However, this approach has significant drawbacks  namely, increased training time and memory consumption. Another concern is that the quantization method of [QuantizationGoogle] is based on the naive min/max approach and potentially may achieve worse results than more sophisticated quantization range selection strategies. The latter problem is solved by methods proposed in [PACT], where quantization parameters are learned using gradient descent. In our framework we use a similar quantization method, along with other quantization schemes, while also providing the ability to automatically insert Fake Quantization operations in the model graph.Another TensorFlowbased Graffitist framework, which also leverages the training of quantization thresholds [TQT]
, aims to improve upon the QAT techniques by providing rangeprecision balancing of the resultant pertensor quantization parameters via training these jointly with network weights. This scheme is similar to ours but is limited to symmetric quantization, factorof2 quantization scales, and only allows for 4/8 bit quantization widths, while our framework imposes no such restrictions to be more flexible to the end users. Furthermore, NNCF does not perform additional network graph transformations during the quantization process, such as batch normalization folding which requires double computation of convolutional operation, and therefore is less demanding to memory and computational resources.
From the PyTorchbased tools available for model compression, the Neural Network Distiller [distiller] is the famous one. It contains an implementation of algorithms of various compression methods, such as quantization, binarization, filter pruning, and others. However, this solution mostly focuses on research tasks rather than the application of the methods to real use cases. The most critical drawback of Distiller is the lack of a ready pipeline from the model compression to the inference.
The main feature of existing compression frameworks is usually the ability to quantize the weights and/or activations of the model from 32 bit floating point into lower bitwidth representations without sacrificing much of the model accuracy. However, as it is now commonly known [SongHan_sparsity]
, deep neural networks can also typically tolerate high levels of sparsity, that is, a large proportion of weights or neurons in the network can be zeroed out without much harm to model performance. NNCF allows to produce compressed models that are both quantized and sparsified. The algorithms implemented in NNCF constitute nonstructured network sparsification approaches, i.e. methods that result in sparse weight matrices of convolutional and fullyconnected layers with zeros randomly distributed inside the weight tensors. This is in contrast to the socalled structured sparsity methods, which aim to prune away whole neurons or convolutional filters
[ChannelPruning]. The nonstructured sparsity algorithms generally range from relatively straightforward magnitudebased weight pruning schemes [SongHan_sparsity, MagnitudeSparsity] to more complex approaches such as variational and targeted dropout [VariationalDropout, TargetedDropout] and regularization [RBSparsity].3 Framework Architecture
NNCF is built on top of the popular PyTorch framework. Conceptually, NNCF consists of an integral core part with set of compression methods which form an NNCF Python package, and of a set of training samples which demonstrate capabilities of the compression methods implemented in the package. Each compression method has three basic components with a defined interface:

Compression Method itself, responsible for correct model transformation, initialization and exporting the model to use it outside PyTorch.

Compression Loss
, representing an additional loss function introduced in the compression algorithm to facilitate compression.

Compression Scheduler, controlling the parameters of the compression method during the training process.
We assume that potentially any compression method can be implemented using these three abstractions. For example, RegularizationBased (RB) sparsity method implemented in NNCF introduces a weights mask for Convolution and FullyConnected layers which is an additional training parameter. This mask is added when the model is being wrapped and multiplied by weights during the export. This sparsity method exploits regularization loss and implements its own scheduler which gradually increases the sparsity rate.
As mentioned before, one of the important features of the framework is automatic model transformation, i.e. the insertion of the auxiliary layers and operations required for a particular compression algorithm. This requires access to the PyTorch model graph, which is actually not made available by the PyTorch framework. To overcome this problem we patch PyTorch module in order to get access to all its operations at any time.
Another important novelty of NNCF is the support of algorithms mixing where the user can build own compression pipeline using several compression methods. An example of it is the models which are trained to be sparse and quantized at the same time to efficiently utilize sparse fixedpoint arithmetic of the target hardware. The mixing feature implemented inside the framework does not require any adaptations from the user side. To enable it one only needs to specify set of supported compression methods in the configuration file.
Fig. 1 shows the common training pipeline for model compression. At the initial step a particular compression algorithm is instantiated and the model is wrapped with additional compression layers. After that the wrapped model is finetuned on the target dataset using modified training pipeline. This modification contains a call of Compression Loss computation which is added to the main loss (e.g. it can be crossentropy loss in case of classification task) and then Compression Scheduler step is called. As we show in Appendix A any existing training pipeline written on PyTorch can be easily adopted to support model compression using NNCF. After the compressed model is trained we can export it to ONNX format for the further usage in OpenVINO[OpenVINO] inference toolkit.
4 Compression Methods Overview
In this section we give an overview of the compression methods implemented in NNCF framework.
4.1 Quantization
The first and most common compression method is quantization. Our quantization approach combines ideas of QAT [QuantizationGoogle] and PACT [PACT] and very close to TQT [TQT]: we train quantization parameters jointly with network weights using socalled ”fake” quantization operations inside the model graph. But in contrast to TQT, NNCF supports symmetric and asymmetric schemes for activations and weights as well as the support of perchannel quantization of weights which helps to quantize even lightweight models produced by NAS, such as EfficientNetB0.
For all supported schemes quantization is represented by the affine mapping of integers to real numbers :
(1) 
where and are quantization parameters. The constant (”scale factor”) is a positive real number, (zeropoint) has the same type as quantized value and maps to the real value
. Zeropoint is used for asymmetric quantization and provides proper handling of zero paddings. For symmetric quantization it is equal to 0.
Symmetric quantization. During the training we optimize the parameter that represents range of the original signal:
where defines quantization range. Zeropoint always equal to zero in this case. Quantization ranges for activation and weights are tailored toward the hardware options available at OpenVINO Toolkit (see Table 1). Three pointwise operations are sequentially applied to quantize r to q: scaling, clamping and rounding.
(2) 
where denotes “bankers” rounding operation.
Weights  
Signed Activation  
Unsigned Activation 
Asymmetric quantization. Unlike symmetric quantization, for asymmetric we optimize boundaries of floating point range (, ) and use zeropoint () from (1).
(3) 
In addition we add a constraint to the quantization scheme: floatingpoint zero should be exactly mapped into integer within quantization range. This constraint allows efficient implementation of layers with padding. Therefore we ”tune” ranges before quantization with the following scheme:
Comparing quantization modes. The main advantage of symmetric quantization is simplicity. It does not have zeropoint, which introduces additional logic in hardware. But asymmetric mode allows fully utilize quantization ranges, which may potentially lead to better accuracy, especially for lower than to 8bits quantization, as we show in the table 2.
Scheme  W8/A8  W4/A8  W4/A4 
Symmetric  65.93  64.42  62.9 
Asymmetric  66.1  65.87  64.7 
Training and inference. As it was mentioned, quantization is simulated on forward pass of training by means of FakeQuantization operations which perform quantization according to (2) or (3) and dequantization (4) at the same time:
(4) 
FakeQuantization layers are automatically inserted in the model graph. Weights get ”fake” quantized before the corresponding operations. Activations are quantized when the preceding layer changes the data type of the tensor, except for basic fusion patterns which correspond to one operation at inference time, such as Conv + ReLU or Conv + BatchNorm + ReLU.
Unlike QAT [QuantizationGoogle] and TQT [TQT], we do not do BatchNorm folding in order to avoid double computation of convolution and additional memory consumption which significantly slow down the training. However, to avoid misalignment between BatchNorm statistics during the training and inference we need to use a large batch size (256 or more).
Model  Dataset  Metric type  Acc. FP32  Acc. compressed 
ResNet50  ImageNet  top1 acc.  76.1  76.03 
Inceptionv3  ImageNet  top1 acc.  77.32  78.36 
MobileNetv1  ImageNet  top1 acc.  69.6  69.75 
MobileNetv2  ImageNet  top1 acc.  71.8  71.8 
MobileNetv3 Small  ImageNet  top1 acc.  67.1  66.77 
SqueezeNet v1.1  ImageNet  top1 acc.  58.19  58.16 
SSD300BN  VOC07+12  mAP  78.28  78.18 
SSD512BN  VOC07+12  mAP  80.26  80.32 
UNet  Camvid  mIoU  72.5  73.0 
UNet  Mapillary Vistas  mIoU  56.23  56.16 
ICNet  Camvid  mIoU  67.89  67.78 
BERTbasechinese  XNLI (test, Chinese)  top1 acc.  77.68  77.02 
BERTlargeuncasedwwm  SQuAD v1.1 (dev)  F1/EM  93.21/87.2  92.48/85.95 
^{*} Whole word masking.
4.2 Binarization
Model  Dataset  Weight / activation bin type  % ops binarized  Acc. FP32  Acc. compressed 
ResNet18  ImageNet  XNOR / scalethreshold  92.4  69.75  61.71 
ResNet18  ImageNet  DoReFa / scalethreshold  92.4  69.75  61.58 
Model  Dataset  Acc. metric type  Acc. FP32  Acc.Metric compressed 
ResNet50 INT8 w/ 60% of sparsity (RB)  ImageNet  top1 acc.  76.13  75.2 
Inception v3 INT8 w/ 60% of sparsity (RB)  ImageNet  top1 acc.  77.32  76.8 
MobileNet v2 INT8 w/ 51% of sparsity (RB)  ImageNet  top1 acc.  71.8  70.9 
MobileNet v2 INT8 w/ 70% of sparsity (RB)  ImageNet  top1 acc.  71.8  70.1 
SSD300BN INT8 w/ 70% of sparsity (Magnitude)  VOC07+12  mAP  78.28  77.94 
SSD512BN INT8 w/ 70% of sparsity (Magnitude)  VOC07+12  mAP  80.26  80.11 
UNet INT8 w/ 60% of sparsity (Magnitude)  CamVid  mIoU  72.5  73.27 
UNet INT8 w/ 60% of sparsity (Magnitude)  Mapillary  mIoU  56.23  54.30 
ICNet INT8 w/ 60% of sparsity (Magnitude)  CamVid  mIoU  67.89  67.53 
Model  Accuracy drop 
All pertensor symmetric  0.75 
All pertensor asymmetric  0.21 
Perchannel weights asymmetric  0.17 
All pertensor asymmetric  
w/ 31% of sparsity  0.35 
Currently NNCF supports binarizing weights and activations of 2D convolutional PyTorch layers (Conv2D).
Weight binarization can be done either via XNOR binarization [rastegari2016xnor] or via DoReFa binarization [zhou2016dorefa] schemes. For DoReFa binarization the scale of binarized weights for each convolution operation is calculated as the mean of absolute values of nonbinarized convolutional filter weights, while for XNOR binarization each convolutional operation will have scales that are calculated in the same manner, but perinput channel for the convolutional filter.
Activation binarization is implemented via binarizing inputs of the convolutional layers in the following way:
(5) 
where are the nonbinarized activation values,  binarized activation values, is the Heaviside step function and and are trainable parameters corresponding to binarization scale and threshold respectively. The thresholds are trained separately for each output activation channel dimension.
It is usually not recommended to binarize certain layers of CNNs  for instance, the input convolutional layer, the fully connected layer and the convolutional layer directly preceding it or the ResNet “downsample” layers. NNCF allows picking an exact subset of layers to be binarized via the layer blacklist/whitelist mechanism as in other NNCF compression methods.
Finally, training binarized networks requires a special scheduling of the training process, tailored specifically for each model architecture. NNCF samples demonstrate binarization of a ResNet18 architecture pretrained on ImageNet using a fourstage process, where each stage taking a certain number of finetuning epochs:

Stage 1: the network is trained without any binarization,

Stage 2: the training continues with binarization enabled for activations only,

Stage 3: binarization is enabled both for activations and weights,

Stage 4: the optimizer learning rate, which had been kept constant at previous stages, is decreased according to a polynomial law, while weight decay parameter of the optimizer is set to 0.
The configuration files for the NNCF binarization algorithm allow controlling the stage durations of this training schedule. Table 4 presents the results of binarizing ResNet18 with either XNOR or DoReFa weight binarization and scalethreshold activation binarization (5).
4.3 Sparsity
NNCF supports two nonstructured sparsity algorithms: i.) a simple magnitudebased sparsity training scheme, and ii.) regularizationbased training, which is a modification of the method proposed in [RBSparsity]. It has been argued [SparsityBenchmark] that complex approaches to sparsification like regularization produce inconsistent results when applied to large benchmark dataset (e.g. ImageNet for classification, as opposed to e.g. CIFAR100) and that magnitudebased sparsity algorithms provide comparable or better results in these cases. However, we found out in our experiments that the regularizationbased (RB) approach to sparsity outperforms the simple magnitudebased method for several classification models trained on ImageNet, achieving higher accuracy for the same sparsity level value for several models (e.g. MobilenetV2). Hence, both methods could be used in different contexts, with RB sparsity requiring tuning of the training schedule and a longer training procedure, but ultimately producing better results for certain tasks. We briefly describe the details of both network sparsification algorithms implemented in NNCF below.
Magnitudebased sparsity. In the magnitudebased weight pruning algorithm, the magnitude of each weight is used as a measure of its importance. In the NNCF implementation of magnitudebased sparsity, a certain schedule for the desired sparsity rate (equal to , where is the sparsity level) over the training process is defined, and a threshold value is calculated each time is changed by the compression scheduler. Weights that are lower than the calculated threshold value are then zeroed out. The compression scheduler can be set to increase the sparsity rate from an initial to a final level over a certain number of training epochs. The dynamics of sparsity level increase during training are adjustable and support , , and modes.
Regularizationbased sparsity. In our formulation of the RB sparsity algorithm, a complexity loss term is added to the total loss function during training, defined as
(6) 
where is the number of network parameters and is the desired sparsity level (percentage of nonzero weights) of the network is set out to achieve. Note that the above regularization loss term penalizes networks with sparsity levels both lower and higher than the defined level. Following the derivations in [RBSparsity], in order to make the loss term differentiable, the model weights are reparametrized as follows:
where is the stochastic binary gate, . It can be shown that the above formulation is equivalent to
being sampled from the Bernoulli distribution
with probability parameter
. Hence, are the trainable parameters which control whether the weight is going to be zeroed out at test time (which is done for ).On each training iteration, the set of binary gate values is sampled once from the above distribution and multiplied with network weights. In the Monte Carlo approximation of the loss function in [RBSparsity], the mask of binary gates is generally sampled and applied several times per training iteration, but single mask sampling is sufficient in practice (as shown in [RBSparsity]). The expected loss term was shown to be proportional to the sum of probabilities of gates being nonzero [RBSparsity], which in our case results in the following expression
(7) 
To make the error loss term (e.g. crossentropy for classification) differentiable w.r.t , we treat the threshold function
as a straightthrough estimator (i.e.
).5 Results
Some of the results for different compression methods were disclosed in the corresponding sections. Table 6 reports compression results for EfficientNetB0, which gives best combination of accuracy and performance on ImageNet. We compare accuracy achieved on floating point model (76.84% of top1) and with accuracy of compressed one.
Here we show performance results for compressed models with OpenVINO toolkit in the Table 7.
Model  Accuracy drop (%)  Speed up 
MobileNet v2 INT8  0.44  1.82x 
ResNet50 v1 INT8  0.34  3.05x 
Inception v3 INT8  0.62  3.11x 
SSD300 INT8  0.12  3.31x 
UNet INT8  0.5  3.14x 
ResNet18 XNOR  7.25  2.56x 
To extend the scope of trainable models and to validate that NNCF could be easily combined with existing PyTorchbased training pipelines, we integrated NNCF with the popular mmdetection object detection toolbox [MMDet]. As a result, we were able to train INT8quantized and INT8quantized+sparse object detection models available in mmdetection
on the challenging COCO dataset and achieve a less than 1 mAP point drop for the COCObased mAP evaluation metric. Specific results for compressed RetinaNetFPNbased detection models are shown in table
8.Model  FP32  Compressed 
RetinaNetResNet50FPN INT8  35.6  35.3 
RetinaNetResNeXt101  
64x4dFPN INT8  39.6  39.1 
RetinaNetResNet50FPN  
INT8+50% sparsity  35.6  34.7 
6 Conclusions
In this work we presented new NNCF framework for model compression with finetuning. It supports various compression methods and allows combining them to get more lightweight neural networks. We paid special attention to usability aspects and simplified the compression process setup as well as approbate the framework on a wide range of models. Models obtained with NNCF show stateoftheart results in terms of accuracyperformance tradeoff. The framework is compatible with OpenVINO inference toolkit which makes it attractive to apply the compression to realworld applications. We are constantly working on developing new features and improvement of the current ones as well as adding support of new models.
References
Appendix A Appendix
Described below are the steps required to modify an existing PyTorch training pipeline in order for it to be integrated with NNCF. The described use case implies there exists a PyTorch pipeline that reproduces model training in floating point precision and a pretrained model snapshot. The objective of NNCF is to compress this model in order to accelerate the inference time. Once the NNCF package is installed, the user needs to revise the training code and introduce minor changes to enable model compression. Below are the steps needed to modify the training pipeline code in PyTorch:

Add the following imports in the beginning of the training sample right after importing PyTorch:
from nncf.dynamic_graph \import patch_torch_operatorsfrom nncf.algo_selector \import create_compression_algorithm \as create_cm_algopatch_torch_operators() 
Once a model instance is created and the pretrained weights are loaded, a compression algorithm should be created and the model should be wrapped:
where config
is a dictionary where all the options and hyperparameters of compression methods are specified.

Then the model can be wrapped with DataParallel or DistributedDataParallel classes for multiGPU training. In the case of distributed training you also need to call compression_algo.distributed() method at this stage.

You should call the compression_algo.initialize() method before the start of your training loop to initialize model parameters related to its compression (e.g. parameters of FakeQuantize layers). Some compression algorithms (e.g. quantization) require arguments (e.g. the train_loader for your training dataset) to be supplied to the initialize() method.

The following changes have to be applied to the training loop code: after model inference is done on the current training iteration, the compression loss should be added (using the + operator) to the common loss. e.g. crossentropy loss:
Call the scheduler step() after each training iteration:
compression_algo.scheduler.step()Call the scheduler epoch_step() after each training epoch:
compression_algo.scheduler.epoch_step()
Comments
There are no comments yet.