Technical Report: NEMO DNN Quantization for Deployment Model

by   Francesco Conti, et al.
University of Bologna
ETH Zurich

This technical report aims at defining a formal framework for Deep Neural Network (DNN) layer-wise quantization, focusing in particular on the problems related to the final deployment. It also acts as a documentation for the NEMO (NEural Minimization for pytOrch) framework. It describes the four DNN representations used in NEMO (FullPrecision, FakeQuantized, QuantizedDeployable and IntegerDeployable), focusing in particular on a formal definition of the latter two. An important feature of this model, and in particular the IntegerDeployable representation, is that it enables DNN inference using purely integers - without resorting to real-valued numbers in any part of the computation and without relying on an explicit fixed-point numerical representation.



There are no comments yet.


page 3

page 4

page 7

page 10

page 11


ILMPQ : An Intra-Layer Multi-Precision Deep Neural Network Quantization framework for FPGA

This work targets the commonly used FPGA (field-programmable gate array)...

Learning In Practice: Reasoning About Quantization

There is a mismatch between the standard theoretical analyses of statist...

TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization

Recent research implies that training and inference of deep neural netwo...

RMSMP: A Novel Deep Neural Network Quantization Framework with Row-wise Mixed Schemes and Multiple Precisions

This work proposes a novel Deep Neural Network (DNN) quantization framew...

Robust and Active Learning for Deep Neural Network Regression

We describe a gradient-based method to discover local error maximizers o...

On the Acceleration of Deep Neural Network Inference using Quantized Compressed Sensing

Accelerating deep neural network (DNN) inference on resource-limited dev...

A Survey of Quantization Methods for Efficient Neural Network Inference

As soon as abstract mathematical computations were adapted to computatio...

Code Repositories


NEural Minimizer for pytOrch

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 FullPrecision representation

The FullPrecision

representation is simply the “normal” one for real-valued neural networks. We build a layer of a Deep Neural Network (DNN) out of a composition of operators in the Linear, Batch-Normalization, Activation classes.

Linear operators include convolutions, fully-connected layers (i.e., tensorwise matrix multiplication). Batch-Normalization operators are also linear or affine transformations, but we treat them separately. Non-linear Activation

layers include the ReLU activation and variants in normal DNNs.

We formally define a layer as any linear sequence of operators that takes as input the output of another layer and concludes with the first Activation layer in the sequence. Note that in our model, we disallow branches starting from a layer that is not a Activation layer.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] For all intents and purposes, a FullPrecision representation in NEMO is simply a valid PyTorch DNN model respecting these restrictions.

1.1 Linear operators

A Linear operator has the form


where ,

are two tensors of

weights and input activations, indicates an elementwise product followed by a reduction along some of the tensor dimensions (essentially a scalar product). Often, it is possible to neglect the bias term as this can be incorporated in one of the following operators. In that case,


is always at least 2-dimensional, indicating a mapping between input channels and output channels; is always at least 1-dimensional, with input channels111

This is the case of fully-connected operators, where “channels” are constituted by a single element, and are sometimes called neurons.

. As a consequence, has at least 1 dimension, output channels.

1.2 Batch-Normalization operators

Linear operators may be followed by Batch-Normalization (BN) operators. BN acts a further affine transformation applied on using parameters extracted statistically during training (,

) or trained with backpropagation (

, ):


All BN parameters have only one dimension .

1.2.1 Activation operators

Non-linear activations operate pointwise, as such their output is dimensionally identical to . They have the form:


The most common activation is ReLU:


2 FakeQuantized representation

In this Section, we discusse the FakeQuantized representation of NEMO, that is used to represent a DNN in a form that takes quantization into account, but is still entirely manageable both in terms of topological transformations and training. We start by formally defining what we mean by “quantization” in this document, along with a set of related definitions.

2.1 Formal definition of DNN tensor quantization


Definition 2.1.

We call quantized a tensor where all elements can be written as


where is a scalar number in , which we call quantum222In this document we refer to layer-wise quantization. For channel-wise quantization,

is a vector of size

. , is a scalar in called offset, and is a finite subset of , which we call quantized space.

Therefore, the problem of quantization of a DNN is that of defining a mapping of all the fundamental tensors of a DNN layer (, , , ) to quantized tensors. Considering that the “natural” representation of these tensors is real-valued () (and practically implemented using 32-bit floating point numbers), a reasonable approach is to define a function q to map and combine it with Eq. 6. This leads to the following definition: [colback=blue!5!white,colframe=blue!75!black]

Definition 2.2.

We call quantized version of a tensor such that


where ( being the dimensionality of ) is a mapping from real to integer numbers that is pointwise, monotonic and piecewise constant, called the quantization function. We call the integer image of .

2.2 Quantization-aware training

The final objective of quantizing a DNN is using in place of without dropping accuracy. This is targeted primarily by tuning the functions used for the various tensors in a layer, and is currently the objective of extensive research. The smaller is the cardinality of , , the smaller will be the number of bits necessary to represent it in a hardware or software implementation.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] A FakeQuantized representation is one that imposes that the weights of Linear operators and the output of Activation operators are real valued, but chosen from a restricted set of quantized values during forward-propagation. Note that this restriction is not usually applied to other layers. This version of the network net can be obtained by running

  net = nemo.transform.quantize_pact(net, dummy_input=dummy_input)

where dummy_input is a torch.Tensor sized like the network input. Currently, nemo supports a PACT-like [1] linear quantization scheme for both weights and activations.

In the example case of a ReLU Activation using PACT [1], this means that the activation is replaced with

Two changes are introduced to the ReLU. First, the clipping function is not only clipping at 0, but also at a maximum value , which can be set to the maximum value of in the FullPrecision stage (see later). Second, the Activation explicitly uses the quantum inside. To represent the tensor with bits, . Due to the clipping nature of ReLUs, we set for all activations.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] For historical reasons, in PACT_Act activations the parameter we call in this document is saved in the alpha parameter. This may change in future versions!

Linear weights are stored in full-precision, but a similar clipping function is used at runtime in forward-propagation (when using linear PACT-like quantization):

is used in place of when performing forward-propagation.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] For historical reasons, in PACT_Conv2d and other Linear layers activations the parameter we call in this document is saved in the alpha parameter with inverted sign (so it’s typically positive, because weights are usually zero-crossing). This may change in future versions!

To enable training of the network, quantization-aware training strategies replace tensors with their quantized version only during the forward-propagation step, but they use and update real tensors in backward-propagation. Most methods estimate gradients through non-linear quantization functions using the

straight-through estimator (STE), i.e., they simply work on full-precision tensors ignoring all quantization functions [1]. The fundamentals behind the fact that STE works are only recently being understood (see Spallanzani et al. [2]).

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] We use PACT-like quantization for both activations and weights, which employs the STE. Therefore, if is the loss, for activations:

and for weights:

Both the forward- and backward-prop functions are defined in the same nemo.quant.pact.PACT_QuantFunc and nemo.quant.pact.PACT_QuantFunc_Asymm torch.autograd.Functions for activations and weights, respectively.

3 QuantizedDeployable and IntegerDeployable representations

While the FakeQuantized representation is useful for training and quantization-aware fine-tuning, it cannot directly be used for deployment on an integer-only Quantized Neural Network (QNN), because quantization is defined rigorously only for weights and activations, but not for all the intermediate representations.

The QuantizedDeployable representations “completes” the task started by the FakeQuantized transformation: all operators on the network operate on quantized inputs and produce quantized outputs. Since all quantized tensors have an integer image as defined in Definition 2.2, it is possible to completely get rid of their real-valued nature and use only integer images along the network. This step yields a IntegerDeployable representation. In this Section, we describe simultaneously the QuantizedDeployable and IntegerDeployable representations, as they are one the image of the other through Definition 2.2.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] Transforming a model net into QuantizedDeployable representation requires three distinct operations. First, quantizing BatchNormalization layers (see Section 3.4):

  net = nemo.transform.bn_quantizer(net)

Second, freezing Linear weights in their quantized state (i.e., setting ):


Third, propagating quanta along the network, as explained in detail for each operator in all parts of this Section:


To switch to IntegerDeployable, several operators have to be changed and all parameters have to be replaced by integer ones:

  net = nemo.transform.integerize_pact(net, eps_in=1.0/255)

Note that in all representations, NEMO utilizes float32 to represent data. This means that NEMO networks in IntegerDeployable format can be inferred on a GPU with no efficient integer support paying only a small penalty because of the additional operators discussed in this section.

3.1 Quantization/Activation operators

From the simple consideration that the input of a DNN layer typically comes from the output of another layer, follows that a favourable location to place the quantization function for activation tensors is within the activation operator, which produces the input to the next block. There is another fundamental consideration that singles out this operator as the right one for embedding the quantization function: q is by construction non-linear and clipped, both characteristics shared with ReLU (which is clipped only on the lower side) and other activation operators (most of which are clipped on both sides).

The Quantization/Activation operator, in this case, provides the double functionality of i) being the non-linear activation essential for the DNN to work; ii) squashing the input tensor (which might be real or quantized within its own quantized space ) into a (generally smaller) quantization space . Therefore, whereas the quantization function as defined in Eq. 6 is parametrized to the same quantization space to which it is applied (e.g.,

), the quantization/activation function is parametrized to the target quantization space

(e.g., ).

General case of quantization functions.

To understand in depth how a quantization function works, we start from the explicit mapping of a real-valued tensor to an arbitrarily defined integer image. By Definition 2.2, this function is a ladder mapping the input tensor to the integer image of the target tensor:


where and are a set of threholds identifying the interval of mapped to each value ; are the lower and upper value of , respectively. Here we focus on quantization functions that are continuously defined: they set to represent a continuous interval of and they set them along a continuous function mapping .

The quantization function does not need to be applied to a real-valued tensor, but can be applied directly on its integer image:

By changing indeces, and defining an , , it is possible to have the staircase always starting from index 0, which gives a “canonical” form of quantization function:

Linear quantization.

Linear quantization uses an affine transformation to derive from ; this translates the abstract formulation of Eq. 8 to a clip function, which is what was shown without full explanation in previous Sections:


with .

How to perform this operation when starting from an integer image? One possibility is to directly apply Eq. 9, which translates on a comparison with a set of explicitly defined thresholds. This approach might be expensive to perform in an actual deployment, but it requires no approximation. See also Section 3.4 for a practical case where we follow this route.

The alternative relies on a technique that we call requantization: this requires an approximation and is the object of the following section. Here we anticipate the final result in this case:


where is an appropriately chosen integer (see Section 3.2).

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] When switching to QuantizedDeployable representation, nemo.quant.pact.PACT_Act activations use the “regular” definition of Eq. 10. [colback=green!5!white,colframe=green!75!black,title=In NEMO…] In IntegerDeployable representation, nemo.quant.pact.PACT_Acts are transformed into nemo.quant.pact.PACT_IntegerAct activations, which apply the requantization method presented in Eq. 11.

3.2 Requantization

The requantization function is essential in any case where we have to transform a tensor from one quantized space to a different one. Ideally, this would happen by simply scaling the quanta:

In general is not an integer, and so this function cannot be used to define an integer image . To solve this issue with an approximation, let us introduce an arbitrary natural number . Then, we can express the ratio as a limit:

While cannot be infinite in practice, this suggests we can make it arbitrarily big to reduce the error in the ratio as much as possible. What is the error in that case? By definition of the floor function,

therefore the error is bound by . To limit the relative error to less than a fraction , then,

Let us use this concept for a formal definition of the requantization function: [colback=blue!5!white,colframe=blue!75!black]

Definition 3.1.

Let us consider two quantized spaces , , their related quanta , , and an integer image in the first quantized space. We define the requantization function from to as


where is a parameter chosen arbitrarily.

Under this definition, we can approximate the integer image of tensor in the quantized space as

We typically choose as a power of 2. In this way, the division reduces to a right shift:


and the parameter can be bound to a relative error with


The requantization approximation can be used to derive the linear quantization transformation presented without proof in the previous section.

which is an alternative form of Eq. 11.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] nemo.quant.pact.PACT_IntegerAct activations used in the IntegerDeployable representation compute internally. They use Eq. 14 given an attribute called requantization_factor, that is and defaults to 16.

3.3 Linear operators

Let us now assume that , are the quantized versions of , . Following Eq. 2, approximating a linear layer by using quantized versions of , means the following


neglecting the bias term. is not explicitly defined as the quantized version of ; however, it is still a quantized tensor, where the quantum is , and the integer image is


As a consequence, the quantization space of is given by


In a practical implementation will have to be represented with a larger number of bits than , .

Note that nothing directly guarantees that is a good approximation of . However, if the network has been trained/fine-tuned in FakeQuantized representation, it is not really important to approximate : was actually used in forward-prop training, not ! In practice, for not too strong quantizations, FakeQuantized fine-tuning might not even be necessary. A simple validation will verify that propagates the correct information through the network.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] The behavior of Linear operators such as PACT_Conv2d does not change from FakeQuantized to QuantizedDeployable. The net.harden_weights() call replaces all weights with their quantized version . The quantum after the Linear operation is computed automatically by NEMO. [colback=green!5!white,colframe=green!75!black,title=In NEMO…] In IntegerDeployable representation, the operator also works in the same way, but the nemo.transform.integerize_pact function will replace all weights with their integer image .

3.4 Batch-Normalization operators

Equation 3 involves an affine transformation with parameters (,,,) that are, in general, in the real domain. Batch-Normalization is often very important for quantization strategy, it normalizes activations, constraining “softly” in an interval that maps well to the clipping () that is imposed through quantization. In general, three different strategies can be applied: i) fold the network BN operators in the previous linear operator, before performing its quantization; ii) replace the parameters with quantized versions; iii) merge the BN operator with the following activation function, creating appropriate thresholds.

BN Folding.

Integrating Eq. 3 with Eq. 1,

Therefore, folding a BN layer into the linear layer that precedes it involves replacing its parameters with the following transform:


Note that even if the original linear layer had no bias term, the folded linear layer in general will have a bias to take into account the affine transformation in the BN layer.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] BN folding of a model net can be performed at the FakeQuantization stage by calling


with an optional dictionary of specific operators to be folded (the default is to fold all). The second command is necessary to reset the parameters of the weights after folding.

Merging BN with Quantization/Activation.

An alternative way to remove a BN layer with respect to folding it into a convolution is to merge it with the following quantization/activation function, i.e., folding the affine transformation into the thresholds shown in Eq. 8.

In the case of linear quantization (of all kinds), the procedure is particularly interesting and useful, as it can be used to absorb all real parameters without any approximation into a set of integer thresholds:


These thresholds map directly the integer image of to that of the output , therefore enabling execution of the layer entirely in the integer domain:


Propagating Eq. 3 means

Each element in the sum identified by index is non-zero if and only if

By construction or simple transformations, we can safely assume that . Therefore the condition can be transformed in

By Eq. 16, , therefore this is equivalent to

Finally, as is integer, one can define a set of integer thresholds absorbing all real parameters without any further approximation:

corresponding to the complete quantization function:

The threshold-based approach is naturally especially effective when the number of thresholds is small, i.e. when the cardinality of is small.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] While NEMO includes a threshold-based nemo.quant.pact.PACT_ThresholdAct activation layer, its operation is experimental and unsupported in the current version.

Integer BN.

When the target cardinality of the output of a block () is not particularly small, thresholds are not an efficient way to implement the BN and the quantization/activation; it is more effective to explicitly perform BN and then quantization/activation by means of Eq. 11. Executing the BN layer in the integer domain requires replacing the parameters of the BN with quantized versions (see Rusci et al. [3, 4]), which means deriving a approximating . Here we consider the “correct” input of which is a function. Let , ; then


where and are the quantized versions of the respective parameters. In general, is represented in its own precision chosen independently, and then requantized to before using it. Then, Eq. 21 becomes

Thus, in the domain of integer images,


Similarly to Eq. 16, this allows to fully operate the BN layer in the integer domain of the integer images; the quantized space is


[colback=green!5!white,colframe=green!75!black,title=In NEMO…] In QuantizedDeployable representation, torch.nn.BatchNorm2d is replaced with nemo.quant.pact.PACT_QuantizedBatchNorm2d. To quantize and , we use a symmetric () -bit quantizer: we compute statically and set . Requantization is not accurately represented at this representation level.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] In IntegerDeployable representation, nemo.quant.pact.PACT_QuantizedBatchNorm2d is replaced with nemo.quant.pact.PACT_IntegerBatchNorm2d. is requantized to before being used:

In this way, the choice whether to store in a lower-precision format or directly in the target format (which typically requires 32 bits) is left to the deployment backend. () is currently wired.

3.5 Add operators

When several paths in a DNN di-graph converge to the same node, they are typically combined through an Add operator. An obvious requirement for these situations is that the numerical representations of each branch should be equalized to that of the others to be summable – each tensor coming from a branch lives its own space , , : therefore,

The solution passes through a requantization step similar to what is shown in the Quantization/Activation and BatchNormalization operators: one of the input branches (e.g., ) is chosen as reference () and as a consequence,


[colback=green!5!white,colframe=green!75!black,title=In NEMO…] To correctly represent Adds in the IntegerDeployable representation, the network must be instantiating the nemo.quant.pact.PACT_IntegerAdd modules. Currently, instantiating this module is one of the few manual modifications required to a network’s definition. This is because the normal way of doing this in PyTorch (just using a +) does not instantiate a torch.nn.Module that can be augmented by NEMO.

In all modes except for IntegerDeployable, nemo.quant.pact.PACT_IntegerAdd behaves like a regular addition. In IntegerDeployable, it performs requantization as shown in Eq. 24. The is set through a requantization_factor that defaults to 256, working in the same way as the one described in Section 3.2 (i.e., it defaults to a relative requantization error ).

Note that if the paths diverge from an operator that is not the final Quantization/Activation of a canonical layer, some of the operations explained in this document (e.g. BN folding) might be more complex and require additional work. See for example Palossi et al. [5] for further details on this issue.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] The nemo.transform.fold_bn has experimental support for inverse folding as explained in Palossi et al. [5]. However, the strategy of branching from a non-Quantization/Activation operator is suboptimal and not recommended for networks that are meant to be quantized.

3.6 Pooling operators

Max-Pooling is not touched by quantization, because all quantization mechanisms preserve relative ordering. Therefore,

Average-Pooling, on the other hand, involves an implicit division by a factor (the product of the pooling filter sizes), which could break the assumptions on integer images. For this reason, a requantization-like operation is necessary. To do that, we transform the division in a product by , then we approximate it:



[colback=green!5!white,colframe=green!75!black,title=In NEMO…] In IntegerDeployable representation, the torch.nn.AvgPool2d operators are transformed into nemo.quant.pact.PACT_IntegerAvgPool2d. These operators perform pooling as defined in Eq. 25.

3.7 Input representation

The rules defined in our model enable propagating quanta in the network graph from each node representing an operation to its successors. However, they leave out one question: what is the representation of the input of the network? Often, input is naturally quantized (e.g., coming from an image with 8-bit channels, from analog-to-digital conversion, etc.) – when the input has no obvious quantized representation, it has to be converted in an appropriate quantized version.

If the input has a representation similar to that of other activations in the network, i.e., with , then the model as described before directly applies to it, too. However, it is possible that the “natural” representation of input has . In these cases, one possible approach is to add a bias to the first Linear node so that the input representation can be translated to the canonical one.

[colback=green!5!white,colframe=green!75!black,title=In NEMO…] It is possible to perform this operation to a network net using the net.add_input_bias() method.

3.8 Other operators

There are many “exotic” operators that are not considered in this text (and not supported in NEMO). For most of them, what is described here can be directly applied with minimal changes. However, a particular mention is necessary for point-wise nonlinearities: most of these are used as alternative activation functions instead of ReLU. Some of them can be integrated in the quantization/activation functions, often as thresholds. Others, especially ones very sensitive in terms of dynamic range (e.g. exponentials) require switching back to real-valued (float) tensors to be applied.


NEMO is an outcome of the European Commission Horizon 2020 ALOHA Project, funded under the EU’s Horizon 2020 Research and Innovation Programme, grant agreement no. 780788. The author also wants to thank Manuele Rusci, Alessandro Capotondi and Matteo Spallanzani for the many discussions that resulted in this technical report. Thanks also to Marcello Zanghieri for proof-reading the first draft of this text.