In recent years, deep learning models have shown to provide superior predictive capabilities in many domains. However, these models, as they are typically developed in full floating-point precision, are both extreme in compute and memory intensive. The mathematical concept of quantization which generates models in lower precision, e.g. , has gained importance as a way to alleviate the computational requirements. In particular, quantization is a critical step in generating hardware-platform optimized models of today’s deep learning accelerators.
Typically, models are developed as full-precision () models and the quantization is part of the compilation flow. Pre-quantized models are models that are already quantized by the time the models are passed to compilation.
Why is this important for co-design? The ability to separate the quantization process from the hardware platform-specific compilation stage is important for optimal deep learning model execution. It allows researchers and modeling toolchain developers to focus on the quantization process independent of the specific hardware-platform. Thus, researchers/developer are able to create domain specific pre-quantized models, rather than rely on a general-purpose quantization approach that a hardware platform-specific compilation flow provides. However, to utilize hardware architecture specific features, the description of a pre-quantized model needs to be expressive. The design of this separated quantization process has the following goals:
Key quantization parameters should be embedded into the ONNX model.
No additional target-specific external metadata must be required.
Model should be directly executable with standard ONNX tools, such as ONNXruntime [onnxruntime].
Standardized ONNX [onnx] operators should only be used.
No custom operators which would prevent model usage in standard tools.
Closely matching output (within narrow margins) on all inference environments (software or hardware).
The description of the model should be expressive to convey hardware-specific operations.
E.g. codify integer scale and right bit shift used by hardware to perform rescaling.
In the following section, the paper provides an introduction to symmetric quantization, followed by detailed examples of ONNX representation of pre-quantized Fully Connected Layer and Convolution Layers. This methodology is further applied to Tanh and Sigmoid activation functions.
2 Related Work
As quantization has gained importance, many deep learning frameworks and compilers have implemented some form of quantization. Below we cite related work. It should be noted, that in general the cited work is focused on quantized networks within their frameworks and compile chain. However, they are not addressing the need to codify an already quantized, i.e., a pre-quantized model, in a standard format. We cite here:
TensorFlow lite [tf_lite]
Nvidia® TensorRTTM [tensorRT]
Onnxruntime how to quantize [onnxruntime_howto_quantize]
Profile-guided quantization in Glow [rotem2019glow]
Quantized networks with TVM [DBLP:journals/corr/abs-2006-10226]
3 Symmetric Quantization
The most common quantization approach represents floating point 32-bit () numbers with integer 8-bit ( or ) numbers [pete_warden, DBLP:journals/corr/abs-1712-05877, DBLP:journals/corr/WuLWHC15]
, which reduces the memory footprint by a factor of four. Furthermore, 8-bit integer operations can be executed highly efficiently (low power and high performance) on machine learning accelerator hardware.
For symmetric quantization, where the zero offset is zero, equation 1
expresses the relationship between quantized tensorand original tensor .
There are multiple ways to determine the in equation 1. One approach might be to profile the tensor to determine the maximum numerical range and mapping this range to the full range. Another might be to minimize the overall quantization error by creating profile histograms and saturating the numerical range prior to mapping. Precisely, this is one of the motivations for this paper, i.e. decoupled quantized model development from the target hardware platform and its compiler. Giving the scale, and the data type of the quantized tensor (i.e. or ), the quantized tensor can be calculated as , with additional rounding and clipping stage to insure that the quantized tensor is represented as proper or values.
Similarly, to quantizing individual tensors, each layer of a network needs to be quantized. The fully connected layer, i.e. a matrix-matrix-multiply followed by addition of a bias tensor, serves as an example here. The fully connected layer with input tensor , weight , bias and output tensor is given in equation 2.
The intermediate tensor is of data type and is the result of the MatMulInteger of the quantized weight and quantized input tensor with the integer addition of the quantized bias as given in equation 5.
The bias is quantized to be of same scale as the output of the MatMulInteger operation and is given as value.
The represents the rescaling (aka output quantization) of the fully connected layer and is used in equation 4 to determine the quantized layer output. Just as for the individual tensor above, a rounding and clipping stage follows equation 4 to ensure that the output tensor is represented as proper or values. Similarly, convolution layers will include such rescaling stage.
As shown in the previous section for fully connected layer, a rescaling with rounding and clipping is required after certain network layers. The rescaling values are floating point values, which can be greater or smaller than 1. To perform the rescaling with integer arithmetic, the floating-point multiplication is replaced with an integer multiplication by an integer value followed by a bitwise right shift. A right shift by bits is equivalent to a division by . Utilizing 2 Mul operators both the integer value and the number of right shift bits can be codified within the ONNX network.
Method with 2 Mul operators
is an Integer value represented as FLOAT.
. representing a right shift by bits
Alternatively, only the floating-point scaling value is codified in ONNX and the conversion to integer value and number right shifts is the responsibility of the hardware-specific tool chain.
Method with 1 Mul operator
For example, a of 0.25 can be represented by of 1 and of . A of can be represented by of 11184810 and of . It should be noted, as the integer value is stored as FLOAT, the largest exactly represented integer value is
The rounding and clipping that follows rescaling is codified in the pre-quantized ONNX network with the ONNX operator QuantizeLinear with . per QuantizeLinear API, the data type of the argument determines the data type of the output tensor. I.e., an argument results in output, while an arguments results in output. Here, the QuantizedLinear is not used for rescaling, and is set to 1, as the scaling has already been codified using one or two MUL operators.
4 Fully Connected Layer
As a first example we show the fully connected layer of a Multi Layer Perceptron (MLP) network. The example shows the methodology applied to a complete network with input and output that can be run within the ONNXruntime. This is the base pattern that can be applied to larger MLP models and other networks containing fully connected layers. Fig.1 shows the ONNX flow for a fully connected layer without an activation function, while Fig. 2
shows the layer with ReLU activation. In these figures, the ONNX graphs are visualized using Netron[netron] tool on the left and the individual operator steps are shown on the right. For each operator, the data types of it’s input and output tensors are given. The ONNX operator MatMulInteger is used to express the matrix-matrix-multiply of the layer input of type or , with the weight coefficients given as resulting in an output tensor of type . Following the MatMulInteger, the bias, as data type is added using the ONNX Add operator. The rescaling is expressed here using two ONNX Mul operator with inputs, thus a ONNX Cast operator is added to cast the into . The final stage in this pattern is the rounding and clipping performed by the ONNX QuantizeLinear operator.
5 Convolution Layer
As second example we show the Convolution 2D layer of a Convolutional Neural Network (CNN). The methodology applied to a complete network with input and output that can be run within the ONNXruntime. Fig.3 shows a Convolution layer without activation function. A ReLU activation function will be similarly handled as for the Fully Connected layer shown in Fig. 2. The ONNX operator ConvInteger is used to express the Convolution with the kernel coefficients given as weights. Following the convolution, the bias, as data type is added using the ONNX Add operator. The rescaling is expressed using the ONNX Mul operator with inputs, thus a ONNX Cast operator is added to cast the into . The final stage in this pattern is the rounding and clipping performed by the ONNX QuantizeLinear operator.
6 Tanh and Sigmoid activation functions within an quantized network
The above described methodology is further expanded to Tanh and Sigmoid activation functions. The figures in this section show the individual operator steps in detail only and omit the ONNX graph visualization. Fig. 4 shows the sequence of ONNX operators to codify an quantized network with tanh activation function. The and values are set such that the full input range of tanh is mapped to the quantized range. The is determined by mapping the range to the full output range of tanh. It should be noted that the ONNX operator for tanh is for input resulting in output. Setting the rescale to map to full input range and the in the above described way results in using a tanh approximation.
An alternative approach is to perform tanh or sigmoid as floating-point functions. These functions might be implemented as or, as shown in the following examples, as . Fig. 5 shows a mixed / flow which allows rescaling to a narrow input range (symmetric around zero) of tanh and execution of tanh function in precision. Fig. 6 shows the mixed / flow for the sigmoid activation function. As the sigmoid activation produces always positive outputs, the quantized output will be .
This paper presents a methodology to separate the quantization process from the model compilation stage which enables independent development, while allowing hardware/software co-design. Detailed examples are given for both MLP and CNN based networks, which can be extended to other networks in a straightforward fashion.