On Periodic Functions as Regularizers for Quantization of Neural Networks

11/24/2018 ∙ by Maxim Naumov, et al. ∙ 0

Deep learning models have been successfully used in computer vision and many other fields. We propose an unorthodox algorithm for performing quantization of the model parameters. In contrast with popular quantization schemes based on thresholds, we use a novel technique based on periodic functions, such as continuous trigonometric sine or cosine as well as non-continuous hat functions. We apply these functions component-wise and add the sum over the model parameters as a regularizer to the model loss during training. The frequency and amplitude hyper-parameters of these functions can be adjusted during training. The regularization pushes the weights into discrete points that can be encoded as integers. We show that using this technique the resulting quantized models exhibit the same accuracy as the original ones on CIFAR-10 and ImageNet datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models require a very large amount of resources during their training (repeated forward and backward propagation) as well as inference (forward propagation). Further, the latter is often performed on the edge devices, such as smartphones or embedded systems, which operate within strict size, temperature and power budget Shimpi (2011); Humrick (2017); Dolbeau (2018); NVIDIA (2018). As a result these devices can perform a limited # of operations per second111

The peak for ARM Cortex is based on fp32 “VMLA.F32 Qd, Qn Dm”, fp16 “VMLA.F16 Qd, Qn Dm” and int8 “VMLAL.S8 Qd, Dn, Dm” instructions, with estimated reciprocal throughput

and width for A-7, reciprocal throughput and width for A-75. Then, peak ops are defined as (frequency/throughput)*width*cores. Also, the power is assumed to be 750mW and 1W per core for ARM Cortex A-7 and A-75, respectively., as illustrated in Tab. 1.

CPU/GPU GHz Watts fp32 fp16 int8
ARM Cortex A-75(2-core) 1.5 1.5 3 12
ARM Cortex A-75(4-core) 3.0 4 48 96 96
NVIDIA Turing Tesla T4 1.35 70 8100 16200 130000
Table 1: Peak GFlops/Gops according to specification

In order to decrease the storage and compute requirements of the model during inference its parameters are often stored as integers with a low number of bits. It is common to use 8-bit integers (1 Byte), rather than 16- (2 Bytes) or 32-bit (4 Bytes) floating point numbers. The process of converting model parameters from “continuous” floating point to discrete integer numbers is called quantization.

Let the original optimization problem be

(1)

where

is the loss measured during training. There are many different quantization schemes based on symmetric vs. asymetric intervals, uniform vs. non-uniform discrete partitioning, different rounding modes and choices for handling the outliers, e.g. few elements that lie outside of the range of most of the other elements.

Let us consider a uniform quantization of the parameter weights , for example from the either convolution (14) or fully connected layers (15), and activations. Then, the quantized problem is commonly written as

(2)

where is the quantization function.

For instance, if we use only two intervals then the process is referred to as binarization and resulting element can be stored in a single bit

Courbariaux et al. (2015, 2016); Hubara et al. (2018); Rastegari et al. (2016). It can be performed using a single threshold point as shown below

(3)

The ternary networks use three intervals with resulting elements stored in 2 bits Li et al. (2016); Mellempudi et al. (2017); Choi et al. (2018). Then, quantization can be performed using two threshold points and resulting in

(4)

Finally, let arbitrary # of bits correspond to points and intervals. Let us assume that we would like to quantize floating point number , with and length . A uniform quantization can be performed symetrically in the interval using multiplier , so that

(5)

with effective points because is double counted.

On the other hand, notice that we can shift the interval to the interval located around by adding a scalar bias term . Therefore, uniform asymetric quantization can be performed with bias using multiplier , so that

(6)

where round operation rounds a floating point to an integer value Wen et al. (2016); Jacob et al. (2017); Krishnamoorthi (2018).

The advantage of symmetric quantization is that for sparse parameters, with a lot of elements, the sparsity is preserved. Note that computation with zeroes can be skipped in hardware Albericio et al. (2016); Venkatesh et al. (2016); Reagen et al. (2016); Chen et al. (2017); Kim et al. (2017); Parashar et al. (2017). The disadvatage is that for highly asymetric intervals many discrete representations may be wasted.

The non-uniform quantization assigns discrete points to the interval based on the distribution of floating point values in it Bagherinezhad et al. (2017); Wang et al. (2018)

. Therefore, it does not have a fixed stride from one point to the next. Its advantage is that the encoded values are more representative of the original ones, but at the same time it can be hard to map back and perform operations with them.

The techniques for handling outliers and determining maximum thresholds, e.g. using adaptive schemes, Kullback–Leibler (KL) divergence measured loss of information, or L2 error minimization in Caffe2, have been investigated in Jia et al. (2014); Migasz ; Zhou et al. (2017); Park et al. (2018).

However, independent of all of these choices, notice that a common trend among (3) - (6) is that fixed thresholds are used in quantization function to clamp floating point values to discrete points. We point out that the matrix- and neural network-based compression techniques are outside the scope of this paper Gong et al. (2014); Denton et al. (2014); Jaderberg et al. (2014); Mishra & Marr (2018).

In this paper we will focus on a very different approach for uniform quantization using periodic functions, such as trigonometric sine (or cosine) as well as hat functions. We discuss uniform quantization, but our ideas can be generalized to non-uniform case using variations of these periodic functions with decaying amplitude and increasing base lengths away from the origin Stenger (1993); Strang & Fix (2008).

2 Periodic Functions as Regularizers

We propose an unorthodox approach for quantizing the weights of a neural network. Instead of using quantization function , we propose proposed adding a regularization term to the loss, so that the resulting optimization problem is written as

(7)

where is a scalar scaling parameter.

The regularization term is a sum of periodic functions that push the values of the weights (and potentially activations) to a set of discrete points during training. Next we will discuss different choices for these functions.

2.1 Trigonometric (Continuous) Functions

Let us focus only on the weights and use trigonometric sine, so that

(8)

where is the maximum weight in absolute value as defined in (5).

Notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 0(a) and 0(b), respectively.

(a) frequency=1
(b) frequency=7
Figure 1: Plot of sine and its square for amplitude=1
(a) frequency=1
(b) frequency=8
Figure 2: Plot of cosine and its square for amplitude=1

An analogous plot can also be created using a trigonometric cosine function and its square, with a difference being a shift in the discrete points so that there are 16 rather than 15 of the points where minimum is attained, but perhaps most importabtly is not preserved, see Fig, 1(a) and 1(b).

2.2 Hat (Non-Continuous) Functions

Let us focus only on the weights and use a hat function, so that

(9)

where is the maximum weight in absolute value as defined in (5).

Once again, notice that for frequency=1 the function R(w) attains its minimum 0.0, when the weight values are distributed at 3 discrete locations, while for frequency=7 it attains its minimum 0.0, when the weight values are distributed at 15 discrete locations, as shown on Fig. 2(a) and 2(b), respectively. Here we show variant of the hat function corresponding to sine, while a shifted variant corresponding to cosine is also possible.

(a) frequency=1
(b) frequency=7
Figure 3: Plot of hat for amplitude=1

Notice that the use of regularization for the purpose of quantization has been suggested in Hung et al. (2015). However, in this earlier work the authors use distance from fixed points (centroids) as a penalty measure to ensure quantization. This contrasts with our periodic trigonometric sine and hat functions, with amplitude and frequency hyper-parameters defined in (8) and (9), respectively.

It is important to highlight a few differences between sine (or cosine) and hat functions. Notice that sine function has very nice properties. It is periodic, continuous and differentiable. However, it is not convex, unlike many of the existing regularizers. Also, notice that the maximum value of the regularizer R(w) is know ahead of time. It can be computed by assuming that all weights translate into value 1.0 after application of the function. Then, the regularizer can be scaled by a constant , such that R(w) . This can be used to facilitate and in fact define the regularizer scaling in (7), therefore reducing the number of hyper-parameters.

Also, sine function has a gradient that is zero (or close to zero) in the neighborhood of points where it attains it’s minimum and maximum values. This property might make escaping the maximum or approaching the minimum slow in their respective neighborhoods. On the other hand, hat function is non-convex and non-continuous, with constant gradient towards the minimum except for the points where it attains its minimum and maximum values, where the gradient does not exist. These tradeoffs might guide the choice between these functions, in a way similar to that of a choice between Sigmoid and ReLU activation functions.

Finally, notice that amplitude can be changed adaptively during the training procedure, which allows us to obtain higher test accuracy, as will be shown in the experiments section. The frequency can also be varied during training, but these experiments are outside of the scope of this paper.

2.3 From Bits to Frequency and Vice-Versa

In practice we are interested in selecting the number of bits to be used for quantization. For the sine and associated hat function the frequency corresponding to number of bits can be found by using

(10)

and vice-versa

(11)

so that frequency 1 implies 2 bits, while frequency 7 implies 4 bits.

On the other hand, for cosine the frequency corresponding to number of bits can be found by using

(12)

and vice-versa

(13)

For instance, frequency 1 implies 1 bit, while frequency 8 implies 4 bits, and so on and so forth.

3 Experiments

In this section we will investigate the accuracy of ResNet-20 on CIFAR-10 and ResNet-50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009)

. We will compare the test error achieved by the original and quantized models with loss function defined in (

1) and (7), respectively. The regularization term we add to the loss in (7) relies on periodic functions: trigonometric sine in (8) and hat in (9

). It can be computed using the following PyTorch

Paszke et al. (2017) code snippet

def periodic_regularization(model, amplitude, frequency): pi = 3.141592 total = 0 for m in model.modules(): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): ic = 1/w.abs().max() rw = torch.sum(amplitude * #either sin torch.pow(torch.sin(pi * frequency * (w * ic)), 2)) #or hat function torch.abs(((((w * ic) - 0.5) * frequency) % 1) * 2 -1))

The training is performed using batch size 256 with default 100 epochs for CIFAR-10 and 90 epochs for ImageNet dataset. We use a fixed schedule that adjusts the amplitude hyper-parameter every

epochs. We start with a small amplitude, such as , and progressively adjust it until it reaches, say after typical epochs of training. Notice that amplitude subsumes the scaling hyper-parameter , which is always set to 1.0. Note that other than using a fixed schedule we do not require any special treatment for the first or last model layers or training epochs, which is otherwise often required to produce good approximations. We show the results of representative runs.

After training the model is quantized using symmetric uniform quantization in (5), which can be performed using the following PyTorch Paszke et al. (2017) code snippet

def quantize_model(model, frequency):
    def quantize_weights(m):
        if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
            c = m.weight.abs().max().data
            m.weight.data.mul_(frequency/c)
            m.weight.data.round_()
            m.weight.data.mul_(c/frequency)

    model.apply(quantize_weights)
Figure 4: Plot of Top1 test error for CIFAR-10 dataset with Resnet20

We illustrate the difference between original, original with regularization (Regu), and quantized model (Quant) for CIFAR-10 dataset on Fig. 4. The accuracy of the original model is plotted with a black dotted line, while the accuracy of other models is plotted with color bars. Notice that the model accuracy changes significantly depending on the number of bits used for quantization. For instance, there seems to be a clear boundary between 4 and 8 bits, where there seems to be (not or) enough bits to represent the information. Notice that while the training succeeds in all cases, the quantization fails to produce accurate results with less than 8 bits.

Figure 5: Plot of Top1 test error for CIFAR-10 dataset with Resnet-20
Figure 6: Plot of Top1 test error for ImageNet dataset with Resnet-50

Also, we illustrate the attained model accuracy with different starting amplitudes for CIFAR-10 dataset on Fig. 5. The accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8-bit quantized model is plotted with color bars. Notice that using adaptive rather than static amplitude allows us to reach higher test accuracy. Also, in our experiments we have found that it is a good practice to target the initial amplitude and choice of fixed schedule such that the final amplitude is in the range of 0.01 - 0.001, which would correspond to a reasonable value of the regularization scaling . We observe similar results on the ImageNet dataset, as seen on Fig. 6.

Figure 7: Plot of Top1 test error for ImageNet dataset with Resnet-50

Finally, notice that both sine and hat functions perform as well on the ImageNet dataset, as shown in Fig. 7. Once again, the accuracy of the original model is plotted with a black dotted line, while the accuracy of the 8-bit quantized model is plotted with color bars. In all plots, amplitude denotes the final amplitude. The detailed results are also summarized in tables Tab. 2 and 3.

Default Model Quantized model (with sine)
test (best) 8 bits 4 bits 2 bits
Amplitude n/a 0.1 0.01 0.001 0.001 0.001
Test error (fixed) 84.72 (87.70) 29.26 58.18 79.18 n/a n/a
Test error (dyn) 84.72 (87.70) 53.28 74.14 87.46 16.66 10.20
Table 2: CIFAR-10: summary of representative experiments
Default Model Quantized model (with sine) (with hat)
8 bits 8 bits
Amplitude n/a 0.1 0.01 0.001 0.001
Top1 error 75.84 1.29 69.02 75.77 75.57
Top5 error 92.90 4.62 89.27 92.54 92.58
Table 3: ImageNet: summary of representative experiments

4 Conclusion and Future Work

We have proposed a novel technique for quantizing neural networks, based on regularization with periodic functions. We have shown that it can be effectively used to quantize ResNets on CIFAR-10 and ImageNet datasets. In our experiments we have achieved virtually no losses vis-à-vis standard model by using amplitude scaling on a fixed schedule through training followed by 8-bit integer quantization. While similar quality results exist for quantization of CNNs, in this note we have achieved them through a completely novel method. In the future, we would like to incorporate the quantization of activations into this approach and experiment with more classes of neural networks.

Acknowledgements

The authors would like to thank Marat Dukhan, Bram Wasti and Satish Nadathur for collecting ARM Cortex A-7 and A-75 processor intruction information as well as Misha Smelyanskiy for his helpful comments and suggestions.

References

5 Appendix: Brief Background

The machine learning models are used in the fields of computer vision (CV) and natural language processing (NLP) among many others. In particular, the deep learning models based on neural networks composed of multiple layers have achieved unprecedented gains in accuracy of image classification and object detection tasks

Krizhevsky et al. (2012); Szegedy et al. (2014); LeCun et al. (1998); Krizhevsky et al. (2009); Deng et al. (2009).

In this paper we focus on the CV deep learning models that often rely on convolutional neural networks (CNNs), that are mainly composed of multiple convolution, fully connected and batch normalization layers LeCun et al. (1989a, b); Goodfellow et al. (2016); Ioffe & Szegedy (2015). For example, we will investigate ResNet-20 on CIFAR-10 and ResNet-50 on ImageNet datasets He et al. (2015); Krizhevsky et al. (2009); Deng et al. (2009). For completeness we review the most common layers next.

The convolution layer can be defined as

(14)

where input image , the filter , while denotes a convolution222In this context it is also common to use a cross-correlation rather than a convolution. and denotes a pooling operation, resulting in output with and for strides . The operation is usually repeated for output channels, resulting in .

The fully connected layer is defined as

(15)

where input , weights , bias

, unit vector

, the non-linear activation function is applied component-wise on intermediate and output for batch size.

The typical batch normalization layer can be written as

(16)

where input , scaled diagonal matrix of weights , bias , unit vector for batch size Devarakonda et al. (2017).