Finding Non-Uniform Quantization Schemes usingMulti-Task Gaussian Processes

We propose a novel method for neural network quantization that casts the neural architecture search problem as one of hyperparameter search to find non-uniform bit distributions throughout the layers of a CNN. We perform the search assuming a Multi-Task Gaussian Processes prior, which splits the problem to multiple tasks, each corresponding to different number of training epochs, and explore the space by sampling those configurations that yield maximum information. We then show that with significantly lower precision in the last layers we achieve a minimal loss of accuracy with appreciable memory savings. We test our findings on the CIFAR10 and ImageNet datasets using the VGG, ResNet and GoogLeNet architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/30/2018

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search

Recent work in network quantization has substantially reduced the time a...
04/29/2018

UNIQ: Uniform Noise Injection for non-uniform Quantization of neural networks

We present a novel method for training a neural network amenable to infe...
12/26/2020

Hybrid and Non-Uniform quantization methods using retro synthesis data for efficient inference

Existing quantization aware training methods attempt to compensate for t...
03/31/2019

Single Path One-Shot Neural Architecture Search with Uniform Sampling

One-shot method is a powerful Neural Architecture Search (NAS) framework...
08/18/2021

RANK-NOSH: Efficient Predictor-Based Architecture Search via Non-Uniform Successive Halving

Predictor-based algorithms have achieved remarkable performance in the N...
09/28/2021

Efficient Fourier representations of families of Gaussian processes

We introduce a class of algorithms for constructing Fourier representati...
08/18/2021

Non-uniform quantization with linear average-case computation time

A new method for binning a set of n data values into a set of m bins for...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The strategy of quantizing neural networks to achieve fast inference has been a popular method of deploying neural networks in compute constrained environments. Its benefits include significant memory savings, improved computational speed, and a decreased cost in the energy needed per inference. Many methods have used this family of strategies, quantizing down to anywhere between 8-bits and 2-bits, with little loss in accuracy[10, 30]. It also bears noting that in most of these methods, after quantizing to very low precisions (1 to 5 bits), retraining is necessary to recover accuracy.

Recently, even though the quantization algorithms have significantly improved, they have almost exclusively implicitly assumed that the best strategy is to quantize all the layers uniformly with the same precision. However, there are two main reasons to believe otherwise: i) we argue that as it has been interpreted [17] that different layers extract different levels of features, it follows that different layers might require different levels of precision; ii) the idea of quantization as an approximation to the floating point (FP) version of the network suggests that lower error in the early layers reduces the propagation of errors down the whole network, minimizing any drop in accuracy. We believe that as important as having a good quantization strategy, is to also have a good strategy for the distribution of bits through the network, thereby eliminating any redundant bits. The goal is then to find a configuration in a search space that uses the least amount of bits and achieves the highest accuracy per bit used.

Fig. 1: Gaussian Process prediction for bit distribution in memory vs accuracy plot

We cast this Neural Architecture Search (NAS) problem into the framework of hyperparameter search, since the bit-width of each layer should ideally be found automatically. As with many NAS approaches, measuring the accuracy of a single configuration can take a considerable amount of time. To mitigate this issue, we propose a two stage approach. First, we map the full search space into a lower dimensional counterpart through a parameterised constraint function, and second, we use a Multi-task Gaussian Process to predict the accuracy at a higher epoch number from lower epoch numbers. This approach allows us to reduce both the complexity of the search space as well as the time required to determine the accuracy of a given configuration. Finally, as our Gaussian Process based approach is suitable for probabalistic inference, we use Bayesian Optimisation (BO) to explore and search the hyperparameter space of variable bit-size configurations.

For the quantization of the network, we use the DSConv method [16]. It achieves high accuracy without significant retraining, meaning the number of epochs needed for full training, and implicitly, the requirement for prediction power, is minimised.

To summarise, our main contributions are as follows:

  1. we cast NAS as hyperparameter search, which we apply to the problem of variable bit-size quantization;

  2. we reduce the time needed to measure the accuracy of a proposed bit configuration considerably by using multi-task GPs to infer future accuracy from current estimates;

  3. we demonstrate performance across a broad range of configurations, described by Bezier curves and Chebyshev series.

The next sections are as follows: Section II shows previous work on quantization and hyperparameter search. Section III elaborates on the methodology used for search, including the constraint, exploration, and sampling procedures. Section IV shows the results achieved on the CIFAR10 and ImageNet datasets using the networks listed above. Section V draws a conclusion and considers insights from the paper.

Ii Related Work

Neural Architecture / Hyperparameter Search

One can consider finding bit distributions as a form of model selection [18]

, given its complexity and the limit on the parameters that it accepts as a solution. Previous methods have predominantly used Reinforcement Learning (RL) and Evolutionary Algorithms (EA) to model search, which is referred to in the literature as Neural Architecture Search. Examples include NASNet

[34], MNasNet [26], ReLeq-Net [6], HAQ [29], among others [1, 31] for RL and [33, 13, 15, 24] for EA. Our work overlaps with these papers only on the goal of finding an optimal strategy given a search space.

ReLeQ-Net and HAQ, to the best of our knowledge, are the only methods whose aims are to find the optimal bit distribution through different layers of a network, and are therefore the papers that overlap the most with our work. It is notable that both of them use an RL based approach to search for optimal bit distributions. However, HAQ is more focused on hardware specific optimization, whereas both ours and ReLeQ-Net’s methods attempt to be Hardware-Agnostic. Recently some work involving Bayesian Optimization (BO) for model architecture selection has been carried out, with systems such as NASBOT [11]. One of the reasons why BO has not been used for model selection has to do with how unclear it is to find a measure of “distance” between two models, which is the main problem that was addressed by NASBOT.

Alternatively, one can see determining bit distribution as finding hyperparameters to be tuned given a model, i.e. not different from finding the optimal learning rates or weight decays. Historically, this has been tackled by BO techniques. In neural networks specifically, this was popularized after the work of [21], and followed by others [27, 2, 7, 22]. As a result BO can be considered a natural method for searching for optimum bit distribution configurations.

Quantization

Quantization strategies can be either trained from scratch or derived from a pretrained network. The methods of [30, 10, 4, 32] initialize their networks from scratch. This ensures that there is no initial bias on the values of the parameters, and they can achieve the minimum difference in accuracy when extremely low bit values are used (1-bit / 2-bits) - a notable exception being DoReFa-Net [32], which reportedly had slightly better results when quantizing the network starting from a pretrained network. The methods of [8, 14, 16, 28]

quantize the network starting from a pretrained network. These methods start with a bias on the values of the parameters, which can limit how much they recover from the lost accuracy. A benefit of these methods though is that they can be quickly fine-tuned over a few epochs re-achieving state-of-the-art results. These methods are more interesting to us because of their quick deployment cycle. It is worth noting that all of these methods use a uniform distribution of precision, meaning that all layers are quantized to the same number of bits.

Iii Method

Our method consists of three parts: constraining, exploring, and sampling the search space. We first constrain the search space by assuming that the bit in the next layer somewhat depends on the bit used in the current layer. We do this by drawing bit distributions from a low-degree Polynomial (in the experiments we use a degree Bezier curve and a order Chebyshev series). Given a drawn distribution, we quantize the network using the DSConv method. We explore the space by placing a Gaussian prior over the polynomial parameters, and sampling / retraining a set of hyperparameters that gives the most information about the final payoff function. After exploring, we rank the configurations based on sampling the GP for accuracy, and choose the ones that are the most appropriate for our end-use. Each of these phases will be explained further in this section.

Fig. 2: Three examples of modified Chebyshev functions and their clamped versions. The continuous lines represent the values of the modified Chebyshev functions as function of the layer, which is then converted into bitwidths whose values are represented on the right hand axis. This is for two 8 layer VGG11s and a 20 layer ResNet corresponding to configurations 1)blue: 8,7,6,6,6,6,5,3 2) orange: 8,3,2,4,6,8,7,2 and 3) green: 1,1,1,2,3,4,4,4,4,4,4,3,3,2,2,2,2,2,2,2

Iii-a Constraining the Space

When trying to find the bits, from 1-8, for each layer, the search space will have size of , where is the number of layers of the network. For a CNN of 50 layers, the search space will be , which is a similar search space to a game of chess (). Algorithms have been developed that consistently beat chess grandmasters, however the time for one episode of chess is considerably less than an episode to train and measure the accuracy of a neural network. Therefore, exhaustively searching in this space is prohibitive.

Our method for constraining the search space relies on the use of parameterised functions. We model a function of degree with a few hyperparameters, which describe the search space. From this function, we pick the bit distribution such that it follows the function’s curve. In this way, a bit configuration of any layered size network can be sampled from a few hyperparameters alone.

We use two parameterised functions to illustrate our solution:

  • We define the Bezier function for , where

    is the degree of the polynomial. The vector

    is the feature vector of the Bezier curve i.e. for Linear Bezier , for Quadratic Bezier , etc.

  • We define the modified Chebysehv function for , where is the degree of the polynomial. The vector is defined as where

The constraint function, then is a clamped and rounded version of the polynomial function chosen, , such that the bits, , for each layer generated are between 1 and 8, and . We can define then , where is the rounding function, and .

Fig. 2 shows an example of a Chebyshev function and its clamped version. The y-axis in the left indicate the value of the Chebyshev function for different values of . This is then clamped, rounded, and scaled such that it transforms into a discontinuous line that represents the bit chosen for each layer of a CNN. The bit value is indicated in the y-axis in the right.

By constraining the search space in this way, the minimization problem then shifts as follows:

Naïve Approach
s.t.
s.t.

where

is the loss function (to be introduced in Section

III-C).

The search then reduces to finding the parameters of the polynomial basis w, which consequently define the bit distributions throughout the layers. The search space is then continuous and compatible with GPs, and significantly reduced to only dimensions. Using this parameterisation, we are able to easily define a distance metric between configurations to be used when calculating the kernel function and predictive distribution from our Gaussian Process. So, with this setup, the search space can be sufficiently explored in a timely manner.

Fig. 3: Quantization given variable bit-widths. Notice that the input is the image, which is a uint8tensor (normalization can be dumped into a KDS tensor [16]), so it is not quantized. The quantization of activations is done before the convolution such that the convolution can be done using the same precision.

Quantization Strategy

As mentioned previously, the method used for quantizing the CNN is DSConv. This choice was made because our aim is to minimize time taken during training, and DSConv has consistently shown good accuracy properties in models, even when they are not retrained. This can also be seen as a time-constraint in the search space, such that minimal training time is needed to achieve meaningful accuracy estimations.

In this method, both the activations and the weights are quantized, such that fast inference is possible. Each of the weight tensors are quantized into two tensors, the Variable Quantized Kernel (VQK) and the Kernel Distribution Shift (KDS). Each of them are divided into blocks of size depthwise. The VQK stores integer values in 2s complement, such that , where are the weight values and is the number of bits in that layer. This has the same size as the FP32 weight tensor, and its values are found by simply scaling each of the block from the original FP32 weight tensor by , and then flooring and cropping to range. The KDS is a tensor times shallower, that holds FP32/16 scaling values, each which corresponds to a block of the VQK. Their values are calculated by simply minimizing the L2 norm of each block with respect to the original corresponding block: . The idea is that at the end of the quantization process, the KDS multiplied by the VQK - correspondingly with each block - will be as similar as possible to the original FP32 weight tensor. The activation tensor is quantized similarly, but using Block Floating Point (BFP) format in each of the blocks instead.

In order to take advantage of the low bit multiplication speed, the activation tensor and the weight tensor need to have the same precision. Figure 3 shows how this is done. The activation tensor prior to a convolution layer is set to be quantized to the same bit precision as that layer. The first convolution is not quantized since the input image is already in uint8 format. In this way, a quantization distribution strategy can be fully defined by providing the precision on each of the layers. Also note that we quantize only the convolutional layers. The Fully Connected layers are all left in the original FP32 precision for training.

This quantization strategy has shown good properties with little to no retraining at all. Since our goal is to show that uneven distribution of precision through the layers of a network is a better strategy for quantization, we will use the fact that DSConv needs little to no retraining in order to accelerate our search algorithm. Whereas multiple other algorithms need many epochs to at least achieve optimum accuracy, DSConv will only need a few iterations before meaningful results are achieved.

Fig. 4: Multi-Task Gaussian Process for inferring accuracy of quantized network. The quantization function is a Bezier Linear with the first parameter set to 0.5 i.e. For all figures, the x-axis is the value of , the left y-axis is the accuracy on CIFAR10 of a toy CNN with 10 layers. The right y-axis (red line) shows the model size for a given value of . The epoch correspondences for each task is [0, 1, 2, 15] respectively. After this exploration phase, the decision procedure is run on the predictive distribution of Task 4.

Iii-B Exploring the Space

Next, we need a way of exploring the space in order to learn the accuracy of the network given a limited set of w points. We propose a Multi-Task Gaussian Process prior in the neural network, such that each task corresponds to the estimation of the accuracy of the quantized neural networks given the parameter w after a certain number of epochs e.g. task 1 corresponds to 0 epochs, task 2 to 1 epoch, task 3 to 2 epochs, task 4 to 15 epochs. Let there be tasks, and a prior on , , such that

. We also place a probability distribution

over different tasks. Let be the observation at hyperparameter value for task , and let

be the observation noise, which is normally distributed. This defines independent Gaussian Likelihoods

. From this model, observations are drawn, such that , where is the observation of the task.

We used the The Intrinsic Correlation Model (ICM) of [5] and [3] for kernel calculation (in our experiments we made use of the squared exponential kernel). We can then define the mean and the correlation between tasks as:

(1)

where and are positive semi definite functions, corresponding to the correlation between functions and the correlation between inputs respectively. From this it follows that the covariance is , where is the Kronecker product, is the matrix of correlations between the functions and is the matrix of correlations between the inputs. For a new set of data points , the mean prediction can then be calculated using the normal formula for the predictive distribution:

(2)

where is an diagonal matrix where the term is .

Figure 4

shows an example of the Multi-Task setting with a 1D Bezier Curve for ease of visualization. Each plot shows the predictive mean and variance for each epoch after 14 data points have been collected, using the exploration algorithm explained in section

III-B, e.g. Task-1 was set to 0 epochs (so straight quantizing from FP32 model); Task-2 was set to 1 epoch; Task-3 to 2 and Task-4 to 15. The idea is to predict what is the distribution of the last task given inputs in earlier tasks.

Exploration Phase

In order to make decisions on what parameters to choose, we need to explore the space to predict the accuracy of the last task. The exploration phase for the multitask Gaussian Process follows the Low-Fidelity Search from [23]. The idea is to find the values of such that it gives us maximal information , where y is the observation history, and is the action to be performed. It is important to weight the information by a measure of the cost that it takes to perform that operation. So the exploration procedure chooses that maximizes per unit cost. This means that the parameter that has the most information about the payoff function will be picked.

Depending on the dataset and model chosen, the user can favour exploration on one fidelity over the other by decreasing the cost of running that particular task. Additionally, we set up a budget on the amount of time in unit cost or number of architectures that we are willing to explore. The Exploration Phase finishes when the Budget has been fully used. After this is finished, the user can run their preferred method of ranking configurations using the posterior of the trained GP.

Iii-C Sampling the Space

The naïve goal is to find the highest accuracy per bit possible, which corresponds to finding the minimum of the loss function . However, there is a trade-off that must be considered. A model, e.g. ResNet20, using a total of 40 bits and achieving 80% accuracy (ratio of 2%/bit) is arguably worse than a model that uses 43 bits and achieves 85% accuracy (ratio of 1.97%/bit). The goal is instead to find a decision procedure that takes into account the regret of not using more bits based on a set of constraints. This relationship should be linear instead of inversely proportional. A better strategy instead is to assume that using 4-bits for all layers is the best that can be done when quantizing without losing accuracy. Each bit used less than this should be a reward, and each bit used more than this should be a penalty, this is added (or subtracted) to the accuracy to get an “effective accuracy”. We then define the effective accuracy as , where is the accuracy of the original network, and is a constant of penalty per bit. Therefore, for each bit used in addition to the average of 4-bits incurs a penalty of 1% in the effective accuracy. The reverse incurs a reward of 1% in the effective accuracy. The decision procedure becomes then to minimize the negative effective accuracy, . Once we have enough information about the GPs, we can rank configurations based on their loss in order to pick the most relevant for us.

Iv Experiments and Results

We tested our method in a variety of configurations, using versions of the original VGG, ResNet, and GoogLeNet models, altered in order to take CIFAR10, and ImageNet32 as input. For training CIFAR10 and ImageNet32, we used data augmentation by cropping 32x32 image of the 4-pixel padded original. We used a Stochastic Gradient Descent optimiser with momentum of 0.9, and weight decay of

. The learning rate started equal to , and was divided by 10 after 150 and 250 epochs.

We ran the exploration procedure on configurations for each network using the multi-task algorithm outlined above. From these configurations, we could then use the mean of the gaussian to draw estimates of the accuracy of many different quantization schemes. Using the decision outlined above, we sorted the results by either accuracy, memory, or computational complexity, and selected the points of interest for better visualization and intuition of what the general trend of the found configurations are.

CNN
Configuration
(bits per layer)
Top 1 Estimate
from GP
Mean
Top1
Std Delta # Bits
Memory
(in MB)
VGG16 32-bit Floating Point - 93.7% - - - 58.8
6555443332211 (95.5%) 93.7% 0.2% -1.8% 50 4.84
1122333445556 (91.3%) 87.7% 0.2% -3.6% 50 9.50
7665544332221 (92.1%) 93.7% 0.1% 1.6% 50 5.26
1222334455667 (90.1%) 91.5% 0.4% 1.4% 50 10.75
4444444444444 (93.3%) 93.8% 0.1% 0.5% 52 8.28
3333333333333 (92.9%) 93.5% 0.2% 0.6% 39 6.44
VGG19 32-bit Floating Point - 93.9% - - - 80.1
6555444433322211 (94.4%) 93.7% 0.1% -0.7% 54 6.95
1122233344445556 (91.6%) 89.6% 0.4% -2.0% 54 12.04
5444433333222211 (93.9%) 93.5% 0.1% -0.4% 46 6.14
1122223333344445 (90.3%) 88.4% 1.2% -0.9% 46 10.05
3333333333333333 (92.9%) 93.4% 0.2% 0.5% 48 8.76
2222222222222222 (92.1%) 92.2% 0.2% 0.1% 32 6.25
ResNet18 32-bit Floating Point - 95.4% - - - 44.6
66655554444333322221 (96.3%) 95.4% 0.1% -0.9% 75 3.72
12222333344445555666 (95.9%) 92.9% 0.3% -3.0% 75 8.00
44444444333333333322 (95.3%) 95.3% 0.1% 0.0% 60 4.34
22333333333344444444 (94.5%) 94.2% 0.2% -0.3% 66 6.08
33333333333333333333 (94.4%) 95.0% 0.1% 0.6% 60 4.90
22222222222222222222 (93.1%) 93.3% 0.5% 0.2% 40 3.49
GoogLeNet 32-bit Floating Point - 95.5% - - - 24.32
421 327 216 (94.7%) 95.3% 0.1% 0.6% 207 2.35
2 16 3 27 4 21 (94.6%) 94.2% 0.1% -0.4% 207 2.98
68 513 412 313 213 15 (95.8%) 95.3% 0.1% -0.5% 231 2.65
15 213 313 412 514 68 (94.7%) 90.5% 0.2% -4.2% 231 3.73
364 (94.7%) 95.1% 0.1% 0.4% 192 2.68
264 (93.4%) 93.5% 0.2% 0.1% 127 1.92
TABLE I: Results for many configurations on CIFAR10. VGG16 and VGG19 correspond to the architectures introduced in [20]. ResNet18 is the architecture from [9], and the GoogLeNet architecture is from [25]. The Configuration refers to the bit value for each layer of a given model, from earlier layers in the left to later layers in the right. They are color coded for clarity: red for higher bits and green for lower bits. It is important to note that we quantize only the convolutional layers, which means that VGG16 has 13 values, VGG19 has 16 values, ResNet18 has 20 values, and GoogLeNet has 64. Because of its size, the GoogLeNet values were represented by a subscript indicating the number of times that a given bit-width is used. The column “Delta” refers to the difference between the GP estimation of the Top1 accuracy and the actual mean Top1 accuracy () after properly retraining that particular configuration.

Results on Accuracy using the CIFAR10 Dataset

Results on CIFAR10 and ablation tests are displayed in Table I. The configurations are color coded for clarity, with red representing higher bit counts and green representing lower bit counts. These configurations were selected based on the decision procedure outlined above, using the Bezier Linear polynomials.

For comparison, we show 6 configurations of each network: the first and third configurations were picked by the decision procedure outlined above; the second and fourth configurations are simply the inverse order of the first and third configurations; the fourth and fifth rows use the traditional uniform distribution of bits for a fair comparison.

It is important to note that the decision to pick these configurations are based on the estimate of the GP rather than on the actual Top-1 and Top-5 results. In order to compare fairly, we also included the Top-1 and Top-5 scores after properly training each of them for an additional 30 epochs using the same hyperparameters and optimiser that were used the train the FP32 version of each of these networks. We have also included a delta column which shows the difference between the Top-1 estimate from the GP and the Top-1 after fine-tuning the network. It is remarkable that most of the error in estimation is within 1%, which shows how the GP was able to generalize and interpolate properly as expected.

It can be seen that in general, using more bits in earlier layers yields more accurate, and lighter configurations. The higher accuracy can be explained numerically, since higher bits are used in earlier layers, the error propagation through the network is smaller. The lower memory usage is due to the fact that later layers have a higher number of channels, and therefore using lower precision in those layers yield a massive difference in memory need. For VGG16, the first configuration is both lighter, faster, and more accurate than using 3-bits for all layers. This pattern is repeated for the deeper VGG19 too, where the first configuration yielded superior results to the constant 3-bits for all layers, and also for ResNet18 as well. This “rule of thumb” is somewhat weaker in the GoogLeNet architecture though, even though there is still a clear correlation.

Results on Accuracy using Chebyshev Series

In order to test robustness of the method in relation to the choice of prior functions, we chose to use a Chebyshev Series of fourth degree, which has a larger search space than the Bezier Linear model. We have tested the model using the CIFAR10 dataset as well, and the results are shown in Table II.

As it can be seen, the degree introduced more flexibility as to what bit configurations the method is capable of finding. We found that with higher degree of polynomials, the number of architectures to search should also increase. In our experiments, we have searched for configurations before finding good results. The table shows the expected result that more bits at the beginning compensate for the fewer bits at the end of the network. The ResNet-18 result resembles the configuration found in Table IV, even though it found a configuration that has more usage of 3-bits, but performs slightly worse. As also expected, when the bit distribution is inverted in the network, it results in both higher memory and lower accuracy.

The same behaviour is found with the VGGs, with the slight difference that as VGG11 is too shallow, it requires more bits to recover the accuracy. VGG16 is considerably deeper, and therefore our algorithm was able to compress it more significantly.

This results shows that our method can be used with a variety of basis. It is worth bearing in mind that the GP processing capability requires the inversion of a matrix, which is proportional to the degree of the polynomial chosen. Therefore our method will only work in a timely manner when using fewer hyperparameters to describe the function.

Method Network Bitwidths
Accuracy
Loss
Memory
(as a % of original)
Ours ResNet18
64332223333333322
22333333332223346
-0.6%
-1.2%
8.0%
11.5%
Ours VGG11
76667654
45676667
-0.1%
-0.5%
16.8%
19.7%
Ours VGG16
4322233332211
1122333322234
-0.8%
-2.4%
6.3%
8.3%
TABLE II: Results of method when using Chebyshev Polynomials of 4thdegree.

Results on Network Size

Figure 5 shows the result of the GP-estimated accuracy of different configurations by their model size. The solid purple line links the uniform configurations, starting with all 1s and finishing with all 6s. Therefore, any point that lies above that line is an interesting point, since it gives better accuracy by using the same amount of memory of its uniform counterpart. We have highlighted a number of different interesting configurations with red stars and labelled them from A-M in order to better visualise what each point represent.

As it can be seen in the figure, the choice of bit-usage throughout the network plays an important role in both the accuracy and the memory usage. Even though there is a clear trend that links model size and accuracy, there are a handful of configurations which can perform well on both fronts. It can be seen that, in general, points that are above the purple line are linearly decreasing with bit-usage whereas the ones that are below the purple line are linearly increasing with bit-usage.

The surprising result is that, in the CIFAR10 experiments, even though using uniformly 1-bit for all layers achieves bad results, by just introducing a couple of bits in the first three quarters of the network (such as in points A, C and F), the memory increase is almost negligible, but the accuracy recovery is significant. Adding bits at the end of the network however, achieves the opposite effect. It can also be noticed that points A, C, E, and F, achieve better accuracy as the uniformly 2s configuration whilst using 50% less memory. This is even more evident in point E, in which we used up to 6 bits in the first layers, but still achieved less memory usage due to the usage of 1-bit in the bigger kernels at the end of the network.

Fig. 5: Scatter plot of the effect on accuracy versus model size of different bit configurations. The left three plots use the CIFAR10 dataset and the right three plots use the ImageNet32 dataset. Note that this is the plot of the estimate as given by the trained GP, and not the actual accuracy given proper training. The solid line refers to the uniform configurations, starting with all 1s and ending with all 6s. Points A-M highlight different configurations as shown in the text boxes. The string of numbers shown refers to the bit size on each layer of the given network. Note that VGG11 has 8 convolutional layers, and therefore points A and B have only 8 numbers. This applies to VGG16 (13 layers), VGG19 (16 layers), ResNet18 (20 layers), and ResNet34 (36 layers) as well.

In the ImageNet32 experiments, we also see some improvement, albeit less dramatic than the CIFAR10 experiments. The overall massage is still the same, as it can be seen in points H, J and L, for which adding bits in the first layers has achieved good accuracy with small memory increases. It is still noteworthy that even with a dataset as challenging as ImageNet32, due to its substantial decrease of information when compared to the default ImageNet, the GP could find good configurations without needing more datapoints. This shows that this method can be robust to changes in dataset.

Brief Comparison with ReLeQ

One of the other papers that touched in this subject was ReLeQ [6]. As explained in the literature review section, they use a reinforcement learning approach to find optimum bit-distributions over the network. Whilst their quantization methodology varies greatly from that used in this paper, it is worth comparing their results to ours. Their results for the CIFAR10 dataset in two of the networks are shown in Table III. It can be seen that we achieve similar results for ResNet, though with different mean bits. Since ReLeQ’s method does not use the same constraint as our method, it could find more varied solutions. This is a limitation to our method which allows it to find solutions to the network faster whilst using less computational power, but reduces the freedom of choice.

Method Network Bitwidths
Accuracy
Loss
Model size (MB)
DSConv WRPNx1
ReLeQ [6] ResNet-20 822322232333222322228 0.12% 3.88 3.25
Ours ResNet-20 66655554444333322221 0.1% 3.54 2.91
ReLeQ [6] VGG-11 858566668 0.17% 6.86 6.61
Ours VGG-11 77766655 0.14% 6.35 5.42
ReLeQ [6] VGG-16 8886868686868688 0.1% 13.32 12.54
Ours VGG16 6555443332211 0.1% 4.62 3.74
TABLE III: Comparison of our accuracy results with ReLeQ’s method [6] on CIFAR10. The authors in [6] did not provide their models size in memory, so we estimated using both our and the original author’s quantisation scheme to make a fair comparison.

Results on ImageNet using ResNet

For completeness, we have included some of the results found by our algorithm on the more challenging ImageNet dataset [19]. We decided to use the dataset in order to have results comparable with other methods. This was trained using an Adam Optimizer [12], with learning rate of .

Table IV shows the results. As expected, the same pattern of decreasing precision downstream holds across datasets. Comparing these results with the results from DSConv [16], we can see that a decreasing bit-width throughout the architecture, starting with 6-bits and finishing with 2-bits, which is superior to the “all 4s” and “all 3s”.

Method # of Layers Bitwidths
Acc.
Loss
Size
(MB)
Ours 18 63 55 45 35 22 0.2% 4.89
DSConv [16] 18 4 0.0% 5.88
DSConv [16] 18 3 0.8% 4.55
Ours 50 66 515 414 315 23 0.6% 11.89
DSConv [16] 50 4 0.0% 14.54
DSConv [16] 50 3 0.9% 11.74
HAQ [29] 50 flexible 0.0% 12.14
TABLE IV: Results of our method using the ImageNet dataset with the ResNet architecture.

As with ReLeQ’s method, HAQ’s method has a weaker constraint on bit distribution, which means it would be able to find configurations that our method would not;However, even with our very strong constraint, we were still able to find configurations that are competitive in memory requirements to those found by HAQ. This shows the strength of the conclusion that later layers require lower precision than earlier layers to maintain the same accuracy.

V Conclusion

In this paper, we demonstrate that a uniform distribution over bit-widths throughout a CNN is likely not the most efficient way to quantize a neural network. In order to demonstrate this, we used a Multi-Task Gaussian Process prior over different training epochs, and a Bayesian Optimization exploration procedure based on Information Maximization that estimated the accuracy of different configurations.

We have observed that starting a CNN with higher bit-widths and decreasing precision in later layers yield better accuracy and better memory usage than the traditional uniformly distributed bit-width. This can be interpreted either numerically (as in less error being propagate down the network), or can be interpreted as the functionality of each layer in the network (earlier layers are concerned with feature extraction and later layers are concerned with classification).

Vi Acknowledgements

This research was supported by Intel and the EPSRC, and we thank our colleagues from the Programmable Solutions Group who greatly assisted in this work.

References

  • [1] B. Baker, O. Gupta, N. Naik, and R. Raskar (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §II.
  • [2] J. Bergstra, D. Yamins, and D. D. Cox (2013) Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures. In

    Proceedings of the 30th International Conference on Machine Learning

    ,
    Vol. 28, pp. I–115. Cited by: §II.
  • [3] E. V. Bonilla, K. M. Chai, and C. Williams (2008) Multi-task gaussian process prediction. In Advances in neural information processing systems, pp. 153–160. Cited by: §III-B.
  • [4] Z. Cai, X. He, J. Sun, and N. Vasconcelos (2017-07) Deep learning with low precision by half-wave gaussian quantization.

    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    .
    External Links: ISBN 9781538604571, Document Cited by: §II.
  • [5] K. M. Chai (2010) Multi-task learning with gaussian processes. Ph.D. Thesis, The University of Edinburgh, The University of Edinburgh. Cited by: §III-B.
  • [6] A. T. Elthakeb, P. Pilligundla, A. Yazdanbakhsh, S. Kinzer, and H. Esmaeilzadeh (2018) Releq: a reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704. Cited by: §II, §IV, TABLE III.
  • [7] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, pp. 2962–2970. Cited by: §II.
  • [8] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §II.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: TABLE I.
  • [10] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §I, §II.
  • [11] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §II.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.
  • [13] H. Kitano (1990)

    Designing neural networks using genetic algorithms with graph generation system

    .
    Complex systems 4 (4), pp. 461–476. Cited by: §II.
  • [14] X. Lin, C. Zhao, and W. Pan (2017)

    Towards accurate binary convolutional neural network

    .
    In Advances in Neural Information Processing Systems 30, pp. 345–353. Cited by: §II.
  • [15] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §II.
  • [16] M. G. d. Nascimento, R. Fawcett, and V. A. Prisacariu (2019) DSConv: efficient convolution operator. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5148–5157. Cited by: §I, §II, Fig. 3, §IV, TABLE IV.
  • [17] C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Document Cited by: §I.
  • [18] C. E. Rasmussen and C. K. I. Williams (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press. External Links: ISBN 026218253X Cited by: §II.
  • [19] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §IV.
  • [20] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: TABLE I.
  • [21] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §II.
  • [22] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. Patwary, M. Prabhat, and R. Adams (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §II.
  • [23] J. Song, Y. Chen, and Y. Yue (2019) A general framework for multi-fidelity bayesian optimization with gaussian processes. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    ,
    pp. 3158–3167. Cited by: §III-B.
  • [24] K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §II.
  • [25] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: TABLE I.
  • [26] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §II.
  • [27] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 847–855. External Links: ISBN 978-1-4503-2174-7, Link, Document Cited by: §II.
  • [28] V. Vanhoucke, A. Senior, and M. Z. Mao (2011) Improving the speed of neural networks on cpus. In in Deep Learning and Unsupervised Feature Learning Workshop, NIPS, Cited by: §II.
  • [29] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §II, TABLE IV.
  • [30] D. Zhang, J. Yang, D. Ye, and G. Hua (2018) Lq-nets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §I, §II.
  • [31] Z. Zhong, J. Yan, W. Wu, J. Shao, and C. Liu (2018) Practical block-wise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §II.
  • [32] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §II.
  • [33] B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §II.
  • [34] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §II.