I Introduction
The strategy of quantizing neural networks to achieve fast inference has been a popular method of deploying neural networks in compute constrained environments. Its benefits include significant memory savings, improved computational speed, and a decreased cost in the energy needed per inference. Many methods have used this family of strategies, quantizing down to anywhere between 8bits and 2bits, with little loss in accuracy[10, 30]. It also bears noting that in most of these methods, after quantizing to very low precisions (1 to 5 bits), retraining is necessary to recover accuracy.
Recently, even though the quantization algorithms have significantly improved, they have almost exclusively implicitly assumed that the best strategy is to quantize all the layers uniformly with the same precision. However, there are two main reasons to believe otherwise: i) we argue that as it has been interpreted [17] that different layers extract different levels of features, it follows that different layers might require different levels of precision; ii) the idea of quantization as an approximation to the floating point (FP) version of the network suggests that lower error in the early layers reduces the propagation of errors down the whole network, minimizing any drop in accuracy. We believe that as important as having a good quantization strategy, is to also have a good strategy for the distribution of bits through the network, thereby eliminating any redundant bits. The goal is then to find a configuration in a search space that uses the least amount of bits and achieves the highest accuracy per bit used.
We cast this Neural Architecture Search (NAS) problem into the framework of hyperparameter search, since the bitwidth of each layer should ideally be found automatically. As with many NAS approaches, measuring the accuracy of a single configuration can take a considerable amount of time. To mitigate this issue, we propose a two stage approach. First, we map the full search space into a lower dimensional counterpart through a parameterised constraint function, and second, we use a Multitask Gaussian Process to predict the accuracy at a higher epoch number from lower epoch numbers. This approach allows us to reduce both the complexity of the search space as well as the time required to determine the accuracy of a given configuration. Finally, as our Gaussian Process based approach is suitable for probabalistic inference, we use Bayesian Optimisation (BO) to explore and search the hyperparameter space of variable bitsize configurations.
For the quantization of the network, we use the DSConv method [16]. It achieves high accuracy without significant retraining, meaning the number of epochs needed for full training, and implicitly, the requirement for prediction power, is minimised.
To summarise, our main contributions are as follows:

we cast NAS as hyperparameter search, which we apply to the problem of variable bitsize quantization;

we reduce the time needed to measure the accuracy of a proposed bit configuration considerably by using multitask GPs to infer future accuracy from current estimates;

we demonstrate performance across a broad range of configurations, described by Bezier curves and Chebyshev series.
The next sections are as follows: Section II shows previous work on quantization and hyperparameter search. Section III elaborates on the methodology used for search, including the constraint, exploration, and sampling procedures. Section IV shows the results achieved on the CIFAR10 and ImageNet datasets using the networks listed above. Section V draws a conclusion and considers insights from the paper.
Ii Related Work
Neural Architecture / Hyperparameter Search
One can consider finding bit distributions as a form of model selection [18]
, given its complexity and the limit on the parameters that it accepts as a solution. Previous methods have predominantly used Reinforcement Learning (RL) and Evolutionary Algorithms (EA) to model search, which is referred to in the literature as Neural Architecture Search. Examples include NASNet
[34], MNasNet [26], ReLeqNet [6], HAQ [29], among others [1, 31] for RL and [33, 13, 15, 24] for EA. Our work overlaps with these papers only on the goal of finding an optimal strategy given a search space.ReLeQNet and HAQ, to the best of our knowledge, are the only methods whose aims are to find the optimal bit distribution through different layers of a network, and are therefore the papers that overlap the most with our work. It is notable that both of them use an RL based approach to search for optimal bit distributions. However, HAQ is more focused on hardware specific optimization, whereas both ours and ReLeQNet’s methods attempt to be HardwareAgnostic. Recently some work involving Bayesian Optimization (BO) for model architecture selection has been carried out, with systems such as NASBOT [11]. One of the reasons why BO has not been used for model selection has to do with how unclear it is to find a measure of “distance” between two models, which is the main problem that was addressed by NASBOT.
Alternatively, one can see determining bit distribution as finding hyperparameters to be tuned given a model, i.e. not different from finding the optimal learning rates or weight decays. Historically, this has been tackled by BO techniques. In neural networks specifically, this was popularized after the work of [21], and followed by others [27, 2, 7, 22]. As a result BO can be considered a natural method for searching for optimum bit distribution configurations.
Quantization
Quantization strategies can be either trained from scratch or derived from a pretrained network. The methods of [30, 10, 4, 32] initialize their networks from scratch. This ensures that there is no initial bias on the values of the parameters, and they can achieve the minimum difference in accuracy when extremely low bit values are used (1bit / 2bits)  a notable exception being DoReFaNet [32], which reportedly had slightly better results when quantizing the network starting from a pretrained network. The methods of [8, 14, 16, 28]
quantize the network starting from a pretrained network. These methods start with a bias on the values of the parameters, which can limit how much they recover from the lost accuracy. A benefit of these methods though is that they can be quickly finetuned over a few epochs reachieving stateoftheart results. These methods are more interesting to us because of their quick deployment cycle. It is worth noting that all of these methods use a uniform distribution of precision, meaning that all layers are quantized to the same number of bits.
Iii Method
Our method consists of three parts: constraining, exploring, and sampling the search space. We first constrain the search space by assuming that the bit in the next layer somewhat depends on the bit used in the current layer. We do this by drawing bit distributions from a lowdegree Polynomial (in the experiments we use a degree Bezier curve and a order Chebyshev series). Given a drawn distribution, we quantize the network using the DSConv method. We explore the space by placing a Gaussian prior over the polynomial parameters, and sampling / retraining a set of hyperparameters that gives the most information about the final payoff function. After exploring, we rank the configurations based on sampling the GP for accuracy, and choose the ones that are the most appropriate for our enduse. Each of these phases will be explained further in this section.
Iiia Constraining the Space
When trying to find the bits, from 18, for each layer, the search space will have size of , where is the number of layers of the network. For a CNN of 50 layers, the search space will be , which is a similar search space to a game of chess (). Algorithms have been developed that consistently beat chess grandmasters, however the time for one episode of chess is considerably less than an episode to train and measure the accuracy of a neural network. Therefore, exhaustively searching in this space is prohibitive.
Our method for constraining the search space relies on the use of parameterised functions. We model a function of degree with a few hyperparameters, which describe the search space. From this function, we pick the bit distribution such that it follows the function’s curve. In this way, a bit configuration of any layered size network can be sampled from a few hyperparameters alone.
We use two parameterised functions to illustrate our solution:

We define the Bezier function for , where
is the degree of the polynomial. The vector
is the feature vector of the Bezier curve i.e. for Linear Bezier , for Quadratic Bezier , etc. 
We define the modified Chebysehv function for , where is the degree of the polynomial. The vector is defined as where
The constraint function, then is a clamped and rounded version of the polynomial function chosen, , such that the bits, , for each layer generated are between 1 and 8, and . We can define then , where is the rounding function, and .
Fig. 2 shows an example of a Chebyshev function and its clamped version. The yaxis in the left indicate the value of the Chebyshev function for different values of . This is then clamped, rounded, and scaled such that it transforms into a discontinuous line that represents the bit chosen for each layer of a CNN. The bit value is indicated in the yaxis in the right.
By constraining the search space in this way, the minimization problem then shifts as follows:
Naïve Approach  
s.t.  
s.t.  
The search then reduces to finding the parameters of the polynomial basis w, which consequently define the bit distributions throughout the layers. The search space is then continuous and compatible with GPs, and significantly reduced to only dimensions. Using this parameterisation, we are able to easily define a distance metric between configurations to be used when calculating the kernel function and predictive distribution from our Gaussian Process. So, with this setup, the search space can be sufficiently explored in a timely manner.
Quantization Strategy
As mentioned previously, the method used for quantizing the CNN is DSConv. This choice was made because our aim is to minimize time taken during training, and DSConv has consistently shown good accuracy properties in models, even when they are not retrained. This can also be seen as a timeconstraint in the search space, such that minimal training time is needed to achieve meaningful accuracy estimations.
In this method, both the activations and the weights are quantized, such that fast inference is possible. Each of the weight tensors are quantized into two tensors, the Variable Quantized Kernel (VQK) and the Kernel Distribution Shift (KDS). Each of them are divided into blocks of size depthwise. The VQK stores integer values in 2s complement, such that , where are the weight values and is the number of bits in that layer. This has the same size as the FP32 weight tensor, and its values are found by simply scaling each of the block from the original FP32 weight tensor by , and then flooring and cropping to range. The KDS is a tensor times shallower, that holds FP32/16 scaling values, each which corresponds to a block of the VQK. Their values are calculated by simply minimizing the L2 norm of each block with respect to the original corresponding block: . The idea is that at the end of the quantization process, the KDS multiplied by the VQK  correspondingly with each block  will be as similar as possible to the original FP32 weight tensor. The activation tensor is quantized similarly, but using Block Floating Point (BFP) format in each of the blocks instead.
In order to take advantage of the low bit multiplication speed, the activation tensor and the weight tensor need to have the same precision. Figure 3 shows how this is done. The activation tensor prior to a convolution layer is set to be quantized to the same bit precision as that layer. The first convolution is not quantized since the input image is already in uint8 format. In this way, a quantization distribution strategy can be fully defined by providing the precision on each of the layers. Also note that we quantize only the convolutional layers. The Fully Connected layers are all left in the original FP32 precision for training.
This quantization strategy has shown good properties with little to no retraining at all. Since our goal is to show that uneven distribution of precision through the layers of a network is a better strategy for quantization, we will use the fact that DSConv needs little to no retraining in order to accelerate our search algorithm. Whereas multiple other algorithms need many epochs to at least achieve optimum accuracy, DSConv will only need a few iterations before meaningful results are achieved.
IiiB Exploring the Space
Next, we need a way of exploring the space in order to learn the accuracy of the network given a limited set of w points. We propose a MultiTask Gaussian Process prior in the neural network, such that each task corresponds to the estimation of the accuracy of the quantized neural networks given the parameter w after a certain number of epochs e.g. task 1 corresponds to 0 epochs, task 2 to 1 epoch, task 3 to 2 epochs, task 4 to 15 epochs. Let there be tasks, and a prior on , , such that
. We also place a probability distribution
over different tasks. Let be the observation at hyperparameter value for task , and letbe the observation noise, which is normally distributed. This defines independent Gaussian Likelihoods
. From this model, observations are drawn, such that , where is the observation of the task.We used the The Intrinsic Correlation Model (ICM) of [5] and [3] for kernel calculation (in our experiments we made use of the squared exponential kernel). We can then define the mean and the correlation between tasks as:
(1)  
where and are positive semi definite functions, corresponding to the correlation between functions and the correlation between inputs respectively. From this it follows that the covariance is , where is the Kronecker product, is the matrix of correlations between the functions and is the matrix of correlations between the inputs. For a new set of data points , the mean prediction can then be calculated using the normal formula for the predictive distribution:
(2)  
where is an diagonal matrix where the term is .
Figure 4
shows an example of the MultiTask setting with a 1D Bezier Curve for ease of visualization. Each plot shows the predictive mean and variance for each epoch after 14 data points have been collected, using the exploration algorithm explained in section
IIIB, e.g. Task1 was set to 0 epochs (so straight quantizing from FP32 model); Task2 was set to 1 epoch; Task3 to 2 and Task4 to 15. The idea is to predict what is the distribution of the last task given inputs in earlier tasks.Exploration Phase
In order to make decisions on what parameters to choose, we need to explore the space to predict the accuracy of the last task. The exploration phase for the multitask Gaussian Process follows the LowFidelity Search from [23]. The idea is to find the values of such that it gives us maximal information , where y is the observation history, and is the action to be performed. It is important to weight the information by a measure of the cost that it takes to perform that operation. So the exploration procedure chooses that maximizes per unit cost. This means that the parameter that has the most information about the payoff function will be picked.
Depending on the dataset and model chosen, the user can favour exploration on one fidelity over the other by decreasing the cost of running that particular task. Additionally, we set up a budget on the amount of time in unit cost or number of architectures that we are willing to explore. The Exploration Phase finishes when the Budget has been fully used. After this is finished, the user can run their preferred method of ranking configurations using the posterior of the trained GP.
IiiC Sampling the Space
The naïve goal is to find the highest accuracy per bit possible, which corresponds to finding the minimum of the loss function . However, there is a tradeoff that must be considered. A model, e.g. ResNet20, using a total of 40 bits and achieving 80% accuracy (ratio of 2%/bit) is arguably worse than a model that uses 43 bits and achieves 85% accuracy (ratio of 1.97%/bit). The goal is instead to find a decision procedure that takes into account the regret of not using more bits based on a set of constraints. This relationship should be linear instead of inversely proportional. A better strategy instead is to assume that using 4bits for all layers is the best that can be done when quantizing without losing accuracy. Each bit used less than this should be a reward, and each bit used more than this should be a penalty, this is added (or subtracted) to the accuracy to get an “effective accuracy”. We then define the effective accuracy as , where is the accuracy of the original network, and is a constant of penalty per bit. Therefore, for each bit used in addition to the average of 4bits incurs a penalty of 1% in the effective accuracy. The reverse incurs a reward of 1% in the effective accuracy. The decision procedure becomes then to minimize the negative effective accuracy, . Once we have enough information about the GPs, we can rank configurations based on their loss in order to pick the most relevant for us.
Iv Experiments and Results
We tested our method in a variety of configurations, using versions of the original VGG, ResNet, and GoogLeNet models, altered in order to take CIFAR10, and ImageNet32 as input. For training CIFAR10 and ImageNet32, we used data augmentation by cropping 32x32 image of the 4pixel padded original. We used a Stochastic Gradient Descent optimiser with momentum of 0.9, and weight decay of
. The learning rate started equal to , and was divided by 10 after 150 and 250 epochs.We ran the exploration procedure on configurations for each network using the multitask algorithm outlined above. From these configurations, we could then use the mean of the gaussian to draw estimates of the accuracy of many different quantization schemes. Using the decision outlined above, we sorted the results by either accuracy, memory, or computational complexity, and selected the points of interest for better visualization and intuition of what the general trend of the found configurations are.
CNN 



Std  Delta  # Bits 



VGG16  32bit Floating Point    93.7%        58.8  
6555443332211  (95.5%)  93.7%  0.2%  1.8%  50  4.84  
1122333445556  (91.3%)  87.7%  0.2%  3.6%  50  9.50  
7665544332221  (92.1%)  93.7%  0.1%  1.6%  50  5.26  
1222334455667  (90.1%)  91.5%  0.4%  1.4%  50  10.75  
4444444444444  (93.3%)  93.8%  0.1%  0.5%  52  8.28  
3333333333333  (92.9%)  93.5%  0.2%  0.6%  39  6.44  
VGG19  32bit Floating Point    93.9%        80.1  
6555444433322211  (94.4%)  93.7%  0.1%  0.7%  54  6.95  
1122233344445556  (91.6%)  89.6%  0.4%  2.0%  54  12.04  
5444433333222211  (93.9%)  93.5%  0.1%  0.4%  46  6.14  
1122223333344445  (90.3%)  88.4%  1.2%  0.9%  46  10.05  
3333333333333333  (92.9%)  93.4%  0.2%  0.5%  48  8.76  
2222222222222222  (92.1%)  92.2%  0.2%  0.1%  32  6.25  
ResNet18  32bit Floating Point    95.4%        44.6  
66655554444333322221  (96.3%)  95.4%  0.1%  0.9%  75  3.72  
12222333344445555666  (95.9%)  92.9%  0.3%  3.0%  75  8.00  
44444444333333333322  (95.3%)  95.3%  0.1%  0.0%  60  4.34  
22333333333344444444  (94.5%)  94.2%  0.2%  0.3%  66  6.08  
33333333333333333333  (94.4%)  95.0%  0.1%  0.6%  60  4.90  
22222222222222222222  (93.1%)  93.3%  0.5%  0.2%  40  3.49  
GoogLeNet  32bit Floating Point    95.5%        24.32  
421 327 216  (94.7%)  95.3%  0.1%  0.6%  207  2.35  
2 16 3 27 4 21  (94.6%)  94.2%  0.1%  0.4%  207  2.98  
68 513 412 313 213 15  (95.8%)  95.3%  0.1%  0.5%  231  2.65  
15 213 313 412 514 68  (94.7%)  90.5%  0.2%  4.2%  231  3.73  
364  (94.7%)  95.1%  0.1%  0.4%  192  2.68  
264  (93.4%)  93.5%  0.2%  0.1%  127  1.92 
Results on Accuracy using the CIFAR10 Dataset
Results on CIFAR10 and ablation tests are displayed in Table I. The configurations are color coded for clarity, with red representing higher bit counts and green representing lower bit counts. These configurations were selected based on the decision procedure outlined above, using the Bezier Linear polynomials.
For comparison, we show 6 configurations of each network: the first and third configurations were picked by the decision procedure outlined above; the second and fourth configurations are simply the inverse order of the first and third configurations; the fourth and fifth rows use the traditional uniform distribution of bits for a fair comparison.
It is important to note that the decision to pick these configurations are based on the estimate of the GP rather than on the actual Top1 and Top5 results. In order to compare fairly, we also included the Top1 and Top5 scores after properly training each of them for an additional 30 epochs using the same hyperparameters and optimiser that were used the train the FP32 version of each of these networks. We have also included a delta column which shows the difference between the Top1 estimate from the GP and the Top1 after finetuning the network. It is remarkable that most of the error in estimation is within 1%, which shows how the GP was able to generalize and interpolate properly as expected.
It can be seen that in general, using more bits in earlier layers yields more accurate, and lighter configurations. The higher accuracy can be explained numerically, since higher bits are used in earlier layers, the error propagation through the network is smaller. The lower memory usage is due to the fact that later layers have a higher number of channels, and therefore using lower precision in those layers yield a massive difference in memory need. For VGG16, the first configuration is both lighter, faster, and more accurate than using 3bits for all layers. This pattern is repeated for the deeper VGG19 too, where the first configuration yielded superior results to the constant 3bits for all layers, and also for ResNet18 as well. This “rule of thumb” is somewhat weaker in the GoogLeNet architecture though, even though there is still a clear correlation.
Results on Accuracy using Chebyshev Series
In order to test robustness of the method in relation to the choice of prior functions, we chose to use a Chebyshev Series of fourth degree, which has a larger search space than the Bezier Linear model. We have tested the model using the CIFAR10 dataset as well, and the results are shown in Table II.
As it can be seen, the degree introduced more flexibility as to what bit configurations the method is capable of finding. We found that with higher degree of polynomials, the number of architectures to search should also increase. In our experiments, we have searched for configurations before finding good results. The table shows the expected result that more bits at the beginning compensate for the fewer bits at the end of the network. The ResNet18 result resembles the configuration found in Table IV, even though it found a configuration that has more usage of 3bits, but performs slightly worse. As also expected, when the bit distribution is inverted in the network, it results in both higher memory and lower accuracy.
The same behaviour is found with the VGGs, with the slight difference that as VGG11 is too shallow, it requires more bits to recover the accuracy. VGG16 is considerably deeper, and therefore our algorithm was able to compress it more significantly.
This results shows that our method can be used with a variety of basis. It is worth bearing in mind that the GP processing capability requires the inversion of a matrix, which is proportional to the degree of the polynomial chosen. Therefore our method will only work in a timely manner when using fewer hyperparameters to describe the function.
Method  Network  Bitwidths 




Ours  ResNet18 




Ours  VGG11 




Ours  VGG16 



Results on Network Size
Figure 5 shows the result of the GPestimated accuracy of different configurations by their model size. The solid purple line links the uniform configurations, starting with all 1s and finishing with all 6s. Therefore, any point that lies above that line is an interesting point, since it gives better accuracy by using the same amount of memory of its uniform counterpart. We have highlighted a number of different interesting configurations with red stars and labelled them from AM in order to better visualise what each point represent.
As it can be seen in the figure, the choice of bitusage throughout the network plays an important role in both the accuracy and the memory usage. Even though there is a clear trend that links model size and accuracy, there are a handful of configurations which can perform well on both fronts. It can be seen that, in general, points that are above the purple line are linearly decreasing with bitusage whereas the ones that are below the purple line are linearly increasing with bitusage.
The surprising result is that, in the CIFAR10 experiments, even though using uniformly 1bit for all layers achieves bad results, by just introducing a couple of bits in the first three quarters of the network (such as in points A, C and F), the memory increase is almost negligible, but the accuracy recovery is significant. Adding bits at the end of the network however, achieves the opposite effect. It can also be noticed that points A, C, E, and F, achieve better accuracy as the uniformly 2s configuration whilst using 50% less memory. This is even more evident in point E, in which we used up to 6 bits in the first layers, but still achieved less memory usage due to the usage of 1bit in the bigger kernels at the end of the network.
In the ImageNet32 experiments, we also see some improvement, albeit less dramatic than the CIFAR10 experiments. The overall massage is still the same, as it can be seen in points H, J and L, for which adding bits in the first layers has achieved good accuracy with small memory increases. It is still noteworthy that even with a dataset as challenging as ImageNet32, due to its substantial decrease of information when compared to the default ImageNet, the GP could find good configurations without needing more datapoints. This shows that this method can be robust to changes in dataset.
Brief Comparison with ReLeQ
One of the other papers that touched in this subject was ReLeQ [6]. As explained in the literature review section, they use a reinforcement learning approach to find optimum bitdistributions over the network. Whilst their quantization methodology varies greatly from that used in this paper, it is worth comparing their results to ours. Their results for the CIFAR10 dataset in two of the networks are shown in Table III. It can be seen that we achieve similar results for ResNet, though with different mean bits. Since ReLeQ’s method does not use the same constraint as our method, it could find more varied solutions. This is a limitation to our method which allows it to find solutions to the network faster whilst using less computational power, but reduces the freedom of choice.
Method  Network  Bitwidths 




ReLeQ [6]  ResNet20  822322232333222322228  0.12%  3.88  3.25  
Ours  ResNet20  66655554444333322221  0.1%  3.54  2.91  
ReLeQ [6]  VGG11  858566668  0.17%  6.86  6.61  
Ours  VGG11  77766655  0.14%  6.35  5.42  
ReLeQ [6]  VGG16  8886868686868688  0.1%  13.32  12.54  
Ours  VGG16  6555443332211  0.1%  4.62  3.74 
Results on ImageNet using ResNet
For completeness, we have included some of the results found by our algorithm on the more challenging ImageNet dataset [19]. We decided to use the dataset in order to have results comparable with other methods. This was trained using an Adam Optimizer [12], with learning rate of .
Table IV shows the results. As expected, the same pattern of decreasing precision downstream holds across datasets. Comparing these results with the results from DSConv [16], we can see that a decreasing bitwidth throughout the architecture, starting with 6bits and finishing with 2bits, which is superior to the “all 4s” and “all 3s”.
Method  # of Layers  Bitwidths 




Ours  18  63 55 45 35 22  0.2%  4.89  
DSConv [16]  18  4  0.0%  5.88  
DSConv [16]  18  3  0.8%  4.55  
Ours  50  66 515 414 315 23  0.6%  11.89  
DSConv [16]  50  4  0.0%  14.54  
DSConv [16]  50  3  0.9%  11.74  
HAQ [29]  50  flexible  0.0%  12.14 
As with ReLeQ’s method, HAQ’s method has a weaker constraint on bit distribution, which means it would be able to find configurations that our method would not;However, even with our very strong constraint, we were still able to find configurations that are competitive in memory requirements to those found by HAQ. This shows the strength of the conclusion that later layers require lower precision than earlier layers to maintain the same accuracy.
V Conclusion
In this paper, we demonstrate that a uniform distribution over bitwidths throughout a CNN is likely not the most efficient way to quantize a neural network. In order to demonstrate this, we used a MultiTask Gaussian Process prior over different training epochs, and a Bayesian Optimization exploration procedure based on Information Maximization that estimated the accuracy of different configurations.
We have observed that starting a CNN with higher bitwidths and decreasing precision in later layers yield better accuracy and better memory usage than the traditional uniformly distributed bitwidth. This can be interpreted either numerically (as in less error being propagate down the network), or can be interpreted as the functionality of each layer in the network (earlier layers are concerned with feature extraction and later layers are concerned with classification).
Vi Acknowledgements
This research was supported by Intel and the EPSRC, and we thank our colleagues from the Programmable Solutions Group who greatly assisted in this work.
References
 [1] (2016) Designing neural network architectures using reinforcement learning. arXiv preprint arXiv:1611.02167. Cited by: §II.

[2]
(2013)
Making a science of model search: hyperparameter optimization in hundreds of dimensions for vision architectures.
In
Proceedings of the 30th International Conference on Machine Learning
, Vol. 28, pp. I–115. Cited by: §II.  [3] (2008) Multitask gaussian process prediction. In Advances in neural information processing systems, pp. 153–160. Cited by: §IIIB.

[4]
(201707)
Deep learning with low precision by halfwave gaussian quantization.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
. External Links: ISBN 9781538604571, Document Cited by: §II.  [5] (2010) Multitask learning with gaussian processes. Ph.D. Thesis, The University of Edinburgh, The University of Edinburgh. Cited by: §IIIB.
 [6] (2018) Releq: a reinforcement learning approach for deep quantization of neural networks. arXiv preprint arXiv:1811.01704. Cited by: §II, §IV, TABLE III.
 [7] (2015) Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems 28, pp. 2962–2970. Cited by: §II.
 [8] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §II.
 [9] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: TABLE I.
 [10] (2017) Quantized neural networks: training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18 (1), pp. 6869–6898. Cited by: §I, §II.
 [11] (2018) Neural architecture search with bayesian optimisation and optimal transport. In Advances in Neural Information Processing Systems, pp. 2016–2025. Cited by: §II.
 [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV.

[13]
(1990)
Designing neural networks using genetic algorithms with graph generation system
. Complex systems 4 (4), pp. 461–476. Cited by: §II. 
[14]
(2017)
Towards accurate binary convolutional neural network
. In Advances in Neural Information Processing Systems 30, pp. 345–353. Cited by: §II.  [15] (2017) Hierarchical representations for efficient architecture search. arXiv preprint arXiv:1711.00436. Cited by: §II.
 [16] (2019) DSConv: efficient convolution operator. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5148–5157. Cited by: §I, §II, Fig. 3, §IV, TABLE IV.
 [17] (2017) Feature visualization. Distill. Note: https://distill.pub/2017/featurevisualization External Links: Document Cited by: §I.
 [18] (2005) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press. External Links: ISBN 026218253X Cited by: §II.
 [19] (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §IV.
 [20] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: TABLE I.
 [21] (2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959. Cited by: §II.
 [22] (2015) Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pp. 2171–2180. Cited by: §II.

[23]
(2019)
A general framework for multifidelity bayesian optimization with gaussian processes.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pp. 3158–3167. Cited by: §IIIB.  [24] (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §II.
 [25] (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: TABLE I.
 [26] (2019) Mnasnet: platformaware neural architecture search for mobile. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2820–2828. Cited by: §II.
 [27] (2013) Autoweka: combined selection and hyperparameter optimization of classification algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13, New York, NY, USA, pp. 847–855. External Links: ISBN 9781450321747, Link, Document Cited by: §II.
 [28] (2011) Improving the speed of neural networks on cpus. In in Deep Learning and Unsupervised Feature Learning Workshop, NIPS, Cited by: §II.
 [29] (2019) HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §II, TABLE IV.
 [30] (2018) Lqnets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §I, §II.
 [31] (2018) Practical blockwise neural network architecture generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2423–2432. Cited by: §II.
 [32] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §II.
 [33] (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §II.
 [34] (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §II.
Comments
There are no comments yet.