1 Introduction
^{†}^{†}The code of this work is available in https://github.com/sonysi/airesearch.In recent years, convolutional neural networks (CNNs) produced stateoftheart results in many computer vision tasks including image classification
[14, 17, 20, 22, 38, 39], object detection [29, 36, 40], semantic segmentation [31, 37], etc. Deploying these models on embedded devices is a challenging task due to limitations on available memory, computational power and power consumption. Many works address these issues using different methods. These include pruning [16, 45, 47], efficient neural architecture design [14, 20, 24, 38], hardware and CNN codesign [14, 21, 43] and quantization [6, 13, 15, 23, 24, 46].In this work, we focus on quantization, an approach in which the model is compressed by reducing the bitwidths of weights and activations. Besides reduction in memory requirements, depending on the specific hardware, quantization usually also results in the reduction of both latency and power consumption. The challenge of quantization is to reduce the model size without compromising its performance. For high compression rates, this is usually achieved by finetuning a pretrained model for quantization. In addition, recent work in quantization focused on making quantizers more hardware friendly
(amenable to deployment on embedded devices) by restricting quantization schemes to be: pertensor, uniform, symmetric and with thresholds that are powers of two
[24, 41].Recently, mixedprecision quantization was studied in [12, 41, 42, 44]. In these works, the bitwidths of weights and activations are not equal across the model and are learned during some optimization process. In [42]
, reinforcement learning is used, which requires the training of an agent that decides the bitwidth of each layer. In
[44], neural architecture search is used, which implies duplication of nodes in the network and that the size of the model grows proportionally to the size of the search space of bitwidths. Both of these methods limit the bitwidth search space because of their computational cost. In [12], the bitwidths are not searched during training, but rather, this method relies on the relationship between the layer’s Hessian and its sensitivity to quantization.An imperative requirement for many efficient edge device hardware implementations is that their quantizers are symmetric, uniform and with poweroftwo thresholds (see [24]). This removes the cost of special handling of zero points and real value scale factors. In this work, we introduce a novel quantization block we call the Hardware Friendly Mixed Precision Quantization Block (HMQ) that is designed to search over a finite set of quantization schemes that meet this requirement. HMQs utilize the GumbelSoftmax estimator [25] in order to optimize over a categorical distribution whose samples correspond to quantization scheme parameters.
We propose a method, based on HMQs, in which both the bitwidth and the quantizer’s threshold are searched simultaneously. We present stateoftheart results on MobileNetV1, MobileNetV2 and ResNet50 in most cases, in spite of the hardware friendly restriction applied to the quantization schemes. Additionally, we present the first (that we know of) mixed precision quantization results of EfficientNetB0. In particular, our contributions are the following:

We introduce HMQ, a novel, hardware friendly, mixed precision quantization block which enables a simple and efficient search for quantization parameters.

We present an optimization method, based on HMQs, for mixed precision quantization in which we search simultaneously for both the bitwidth and the threshold of each quantizer.

We present competitive and, in most cases, stateoftheart results using our method to quantize ResNet50, EfficientNetB0, MobileNetV1 and MobileNetV2 classification models on ImageNet.
2 Related Work
Quantization lies within an active area of research that tries to reduce memory requirements, power consumption and inference latencies of neural networks.
These works use techniques such as pruning, efficient network architectures and distillation (see e.g. [7, 14, 15, 18, 19, 20, 30, 34, 35, 38, 39, 48]).
Quantization is a key method in this area of research which compresses and accelerates the model by reducing the number of bits used to represent model weights and activations.
Quantization. Quantization techniques can be roughly divided into two families: posttraining quantization techniques and quantizationaware training techniques. In posttraining quantization techniques, a trained model is quantized without retraining the model (see e.g. [1, 5]). In quantizationaware training techniques, a model undergoes an optimization process during which the model is quantized. A key challenge in this area of research, is to compress the model without significant degradation to its accuracy. Posttraining techniques suffer from a higher degradation to accuracy, especially for high compression rates.
Since the gradient of quantization functions is zero almost everywhere, most quantizationaware training techniques use the straight through estimator (STE) [4] for the estimation of the gradients of quantization functions.
These techniques mostly differ in their choice of quantizers, the quantizers’ parametrization (thresholds, bitwidths, step size, etc.) and their training procedure.
During training, the network weights are usually stored in fullprecision and are quantized before they are used in feedforward.
The fullprecision weights are then updated via backpropagation.
Uniform quantizers are an important family of quantizers that have several benefits from a hardware pointofview (see e.g. [13, 24, 41]).
Nonuniform quantizers include clustering, logarithmic quantization and others (see e.g. [3, 33, 46, 49]).
Mixed precision. Recent works on quantization produced stateoftheart results using mixed precision quantization, that is, quantization in which the bitwidths are not constant across the model (weights and activations). In [42], reinforcement learning is used to determine bitwidths. In [12], second order gradient information is used to determine bitwidths. More precisely, the bitwidths are selected by ordering the network layers using this information. In [41], bitwidths are determined by learnable parameters whose gradients are estimated using STE. This work focuses on the choice of parametrization of the quantizers and shows that the threshold (dynamic range) and step size are preferable over parametrizations that use bitwidths explicitly.
In [44], a mixed precision quantizationaware training technique is proposed where the bitwidths search is converted into a network architecture search (based on [27]). More precisely, in this solution, the search space of all possible quantization schemes is, in fact, a search for a subgraph in a supernet. The disadvantage of this approach, is that the size of the super net grows substantially with every optional quantization edge/path that is added to the super net. In practice, this limits, the architecture search space. Moreover, this work deals with bitwidths and thresholds as two separate problems where thresholds follow the solution in [8].
3 The HMQ Block
The Hardware Friendly Mixed Precision Quantization Block (HMQ) is a network block that learns, via standard SGD, a uniform and symmetric quantization scheme. The scheme is parametrized by a pair of threshold and bitwidth . During training, an HMQ searches for over a finite space . In this work, we make HMQs “hardware friendly” by also forcing their thresholds to be powers of two. We do this by restricting
(1) 
where is an integer we configure per HMQ (see Section 4).
The step size of a uniform quantization scheme is the (constant) gap between any two adjacent quantization points. is parametrized by differently for a signed quantizer, where , and an unsigned one, where . Note that ties the bitwidth and threshold values into a single parameter but is not uniquely defined by them. The definition of the quantizer that we use in this work is similar to the one in [24]. The signed version of a quantizer of an HMQ is defined as follows:
(2) 
where and is the rounding function. Similarly, the unsigned version is defined as follows:
(3) 
In the rest of this section we assume that the quantizer of an HMQ is signed, but it applies to both signed and unsigned quantizers.
In order to search over a discrete set, the HMQ represents each pair in
as a sample of a categorical random variable of the GumbelSoftmax estimator (see
[25, 32]). This enables the HMQ to search for a pair of threshold and bitwidth. The GumbelSoftmax is a continuous distribution on the simplex that approximates categorical samples. In our case, we use this approximation as a joint discreet probability distribution of thresholds and bitwidths
on :(4) 
where is a matrix of class probabilities whose entries correspond to pairs in ,
are random i.i.d. variables drawn from Gumbel(0, 1) and is a softmax temperature value.
We define where is a matrix of trainable parameters . This guarantees that the matrix forms a categorical distribution.
The quantizers in Equations 2 and 3 are well defined for any two real numbers and . During training, in feed forward, we sample and use these samples in the approximation of a categorical choice. The HMQ parametrizes its quantizer using an expected step size and an expected threshold that are defined as follows:
(5) 
(6) 
where is the marginal distribution of thresholds and .
In backpropagation, the gradients of rounding operations are estimated using the STE and the rest of the module, i.e. Equations 4, 5 and 6, are differentiable. This implies that the HMQ smoothly updates the parameters which, in turn, smoothly updates the estimated bitwidth and threshold values of the quantization scheme. Figure 1 shows examples of HMQ quantization schemes during training. During inference, the HMQ’s quantizer is parametrized by the pair that corresponds to the maximal parameter .
Note that the temperature parameter of the GumbelSoftmax estimator in Equation 4 has a dual effect during training. As it approaches zero, in addition to approximating a categorical choice of a unique pair , smaller values of
also incur a larger variance of gradients which adds instability to the optimization process. This problem is mitigated by annealing
(see Section 4).4 Optimization Process
In this section, we present a finetuning optimization process that is applied to a full precision, 32bit floating point, pretrained model after adding HMQs.
Throughout this work, we use the term model weights (or simply weights) to refer to all of the trainable model weights, not including the HMQ parameters.
We denote by , the set of weight tensors to be quantized; by , the set of activation tensors to be quantized and by , the set of HMQ parameters.
Given a tensor , we use the notation to denote the number of entries in .
From a high level view, our optimization process consists of two phases. In the first phase, we simultaneously train both the model weights and the HMQ parameters. We take different approaches for quantization of weights and activations. These are described in Sections 4.1 and 4.2
. We split the first phase into cycles with an equal number of epochs each. In each cycle of the first phase, we reset the GumbelSoftmax temperature
in Equation 4 and anneal it till the end of the cycle. In the second phase of the optimization process, we finetune only the model weights. During this phase, similarly to HMQs behaviour during inference, the quantizer of every HMQ is parametrized by the pair that corresponds to the maximal parameter that was learnt in the first phase.4.1 Weight Compression
Let be an input tensor of weights to be quantized by some HMQ.
We define the set of thresholds in the search space of the HMQ by setting in Equation 1 to be .
The values in are different per experiment (see Section 5).
Denote by the subset of containing all of the parameters of HMQs quantizing weights. The expected weight compression rate, induced by the values of is defined as follows:
(7) 
where is a tensor of weights and is the expected bitwidth of , where is the bitwidth marginal distribution in the GumbelSoftmax estimation of the corresponding HMQ.
In other words, assuming that all of the model weights are quantized by HMQs, the numerator is the memory requirement of the weights of the model before compression and the denominator is the expected memory requirement during training.
During the first phase of the optimization process, we optimize the model with respect to a target weight compression rate
, by minimizing (via standard SGD) the following loss function:
(8) 
where is the original, task specific loss, e.g. the standard cross entropy loss, is a loss with respect to the target compression rate and is a hyperparameter that control the tradeoff between the two. We define as follows:
(9) 
In practice, we gradually increase the target compression rate during the first few cycles in the first phase of our optimization process.
This approach of gradual training of quantization is widely used, see e.g. [2, 10, 12, 49].
In most cases, layers are gradually added to the training process whereas in our process we gradually decrease the bitwidth across the whole model, albeit, with mixed precision.
By the definition of , if the target weight compression rate is met during training, i.e. , then the gradients of with respect to the parameters in are zero and the task specific loss function determines the gradients alone. In our experiments, the actual compression obtained by using a specific target compression depends on the hyperparameter and the sensitivity of the architecture to quantization.
4.2 Activations Compression
We define in the search space of an HMQ that quantizes a tensor of activations similarly to HMQs quantizing weights.
We set in Equation 1 to be minimum such that is greater or equal than the maximum absolute value of an activation of the pretrained model over the entire training set.
The objective of activations compression is to fit any single activations tensor, after quantization, into a given size of memory (number of bits). This objective is inspired by the one in [41] and is especially useful for DNNs in which the operators in the computational graph induce a path graph, i.e. the operators are executed sequentially. We define the target activations compression rate to be
(10) 
where are the activation tensors to be quantized. Note that implies the precise (maximum) number of bits of every feature map :
(11) 
We assume that for every feature map (otherwise, the requirement cannot be met and should be increased) and fix in the search space of the HMQ that corresponds to .
Note that this method can also be applied to models with a more complex computational graph, such as ResNet, by applying Equation 11 to blocks instead of single feature maps.
Note also, that by definition, the maximum bitwidth of every activation is .
We can therefore assume that .
Here, the bitwidths of every feature map is determined by Equation 11. This is in contrast to the approach in [41] (for activations compression) and our approach for weight compression in Section 4.1, where the choice of bitwidths is a result of an SGD minimization process. This allows a more direct approach for the quantization of activations in which we gradually increase , during the first few cycles in the first phase of the optimization process. In this approach, while activation HMQs learn the thresholds, their bitwidths are implied by . This, in contrast to adding a target activations compression component to the loss, both guarantees that the target compression of activations is obtained and simplifies the loss function of the optimization process.
5 Experimental Results
In this section, we present results using HMQs to quantize various classification models.
As proof of concept, we first quantize ResNet18 [17] trained on CIFAR10 [26].
For the more challenging ImageNet [9] classification task, we present results quantizing ResNet50 [17], EfficientNetB0 [39], MobileNetV1 [20] and MobileNetV2 [38].
In all of our experiments, we perform our finetuning process on a full precision, 32bit floating point, pretrained model in which an HMQ is added after every weight and every activation tensor per layer, including the first and last layers, namely the input convolutional layer and the fully connected layer. The parameters of every HMQ are initialized as a categorical distribution in which the parameter that corresponds to the pair of the maximum threshold with the maximum bitwidth is initialized to and
is uniformly distributed between the rest of the parameters. The bitwidth set
in the search space of HMQs is set differently for CIFAR10 and ImageNet (see Sections 5.1 and 5.2).Note that in all of the experiments, in all of the weight HMQs, the maximal bitwidth is 8 (similarly to activation HMQs).
This implies that throughout the finetuning process.
The optimizer that we use in all of our experiments is RAdam [28] with and .
We use different learning rates for the model weights and the HMQ parameters.
The data augmentation that we use during finetuning is the same as the one used to train the base models.
The entire process is split into two phases, as described in Section 4. The first phase consists of epochs split into cycles of epochs each. In each cycle, the temperature in Equation 4, is reset and annealed till the end of the cycle. We update the temperature every steps within a cycle, where is the number of steps in a single epoch. The annealing function that we use is similar to the one in [25]:
(12) 
where is the training step (within the cycle) and .
The second phase, in which only weights are finetuned, consists of 20 epochs.
As mentioned in Section 4, during the first phase, we gradually increase both the weight and activation target compression rates and , respectively. Both target compression rates are initialized to a minimum compression of (implying 8bit quantization) and are increased, in equally sized steps, at the beginning of each cycle, during the first cycles.
Figure 2 shows an example of the behaviour of the expected weight compression rate and the actual weight compression rate (implied by the quantization schemes corresponding to the maximum parameters ) during training, as the value of the target weight compression rate is increased and the temperature of the GumbelSoftmax is annealed in every cycle.
Note how the difference between the expected and the actual compression rate values decreases with , in every cycle (as to be expected by the GumbelSoftmax estimator’s behaviour).
We compare our results with those of other quantization methods based on top1 accuracy vs. compression metrics. We use weight compression rate (WCR) to denote the ratio between the total size (number of bits) of the weights in the original model and the total size of the weights in the compressed model. Activation compression rate (ACR) denotes the ratio between the size (number of bits) of the largest activation tensor in the original model and its size in the compressed model. As explained in Section 4.2, our method guarantees that the size of every single activation tensor in the compressed model is bounded from above by a predetermined value .
5.1 ResNet18 on CIFAR10
As proof of concept, we use HMQs to quantize a ResNet18 model that is trained on CIFAR10 with standard dataaugmentation from [17].
Our baseline model has top1 accuracy of 92.45%.
We set in the search space of HMQs quantizing weights.
For activations, is set according to our method in Section 4.2.
In all of the experiments in this section, we set in the loss function in Equation 8.
The learning rate that we use for model weights is 1e5.
For HMQ parameters the learning rate is 1e3.
The batchsize that we use is 256.
Figure 3 presents the Pareto frontier of weight compression rate vs. top1 accuracy for different quantization methods of ResNet18 on CIFAR10. In this figure, we show that our method is effective, in comparison to other methods, namely DNAS [44], UNIQ [3], LQNets [46] and HAWQ [12], using different activation compression rates.
We explain our better results, compared to LQNets and UNIQ, inspite of the higher activation and weight compression rates, by the fact that HMQs take advantage of mixed precision quantization. Compared to DNAS, our method has a much larger search space, since in their method, each quantization scheme is translated into a subgraph in a super net. Moreover, HMQs tie the bitwidth and threshold into a single parameter using Equation 5. Comparing our method to HAWQ, HAWQ only uses the Hessian information whereas we perform an optimization over the bitwidth.
5.2 ImageNet
In this section, we present results using HMQs to quantize several model architectures, namely MobileNetV1 [20], MobileNetV2 [38], ResNet50 [17] and EfficientNetB0 [39] trained on the ImageNet [9] classification dataset.
In each of these cases, we use the same data augmentation as the one reported in the corresponding paper.
Our baseline models have the following top1 accuracies: MobileNetV1 (70.6), MobileNetV2 (71.88^{2}^{2}2Torchvision models (https://pytorch.org/docs/stable/torchvision/models.html)), ResNet50 (76.15^{2}) and EfficientNetB0 (76.8^{3}^{3}3https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet).
In all of the experiments in this section, we set in the search space of HMQs quantizing weights.
For activations, is set according to our method in Section 4.2.
As mentioned above, we use the RAdam optimizer in all of our experiments and we use different learning rates for the model weights and the HMQ parameters. For model weights, we use the following learning rates: MobileNetV1 (5e6), MobileNetV2 (2.5e6), ResNet50 (2.5e6) and EfficientNetB0 (2.5e6). For HMQ parameters, the learning rate is equal to the learning rate of the weights multiplied by 1e3. The batchsizes that we use are: MobileNetV1 (256), MobileNetV2 (128), ResNet50 (64) and EfficientNetB0 (128).
5.2.1 Weight Quantization.
In Table 1, we present our results using HMQs to quantize MobileNetV1, MobileNetV2 and ResNet50. In all of our experiments in this table, we set in Equation 10, implying (single precision) 8bit quantization of all of the activations. We split the comparison in this table into three compression rate groups: , and in rows 1–2, 3–4 and 5–6, respectively.
Method  MobileNetV1  MobileNetV2  ResNet50  

WCR  Acc  WCR  Acc  WCR  Acc  
HAQ [42]  14.8  57.14  14.07  66.75  15.47  70.63 
HMQ (ours)  14.15 ()  68.36  14.4()  65.7  15.7 ()  75 
HAQ  10.22  67.66  9.68  70.9  10.41  75.30 
HMQ  10.68 ()  69.88  9.71 ()  70.12  10.9 ()  76.1 
HAQ  7.8  71.74  7.46  71.47  8  76.14 
HMQ  7.6 ()  70.912  7.7 ()  71.4  9.01 ()  76.3 
Note that our method excels in very high compression rates. Moreover, this is in spite of the fact that an HMQ uses uniform quantization and its thresholds are limited to powers of two whereas HAQ uses kmeans quantization. We explain our better results by the fact that in HAQ, the bitwidths are the product of a reinforcement learning agent and the thresholds are determined by the statistics, opposed to HMQs, where they are the product of SGD optimization.
5.2.2 Weight and Activation Quantization.
In Table 2, we compare mixed precision quantization methods in which both weights and activations are quantized. In all of the experiments in this table, the activation compression rate is equal to 8. This means (with some variation between methods) that the smallest number of bits used to quantize activations is equal to 4. This table shows that our method achieves on par results with other mixed precision methods, in spite of the restrictions on the quantization schemes of HMQs. We believe that this is due to the fact that, during training, there is no gradient mismatch for HMQ parameters (see Equations 5 and 6). In other words, HMQs allow smooth propagation of gradients. Additionally, HMQs tie each pair of bitwidth and threshold in their search space with a single trainable parameter (opposed to determining the two separately).

5.2.3 EfficientNet.
In Table 3, we present results quantizing EfficientNetB0 using HMQs and in Figure 4, we use the Pareto frontier of accuracy vs model size to summarize our results on all four of the models that were mentioned in this section.
ACR  WCR  Acc  

4  4  4  76.4 
8  8.05  76  
12  11.97  74.6  
16  14.87  71.54 
5.2.4 Additional Results.
In Figure 5, we present an example of the final bitwidths of weights and activations in MobileNetV1 quantized by HMQ. This figure implies that pointwise convolutions are less sensitive to quantization, compared to their corresponding depthwise convolutions. Moreover, it seems that deeper layers are also less sensitive to quantization. Note that the bitwidths of activations in Figure 4(b) are not a result of finetuning but are predetermined by the target activation compression, as described in Section 4.2. In Table 4, we present additional results using HMQs to quantize models trained on ImageNet. This table extends the results in Table 1, here, both weights and activations are quantized using HMQs.



6 Conclusions
In this work, we introduced the HMQ, a novel quantization block that can be applied to weights and activations. The HMQ repurposes the GumbelSoftmax estimator in order to smoothly search over a finite set of uniform and symmetric activation schemes. We presented a standard SGD finetuning process, based on HMQs, for mixed precision quantization that achieves stateoftheart results in accuracy vs. compression for various networks. Both the model weights and the quantization parameters are trained during this process. This method can facilitate different hardware requirements, including memory, power and inference speed by configuring the HMQ’s search space and the loss function. Empirically, we experimented with two image classification datasets: CIFAR10 and ImageNet. For ImageNet, we presented stateoftheart results on MobileNetV1, MobileNetV2 and ResNet50 in most cases. Additionally, we presented the first (that we know of) quantization results of EfficientNetB0.
Acknowledgments
We would like to thank Idit Diamant and Oranit Dror for many helpful discussions and suggestions.
References
 [1] (2019) Post training 4bit quantization of convolutional networks for rapiddeployment. In Advances in Neural Information Processing Systems, pp. 7948–7956. Cited by: §2.
 [2] (2018) Nice: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §4.1.
 [3] (2018) UNIQ: uniform noise injection for nonuniform quantization of neural networks. arXiv preprint arXiv:1804.10969. Cited by: §2, §5.1.

[4]
(2013)
Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §2. 
[5]
(2020)
Zeroq: a novel zero shot quantization framework.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 13169–13178. Cited by: §2.  [6] (2017) Deep learning with low precision by halfwave gaussian quantization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5918–5926. Cited by: §1.
 [7] (2017) Learning efficient object detection models with knowledge distillation. In Advances in Neural Information Processing Systems, pp. 742–751. Cited by: §2.
 [8] (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2.
 [9] (2009) Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §5.2, §5.
 [10] (2017) Learning accurate lowbit deep neural networks with stochastic quantization. arXiv preprint arXiv:1708.01001. Cited by: §4.1.
 [11] (2019) HAWQv2: hessian aware traceweighted quantization of neural networks. arXiv preprint arXiv:1911.03852. Cited by: Table 2.
 [12] (2019) Hawq: hessian aware quantization of neural networks with mixedprecision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 293–302. Cited by: §1, §2, §4.1, §5.1.
 [13] (2019) Learned step size quantization. arXiv preprint arXiv:1902.08153. Cited by: §1, §2.
 [14] (2018) Squeezenext: hardwareaware neural network design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1638–1647. Cited by: §1, §2.
 [15] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §1, §2.
 [16] (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: §1.
 [17] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §5.1, §5.2, §5.
 [18] (2019) Filter pruning via geometric median for deep convolutional neural networks acceleration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349. Cited by: §2.
 [19] (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §2.
 [20] (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §2, §5.2, §5.
 [21] (2019) Searching for mobilenetv3. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1314–1324. Cited by: §1.
 [22] (2018) Squeezeandexcitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §1.
 [23] (2018) Quantization and training of neural networks for efficient integerarithmeticonly inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §1.
 [24] (2019) Trained quantization thresholds for accurate and efficient fixedpoint inference of deep neural networks. arXiv preprint arXiv:1903.08066. Cited by: §1, §1, §1, §2, §3.
 [25] (2017) Categorical reparametrization with gumblesoftmax. In International Conference on Learning Representations (ICLR 2017), Cited by: §1, §3, §5.
 [26] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §5.
 [27] (2019) DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [28] (2020) On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, External Links: Link Cited by: §5.
 [29] (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
 [30] (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: §2.
 [31] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §1.

[32]
(2017)
The concrete distribution: a continuous relaxation of discrete random variables
. In International Conference on Learning Representations, External Links: Link Cited by: §3.  [33] (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §2.
 [34] (2017) Pruning convolutional neural networks for resource efficient inference. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [35] (2018) Model compression via distillation and quantization. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 [36] (2015) Faster rcnn: towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
 [37] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §1.
 [38] (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §1, §2, §5.2, §5.

[39]
(2019)
EfficientNet: rethinking model scaling for convolutional neural networks.
In
International Conference on Machine Learning
, pp. 6105–6114. Cited by: §1, §2, §5.2, §5.  [40] (2019) Efficientdet: scalable and efficient object detection. arXiv preprint arXiv:1911.09070. Cited by: §1.
 [41] (2020) Mixed precision dnns: all you need is a good parametrization. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2, §2, §4.2, §4.2, Table 2.
 [42] (2019) HAQ: hardwareaware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1, §2, Table 1, Table 2.
 [43] (2019) Fbnet: hardwareaware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10734–10742. Cited by: §1.
 [44] (2018) Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090. Cited by: §1, §2, §5.1.
 [45] (2018) Nisp: pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §1.
 [46] (2018) LQnets: learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 365–382. Cited by: §1, §2, §5.1.
 [47] (2018) A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 184–199. Cited by: §1.
 [48] (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6848–6856. Cited by: §2.
 [49] (2017) Incremental network quantization: towards lossless cnns with lowprecision weights. In International Conference on Learning Representations, External Links: Link Cited by: §2, §4.1.
Comments
There are no comments yet.