Low Power Inference for On-Device Visual Recognition with a Quantization-Friendly Solution

03/12/2019 ∙ by Chen Feng, et al. ∙ 6

The IEEE Low-Power Image Recognition Challenge (LPIRC) is an annual competition started in 2015 that encourages joint hardware and software solutions for computer vision systems with low latency and power. Track 1 of the competition in 2018 focused on the innovation of software solutions with fixed inference engine and hardware. This decision allows participants to submit models online and not worry about building and bringing custom hardware on-site, which attracted a historically large number of submissions. Among the diverse solutions, the winning solution proposed a quantization-friendly framework for MobileNets that achieves an accuracy of 72.67 dataset with an average latency of 27ms on a single CPU core of Google Pixel2 phone, which is superior to the best real-time MobileNet models at the time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Competitions encourage diligent development of advanced technology. Historical examples include Ansari XPRIZE competitions for suborbital spaceflight, numerous Kaggle competitions such as identifying salt deposits beneath the Earth’s surface from seismic images, and the PASCAL VOC, ILSVRC, and COCO competitions for computer vision PASCAL ; ILSVRC15 ; COCO . The IEEE International Low-Power Image Recognition Challenge (LPIRC, https://rebootingcomputing.ieee.org/lpirc) accelerates the development of computer vision solutions that are low-latency, accurate, and low-power.

Started in , LPIRC is an annual competition identifying the best system-level solution for detecting objects in images while using as little energy as possible 7372672 ; 8342099 ; 7858303 ; arxiv . Although many competitions are held every year, LPIRC is the only one integrating both computer vision and low power. In LPIRC, a contestants’ system is connected to a referee system through an intranet (wired or wireless). In , the competition has three tracks. For track

, teams submit neural network architectures optimized by Google’s TfLite engine and executed on Google’s Pixel

phone. For track

, teams submit neural network architectures coded in Caffe

and executed on NVIDIA Jetson TX. For track , teams optimize both software and hardware and bring the end system on-site for evaluation. We highlight track 1 below for it has enlisted a large number of high-quality solutions.

2 Track 1: Efficient Network Architectures for Mobile

Track 1’s goal is to help contestants develop real-time image classification on high-end mobile phones. The platform simplifies the development cycles by providing an automated benchmarking service. Once a model is submitted in Tensorflow format, the service uses TfLite to optimize it for on-device deployment and then dispatches the model to a Pixel 2 device for latency and accuracy measurements. The service also ensures that all models are benchmarked in the same environment for reproducibility and comparability.

Track 1 selects submissions with the best accuracy within a ms-per-image time constraint, using a batch size of and a single big core in Pixel 2. Although no power or energy is explicitly measured, latency correlates reasonably with energy consumption. Table 1 shows the score of the track

winner’s solution. The model is evaluated on both the ILSVRC2012-ImageNet 

ILSVRC15 validation set as well as a freshly collected holdout set.

Average Latency Test Metrics Accuracy on Classified Accuracy/Time Number Classified
Image Validation
Holdout Set
Table 1: Evaluation of Track Winner’s Solution. Average Latency: is single-threaded, non-batched run-time (ms) measured on a single Pixel big core of classifying one image. Test metric (primary metric): is the total number of images correctly classified within the wall time ( ms x N) divided by N, where N is the total number of test images. Accuracy on Classified: is the accuracy in computed only with the images classified within the wall-time. Accuracy/Time: is the ratio of the accuracy and either the total inference time or the wall-time, whichever is longer. Number Classified: is the number of images classified within the wall-time.

Track received a total of valid submissions (submissions that passed the bazel test and successfully evaluated) and submissions received test metric scores between and . Slightly over half (%) of the solutions use -bit quantization. Most of the architectures (%) are variations of the existing Mobilenet model family, namely quantized V (%), quantized V (%) and float V (%). The winning track submission outperformed the previous state-of-the-art below ms (based on quantized MobileNet V) in accuracy by %. The predominant dependence on Mobilenets is expected considering their exceptional on-device performance and technical support, although future installments are looking to mechanistically discover novel architectures.

3 Winning Track 1 with Quantization-Friendly Mobilenets

3.1 Large Quantization Loss in Precision

The winning solution is based on MobileNet V, but modified in a way that is quantization-friendly. Quantization is often critical for low latency inference on mobile. As most neural networks are trained using floating-point models, they need to be converted to fixed-point in order to efficiently run on mobile devices. Although Google’s MobileNet models successfully reduce the parameter size and computational latency by using separable convolution, direct post-quantization on a pre-trained MobileNet V model can result in significant precision loss. For example, the accuracy of a quantized MobileNet V could drop to % on ImageNet validation data-set as shown in Table 2.

Image Validation Accuracy
Direct Post-Quantization Quantization-Friendly
Model Floating-point Bit Fixed-point Floating-point Bit Fixed-point
MobileNetV_ (ImageNet) % % % %
MobileNetV_ (ImageNet) % % % %
MobileNetV_SSD % % % %
Table 2: Experimental results on the accuracy of floating-point and quantization-friendly mobilenets for image recognition and object detection.

The root cause of accuracy loss due to quantization in such separable convolution networks is analyzed as follows. In separable convolutions, depth-wise convolution is applied on each channel independently, while the min and max values used for weights quantization are taken collectively from all channels. Furthermore, without correlation crossing channels, depth-wise convolution may be prone to produce all-zero values of weights in one channel. All-zero values in one channel have very small variance which leads to a large “scale” value for that specific channel when applying batch normalization transform directly after depth-wise convolution. Therefore, such outliers in one channel may cause a large quantization loss for the whole model due to uneven distributed data range. This is commonly observed in both MobileNet V

 Quantization and V models. Figure 1 shows an example of the observed batch normalization scale values of channels extracted from the first depth-wise convolution layer in MobileNetV float model. As a result, those small values corresponding to informative channels are not well preserved after quantization and this significantly reduces the representation power of the model.

Figure 1: An example of values across channels of the first depthwise convolution layer from MobileNetV float model. Small variance in all-zero channels result in large values. With quantization cross all channels, small values will suffer huge quantization loss.

3.2 The Winning Quantization-Friendly Approach

For a better solution, an effective quantization-friendly separable convolution architecture is proposed as shown in Figure 2

(c), where the non-linear operations (both batch normalization and ReLU

) between depth-wise and point-wise convolution layers are all removed, letting the network learn proper weights to handle the batch normalization transform directly. In addition, ReLU is replaced with ReLU in all point-wise convolution layers. From the experiments in MobileNet V and V models, this architecture maintains high accuracy in the -bit quantized pipeline in various tasks such as image recognition and object detection.

As an alternative, one can use Learn2Compress learn2compress , Google’s ML framework for directly training efficient on-device models from scratch or an existing TensorFlow model by combining quantization along with other techniques like distillation, pruning, and joint training. Comparing with these options, the winners’ solution provides a much simpler way to modify separable convolution layers and make whole network quantization-friendly without re-training.

Figure 2: The quantization-friendly separable convolution core layer design proposed by the winning solution. (a) and (b) illustrate core layers of the standard convolution and separable convolution. (c) is proposed core layer design based on (b) by removing the batch normalziation and ReLU between depthwise convolution and pointwise convolution.

3.3 System Integration and Experimental Results

By considering the trade-off between accuracy and model complexity, MobileNetV__ is chosen as the base architecture to apply the quantization-friendly changes. Based on the proposed structure, a floating-point model can be trained on the dataset. During the post-quantization step, the model runs against a range of different inputs, one image in each class category from the training data, to collect min and max values as well as the data histogram distribution at each layer output. Values for optimal “step size” and “offset”, represented by and , that minimize the summation of quantization loss and saturation loss during a greedy search, are picked for linear quantization. Given the calculated range of min and max values, TensorFlow Lite provides a path to convert a graph model (.pb) to tflite model (.tflite) that can be deployed on edge devices.

The proposed fixed-point model with an input resolution of 128 can achieve an accuracy of % on ImageNet validation dataset and an accuracy of % on holdout dataset. The base network model can also be used for different tasks such as image recognition and object detection. Table 2 shows our experimental results on the accuracy of the proposed quantization-friendly MobileNets for image recognition on ImageNet (with an input resolution of ) and object detection on COCO dataset. With the proposed core network structure, the model largely increases accuracy on both tasks in bit fixed-point pipeline. Whereas direct quantization on pre-trained MobileNet V and V model would cause unacceptable accuracy loss, the quantization-friendly MobileNets managed to stay within of the float model’s accuracy for ImageNet classification, and within for COCO object detection.

4 Conclusions

By providing an convenient platform for evaluating on-device neural network architectures, LPIRC has successfully enlisted creative solutions from the field. Not only does the winning solution outperformed state-of-the-art, it also provides insights regarding quantization that is applicable to other tasks such as object detection. This success showcases not only the effectiveness of their quantization-friendly approach, but also the importance of the platform that facilitates its development.