Deep Learning Acceleration Techniques for Real Time Mobile Vision Applications

by   Gael Kamdem De Teyou, et al.

Deep Learning (DL) has become a crucial technology for Artificial Intelligence (AI). It is a powerful technique to automatically extract high-level features from complex data which can be exploited for applications such as computer vision, natural language processing, cybersecurity, communications, and so on. For the particular case of computer vision, several algorithms like object detection in real time videos have been proposed and they work well on Desktop GPUs and distributed computing platforms. However these algorithms are still heavy for mobile and embedded visual applications. The rapid spreading of smart portable devices and the emerging 5G network are introducing new smart multimedia applications in mobile environments. As a consequence, the possibility of implementing deep neural networks to mobile environments has attracted a lot of researchers. This paper presents emerging deep learning acceleration techniques that can enable the delivery of real time visual recognition into the hands of end users, anytime and anywhere.



There are no comments yet.


page 1

page 2

page 3

page 4


Enable Deep Learning on Mobile Devices: Methods, Systems, and Applications

Deep neural networks (DNNs) have achieved unprecedented success in the f...

Complex Networks: New Concepts and Tools for Real-Time Imaging and Vision

This article discusses how concepts and methods of complex networks can ...

Deep Learning Algorithms with Applications to Video Analytics for A Smart City: A Survey

Deep learning has recently achieved very promising results in a wide ran...

Deep Neural Networks to Detect Weeds from Crops in Agricultural Environments in Real-Time: A Review

Automation, including machine learning technologies, are becoming increa...

Comparing Two Generations of Embedded GPUs Running a Feature Detection Algorithm

Graphics processing units (GPUs) in embedded mobile platforms are reachi...

Scheduling Real-time Deep Learning Services as Imprecise Computations

The paper presents an efficient real-time scheduling algorithm for intel...

Deep Neural Mobile Networking

The next generation of mobile networks is set to become increasingly com...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Learning is a powerful technique to automatically extract high level features from complex data which can be exploited for applications such as computer vision, natural language processing, cybersecurity, communications [1] [2] [3]. In particular, tremendous progress has been made in the field of computer vision, with Artificial Neural Networks (ANNs) repeatedly pushing the frontier of visual recognition technology. For the recent years, the general trend has been to make deeper and more complicated networks in order to deliver higher accuracy for applications such as image classification, object detection, image segmentation, text recognition, etc. This was possible at the cost of high computational resources which is provided through Desktop GPUs or cloud computing servers but remains still beyond the capabilities of a lot of real world mobile and embedded vision applications where high accuracy and high speed are desired while maintaining relative few space on hardware, low power and low memory consumption. Indeed, contrary to Desktop GPU or cloud computing servers, visual recognition on mobile devices poses new challenges. Models must run quickly with high accuracy in a resource-constrained environment, making use of limited computation, power and space.

This paper describes the state of art techniques used to implement deep neural networks in portable devices for real time vision applications. In particular:

  • We present advanced mathematical operations such as pruning, hashing, quantization, hashing, matrix factorization and distillation that significantly reduce the complexity of deep neural networks in mobile environments.

  • We analyze the application of reinforcement learning and recurrent neural networks to improve the accuracy and latency of mobile recognition tasks

  • We present popular software frameworks that have been developed to adapt the design and training of DNNs to the requirements of mobile devices.

  • We present new generation chipsets that can bring high-performance GPUs into portable devices, thus enabling the execution of deep learning algorithms in mobile devices.

This paper is organized as follows. Section 2, presents neural network architectures and advanced mathematical operations that are optimized for real time vision applications in mobile environment. Section 3 discusses the application of reinforcement learning and recurrent neural networks for accurate and high speed mobile vision applications. Then section 4 describes software frameworks that provide the necessary basis and contain different sets of elements for coding and implementing neural networks in portable devices. Section 5 presents new generation chips that improves the computational power of portable devices. This paper ends with conclusion in section 6.

2 Neural Network Architectures

Various neural network architectures for vision applications such as image classification, image captioning, object detection and segmentation have been proposed in the past few years along with the development of deep learning. For example for object detection, they could be generally divided into two categories: single-stage based methods and two-stage based methods. Typical two-stage methods include R-CNN [7], Fast RCNN [8], Faster RCNN [9] and R-FCN [10]. Early methods like R-CNN [7] and Fast R-CNN [8] utilize external region proposal generation algorithms like [11] to produce region proposal candidates and perform classification on each candidate region. Latter methods introduce region proposal networks (RPN) to produce region proposal, and integrate backbone network, RPN and front-end modules like classification and bounding-box regression module into one framework for end to end training. These approaches are accurate but have heavy computing cost and high latency.

On the contrary, typical single-stage methods like SSD [12], YOLO[13], YOLOv2 [14] and YOLOv3 [15], apply predefined sliding default boxes of different scales/sizes on one or multiple feature maps to achieve the trade-off between speed and accuracy. This kind of methods are usually faster than the two-stage counterparts, but less accurate than two-stage-based methods. In general all these methods still come at high computational cost that is beyond the capabilities of many mobile and embedded vision applications in terms of latency, memory and power consumption. In addition to computation, these deep neural networks often suffer from over-parametrization and large amounts of redundancy that worsen their inefficiency when comes to the deployment on portable devices.

Recently, a lot of effort has been made in building very small and low latent models that can be easily matched to the design requirements for mobile and embedded vision applications. For example, structures like Xception , SqueezeNet, ShuffleNet, MobileNets and MobileNetsv2 [4] [5][6] have been proposed. These neural network architectures use Depthwise Separable Convolutions which are more efficient than standard convolutions. The basic idea is to replace a full convolution operator with a factorized version that splits convolution into two separate layers. The first layer is called a depthwise convolution, it performs lightweight filtering by applying a single convolutional filter per input channel. The second layer is a 1 x 1 convolution, called a pointwise convolution, which is responsible for building new features through computing linear combinations of the input channels. Effectively depthwise separable convolution reduces computation by a factor of size , where is the number of output channels. However, these architectures are general, but are not specifically designed for object detection tasks.

Another fundamental technique for reducing the complexity of deep neural networks is pruning [19] [20]. Early a lot of researchers have been analyzing network pruning to compress CNNs. It consists in reducing the number of network connections by removing less relevant links. One of the first works on network pruning was [21] where the authors showed that it can be used to reduce the network complexity and over-fitting. In [22], state-of-the-art CNN models were pruned successfully with no loss of accuracy. On top of that the authors of paper [23] built another approach. They started by learning the connectivity via normal training. Next, they pruned the small-weight connections by removing all links with weights below a threshold. Finally they retrain the network to learn the final weights for the remaining sparse connections. As a result, they could reduce the number of parameters by 9x and 13x for AlexNet and VGG-16 architectures as reported in the paper.

Hashing can also be used to reduce the size of neural networks to adapt them to mobile devices, which are designed with little memory. It consists in exploiting inherent redundancy in neural networks to achieve drastic reduction in model size. As described in [24], Hashed networks uses a low-cost hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. These parameters are tuned to adjust to the HashedNets weight sharing architecture with standard backprop during training. The authors claimed that such hashing procedure introduces no additional memory overhead, and they demonstrate on several benchmark data sets that HashedNets shrink the storage requirements of neural networks substantially while mostly preserving generalization performance. This work was empirical but very effective on the results. More recently in [25], the authors complement [24] by providing rigorous theoretical guarantees on why existing neural network compression should work. As reported in the paper, the authors provide provable guarantees on some hashing-based parameter reduction methods in neural nets. First, they introduce a neural net compression scheme based on random linear sketching (which is usually implemented efficiently via hashing), and show that the sketched (smaller) network is able to approximate the original network on all input data coming from any smooth and wellconditioned low-dimensional manifold. The sketched network can also be trained directly via back-propagation. Next, they study the previously proposed HashedNets architecture and show that the optimization landscape of one-hidden-layer HashedNets has a local strong convexity property similar to a normal fully connected neural network. In [26], a particular application for mobile visual search has been proposed with a few-parameter, low-latency, and high-accuracy deep hashing approach for constructing binary hash codes that will enable users to sense their surrounding with smartphones. The authors use the Mobilenet architecture, which significantly decrease the number of parameters and they added a hash layer hash layer.

Two efficient approximations to standard convolutional neural networks are presented in


: Binary-Weight-Networks and XNOR-Networks. In Binary-WeightNetworks, the filters are approximated with binary values and the authors could achieve 32x memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58x faster convolutional operations (in terms of number of the high precision operations) and 32x memory savings as reported in


. The authors claimed that XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. This technique was evaluated on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. Similar network binarization techniques can be found in

[28] [29] [30] [31] [32].

A simple way to deploy deep neural networks on devices with limited computational resources is distillation. In this technique, first a cumbersome neural network is trained on a large dataset. Then different kind of training can be used to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment on portable devices. Basically the knowledge of the cumbersome network is compressed and transferred to a smaller network. This type of approach has been successfully used in [33] where the authors demonstrate convincingly that the knowledge acquired by a large ensemble of models can be transferred to a single small model. In [34], this approach was further developed using different compression techniques and tested on MNIST dataset. The authors achieved some surprising results and they showed that it is possible to significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model.

Matrix factorization has also been used by researchers to reduce the computation of deep neural networks [35] [36]

. The idea behind matrix factorization is to exploit low-rank decompositions of convolutional tensors to speed up the evaluation of CNNs. In

[36], the authors exploit cross-channel redundancy to construct a low rank basis of filters that are rank-1 in the spatial domain. Their methods are architecture agnostic, and can be easily applied to existing CPU and GPU convolutional frameworks for tuneable speedup performance. In [35], using large state-of-the-art models, researchers demonstrate speedups of convolutional layers on both CPU and GPU by a factor of 2x, while keeping the accuracy within 1% of the original model.

A complement technique to matrix factorization in reducing the memory consumption of deep neural networks is quantization. The main idea behind quantization is that high precision parameters are not really indispensable in achieving high performance in deep neural networks. The simplest implementation of quantization is binarization, where a parameter is set to 1 if and to -1 if . In paper [37], a fixed-point implementation of 8-bit integer was compared with 32-bit floating point activations. In paper [38], another fixed-point network with ternary weights and 3-bits activations was proposed. In [39], a quantized CNN is implemented on mobile devices and is evaluated for image classification on two bemchmarks, MNIST and ILSVRC. The model can accurately classifify images within one second. They propose an effective training scheme to suppress the accumulative errors while quantizing the whole network. In [40]

, researchers address the problem of large parameters storage by investigating information theoretical vector quantization methods for compressing the parameters of CNNs. In particular, they found that vector quantization methods have a clear gain over existing matrix factorization methods.

3 Reinforcement Learning and Recurrent Neural Networks

Several small deep neural network architectures for object detection in static images have also been proposed such as Tiny YOLO [16] and Tiny SSD [17]. However, directly applying these detectors to real time mobile vision applications faces new challenges. First, applying the deep networks on all frames introduces unaffordable computational cost for mobile devices. Second, recognition accuracy suffers from deteriorated appearances in videos that are seldom observed in static images, such as motion blur, video defocus, rare poses, etc. Third, there are correlations between frames that can be used to improve the accuracy on one hand, and reduce the latency on the other hand.

To tackle these challenges, reinforcement learning and recurrent neural networks can be used. Indeed, several real time vision problems such as object detection can be seen as a sequential decision making process where at every frame we have to predict the existence and location of a particular object based on information from current and previous frames. Thus, Recurrent Neural Networks (RNNs) such as Long Short Term Memory (LSTMs) and Reinforcement Learning can be used. The agent is a RNN, the states of the game are video frames, the actions are the bounding box locations. During training the agent receive a reward for each bounding box prediction. The agent will be trained to learn good policies that capture correlation between frames. For example, in paper

[55] the authors introduce a fully end-to-end approach for visual tracking in videos that learns to predict the bounding box locations of a target object at every frame. An important insight is that the tracking problem can be considered as a sequential decision-making process and historical semantics encode highly relevant information for future decisions. In [56], the authors present a novel recurrent neural network model that is capable of extracting information from an image or video by adaptively selecting a sequence of regions or locations and only processing the selected regions at high resolution. Like convolutional neural networks, the proposed model has a degree of translation invariance built-in, but the amount of computation it performs can be controlled independently of the input image size.

4 Software frameworks for Mobile Devices

Software frameworks provide the necessary basis and contain different sets of elements such as convolutional layers, max pooling and loss layers for implementing deep neural networks without coding from scratch. They also provide all of the necessary infrastructure to implement functions such as reading a network description file, linking core functions into a network, reading data from training and validation databases, running the network forward to generate output, computing loss, running the network backward to adapt the weights, and repeating this process as many times as is necessary to adequately train the network.

Several frameworks have appeared in the last past five years for training deep neural networks. Some example include TensorFlow, Caffe, Facebook’s Caffe2, Microsoft’s Cognitive Toolkit, Darknet, MXNet, Google’s TensorFlow, Theano, and Torch (Intel’s Deep Learning Training Tool and NVIDIA’s DIGITS are a special case, as they both run Caffe ”under the hood”). Additionally, several chip suppliers provide proprietary tools for quantizing and otherwise optimizing networks for resource-constrained embedded applications. Such tools are sometimes integrated into a standalone framework; other times they require (or alternatively include a custom version of) another existing framework


Although some of the above frameworks do include a specific option for running on mobile platforms (e.g MXNet, TensorFlow, Caffe, and Torch), their fundamental goal is to design and train deep neural networks on powerful GPUs, where it makes perfect sense to train a deep network, rather than performing inference on resource-constrained mobile devices. In response to this problem, several alternative frameworks have been developped to adapt the design and training of DNNs to the requirements of mobile devices.

DeepLearningKit [42] is an open source framework that supports using pre-trained deep learning models (convolutional neural networks) for iOS, OS X and tvOS. DeepLearningKit is developed in Metal in order to utilize the GPU efficiently and Swift for integration with applications, e.g. iOS-based mobile apps on iPhone/iPad, tvOS-based apps for the big screen, or OS X desktop applications. The goal of the authors is to support deep learning models previously trained with popular frameworks cuch as Caffe, TensorFlow, Torch and Theano.

DeepX [43] is a software accelerator that lowers the device resources(computation, memory and energy) required by deep learning architectures that currently act as a severe bottlenet to mobile adoption. As detailed in [43], the foundation of DeepX is a pair of resource control algorithms, designed for the inference stage of deep learning, that: (1) decompose monolithic deep model network architectures into unit-blocks of various types, that are then more efficiently executed by heterogeneous local device processors (e.g., GPUs, CPUs); and (2), perform principled resource scaling that adjusts the architecture of deep models to shape the overhead each unit-blocks introduces. Experiments show, DeepX can allow even large-scale deep learning models to execute efficiently on modern mobile processors and significantly outperform existing solutions, such as cloud-based offloading.

CNNdroid presented in [44], is an open source GPU-accelerated library, dubbed CNNdroid, which is specifically designed and optimized for execution of trained deep CNNs on Android-based mobile devices. It supports almost all CNN layer types and is compatible with CNN models trained by Caffe, Torch and Theano. Additionally, it can make use of both GPU- and CPU-computing capabilities of the mobile.

Many mobile visual applications such as motion tracking, activity recognition and pose estimation can require time series inputs from mobile sensors such as cameras, accelerometers and gyroscopes. For such problems, the measurements are systematically noisy and it is hard to find the noise distribution. DeepSense


, a deep learning framework, is particularly robust to filter the noise from mobile sensors. Deepsense is particularly designed to address time series data and is able to execute models for different applications such as car tracking, object detection, image recognition and face recognition in soft real time.

Boda-RTC [46] is an open source system that allow developers to rapidly develop new computational kernels for existing hardware targets, and rapidly tune existing computational kernels for new hardware targets. The foundation of this system is to use a code-generation approach to target the vendor-neutral OpenCL platforms.

TensorFlow Lite [47]

is the TensorFlow’s official solution for running machine learning models on mobile and embedded devices. It enables on-device machine learning inference with low latency and a small binary size on Android, iOS, and other operating systems.It allows developers to run machine-learned models on mobile devices with low latency, so they can take advantage of them to do classification, regression or anything else they might want without necessarily incurring a round trip to a server. It’s presently supported on Android and iOS via a C++ API, as well as having a Java Wrapper for Android Developers. Additionally, on Android Devices that support it, the interpreter can also use the Android Neural Networks API for hardware acceleration, otherwise it will default to the CPU for execution


With the popularity for machine learning and mobile applications, Apple launched their Core ML library which allows mobile app developers to train models on powerful computers, and then save the training models on the phone and run their optimized version there [49].

Bender is a deep learning framework that allows developers to easily define and run neural networks on iOS apps [50]

. It uses the iPhone’s mobile GPU shaders toolkit known as Metal Performance Shaders (MPS). First, you to train a model on platforms such as TensorFlow, Theano, Caffe or Keras, and you can save the model as a frozen graph or export the weights to files. Then Bender allows you to import directly that model from supported platforms as a frozen graph, or redefine the network structure and load the weights. Finally Bender runs the model on the GPU using MPS. Bender suports the most common ML nodes and layers but it is also extensible so you can write your own custom functions.

Framework Acceleration Compatibility
DeepX GPU Android
DeepLearningKit GPU iOS
CNNdroid GPU Android
DeepSense CPU Android
Boda-RTC GPU dataset
TensorFlow Lite GPU Android, iOS
Core ML GPU Android
Bender GPU iOS
Table 1: Comparison of frameworks for mobile platforms

5 Hardware for Mobile Devices

In the recent years, manufacturers have made a lot of efforts to increase the computational power of mobile devices by producing new generation chipsets that may bring high-performance GPUs into portable devices, thus enabling the execution of deep learning algorithms in mobile devices.

For example, NVIDIA proposed the Jetson series, which are Tegra chipsets for general purpose computing [52]. NVIDIA tailored their desktop GPUs to meet the requirements of mobile environments, and the Tegra chipsets can perform better than mobile processors on DL, as well as other accelerable applications.

As the biggest mobile processor vendor, Qualcomm also introduced specific chips into the newest Snapdragon system-on-chip (SoC), to accelerate DL applications as reported in [51]. The Snapdragon mobile platform can run directly on the smartphone without the need to access the cloud. Qualcomm also announced Snapdragon 855, its next-generation processor, at the recently held Snapdragon Tech Summit. It aims to bring advanced capabilities in connectivity, camera, digital assistants, gaming and entertainment, speed and more.

Huawei introduced Kirin 980, an artificially intelligent chip with a 7 nm processor. It is being seen as the next generation of processing technology for smartphones. The company’s recently launched Mate 20 smartphones flaunt this AI chip, which has become the first device Huawei to have world’s first 5G-ready 7nm AI chip.

Samsung also developed next-gen mobile system-on-chip (SoC) for deep learning. Called the Exynos 9820, it was expected to be embedded in its phone Galaxy S10. it flaunts a dedicated neural processing unit (NPU) to handle smartphones’ AI functions on the device. The latest AI chip is quite faster than its previous generations and can perform tasks such as image recognition, translation, speech transcription with greater efficiency.

Another solution to speedup AI algorithms in portable environments is to combine development tools to existing mobile chips, especially mobile vision processing unit (VPU). For example, Movidius the leader of low power machine vision for connected devices and Google brought a powerful Myriad 2 chip MA2450 with relevant development tools into mobile devices [54]. The newly developed chip supports existing DNNs frameworks such as TensorFlow or Caffe, after translating DNNs to Myriad 2 VPU. The performances of this chip is better than the on-chip GPUs available in commercial mobile devices.

6 Conclusion

In this paper, we discussed the emerging deep learning acceleration techniques for real time mobile vision applications. First, we presented an overview of neural network architectures and advanced mathematical operations that are optimized for real time vision applications in mobile environment. Then we discussed the application of reinforcement learning and recurrent neural networks to improve the accuracy and speed of mobile vision applications. After that, we described software frameworks that provide the necessary basis and contain different sets of elements for coding and implementing neural networks in portable devices. Finally, we presented new generation chips that can bring high-performance GPUs into portable devices, thus enabling the execution of deep learning algorithms in mobile devices.


  • [1] Zhijin Qin et al.Deep Learning in Physical Layer Communications, arXiv preprint 1807.22723v3, 2019.
  • [2] Tom Young et al.Recent Trends in Deep Learning Based Natural Language Processing, arXiv 1708.02709V8 [] , Nov 2018.
  • [3] Gael Kamdem De Teyou and Junior ZiazetConvolutional Neural Network for Intrusion Detection System In Cyber Physical Systems, arXiv:1905.03168 [cs.CR], 2019.
  • [4] Francois CholletXception: Deep Learning with Depthwise Separable Convolutions, arXiv, 2017.
  • [5] Andrew G. Howard et al., MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , Arxiv, 2017
  • [6] Mark Sandler et al.MobileNetV2: Inverted Residuals and Linear Bottlenecks, arXiv, 2018.
  • [7] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra MalikRich feature hierarchies for accurate object detection and semantic segmentation, In CVPR, 2014.
  • [8] Ross GirshickFast r-cnn., In ICCV, 2015.
  • [9] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Suntowards realtime object detection with region proposal networks, In NIPS, pages 91–99, 2015.
  • [10] Jifeng Dai, Yi Li, Kaiming He, and Jian SunR-fcn: Object detection via region-based fully convolutional networks, In NIPS, 2016.
  • [11] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and ArnoldWMSmeulders.Selective search for object recognition, IJCV, 2013.
  • [12] Wei Liu, Dragomir Anguelov, Dumitru ErhanSsd: Single shot multibox detector., In ECCV, pages 21–37, 2016.
  • [13] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali FarhadiYou only look once: Unified, real-time object detection., In CVPR, pages 779–788, 2016.
  • [14] Joseph Redmon and Ali FarhadiYOLO9000: Better, Faster, Stronger, arXiv, 2016.
  • [15] Joseph Redmon and Ali FarhadiYOLOv3: An incremental improvement.
  • [16] Joseph Redmonhttps://pjreddie.comdarknetyolo
  • [17] Alexander Wong, Mohammad Javad Shafiee, Francis Li, Brendan ChwylTiny SSD: A Tiny Single-shot Detection Deep Convolutional Neural Network for Real-time Embedded Object Detection, arXiv 2018
  • [18] Xizhou Zhu et al.Towards High Performance Video Object Detection for Mobiles, arXiv, 2018.
  • [19] G. Castellano et al.An iterative pruning algorithm for feedforward neural networks, IEEE Transactions on Neural Networks, Volume: 8 , Issue: 3 , May 1997
  • [20] Ji Lin et alRuntime Neural Pruning, 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
  • [21] LeCun, Yann, Denker, John S, Solla, Sara A, Howard, Richard E, and Jackel, Lawrence DOptimal brain damage. , In NIPs, volume 89, 1989.
  • [22] S. Han, J. Pool, J. Tran, andW. DallyLearning both weights and connections for efficient neural network., In Advances in Neural Information Processing Systems, pages 1135–1143,
  • [23] S. Han, H. Mao, and W. J. Dally.Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding., arXiv, 2016.
  • [24] Wenlin Chen et al. Compressing Neural Networks with the Hashing Trick, arXiv, 2015.
  • [25] Yibo Lin et al.Towards a Theoretical Understanding of Hashing-Based Neural Nets, arXiv, 2019.
  • [26] Heng Qi, Wu Liu, Liang LiuAn efficient deep learning hashing neural network for mobile visual search , arXiv, 2017.
  • [27] Mohammad Rastegari et alXNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, arXiv, 2016.
  • [28] Courbariaux, M., Bengio, YBinarynet: Training deep neural networks with weights and activations constrained to +1 or -1,CoRR (2016)
  • [29] Courbariaux, M., Bengio, Y., David, J.PTraining deep neural networks with low precision multiplications, arXiv preprint arXiv:(2014)
  • [30] Soudry, D., Hubara, I., Meir, R

    Expectation backpropagation: parameter-free training of multilayer neural networks with continuous or discrete weights.

    , Advances in Neural Information Processing Systems. (2014) 963–971
  • [31] Esser, S.K., Appuswamy, R., Merolla, P., Arthur, J.V., Modha, D.S.:Backpropagation for energy-efficient neuromorphic computing.,Advances in Neural Information Processing Systems. (2015) 1117–1125
  • [32] Courbariaux, M., Bengio, Y., David, J.PBinaryconnect: Training deep neural networks with binary weights during propagations, Advances in Neural Information Processing Systems. (2015) 3105–3113
  • [33] Cristian Bucila et al.Model Compression, arXiv preprint arXiv:2015
  • [34] Geoffrey Hinton et al.Distilling the Knowledge in a Neural Network, 2006
  • [35] Denton, E.L., Zaremba, W., Bruna, J., LeCun, Y., Fergus, RExploiting linear structure within convolutional networks for efficient evaluation, Advances in Neural Information Processing Systems. (2014) 1269–1277
  • [36] Jaderberg, M., Vedaldi, A., Zisserman, A.:Speeding up convolutional neural networks with low rank expansions, arXiv preprint arXiv:1405.3866 (2014)
  • [37] Vanhoucke, V., Senior, A., Mao, M.Z.Improving the speed of neural networks on cpus. In: Proc. Deep Learning and Unsupervised Feature Learning, NIPSWorkshop. Volume 1. (2011)
  • [38] Hwang, K., Sung, WFixed-point feedforward deep neural network design using weights +1, 0, and- 1.,Signal Processing Systems (SiPS), 2014 IEEE Workshop on, IEEE (2014) 1–6
  • [39] Jiaxiang Wu et al.Quantized Convolutional Neural Networks for Mobile Devices, arXiv preprint arXiv:2016
  • [40] Yunchao Gong et al.Compressing deep convolutional networks using vector quantization , arXiv preprint arXiv:2014
  • [41] Software Frameworks and Toolsets for Deep Learning-based Vision Processing, available at https: www.embedded-vision.complatinum-membersembedded-vision-allianceembedded-vision-trainingdocumentspagesdeep-learning-software
  • [42] Amund Tveit et al.DeepLearningKit - an GPU Optimized Deep Learning Framework for Apple’s iOS, OS X and tvOS developed in Metal and Swift, arXiv preprint arXiv:2016
  • [43] Nicholas D. Lane et al.DeepX: a software accelerator for low-power deep learning inference on mobile devices, Proceedings of the 15th International Conference on Information Processing in Sensor Networks, Article No. 23, 2016
  • [44] Seyyed Salar Latifi Oskouei et al.CNNdroid: GPU-Accelerated Execution of Trained Deep Convolutional Neural Networks on Android, arXiv, 2016
  • [45] Shuochao Yao et al.DeepSense: A Unified Deep Learning Framework for Time-Series Mobile Sensing Data Processing, arXiv, 2017
  • [46] Matthew W. Moskewicz et al.Boda-RTC: Productive Generation of Portable, Efficient Code for Convolutional Neural Networks on Mobile Computing Platforms, arXiv, 2016
  • [47] TensorFlow Lite, https:www.tensorflow.orglite
  • [48] https:medium.comtensorflowusing-tensorflow-lite-on-android-9bbc9cb7d69d
  • [49] AppleCore ML,
  • [50] XmartlabBender, https:xmartlabs.github.ioBender
  • [51] Andrey Ignatov et al.AI Benchmark: Running Deep Neural Networks on Android Smartphones, arXiv, 2018
  • [52] NVIDIA Corporation. Embedded systems developer kits and modules., www.nvidia.comobject embedded-systems-dev-kits-modules.html, 2017
  • [53] HiAI Foundation Introduction, https:developer.huawei.comconsumerendevservicedoc2020314, 2017
  • [54] Google and Movidius to Enhance Deep Learning Capabilities in Next-Gen Devices, https:www.movidius.comnewsgoogle-and-movidius-to-enhance-deep-learning-capabilities-in-next-gen-devic, 2016
  • [55] Da Zhang et al.Deep Reinforcement Learning for Visual Object Tracking in Videos, arXiv, 2017
  • [56] Volodymyr Mnih et al.Recurrent Models of Visual Attention, Proceedings of the 27th International Conference on Neural Information Processing Systems , 2014