Smart at what cost? Characterising Mobile Deep Neural Networks in the wild

09/28/2021 ∙ by Mario Almeida, et al. ∙ SAMSUNG 0

With smartphones' omnipresence in people's pockets, Machine Learning (ML) on mobile is gaining traction as devices become more powerful. With applications ranging from visual filters to voice assistants, intelligence on mobile comes in many forms and facets. However, Deep Neural Network (DNN) inference remains a compute intensive workload, with devices struggling to support intelligence at the cost of responsiveness.On the one hand, there is significant research on reducing model runtime requirements and supporting deployment on embedded devices. On the other hand, the strive to maximise the accuracy of a task is supported by deeper and wider neural networks, making mobile deployment of state-of-the-art DNNs a moving target. In this paper, we perform the first holistic study of DNN usage in the wild in an attempt to track deployed models and match how these run on widely deployed devices. To this end, we analyse over 16k of the most popular apps in the Google Play Store to characterise their DNN usage and performance across devices of different capabilities, both across tiers and generations. Simultaneously, we measure the models' energy footprint, as a core cost dimension of any mobile deployment. To streamline the process, we have developed gaugeNN, a tool that automates the deployment, measurement and analysis of DNNs on devices, with support for different frameworks and platforms. Results from our experience study paint the landscape of deep learning deployments on smartphones and indicate their popularity across app developers. Furthermore, our study shows the gap between bespoke techniques and real-world deployments and the need for optimised deployment of deep learning models in a highly dynamic and heterogeneous ecosystem.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The recent popularity of Deep Neural Networks (DNNs) has seen them being applied to myriads of areas, from computer vision

(He et al., 2016) to speech recognition (Chan et al., 2015) and machine translation (Sutskever et al., 2014). DNNs are no longer only being deployed in datacenters (Hazelwood et al., 2018), as they have found their way into mobile devices, ranging from IoT devices to flagship smartphones and self-driving cars. In fact, large part of what makes smartphones smart, can be attributed to the ever-increasing support for machine learning, be it in the form of camera optimisations, intelligent assistants or text predictions.

While DNNs have become more and more accurate, this was frequently at the expense of an increased number of parameters, energy consumption and computational load (Almeida et al., 2019; Simonyan and Zisserman, 2015; Huang et al., 2017; He et al., 2016), often resulting in poor performance on resource-restricted mobile and embedded devices (Zhang et al., 2018; Lee et al., 2019a; Almeida et al., 2019).

To address these challenges, there has been significant research towards mobile-specific DNN optimisations. Firstly, researchers have designed various mobile-specific architectures either manually (Howard et al., 2017; Laskaridis et al., 2020b) or automatically, through Network Architecture Search (NAS) (Tan et al., 2019). Secondly, numerous works have looked into reducing computation through weight sparsification and pruning (Lee et al., 2019b) and quantisation (Han et al., 2016). Thirdly, kernel optimisations have been proposed for mobile SoCs (Fernandez-Marques et al., 2020). Last but not least, inference offloading is an alternative approach where computation is partly or wholly outsourced to a remote endpoint for faster results (Kang et al., 2017; Laskaridis et al., 2020a).

At the same time, recent developments on mobile SoCs enable smartphones to support higher DNN computational throughput at a lower energy budgets (Wang et al., 2020; Ignatov et al., 2019), either through heterogeneous multi-core processors (e.g. ARM big.LITTLE and DynamIQ) or through specialised hardware (e.g. DSPs and NPUs). However, the device ecosystem remains very heterogeneous, ranging from cheaper devices with older processors to flagship devices with dedicated processing units. As a result, it is extremely hard for developers to assess the performance and optimise their DNN models for each possible device tier (Wu et al., 2019).

In this work, we attempt to measure what the actual mobile ML landscape looks like in the wild by studying real-world DNNs, as deployed with the most popular applications of the Google Play Store. Our goal is to examine whether real-life deployments follow the state-of-the-art of ML research and identify performance bottlenecks over devices of different tiers and generations. The gained experience will provide insights on the system and model-level optimisations required to push the current frontier of mobile intelligence. In particular, we make the following contributions:

  • [leftmargin=*,topsep=0pt]

  • We design a system, named gaugeNN, that automates the extraction, analysis and benchmarking of DNN models found in the most popular apps in the wild.

  • Using gaugeNN we analyse over 16k (33k across two snapshots) Google Play Store apps with respect to their DNN models. We characterise these models in terms of their usage, architecture, layer operations and optimisations as well as external cloud-based DNN API calls.

  • We compare our latest snapshot with a previous version of the Google Play most popular apps 12 months ago and comment on the trajectory of DNN mobile penetration in the past year.

  • We perform a runtime measurement of hundreds of these DNN models across heterogeneous devices of different capabilities to further characterise these models in terms of their achieved latency and energy consumption.

  • We analyse model and system-level optimisations supported by publicly available toolsets and provide an overview of the current DNN optimisation landscape available to developers and practical guidelines for improving the development and deployment of future DNNs.

2. Research Questions & Results

With our study, we aim to answer the following Research Questions (RQ) that arise:

  • [leftmargin=*,label=]

  • RQ#1: Given the forefront of ML research and the multitude of tools and devices in the wild, what kind of models are being deployed in mobile apps and utilised by developers and for which tasks?

  • RQ#2: In a highly heterogeneous ecosystem of smartphones, how are these models deployed and are they able to perform efficiently across different targets and tasks?

  • RQ#3: What are common model and system-level optimisations being used to make inference in the wild faster on smartphones? Can they be improved?

Results: Our results indicate that mobile developers choose to deploy simple off-the-shelf models on-device, potentially pretrained or fine-tuned for targeting different tasks, and often rely on cloud offloading to support larger tasks. This minimises the burden to the app developer and cashes upon existing models widely available. Furthermore, we witness that devices of different tiers and generations have widely varying performance over the benchmarked models, with the low-tier devices being significantly slower in DNN-based tasks. When it comes to performance per watt, we notice a general trajectory of devices getting incrementally more efficient from generation to generation, with SoCs integrating more and more specialised hardware in the die. However, the same trajectory cannot be traced on battery technology, which remains largely the same and mainly varies depending on the device’s form factor. Last, we have observed that off-the-shelf model-level optimisations deployed with major frameworks more often than not do not result to latency or memory benefits during inference, but are focused on compressibility of the model. Simultaneously, SoC vendor-specific tools offer a significant benefit in runtime, at the expense of generality of the deployed models. Still, we found no significant evidence of target-specific model deployment in the wild.

3. Methodology

Figure 1. Workflow of gaugeNN.

To fulfil these diverse characterisation goals, we employ the three step methodology depicted in Fig. 1. First, we crawl the Google Play Store to find the DNN models from within the most popular apps among mobile users and extract their associated ML models, validating them against certain rules (grey boxes). Second, we perform a device-agnostic app and model analysis (purple boxes). Specifically, we look at the app’s store metadata, where the DNN is used, as well as the model’s layers and operations. Finally, we benchmark the models on different devices to analyse their performance upon deployment (blue box). To automate this process and analyse ML models at scale we designed gaugeNN. We describe below each component in greater detail.

3.1. DNNs retrieval

The first step in our methodology is to find, extract and validate the DNNs from Google Play Store most popular apps.

App crawling. First, gaugeNN mimics the web API calls made from the Google Play store of a typical mobile device to crawl the Google Play Store. In these requests, both the user-agent and locale headers are defined, which determine the variant of the store and apps retrieved. To perform the crawling, we fetch the list of the top free apps per category which returns a maximum of 500 apps. Additionally, gaugeNN stores the store metadata for each app, including popularity, category, reviews, etc. in an ElasticSearch instance for quick ETL111Evaluate Transform Loop analytics and cross-snapshot investigations (Sec. 4).

Model extraction. Given the downloaded apps, gaugeNN proceeds to extract the DNN models from each application’s package. Traditionally, Android applications are packaged in a zip file, i.e. apk, which comes with the the Java/Kotlin “bytecode” along with resources used by the app (e.g. textures, images, fonts). Apks have a size limit of 100MB and files – such as DNN weights – can have a larger storage footprint. As a result, Google Play allows additional content to be shared either with expansion files (Google, 2021b) (OBBs) or through Android App Bundles through Play Asset Delivery (Google, 2021a) The former supplement the main apk file and are hosted and served by Google Play, whereas the latter offers the possibility of downloading assets on demand, as needed for a given device. gaugeNN supports file extraction from i) the base apk, ii) expansion files (OBBs) and iii) Android App Bundles, but does not track asset delivery outside of Google Play. Extracted files are matched against a compiled list of 69 known DNN framework formats (listed in the Appendix) to identify potential DNN models.

Figure 2. gaugeNN benchmark platform.

Model validation.

Many models use generic file formats (e.g., protobuffer). Therefore, the number of candidate model files and extensions is quite large and benchmarking all prospective ones quickly becomes computationally prohibitive at scale. Therefore, inspired by the open-source Netron

(Roeder, 2020) tool , gaugeNN employs a lightweight – framework and format specific –validation process to remove files that are not DNN models. This validation consists of checking the binary signature of the file for the presence of specific identifiers that a framework uses. For example, for TFLite, we know that the FlatBuffer files representing models include specific headers at certain positions of the binary file, thus we check for the existence of e.g.  the string “TFL3” there.

On the downside, encrypted and obfuscated models do not match such validation rules and are not extracted in our analysis. Moreover, models downloaded on demand by the application outside of the official Google Play distribution mechanisms are omitted from our benchmarks. However, we do track applications using such models indirectly by means of library inclusion in the application code and native libraries, even without explicitly analysing the models. The native code detection follows the methodology of Xu et al. (Xu et al., 2019).

3.2. Offline DNN analysis

After collecting the top apps from each category, we analyse the usage of Deep Neural Networks in the wild. Apps can use DNN models in different ways; i) they can execute the models on-device or ii) offload the computation to external resources (e.g. cloud providers).

In-app DNN models. After identifying the model files within an application, gaugeNN extracts their DNN architecture either by parsing directly the file, or by using the associated framework’s interpreter. A DNN model is typically represented as a DAG222Directed Acyclic Graph, where layers are represented by vertices and data flows by edges. By going through each model’s graph, gaugeNN

registers the type of layer, its parameters (weights) and operations in a trace-based manner and uses this information to estimate the total operations

333Model FLOPs are estimated as a function of the cumulative Multiply-Accumulate (MAC) operations performed by each of the model’s layers. (#FLOPs) and model size (#parameters). Furthermore, we can later individually run these models and measure their inference latency, energy and memory footprint.

DNN Cloud APIs. Alternatively, applications might integrate ML functionality through cloud-backed APIs, by means of offloading inference to a remote endpoint. To detect the usage of cloud-based DNN models, gaugeNN inspects the app code to search for common DNN framework API calls. Android apps are typically developed in Kotlin or Java and then compiled into dex format(Google, 2020a) and packaged within the app binary. It is possible to extract this dex binary from the app package and decompile it into a human-readable (smali (Freke, 2020)) format using the apktool (Tumbleson, 2020) to inspect the original code API calls. gaugeNN automates the process of decompiling these binaries and performs string matching on the smali files to detect known cloud DNN framework calls. In particular, gaugeNN recognises calls to libraries belonging to Google FireBase (Google, 2020c), Google Cloud (Google, 2020b) and Amazon AWS ML services (Amazon, 2020).

Figure 3. gaugeNN benchmark workflow.

3.3. Model benchmarking

Next, we describe how gaugeNN assesses the on-device run time and energy consumption of DNNs.

Devices. To assess the performance of the deployed DNN models at runtime – i.e. latency, energy, memory and CPU utilisation – we deploy these models on the devices of Table 1. The devices of the first group represent three distinct tiers of smartphones (low to high-end) and showcase the performance across heterogeneous clients, while the development boards of the second group represent high-tier SoCs from different generations, whose open design allows us to measure energy consumption through cable probes connected to a Monsoon power monitor (Fig. 2).

Benchmark workflow. All benchmarks are written in native code and compiled for aarch64 with Android NDK. gaugeNN adopts a master-slave architecture depicted in Fig. 2. The server, where the models initially reside, is responsible for orchestrating the deployment and benchmarking of the models across client devices (phones), connected over USB. To control the power passthrough of mobile devices, we use a USB controller board (Yepkit, 2020) that can programmatically disable data and power channels during measurements. This component was necessary, as connecting the device over USB charges it, interfering with the energy measurements.

The benchmarking workflow is depicted in Fig. 3. Initially, the master (left side) pushes all the necessary dependencies to the device (right side) through adb

and asserts the initial device state (WiFi and sensors are off, maximum screen timeout, etc). The benchmark consists of an unattended, headless script that runs on the device upon disconnection of the USB power, controlled through the USB board. This script is launched as a daemon process and performs the following tasks: 1) It waits until the USB power is off; 2) it runs a configurable amount of warmup inferences to remove cold cache outliers; 3) it runs the actual benchmark inferences with a configurable inter-experiment sleep period; 4) it turns on WiFi upon completion and communicates a TCP message through

netcat to the server that the experiment is over. Subsequently, the server re-enables the USB power, connects over adb and gathers the job results before cleaning up and launching the next job.

Model SoC RAM Battery capacity
Samsung devices
A20 Exynos 7884 4GB 4000mAh
A70 Snapdragon 675 6GB 4500mAh
S20 Snapdragon 888 8GB 4000mAh
Qualcomm development boards
Q845 HDK Snapdragon 845 8GB 2850mAh
Q855 HDK Snapdragon 855 8GB N/A
Q888 HDK Snapdragon 888 8GB N/A
Table 1. Device specifications.

Energy measurements. Energy on open deck devices is measured via a Monsoon power monitor (AAA10F). To prevent Android’s battery saving mechanisms (e.g., Doze (Google, 2020d)) killing background jobs when the screen goes off or scaling down the CPU frequency, we keep the phone screen on during the benchmark, by interfacing with the Android’s Power Manager service. We also ensure that the screen is always in a similar state across devices, by developing an app that shows a black background. While the screen does incur extra energy consumption, this is measured and accounted for.

In the following sections, we present the findings of our experiments run with gaugeNN. First, we present an offline analysis of the apps and models found from crawling the Google Play Store (Sec. 4) and then we move to runtime analysis of these models on devices (Sec. 5) and specific optimisations (Sec. 6).

4. Dataset Collection & Analysis

In this section, we attempt to find an answer to RQ#1 with regards to DNN deployment in the wild. To this direction we first analyse our collected data with respect to the existence of DNN models in the top Google Play Store apps and their distribution to user devices. Then we move to more specific model and app categorisation and characterisation and finally draw conclusions about the trajectory of ML mobile deployment from our temporal analysis results.

4.1. Datasets

As shown in Table 2 we collected two snapshots of the top free Google Play apps, on the of February 2020 and on the of April 2021. At these points in time, the Android devices represented and of the mobile OS market share (Statista, 2020; GlobalStats, 2020) respectively. Data was collected from an UK-based account associated to a Samsung S10 (SM-G977B), downloading the most popular apps across all categories of the Google Play Store (up to 500 apps per category). This accounts for the top 0.6% of total applications available in the store444Google Play Store is estimated to have 2.9M apps at the time of the latest snapshot (AppBrain, 2020). In general, apps downloads tend to follow a power law distribution (Viennot et al., 2014). Therefore, the most popular apps are installed on most users’ phones while the rest follow a long tail. While we could not scale a study of paid apps for monetary reasons, these account for a very small percentage of downloaded apps (Viennot et al., 2014). For the rest of the paper, we report on the latest Play Store snapshot, unless explicitly stated otherwise.

4.2. Model distribution to devices

As described in Sec. 3.1, models in Android applications can be distributed post-installation (e.g. through OBBs or Asset Delivery). This allows developers to bypass the 100MB apk limit and to provide customised models for devices with different capabilities (e.g. devices with specified NPU). To identify any models that are distributed post-installation, we downloaded all companion files and Google Play assets. We found no models being distributed outside of the main apk. Furthermore, we downloaded an extra snapshot with a three Android generations older device profile555Samsung S7 edge – SM-G935F, released in February’16, three years before the S10 5G., and found no evidence of device-specific model customisation.

Observations: Our results indicate that the functionality offered by Play Services to download device-specific models may be underutilised in the realm of mobile ML or that developers choose not to specialise their models per device SoC or model. While specialising the model distribution per device target can be beneficial for performance and energy, it requires offline vendor-specific customisation of the model. Evidently, app developers seem to prefer generality of their deployment solutions, in line with (Wu et al., 2019), and defer optimisation to middleware in the stack, such as NNAPI drivers or specific hardware delegates (Ignatov et al., 2019).

Snapshot ’20 Snapshot ’21
Date Feb. 2020 Apr. 2021
Total Apps
Apps w/ frameworks
Apps w/ models
Total models
Unique models
Table 2. Dataset snapshots details.

4.3. ML frameworks

Next, we look into the models found per ML framework. Specifically, Fig. 4 depicts the number of models successfully extracted, validated and benchmarked, per category and ML framework. These models represent 90.72% of the total apps including ML libraries in their codebase (Table 2), with the rest accounting for obfuscated, encrypted or lazily downloaded models. In total these account for 1,666 models – 1436 (86.19%) TFLite, 176 (10.56%) caffe, 46 (2.76%) ncnn, 5 (0.3%) TensorFlow and 3 (0.18%) SNPE. TFLite is expectedly first in popularity, as the recommended solution from the OS provider for mobile ML inference. However, it is surprising to see caffe so widely used, since it has been long deprecated and replaced by caffe2

in 2017 and now PyTorch Mobile.

Observations: These results illustrate a long latency between the state-of-the-art frontier of ML frameworks and their adoption for in-the-wild deployment.

4.4. Model categorisation

Here, we perform a quantitative analysis of DNN models and their respective apps and correlate them with metadata from the Google Play Store. Our aim is to categorise the most popular DNN-powered apps and characterise their usage.

Figure 4. Number of models gaugeNN successfully extracted and executed per framework and Google Play category. Categories with less than 20 models are excluded.

Fig. 4 shows the number of ML models per framework and Google Play category. We observe that the top DNN-powered apps belong to “communication” and “finance” tools with several DNNs for face and object detection (e.g. for detecting a card or ID to make transactions in the latter case). These are followed by more traditionally DNN-backed categories, such as “photography” and “beauty”, which typically contain DNN-based filters to enhance photos. Potentially less expected categories include “food and drink”, “dating” and “parenting”. By manually examining these models, we found anecdotal examples of apps within these categories using DNNs to detect or recognise objects (e.g. a bottle of wine or a face), for recommendation systems (e.g. partner matching, advertising and food recipe recommendation) and even for baby monitoring.

To dig deeper into the purpose of each AI model, we manually looked into the naming, input/output dimensions and layer types of the encountered DNN models in order to characterise their usage. This labour intensive job was done across three ML researchers with a majority vote on the results. We were able to identify the usage of models, accounting for of all models, with around having names which hint either the model, task at hand or both (e.g. “hair_segmentation_mobilenet.tflite”). Our characterisation shows that the most popular task for deploying Deep Learning is computer vision ( of all models), followed by NLP (17 models) and audio (15 models). Last, we found traces of DNN models (4 models) utilising sensor data, such as accelerometer, gyroscope, etc. Two anecdotal use-cases for sensor ML are horse movement tracking and car crash detection in insurance apps. Task-specific results are shown in Table 3, where it can be seen that most vision models were targeted at object, face and contour detection, most audio tasks at ambient sound recognition, most NLP tasks at text-completion and sensor tasks at movement tracking.


Vision models seem to be the most prevalent, with a focus on object and face detection and text recognition and used mostly across communication, photography and beauty apps.

4.5. Model uniqueness characterisation.

Diving deeper into the models distributed amongst the most popular applications, we found that not all models are bespoke or unique. Overall, we witness DNN models spread across different application categories, with a significant portion of these being off-the-shelf models without customisation. In fact, after checking for unique checksums on these models and respective weights666Most apps distribute the model weights in their apk, either in a single file, along with the DNN graph, or in separate files (e.g. caffe). In either case, we perform an md5 checksum on both the model and weights., we find that only 318 models (19.1% of the models as shown in Table 3) are unique. For the most prevalent vision task, i.e., object detection, FSSD (Li and Zhou, 2017) seems to be the most popular model. We found such occurrences even within popular Google apps (e.g. “Gallery Go” and “Arts & Culture”). For face detection the Blazeface (Bazarevsky et al., 2019) is another very popular model. Spanning across tasks, MobileNet (Howard et al., 2017)

seems to be the most popular architecture with variants (e.g. FSSD) being used to other vision tasks including semantic segmentation, pose estimation or classification. Last, we encounter multiple occurences of models tackling a common task, e.g. recognise information from credit cards

(Team, 2020), such as names and dates.

Task Models
Vision (1495 models)
object detection 788 ()
face detection 197 ()
contour detection 192 ()
text recognition 185 ()
augmented reality 51 ()
semantic segmentation 14 ()
object recognition 14 ()
pose estimation 8 ()
photo beauty 8 ()
image classification 7 ()
nudity detection 5 ()
other 26 ()
Task Models
NLP (17 models)
auto-complete 9 ()
sentiment prediction 4 ()
content filter 2 ()
text classification 1 ()
translation 1 ()
Audio (15 models)
sound recognition 12 ()
speech recognition 2 ()
keyword detection 1 ()
Sensor (4 models)
movement tracking 3 ()
crash detection 1 ()
Table 3. DNN task classification.

Model fine-tuning. Taking this analysis one step further, we perform a checksum-based analysis at finer-granularity (layer-level) to see to what degree to developers train their own models from scratch or fine-tune the last layers through transfer learning (Pan and Yang, 2009). The intuition is that the first layers of the network are typically extracting low-level features (e.g. edges, shapes, etc. for vision tasks) that are shared between similar tasks and only deeper in the DNN do the task-specific and semantically relevant features get extracted. Results from our analysis show that, excluding duplicate models, 9.02% of the remaining models share at least 20% of the weights with at least one other model. In fact,

of the models only differ in up to three layers, indicating that some developers only fine-tune small portions of the network, resulting in a significantly smaller training footprint and exploiting transfer learning from other (typically off-the-shelf) networks. Moreover, we checked for traces of online fine-tuning done on device (e.g. through

TFLiteTransferConverter (, 2019)) and found none, indicating that on-device fine-tuning is not yet widely exploited in the wild due to the significant computation requirements and the limited availability of labelled high-quality on-device datasets.

Observations. Based on this type of evidence, we deduce that it is common for developers to leverage a pre-trained model that is widely available and pay the significantly smaller cost of training offline only a subset of the last DNN layers. While online on-device training is a prominent future avenue, be it through fine-tuning or federated learning, current support in mobile frameworks is limited and so are such deployments.

4.6. Temporal analysis across snapshots

Figure 5. Individual models removed/added between two snapshots taken one year apart.

As aforementioned, we took two distinct snapshots of the most popular apps in the Google Play Store 12 months apart from each other. In this part of our analysis, we compare and contrast these two snapshots in terms of app popularity and in-the-wild DNN deployment and draw conclusions about the trajectory of ML penetration in smartphones nowadays. What is unique about our dataset is that we happened to measure DNN-deployment across the COVID-19 pandemic, which had a crucial impact on human activity during the course of 2020/2021. For this reason, we also compare our temporal analysis with similar analyses done in the past (Xu et al., 2019) to i) identify potential biases of our dataset during these exceptional circumstances and ii) to see how app popularity and, as an extension, DNN adoption, has been affected by these circumstances.

Results from our temporal analysis indicate a surging number of DNN models being deployed on the Android platform, essentially doubling in the course of 12 months. Specifically, our traced models went from to for our latest snapshot (Table 2), with most additions belonging to vision tasks. TFLite remains the dominant mobile inference framework, going from to of the total models found (). The increase in models was less pronounced for ncnn () and caffe (). The latter is surprising given the fact it has been deprecated and newer frameworks have taken its place (caffe2 and PyTorch Mobile). Finally, we observe a drop in the TF () adoption rate, which is expected given the increasing popularity of its mobile counterpart.

Next, we analyse the DNN models across snapshots per category of application to which they belong. Fig. 5 depicts the number of individual models that were removed/added across our snapshots, sorted by the difference between the two. Interestingly, most additions of ML models happened for communication tools, taking the lead from “photography” applications, which was the top ML-powered category of 2020. This can potentially indicate that communication apps became more important due to the pandemic, and developer focus was diverted to this category. A similar trend could be witnessed for “finance” applications, where we observed many models aimed at the automated identification of people and their ID cards. Whilst this traditionally constituted a manual process done in person in financial institution (e.g. banks), the pandemic might have created a new need for ML models to fill. Last, apps related to “health” and “medical” purposes seem to have a surging deployment of DNN models. On the other side of the spectrum, “lifestyle”, “food & drinks” and “Android Wear” applications seem to be falling in terms of popularity, something that could be potentially attributed to the fact that people stay more at home.

Next, we integrate the results of previous analyses (Xu et al., 2019; Sun et al., 2021) to shape a more general trend for DNN adoption in the Android ecosystem. In (Xu et al., 2019), the authors report the total ML-backed apps going from in June 2018 to in September 2018. In (Sun et al., 2021), the authors traced ML-powered apps, somewhere between (Xu et al., 2019) and June 2020 777The snapshot date is not reported, thus we consider it between (Xu et al., 2019), with which it compares, and the work’s venue submission date.. Last, for our trace, we report ML-powered apps going from to from February 2020 to April 2021. From the previously reported figures, we witness a soaring trajectory of ML apps deployed in the wild, with the adoption rate of ML being accelerating.

Observations: While there was a big reshuffling in the type of AI models deployed during the pandemic, we observe a considerable general growth in the number of DNN models in AI-powered applications in the past 3 years (from 176 in 2018 (Xu et al., 2019) to 1,666 in April 2021). These results demonstrate how the proliferation of mobile AI frameworks, the availability of pre-trained models and the constant improvement of mobile hardware have driven this growth and the need to keep up with this ever-increasing adoption.

Figure 6. Model layer composition per input modality for TFLite, NCNN and caffe.

4.7. Mobile DNNs layers and operations

After having coarsely characterised the models based on their input modality, target task and app category, we take a finer-grained look into the models and analyse their structure in terms of the layers and operations they contain.

DNN layers and operation types. First, we go through the graph representing each DNN and trace the layer types they contain, grouping results per input modality. Results are shown in Fig. 6 for TFLite, NCNN and Caffe. We see convolution

layers being amongst the most popular layer types across modalities (34%, 10%, 20% for image, text and audio, respectively). Originally applied in visual tasks, their usage nowadays spreads across recommender systems, natural language processing and time-series analysis. Variants such as

depthwise-separable convolutions (depth_conv) (Howard et al., 2017) are computationally less heavy and are aimed for mobile deployments. Dense (or linear) layers are fully-connected layers that are typically found in the output of classification tasks, or in the implementation of RNNs. Majority of these layers are found in audio (19%) and text (9%) models. Activations essentially impose non-linearity in DNNs, and can be fused with the previous layer in terms of implementation. Thus, the existence of such operations as distinct layers is framework dependent. Last, “helper” layers such as math, quant, resize and slice operations are performing math or matrix representation operations and can be found across modalities.

Figure 7. FLOPs and parameters per DNN task.
Figure 8. Observed relationship between latency and FLOPs across six different devices.

DNN #operations and #parameters. Next, we estimate the number of operations (in FLOPs) and parameters that each model contains by going through the graph in a trace-based manner. Concretely, we generate a random input with the DNN-specified input dimensions and perform a DNN inference. During the forward propagation step, we measure analytically the amount of operations being performed per layer (dependent on the kind of layer) and the number of trainable parameters associated with it. Fig. 7 shows the result of this analysis per DNN task. We see that among the traced models, on average the heaviest deployed vision models belong to classification, hair reconstruction, segmentation and beauty tasks. For NLP the heaviest tasks belong to text auto-completion whereas for audio the heaviest deployed task is sound recognition. At this point, we note that these numbers only refer to the traced deployed models and do not represent a generic commentary on the overhead of models per task. In fact, in many cases it is the opposite if we only take into consideration the task (e.g. classification vs. segmentation or speech vs. sound recognition). Also, we note that the number of models found for each task category varies significantly.

Observations: We find that convolutions dominate the mobile DNN landscape due to their wide use in vision models, as well as the fact that they can map well on mobile hardware for efficient execution, compared to e.g. recurrent layers (Zhang et al., 2018)

. While depth-wise convolutions can significantly improve performance, their deployments are scarcer as they can impact the quality of the model. Furthermore, we find that there is huge variance in terms of FLOPs and parameters (four orders of magnitude) in the traced models. This might be attributed to the granularity of the task corresponding to a single inference. For example, in image recognition the input is typically an RGB image while in next-word prediction the input can be a couple of words.

5. Runtime analysis of Mobile DNNs

Up until now, we have focused our efforts on analysing the DNN models in an offline manner. In this section, we turn to on-device benchmarking and report on performance and energy when running the encountered models across the devices presented in Table 1. This analysis provides important insights about how real-world AI applications are performing on a heterogeneous set of devices, thus answering RQ#2.

5.1. On-device DNN latency

Figure 9. Latency per device ECDF.
(a) Inference energy
(b) Inference power
(c) Inference efficiency
Figure 10.

Distributions of inference energy, power and efficiency of the collected models when run across 3 generations of Qualcomm SoCs. The lines represent kernel density estimations.

Prior work (Almeida et al., 2019; Ignatov et al., 2019) has shown that FLOPs is not necessarily a good proxy for estimating a model’s on-device performance. Reasons for such discrepancies include the underutilisation of hardware due to e.g. memory-bound operations, thermal throttling due to continuous inference or even due to scheduling on cores of different dynamics due to energy-saving scheduler policies on Heterogeneous Multi-Processors (Kim et al., 2017). To further corroborate this fact, in Fig. 8 we depict the FLOPs and actual measured inference latency across devices for different models. Our analysis on real-world models on different devices reinforces this non-linear (line-fit) relationship as it not only varies for different model architectures, but also differs from one device to another.

To investigate this further, in Fig. 9 we show the ECDF of model runtime across all available devices. From the graph it is evident that the computing gap between a low-end device (A20) and a mid-tier device (A70) is considerably larger than the difference of mid-tier to high-end (S21). Specifically, low-end and mid-tier devices (A20 and A70) are and slower compared to S21. Across generations of high-end SoCs of the same manufacturer (Q845, Q855, Q888), we see incremental performance gains (i.e., average latency of , and ms), but noticeable, to the point that a next-gen mid-tier phone may perform better than the high-end SoC of a prior generation, despite claims about significant boosts in AI acceleration between generations. Last, we want to mention that for the two devices that integrate the same SoC (Q888 and S21), the open-deck design of the development board along with the vanilla variant of the OS leads to incrementally better results and faster inference overall. Heat dissipation of the open design, cross-manufacturer configurations and low-level configuration of the Android Scheduler can all be contributing factors.

Observations: We observe a wide variability of inference latency across devices even for models that have similar FLOP counts, which reaffirms the need for on-device benchmarking. Devices of different tiers and generations offer variable dynamics, with the lower-tier falling significantly behind in performance. Even devices integrating the same SoC can offer variable performance due to vendor-specific configurations, the installed apps and drivers or even due to different thermal characteristics. Therefore, given this heterogeneity, it is hard for developers to accurately predict the users’ experience without testing their models on a large sample of devices.

5.2. Energy consumption

In mobile settings, one cannot simply optimise for performance without taking energy consumption into consideration. While smartphone capabilities are growing larger every year, the same developments have not been witnessed in battery technology. Therefore, quantifying the cost of being smart in terms of energy is an important component in the mobile world. In this section, we report on the energy, power and efficiency of doing inference on device, across frameworks for the three Snapdragon boards representing different generations of devices.

5.2.1. Energy and power consumption per device

Fig. 9(a) shows the distribution of models with respect to the energy required per inference across our three devices. Expectedly, we see from the kernel density function lines that all three devices follow a similar trajectory, indicating that a similar amount of energy is required for similar workloads regardless of the device. On the other hand, this is not the case in terms of power consumption (Fig. 9(b)), where we can see newer generations of devices consistently drawing more power to run models. This is a direct implication of the fact that newer generations of devices can execute models faster, as shown in Fig. 9, while energy required remains similar.

Following these observations, we decided to calculate inference efficiency per each model by calculating how many floating-point operations can be executed in one second per one Watt888Effectively the same as calculating FLOPs per Joule.. As can be seen in Fig. 9(c), trends in efficiency stay mostly the same across different devices, following energy consumption, but unlike energy we can see a minor improvement of the newer devices over Q845 in the middle of the distribution, suggesting that relatively more models can run more efficiently (median efficiency of 730, 765 and 873 MFLOP/sW, after removing outliers) on the newer hardware.

5.2.2. Use-case driven energy consumption

Up to here, we have seen performance and energy consumption for single inferences. However, the quanta of data associated with each inference may vary considerably between tasks or modalities as noted before in Sec. 4.7. Thus, we dive deeper into three selected tasks representative of each modality, namely i) sound recognition for audio, ii) auto-completion for text and iii) semantic segmentation for vision.

Use-case Battery discharge (mAh)
Avg. Median Min Max
Sound R. 0.63502.0226 0.0652 0.0351 2.5277
Typing 0.07520.1637 0.0292 0.0245 0.1993
Segm. 1221.72761.0 619.62 271.93 3835.2
Sound R. 1.03113.3438 0.1821 0.0262 5.0327
Typing 0.11920.2835 0.0387 0.0279 0.3404
Segm. 1133.42468.1 489.10 262.85 3239.7
Sound R. 0.79502.8060 0.1009 0.0316 4.4132
Typing 0.10010.2484 0.0315 0.0300 0.3403
Segm. 1062.72416.6 455.71 272.44 3290.8
Table 4. Scenario-driven energy consumption for three devices and use-cases in audio, text and vision.

We make certain realistic assumptions on the data sizes, granularity input and frequency of results and then assess all relevant models belonging to this category. Specifically, for speech recognition, we assumed each model is run in order to recognize 1 hour of audio input. To derive how long a model would need to be run, we manually investigated the models and assumed the most likely amount of audio input per inference considering the model’s input dimension and common practices in speech ML (Chan et al., 2015; Pratap et al., 2020; Mehrotra et al., 2021). For text auto-completion we assumed each model is run once per new word typed by a user, and further assumed a workload of 275 words, derived from WhatsApp’s statistics about average daily number and length of messages (Facebook, 2020; Whatsapp, 2021; Rosenfeld et al., 2015). Last, for semantic segmentation, we assumed each model is used to segment a human at 15 FPS during a 1-hour-long video call in order to apply background effects, we further assumed that the model processes one frame per inference which is the usual approach (Long et al., 2015; Zhao et al., 2018; Chen et al., 2020).

Results across the development boards are depicted in Table 4 and indicate that different tasks and use cases result in very different impact on the battery life. On the high-end of energy consumption, we see that one hour of segmentation can result in a significant average reduction of 26.6% to 30.54% of a common 4000mAh battery capacity (e.g. A20 and S21). Moreover, the most energy hungry segmentation models can almost deplete the full battery capacity within an hour, with a 80.9% to 95.9% reduction. On the other end, models like auto-completion are ubiquitous across messaging apps and deliver both in terms performance and efficiency, allowing their frequent use without a significant impact on battery.

Observations. Energy consumption is a major component in mobile, and intelligence comes at a cost to battery life. Unlike latency, which is visibly improved with new generations of devices, energy consumption seems to be predominantly dependent on the model architecture. Even though newer hardware might improve in power-efficiency, differences are much less pronounced compared to performance improvements, which are even less observable across different model architectures. This suggests that it is the AI developers who can optimise battery life the most, unlike plain latency which can be improved at multiple levels, including manufacturers.

6. Available Optimisations

After examining how real-world DNNs run on a heterogenous set of devices, we now look into RQ#3 by means of DNN-specific as well as system-level optimisations aiming to improve inference and deployment performance.

Figure 11. Inference throughput vs. batch size.

6.1. Model-level Optimisations

In this section, we focus on the adoption of three model-level optimisations, namely i) weight clustering, ii) pruning and iii) quantisation, for the identified TFLite models.

Clustering: Clustering refers to the technique of reducing the number of distinct weight values by representing them through their clusters’ centroids (Han et al., 2016). We identify clusters of shared weights by searching for layers with a “cluster_” prefix on TFLite models. Despite the advertised potential for significant model size reductions (Google, 2021c), we report that none of the models in-the-wild seem to use weight clustering. This may be a result of either accuracy drops or the fact that the current clustering implementation does not reduce runtime memory and targets model compression only (Google, 2021c).

Pruning: Pruning refers to the technique of zero-ing out specific weights/channels of the network that have minimal impact on the output, due to representational redundancy in DNNs. Weight pruning can be detected during training by searching for layers with a “prune_” prefix for TFLite models. Nonetheless this prefix is often removed for inference (Google, 2021d). We report that we did not find any occurrence of such layers either. While this approach has the potential to skip the zero weight computations during inference, the current implementation benefits only from increased sparsity (, 2021) which, like clustering, results only in model compressibility. To find if there is the potential of adopting magnitude-based weight pruning, we measured the weight sparsity for the tracked TFLite models. We find that, overall, 3.15% of weights are near zero (within ), which might show limited prospects for weight magnitude-based pruning.

Figure 12. TFLite’s model throughput for different devices and compute targets.

Quantisation: Finally, quantisation constitutes a prominent method for minimizing the computational and memory demands of DNNs by means of reducing their representation precision (Wu et al., 2016; Jacob et al., 2018). To study its adoption, we analysed the layer types and their weight and input bitwidth representations. We report that 10.3% of the models make use of the dequantize layer, which indicates the deployment of lower-precision models as a way to perform model compression. Furthermore, by examining each model’s weights, we found that 20.27% of the models use int8

for the weight tensors whereas 10.31% of the models work with

int8 activations.

Recent hardware advances have led to NPUs that support multiple arithmetic precisions (Qualcomm, 2021; Arm, 2021; Liao et al., 2019). Such examples are the Hexagon 698 processor on Qualcomm Snapdragon 865 (SDM865) (Qualcomm, 2021) and the Arm Ethos processor (Arm, 2021), which support 16-bit for activations and 8-bit for weights (A16W8). These schemes enable a better compromise between faster low-precision compute and having enough representational power to achieve good accuracy. In spite of the new opportunities of these hardware architectures, not only do existing deployment methodologies fail to exploit them but we also found no evidence of their adoption. We revisit the issue of quantisation with hardware-specific optimisations in Sec. 6.3, where we use the Google’s NNAPI and Qualcomm’s SNPE to target specific processors in the SoC.

Observations: While the research community has developed numerous ways to optimise DNNs for mobile execution, out-of-the-box support for such optimisations in modern frameworks’ can be primitive and might not translate to run time gains at the expense of accuracy. Furthermore, most optimisations typically require model re-training and access to large-scale datasets. As such, we find that such optimisations are not widely adopted by the mobile AI developers. Quantisation, which can also be used to target different SoC accelerators, is the most widely-used optimisation. However, more advanced hybrid quantisation schemes remain unsupported.

6.2. System-level optimisations

Upon deploying a model, developers have different setup choices that can affect the model’s performance. In this section, we discuss the impact of different tuneable model and system parameters on model performance.

Impact of batch size. One common way of increasing a model’s throughput is batching input samples together. By taking advantage of SIMD instructions of SoCs and accelerators, this technique increases the DNNs throughput by producing multiple inference results in one forward pass.

In Fig. 11, we show the batch throughput across devices when processing and samples at a time with 4 threads. We only consider TFlite models that successfully ran all batch sizes across all devices (149 in total). As expected, we see that the throughput increases as the batch size does. In fact, throughput scales almost linearly, which indicates that no bottleneck is hit up to that point. Moving the comparison across devices, we see that S21 offers significantly faster inference, with throughput being and higher compared to A70 and A20 respectively on the highest batch size. This result goes in line with our conclusions from Sec. 5.1. We anticipate that when scaling to higher batch sizes, devices with lower core count and memory will hit memory bandwidth bottlenecks or out of memory errors, but we defer this for future work.

Figure 13. ECDF of TFLite models latency and energy per CPU runtime.

Impact of thread count. Another tuneable parameter during mobile execution is the number of threads allocated for execution on CPU. By default, all cores of the device can be simultaneously used during execution (ARM DynamIQ). However, in Heterogeneous Multi-core Processors (HMP) there usually exist multiple islands of cores, offering different dynamics and computational power. In Fig. 12 we show how the models’ throughput varies when executed with different thread counts (2,4,8) and affinities (2,4). For the latter, we use process pinning to select which cores to target from the heterogeneous core sets. We observe that the optimal thread count can vary across devices, with A20, A70 and S21 performing better with 4, 2 and 4 threads, respectively. We also see that the 8-threaded performance drops significantly across devices, indicating bottlenecked execution.

Digging deeper into thread performance, we further plot four additional setups where we set the CPU affinity to run over a varying number of the largest cores. For example, 4a2 means 4 threads with affinity 2, which means 4 threads will run over the top 2 cores of the mobile’s SoC. As expected, we observe that any setup that sets the number of threads higher than the CPU affinity cores (4a2 and 8a4) results in significant performance degradation. This happens to due to time-sharing, having the other thread pinned on the same core waiting. Nonetheless we also witness some less expected findings, such as the fact that setting the affinity to the same number of top cores does not yield any significant gain, irrespective of our initial hypothesis that it would reduce process migration between cores. In fact, 4a4 performs worse to 4 threads for A70 and similar is the case for 2a2 and 2 threads for A20.

Predicting the optimal number of threads for mobile inference can be challenging as mobile devices have different CPU architectures with varying core frequencies as well as DVFS-enabled schedulers implementing energy-preserving policies (Kim et al., 2017). Moreover, most mobile devices, nowadays, incorporate HMP SoCs (i.e. ARM big.LITTLE, DynamIQ) with varying number of cores per island (e.g. Q888 has X1, A78, A55 ARM Cortex cores, whereas Q675 has A76 and A55 cores). Therefore, scheduling across core islands can bring sub-optimal results to DNN execution. However, when selecting the optimal thread count and affinity for each device, we see up to throughput gains overall. This suggests that tuning scheduling and thread count of DNN execution on heterogeneous devices and processors can yield significant improvements.

Observations: Results from model-level optimisation indicate that there are alternative parameters for boosting inference throughput, but they should be tweaked in tandem with system-level factors, including the SoC topology and memory hierarchy to make efficient use of the underlying hardware.

Figure 14. ECDF of TFLite and caffe models latency and energy per hardware target with SNPE.

6.3. Target generality vs. hardware-specific optimisations

In the previous section, we have visited certain setup “hyperparameters”, namely

batch size and process affinity that depending on the use-case can enhance inference performance. In this section, we investigate framework-specific optimisations that can enhance performance, either by means of optimised operator kernel implementations or by moving computation to a different device altogether, i.e. targeting the GPU/NPU/DSP of the SoC. To this direction we run experiments measuring performance and energy of framework-specific optimisations on TFLite and caffe models across three alternative backends, namely NNAPI, XNNPACK and SNPE, on the Q845 board. We divert the reader to the Appendix for more information on these frameworks.

Traces of hardware-specific acceleration. In our latest snapshot, we found some traces of hardware-specific acceleration. Specifically, we have found apps are using NNAPI, a single application using XNNPACK and three using SNPE. It is interesting to note that in the last case these models get blindly distributed to all devices, irrespective of having a Qualcomm-based SoC or not. In fact, they deploy both a TFLite and dlc variants of the same model. Overall, we see that many app models are missing on the efficiency promises of targeting specialized hardware or using target-optimized kernel operations.

Optimisation opportunities. As a way to measure the potential benefit of using each of the aforementioned framework optimisations on different processing elements, we run two experiments, one on TFLite models for NNAPI and XNNPACK (Fig. 13) and another for TFLite and caffe models for SNPE (Fig. 14). In each case, we compare the performance of framework-specific optimisations to the baseline CPU and GPU runs. The reason we do not compare across them is because the number of models commonly compatible is low. This highlights one succinct characteristic of such optimisations, the rudimentary support for operators across heterogeneous targets which in turn can hinder their widespread adoption.

Results from our evaluation indicate that for CPU execution (Fig. 13), one is better off using the XNNPACK delegate for executing DNN inference faster and more efficiently on average. NNAPI did not prove its potential in our experiments, with its performance lagging behind the default CPU execution ( slower and less efficient on average). This could be potentially attributed to unoptimised NN drivers from the vendor. On the other hand, when one is deploying with a vendor-specific platform, SNPE in our case, performance is better for DSP and GPU (Fig. 14), compared to vanilla CPU and GPU runs. Specifically, these are and faster and and more efficient on average, compared to CPU runs. In comparison to GPU runs, these are and faster and and more efficient on average. In the case of CPU, however, the story is similar with our last experiment, further corroborating the story for non-optimised CPU drivers from the vendor.

Note that CPU and GPU runs are executed at full-precision (float32), while the DSP runs in int8. Depending on the task this can result in accuracy variations, but we do not have access to model-specific data and labels to assess that.

Observations. Results from our experiments say a mixed story about hardware and frameworks specific optimisations. While it can yield noticeably better performance across models, this is not always the case due to driver implementation or other low-level confounding factors. The dilemma of target generality vs hardware-specific optimisations ultimately lies in the hand of the developer and the resources they have at their disposal to extract every bit of performance in hardware.

6.4. Cloud-based DNN models

Another approach to accelerate inference and bring intelligence to mobile apps, without having the need to specialise per target device is by offloading to the cloud. We can envision this approach being popular amongst developers who do not implement or train their own models or for models that are too computationally intensive to run locally on a mobile device or too expensive to optimise for each available target to offer a similar QoE.

As mentioned in Sec. 3.2, gaugeNN tracks app invocations of known cloud-based machine learning APIs in their code. This includes calls to Google (Google Cloud and Firebase ML) and Amazon services. Fig. 15 shows the number of applications invoking each of the cloud-based ML APIs across our dataset. Overall, we find 524 distinct applications that use cloud AI APIs, a considerable increase of from our 2020 dataset. More specifically, 452 and 72 apps using Google AI services and Amazon respectively. This increase is inline with the increase in models deployed within the apps (Sec. 4.6). Furthermore, we observe that developers primarily use cloud-based image and video analytics to perform face identification, bar/QR code recognition, video analytics and chatbots.

Observations: Our results indicate that cloud APIs from Google and Amazon are gaining in popularity as they allow developers to quickly deploy AI capabilities without the need for specialised ML expertise and costly infrastructure for training. Moreover, developers do not need to maintain training data on-premise and the resulting apps can be supported by heterogeneous devices with similar QoE.

Figure 15. Number of apps that invoke cloud-based ML APIs. Categories with less than 10 apps are excluded.

7. Related Work

In the past, there have been numerous studies that performed large scale analysis of the Google Play Store but with different aims, such as characterising mobile apps (Viennot et al., 2014) and their API usage (Onwuzurike et al., 2018; Almeida et al., 2016). Closer to the ML community, there has been an increasing effort to benchmark state-of-the-art models across different devices and frameworks (Hanhirova et al., 2018; Guo et al., 2019; Wu and others, 2019; Hadidi et al., 2019; Almeida et al., 2019; Ignatov et al., 2019). Although these studies have done a great job at extensively benchmarking state-of-the-art models, we still lack the knowledge as to whether these models are representative of the ones deployed today in mobile apps. Moreover, there is a lack of understanding on how the latest trends on DNN optimisation affect the latest DNN-based mobile apps.

To the best of our knowledge, there are largely two works that have investigated DNN usage in the wild. One is from Xu et al. (Xu et al., 2019) and focuses on investigating who the early adopters of DNNs are and what are the use-cases for Deep Learning in mobile apps. While they do conduct a lightweight analysis of DNN operations, they have only measured model footprint and performance in an offline and device-agnostic manner, by means of measuring the FLOPs of DNN layers. However, it has been shown that FLOPs is not a good proxy of a model’s run time  (Almeida et al., 2019; Ignatov et al., 2019), especially across different hardware configurations. Therefore, there is still limited understanding about the actual performance of DNN models in the wild, across a heterogeneous ecosystem of more and less capable devices. A more privacy-centric work has been presented in (Sun et al., 2021), which investigates DNN model protection on mobile devices and illustrates succinctly that many Android apps do not protect their DNN models, which means these can be easily leaked, or extracted for analysis. Nevertheless, it does not perform any performance analysis.

These two works serve as a starting point for our study, which aims to answer the question of how widely deployed DNNs found in the most popular Android apps actually perform on widely deployed devices, essentially correlating the state of Deep Learning mobile deployment in the wild. To this end, we conduct an in-depth benchmarking of models used in the latest most trending mobile apps. This includes analyses of latency, energy, system and model-level parameters and optimisations, providing a better comprehension of the current limitations when deploying DNNs on mobile phones of different tiers and generations.

8. Discussion & Future work

8.1. Implications & Trends

Proliferation of mobile AI. Our results indicate that both on-device and cloud-supported DNN applications are increasing rapidly (doubled within a year). This is mostly driven by the availability of pre-trained models and easy-to-use cloud-based APIs, focusing mostly on vision tasks such as image detection and recognition.

Model reuse. While there is much research on bespoke model architectures, customisation and fine-tuning (Pan and Yang, 2009; Laskaridis et al., 2021), we observe that most developers use off-the-shelf DNN architectures. In fact, close to 80.9% of the models are shared across two or more applications and a further 9.02% of the remaining models share some layers (i.e., derived from a common model after fine-tuning). Simultaneously, there is a parallel trend of resorting to cloud-powered inference, further demonstrating a preference of developers towards turnkey solutions, instead of bespoke customised ones. With the current trajectory of AI, we expect more developers specialising on ML-based app development at least until the middleware (e.g. NNAPI) which abstracts away ML-specific parameters becomes more prevalent.

DNNs and mobile hardware resources. We witness that most applications do not take advantage of SoC-specific accelerators to accelerate their inference runtime, but rather target generality of their solutions, either by shipping vanilla CPU-only execution or by integrating framework-specific middleware options (e.g. NNAPI). Last, offloading inference to the cloud offers a consistent QoE, which is not dependent on the target device, at the expense of privacy (Laskaridis et al., 2020a; Almeida et al., 2021) and monetary cost. This behaviour comes as a consequence of the fragmentation in the Android ecosystem in terms of hardware capabilities and software support (e.g. vendor-specific NNAPI drivers). Consequently, we anticipate the need of automated solutions for optimised development and deployment of ML solutions in mobile apps, which abstract away the complexity of efficiency and heterogeneity of the ecosystem.

Energy as a bottleneck. While Deep Learning adoption is undisputed, with accelerating trajectory in the future, manufacturers turn to specialised hardware for faster and more efficient ML (e.g. NPUs). However, the same cannot be stated for battery technology and capacity, which remain relatively stale. Given what we observed for the segmentation scenario in Sec. 5.2.2, we anticipate energy sooner or later becoming a bottleneck in DNN deployment, requiring novel solutions to support mobile intelligence on the go.

DNN co-habitation. With more and more applications shipping DNN-powered solutions, we also anticipate the co-existence and parallel runtime of more than one DNN in the future. Thus, researchers will need to tackle this emerging problem to efficiently support such runtimes, by means of OS or hardware-level solutions.

On-device learning and personalisation. Last, so far in the paper we have only visited the task of mobile inference. In this setup, the weights of the model come pretrained on some centralised dataset and the device only performs forward propagation. However, with users becoming more and more privacy aware and with legislation discouraging the storage of user data without legitimate interest, on-device training and federated learning (McMahan et al., 2017; Horvath et al., 2021) become more and more prevalent (Paulik et al., 2021; Bonawitz et al., 2019). Moreover, with the proliferation of on-device data, on-device personalisation (Leontiadis et al., 2021) is also gaining traction. These tasks will create a different workload to be optimised for on-device runtime, for which current or future tools will need to provide support.

8.2. Limitations

In this work we have shed light to the use and performance of DNNs in real-world applications. However, we only focused on the Android smartphone landscape due to its larger market share and wide device fragmentation. These finding might only partially hold for other mobile ecosystems.

Furthermore, we have analysed the models that could be identified as DNN models. Obfuscated and encrypted models, or models that are downloaded outside of Google Play store were not benchmarked, despite us tracking the respective application as ML-powered. While there might be a different distribution of obfuscated models in the wild, the results from (Sun et al., 2021) indicate otherwise.

Our analysis included both offline introspection and dynamic benchmarking of the models. However, we did not investigate particular invocation paths and frequency of inference per app. We expect that some of these models are rarely used (e.g. credit card scanning) while others are utilised more frequently (e.g. activity detection). However, the real-world usage of these models requires device instrumentation and collecting telemetry data over a large user-base. While previous works (Almeida et al., 2018; Onwuzurike et al., 2018) have proposed large-scale crowd-testing of virtualised mobile apps with real user interaction, these generally preclude testing sensor input-dependent functionality, on which DNNs depend. We leave this as future work.

Last, while we characterise DNN cloud offloading, we acknowledge that we miss any developers who use their own custom (e.g., REST-based) APIs to access remote execution.

9. conclusion

In this work, we have carried out a comprehensive empirical study of the most popular DNN-powered mobile apps. Using gaugeNN, we analyse thousands of mobile apps in the wild and identify a significant chasm between the deployed models and the state-of-the-art architectures and optimisation techniques. This is the first work to dig deeper into these aspects so as to provide guidelines for both the mobile application and the DNN-framework developer communities.


  • M. Almeida, M. Bilal, J. Blackburn, and K. Papagiannaki (2016) An empirical study of android alarm usage for application scheduling. In Passive and Active Measurement, T. Karagiannis and X. Dimitropoulos (Eds.), Cham, pp. 373–384. External Links: ISBN 978-3-319-30505-9 Cited by: §7.
  • M. Almeida, M. Bilal, A. Finamore, I. Leontiadis, Y. Grunenberger, M. Varvello, and J. Blackburn (2018) Chimp: crowdsourcing human inputs for mobile phones. In Proceedings of the 2018 World Wide Web Conference, pp. 45–54. Cited by: §8.2.
  • M. Almeida, S. Laskaridis, I. Leontiadis, S. I. Venieris, and N. D. Lane (2019) EmBench: Quantifying Performance Variations of Deep Neural Networks across Modern Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems and Applications (EMDL), pp. 1–6. Cited by: §1, §5.1, §7, §7.
  • M. Almeida, S. Laskaridis, S. I. Venieris, I. Leontiadis, and N. D. Lane (2021) DynO: dynamic onloading of deep neural networks from cloud to device. External Links: 2104.09949 Cited by: §8.1.
  • Amazon (2020) AWS Android SDK. Note: Cited by: §3.2.
  • AppBrain (2020) Number of Android apps on Google Play. Note: Cited by: footnote 4.
  • Arm (2021) Ethos npu. Note: September 30, 2021 Cited by: §6.1.
  • V. Bazarevsky, Y. Kartynnik, A. Vakunov, K. Raveendran, and M. Grundmann (2019) Blazeface: sub-millisecond neural face detection on mobile gpus. arXiv preprint arXiv:1907.05047. Cited by: §4.5.
  • K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Konečný, S. Mazzocchi, B. McMahan, T. Van Overveldt, D. Petrou, D. Ramage, and J. Roselander (2019) Towards federated learning at scale: system design. pp. 374–388. External Links: Link Cited by: §8.1.
  • W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals (2015) Listen, attend and spell. arXiv preprint arXiv:1508.01211. Cited by: §1, §5.2.2.
  • W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang (2020) FasterSeg: searching for faster real-time semantic segmentation. Cited by: §5.2.2.
  • Facebook (2020) Two Billion Users — Connecting the World Privately. Note: Cited by: §5.2.2.
  • J. Fernandez-Marques, P. Whatmough, A. Mundy, and M. Mattina (2020) Searching for winograd-aware quantized networks. pp. 14–29. External Links: Link Cited by: §1.
  • J. Freke (2020) Smali assembler. Note: Cited by: §3.2.
  • GlobalStats (2020) Mobile operating systems’ market share worldwide from April 2020 to April 2021. Note: Cited by: §4.1.
  • Google (2020a) Android Runtime and Dalvik. Note: Cited by: §3.2.
  • Google (2020b) Google Cloud APIs. Note: Cited by: §3.2.
  • Google (2020c) Google Cloud APIs. Note: Cited by: §3.2.
  • Google (2020d) Optimize for Doze and App Standby. Note: Cited by: §3.3.
  • Google (2021a) About Android App Bundles. Note: Cited by: §3.1.
  • Google (2021b) APK Expansion Files. Note: Cited by: §3.1.
  • Google (2021c) Tensorflow: Clustering. Note: Cited by: §6.1.
  • Google (2021d)

    Tensorflow: pruning with keras

    Note: Cited by: §6.1.
  • Q. Guo, S. Chen, X. Xie, L. Ma, Q. Hu, H. Liu, Y. Liu, J. Zhao, and X. Li (2019) An Empirical Study towards Characterizing Deep Learning Development and Deployment across Different Frameworks and Platforms. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 810–822. Cited by: §7.
  • R. Hadidi, J. Cao, Y. Xie, B. Asgari, T. Krishna, and H. Kim (2019) Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices. In 2019 IEEE International Symposium on Workload Characterization (IISWC), Vol. , pp. 35–48. Cited by: §7.
  • S. Han, H. Mao, and W. J. Dally (2016) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations (ICLR). Cited by: §1, §6.1.
  • J. Hanhirova, T. Kämäräinen, S. Seppälä, M. Siekkinen, V. Hirvisalo, and A. Ylä-Jääski (2018)

    Latency and Throughput Characterization of Convolutional Neural Networks for Mobile Computer Vision

    In Proceedings of the 9th ACM Multimedia Systems Conference (MMSys), pp. 204–215. External Links: ISBN 9781450351928 Cited by: §7.
  • K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, J. Law, K. Lee, J. Lu, P. Noordhuis, M. Smelyanskiy, L. Xiong, and X. Wang (2018) Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. pp. 620–629. External Links: ISSN Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. pp. 770–778. Cited by: §1, §1.
  • S. Horvath, S. Laskaridis, M. Almeida, I. Leontiadis, S. I. Venieris, and N. D. Lane (2021) FjORD: fair and accurate federated learning under heterogeneous targets with ordered dropout. arXiv preprint arXiv:2102.13451. Cited by: §8.1.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1, §4.5, §4.7.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.

    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

    2017-Janua, pp. 2261–2269.
    External Links: Document, 1608.06993, ISBN 9781538604571, ISSN 0022-4790, Link Cited by: §1.
  • A. Ignatov, R. Timofte, A. Kulik, S. Yang, K. Wang, F. Baum, M. Wu, L. Xu, and L. Van Gool (2019) AI benchmark: all about deep learning on smartphones in 2019. Cited by: §1, §4.2, §5.1, §7, §7.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. pp. 2704–2713. Cited by: §6.1.
  • Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang (2017) Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge. pp. 615–629. Cited by: §1.
  • Y. G. Kim, M. Kim, and S. W. Chung (2017) Enhancing energy efficiency of multimedia applications in heterogeneous mobile multi-core processors. IEEE Transactions on Computers 66 (11), pp. 1878–1889. Cited by: §5.1, §6.2.
  • S. Laskaridis, A. Kouris, and N. D. Lane (2021) Adaptive inference through early-exit networks: design, challenges and directions. In Proceedings of the 5th International Workshop on Embedded and Mobile Deep Learning, EMDL’21, New York, NY, USA, pp. 1–6. External Links: ISBN 9781450385978, Link, Document Cited by: §8.1.
  • S. Laskaridis, S. I. Venieris, M. Almeida, I. Leontiadis, and N. D. Lane (2020a) SPINN: Synergistic Progressive Inference of Neural Networks over Device and Cloud. Cited by: §1, §8.1.
  • S. Laskaridis, S. I. Venieris, H. Kim, and N. D. Lane (2020b) HAPI: Hardware-Aware Progressive Inference. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Cited by: §1.
  • J. Lee, N. Chirkov, E. Ignasheva, Y. Pisarchyk, M. Shieh, F. Riccardi, R. Sarokin, A. Kulik, and M. Grundmann (2019a) On-device neural net inference with mobile gpus. Cited by: §1.
  • N. Lee, T. Ajanthan, and P. Torr (2019b) SNIP: Single-Shot Network Pruning based on Connection Sensitivity. In International Conference on Learning Representations (ICLR), Cited by: §1.
  • I. Leontiadis, S. Laskaridis, S. I. Venieris, and N. D. Lane (2021) It’s always personal: using early exits for efficient on-device cnn personalisation. New York, NY, USA. External Links: ISBN 9781450383233, Link, Document Cited by: §8.1.
  • Z. Li and F. Zhou (2017) FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960. Cited by: §4.5.
  • H. Liao, J. Tu, J. Xia, and X. Zhou (2019) DaVinci: A Scalable Architecture for Neural Network Computing. pp. 1–44. Cited by: §6.1.
  • J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. pp. 3431–3440. External Links: Document Cited by: §5.2.2.
  • B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas (2017) Communication-efficient learning of deep networks from decentralized data. pp. 1273–1282. Cited by: §8.1.
  • A. Mehrotra, A. G. C. P. Ramos, S. Bhattacharya, Ł. Dudziak, R. Vipperla, T. Chau, M. S. Abdelfattah, S. Ishtiaq, and N. D. Lane (2021) NAS-bench-ASR: reproducible neural architecture search for speech recognition. Cited by: §5.2.2.
  • L. Onwuzurike, M. Almeida, E. Mariconti, J. Blackburn, G. Stringhini, and E. De Cristofaro (2018) A Family of Droids-Android Malware Detection via Behavioral Modeling: Static vs Dynamic Analysis. In 2018 16th Annual Conference on Privacy, Security and Trust (PST), pp. 1–10. Cited by: §7, §8.2.
  • S. J. Pan and Q. Yang (2009) A survey on transfer learning. IEEE Transactions on knowledge and data engineering 22 (10), pp. 1345–1359. Cited by: §4.5, §8.1.
  • M. Paulik, M. Seigel, H. Mason, D. Telaar, J. Kluivers, R. van Dalen, C. W. Lau, L. Carlson, F. Granqvist, C. Vandevelde, et al. (2021) Federated evaluation and tuning for on-device personalization: system design & applications. arXiv preprint arXiv:2102.08503. Cited by: §8.1.
  • V. Pratap, Q. Xu, J. Kahn, G. Avidov, T. Likhomanenko, A. Hannun, V. Liptchinsky, G. Synnaeve, and R. Collobert (2020) Scaling Up Online Speech Recognition Using ConvNets. pp. 3376–3380. External Links: Document, Link Cited by: §5.2.2.
  • Qualcomm (2021) Snapdragon neural processing engine. Note: September 30, 2021 Cited by: §6.1.
  • L. Roeder (2020) Netron. Note: Cited by: §3.1.
  • A. Rosenfeld, S. Sina, D. Sarne, O. Avidov, and S. Kraus (2015) A study of whatsapp usage patterns and prediction models without message content. arXiv preprint arXiv:1802.03393. Cited by: §5.2.2.
  • K. Simonyan and A. Zisserman (2015) Very Deep Convolutional Networks for Large-Scale Image Recognition. Cited by: §1.
  • Statista (2020) Mobile operating systems’ market share worldwide from January 2012 to July 2020. Note: Cited by: §4.1.
  • Z. Sun, R. Sun, L. Lu, and A. Mislove (2021) Mind your weight(s): a large-scale study on insufficient machine learning model protection in mobile apps. External Links: Link Cited by: §4.6, §7, §8.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215. Cited by: §1.
  • M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: platform-aware neural architecture search for mobile. Cited by: §1.
  • F. Team (2020) Pay Cards Recognizer. Note: Cited by: §4.5.
  • (2019) Example on-device model personalization with TensorFlow Lite. Note: Cited by: §4.5.
  • (2021) Trim insignificant weights. Note: Cited by: §6.1.
  • C. Tumbleson (2020) apktool. Note: Cited by: §3.2.
  • N. Viennot, E. Garcia, and J. Nieh (2014) A Measurement Study of Google Play. In The 2014 ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pp. 221–233. Cited by: §4.1, §7.
  • S. Wang, A. Pathania, and T. Mitra (2020) Neural Network Inference on Mobile SoCs. IEEE Design Test (). Cited by: §1.
  • Whatsapp (2021) Whatsapp daily messages. Note: Cited by: §5.2.2.
  • C. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury, M. Dukhan, K. Hazelwood, E. Isaac, Y. Jia, B. Jia, T. Leyvand, H. Lu, Y. Lu, L. Qiao, B. Reagen, J. Spisak, F. Sun, A. Tulloch, P. Vajda, X. Wang, Y. Wang, B. Wasti, Y. Wu, R. Xian, S. Yoo, and P. Zhang (2019) Machine Learning at Facebook: Understanding Inference at the Edge. pp. 331–344. Cited by: §1, §4.2.
  • C. Wu et al. (2019) Machine Learning at Facebook: Understanding Inference at the Edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vol. , pp. 331–344. Cited by: §7.
  • J. Wu, C. Leng, Y. Wang, Q. Hu, and J. Cheng (2016) Quantized Convolutional Neural Networks for Mobile Devices. pp. 4820–4828. Cited by: §6.1.
  • M. Xu, J. Liu, Y. Liu, F. X. Lin, Y. Liu, and X. Liu (2019) A first look at deep learning apps on smartphones. In The World Wide Web Conference, pp. 2125–2136. Cited by: §3.1, §4.6, §4.6, §4.6, §7, footnote 7.
  • Yepkit (2020) Yepkit YKUSH 3 USB 3.1 Switchable Hub. Note: Cited by: §3.3.
  • X. Zhang, C. Xie, J. Wang, W. Zhang, and X. Fu (2018)

    Towards memory friendly long-short term memory networks (lstms) on mobile gpus

    pp. 162–174. External Links: ISBN 9781538662403, Link, Document Cited by: §1, §4.7.
  • H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia (2018) ICNet for real-time semantic segmentation on high-resolution images. Cham, pp. 418–434. External Links: ISBN 978-3-030-01219-9 Cited by: §5.2.2.

Appendix A Additional platform information

DNN Model extraction

In Sec. 3.1 of the paper, we stated that gaugeNN supports file extraction from i) the base apk, ii) expansion files (OBBs) and iii) Android App Bundles. The extracted files are matched against a compiled list of known DNN framework formats and validation rules to identify potential DNN models. The complete list of formats is shown in Table 5.

Framework Extensions
ONNX .onnx, .pb, .pbtxt, .prototxt
MXNet .mar, .model, .json, .params
Keras .h5, .hd5, .hdf5, .keras, .json, .model, .pb, .pth
Caffe .caffemodel, .pbtxt, .prototxt, .pt
Caffe2 .pb, .pbtxt, .prototxt
PyTorch .pt, .pth, .pt1, .pkl, .h5, .t7, .model, .dms, .pth.tar, .ckpt, .bin, .pb, .tar
Torch .t7, .dat
SNPE .dlc
FeatherCNN .feathermodel
TFLite .tflite, .lite, .tfl, .bin, .pb
TF .pb, .meta, .pbtxt, .prototxt, .json, .index, .ckpt
Sklearn .pkl, .joblib, .model
armNN .armnn
Mnn .mnn
Ncnn .param, .bin, .cfg.ncnn, .weights.ncnn, .ncnn
Tengine .tmfile
Flux .bson
Chainer .npz, .h5, .hd5, .hdf5, .chainermodel
Table 5. Frameworks and formats validated by gaugeNN

Appendix B Additional experiment information

Hardware-specific acceleration frameworks

As per Sec. 6.3, we run our TFLite models against alterative backends, namely NNAPI, XNNPACK and SNPE. Below we provide additional information for each one:

  • [leftmargin=0pt,topsep=0pt,label=]

  • NNAPI999 Neural Networks API (NNAPI) is a middleware-level library in Android that sits between the machine learning framework library used by an application (e.g. TFLite) and the the Android Hardware Acceleration Layer (HAL). It essentially provides an abstraction layer, handling hardware acceleration through vendor and hardware specific NN drivers, which provide efficient operator implementations for CPU, GPU, DSP, NPUs or other kinds of specialised hardware. Execution falls back to CPU in the absence of such drivers or unsupported operators. TFLite is at the foreforent of NNAPI delegation and PyTorch Mobile has announced support for it. Nonetheless, NNAPI being in its infancy comes with some shortcomings, mainly in the realm of OS version support (Android P and above), NN drivers availability and heterogeneity in performance gains.

  • XNNPACK101010 XNNPack provides a low-level, highly optimised library for NN inference operators across platforms. Specifically for ARM, it supports efficient implementation of operators through Neon instructions, as well as inference on sparse networks, which offers a practical solution to the problem described in Sec. 6.1. Despite the claimed performance benefits, operator support is limited and if not careful can lead to performance penalties instead of gains when compared to the baseline CPU delegates.

  • SNPE111111 The Snapdragon Neural Processing Engine (SNPE) constitutes a vendor-specific runtime for execution of DNNs on Qualcomm SoCs, targeting the CPU, Adreno GPU or Hexagon DSP of the SoC, handling quantisation in the proper precision internally. It uses its own representation for NNs (.dlc format) supports conversion from different frameworks, including caffe and TFLite. However, while SNPE can potentially take advantage of hardware-specific optimisations, it can only target Qualcomm SoCs, trading off generality for performance. Operator support can also be of issue in SNPE, supporting CPU fallback in case of hardware-specific unsupported operations.