1 Introduction
Deep learning has become the de facto method for artificial intelligence in various tasks including computer vision, user intention recognition
Guo et al. (2019), and autopilot LeCun et al. (2015). As edge devices (e.g., smartphones, IoT devices, wearable devices) are ubiquitous now, deep learning on edge devices, especially mobile devices, attracts growing attention Shi et al. (2016). There are many advantages for deep learning on mobiles, for example, low latency, privacy protection, and personalized service. To make full use of ondevice deep learning technology, inference engines tailored to mobile devices have been developed and extensively used in mobile applications, for example, TFLite Google (2017a)Google (2017a), NCNN Tencent (2017), CoreML Apple (2017), etc.The major challenges for mobile inference engines can be categorized into three aspects: model compatibility, device diversity, and resource limitation.
(1) Model compatibility
. Most deep learning models deployed on mobile devices are trained from wellknown deep learning frameworks such as TensorFlow
Abadi et al. (2016), PyTorch
Paszke et al. (2017), Caffe
Jia et al. (2014), CNTK Yu et al. (2014), MXNet Chen et al. (2015). As a result, it is a basic requirement that an inference engine should have the model compatibility for different formats and different operators. More importantly, the engine should also allow for proper scalability to support new operators emerging in the future.(2) Device diversity. Almost every wellknown mobile applications are used extensively on various devices, ranging from lowend equipment with singlecore CPU to highend equipment with coprocessors such as Apple Neural Engine (ANE). To achieve great performance on various devices, mobile inference engines have to take hardware architectures or even device vendors (like ARM Mali GPU or Qualcomm Adreno GPU) into consideration. In addition, the engine is also expected to take care of the software diversity problem well, such as different operating systems (Android/iOS/embedded OS) and different solution standards (OpenCL Khronos (2009)/OpenGL Khronos (1992)/Vulkan Khronos (2015) for Android GPU).
(3) Resource limitation. Despite the rapid hardware development, memory and computation power are still constrained on mobile devices and are orders of magnitude lower than their desktop and server counterparts.
Concluded from the challenges above, a good mobile inference engine should have the following two properties: (1) Universality to address both model compatibility and device diversity; (2) Efficiency to inference models on devices with great performance and to use as little memory and energy consumption as possible.
To satisfy above properties, we introduce a new mobile inference engine named Mobile Neural Network (MNN). Our contributions can be summarized as follows:

We present a mechanism called preinference which manages to perform runtime optimization through online cost evaluation and optimal scheme selection.

We deliver indepth kernel optimization by utilizing improved algorithm and data layout to further boost the performance of some widelyused operations.

We propose backend abstraction module to enable hybrid scheduling and keep engine itself as lightweight as possible. Integrating MNN into applications only increases the binary size by KB.
It should be noted that MNN has been extensively adopted in many mobile applications. Meanwhile, we opensource the entire project to enrich the community and hope to involve more people to make it better.
2 Related Work
With the present rise of demand for ondevice deep learning, much attention has been spent on mobile inference solutions, especially from several major companies.
CoreML is a framework of Apple to integrate machine learning models into iOS software applications, which can leverage multiple hardware like CPU, GPU, and ANE. For Android smartphones, Google also provides their own solution for ondevice inference,
i.e., MLkit and Neural Networks API (NNAPI) Google (2016). However, the main drawback of these solutions lies in their limited universality. For example, CoreML requires iOS 11+ and NNAPI requires Android 8.1+, which exclude many existing mobile phones and the embedded devices.In late 2017, Google released TensorFlow Lite (TFLite) Google (2017a), an efficient mobile deep learning framework. TFLite is optimized for less powerful devices such as mobile phones and embedded devices. Almost at the same time, Facebook released Caffe2 Paszke et al. (2017) to help developers deploy deep learning models on mobile devices. Both of them support a wide range of devices and many applications have already been developed based on them. However, ondevice inference with TFLite or Caffe2 may sometimes go against the goal to be lightweight. For example, TFLite utilizes Accelate, Eigen, and OpenBLAS libraries for floatingpoint acceleration while Caffe2 depends on Eigen to accelerate matrix operations. Integrating mobile deep learning frameworks with these dependencies will bloat the binary size of mobile applications and bring in unnecessary overhead.
Many efforts have been devoted to solving the problem above. NCNN Tencent (2017), MACE Xiaomi (2018), and Anakin Baidu (2018) are the representatives. They follow a paradigm of what we call manually search or nonautomated search. In this paradigm, operators are optimized case by case through carefullydesigned assembly instructions and do not rely on any external libraries. For example, a unique program function is implemented for a
convolution of stride
; another function has to be implemented separately for a convolution of stride . This kind of implementation allows the mobile inference engine to be lightweight and efficient. However, the casebycase optimization is timeconsuming and can hardly cover all new emerging operators.In stark contrast to the manual search, there is another philosophy on the opposite, which we call automated search, pioneered by TVM DMLC (2016). TVM is an open deep learning compiler stack that can compile various deep learning models into libraries in an endtoend manner. Not only does it resolve the redundant dependency problem, but also provides graphlevel and operatorlevel optimization customized for the model and backend. As a result, the performance of TVM is pretty encouraging and scalable in terms of both model and device diversity. However, these advantages come at a price. The runtime library generated by TVM is modelspecific, which means if we want to update the model (which can be very common and frequent for many AIdriven software applications), we need to regenerate the code and release a new version. This mechanism bears a burden of cost and sometimes it is impracticable for mobile applications. In this paper, we are motivated to develop a semiautomated search architecture featured by enhanced universality and better performance in mobile deployment.
It is worthy of mention that there are some other works related to ondevice deep learning, e.g., computational graph DSL (DomainSpecific Language) Abadi et al. (2016); Bastien et al. (2012) to perform optimization in the graph level or operator fusion and replacement Google (2017b); Wei et al. (2017). These works are orthogonal to our contributions in this paper and are partially referenced by MNN.
3 Mobile Neural Network (MNN)
3.1 Overview of MNN
The basic workflow of MNN is composed of two parts, offline conversion and ondevice inference. Figure 2 summarizes the entire architecture of MNN and this section briefly shows how MNN optimizes the workflow by walking through its components.
For the part of offline conversion, the converter first takes models as input from different deep learning frameworks and transforms them to our own model format (.mnn
). Meanwhile, some basic graph optimizations are performed, such as operator fusion Ashari et al. (2015), replacement, and model quantization Rastegari et al. (2016).
For ondevice inference, three modules are involved: preinference, operatorlevel optimization, and backend abstraction. For each operator, the preinference module offers a cost evaluation mechanism, which combines the information (e.g., input size, kernel shape) with backend properties (e.g., number of cores, hardware availability) to dynamically determine the optimal computation solution from a solution pool. Then the operatorlevel optimization module utilizes advanced algorithms together with techniques like SIMD (Single Instruction Multiple Data), pipelining to further boost the performance.
Moreover, MNN supports various hardware architectures as backend. Since no single standard fits all hardware specifications, MNN supports different software solutions such as OpenCL, OpenGL, Vulkan, and Metal. All backends are implemented as independent components and a set of uniform interface is provided with backend abstraction module to hide raw details (e.g., memory management on heterogeneous backends).
With the proposed architecture, not only do we achieve high performance for ondevice inference, but also make it easy to extend MNN to more ongoing backends (such as TPU, FPGA, etc.). In the rest of this section, we present more details of the architecture of MNN.
3.2 Preinference
Preinference is the fundamental part of the proposed semiautomated search architecture. It takes advantage of a common phenomenon that the input size is typically fixed (or can be preprocessed into a target size) in many deep learning applications. Based on this, memory usage and computational cost can be determined ahead of formal inferences. Thus, some optimization techniques like memory preallocation and reuse can be conducted to further improve the performance. Details of preinference can be specified into two parts: computation scheme selection and preparationexecution decoupling.
Computation scheme selection. We propose a cost evaluation mechanism to select the optimal scheme from the scheme pool, which takes into consideration both the algorithm implementation and the backend characteristics,
(1) 
where stands for the cost.
(1) Take convolution scheme pool as an example, in which there are generally two fast implementation algorithms to choose: sliding window and Winograd Lavin & Gray (2016). Our general idea is to dynamically choose the algorithm that minimizes the computational cost based on different convolution settings. Therefore, the optimal computation scheme to minimize the term can be determined as follows:

If kernel size , it is just a matrix multiplication. Strassen algorithm Strassen (1969) will be applied.

If kernel size , we employ Winograd to transform convolution into matrix multiplication. Winograd convolution has many suboptions for different output tile size , with regard to which, the arithmetic cost can be formulated as follows. Based on the cost, we choose the optimal output tile size that minimizes the cost ,
(2) 
Then, if equals to , sliding window is picked out; otherwise, we adopt Winograd convolution.
This cost evaluation mechanism for convolution is summarized as
(3) 
(2) Then next question is how we determine the second term in Equation 1. Generally, the idea is to sum up the time cost of all the operators on different backends, then choose the backend with the minimal cost,
(4) 
where denotes the cost as before and op represents the operator. The key of backend cost evaluation is apparently to obtain the term, which relies on the different backend cases. If an operator is not supported on a GPU backend of interest (e.g., OpenCL/OpenGL/Vulkan), it will be scheduled to run on CPU. The cost of an operator on CPU or GPU can be formulated as
(5) 
where MUL stands for the number of multiplication, indicating the computational complexity of an operator. FLOPS (FLoatingOperations Per Second) is a common index to quantify the computation power of CPU or GPU. As we see, the main difference of GPU from CPU as the backend is that it has an additional term , which accounts for the cost to prepare the command buffer and command description for GPU. Note that, FLOPS and are known constants for a specific backend (please refer to the Appendix for how to determine them in detail).
Table 1 shows the performance comparison between fixed scheme and our scheme selection under different convolution settings. We can see that each scheme of sliding window or Winograd can be especially fit in certain case, but not universally good. Our method gets the best (or comparable to the best) in different cases, partly showing the claimed universal efficiency.
Scheme  Convolution argument setting  

Sliding  
WinoMin  
WinoMax  
Ours 
Preparationexecution decoupling. During the program execution, the computation is normally interlaced with the memory allocation and freeing. Speaking of mobile applications, however, time spent on memory management cannot be ignored. Since input size is determined or can be preprocessed into a target size, MNN can infer the exact required memory for the entire graph by virtually walking through all operations and summing up all allocation and freeing. In this sense, MNN preallocates required memory as a memory pool during the preinference stage and reuses it in the following inference sessions. The whole process is illustrated in Figure 3.
It should also be noted that by decoupling preparation from execution, MNN achieves better performance when the selected scheme is GPUrelated. This is because setting up command buffer and its related command descriptions is, to some extent, timeconsuming and thus have negative impact on the inference performance.
This simple idea can be quite effective in practice. The inference time using this technique can drop by about on CPU and on GPU. More details can be found in Table 2.
Scheme  CPU ( threads)  GPU (Vulkan) 

MI6 (w/o)  
MI6 (w/)  ()  () 
P10 (w/o)  
P10 (w/)  ()  () 
3.3 Kernel Optimization
Kernel is the detailed implementation of an operator, whose optimization can be specified into two ways, i.e., the algorithm and the schedule Jiang et al. (2018). In other words, we need to choose the optimal algorithm with the lowest arithmetic complexity and make the most of the available hardware resources for fastest execution.
3.3.1 Winograd optimization
Winograd algorithm is a wellknown minimal filtering algorithm proposed by Shmuel Winograd Winograd (1980) and has been used to accelerate convolution in DNNs Lavin & Gray (2016).
Given an input feature map of size , its output feature map , Winograd convolution can be formulated as
(6) 
where are the transformation matrices for kernel (spatial size ), input (spatial size ), and output (spatial size ), respectively. These three matrices only depend on the shape of and .
We optimize Winograd convolution based on a proposed Winograd generator with popular parallel computing techniques such as pipelining and SIMD.
(1) Block division and pipelining. In Winograd convolution Lavin & Gray (2016), is a small tile instead of the whole input feature map, thus leaving us the first issue: block division, i.e., how to determine .
To resolve it, we divide the block from the output perspective. For the output of size , let be the multiplier for parallel computing (namely, we compute output blocks per time). Then there should be
(7) 
where is the aforementioned optimal output tile size decided at the preinference stage (Equation 2).
When blocks are computed together, we must try our best to avoid pipeline stalls to hide latency. The trick is wellknown as avoiding data dependence in pipeline Kennedy & Allen (2001). This is achieved by careful assembly instruction rearrangement in MNN.
(2) Hadamard product optimization. Hadamard product is an essential step in Winograd convolution (see Equation 6). However, it has a problem that memory access takes much time, draging down the whole acceleration.
From Equation 6, it can be found that the summation plus the Hadamard product can be transformed into a dot product. Combining many dot product together gives us matrix multiplication, which is a good indicator for parallelism and amortizing memory access overhead. In this spirit, we propose to transform the Hadamard product to matrix multiplication building upon data layout reordering. The consequent new data layout is called NC4HW4 DMLC (2016); Apple (2014). Briefly, NC4HW4 is a data layout reordering method that splits out (
in this paper) data elements as a unit to make a new dimension for a tensor. The
elements are placed contiguously in memory so as to leverage the vector registers in CPUs to compute
data in a single instruction (i.e., SIMD). After this reordering, the Winograd convolution is illustrated as Figure 4.(3) Winograd generator. Most existing inference frameworks using Winograd Google (2017a); Tencent (2017); Xiaomi (2018) hardcode the matrices for common kernel and input sizes in their source codes Google ; Xiaomi , which has relatively poor scalability in face of new cases. In contrast, MNN keeps a universal interface through a proposed Winograd generator, allowing for Winograd convolution of any kernel and input sizes. We adopt the following formula to generate the matrices,
(8)  
where is a scalar used to minimize the numerical errors. We set in this paper.
3.3.2 Large matrix multiplication optimization
As aforementioned (Section 3.2), convolution operations with kernel size are converted into large matrix multiplication in MNN, where comes the Strassen algorithm Strassen (1969) for acceleration. To the best of our knowledge, MNN is the first mobile inference engine to adopt Strassen algorithm to accelerate large matrix multiplication.
Strassen is a fast algorithm that trades expensive multiplications with cheap additions, whose acceleration effect is maximized when it is applied recursively Blahut (2010). In practice, we need to decide when the recursion should stop. Conditioned on a fact that on modern processors, multiplication is nearly the same costly as addition, so we can just compare their cost via their numbers. In MNN, for a matrix multiplication of size , the number of direct multiplication is , while it only needs multiplications using Strassen. The extra cost of using Strassen is matrix additions of size , matrix additions of size , and matrix additions of size . Therefore, the recursion goes on only when the benefit is over cost, formulated as
(9) 
Once this inequation cannot hold, the recursion of Strassen should stop.
Table 3 demonstrates the advantage of Strassen compared with direct matrix multiplication with different matrix sizes. We can see that the Strassen method outperforms the direct one by improvement.
Matrix size  w/o Strassen  w/ Strassen 

()  
()  
() 
3.4 Backend Abstraction
Backend abstraction module is introduced to make all the hardware platforms (e.g., GPU, CPU, TPU) and software solutions (e.g., OpenCL, OpenGL, Vulkan) encapsulated into a uniform Backend
class. Through Backend
class, resource management, memory allocation, and scheduling are disentangled with the concrete operator implementations.
The Backend
class consists of several abstract functions, as shown in Figure 5. As for memory management, onAcquireBuffer
is responsible for allocating new memory for tensors and onReleaseBuffer
for releasing them. For operator implementation, onCreate
is designed to create execution instance for each operator.
Advantages of this specific module are threefolds.
(a) Manual Search  (b) Semiautomated Search (MNN)  (c) Automated Search 
(1) Reduce complexity. A large number of operators and innumerable devices make the operator optimization a nontrivial task. The major challenge lies in the fact that heterogeneous backends typically have different ways to manage resources, e.g., allocating/freeing memory and scheduling data. Dealing with these issues makes implementation errorprone and cumbersome. The Backend
class uniformly manages resource loading such as GPU shader and completes the optimal memory allocation according to the declarations of operators. With backend abstraction, MNN manages to divide the task into two independent parts. The "frontend operator" developers can focus on the efficient implementation for fast operator execution and all unrelated backend details are hidden. The "backend" developers can be devoted to exploiting different backend specifications and offering more convenient APIs. This separation of tasks are rather meaningful in practice since the lowering contribution barriers is highly appreciated in opensource projects.
(2) Enable hybrid scheduling. Heterogeneous computing mainly involves the backend selection and data transmission among different backends. When creating an inference session in MNN, one can configure targeted backends. If there are multiple backends available, MNN can decide the optimal backends for operators according to aforementioned backend evaluation (Section 3.2
). As a result, MNN supports the flexible combination of operator execution on different backends even in a single inference. For example, convolution may run on CPU and the following ReLU activation may run on GPU. With the help of
Backend
class, developers do not need to worry about the scheduling and transmission, which automatically proceed under the hood.
(3) More lightweight. Implementation of each backend can work as an independent component while maintaining the uniform interface in MNN. Whenever some backends are unavailable on certain device, the corresponding concrete implementation can be easily peeled off from the whole framework. For example, Metal Apple (2014) is not supported on Android platforms and thus we can just take away the Metal module without touching the specific operator implementations. Through the decoupling of operator and backend, MNN can be competitively lightweight. This is of critical importance since mobile applications have a rigorous restriction on the binary size.
Through these uniform backend interfaces, to the best of our knowledge, MNN is the inference engine that supports the most comprehensive backends (see Table 4). Moreover, it is scalable enough for users to integrate new backends such as NPU, FPGA, etc.
3.5 Method Summary
As compared with some other typical design paradigms shown in Figure 6, we illustrate the design philosophy behind MNN and highlight several MNNspecific advantages in this part.
For an inference engine, high performance is the vital factor to decide whether it will be adopted by developers or not. Out of this motivation, different inference engines (e.g., TFLite, NCNN, TVM) put continuous efforts into optimization to achieve better performance in difference ways.
For MNN, we make many improvements to meet the fundamental requirement of high performance. Taking the overwhelming operator diversity into consideration, we are not satisfied with the casebycase optimization solution. Although this solution can be rather simple and effective, it often encounters the problem that some operators are left out of optimization and become the performance bottleneck (as shown by an example in Section 4.2). Instead, MNN first spots the computeintensive unit of a smaller granularity (i.e., the basic matrix multiplication), which is highly optimized by means of fast algorithms and paralleling techniques. Consequently, operators built upon this basic unit can naturally benefit from acceleration without being specially optimized.
Besides, maintainability, scalability, and the deployment cost all have a great impact on the longterm growth of an inference engine. Compared with autotuning in TVM, MNN is able to select the optimal computation scheme with less time and realize the runtime optimization through the proposed preinference mechanism. Note that, by transferring the search stage from offline compilation to online preinference, we also avoid the restriction on the binary validation (e.g., the iOS code signature). Meanwhile, by providing a set of uniform interface to hide raw backend details, MNN is naturally equipped with the property of modularity. Not only does this property make MNN lightweight, but also it is more convenient for contributors to extend MNN to more backends, as partly shown by number of operators supported on different backends in Table 4.
4 Benchmark Experiments
In this section, we comprehensively evaluate the performance of MNN. We first explain our experiment settings, then present the experimental results on different hardware platforms and networks, compared with other mobile inference engines. Finally, we share an online case to show the production application of MNN.
4.1 Experiment Settings

Devices. For iOS, we adopt iPhone8 and iPhoneX (processor: Apple A11 Bionic), as they are popularly adopted in benchmarks^{1}^{1}1https://www.tensorflow.org/lite/performance/benchmarks. For Android, MI6 (processor: Snapdragon 835) and Mate20 (processor: Kirin 980) are adopted.

CPU and GPU. (1) For CPU, thread are evaluated considering that modern devices usually have two or four processors and multithreading is a common acceleration technique. CPU affinity is set to use all the available cores in line with NCNN benchmark Tencent (2017). (2) For GPU, we evaluate the Metal Performance Shaders on iPhones. On Android devices, three standard backends (i.e., OpenCL, OpenGL, and Vulkan) are evaluated since MNN has supported them all (see Table 4).
4.2 Experimental Results
Performance on different smartphones and networks. Figure 7 shows the performance of MNN compared with other four inference engines. We have the following observations.
(1) Generally, MNN outperforms other inference engines under almost all settings by about , regardless of the smartphones, backends, and networks.
(2) For CPU, on average, 4thread inference with MNN is about faster than others on iOS platforms, and about faster on Android platforms (e.g., Mate20).
(3) For Metal GPU backend on iPhones, MNN is much faster than TFLite, a little slower than CoreML but still comparable, which is reasonable since CoreML is Apple’s exclusive GPU solution tailored to iOS while MNN is meant to support backends of different operating systems. For Android GPU backends, other engines usually have their performance blind spots. For example, NCNN with Vulkan backend is not very fast on MI6; TFLite with OpenGL still has much room for improvement on ResNet18 network. In contrast, MNN obtains favorable results on all different hardware platforms and networks. Note that we achieve this comprehensive performance advantage by means of the proposed semiautomated search architecture rather than casebycase heavy optimization.
(4) The multithread CPU inference using MNN on highend devices (e.g., iPhone8 and iPhoneX) is highly competitive compared with that using GPU backends, which demonstrates the effectiveness of the proposed indepth kernel optimization of MNN.
Bottleneck of casebycase optimization. Figure 8 shows an underperformance example of the casebycase optimization on Inceptionv3 Szegedy et al. (2015). One can clearly see that NCNN requires abnormally more inference time than the others. This is because some special operators in this network (e.g., and convolution) are not optimized by NCNN (for now). Consequently, they become bottlenecks during execution and drag down the overall performance severely. This specific case shows the limited scalability of casebycase optimization. MNN is free from this problem because our computation solution is a general one which is applicable for various convolution cases.
Comparison with TVM. We compare the performance of MNN with TVM on various networks, as shown in Figure 9. Although MNN does not apply modelspecific tuning and optimization, MNN still has encouraging performance and is even slightly faster than TVM. In addition, generating modelspecific code using autotuning and compiling in TVM usually brings some deployment overhead. In Table 5^{2}^{2}2One may feel curious why the compiling results are faster than autotuning. This is because, TVM provides developers with the mechanism (i.e., tensorize) to replace the autotuned code with handoptimized implementation. Thus, it is reasonable for our optimized direct compilation to outperform autotuning. we show the time cost of autotuning and compiling for ResNet18 with TVM. Even with a small number of trials for tuning on a single device, TVM still takes much time to generate code. Since most mobile applications cover lots of different device types, the process of code generation will take much longer time and demand more resources (e.g., servers), which is hardly affordable for many developers. MNN is free from these issues because all optimizations are performed at runtime without performance loss.
#Trial  Autotuning  Compiling 

Device  CPU type  GPU type  AIT 

EMLAL00  Kirin 970  MaliG72 MP12  
PBEM00  SDM670  Adreno 615  
PACM00  CortexA73  MaliG72 MP3  
COLAL10  CortexA73  MaliG72 MP12  
OPPO R11  Kryo 260  Adreno 512 
4.3 Online Case Study
Searching commodities is an indispensable stage for shopping on Ecommerce platforms. However, due to the complexity of commodity categories, the traditional way of text search cannot meet the expectation of users these days. Therefore, searching commodity from images become a musthave feature for Ecommerce platforms. This part shows a real application case of MNN in this scenario.
In the Ecommerce application, MNN is employed to run deep learning models on mobile devices for main object detection and detected results are then used in commodity search. This service covers more than kinds of mobile devices and has over million daily users. Table 6 shows the top popular devices used in this service and the average inference time, where MNN achieves stable and smooth search experience with average inference time (ms) across all different devices, regardless of their broad diversity. This shows the encouraging universality of MNN.
5 Conclusion and Future Work
Mobile inference engine is critical to deep learning model deployment on mobile applications. To deal with the challenge of model compatibility and device diversity, we introduce Mobile Neural Network (MNN), which proposes a novel mobile engine design paradigm (semiautomated search) to get the best of both universality and efficiency. Preinference mechanism with a thorough kernel optimization and backend abstraction endows MNN favorable universality and stateoftheart ondevice inference performance.
MNN is still fast evolving, which is being improved in many aspects, for example, (1) applying autotuning during backend evaluation, (2) integrating model compression tools (e.g., pruning) to slim the model on the fly, (3) providing more tools for user convenience, (4) providing more language support including JavaScript and Python.
Acknowledgements
We thank Chaoyue Niu for helpful discussions and the anonymous reviewers for their valuable comments to improve our work.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A system for largescale machine learning. In IEEE Symposium on Operating Systems Design and Implementation, 2016.
 Apple (2014) Apple. Metal: Accelerating graphics and much more. https://developer.apple.com/metal/, 2014. Accessed: 20190901.
 Apple (2017) Apple. CoreML. https://developer.apple.com/documentation/coreml, 2017. Accessed: 20190901.
 Ashari et al. (2015) Ashari, A., Tatikonda, S., Boehm, M., Reinwald, B., Campbell, K., Keenleyside, J., and Sadayappan, P. On optimizing machine learning workloads via kernel fusion. In ACM SIGPLAN Notices, 2015.
 Baidu (2018) Baidu. Anakin. https://github.com/PaddlePaddle/Anakin, 2018. Accessed: 20190901.
 Bastien et al. (2012) Bastien, F., Lamblin, P., Pascanu, R., Bergstra, J., Goodfellow, I., Bergeron, A., Bouchard, N., WardeFarley, D., and Bengio, Y. Theano: new features and speed improvements. In NeurIPS Workshop, 2012.
 Blahut (2010) Blahut, R. E. Fast algorithms for signal processing. Cambridge University Press, 2010.
 Chen et al. (2015) Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 DMLC (2016) DMLC. TVM: Tensor virtual machine, open deep learning compiler stack. https://github.com/dmlc/tvm, 2016. Accessed: 20190901.
 (10) Google. TensorFlow Winograd. https://github.com/tensorflow/tensorflow/blob/9590c4c32dd4346ea5c35673336f5912c6072bf2/tensorflow/core/kernels/winograd_transform.h. Accessed: 20190901.
 Google (2016) Google. Neural Networks API. https://developer.android.google.cn/ndk/guides/neuralnetworks, 2016. Accessed: 20190901.
 Google (2017a) Google. TensorFlow Lite. https://tensorflow.google.cn/lite, 2017a. Accessed: 20190901.
 Google (2017b) Google. XLA: Accelerated linear algebra. https://developers.googleblog.com/2017/03/xlatensorflowcompiled.html, 2017b. Accessed: 20190901.
 Guo et al. (2019) Guo, L., Hua, L., Jia, R., Zhao, B., Wang, X., and Cui, B. Buying or browsing?: Predicting realtime purchasing intent using attentionbased deep network with multiple behavior. In SIGKDD, 2019.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, 2016.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Iandola et al. (2016) Iandola, F., Moskewicz, M., and Ashraf, K. SqueezeNet: Alexnetlevel accuracy with 50x fewer parameters and 0.5MB model size. arXiv preprint arXiv:1602.07360, 2016.
 Jia et al. (2014) Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
 Jiang et al. (2018) Jiang, Z., Chen, T., and Li, M. Efficient deep learning inference on edge devices. In MLSys, 2018.
 Kennedy & Allen (2001) Kennedy, K. and Allen, J. R. Optimizing compilers for modern architectures: a dependencebased approach. Morgan Kaufmann, 2001.
 Khronos (1992) Khronos, G. OpenGL: Open Graphics Library. https://opengl.org/, 1992. Accessed: 20190901.
 Khronos (2009) Khronos, G. OpenCL: Open Computing Language. https://www.khronos.org/opencl/, 2009. Accessed: 20190901.
 Khronos (2015) Khronos, G. Vulkan. https://www.khronos.org/vulkan, 2015. Accessed: 20190901.

Lavin & Gray (2016)
Lavin, A. and Gray, S.
Fast algorithms for convolutional neural networks.
In CVPR, 2016.  LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436, 2015.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. In NeurIPS Workshop, 2017.
 Rastegari et al. (2016) Rastegari, M., Ordonez, V., Redmon, J., and Farhadi, A. Xnornet: Imagenet classification using binary convolutional neural networks. In ECCV, 2016.
 Reddi et al. (2019) Reddi, V. J., Cheng, C., Kanter, D., Mattson, P., Schmuelling, G., Wu, C.J., Anderson, B., Breughe, M., Charlebois, M., Chou, W., et al. Mlperf inference benchmark. arXiv preprint arXiv:1911.02549, 2019.
 Shi et al. (2016) Shi, W., Cao, J., Zhang, Q., Li, Y., and Xu, L. Edge computing: Vision and challenges. IEEE Internet of Things Journal, 3(5):637–646, 2016.
 Strassen (1969) Strassen, V. Gaussian elimination is not optimal. Numerical Mathematics, 13(4):354–356, 1969.
 Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In CVPR, 2015.
 Tencent (2017) Tencent. NCNN. https://github.com/Tencent/ncnn, 2017. Accessed: 20190901.
 Wei et al. (2017) Wei, R., Schwartz, L., and Adve, V. DLVM: A modern compiler infrastructure for deep learning systems. arXiv preprint arXiv:1711.03016, 2017.
 Winograd (1980) Winograd, S. Arithmetic complexity of computations, volume 33. SIAM, 1980.
 (35) Xiaomi. MACE Winograd. https://github.com/XiaoMi/mace/blob/9b0b03c99cf73cd019050c6b9ee80a4753265da0/mace/ops/arm/fp32/conv_2d_3x3_winograd.cc. Accessed: 20190901.
 Xiaomi (2018) Xiaomi. MACE: Mobile ai compute engine. https://github.com/XiaoMi/mace, 2018. Accessed: 20190901.
 Yu et al. (2014) Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B., Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., et al. An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSRTR2014–112, 2014.
Appendix A MLPerf Evaluation
We also conduct MobileNetv2 benchmark with MNN using the benchmark tool MLPerf Reddi et al. (2019) on CPU threads of Pixel 3. Results are shown in Table 7.
Item of evaluation  Value 

min_query_count  
max_query_count  
QPS w/ loadgen overhead  
QPS w/o loadgen overhead  
Min latency (ns)  
Max latency (ns)  
Mean latency (ns)  
percentile latency (ns)  
percentile latency (ns) 
Appendix B More comparison on Pixel phones
To further compare with TFLite, we also conduct evaluations of Inceptionv3 float model on the CPU of Pixel 2 and 3. Results are shown in Table 8. As we see, MNN is consistently faster than TFLite with either single thread or multithreads, in line with the results in the main paper.
Phone type  #Threads  TFLite  MNN 

Pixel 2  
Pixel 2  
Pixel 3  
Pixel 3 
Appendix C Backend cost evaluation
Both CPU and GPU use FLOPS to measure the capability of the processors. Only GPU has the term. Their values are determined as follows.

FLOPS. For CPU, if the OS is Linux or Android, we can access the maximal frequency of each CPU core. Then choose the largest frequencies and add them together as the FLOPS term, where is the prespecified number of threads (such as two threads or four threads). For the other CPU systems, set FLOPS to
. For GPU, we estimate the
FLOPS through practical running. Specifically, we run the MobileNetv1 network for times and obtain the FLOPS values for a bunch of common mobile GPUs. The results are shown in the list below. For those GPUs not in this list, we set the FLOPS as , namely, faster than CPU, as is the normal case.The list of GPU FLOPS (): MaliT860: ; MaliT880: ; MaliG51: ; MaliG52: ; MaliG71: ; MaliG72: ; MaliG76: ; Adreno (TM) 505: ; Adreno (TM) 506: ; Adreno (TM) 512: ; Adreno (TM) 530: ; Adreno (TM) 540: ; Adreno (TM) 615: ; Adreno (TM) 616: ; Adreno (TM) 618: ; Adreno (TM) 630: ; Adreno (TM) 640: .

. This term depends on the adopted graphical API. For OpenCL and OpenGL, it is empirically set to (ms), which is the normal average time of calling API like
clEnqueueNDRKernel
. For Vulkan, since it only needs to summitcommandBuffer
, which is less timeconsuming, thus can be estimated as (ms).