Recently, there is an increasing trend to carry out convolutional neural network (CNN) inference on mobile devices directly because of both privacy and real-time latency (user experience) requirements. [13, 9, 11, 18]. However, since mobile devices are subjected to both computational and energy constraints, recent research therefore puts effort on designing more lightweight “mobile models” that are composed of fewer layers and/or using less computational expensive operations.
When optimizing the performance of a program with respect to a type of processors, developers often use the roofline model  to guide their implementation. Figure 1 shows the roofline model of quad-core ARM Cortex-A57. The roofline (the dashed line) indicates the maximum achievable performance of any program under that processor.
Given the roofline model of a processor, one can check whether her implementation has fully utilized that processor or not. In Figure 1
, the point ‘Unoptimized’ represents a naive C implementation of MobileNetV1 written by us. The point ‘TF-Lite’ represents the popular TensorFlow Lite binary compiled with math optimization, auto vectorization and linking to Eigen
BLAS library. Since TF-Lite is open source, it is known that it has already optimized using all the optimization tricks suggested in the roofline article (e.g., using SIMD intrinsics). Unfortunately, even the popular TensorFlow Lite (TF-Lite) is not fully utilizing the processor. So, what is missing?
ARM processors get the lion’s share of the mobile device processor industry ; and DWConv and PWConv are the two most dominating operations in state-of-the-art mobile models and they take up 90+% of total inference time [11, 18, 22]. Therefore, the goal of this paper is to optimize depthwise convolution (DWConv) and pointwise convolution (PWConv) on ARM processors. We observe there are two major issues that hurt the performance of DWConv and PWConv on ARM processors.
First, we point out that the existing DWConv and PWConv implementations are poor in core scalability, which is against the trend of getting more cores in ARM processors (e.g., Huawei’s latest mobile phone SoC chipset, Kirin 980, has eight ARM cores). Second, we point out that the optimization tricks suggested in the roofline article are necessary but insufficient for ARM processors. Specifically, while both ARM and x86 processors can carry out 2 FMA (fused-multiply-add) instructions per cycle, ARM processors can only load 1 register (from the cache) per cycle whereas x86 processors can load 4 registers per cycle. In other words, while optimizing the cache miss and increasing parallelism could eliminate the major bottleneck on x86 processors, on ARM processors those tricks could only shift the bottleneck to the traffic between the register and the cache. Based on the above observations, we therefore develop high performance version of DWConv and PWConv for mobile devices. Using techniques like loop rescheduling  and register tiling , our implementations are able to reduce the traffic between the cache and the memory as well as the traffic between the register and the cache. Experimental results show that our implementation can respectively achieve a speedup of up to 5.5 and 2.1 against TVM  on DWConv and PWConv, which leads to a 46GFlops on ARM Cortex-A57 in terms of overall MobileNetV1 inference.
ARM processors dominate the mobile device market. Latest ARM processors all support a 64-bit architecture, named “AArch64”. AArch64 is a load-store architecture where data has to be loaded into the registers before the operations take place. AArch64 supports SIMD instruction and each core has 32 SIMD registers. Each SIMD register is 128-bit, which means each SIMD instruction can operate on 4 single precision numbers simultaneously. The predominate instruction used in model inference is the FMA (fused-multiply-add) SIMD instruction. An FMA instruction requires 3 SIMD registers to fully operate. Each FMA instruction carries out a 4-way SIMD multiplication, followed by a 4-way SIMD addition.
Depthwise convolution (DWConv) is a key operation in mobile models. It takes three inputs: (i) a 3d array (the input feature map) of size , (ii) a 3d array (the filter) of size
, (iii) the stride. It produces a 3d array (the output feature map) of size . In the above, and are the spatial height and width, is the number of channels. The subscripts , and refers to the input feature map, the filter, and the output feature map respectively.
Figure 2 illustrates the concept of depthwise convolution. Algorithm 1 is its plain implementation, which consists of 5 tightly-nested loops around a multiply-accumulate (MAC) statement (Line 6). Referring to Figure 2, the implementation iteratively applies the filter (lines 4 and 5) per channel (Line 3), and then repeats the task by moving the filter from left to right (Line 2) and then from top to bottom (Line 1).
Algorithm 2 shows the implementation of DWConv in TF-Lite. It mainly applies 4 tricks to improve its efficiency.
Loop rescheduling and SIMD. Any permutation of the ordering (scheduling) of the loops would yield the same correct result but with different efficiency. Furthermore, each channel of the filter can apply to corresponding channel of the input independently and thus in parallel. Consequently, Algorithm 2 reschedules the innermost loop to process the MAC across 4 channels using SIMD (lines 6–12).
Loop Unrolling. The innermost loop actually possesses loop independence, meaning one iteration does not depend on its previous iteration. In other words, the loop can be run in parallel. Consequently, the actual implementation of the innermost loop is unrolled (or called as flattened) . Loop unrolling not only improves ILP (instruction-level parallelism), but also reduces branch mis-prediction incurred by the test condition of each iteration. Algorithm 2 however does not explicitly show the unrolled loop for brevity.
When involving matrix/tensor, loop blocking is often used to reduce cache misses. In TF-Lite, loop blocking is applied to the loop (Algorithm 1; Line 2) and it becomes the loop in (Algorithm 2; Line 2) and the loop in (Algorithm 2; Line 4). By doing so, the data loaded in the loop (Algorithm 2; Line 2) could stay in the cache and get re-used again and again by the inner loop.
Multi-threading. As real-time inference is getting more important, TF-Lite also uses multiple cores to parallel the outermost loop (Line 1). In other words, the blocks across the direction in Figure 2 are generated by multiple cores.
Another key component in mobile models is the pointwise convolution (PWConv). PWConv is a simple convolution. It takes as inputs: (1) a 3d input feature map of size (), and (2) a 4d filter of size (), and produces a 3d output feature map of size (), where and .
Algorithm 3 shows the implementation of PWConv in TF-Lite. It essentially transforms the problem into a matrix-matrix (MM) multiplication problem , where the 2d matrix is flatten from the 3d input , so that is a matrix, where (Line 1); and is a matrix of size flatten from (Line 2) since the first two dimensions are of size 1.
Since MM multiplication is a classic problem that has been well studied, TF-Lite simply calls the high performance MM routine in a BLAS library . MM multiplication implementations in BLAS are highly optimized with all the tricks (e.g., SIMD, loop rescheduling) mentioned above. Recently, Google released an experimental matrix multiplication library named Ruy . Ruy achieves good performance on small matrices (e.g., 100100) but its the performance on large matrices is poorer than BLAS. Since Ruy’s code is still immature and flux, we do not analyze it here but include that in our experiments.
High Performance DWConv and PWConv
In this section, we present techniques to optimize the implementations of DWConv and PWConv on ARM processors. We will explain in detail why the existing “well-optimized” implementations are not efficient on ARM processors and propose our solutions. One of the key elements there is about the notions of operational intensity in the roofline model  and the notion of arithmetic intensity .
The roofline model 
is often used to understand the estimated performance of a given compute kernel running on a type of processor by showing the inherent hardware limitations, and potential benefit and priority of optimizations (e.g., locality, bandwidth, and different parallelization tricks). The roofline model, however, focuses on cache misses. In other words, it focuses on the traffic between the cache and the memory and assumes if the program is well optimized with little cache miss, the program could fully utilize the hardware. The key metric inside the roofline model is “operational intensity” (OI), which measures the average number of floating-point operations that can be carried out per byte of memory loaded from the memory.
“Arithmetic Intensity” (AI)  measures the average number of floating-point operations that can be carried out per byte of memory loaded from the cache to the register.111We remark that there is a misconception online (e.g., Wikipedia) that OI is equivalent to AI. That misconception comes from the fact that cache miss is the major bottleneck on x86 processors and thus the traffic between the registers and the cache is immaterial after the cache bottleneck is resolved. However, for ARM processors, it is not the case. This is exactly what we want to go after if the memory bottleneck can be removed. Let be the number of arithmetic operations carried out, be the number of bytes transferred between cache and registers, the arithmetic intensity is .
Given a particular layer of convolution (e.g., DWConv), is a constant as it is dedicated by the problem definition and algorithm, a larger means the implementation is more efficient because there are fewer data transferred between the cache and the registers, which implies the implementation is doing a good job in keeping the data in the register as long as it is necessary.
Existing implementations of DWConv have poor scalability on the number of cores. Take TF-Lite implementation as an example (Algorithm 2), it picks the dimension as the outer-most loop to apply thread parallelism (Line 1). In other words, given cores, each core is assigned with a chunk of output feature map in size of to compute.
Since the chunk spans over all the output channels, each core has to copy the whole filter of size into its tiny L1 cache. In other words, when the input feature map, the filter, and the output feature map cannot all fit into the L1 cache, the number of L1 cache misses will fly high. Furthermore, the situation exacerbates with the number of layers because the filters are getting larger when they appear deeper in the model.
Although the implementation of DWConv in TF-Lite has good performance from the perspective of OI (and thus in terms of cache misses when we do not use more cores), its performance is next limited by its poor arithmetic intensity. This is not an issue on x86 processors. However, this is a big issue on ARM processor because ARM processors can only load 1 register per cycle while it can process 2 SIMD FMA instructions per cycle. In other words, if we do not optimize the pipeline well, the FMA instructions are always waiting for data to be loaded to the registers.
To be specific, we first analyze the AI of TF-Lite implementation (Algorithm 2). Its inner-most loop is able to process 4 output elements in parallel by SIMD (Line 10). In order to do so, however, it has to carry out 3 SIMD load instructions (Lines 7–9) to retrieve the filter, input and output respectively from cache to registers, and 1 SIMD store instruction to write back the updated output elements to L1 cache (Line 11). Thus, the arithmetic intensity of this implementation is . If the width of the filter and the number of channels are small, compilers may keep elements of the filter in the register for the loop (Line 4). To give TF-Lite such benefit of doubt, we assume this happens and thus its arithmetic intensity can become . Nonetheless, it is still a very poor number.
Algorithm 4 is our proposed implementation. To address the core inscalability problem, we re-schedule the loop order and picks the dimension as the outer-most loop to apply thread parallelism (Line 1). This way, each core is assigned with a chunk of output feature map in size to compute. Under such parallelism, since a chunk only spans output channels, each core needs to retrieve elements of the filter to its L1 cache. Compared with TF-Lite implementation that retrieves elements of the filter to the L1 cache, we fetch only of those in cache, which significantly reduce the cache misses and improve the core scalability.
To improve the arithmetic intensity, we exploit different techniques to increase the reuse of the data in the register as much as we can. The first technique we applied is register tiling  (Lines 2 and 3). It splits the filter into tiles of size . By doing so, a tile can be kept in the registers as long as possible. The kernel is used to compute the convolution results of a small output block of size . and are set to ensure the output block stay in the registers across the Kernel. The kernel is skillfully tuned to increase its AI by reducing the traffic between the registers and the cache. Specifically, lines 7 to 11 in the kernel aim to load the filter into the registers. However, this load process is only done when and (Line 7), meaning for the nested loops in lines 2 and 3, the filter is only loaded once and stays in the registers for long. Lines 14 to 19 in the kernel aim to load a specific output block of size into the registers. Notice that this specific output block is only loaded once and would never get re-loaded again. Similarly, Lines 29 to 34 in the kernel aim to store the updated output block back to the cache. Again this specific output block is only stored once, as it would never get re-loaded for any further processing after it carries out the FMA in lines 20-27.
We now analyze the AI of our implementation. That would help us to see why it outperforms the existing implementations. It is easy to know the arithmetic operations are all inlined within the Kernel. In the Kernel, the number of arithmetic FMA operations all lies in lines 18–25, which has 4 for loops. So, the FMA operation is carried out times. Thus, the number of floating-point operations is , which will be the numerator in the AI.
The denominator of AI captures the number of bytes transferred between cache and registers. For our implementation, it involves:
Loading the filter block once (Lines 7-11) across the nested two loops and (Lines 2 and 3) and reused times. Thus, Kernel incurs an average of bytes traffic between the registers and cache.
Loading the output block once (Lines 14-19) and storing once (Lines 29-34) in the kernel. So, the traffic for output block in Kernel is bytes.
Loading one SIMD register data of in the inner-most loop (Lines 20-27). Thus, the traffic for is bytes.
Putting it all together, the AI of our implementation is:
Since the size of the filter is either or , and the block sizes and are empirically set as 1 or 2 (they are set with the objective of saving some registers because we indeed apply loop unrolling to the 4 tightly-nested loops in Lines 18-25). So, the term is negligible. Therefore, we rewrite equation (1) as , which is obviously way larger than .
Pointwise Convolution Implementation
has the best performance and thus we set TF-Lite to use OpenBLAS instead. Nonetheless, it is known that current matrix-multiplication implementations including OpenBLAS cannot scale well on multiple cores for deep learning workload[25, 26, 17].
Algorithm 5 is the implementation of a BLAS MM routine (e.g., SGEMM in OpenBLAS). It has applied loop blocking to increase data reuse in the memory hierarchy. Its kernel is the function RTRA (Line 4), which stands for Register Tiling Reuse block A. The logical view of RTRA is depicted in Figure 3 (left). It first SIMD loads a block of matrix , which is represented as , into the registers (Line 2). is of size . The elements of stay in the registers across the j’ loop (Line 3 in Algorithm 5) and are reused times.
Inside the function RTRA (Figure 3), Line 3 aims to stream a block of matrix and a block of matrix into the registers. is of size and is of size . A matrix multiplication between and is performed to update (Line 4), and it costs FMA operations and the number of floating-point operations is . Finally, the updated has to be stored to the cache.
The AI of BLAS MM implementation is as follows. The arithmetic operations are all inlined in the kernel RTRA. In routine RTRA, its AI is:
Since AArch64 has 32 128-bit SIMD registers, in order to fully allocate the registers, , and are usually set as , and in the BLAS Libraries (e.g., OpenBLAS). Then, we can get . Note that the RTRA kernel has a poor AI because has to be transferred twice between the cache and the registers (one load and one store).
We propose another loop blocking method with better AI (Algorithm 6). It calls another kernel RTRD (Register Tiling Reuse block D), whose concept is listed in Figure 3 (right). RTRD first loads block into the registers. The elements of stay in the registers across the i’ loop (Line 3; Algorithm 6) and are reused. After that, it streams blocks and into the registers and then evaluates a small matrix multiplication to update (Line 4). Differ from RTRA, RTRD only stores the block to the cache in the last iteration of loop . Though this way is inefficient on x86 processor , it is very efficient for ARM processors because ARM processors is sensitive to AI. The arithmetic intensity of RTRD for MM multiplication is:
To fully allocate the registers, we can set , and . Thus, it is about 1.5 larger than , since and are often much larger than 8. Of course, our actual implementation also includes all the optimization tricks such as software prefetching, loop rolling etc. But we do not repeat them here.
In this section, we present performance results of our high performance depthwise and pointwise convolution on mobile devices. We run our experiments on a 2.0GHz quad-core ARM Cortex-A57. Each core has 48KB L1 instruction cache and 32KB L1 data cache. All cores share 2MB unified L2 cache. We compare performance of our DWConv and PWConv implementations with two versions of TF-Lite, one links to OpenBLAS  and the other one links to Ruy . In addition, we compare the performance with TVM . TVM implementations suppose to deliver performance as good as the performance offered by manually optimizing the implementation for a specific hardware. The DWConvs and PWConvs operations in this study are extracted from MobileNetV1, MobileNetV2 and MnasNet . They are different in input size, output size and filter size.
Figure 4 to Figure 6 show the speedup of our implementations (and TVM) with respect to TF-Lite, on different DWConv and PWConv extracted from MobileNetV1, MobileNetV2, and MnasNet-A1, respectively. For example, in Figure 4, to refer to nine different DWConvs found in MobileNetV1. Results show that our DWConv implementation outperforms TF-Lite at least by 2.9 and up to 9.0. In addition, our DWConv implementation outperforms TVM generated binaries by at least 1.4 and up to 5.5, showing that TVM is not able to reach the level of optimizations that we can achieve.
Our PWConv implementation achieves 1.3 to speedup over TF-Lite(OpenBLAS), which is essentially calling the OpenBLAS library for MM multiplication. Our PWConv implementation also achieves up to 2.1 speedup over TF-Lite(Ruy), which uses the aggressively tuned library Ruy to implement PWConv. In addition, our PWConv implementation achieves 1.05 to 2.11 speedup over TVM, which once again shows TVM is not able to reach the level of optimizations that we can achieve.
In Figure 7, we compare the scalability of our DWConv and PWConv performances with respect to the number of cores. We include TF-Lite (which uses OpenBLAS to implement PWConv) there for comparisons.222We do not include TVM here because TVM generates different binaries for different number of threads, making things incomparable. For space reasons, we only include the results from MobileNetV1 as results from MobileNetV2 and Mnasnet-A1 are largely similar.
From Figure 7, we see that our implementations scale better than TF-Lite. We almost achieve perfect speedup when using 2 threads, which is very promising because every parallel program has its serial part based on Amdahl’s law. When using 4 threads, the core instability of TF-Lite immediately manifest – TF-Lite has only around 2 speedup on DWConv and 1.8 to 2.7 on PWConv. In contrast, our implementations achieve 2.2 to 3.9 speedup on DWConv and 3.2 to 3.9 speedup on PWConv.
Most works on optimizing deep learning operations focus only on conventional convolutions [25, 2, 6, 17] but not depthwise and pointwise convolutions appeared in mobile models. To our best knowledge, this paper is the first to discuss the optimization of depthwise and pointwise convolutions on mobile processors. In , there are treatments to improve the performance of DWConv, but they focus on training and GPU, whereas our focus is on inference and ARM. TVM  is a compiler stack for generating highly efficient binaries for deep network. It supports CPU, GPU, ARM, and specialised accelerators. Our experimental results show that binaries optimized by TVM not yet fully utilize the power of mobile processors. BLAS libraries [19, 15, 8] offer highly efficient implementations for PWConv. However, we are able to show that they are still lacking on mobile devices.
Conclusions and Future Work
In this paper, we show that existing implementations of depthwise convolution and pointwise convolution are not efficient enough on mobile devices. The major reason is that those implementations have not considered the fact that ARM processors are getting more cores as well as the latency gap between the load and FMA instructions in ARM processors.
To this end, we re-optimize the implementations of DWConv and PWConv specifically for ARM. That is because ARM processors are dominating the mobile device market and there is an increasing demand to carry out inference directly on the mobile devices. Experimental results show that our implementations can outperform industry-strength implementations from TF-Lite as well as optimized binaries generated from TVM. Using MobileNetV1 as an example, our optimized implementation can carry out inference at 46GFlops, a performance that is almost hitting the roofline of ARM processors. The encouraging result also reveals one important future work for us. Since TVM is a compiler framework for deep learning models, our results indicate that we incorporate our techniques (e.g., register tiling) into TVM so to make it generate highly efficient binaries for mobile models on mobile devices.
This work is supported by Hong Kong General Research Fund (14200817, 15200715, 15204116), Hong Kong AoE/P-404/18, Innovation and Technology Fund ITS/310/18.
-  (2018) TVM: an automated end-to-end optimizing compiler for deep learning. In OSDI, Cited by: High Performance Depthwise and Pointwise Convolutions on Mobile Devices, Introduction, Experimental Evaluation, Related Work.
-  (2017) MEC: memory-efficient convolution for deep neural network. In ICML, Cited by: Related Work.
-  (2016) Xception: deep learning with depthwise separable convolutions. CoRR. External Links: Cited by: Introduction.
-  (1990) A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw.. Cited by: Pointwise Convolution.
-  (1979) Unrolling loops in FORTRAN. Softw., Pract. Exper.. Cited by: item 2.
-  (2018) Anatomy of high-performance deep learning convolutions on simd architectures. In SC, Cited by: Related Work.
-  (2019) Ruy. Note: https://github.com/tensorflow/tensorflow Cited by: Pointwise Convolution, Experimental Evaluation.
-  (2010) Eigen v3. Note: http://eigen.tuxfamily.org Cited by: Introduction, Core Inscalability, Related Work.
-  (2016) MCDNN: an approximation-based execution framework for deep stream processing under resource constraints. In MobiSys, Cited by: Introduction.
-  (2005) Mapping computational concepts to gpus. In ACM SIGGRAPH Courses, Cited by: Arithmetic Intensity, High Performance DWConv and PWConv.
-  (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR. External Links: Cited by: Introduction, Introduction, Introduction, Experimental Evaluation.
-  (2002) Register tiling in nonrectangular iteration spaces. TOPLAS. Cited by: Introduction, Our implementation.
-  (2017) DeepMon: mobile gpu-based deep learning framework for continuous vision applications. In MobiSys, Cited by: Introduction.
-  (1992) Using processor affinity in loop scheduling on shared-memory multiprocessors. In SC, Cited by: Introduction.
-  (2015) Note: http://www.openblas.net Cited by: Core Inscalability, Experimental Evaluation, Related Work.
-  (2018) Diagonalwise refactorization: an efficient training method for depthwise convolutions. In IJCNN, Cited by: Related Work.
-  (2017) Optimizing cnns on multicores for scalability, performance and goodput. In ASPLOS, Cited by: Core Inscalability, Related Work.
-  (2018) Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. CoRR. External Links: Cited by: Introduction, Introduction, Introduction, Experimental Evaluation.
-  (2014) Anatomy of high-performance many-threaded matrix multiplication. In IPDPS, Cited by: Our Implementation, Related Work.
-  (2017) Arm business strategy. Note: https://group.softbank/en/corp/d/annual-reports/2017/future-forward/segars-interview/ Cited by: Introduction.
-  (2018) MnasNet: platform-aware neural architecture search for mobile. CoRR. External Links: Cited by: Introduction, Experimental Evaluation.
-  (2019) EfficientNet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: Introduction, Introduction.
-  (2009) Roofline: an insightful visual performance model for multicore architectures. Commun. ACM. Cited by: Introduction, Roofline Model, High Performance DWConv and PWConv.
-  (2000) Loop tiling for parallelism. Kluwer Academic Publishers, Norwell, MA, USA. Cited by: item 3.
-  (2018) High performance zero-memory overhead direct convolutions. In ICML, Cited by: Core Inscalability, Related Work.
-  (2018) DeepCPU: serving rnn-based deep learning models 10x faster. In ATC, Cited by: Core Inscalability.