MobileDets: Searching for Object Detection Architectures for Mobile Accelerators

by   Yunyang Xiong, et al.

Inverted bottleneck layers, which are built upon depthwise convolutions, have been the predominant building blocks in state-of-the-art object detection models on mobile devices. In this work, we question the optimality of this design pattern over a broad range of mobile accelerators by revisiting the usefulness of regular convolutions. We achieve substantial improvements in the latency-accuracy trade-off by incorporating regular convolutions in the search space, and effectively placing them in the network via neural architecture search. We obtain a family of object detection models, MobileDets, that achieve state-of-the-art results across mobile accelerators. On the COCO object detection task, MobileDets outperform MobileNetV3+SSDLite by 1.7 mAP at comparable mobile CPU inference latencies. MobileDets also outperform MobileNetV2+SSDLite by 1.9 mAP on mobile CPUs, 3.7 mAP on EdgeTPUs and 3.4 mAP on DSPs while running equally fast. Moreover, MobileDets are comparable with the state-of-the-art MnasFPN on mobile CPUs even without using the feature pyramid, and achieve better mAP scores on both EdgeTPUs and DSPs with up to 2X speedup.


page 1

page 2

page 3

page 4


MnasFPN: Learning Latency-aware Pyramid Architecture for Object Detection on Mobile Devices

Despite the blooming success of architecture search for vision tasks in ...

PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices

The better accuracy and efficiency trade-off has been a challenging prob...

Single-Training Collaborative Object Detectors Adaptive to Bandwidth and Computation

In the past few years, mobile deep-learning deployment progressed by lea...

MnasNet: Platform-Aware Neural Architecture Search for Mobile

Designing convolutional neural networks (CNN) models for mobile devices ...

Searching for MobileNetV3

We present the next generation of MobileNets based on a combination of c...

Inverted Residuals and Linear Bottlenecks: Mobile Networks forClassification, Detection and Segmentation

In this paper we describe a new mobile architecture, MobileNetV2, that i...

Code Repositories


This repo contains the colab for my ssdlite mobiledet model retrain tutorial

view repo

1 Introduction

In many computer vision applications it can be observed that higher capacity networks lead to superior performance

[30, 43, 16, 25]. However, they are often more resource-consuming. This makes it challenging to find models with the right quality-compute trade-off for deployment on edge devices with limited inference budgets.

Figure 1: Platform-aware architecture search and the proposed TDB search space work synergistically to boost object detection performance on accelerators. SSDLite object detection performance on Pixel-4 DSPs with different backbone designs: manually-designed MobileNetV2, searched with IBN-only search space, and searched with the proposed TDB space (with both IBNs and full conv-based building blocks). Layers are visualized as vertical bars where color indicates layer type and length indicates expansion ratio. and mark the feature inputs to the SSDLite head. While conducting platform-aware architecture search in an IBN-only search space achieves a mAP boost over the handcrafted baseline, searching within the proposed TDB space brings another mAP gain.

A lot of effort has been devoted to the manual design of lightweight neural architectures for edge devices [15, 28, 41]. Unfortunately, relying on human expertise is time-consuming and can be sub-optimal. This problem is made worse by the speed at which new hardware platforms are released. In many cases, these newer platforms have differing performance characteristics which make a previously developed model sub-optimal.

To address the need for automated tuning of neural network architectures, many methods have been proposed. In particular, neural architecture search (NAS) methods

[5, 33, 31, 14, 10] have demonstrated a superior ability in finding models that are not only accurate but also efficient on a specific hardware platform.

Despite many advancements in NAS algorithms [42, 25, 2, 21, 5, 33, 31, 14, 10], it is remarkable that inverted bottlenecks (IBN) [28] remain the predominant building block in state-of-the-art mobile models. IBN-only search spaces have also been the go-to setup in majority of the related NAS publications [31, 5, 33, 14]. Thanks to depthwise-separable convolutions [29], IBN layers are very good at reducing the parameter count and FLOPs. In addition, depthwise-separable convolutions are well-optimized on mobile CPUs. However, depthwise-separable convolutions are less optimized on several modern mobile accelerators.

For example, EdgeTPU accelerators and Qualcomm DSPs, which are becoming increasingly prevalent among mobile devices, are designed to accelerate regular convolutions [11]

. It is observed that for certain tensor shapes and kernel dimensions, a regular convolution can utilize the hardware up to

more efficiently than the depthwise variation on an EdgeTPU despite the much larger amount of theoretical computation cost ( more FLOPs). This observation leads us to question the exclusive use of IBNs in most current state-of-the-art mobile architectures.

We will investigate this on object detection, one of the driving applications for mobile accelerators. It is widely used in applications where tracking objects is essential, such as self-driving car, video surveillance and face detection. Traditionally, objection detectors reuse backbone designs from classification. This simple approach is proven sub-optimal by detection-specific NAS 

[8, 35, 7]. Motivated by their success, we also focus on architecture search on the object detection task directly.

We propose an enlarged search space family, including IBNs and full convolution sequences motivated by the structure of Tensor decomposition [34, 6], called Tensor-Decomposition-Based search space (TDB), that works well across a wide range of mobile accelerators. Our search space augments the IBN-only search space with new building blocks that capture both expansion and compression, that execute efficiently on mobile accelerators once placed at the right positions in the network.

To effectively allocate the newly proposed building blocks, we perform latency-aware architecture search on the object detection task, targeting at a diverse set of mobile platforms including CPUs, EdgeTPUs and DSPs. We first show that detection-specific NAS consistently improves the performance across all hardware platforms. We further show that conducting architecture search in the TDB search space delivers state-of-the-art models. By learning to leverage full convolutions at selected positions in the network, our method outperforms IBN-only models by a significant margin, especially on EdgeTPUs and DSPs.

With a simple SSDLite head, our searched models, MobileDets, outperform MobileNetV2 by 1.9mAP on mobile CPU, 3.7mAP on EdgeTPU and 3.4mAP on DSP at comparable inference latencies. MobileDets also outperform the state-of-the-art MobileNetV3 classification backbone by 1.7mAP at similar CPU inference efficiency. In addition, the searched models achieved comparable performance with the state-of-the-art mobile CPU detector, MnasFPN [7], without leveraging the NAS-FPN head. On both EdgeTPUs and DSPs, MobileDets are more accurate than MnasFPN while being more than faster. Interestingly, our most performant models are observed to utilize regular convolutions more extensively on mobile accelerators like EdgeTPUs and DSPs, especially in the initial part of the network where the depthwise convolution tends to be less efficient. It supports our questioning of using IBN-only search space as a universal solution. Our results suggest that trade-offs made wrt FLOPs/general purpose CPUs are not necessarily optimal for many other modern mobile accelerators.

Our contributions can be summarized as follows:

  • We reveal that the widely used IBN-only search spaces can be sub-optimal for modern mobile accelerators, such as EdgeTPUs and DSPs.

  • We propose a novel search space, TDB, that works across a broad range of mobile accelerators by revisiting the usefulness of regular convolutions.

  • We demonstrate how NAS can be used as a tool to efficiently discover high-performance architectures for new accelerators, by learning to leverage regular convolutions at selected positions in the network.

  • We obtain MobileDets, a family of models that achieve state-of-the-art quality-latency trade-off on object detection across multiple mobile accelerators.

2 Related Work

2.1 Mobile Object Detection

Object detection a classic computer vision challenge for which the goal is to learn to identify objects of interest in images. Existing object detectors can be divided into two categories: two-stage detectors and one-stage single shot detectors. For two-stage detectors, including Faster RCNN [27], R-FCN [9] and ThunderNet [24], region proposals must be generated first before the detector can make any subsequent predictions. Two-stage detectors are not efficient in terms of inference time due to this multi-stage nature. On the other hand, one-stage single shot detectors, such as SSD [22], SSDLite [28], YOLO [26], SqueezeDet [39] and Pelee [36], require only one single pass through the network to predict all the bounding boxes, making them ideal candidates for efficient inference on edge devices. We therefore focus on one-stage detectors in this work.

SSDLite [28] is an efficient variant of SSD that has become one of the most popular lightweight detection heads, which is well suited for use cases on mobile devices. Efficient backbones, such as MobileNetV2 [28] and MobileNetV3 [14], are paired with SSDLite to achieve state-of-the-art mobile detection results. Both models will be used as baselines to demonstrate the effectiveness of our proposed search spaces over different mobile accelerators.

2.2 Mobile Neural Architecture Search (NAS)

Designing high-performance on-device neural architectures requires substantial human expertise. Recently, there has been a growing interest in leveraging neural architecture search (NAS) methods to automate the manual design process of edge models [5, 31, 14, 10]. The generic recipe behind these algorithms is to learn to explore a given search space under the guidance of a predefined signal of interest (e.g., accuracies, inference latencies, or a combination of both). In the context of searching for mobile-efficient models, one critical factor of the signal is usually the on-device latency over a specific hardware platform.

NetAdapt [40] and AMC [13] were among the first attempts to utilize latency-aware search to finetune the number of channels of a pre-trained model. MnasNet [32] and MobileNetV3 [14] extended this idea to find resource-efficient architectures within the NAS framework. With a combination of techniques, MobileNetV3 delivered state-of-the-art architectures on mobile CPU.

As a complementary direction, there are many recent efforts aiming to reduce the search efficiency of NAS [3, 2, 23, 21, 5, 38, 4]. Despite the rapid progress in terms of NAS algorithms, existing works are heavily focusing on IBN-only search spaces optimized for FLOPs/mobile CPUs. Less attention has been paid to designing useful search spaces for modern mobile accelerators, such as DSPs and EdgeTPUs, that are becoming increasingly popular in the real world.

2.3 NAS for Mobile Object Detection

A great majority of the NAS literature [33, 31, 14] focuses on classification and only re-purpose the learned feature extractor as the backbone for object detection without further searches. Recently, multiple papers [8, 35, 7] have shown that better latency-accuracy trade-offs are obtained by searching directly for object detection models. Our work follows this methodology.

One strong detection-specific NAS baseline for mobile detection models is MnasFPN [7], which searches for the feature pyramid head with a mobile-friendly search space that heavily utilizes depthwise separable convolutions. While MnasFPN is successful on mobile CPU, several factors limit its generalization towards mobile accelerators: (1) so far both depthwise convolutions and feature pyramids are less optimized on these platforms, and (2) MnasFPN does not search for the backbone, which is a bottleneck for latency.

By comparison, our work relies on SSD heads and proposes a new search space for the backbone based on full-convolutions, which are more amenable to mobile acceleration. While it is challenging to develop a generic search space family that spans a set of diverse and dynamically-evolving mobile platforms, we take a first step towards this goal, starting from the most common platforms such as mobile CPUs, DSPs and EdgeTPUs.

3 Revisiting Full Convs in Mobile Search Spaces

In the following, we first explain why IBN layers may not be sufficient to handle mobile accelerators beyond mobile CPUs. We then propose new building blocks based on regular convolutions to enrich our search space. Finally, we discuss the relationship between the layout of these building blocks and the linear structure of Tucker/CP decomposition [34, 6].

IBNs are all we need?

IBNs are designed to reduce the number of parameters and FLOPs, and leverage the depthwise-separable kernels to achieve high efficiency on mobile CPUs. However, not all FLOPs are the same, especially in modern mobile accelerators such as EdgeTPU and DSPs. For example, a regular convolution may run faster on EdgeTPUs than its depthwise variation even with more FLOPs. The observation indicates that the widely used IBN-only search space can be suboptimal for modern mobile accelerators. This motivated us to propose new building blocks by revisiting regular (full) convolutions to enrich IBN-only search spaces for mobile accelerators. Specifically, we propose two flexible layers to perform channel expansion and compression, respectively, which are detailed below.

3.1 Fused Inverted Bottleneck Layers (Expansion)

The depthwise-separable convolution [29] is a critical element of an inverted bottleneck [15]. The idea behind the depthwise-separable convolution is to replace an “expensive” full convolution with a combination of a depthwise convolution (for spatial dimension) and a pointwise convolution (for channel dimension). However, the notion of expensiveness was largely defined based on FLOPs or the number of parameters, which are not necessarily correlated with the inference efficiency on modern mobile accelerators.

To incorporate regular convolutions, we propose to modify an IBN layer by fusing together its first convolution (which usually comes with an expansion ratio) and its subsequent depthwise convolution as a single regular convolution. We keep the notion of expansion by allowing this full convolution to expand the channel size. We therefore refer to the layer as a fused inverted bottleneck, or simply Fused convolution layers, in the rest of the paper. The expansion ratio of this layer will be determined by the algorithm.

3.2 Generalized Bottleneck Layers (Compression)

Bottlenecks are introduced in ResNet [12] to reduce the cost of large convolutions over high-dimensional feature maps. A bottleneck layer does compression, as the feature maps are first projected to have less channels and then projected back at the end. Both projections are implemented as convolutions.

It is in general useful to allow fine-grained control over the channel sizes in each layer because of their strong influence on the inference latency. To define a flexible compression layer, we generalize the traditional bottleneck to have both a searchable input compression ratio (for the first conv) and an output compression ratio (for the last conv), and let the NAS algorithm to decide the best configurations. We refer to these new building blocks as Tucker convolution layers for their connections with Tucker decomposition (next subsection).

3.3 Connections with Tucker/CP decomposition

All the layers above can be linked to Tucker/CP decomposition. Fig. 2 shows the graphical structure of an inverted bottleneck with input expansion ratio , modulo non-linearities. This structure is equivalent to the sequential structure of approximate evaluation of a regular convolution by using CP decomposition [19]. The generalized bottleneck layer with input and output compression ratios and , denoted as Tucker layer shown in Fig. 3, has the same structure (modulo nonlinearities) as the Tucker decomposition approximation of a regular convolution [18]. Fused convolution with an input expansion ratio , shown in Fig. 4, can also be considered as a variant of the Tucker decomposition approximation.

Figure 2: Inverted bottleneck layer: pointwise convolution transforms the input channels from to with input expansion ratio , then depthwise convolution transforms the input channels from to , and the last pointwise convolution transforms the channels from to . The highlighted in IBN layer are searchable.
Figure 3: Tucker layer (generalized bottleneck layer): pointwise convolution transforms the input channels to with input compression ratio , then regular convolution transforms the input channels from to with output compression ratio , and the last pointwise convolution transforms the channels from to . The highlighted in Tucker layer are searchable.
Figure 4: Fused convolution layer: regular convolution transforms the input channels from to with input expansion ratio , and the last pointwise convolution transforms the channels from to . The highlighted in Fused convolution layer are searchable.

Details of the approximation is as follows. CP-decomposition approximates convolution using a set of sequential linear mappings: a pointwise convolution, a separable convolutions in the spatial dimensions with depthwise convolution, and finally another pointwise convolution. Without performing spatially separable convolution, the sequential graphical structure is equivalent to inverted bottleneck in Fig. 2. Similarly, a mode-2 tucker decomposition approximation of a convolution along input and output channels involves a sequence of three operations, a convolution, then a regular convolution, and followed by another pointwise convolution. This sequential structure of tucker layer is shown in Fig. 3. Combining the first pointwise convolution and the second regular convolution as one regular convolution gives the fused inverted bottleneck layer in Fig. 4.

We therefore refer to the expansion operation as fused convolution layer, the compression operation as Tucker layer, and our proposed search space with a mix of both layers and IBNs as the Tucker Decomposition Based search space.

4 Architecture Search Method

4.1 Search Algorithm

Our proposed search spaces are complementary to any neural architecture search algorithms. In our experiments, we employ TuNAS [1] for its scalability and its reliable improvement over random baselines. We briefly recap TuNAS below.

TuNAS constructs a one-shot model that encompasses all architectural choices in a given search space, a controller whose goal is to pick an architecture that optimize a platform-aware reward function. The one-shot model and the controller are trained together during search. In each step, the controller samples a random architecture from a multinomial distribution that spans over the choices, then the portion of the one-shot model’s weights associated with the sampled architecture are updated, and finally a reward is computed for the sampled architecture, which is used to update the controller. The update is given by applying standard REINFORCE algorithm [37] on the following reward function:

where denotes the detection mAP of an architecture , is the inference cost (in this case, latency), is the given cost budget, and is a hyper-parameter that balances the importance of accuracy and inference cost. The search quality tends to be insensitive to , as shown in [1] and the experiments.


is computed at every update step, efficiency is key. We estimate

based on a small mini-batch for efficiency, and describe how to estimate in the next section.

4.2 Cost Models

In an ideal world, we would benchmark each candidate architecture proposed during a search on our target hardware. In practice, however, it is difficult infrastructure-wise to directly synchronize between the mobile phones used for benchmarking and the server-class ML hardware used to train the shared model weights. Enumerating all possible models is infeasible; our search space is too large to benchmark every possible model in the search space ahead of time.

To get around these challenges, we train a cost model

– a linear regression model whose features are composed of, for each layer, an indicator of the cross-product between input/output channel sizes and layer type. This model has high fidelity across platforms (

). This linear cost model outperforms the additive models [40, 13, 31], especially for DSP. This is because we do not assume the cost to be additive at the level of operations.

During search, we use the regression model as a surrogate for on-device latency. To collect training data for the cost model, we randomly sample several thousand network architectures from our search space and benchmark each one on device. This is done only once per hardware and prior to search, eliminating the need for direct communication between server-class ML hardware and mobile phones. For final evaluation, the found architectures are benchmarked on the actual hardware instead of the cost model.

5 Experiments

5.1 Experimental Setup

5.1.1 Standalone Training

We use 320

320 image size for both training and evaluation. The training is carried out over 32 synchronized replicas on a 4x4 TPU-v2 pod. For fair comparison with existing models, we use standard preprocessing in Tensorflow object detection API without additional enhancements such as drop-block or auto-augment. We use SGD with momentum 0.9 and weight decay

. The learning rate is warmed up in the first 2000 steps and then follows cosine decay. All models are trained from scratch without any ImageNet pre-trained checkpoint. We consider two different training schedules:

  • Short-schedule: Each model is trained for 50K steps with a batch size of 1024 and an initial learning rate of 4.0.

  • Long-schedule: Each model is trained for 400K steps with a batch size of 512 and an initial learning rate of 0.8.

The short schedule is about faster than the long schedule but would result in slightly inferior quality. Unless otherwise specified, we use the short schedule for ablation studies and the long schedule for the final results in Table 1.

5.1.2 Architecture Search

To avoid overfitting the true validation dataset, we split out 10% of the COCO training data to evaluate the models and compute rewards during search. Hyperparameters for training the shared weights follow those in standalone training. As for reinforcement learning, we use Adam optimizer with an initial learning rate of

, and . We search for 50K steps to obtain the architectures in ablation studies and search for 100K steps to obtain the best candidates in the main results table.

5.2 Latency Benchmarking

We benchmark using TF-Lite, which relies on NNAPI to delegate computations to accelerators. For all benchmarks we use single-thread and a batch size of . In Pixel 1 CPU, we use only a single large core. For Pixel 4 EdgeTPU and DSP, the models are fake-quantized [17] as required.

5.3 Search Space Definitions

The overall layout of our search space resembles that of ProxylessNAS and TuNAS. We consider three variants with increasing sizes:

  • IBN: The smallest search space that contains IBN layers only. Each layer may choose from kernel sizes and expansion factors .

  • IBN+Fused: An enlarged search space that not only contains all the IBN variants above, but also Fused convolution layers in addition with searchable kernel sizes and expansion factors .

  • IBN+Fused+Tucker: A further enlarged search space that contains Tucker (compression) layers in addition. Each Tucker layer allows searchable input and output compression ratios within .

For all search spaces variants above, we also search for the output channel sizes of each layer among the options of times a base channel size (rounded to the multiples of 8 to be more hardware-friendly). Layers in the same block share the same base channel size, though they can end up with different channel multipliers. The base channel sizes for all the blocks (from stem to head) are 32-16-32-48-96-96-160-192-192. The multipliers and base channel sizes are designed to approximately subsume several representative architectures in the literature, such as MobileNetV2 and MnasNet.

5.3.1 Hardware-Specific Adaptations

The aforementioned search spaces are slightly adapted depending on the target hardware. Specifically:

  • All building blocks are augmented with Squeeze-and-Excitation blocks and h-Swish activation functions (to replace ReLU-6) when targeting at CPUs. This is necessary for fair comparison with the MobileNetV3+SSDLite baseline. Both primitives are not well supported on EdgeTPUs or DSPs.

  • We exclude 55 convolutions from the search space when targeting at DSPs, which are known to be highly inefficient due to hardware constraints.

5.4 Search Space Ablation

For each hardware platform (CPU, EdgeTPU and DSP), we conduct architecture search over different search space variants and evaluate the discovered models by training them from scratch. The goal is to verify the usefulness of each search space when paired with a (potentially imperfect) NAS algorithm.

With a perfect architecture search algorithm, the largest search space is guaranteed to outperform the smaller ones because it subsumes the solutions of the latter. This is not necessarily the case in practice, however, as the algorithm may end up with sub-optimal solutions especially when the search space is large [7]. In the following, a search space is considered useful if it enables NAS methods to identify sufficiently good architectures even if they are not optimal.

5.4.1 Cpu

Figure 5 shows the architecture search results when targeting at Pixel-1 CPUs. As expected, MobileNetV3+SSDLite is a strong baseline as the efficiency of its backbone has been heavily optimized for the same hardware platform over the classification task on ImageNet. We also note that the presence of regular convolutions does not offer clear advantages in this particular case, as IBN-only is already strong under FLOPs/CPU latency. Nevertheless, conducting domain-specific architecture search wrt the object detection task offers non trivial gains on COCO (+1mAP in the range of 150-200ms).

Figure 5: NAS results on Pixel-1 CPU using different search space variants.
Figure 6: NAS results on Pixel-4 EdgeTPU using different search space variants.
Figure 7: NAS results on Pixel-4 DSP using different search space variants.

5.4.2 EdgeTPU

Figure 6 shows the architecture search results when targeting at Pixel-4 EdgeTPUs. Conducting hardware-aware architecture search with any of the three search spaces significantly improves the overall quality. This is largely due the the fact that the baseline architecture (MobileNetV2)111MobileNetV3 is not well supported on EdgeTPUs due to h-swish and squeeze-and-excite blocks. is heavily optimized towards CPU latency, which is strongly correlated with FLOPs/MAdds but not well calibrated with the EdgeTPU latency. Notably, while IBN-only still offers the best accuracy-MAdds trade-off (middle plot), having regular convolutions in the search space (either IBN+Fused or IBN+Fused+Tucker) offers clear further advantages in terms of accuracy-latency trade-off. The results demonstrate the usefulness of full convolutions on EdgeTPUs.

5.4.3 Dsp

Figure 7 shows the search results when targeting at Pixel-4 DSPs. Similar to the case of EdgeTPUs, detection-specific search significantly outperforms the baseline. Moreover, it is evident that the inclusion of regular convolutions leads to substantial mAP improvement under comparable inference latency.

Model/Search Space Target mAP (%) Latency (ms) MAdds Params
hardware  valid test  CPU  EdgeTPU  DSP (B) (M)
MobileNetV2 22.1 162 8.4 11.3 0.80 4.3
MobileNetV2 (ours) 22.2 21.8 129 6.5 9.2 0.62 3.17
MobileNetV2 1.5 (ours) 25.7 25.3 225 9.0 12.1 1.37 6.35
MobileNetV3 22.0 119 0.51 3.22
MobileNetV3 (ours) 22.2 21.8 108 0.46 4.03
MobileNetV3 1.2 (ours) 23.6 23.1 138 0.65 5.58
MnasFPN (ours) 25.6 26.1 185 18.5 25.1 0.92 2.50
MnasFPN (ours) 0.7 24.3 23.8 120 16.4 23.4 0.53 1.29
IBN+Fused+Tucker CPU 24.2 23.7 122 0.51 3.85
IBN+Fused 23.0 22.7 107 0.39 3.57
IBN 24.1 23.4 113 0.45 4.21
IBN+Fused+Tucker  EdgeTPU 25.7 25.5 248 6.9 10.8 1.53 4.20
IBN+Fused 26.0 25.4 272 6.8 9.9 1.76 4.79
IBN 25.1 24.7 185 7.4 10.4 0.97 4.17
IBN+Fused+Tucker DSP 28.9 28.5 420 8.6 12.3 2.82 7.16
IBN+Fused 29.1 28.5 469 8.6 11.9 3.22 9.15
IBN 27.3 26.9 259 8.7 12.2 1.43 4.85
  • : Model/search space augmented with Squeeze-Excite and h-Swish (CPU-only).

  • : Not well supported by the hardware platform.

  • : Model/search space with 33 kernel size only (DSP-friendly).

  • : Model augmented with NAS-FPN head.

  • : Endpoint C4 located after (rather than before) the 11 expansion in IBN.

Table 1: Main results. Test AP scores are based on COCO test-dev.

5.5 Main Results

We compare our architectures obtained via latency-aware NAS against state-of-the-arts mobile detection models on COCO [20]. For each target hardware platform (among CPU, EdgeTPU and DSP), we report results obtained by searching over each of the three search space variants. Results are presented in Table 1.

Target: Pixel-1 CPU (23.7 mAP @ 122 ms) Target: Pixel-4 EdgeTPU (25.5 mAP @ 6.8 ms) Target: Pixel-4 DSP (28.5 mAP @ 12.3 ms)
Figure 8: Best architectures searched in the IBN+Fused+Tucker space wrt different mobile accelerators. Endpoints C4 and C5 are consumed by the SSD head.

On mobile CPUs, MobileDet outperforms MobileNetV3+SSDLite, a strong baseline based on the state-of-the-art image classification backbone, by 1.7 mAP at comparable latency. The result demonstrates the effectiveness of detection-specific NAS. The models also achieved competitive results with MnasFPN, the state-of-the-art detector for mobile CPUs, without leveraging the NAS-FPN head. It is also interesting to note that the incorporation of full convolutions is quality-neutral over mobile CPUs, indicating that IBNs are indeed promising building blocks for this particular hardware platform.

On EdgeTPUs, MobileDet outperforms MobileNetV2+SSDLite by mAP on COCO test-dev at comparable latency. We attribute the gains to both task-specific search (wrt COCO and EdgeTPU) and the presence of full convolutions. Specifically, IBN+Fused+Tucker leads to 0.8 mAP improvement together with latency reduction as compared to the IBN-only search space.

On DSPs, MobileDet achieves mAP on COCO with ms latency, outperforming MobileNetV2+SSDLite () by mAP at comparable latencies. The same model also outperforms MnasFPN by mAP with more than more speed-up. Again it is worth noticing that including full convolutions in the search space clearly improved the architecture search results from 26.9 mAP @ 12.2 ms to 28.5 mAP @ 11.9ms. Fig. 8 illustrates our searched object detection architectures, MobileDets, by targeting at different mobile hardware platforms using our largest TDB search space. One interesting observation is that MobileDets use regular convolutions extensively on EdgeTPU and DSP, especially in the early stage of the network where depthwise convolutions tend to be less efficient. The results demonstrate that IBN-only search space is not optimal for these accelerators.

Figure 9: Transferability of architectures (searched wrt different target platforms) across hardware platforms. For each given architecture, we report both the original model and its scaled version with channel multiplier 1.5.

Finally, we investigate the transferability of the architectures across hardware platforms. Fig. 9 compares MobileDets (obtaind by targeting at different accelerators) wrt different hardware platforms. Our results indicate that architectures searched on EdgeTPUs and DSPs are mutually transferable. In fact, both searched architectures extensively leveraged regular convolutions. On the other hand, architectures specialized wrt EdgeTPUs or DSPs (which tend to be FLOPs-intensive) do not transfer well to mobile CPUs.

6 Conclusion

In this work, we question the predominant design pattern of using depthwise inverted bottlenecks as the only building block for computer vision models on the edge. Using the object detection task as a case study, we revisit the usefulness of regular convolutions over a variety of mobile accelerators, including mobile CPUs, EdgeTPUs and DSPs. Our results reveal that full convolutions can substantially improve the accuracy-latency trade-off on several accelerators when placed at the right positions in the network, which can be efficiently identified via neural architecture search. The resulting architectures, MobileDets, achieve superior detection results over a variety of hardware platforms, significantly outperforming the prior art by a large margin.