Log In Sign Up

MCUNet: Tiny Deep Learning on IoT Devices

by   Ji Lin, et al.

Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude less even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. TinyNAS adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e. device, latency, energy, memory) under low search costs. TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the design space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 2.7x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70 off-the-shelf commercial microcontroller, using 3.6x less SRAM and 6.6x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2 and ProxylessNAS-based solutions with 2.2-2.6x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.


page 2

page 4

page 6

page 7

page 9

page 10

page 12

page 13


LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search

Text to speech (TTS) has been broadly used to synthesize natural and int...

MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning

Tiny deep learning on microcontroller units (MCUs) is challenging due to...

MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers

Executing machine learning workloads locally on resource constrained mic...

μNAS: Constrained Neural Architecture Search for Microcontrollers

IoT devices are powered by microcontroller units (MCUs) which are extrem...

Constrained deep neural network architecture search for IoT devices accounting hardware calibration

Deep neural networks achieve outstanding results in challenging image cl...

ESAI: Efficient Split Artificial Intelligence via Early Exiting Using Neural Architecture Search

Recently, deep neural networks have been outperforming conventional mach...

SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers

The vast majority of processors in the world are actually microcontrolle...

1 Introduction

The number of IoT devices based on always-on microcontrollers is increasing rapidly at a historical rate, reaching 250B 44, enabling numerous applications including smart manufacturing, personalized healthcare, precision agriculture, automated retail, etc. These low-cost, low-energy microcontrollers give rise to a brand new opportunity of tiny machine learning (TinyML). By running deep learning models on these tiny devices, we can directly perform data analytics near the sensor, thus dramatically expand the scope of AI applications. However, microcontrollers have a very limited resource budget, especially memory (SRAM) and storage (Flash). The on-chip memory is 3 orders of magnitude smaller than mobile devices, and 5-6 orders of magnitude smaller than cloud GPUs, making deep learning deployment extremely difficult. As shown in Table 1, a state-of-the-art ARM Cortex-M7 MCU only has 320kB SRAM and 1MB Flash storage, which is impossible to run off-the-shelf deep learning models: ResNet-50 He et al. (2016) exceeds the storage limit by , MobileNetV2 Sandler et al. (2018) exceeds the peak memory limit by . Even the int8 quantized version of MobileNetV2 still exceeds the memory limit by ***Not including the runtime buffer overhead (e.g., Im2Col buffer); the actual memory consumption is larger., showing a big gap between the desired and available hardware capacity.

Table 1: Left: Microcontrollers have 3 orders of magnitude less memory and storage compared to mobile phones, and 5-6 orders of magnitude less than cloud GPUs. The extremely limited memory makes deep learning deployment difficult. Right: The peak memory and storage usage of widely used deep learning models. ResNet-50 exceeds the resource limit on microcontrollers by , MobileNet-V2 exceeds by . Even the int8 quantized MobileNetV2 requires larger memory and can’t fit a microcontroller.

Different from the cloud and mobile devices, microcontrollers are bare-metal devices that do not have an operating system. Therefore, we need to jointly design the deep learning model and the inference library to efficiently manage the tiny resources and fit the tight memory&storage budget. Existing efficient network design Howard et al. (2017); Sandler et al. (2018); Zhang et al. (2018) and neural architecture search methods Tan et al. (2019); Cai et al. (2019); Wu et al. (2019); Cai et al. (2020) focus on GPU or smartphones, where both memory and storage are abundant. Therefore, they only optimize to reduce FLOPs or latency, and the resulting models cannot fit microcontrollers. There is limited literature Fedorov et al. (2019); Liberis and Lane (2019); Rusci et al. (2020); Lawrence and Zhang (2019) that studies machine learning on microcontrollers. However, due to the lack of system-algorithm co-design, they either study tiny-scale datasets (e.g

., CIFAR or sub-CIFAR level), which are far from real-life use case, or use weak neural networks that cannot achieve decent performance. In this paper, we propose MCUNet, a system-model co-design framework that enables ImageNet-scale deep learning on off-the-shelf microcontrollers. To handle the scarce on-chip memory on microcontrollers, we jointly optimize the deep learning model design (TinyNAS) and the inference library (TinyEngine) to reduce the memory usage. TinyNAS is a two-stage neural architecture search (NAS) method that can handle the tiny and diverse memory constraints on various microcontrollers. The performance of NAS highly depends on the search space 

Radosavovic et al. (2020)

, yet there is little literature on the search space design heuristics at the tiny scale. TinyNAS addresses the problem by first optimizing the search space automatically to fit the tiny resource constraints, then performing neural architecture search in the optimized space. Specifically, TinyNAS generates different search spaces by scaling the input resolution and the model width, then collects the computation FLOPs distribution of satisfying networks within the search space to evaluate its priority. TinyNAS relies on the insight that

a search space that can accommodate higher FLOPs under memory constraint can produce better model. Experiments show that the optimized space leads to better accuracy of the NAS searched model. To handle the extremely tight resource constraints on microcontrollers, we also need a memory-efficient inference library to eliminate the unnecessary memory overhead, so that we can expand the search space to fit larger model capacity with higher accuracy. TinyNAS is co-designed with TinyEngine to lift the ceiling for hosting deep learning models. TinyEngine improves over the existing inference library with code generator-based compilation method to eliminate memory overhead, which reduces the memory usage by and improves the inference speed by 22%. It also supports model-adaptive memory scheduling: instead of layer-wise optimization, TinyEngine optimizes the memory scheduling according to the overall network topology to get a better strategy. Finally, it performs specialized computation kernel optimization (e.g., loop tiling, loop unrolling, op fusion, etc.) for different layers, which further accelerates the inference. MCUNet dramatically pushes the limit of deep network performance on microcontrollers. TinyEngine reduces the peak memory usage by 2.7 and accelerates the inference by 1.7-3.3 compared to TF-Lite and CMSIS-NN, allowing us to run a larger model. With system-algorithm co-design, MCUNet (TinyNAS+TinyEngine) achieves a record ImageNet top-1 accuracy of 70.2% on an off-the-shelf commercial microcontroller. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4 faster than existing solutions at 2.2-2.6 smaller peak SRAM. For interactive applications, our solution achieves 10 FPS with 91% top-1 accuracy on Speech Commands dataset. Our study suggests that the era of tiny machine learning on IoT devices has arrived.

2 Background

Microcontrollers have tight memory: for example, only 320kB SRAM and 1MB Flash for a popular ARM Cortex-M7 MCU STM32F746. Therefore, we have to carefully design the inference library and the deep learning models to fit the tight memory constraints. In deep learning scenarios, SRAM (read&write) constrains the activation size; Flash (read-only) constrains the model size.

Deep Learning Inference on Microcontrollers.

Deep learning inference on microcontrollers is a fast-growing area. Existing frameworks such as TensorFlow Lite Micro 

Abadi et al. (2016), CMSIS-NN Lai et al. (2018), CMix-NN Capotondi et al. (2020), and MicroTVM Chen et al. (2018a) have several limitations: 1. Most frameworks rely on an interpreter to interpret the network graph at runtime, which will consume a lot of SRAM and Flash (up to of peak memory) and increase latency by 22%. 2. The optimization is performed at layer-level, which fails to utilize the overall network architecture information to further reduce memory usage.

Efficient Neural Network Design.

Network efficiency is very important for the overall performance of the deep learning system. One way is to compress off-the-shelf networks by pruning Han et al. (2015); He et al. (2017); Lin et al. (2017); Liu et al. (2017); He et al. (2018); Liu et al. (2019b) and quantization Han et al. (2016); Zhu et al. (2016); Rastegari et al. (2016); Zhou et al. (2016); Courbariaux and Bengio (2016); Choi et al. (2018); Wang et al. (2019)

to remove redundancy and reduce complexity. Tensor decomposition 

Lebedev et al. (2014); Gong et al. (2014); Kim et al. (2015) also serves as an effective compression method. Another way is to directly design an efficient and mobile-friendly network Howard et al. (2017); Sandler et al. (2018); Ma et al. (2018); Zhang et al. (2018); Ma et al. (2018). Recently, neural architecture search (NAS) Zoph and Le (2017); Zoph et al. (2018); Liu et al. (2019a); Cai et al. (2019); Tan et al. (2019); Wu et al. (2019) dominates efficient network design. The performance of NAS highly depends on the quality of the search space Radosavovic et al. (2020). Traditionally, people follow manual design heuristics for NAS search space design. For example, the widely used mobile-setting search space Tan et al. (2019); Cai et al. (2019); Wu et al. (2019) originates from MobileNetV2 Sandler et al. (2018): they both use 224 input resolution and a similar base channel number configurations, while searching for kernel sizes, block depths, and expansion ratios. However, there lack standard model designs for microcontrollers with limited memory, so as the search space design. One possible way is to manually tweak the search space for each microcontroller. But manual tuning through trials and errors is labor-intensive, making it prohibitive for a large number of deployment constraints (e.g., STM32F746 has 320kB SRAM/1MB Flash, STM32H743 has 512kB SRAM/2MB Flash, latency requirement 5FPS/10FPS). Therefore, we need a way to automatically optimize the search space for tiny and diverse deployment scenarios.

3 MCUNet: System-Algorithm Co-Design

Figure 1: MCUNet jointly designs the neural architecture and the inference scheduling to fit the tight memory resource on microcontrollers. TinyEngine makes full use of the limited resources on MCU, allowing a larger design space for architecture search. With a larger degree of design freedom, TinyNAS is more likely to find a high accuracy model compared to using existing frameworks.

We propose MCUNet, a system-algorithm co-design framework that jointly optimizes the NN architecture (TinyNAS) and the inference scheduling (TinyEngine) in a same loop (Figure 1). Compared to traditional methods that either (a) optimizes the neural network using neural architecture search based on a given deep learning library (e.g

., TensorFlow, PyTorch

Tan et al. (2019); Cai et al. (2019); Wu et al. (2019), or (b) tunes the library to maximize the inference speed for a given network Chen et al. (2018a, b), MCUNet can better utilize the resources by system-algorithm co-design.

3.1 TinyNAS: Two-Stage NAS for Tiny Memory Constraints

TinyNAS is a two-stage neural architecture search method that first optimizes the search space to fit the tiny and diverse resource constraints, and then performs neural architecture search within the optimized space. With an optimized space, it significantly improves the accuracy of the final model.

Automated search space optimization.

We propose to optimize the search space automatically at low cost by analyzing the computation distribution of the satisfying models. To fit the tiny and diverse resource constraints of different microcontrollers, we scale the input resolution and the width multiplier of the mobile search space Tan et al. (2019). We choose from an input resolution spanning and a width multiplier to cover a wide spectrum of resource constraints. This leads to possible search space configurations . Each search space configuration contains possible sub-networks. Our goal is to find the best search space configuration that contains the model with the highest accuracy while satisfying the resource constraints. Finding is non-trivial. One way is to perform neural architecture search on each of the search spaces and compare the final results. But the computation would be astronomical. Instead, we evaluate the quality of the search space by randomly sampling

networks from the search space and comparing the distribution of satisfying networks. Instead of collecting the Cumulative Distribution Function (CDF) of each satisfying network’s

accuracy Radosavovic et al. (2019), which is computationally heavy due to tremendous training, we only collect the CDF of FLOPs (see Figure 2). The intuition is that, within the same model family, the accuracy is usually positively related to the computation Canziani et al. (2016); He et al. (2018). A model with larger computation has a larger capacity, which is more likely to achieve higher accuracy. We further verify the the assumption in Section 4.4.

Figure 2: TinyNAS selects the best search space by analyzing the FLOPs CDF of different search spaces. Each curve represents a design space. Our insight is that the design space that is more likely to produce high FLOPs models under the memory constraint gives higher model capacity, thus more likely to achieve high accuracy. For the solid red space, the top 20% of the models have 50.3M FLOPs, while for the solid black space, the top 20% of the models only have 32.3M FLOPs. Using the solid red space for neural architecture search achieves 78.7% final accuracy, which is 4.5% higher compared to using the black space. The legend is in format: w{width}-r{resolution}|{mean FLOPs}.

As an example, we study the best search space for ImageNet-100 (a 100 class classification task taken from the original ImageNet) on STM32F746. We show the FLOPs distribution CDF of the top-10 search space configurations in Figure 2. We sample networks from each space and use TinyEngine to optimize the memory scheduling for each model. We only keep the models that satisfy the memory requirement at the best scheduling. To get a quantitative evaluation of each space, we calculate the average FLOPs for each configuration and choose the search space with the largest average FLOPs. For example, according to the experimental results on ImageNet-100, using the solid red space (average FLOPs 52.0M) achieves 2.3% better accuracy compared to using the solid green space (average FLOPs 46.9M), showing the effectiveness of automated search space optimization. We will elaborate more on the ablations in Section 4.4.

Resource-constrained model specialization.

To specialize network architecture for various microcontrollers, we need to keep a low neural architecture search cost. After search space optimization for each memory constraint, we perform one-shot neural architecture search Bender et al. (2018); Guo et al. (2019) to efficiently find a good model, reducing the search cost by 200 Cai et al. (2019). We train one super network that contains all the possible sub-networks through weight sharing

and use it to estimate the performance of each sub-network. We then perform evolution search to find the best model within the search space that meets the on-board resource constraints while achieving the highest accuracy. For each sampled network, we use TinyEngine to optimize the memory scheduling to measure the optimal memory usage. With such kind of co-design, we can efficiently fit the tiny memory budget. The details of super network training and evolution search can be found in the supplementary.

3.2 TinyEngine: A Memory-Efficient Inference Library

Researchers used to assume that using different deep learning frameworks (libraries) will only affect the inference speed but not the accuracy . However, this is not the case for TinyML: the efficiency of the inference library matters a lot to both the latency and accuracy of the searched model. Specifically, a good inference framework will make full use of the limited resources in MCU, avoiding waste of memory, and allow a larger design space for architecture search. With a larger degree of design freedom, TinyNAS is more likely to find a high accuracy model. Thus, TinyNAS is co-designed with a memory-efficient inference library, TinyEngine.

Code generator-based compilation.

Most existing inference libraries (e.g., TF-Lite Micro, CMSIS-NN) are interpreter-based. Though it is easy to support cross-platform development, it requires extra memory, the most expensive resource in MCU, to store the meta-information (such as model structure parameters). Instead, TinyEngine only focuses on MCU devices and adopts code generator-based compilation. It not only avoids the time for runtime interpretation, but also frees up the memory usage to allow design and inference of larger models. Compared to CMSIS-NN, TinyEngine reduced memory usage by 2.7 and improve inference efficiency by 22% via code generation, as respectively shown in Figures 3 and 4.

Figure 3: TinyEngine achieves higher inference efficiency than existing inference frameworks while reducing the memory usage. Left: TinyEngine is 3 and 1.6 faster than TF-Lite Micro and CMSIS-NN, respectively. Note that if the required memory exceeds the memory constraint, it is marked with “OOM” (out of memory). Right: By reducing the memory usage, TinyEngine can run various model designs with tiny memory, enlarging the design space for TinyNAS under the limited memory of MCU. Model details in Section B.
Figure 4: TinyEngine outperforms existing libraries by eliminating runtime overheads and specializing each optimization technique. This effectively enlarges design space for TinyNAS under a given latency constraint.
Figure 5: Binary size.

The binary size of TinyEngine is light-weight, making it very memory-efficient for MCUs. Unlike interpreter-based TF-Lite Micro, which prepares the code for every operation (e.g., conv, softmax) to support cross-model inference even if they are not used, which has high redundancy. TinyEngine only compiles the operations that are used by a given model into the binary. As shown in Figure 5, such model-adaptive compilation reduces code size by up to 4.5 and 5.0 compared to TF-Lite Micro and CMSIS-NN, respectively.

Model-adaptive memory scheduling.

Existing inference libraries schedule the memory for each layer solely based on the layer itself: in the very beginning, a large buffer is designated to store the input activations after im2col; when executing each layer, only one column of the transformed inputs takes up this buffer. This leads to poor input activation reuse. Instead, TinyEngine smartly adapts the memory scheduling to the model-level statistics: the maximum memory required to fit exactly one column of transformed inputs over all the layers ,


For each layer , TinyEngine tries to tile the computation loop nests so that, as many columns can fit in that memory as possible,


Therefore, even for the layers with the same configuration (e.g., kernel size, #in/out channels) in two different models, TinyEngine will provide different strategies. Such adaption fully uses the available memory and increases the input data reuse, reducing the runtime overheads including the memory fragmentation and data movement. As shown in Figure 4, the model-adaptive im2col operation improved inference efficiency by 13%.

Computation kernel specialization.

TinyEngine specializes the kernel optimizations for different layers: loops tiling is based on the kernel size and available memory, which is different for each layer; and the inner loop unrolling is also specialized for different kernel sizes (e.g., 9 repeated code segments for 33 kernel, and 25 for 5

5 ) to eliminate the branch instruction overheads. Operation fusion is performed for Conv+Padding+ReLU+BN layers. These specialized optimization on the computation kernel further increased the inference efficiency by 22%, as shown in Figure 


4 Experiments

4.1 Setups


We used 3 datasets as benchmark: ImageNet Deng et al. (2009), Visual Wake Words (VWW) Chowdhery et al. (2019), and Speech Commands (V2) Warden (2018). ImageNet is a standard large-scale benchmark for image classification. VWW and Speech Commands represent popular microcontroller use-cases: VWW is a vision based dataset identifying whether a person is present in the image or not; Speech Commands is an audio dataset for keyword spotting (e.g

., “Hey Siri”), requiring to classify a spoken word from a vocabulary of size 35. Both datasets reflect the always-on characteristic of microcontroller workload. We did not use datasets like CIFAR 

Krizhevsky et al. (2009) since it is a small dataset with a limited image resolution (), which cannot accurately represent the benchmark model size or accuracy in real-life cases.

Table 2: System-algorithm co-design (TinyEngine + TinyNAS) achieves the highest ImageNet accuracy of models runnable on a microcontroller. LibraryModel S-MbV2 S-Proxyless TinyNAS CMSIS-NN Lai et al. (2018) 35.2% 49.5% 55.5% TinyEngine 43.8% 54.4% 60.1% Table 3: MCUNet outperforms the baselines at various latency requirements. Both TinyEngine and TinyNAS bring significant improvement on ImageNet. Latency Constraint N/A 5FPS 10FPS S-MbV2+CMSIS 39.7% 39.7% 28.7% S-MbV2+TinyEngine 43.8% 41.6% 34.1% MCUNet 60.1% 49.9% 40.5%
Table 4: MCUNet outperforms Rusci et al. (2020) without using advanced mixed-bit quantization (8/4/2-bit) policy under different scales of resource constraints, achieving a record ImageNet accuracy (>70%) on microcontrollers. Quantization (256kB, 1MB) (320kB, 1MB) (512kB, 1MB) (512kB, 2MB) Rusci et al. Rusci et al. (2020) Mixed 60.2% - 62.9% 68.0% MCUNet 4-bit 60.7% 62.3% 65.9% 70.2%
Figure 6: MCUNet reduces the the SRAM memory by 3.6 and Flash usage by 6.6 compared to MobileNetV2 and ResNet-18 (8-bit), while achieving better accuracy (70.2% vs. 69.8% ImageNet top-1).

During neural architecture search, in order not to touch the validation set, we perform validation on a small subset of the training set (we split 10,000 samples from the training set of ImageNet, and 5,000 from VWW). Speech Commands has a separate validation&test set, so we use the validation set for search and use the test set to report accuracy. The training details are in the supplementary material.

Model deployment.

We perform int8 linear quantization to deploy the model. All the MCU results are reported on STM32F746 MCU (320kB SRAM/1MB Flash), except for the OOM results that are measured on a larger MCU: STM32H743 (512kB SRAM/2MB Flash). All the latency is normalized to STM32F746 with 216MHz CPU.

4.2 Large-Scale Image Recognition on Tiny Devices

With our system-algorithm co-design, we achieve record high accuracy (70.2%) on large-scale ImageNet recognition on microcontrollers. We co-optimize TinyNAS and TinyEngine to find the best runnable network. We compare our results to several baselines. We generate the best scaling of MobileNetV2 Sandler et al. (2018) (denoted as S-MbV2) and ProxylessNAS Mobile Cai et al. (2019) (denoted as S-Proxyless) by compound scaling down the width multiplier and the input resolution until they meet the memory requirement. We train and evaluate the performance of all the satisfying scaled-down models on the Pareto front e.g., if we have two models (w0.5, r128) and (w0.5, r144) meeting the constraints, we only train and evaluate (w0.5, r144) since it is strictly better than the other; if we have two models (w0.5, r128) and (w0.4, r144) that fits the requirement, we train both networks and report the higher accuracy., and then report the highest accuracy as the baseline. The former is an efficient manually designed model, the latter is a state-of-the-art NAS model. We did not use MobileNetV3 Howard et al. (2019)-alike models because the hard-swish activation is not efficiently supported on microcontrollers.

Co-design brings better performance.

Both the inference library and the model design help to fit the resource constraints of microcontrollers. As shown in Table 2, when running on a tight budget of 320kB SRAM and 1MB Flash, the optimal scaling of MobileNetV2 and ProxylessNAS models only achieve 32.5% and 49.5% top-1 accuracy on ImageNet. With TinyNAS, we can reduce memory consumption, allowing for a larger runnable model to achieve 54.4% top-1 accuracy. Finally, with system-algorithm co-design, MCUNet further advances the accuracy to 60.1%, showing the advantage of joint optimization. Co-design improves the performance at various latency constraints (Table 3). TinyEngine accelerates inference to achieve higher accuracy at the same latency constraints. For the optimal scaling of MobileNetV2, TinyEngine improves the accuracy by 1.9% at 5 FPS setting and 5.4% at 10 FPS. With MCUNet co-design, we can further improve the performance by 8.3% and 6.4%.

Lower bit precision.

We used int8 linear quantization for both weights and activations. As the microcontroller only supports 16-bit instructions, the multiplication operands are converted to 16-bit, and accumulated in 32bit. We also performed 4-bit linear quantization on ImageNet, which can fit larger number parameters. The results are shown in Table 6. Under the same memory constraints, 4-bit MCUNet outperforms 8-bit by 2.2% by fitting a larger model in the memory. Without mixed-precision, we can already outperform the existing state-of-the-art Rusci et al. (2020) on microcontrollers, showing the effectiveness of system-algorithm co-design. We believe that we can further advance the Pareto curve in the future with mixed precision quantization. Notably, our model achieves a record ImageNet top-1 accuracy of on STM32H743 MCU. To the best of our knowledge, we are the first to achieve ImageNet accuracy on off-the-shelf commercial microcontrollers. Compared to ResNet-18 and MobileNetV2-0.75 (both in 8-bit) which achieve a similar ImageNet accuracy (69.8%), our MCUNet reduces the the memory usage by 3.6 and the Flash usage by 6.6 (Figure 6) to fit the tiny memory size on microcontrollers.

4.3 Visual&Audio Wake Words

Figure 7: Accuracy vs. latency/SRAM memory trade-off on VWW (top) and Speech Commands (down) dataset. MCUNet achieves better accuracy while being 2.4-3.4 faster at 2.2-2.6 smaller peak SRAM.

We benchmarked the performance on two wake words datasets: Visual Wake Words Chowdhery et al. (2019) (VWW) and Google Speech Commands (denoted as GSC) to compare the accuracy-latency and accuracy-peak memory trade-off. We compared to the optimally scaled MobileNetV2 and ProxylessNAS running on TF-Lite Micro. The results are shown in Figure 7. MCUNet significantly advances the Pareto curve. On VWW dataset, we can achieve higher accuracy at 3.4 faster inference speed. We also compare our results to the previous first-place solution on VWW challenge 40 (denoted as Han et al.). We scaled the input resolution to tightly fit the memory constraints of 320kB and re-trained it under the same setting like ours. We find that MCUNet achieves 2.4 faster inference speed compared to the previous state-of-the-art. Interestingly, the model from 40 has a much smaller peak memory usage compared to the biggest MobileNetV2 and ProxylessNAS model, while having a higher computation and latency. It also shows that a smaller peak memory is the key to success on microcontrollers. On the Speech Commands dataset, MCUNet achieves a higher accuracy at 2.8 faster inference speed and smaller peak memory. It achieves 2% higher accuracy compared to the largest MobileNetV2, and 3.3% improvement compared to the largest runnable ProxylessNAS under 320kB SRAM constraint.

4.4 Analysis

Search space optimization matters.

Search space optimization significantly improves the NAS accuracy. We performed an ablation study on ImageNet-100, a subset of ImageNet with 100 randomly sampled categories. The distribution of the top-10 search spaces is shown in Figure 2. We sample several search spaces from the top-10 search spaces and perform the whole neural architecture search process to find the best model inside the space that can fit 320kB SRAM/1MB Flash.

R-18@224 Rand Space Huge Space Our Space Acc. 80.3% 74.71.9% 77.0% 78.7% Table 5: Our search space achieves the best accuracy, closer to ResNet-18@224 resolution (OOM). Randomly sampled and a huge space (contain many configs) leads to worse accuracy.
Figure 8: Search space with higher mean FLOPs leads to higher final accuracy.

We compare the accuracy of the searched model using different search spaces in Table 5. Using the search space configuration found by our algorithm, we can achieve 78.7% top-1 accuracy, closer to ResNet-18 on 224 resolution input (which runs out of memory). We evaluate several randomly sampled search spaces from the top-10 spaces; they perform significantly worse. Another baseline is to use a very large search space supporting variable resolution (96-176) and variable width multipliers (0.3-0.7). Note that this “huge space” contains the best space. However, it fails to get good performance. We hypothesize that using a super large space increases the difficulty of training super network and evolution search. We plot the relationship between the accuracy of the final searched model and the mean FLOPs of the search space configuration in Figure 8. We can see a clear positive relationship, which backs our algorithm.

Sensitivity analysis on search space optimization.

(a) Best width setting.
(b) Best resolution setting.
Figure 9: Best search space configurations under different SRAM and Flash constraints.

We inspect the results of search space optimization and find some interesting patterns. The results are shown in Figure 9. We vary the SRAM limit from 192kB to 512kB and Flash limit from 512kB to 2MB, and show the chosen width multiplier and resolution. Generally, with a larger SRAM to store a larger activation map, we can use a higher input resolution; with a larger Flash to store a larger model. we can use a larger width multiplier. When we increase the SRAM and keep the Flash from point 1 to point 2 (red rectangles), the width is not increased as Flash is small; the resolution increases as the larger SRAM can host a larger activation. From point 1 to 3, the width increases, and the resolution actually decreases. This is because a larger Flash hosts a wider model, but we need to scale down the resolution to fit the small SRAM. Such kind of patterns is non-trivial and hard to discover manually.

Evolution search.

Figure 10: Evolution progress.

The curve of evolution search on different inference library is in Figure 10. The solid line represents the average value, while the shadow shows the range of (min, max) accuracy. On TinyEngine, evolution clearly outperforms random search, with 1% higher best accuracy. The evolution on CMSIS-NN leads to much worse results due to memory inefficiency: the library can only host a smaller model compared to TinyEngine, which leads to lower accuracy.

5 Conclusion

We propose MCUNet to jointly design the neural network architecture (TinyNAS) and the inference library (TinyEngine), enabling deep learning on tiny hardware resources. We achieved a record ImageNet accuracy (70.2%) on off-the-shelf microcontrollers, and accelerated the inference of wake word applications by 2.4-3.4. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived.

Statement of Broader Impacts

Our work is expected to enable tiny-scale deep learning on microcontrollers and further democratize deep learning applications. Over the years, people have brought down the cost of deep learning inference from $5,000 workstation GPU to $500 mobile phones. We now bring deep learning to microcontrollers costing $5 or even less, which greatly expands the scope of AI applications, making AI much more accessible. Thanks to the low cost and large quantity (250B) of commercial microcontrollers, we can bring AI applications to every aspect of our daily life, including personalized healthcare, smart retail, precision agriculture, smart factory, etc. People from rural and under-developed areas without Internet or high-end hardware can also enjoy the benefits of AI. With these always-on low-power microcontrollers, we can process raw sensor data right at the source. It helps to protect privacy since data no longer has to be transmitted to the cloud but processed locally.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In OSDI, Cited by: Appendix A, MCUNet: Tiny Deep Learning on IoT Devices, §2.
  • [2] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan, and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, Cited by: §3.1.
  • [3] H. Cai, C. Gan, T. Wang, Z. Zhang, and S. Han (2020) Once for All: Train One Network and Specialize it for Efficient Deployment. In ICLR, Cited by: Appendix D, §1.
  • [4] H. Cai, L. Zhu, and S. Han (2019) ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In ICLR, Cited by: Appendix B, Appendix D, §1, §2, §3.1, §3, §4.2.
  • [5] A. Canziani, A. Paszke, and E. Culurciello (2016) An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678. Cited by: §3.1.
  • [6] A. Capotondi, M. Rusci, M. Fariselli, and L. Benini (2020) CMix-nn: mixed low-precision cnn library for memory-constrained edge devices. IEEE Transactions on Circuits and Systems II: Express Briefs 67 (5), pp. 871–875. Cited by: §2.
  • [7] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. (2018) tvm: An automated end-to-end optimizing compiler for deep learning. In OSDI, Cited by: §2, §3.
  • [8] T. Chen, L. Zheng, E. Yan, Z. Jiang, T. Moreau, L. Ceze, C. Guestrin, and A. Krishnamurthy (2018) Learning to optimize tensor programs. In NeurIPS, Cited by: §3.
  • [9] J. Choi, Z. Wang, S. Venkataramani, P. I. Chuang, V. Srinivasan, and K. Gopalakrishnan (2018) Pact: parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085. Cited by: §2.
  • [10] A. Chowdhery, P. Warden, J. Shlens, A. Howard, and R. Rhodes (2019) Visual wake words dataset. arXiv preprint arXiv:1906.05721. Cited by: Appendix A, Appendix E, §4.1, §4.3.
  • [11] M. Courbariaux and Y. Bengio (2016) Binarynet: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: §2.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, Cited by: Appendix B, Appendix E, §4.1.
  • [13] I. Fedorov, R. P. Adams, M. Mattina, and P. Whatmough (2019) Sparse: sparse architecture search for cnns on resource-constrained microcontrollers. In NeurIPS, Cited by: §1.
  • [14] Y. Gong, L. Liu, M. Yang, and L. Bourdev (2014)

    Compressing deep convolutional networks using vector quantization

    arXiv preprint arXiv:1412.6115. Cited by: §2.
  • [15] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2019) Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv. Cited by: §3.1.
  • [16] S. Han, H. Mao, and W. J. Dally (2016) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, Cited by: §2.
  • [17] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both Weights and Connections for Efficient Neural Networks. In NeurIPS, Cited by: Appendix D, §2.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep Residual Learning for Image Recognition. In CVPR, Cited by: §1.
  • [19] Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) AMC: AutoML for Model Compression and Acceleration on Mobile Devices. In ECCV, Cited by: §2, §3.1.
  • [20] Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In ICCV, Cited by: §2.
  • [21] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    arXiv. Cited by: §1, §2.
  • [22] A. Howard, M. Sandler, G. Chu, L. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam (2019) Searching for MobileNetV3. In ICCV, Cited by: §4.2.
  • [23] Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: §2.
  • [24] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: Appendix B, §4.1.
  • [25] L. Lai, N. Suda, and V. Chandra (2018) Cmsis-nn: efficient neural network kernels for arm cortex-m cpus. arXiv preprint arXiv:1801.06601. Cited by: MCUNet: Tiny Deep Learning on IoT Devices, §2, §4.1.
  • [26] T. Lawrence and L. Zhang (2019) IoTNet: an efficient and accurate convolutional neural network for iot devices. Sensors 19 (24), pp. 5541. Cited by: §1.
  • [27] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: §2.
  • [28] E. Liberis and N. D. Lane (2019) Neural networks on microcontrollers: saving memory at inference via operator reordering. arXiv preprint arXiv:1910.05110. Cited by: §1.
  • [29] J. Lin, Y. Rao, J. Lu, and J. Zhou (2017) Runtime neural pruning. In NeurIPS, Cited by: §2.
  • [30] H. Liu, K. Simonyan, and Y. Yang (2019) DARTS: Differentiable Architecture Search. In ICLR, Cited by: §2.
  • [31] Z. Liu, H. Mu, X. Zhang, Z. Guo, X. Yang, K. Cheng, and J. Sun (2019) MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning. In ICCV, Cited by: §2.
  • [32] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang (2017) Learning efficient convolutional networks through network slimming. In ICCV, Cited by: §2.
  • [33] I. Loshchilov and F. Hutter (2016)

    Sgdr: stochastic gradient descent with warm restarts

    arXiv preprint arXiv:1608.03983. Cited by: Appendix E.
  • [34] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In ECCV, Cited by: §2.
  • [35] I. Radosavovic, J. Johnson, S. Xie, W. Lo, and P. Dollár (2019) On network design spaces for visual recognition. In ICCV, Cited by: §3.1.
  • [36] I. Radosavovic, R. P. Kosaraju, R. Girshick, K. He, and P. Dollár (2020) Designing network design spaces. arXiv preprint arXiv:2003.13678. Cited by: §1, §2.
  • [37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016) Xnor-net: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §2.
  • [38] M. Rusci, A. Capotondi, and L. Benini (2020) Memory-driven mixed low precision quantization for enabling deep network inference on microcontrollers. In MLSys, Cited by: §1, Figure 6, §4.2, Table 4.
  • [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: Inverted Residuals and Linear Bottlenecks. In CVPR, Cited by: Appendix B, §1, §1, §2, §4.2.
  • [40] Solution to visual wakeup words challenge’19 (first place).. Note: Cited by: §4.3.
  • [41] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le (2019) MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, Cited by: Appendix B, Appendix C, Appendix D, §1, §2, §3.1, §3.
  • [42] K. Wang, Z. Liu, Y. Lin, J. Lin, and S. Han (2019) HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In CVPR, Cited by: §2.
  • [43] P. Warden (2018) Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: Appendix E, §4.1.
  • [44] Why tinyml is a giant opportunity. Note: Cited by: §1.
  • [45] B. Wu, X. Dai, P. Zhang, Y. Wang, F. Sun, Y. Wu, Y. Tian, P. Vajda, Y. Jia, and K. Keutzer (2019) FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In CVPR, Cited by: Appendix D, §1, §2, §3.
  • [46] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In CVPR, Cited by: §1, §2.
  • [47] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou (2016) Dorefa-net: training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv:1606.06160. Cited by: §2.
  • [48] C. Zhu, S. Han, H. Mao, and W. J. Dally (2016) Trained ternary quantization. arXiv preprint arXiv:1612.01064. Cited by: §2.
  • [49] B. Zoph and Q. V. Le (2017)

    Neural Architecture Search with Reinforcement Learning

    In ICLR, Cited by: §2.
  • [50] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning Transferable Architectures for Scalable Image Recognition. In CVPR, Cited by: §2.

Appendix A Demo Video

We release a demo video of MCUNet running the visual wake words dataset[10] in this link. MCUNet with TinyNAS and TinyEngine achieves 12% higher accuracy and 2.5 faster speed compared to MobilenetV1 on TF-Lite Micro [1]. Note that we show the actual frame rate in the demo video, which includes frame capture latency overhead from the camera (around 30ms per frame). Such camera latency slows down the inference from 10 FPS to 7.3 FPS.

Appendix B Profiled Model Architecture Details

We provide the details of the models profiled in Figure 4.


SmallCifar is a small network for CIFAR [24] dataset used in the MicroTVM/TVM post It takes an image of size as input. The input image is passed through {convolution (kernel size

), max pooling}. The output channels are 32, 32, 64 respectively. The final feature map is flattened and passed through a linear layer of weight 1024

10 to get the logit. The model is quite small. We mainly use it to compare with MicroTVM since most of ImageNet models run OOM with MicroTVM.

ImageNet Models.

All other models are for ImageNet [12] to reflect a real-life use case. The input resolution and model width multiplier are scaled down so that they can run with most of the libraries profiled. We used input resolution of for MobileNetV2 [39] and ProxylessNAS [4], and for MnasNet [41]. The width multipliers are 0.35 for MobileNetV2, 0.3 for ProxylessNAS and 0.2 for MnasNet.

Appendix C Design Cost

There are billions of IoT devices with drastically different constraints, which requires different search spaces and model specialization. Therefore, keeping a low design cost is important. MCUNet is efficient in terms of neural architecture design cost. The search space optimization process takes negligible cost since no training or testing is required (it takes around 2 CPU hours to collect all the FLOPs statistics). The process needs to be done only once and can be reused for different constraints (e.g., we covered two MCU devices and 4 memory constraints in Table 4). TinyNAS is an one-shot neural architecture search method without a meta controller, which is far more efficient compared to traditional neural architecture search method: it takes 40,000 GPU hours for MnasNet [41] to design a model, while MCUNet only takes 300 GPU hours, reducing the search cost by 133. With MCUNet, we reduce the emission from 11,345 lbs to 85 lbs per model (Figure 11).

Figure 11: Total emission (klbs) for model design. MCUNet saves the design cost by orders of magnitude, allowing model specialization for different deployment scenarios.

Appendix D Resource-Constrained Model Specialization Details

For all the experiments in our paper, we used the same training recipe for neural architecture search to keep a fair comparison.

Super network training.

We first train a super network to contain all the sub-networks in the search space through weight sharing. Our search space is based on the widely-used mobile search space [41, 4, 45, 3] and supports variable kernel sizes for depth-wise convolution (3/5/7), variable expansion ratios for inverted bottleneck (3/4/6) and variable stage depths (2/3/4). The input resolution and width multiplier is chosen from search the space optimization technique proposed in section 3.1. The number of possible sub-networks that TinyNAS can cover in the search space is large: . To speed up the convergence, we first train the largest sub-network inside the search space (all kernel size 7, all expansion ratio 6, all stage depth 4). We then use the trained weights to initialize the super network. Following [3], we sort the channels weights according to their importance (we used L-1 norm to measure the importance [17]), so that the most important channels are ranked higher. Then we train the super network to support different sub-networks. For each batch of data, we randomly sample 4 sub-networks, calculate the loss, backpropogate the gradients for each sub-network, and update the corresponding weights. For weight sharing, when select a smaller kernel, e.g., kernel size 3, we index the central window from the kernel; when selecting a smaller expansion ratio, e.g. 3, we index the first channels from the channels ( is #block input channels), as the weights are already sorted according to importance; when using a smaller stage depth, e.g. 2, we calculate the first 2 blocks inside the stage the skip the rest. Since we use a fixed order when sampling sub-networks, we keep the same sampling manner when evaluating their performance.

Evolution search.

After super-network training, we use evolution to find the best sub-network architecture. We use a population size of 100. To get the first generation of population, we randomly sample sub-networks and keep 100 satisfying networks that fit the resource constraints. We measure the accuracy of each candidate on the independent validation set split from the training set. Then, for each iteration, we keep the top-20 candidates in the population with highest accuracy. We use crossover to generate 50 new candidates, and use mutation with probability 0.1 to generate another 50 new candidates, which form a new generation of size 100. We measure the accuracy of each candidate in the new generation. The process is repeated for 30 iterations, and we choose the sub-network with the highest validation accuracy.

Appendix E Training&Testing Details


The super network is trained on the training set excluding the split validation set. We trained the network using the standard SGD optimizer with momentum 0.9 and weight decay 5e-5. For super network training, we used cosine annealing learning rate [33]

with a starting learning rate 0.05 for every 256 samples. The largest sub-network is trained for 150 epochs on ImageNet 

[12], 100 epochs on Speech Commands [43] and 30 epochs on Visual Wake Words [10] due to different dataset sizes. Then we train the super network for twice training epochs by randomly sampling sub-networks.


We evaluate the performance of each sub-network on the independent validation set split from the training set in order not to over-fit the real validation set. To evaluate each sub-network’s performance during evolution search, we index and inherit the partial weights from the super network. We re-calibrate the batch normalization statistics (moving mean and variance) using 20 batches of data with a batch size 64. To evaluate the final performance on the real validation set, we also fine-tuned the best sub-network for 100 epochs on ImageNet.


For most of the experiments (except Table 4), we used TensorFlow’s int8 quantization (both activation and weights are quantized to int8). We used post-training quantization without fine-tuning which can already achieve negligible accuracy loss. We also reported the results of 4-bit integer quantization (weight and activation) on ImageNet (Table 4 of the paper). In this case, we used quantization-aware fine-tuning for 25 epochs to recover the accuracy.