Enabling Large Neural Networks on Tiny Microcontrollers with Swapping

01/14/2021 ∙ by Hongyu Miao, et al. ∙ 0

Running neural networks (NNs) on microcontroller units (MCUs) is becoming increasingly important, but is very difficult due to the tiny SRAM size of MCU. Prior work proposes many algorithm-level techniques to reduce NN memory footprints, but all at the cost of sacrificing accuracy and generality, which disqualifies MCUs for many important use cases. We investigate a system solution for MCUs to execute NNs out of core: dynamically swapping NN data chunks between an MCU's tiny SRAM and its large, low-cost external flash. Out-of-core NNs on MCUs raise multiple concerns: execution slowdown, storage wear out, energy consumption, and data security. We present a study showing that none is a showstopper; the key benefit – MCUs being able to run large NNs with full accuracy and generality – triumphs the overheads. Our findings suggest that MCUs can play a much greater role in edge intelligence.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With low cost and energy, MCUs are becoming ubiquitous platforms for neural networks (NNs), a paradigm dubbed tinyML [17]. Running NN on MCU

, rather than sending raw data off, offers multiple advantages, notably tolerating poor networks and preserving data privacy. Use cases include detecting crop disease by classifying photos of their leaves in farms 

[18]; monitoring traffic patterns by analyzing city images; recognizing human voice commands at home [48, 12].

A top difficulty in tinyML is MCU’s memory limit. On one hand, an MCU often has tens to hundreds KB of SRAM as its readable/writeable main memory; it has no more than a few MBs of byte-addressable on-chip flash for read-only code and data (note: on-chip flash is different from the external, non-byte-addressable flash such as SD cards, which are of GBs) [20]. On the other hand, state-of-the-art NNs achieve high accuracy and generality with their large memory footprints [42, 41]. An NN’s memory footprint comprises its parameters to load during NN execution, as well as feature maps as the intermediate and final results to load and store.

Figure 1: Many popular NNs exceed the MCU memory size [25].

Of a state-of-the-art NN, the parameters ranges from several MBs to more than 100 MBs (even with parameter quantized) [14]; the feature maps can be as large as tens of MBs [14]. Despite an MCU can process each NN layer in memory before loading the next layer, the per-layer parameters and feature maps can take up to 100 MB (e.g. VGG16 [40]). This exceeds the MCU memory size by up to two orders of magnitude. Such a memory gap is widening as recent NNs are larger [43] while commodity MCUs see slow, if at all, growth in memory sizes due to cost and resource constraints [7].

A popular approach to reducing NN memory footprints is to engineer NNs themselves. Common techniques include model compression [49, 28, 35], parameter quantization [27], designing tiny NNs from scratch [46], as well as automating these procedures [36]. As a common tradeoff however, these techniques give away model accuracy or generality at varying degrees. Unfortunately, in order for an NN to fit into the MCU memory, the NN either becomes substantially inaccurate (e.g. < 60% top-1 accuracy as shown in Figure 1) or too specialized (e.g. can only detect a few object classes [45]).

This disqualifies MCUs from key uses where high accuracy or generality is desired while delays can be tolerated: (1) NN inference on slowly changing signals, e.g., monitoring crop health by analyzing hourly photos [18] and traffic patterns by analyzing video frames every 20-30 minutes [45]. (2) profiling NNs on device

: occasionally running a full-blown NN to estimate the accuracy of long-running smaller NNs 

[39]; (3) transfer learning: re-training NNs on MCUs with new data collected from deployment every hour or day [44].

A case for out-of-core NNs

Can an MCU execute NNs that far exceed its physical memory size? A proven wisdom is to swap tiles of NN layers between memory tiers as NN is being executed [23]. In case of tinyML, this is to split one NN layer’s working set into a series of data chunks, i.e. tiles, each small enough to fit the MCU memory; load tiles from external storage (a micro SD card) to memory, compute on them, and write results back to the storage for subsequent processing. While prior systems have swapped NN tiles between a server’s CPU/GPU memories for training [31], applying the idea to MCU for inference, in particular swapping between its small SRAM and a wimpy SD card, raises multiple concerns: loss of SD card durability, slowdown in NN execution, energy increase, and safety/security of out-of-core NN data.

Key observations

This paper investigates the feasibility of out-of-core NN execution on MCUs.

  • [leftmargin=0cm,itemindent=.3cm,labelwidth=labelsep=0pt, parsep=0pt, topsep=1pt, itemsep=2pt, align=left]

  • Swapping overhead is only pronounced in certain layers.    Only on layers with low arithmetic intensity, notably fully connected (FC) layers, the swapping delay due to IO is longer than that of computation; on layers with higher arithmetic intensity, e.g. convolution (Conv), the swapping delay is dwarfed by that of computation. The swapping overhead is further diminished by MCU’s relative low CPU speed as compared to its IO speed.

  • Swapping rate is throttled by computation, which limits the wear rate of SD cards. As a common NN structure, IO-bound layers such as FC are spaced by compute-bound layers such as Conv. As a result, even with continuous NN executions, IO is only exercised intermittently.

  • Read-most swapping IO.    While writes of NN feature maps wear SD cards; reads of weights and input feature maps do not [2]. Fortunately, writes only constitute a small fraction of all the swapping IO traffic as the paper will show.

  • Hide swapping delays with parallelism at various granularities.    Within a layer, the MCU can exploit tile parallelism, by computing on a tile while transferring others to/from the storage. Between consecutive NN executions such as on a sequence of video frames, the MCU can further exploit pipeline parallelism, by overlapping the swapping IO for an earlier frame with the computation of a later frame.

  • Modern MCU hardware.    Recent SD cards already over-provision durability at low cost, e.g., a 64 GB SD card can last more than 10 years with 100 GB of daily writes (Section 3.4). As such, MCU can trades the surplus durability as a system resource for accommodating large NNs. Modern MCUs incorporate rich specialized hardware, e.g., for DMA, hash, and crypto, which facilitate fast and secure data swapping.

  • IO adds marginal energy to an already busy MCU.    With an MCU already busy on computation, most of its hardware components in high power states. Further activating the SD card for swapping increases the system energy moderately.

Quantitative findings

We studied a diverse set of NNs, MobileNets [30], AlexNet [33], and VGG16 [40], on a Cortex-M7 MCU with 128 KB of SRAM. Our findings are:

  • [leftmargin=0cm,itemindent=.3cm,labelwidth=labelsep=0pt, parsep=0pt, topsep=1pt, itemsep=2pt, align=left]

  • Low to modest speed overhead.    NNs with dominant compute-bound layers see negligible swapping overhead, both in per-frame delay and throughput. Compared to running AlexNet on an “ideal” MCU with infinite memory, running it out-of-core with 128 KB memory sees only 3.3% longer delay and almost identical throughput. NNs with more IO-bound layers such as MobileNet see moderate delay increase (24.2%) while insignificant loss in throughput (2.5%) thanks to tile and pipeline parallelism.

  • Low durability loss.    Even with an MCU executing NNs continuously, the write traffic due to swapping is no more than a few hundred GBs per day, comparable to SD card writes on a commodity surveillance camera. A 64 GB SD card can sustain such a write rate for 7.5 years before half of its cells are worn out.

  • Modest increase in energy consumption.    Our worst-case estimation shows swapping increases system energy by less than 42% compared to running NNs with infinite memory.

  • Out-of-core data can be secured with known mechanisms such as encryption and hash-based integrity protection. Specialized hardware on MCUs further reduces their overhead.


This paper contributes:

  • [leftmargin=0cm,itemindent=.3cm,labelwidth=labelsep=0pt, parsep=0pt, topsep=1pt, itemsep=2pt, align=left]

  • the first study of applying swapping to NN on MCUs;

  • an analysis of IO behaviors of NN layers under swapping, characterizing performance, storage durability, energy, and data security. It further presents new insights on extracting parallelism for hiding swapping delays;

  • a finding that an MCU of less than ten dollars with hundreds of KB SRAM can execute large NNs such as VGG16, expanding the scope of tinyML significantly.

2 The System Model

MCU hardware

We assume the following hardware components: (1) a CPU with clockrate from tens of MHz to a few hundred MHz, as exemplified by Arm Cortex M3 and M7; (2) on-chip SRAM: from tens of KBs to several hundreds of KBs; (3) on-chip NOR flash: byte-addressable, read-only memory no more than a few MBs; (4) cheap external storage, e.g. a micro SD card ranging from tens of GBs to a few hundred GBs; (5) a DMA engine, for moving data between SRAM and external storage without CPU involved; (6) optionally, on-chip accelerators for computing crypto and hash functions.

Major vendors ship numerous MCU models meeting the above conditions. Examples include the STM32 MCU family from STMicroelectronics [10] and the LCP series from NXP Semiconductors [6]. They are priced at $1-$20 per unit.

NN workloads & metrics

We motivate our study by considering periodic NN inference on video/audio data as a sequence of frames captured by MCUs at run time. To characterize inference speed, we consider both the inference delay of each frame and throughput as the number of frames processed per second. MCU applications may be sensitive to either metric or both. For instances, keyword spotting is sensitive to inference delays [48] and car counting benefits from high throughput [45].

Figure 2: An example of out-of-core NN execution, showing Conv (compute-bound) and FC (IO-bound) layers.

Out-of-core NN executions

We consider the following swapping strategy. An NN’s parameters are pre-stored on the external flash. Given an input frame, the MCU executes the NN’s layers in sequence. It processes a layer in tiles, in case the layer’s memory footprint exceeds MCU’s main memory: to do so, the MCU loads to the main memory a tile of parameters and a tile of input feature maps, computes a tile of output feature maps in memory, and writes back the output to the external flash. Altogether, the input and output tiles shall simultaneously fit in the main memory.

As shown in Figure 2, MCU extracts CPU/IO parallelism for hiding IO delays. (1) Tile parallelism within an NN layer: while computing an output tile Tile0, MCU can pre-load from flash the input tiles for computing the next output tile Tile1; while writing back the completed Tile0 back to flash, MCU can compute Tile1 simultaneously. (2) Layer parallelism: in a similar fashion, MCU can execute an earlier layer’s computation with a latter layer’s IO simultaneously. (3) Pipeline parallelism across data frames: MCU can execute compute-bound and IO-bound layers for different frames in parallel, as these layers exercise complementary resources, namely CPU and IO bandwidth. As shown in Figure 2, MCU swaps frame 0’s FC layer while computes on frame 1’s Conv layer.

3 The Feasibility Study

(a) AlexNet (in shape: 227)
(b) VGG16 (in shape: 224)
(c) MobileNet (in shape: 224, alpha: 1)
Figure 3: Compute and IO delay of NN layers. MCU: ARM Cortex-M7 @ 216 MHz, memory buffer for each layer: 128 KB.
Table 1: A set of three NNs studied in this paper.

We study three representative NNs, whose memory footprints range from sveral-MB to hundred-MB (with quantization). As shown in Table 1: MobileNet has large feature maps but small weight parameters, AlexNet has small feature maps but large weight parameters, and VGG16 has 1000 larger memory footprint than MCUs’ SRAM size.

3.1 A taxonomy of NN layers

Table 2: Normalized arithmetic intensity () on NN layers with MCU’s common speed range (64–480 MOPS) and IO bandwidth range (10–40 MB/s). NN: VGG16

To study the swapping overhead, we focus on a layer’s swapping delay relative to its computation delay on typical MCUs. The rationale is that as MCU can perform swapping and computation in parallel, the longer of the two delays will be the layer’s bottleneck.

In general, arithmetic intensity, as commonly used in HPC  [8], characterizes a workload’s compute/IO ratio. It is defined as , where is the amount of data to move in the memory hierarchy and is the amount of arithmetic operations on the data. By factoring in an MCU’s CPU speed () and IO bandwidth (), we define as the normalized arithmetic intensity on MCU. Of a given layer, means swapping incurs less delay than computation, i.e, a compute-bound layer; means swapping incurs longer delay, i.e. an IO-bound layer.

On modern MCUs with simple CPU cores, is primiarly determined by the CPU clockrate; it ranges from 64 MOPS to 480 MOPS [16, 13]. is jointly determined by the MCU’s DMA bandwidth and the SD card bandwidth, ranging from 10 MB/s to 40 MB/s as reported in literatures [5]. With these values, common NN layers fall into three distinct categories per their normalized arithemetic intensity ().

(1) A majority of compute-bound layers ().    Notable examples are Conv layers known for their high complexity. In the example of VGG16 (Table 2), for the Conv layers far exceeds 1 even with a high CPU clockrate and slow IO. They often dominate an NN’s execution time (51% – 90%), as exemplified by the three NNs in Figure 3. On these layers, the computation delay overshadows the IO delay.

(2) Some IO-bound layers ().    Examples include fully connected (FC) and depth-wise convolutional layers (DW). These layers perform light computation over large volumes of feature maps and weight parameters. Of all layers in an NN, they are often minorities (e.g. 2 out of 21 in VGG16). With out-of-core execution, the IO delay exceeds the computation delay by up to 10 (e.g. fc1 in Table 2 and Figure 2(b)).

(3) Other layers with insignificant overheads

, e.g., Relu and Maxpooling. These layers have low complexity and contribute a tiny fraction of data to move and to compute (0.3%-0.9%) for an NN. As such, their swapping overhead is insignificant.

3.2 Impact on per-frame delays

Most NNs, with a small fraction of IO-bound layers, see negligible delay increase; NNs with more IO-bound layers see modest delay increase.

Within a compute-bound layer, MCU can execute IO and computation for consecutive tiles simultaneously (as these tiles are independent), completely hiding the IO delay behind the much longer computation delay. Within an IO-bound layer, IO and compute for consecutive tiles can happen simultaneously as well, but the long IO delay cannot be totally hidden by relatively shorter compute delay. For other layers, e.g. relu/pooling, the IO/compute delay is insignificant.

As such, the increased delay of an NN due to swapping is mainly determined by the proportion of IO-bound layers’ IO delay to all layers’ total compute delay. The increased delay for NNs with less IO-bound layers is negligible. As VGG and AlexNet shown in Table 1, only 2 out of 13 and 3 of 5 of layers are IO-bound, leading to 3.3% and 3.6% increased delay. The increased delay for NNs with more IO-bound layers is modest. As MobileNet shown in Table 1, 13 of 28 layers are IO-bound, leading to 24.2% increased delay. Overall, the increased delay due to swapping is negligible for most NNs and modest for some special NNs.

3.3 Impact on NN throughput

NNs see negligible throughput loss.

NNs with negligible delay increase will also see negligible throughput loss when processing a stream of frames. For those NNs seeing higher delay increase, fortunately, MCU can reduce throughput loss by exploiting pipeline parallelism across frames.

A common pattern in an NN is that one or more compute-bound layers followed by one or more IO-bound layers, i.e. a pipeline with interleaved compute-bound and IO-bound stages. For instance, the AlexNet in Figure 2(a), conv1-5 ( compute-bound stage) is followed by fc6-8 (IO-bound stage). When executing NN on a sequence of frames, MCU can overlap IO/compute-bound stages of adjacent frames, hence hiding the IO delays that cannot be hidden at the layer/tile levels with each frame. As shown in Figure 4, MCU can swap for frame 0’s FC layers while computing Frame 1’s Conv layers, leading high MCU/IO utilization and throughput.

The throughput loss for a pair of compute-bound and IO-bound stages is zero, if their compute delay is longer than their IO delay. As shown in Figure 3: (1) both AlexNet and VGG have one compute-bound stage followed by one IO-bound stage, and their compute delay is much longer than IO delay (Alexnet: 20 vs. 12. VGG: 602 vs. 55), so swapping has no throughput loss for them. (2) MobileNet has 13 pairs of compute/IO-bound layers. Only 2 of 13 pairs (dw/pw-1/2) and two layers (conv1 and preds) have throughput loss (1.4% – 93%) because of longer IO delay than compute delay, leading to 2.5% overall throughput loss for MobileNet. Overall, throughput loss due to swapping is negligible for NNs.

3.4 Impact on flash durability

SD card sees negligible durability loss, and its lifetime could be years or tens of years with swapping.

The amount of data written to SD card per frame is not large because NN layers are read-most, and the write frequency is low due to the long execution time on slow MCU.

Modest write rate

For a given NN and SRAM size, the amount of data written to SD card is determined by the frame rate (reciprocal of delay per frame) and the amount of data to write per frame (upper bound is the sum of output feature maps of all layers), which have negative correlations: (1) for large NNs, frame rate is low but the amount of data to write per frame is large; (2) for small NNs, frame rate is high but the amount of data to write per frame is small. Therefore, no matter an NN is large or small, the data written per day won’t be large. For instance, swapping writes only 2.0/2.8 GB for VGG16/AlexNet per day. Even for the extreme case, MobileNet, which has high frame rate and relatively large feature maps to write, swapping writes 123 GB per day.

Figure 4: AlexNet: tile parallelism for low delay and pipeline parallelism for high throughput.

SD card has long lifetime even with swapping

SD card is build up of many cells, which have limited write cycles [4]. As the capacity is becoming larger [3], the durability budget is keeping increasing. The study [9] keeps writing 24/7 as fast as possible to 40 4 GB SD cards, and 1, 20, and 40 of 40 cards observe the first failures after writing 6.5 TB, 9 TB, and 12.5 TB of data to them. Based on their results, the first cell is only expected to fail on a 64 GB SD card after running MobileNet, AlexNet, and VGG16 for 2.4 – 4.5, 104 – 200, and 145 – 280 years, and 50% of cells fail (10K cycles per cell [1, 21]) only after running for 7.5, 328, and 460 years.

3.5 Impact on system energy

Swapping adds modest energy consumption to an already busy MCU.

We estimate the worst-case energy overhead due to swapping. Our test platform is an STM32F746NG-Discovery board (ARM Cortex-M7 at 216 MHz; 340 KB SRAM) with an external power meter [22]. We run two benchmarks. (1) in-core emulates NN executions with an infinite amount of memory: it runs NN compute [34] for 1000 iterations. (2) out-of-core emulates NN executions with the most intensive IO traffic in parallel to the compute: it executes the same amount of compute with an IO thread repeatedly flushing data blocks to SD card. Each data block is 100 KB (close to tile size); the flush is asynchronous using the MCU’s DMA engine.

Our measurement shows that: the additional IO workloads increases the system energy by 42%, from 0.07 Wh (in-core) to 0.10 Wh (out-of-core); the total execution time goes from 178 sec to 213 sec. Our obsevations are: (1) The actual energy overhead in out-of-core NNs is likely much less: while the out-of-core benchmark keeps IO always busy, the actual out-of-core NNs exercise IO intermittently (§3.1) because most NN layers are likely compute-bound. (2) We attribtute the modest energy overhead to the incremental nature of system energy: when an MCU-based device is already busy executing compute, its most power-hungry hardware – cores, interconnect, SRAM, and regulators – is already activated; executing IO, which activiates an SD card and the MMC controller in addition, adds to the energy but not much.

3.6 Out-of-core data security and safety

Compared to storing NN data in on-chip SRAM, (temporarily) storing it off-chip is more vulnerable to physical attacks [15]: adversaries may learn or corrupt the data by tapping into the IO bus between MCU and the SD card, or the SD card itself. Fortunately, by encrypting NN data before swapping out, MCU can ensure the data to be confidential and integral; the overhead is linear to the data amount. Hardware crypto, such as for ASE [19, 38], is already common on modern MCUs. Its computation overhead is comparable to (or even less than) the least intensive NN compute (e.g. FC layers).

Compared to SRAM, SD cards are less durable. Yet, it is known that a SD card rarely fails as a whole but seeing a gradual increase number of corrupted cells over time [11]. Cell corruption is often silent, i.e. a read value simply differs from what was written last time. Fortunately, MCU can detects such failures with hash-based integrity checking. With specialized hardware on MCUs, computing hash is no more expensive than the least intensive NN compute [19]. Upon detection of bad cells, the MCU can recompute the most recent NN layer and recover the corrupted out-of-core data.

4 Concluding remarks

Implications on model compression

Our solution boosts design freedom in tinyML, where memory limit was considered as the primary motivation for model compression. With the removal of such a limit, developers now have the choice of run large NNs without compression, retaining full model accuracy. Even in case of model compression is warranted, e.g. for faster NN execution, developers now have a wider selection of baseline NNs, including the ones with orders of higher memory footprints than MCUs.

Relation to prior work

Prior work enables out-of-core NN training with large batches on GPU/CPU memory systems [29, 31, 37, 47, 32], but they cannot address the unique challenge on MCU that even a single layer exceeds main memory during NN inference, and it’s never been studied how swapping affects SD card lifetime, execution slowdown, energy consumption, and data security. Our study answers these questions and shows that swapping is feasible without much overhead.

Tensorflow Lite Micro [24] is a framework for running NN inference on embedded devices. CMSIS-NN [34] provides optimized NN kernels for ARM Cortex-M MCUs. SONIC [26] supports intermittent computing for NN inference on MUCs. However, none of them supports out-of-core NN inference on MCU, as what swapping solution does. Swapping is a complement and can be integrated to existing systems.


This paper advocates enabling large NNs on tiny MCUs without losing accuracy by swapping data to SD card. Our study shows that none of SD card durability loss, execution slowdown, energy consumption, or data security is an issue. We find that an MCU with hundreds of KBs SRAM can execute NNs with a few hundreds MBs of memory footprint (a 1000 gap). Out-of-core execution expands the scope of NN applications on MCUs.