Machine learning (ML) inference on the edge is an increasingly attractive prospect due to its potential for increasing energy efficiency Fedorov et al. (2019), privacy, responsiveness Zhang et al. (2017)
, and autonomy of edge devices. Thus far, the field edge ML has predominately focused on mobile inference which has led to numerous advancements in machine learning models such as exploiting pruning, sparsity, and quantization. But in recent years, there have major been strides in expanding the scope of edge systems. Interest is brewing in both academiaFedorov et al. (2019); Zhang et al. (2017) and industry Flamand et al. (2018); Warden (2018a) towards expanding the scope of edge ML to microcontroller-class devices.
The goal of “TinyML” tinyML Foundation (2019) is to bring ML inference to ultra-low-power devices, typically under a milliWatt, and thereby break the traditional power barrier preventing widely distributed machine intelligence. By performing inference on-device, and near-sensor, TinyML enables greater responsiveness and privacy while avoiding the energy cost associated with wireless communication, which at this scale is far higher than that of compute Warden (2018b). Furthermore, the efficiency of TinyML enables a class of smart, battery-powered, always-on applications that can revolutionize the real-time collection and processing of data. This emerging field, which is the culmination of many innovations, is poised only further to accelerate its growth in the coming years.
To unlock the full potential of the field, hardware software co-design is required. Specifically, TinyML models must be small enough to fit within the tight constraints of MCU-class devices (e.g., a few hundred kB of memory and limited onboard compute horsepower in the order of MHz processor clock speed), thus limiting the size of the input and the number of layers Zhang et al. (2017)
or necessitating the use lightweight, non-neural network-based techniquesKumar et al. (2017). TinyML tools are broadly defined as anything that enables the design, mapping, and deployment of TinyML algorithms including aggressive quantization techniques Wang et al. (2019), memory aware neural architecture searches Fedorov et al. (2019), frameworks TensorFlow , and efficient inference libraries Lai et al. (2018); Garofalo et al. (2019). Efforts in TinyML hardware include improving inference on the next generation of general-purpose MCUs 16; E. Flamand, D. Rossi, F. Conti, I. Loi, A. Pullini, F. Rotenberg, and L. Benini (2018), developing hardware specialized for low power inference, and creating novel architectures intended only as inference engines for specific tasks Moons et al. (2018).
The complexity and dynamicity of the field obscure the measurement of progress and make dynamism design decisions intractable. In order to enable the continued innovation, a fair and reliable method of comparison is needed. Since progress is often the result of increased hardware capability, a reliable TinyML hardware benchmark is required.
|Input Type||Use Cases||Model Types||Datasets|
Audio Wake Words
Speech Commands Warden (2018a)
Audioset Gemmeke et al. (2017)
ExtraSensory Vaizman et al. (2017)
Visual Wake Words
Visual Wake Words Chowdhery et al. (2019)
CIFAR10 Krizhevsky et al. (2009)
MNIST LeCun and Cortes (2010)
ImageNet Deng et al. (2009)
DVS128 Gesture Amir et al. (2017)
|Physiological / Behavioral Metrics||
Physionet Goldberger et al. (2000)
HAR Cramariuc (2019)
DSA Altun et al. (2010)
Opportunity Roggen et al. (2010)
Sensing (light, temp, etc)
UCI Air Quality De Vito et al. (2008)
UCI Gas Vergara et al. (2012)
UCI EMG Lobov et al. (2018)
NASA’s PCoE Saxena and Goebel (2008)
In this paper, we discuss the challenges and opportunities associated with the development of a TinyML hardware benchmark. Our short paper is a call to action for establishing a common benchmarking for TinyML workloads on emerging TinyML hardware to foster the development of TinyML applications. The points presented here reflect the ongoing effort of the TinyMLPerf working group that is currently comprised of over 30 organizations and 75 members.
The rest of the paper is organized as follows. In Section 2, we discuss the application landscape of TinyML, including the existing use cases, models, and datasets. In Section 3, we describe the existing TinyML hardware solutions, including outlining improvements to general-purpose MCUs and the development of novel architectures. In Section 4, we discuss the inherent challenges of the field and how they complicate the development of a benchmark. In Section 5, we describe the existing benchmarks that relate to TinyML and identify the deficiencies that still need to be filled. In Section 6 we discuss the progress of the TinyMLPerf working group thus far and describe the next steps. In Section 7, we concluded the paper and discuss future work.
2 Tiny Use Cases, Models & Datasets
In this section we attempt to summarize the field of TinyML by describing a set of representative use cases (Section 2.1), their relevant datasets (Section 2.2), and the model architectures commonly applied to these specific use cases (Section 2.3).
2.1 Use Cases
Despite the general lack of maturity within the field, there are a number of well established TinyML use cases. We categorize the application landscape of tiny ML by input type in Table 1, which in the context of TinyML systems plays a crucial role in the use case definition.
Audio wake words is already a fairly ubiquitous example of always-on ML inference. Audio wake words is generally a speech classification problem that achieves very low power inference by limiting the label space, often to two labels: “wake word” and “not wake word” Zhang et al. (2017).
Other deployed TinyML applications, like activity recognition from IMU data Hassan et al. (2018), rely on low feature dimensionality to fit within the tight constraints of the platforms. Some use cases have been proven viable, but have yet to reach end users because they are too new, like visual wake words Chowdhery et al. (2019).
Many traditional ML use cases can be considered futuristic TinyML tasks. As ultra-low-power inference hardware continues to improve, the threshold of viability expands. Tasks like large label space image classification or object counting are well suited for low-power always-on applications but are currently too compute and memory hungry for today’s TinyML hardware. Furthermore, TinyML has a significant role to play in future technology. For example, many of the fundamental features of augmented reality (AR) glasses are always-on and battery-powered. Due to tight real time constraints, these devices cannot afford the latency of offloading computation to the cloud, an edge server, or even an accompanying mobile device. Thus, due to shared constraints, AR applications can benefit significantly from progress in the field of TinyML.
There are a number of open-source datasets that are relevant to TinyML usecases. Table1 breaks them down by the type of data. Despite the availability of these datasets, the majority of deployed TinyML models are trained on much larger, proprietary datasets. The open-source datasets that are competitively large are not TinyML specific. The lack of large, TinyML focused, open-source datasets slows the progress of academic research and limits the ability of a benchmark to represent real workloads accurately.
Table 1 lists common model types for TinyML use cases. Although neural networks (NN) are a dominant force in traditional ML, it is common to use non-NN based solutions like decision trees Kumar et al. (2017), for some TinyML use cases, due to their low compute and memory requirements.
Machine learning on MCU-class devices has only recently become feasible; therefore, the community has yet to produce models that have become widely accepted as MobileNets have become for mobile devices. This makes the task of selecting representative models challenging. However, immaturity also brings opportunity as our decisions can help direct future progress. Selecting a subset of the currently available models, outlining the rules for quality versus accuracy trade-offs, and prescribing a measurement methodology that can be faithfully reproduced will encourage the community to develop new models, runtimes, and hardware that progressively outperform one another.
3 Tiny Hardware Constraints
TinyML hardware is defined by its ultra-low power consumption, which is often in the range of 1 mWatt and below. At the top of this range are efficient 32-bit MCUs, like those based on the Arm Cortex-M7 or RISC-V PULP processors, and at the bottom are novel ultra-low-power inference engines. Even the largest TinyML devices consume drastically less power than the smallest traditional ML devices. Figure 1 shows the logarithmic comparison of the active power consumption between TinyML devices and those currently supported by MLPerf (v0.5 inference results from the open and closed divisions). TinyML devices can be up to four orders of magnitude smaller in the power budget as compared to state-of-the-art MLPerf systems.
The advent of low-power, cheap 32-bit MCUs have revolutionized the compute capability at the very edge. Cortex-M based platforms are now regularly performing tasks that were previously infeasible at this scale, mostly due to support for single instruction multiple data (SIMD) and digital signal processing (DSP) instructions. This fast vector math supports NN and highly efficient SVM implementations, it also accelerates many feature computations using 8bit fixed point arithmetic.
A feature of MCUs is the prevalence of on-chip SRAM and embedded Flash. Thus, when models can fit within the tight on-chip memory constraints, they are free of the costly DRAM accesses that hamper traditional ML. Widespread adoption and dispersion of TinyML are reliant on the capability of these platforms.
Although general-purpose MCUs provide flexibility, the highest TinyML performance efficiency comes from specialized hardware. Novel architectures can achieve performance in the range of one micro Joule per inference Holleman (2019). These specialized devices expand the boundaries of ML to the ultra low power end of TinyML processors.
TinyML systems present a number of unique challenges to the design of a performance benchmark that can be used to measure and quantify performance differences between various systems systematically. We discuss the three primary obstacles and postulate how they might be overcome.
4.1 Low Power
Low power consumption is one of the defining features of TinyML systems. Therefore, a useful benchmark should ostensibly profile the energy efficiency of each device. However, there are many challenges in fairly measuring energy consumption. Firstly, as illustrated in Figure 1, TinyML devices can consume drastically different amounts of power, which makes maintaining accuracy across the range of devices difficult. Secondly, determining what falls under the scope of the power measurement is difficult to determine when data paths and pre-processing steps can vary significantly between devices. Other factors like chip peripherals and underlying firmware can impact the measurements. Unlike traditional high-power ML systems, TinyML systems do not have spare cores to load the System-Under-Test (SUT) with minimal overheads.
4.2 Limited Memory
Due to their small size, TinyML systems often have tight memory constraints. While traditional ML systems like smartphones cope with resource constraints in the order of a few GBs, tinyML systems are typically coping with resources that are two orders of magnitude smaller.
Memory is one of the primary motivating factors for the creation of a TinyML specific benchmark. Traditional ML benchmarks use inference models that have drastically higher peak memory requirements (in the order of gigabytes) than TinyML devices can provide. This also complicates the deployment of a benchmarking suite as any overhead can significantly impact power consumption or even make the benchmark too big to fit. Individual benchmarks must also cover a wide range of devices; therefore, multiple levels of quantization and precision should be represented in the benchmarking suite. Finally, a variety of benchmarks should be chosen such that the diversity of the field is supported.
4.3 Hardware Heterogeneity
Despite its nascency, TinyML systems are already diverse in their performance, power, and capabilities. Devices range from general-purpose MCUs to novel architectures, like in event-based neural processors Brainchip or memory compute Kim et al. (2019). This heterogeneity poses a number of challenges as the system under test (SUT) will not necessarily include otherwise standard features, like a system clock or debug interface. Furthermore, the task of normalizing performance results across heterogeneous implementations is a key challenge.
Today’s state-of-the-art benchmarks are not designed to handle the challenges readily. They need careful re-engineering to be flexible enough to handle the extent of hardware heterogeneity that is commonplace in the TinyML ecosystem.
5 Related Work
There are a number of ML related hardware benchmarks, however, none that accurately represent the performance of TinyML workloads on tiny hardware. Table 2 shows a sampling of the widely accepted industry benchmarks that are directly applicable to the discussion on TinyML systems.
EEMBC CoreMark Gal-On and Levy (Technical report) has become the standard performance benchmark for MCU-class devices due to its ease of implementation and use of real algorithms. Yet, CoreMark does not profile full programs, nor does it accurately represent machine learning inference workloads.
EEMBC MLMark Torelli and Bangale addresses these issues by using actual ML inference workloads. However, the supported models are far too large for MCU-class devices and are not representative of TinyML workloads. They require far too much memory (GBs) and have significant run times. Additionally, while CoreMark supports power measurements with ULPMark-CM EEMBC , MLMark does not, which is critical for a TinyML benchmark.
MLPerf, a community-driven benchmarking effort, has recently introduced a benchmarking suite for ML inference Reddi et al. (2019) and has plans to add power measurements. However, much like MLMark, the current MLPerf inference benchmark precludes MCUs and other resource-constrained platforms due to a lack of small benchmarks and compatible implementations.
As Table 2 summarizes, there is a clear and distinct need for a TinyML benchmark that caters to the unique needs of ML workloads, makes power a first-class citizen and prescribes a methodology that suits TinyML.
To overcome theses challenges, we adopt a set of principles for the development of a robust TinyML benchmarking suite and select a set of preliminary use cases.
6.1 Open and Closed Divisions
As previously stated, TinyML is a diverse field, therefore not all systems can be accommodated under strict rules, however, without strict rules, direct comparison of the hardware becomes more difficult. To address this issue, we adopt MLPerf’s open and closed structure. More traditional TinyML solutions can submit to the closed division where submissions must use a model that is considered equivalent to the reference model. TinyML systems that fall outside the bounds of the ”closed” benchmark can submit results to the open division which will allow submissions to deviate as necessary from the closed reference. We believe this structure increases the inclusivity of the bechmarking suite while maintaining the comparability of the results.
6.2 Preliminary Use Cases
The group has selected three preliminary use cases to target: visual wake words, audio wake words, and anomaly detection. Visual wake words is a binary image classification task that indicates if a person is visible in the image or not. Audio wake words refers to the common, keyword spotting task (e.g. “Alexa”, “Ok Google”, and “Hey Siri”). Anomaly detection is a broader use case that classifies time series data as “normal” or “abnormal”.
These use cases have been selected to represent the broad range of TinyML. They encompass three distinct input data types and range from relatively resource hungry (visual wake words) to light weight (anomaly detection). Furthermore the models traditionally used for these use cases are varied therefore the benchmarking suite can support a diverse set of ML techniques.
6.3 Future work
Perfection is often the enemy of good, therefore, to fill the community’s need for comparability, our priority is to quickly establish a set of minimum viable benchmarks and iteratively address deficiencies. The benchmarking suite will continue to evolve to meet the needs of the community.
The next step is to select representative reference models for the three preliminary use cases and develop a reference implementation of each benchmark. We plan to finish development and accept result submissions before the end of 2020.
In conclusion, TinyML is an important and rapidly evolving field that requires comparability amongst hardware innovations to enable continued progress and stability. In this paper, we reviewed the current landscape of TinyML, including highlighting the need for a hardware benchmark. Additionally, we analyzed challenges associated with developing said benchmark and discussed a path forward. We hope this work can act as the call to action to establish community-driven, fair, and useful TinyML benchmark.
If you would like to contribute to the effort, join the working group here: https://groups.google.com/forum/#!forum/mlperf-tiny
- Comparative study on classifying human activities with miniature inertial and magnetic sensors. Pattern Recognition 43, pp. 3605–3620. External Links: Cited by: Table 1.
A low power, fully event-based gesture recognition system.
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7388–7397. External Links: Cited by: Table 1.
-  Akida neuromorphic system on chip. External Links: Cited by: §4.3.
- Visual wake words dataset. CoRR abs/1906.05721. External Links: Cited by: Table 1, §2.1.
- PRECIS har. IEEE Dataport. External Links: Cited by: Table 1.
On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sensors and Actuators B Chemical 129, pp. 750–757. External Links: Cited by: Table 1.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: Table 1.
-  ULPMark - an eembc benchmark. Embedded Microprocessor Benchmark Consortium. External Links: Cited by: §5.
- SpArSe: sparse architecture search for cnns on resource-constrained microcontrollers. In Advances in Neural Information Processing Systems 32, pp. 4978–4990. Cited by: §1, §1.
- GAP-8: a risc-v soc for ai at the edge of the iot. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP), Vol. , pp. 1–4. External Links: Cited by: §1, §1.
- Exploring coremark - a benchmark maximizing simplicity and efficacy. Technical report Embedded Microprocessor Benchmark Consortium. External Links: Cited by: §5.
- PULP-nn: accelerating quantized neural networks on parallel ultra-low-power risc-v processors. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 378 (2164), pp. 20190155. External Links: Cited by: §1.
- Audio set: an ontology and human-labeled dataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA. Cited by: Table 1.
- PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220. Note: Circulation Electronic Pages: http://circ.ahajournals.org/content/101/23/e215.full PMID:1085218; doi: 10.1161/01.CIR.101.23.e215 Cited by: Table 1.
A robust human activity recognition system using smartphone sensors and deep learning. Future Generation Computer Systems 81, pp. 307–313. Cited by: §2.1.
-  Helium: enhancing the capabilities of the smallest devices. External Links: Cited by: §1.
- The speed and power advantage of a purpose-built neural compute engine. External Links: Cited by: §3.
- A 1-16b precision reconfigurable digital in-memory computing macro featuring column-mac architecture and bit-serial computation. In ESSCIRC 2019-IEEE 45th European Solid State Circuits Conference (ESSCIRC), pp. 345–348. Cited by: §4.3.
- CIFAR-10 (canadian institute for advanced research). . External Links: Cited by: Table 1.
- Resource-efficient machine learning in 2 KB RAM for the internet of things. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1935–1944. External Links: Cited by: §1, §2.3.
- CMSIS-nn: efficient neural network kernels for arm cortex-m cpus. External Links: Cited by: §1.
- MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: Table 1.
- Latent factors limiting the performance of semg-interfaces. Sensors 18, pp. 1122. External Links: Cited by: Table 1.
- BinarEye: an always-on energy-accuracy-scalable binary cnn processor with all memory on chip in 28nm cmos. In 2018 IEEE Custom Integrated Circuits Conference (CICC), pp. 1–4. Cited by: §1.
- MLPerf inference benchmark. External Links: Cited by: §5.
- Collecting complex activity datasets in highly rich networked sensor environments. In 2010 Seventh International Conference on Networked Sensing Systems (INSS), Vol. , pp. 233–240. External Links: Cited by: Table 1.
- Turbofan engine degradation simulation data set. External Links: Cited by: Table 1.
-  TensorFlow lite for microcontrollers. External Links: Cited by: §1.
- TinyML summit. External Links: Cited by: §1.
-  Measuring inference performance of machine-learning frameworks on edge-class devices with the mlmark benchmark. EEMBC. External Links: Cited by: §5.
- Recognizing detailed human context in the wild from smartphones and smartwatches. IEEE Pervasive Computing 16 (4), pp. 62–74. External Links: Cited by: Table 1.
- Chemical gas sensor drift compensation using classifier ensembles. Sensors and Actuators B: Chemical s 166–167, pp. 320–329. External Links: Cited by: Table 1.
- HAQ: hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8612–8620. Cited by: §1.
- Speech commands: a dataset for limited-vocabulary speech recognition. External Links: Cited by: Table 1, §1.
- Why the future of machine learning is tiny. External Links: Cited by: §1.
- Hello edge: keyword spotting on microcontrollers. External Links: Cited by: §1, §1, §2.1.