The internet-of-Things ecosystem is made possible by miniaturized and smart end-node devices, which can sense the surrounding environment and take decisions based on the information inferred from sensor data. Because of their tiny form factor and the requirement for low cost and battery-operated nature, these smart networked devices are severely constrained in terms of memory capacity and maximum performance and use small Microcontroller Units (MCUs) as their main on-board computing device 
. At the same time, there is an ever-growing interest in deploying more accurate and sophisticated data analytics pipelines, such as Deep Learning (DL) inference models, directly on IoT end-nodes. These competing needs have given rise in the last few years to a specific branch of machine learning (ML) and DL research calledTinyML  – focused on shrinking and compressing top-accurate DL models with respect to the target device characteristics.
The primary limitation of the current generation of TinyML hardware and software is that it is mostly focused on inference. The inference task can be strongly optimized by quantizing  or pruning  the trained model. Many vendors of AI-oriented system-on-chips (SoCs) provide deployment frameworks to automatically translate DL inference graphs into human-readable or machine code . This train-then-deploy design process rigidly separates the learning phase from the runtime inference, resulting in a static intelligence model design flow, incapable of adapting to phenomena such as data distribution shift: a shift in the statistical properties of real incoming data vs the training set that often impacts applications, causing the smart sensors platform to be unreliable when deployed in the field .
Even if the algorithms themselves are technically capable to learn and adapt to new incoming data, the update process can only be handled from a centralized service, running on the cloud or host servers . In this regard, the original training dataset would have to be enriched with the newly collected dataset, and the model would have to be retrained from scratch on the enlarged dataset, adapting to the new data without forgetting the original information . Such an adaptive mechanism belongs to the rehearsal category and requires the storage of the full training set, often amounting to gigabytes of data. Additionally, large amounts of data have to be collected in a centralized fashion by network communication, resulting in potential security and privacy concerns, as well as issues of radio power consumption and network reliability in non-urban areas.
We argue that a robust and privacy-aware solution to these challenges is enabling future smart IoT end-nodes to Lifelong Learning, also known as Continual Learning (CL): the capability to autonomously adapt to the ever-changing surrounding environment by learning continually (only) from incoming data without forgetting the original knowledge – a phenomenon known as catastrophic forgetting . Despite many approaches exists to learn from data , recently the focus has moved to improve the recognition accuracy of DL models because of their superior capabilities, accounting on new data belonging to known classes (domain-incremental CL) or a new classes (class-incremental CL) [22, 43]. The CL techniques recently proposed are grouped in three categories: architectural, regularization and memory (or rehearsal) strategies. The architectural approaches specialize a subset of parameters for every (new and old) task but require the task-ID information at inference time, indicating the nature of current task in a multi-head network, and therefore they are not suitable for class or domain incremental continual learning. Concerning these latter scenarios, memory-based approaches, which preserve samples from previous tasks for replaying, perform better than regularization techniques, which simply address catastrophic forgetting by imposing constraints on the network parameter update at low memory cost [43, 59, 12]. This finding was confirmed during the recent CL competition at CVPR2020 , where the best entry leveraged on rehearsal based strategies.
The main drawback of memory-based CL approaches concerns the high memory overhead for the storage of previous samples: the memory requirement can potentially grows over time preventing the applicability of these methods at the tiny scale, e.g. . To address this problem, Pellegrini et al.  have recently introduced Continual Learning based on Latent Replays (LRs). The idea behind this is to combine a few old data points taken from the original training set, but encoded intoa low-dimensional latent space to reduce the memory cost, with the new data for the incremental learning tasks. Hence, the previous knowledge is retained by means of Latent Replays samples, i.e. the intermediate feature maps of the DL model inference, selected so that they require less space with respect to the input data (up to 48 smaller compared to raw images ). This strategy also leads to reduced computational cost: the Latent intermediate layer splits the network in a frozen stage at the front and an adaptive stage at the back, and only layers in the latter need to be updated. So far, LR-based Continual Learning has been successfully prototyped on high-performance embedded devices such as smartphones, including a Snapdragon-845 CPU running Android OS in the power envelope of a few Watts111https://hothardware.com/reviews/qualcomm-snapdragon-845-performance-benchmarks. On the contrary, in this work, we focus on IoT applications and TinyML devices, with 100 tighter power constraints and 1000 smaller memories available.
In our preliminary work , we proposed the early design concept of a HW/SW platform for Continual Learning based on the Parallel Ultra Low Power (PULP) paradigm , and assessed the computational and memory costs to deploy Latent Replay-based CL algorithms.
In this paper, we complete and extend that effort by introducing several novel contributions from the software stack, system integration and algorithm viewpoint. To the best of our knowledge, we present the first TinyML processing platform and framework capable of on-device CL, together with the design flow required to sustain learning tasks within a few tens of mW of power envelope ( lower than state-of-the-art solutions). The proposed platform is based on VEGA, a recently introduced end-node System-on-Chip prototype fabricated in 22nm technology . Unlike traditional low-power and flexible MCUs design, VEGA exploits explicit data parallelism, by featuring a multi-core SW programmable RISC-V cluster with shared Floating Point Units (FPUs), DSP-oriented ISA and optimized memory management to enable the learning paradigm on low-end IoT devices. Additionally, to gain minimum-cost on-device retention of Latent Replays and better enable deployment on an ultra-low-power platform, we extend the LR algorithm proposed by Pellegrini et al.  to work with a fully quantized frozen front-end and compress Latent Replays using quantization down to 7 bits, with a small accuracy drop (almost lossless for 8-bit) when compared to the single-precision floating-point datatype (FP32) on the Core50 CL classification benchmark.
In summary, the contributions of this work are:
We extend the LR algorithm to work with an 8-bit quantized and frozen front-end without impact on the CL process and to support LR compression with quantization, reducing up to 4.5 the memory needed for rehearsing. We call this extension Quantized Latent Replay-based Continual Learning or QLR-CL.
We propose a set of CL primitives including forward and backward propagation of common layers such as convolution, depthwise convolution, and fully connected layers, fine-tuned for optimized execution on VEGA, a TinyML platform for Deep Learning based on PULP , fabricated in 22nm technology. We also introduce a tiling scheme to manage data movement for the CL primitives.
We compare the performance of our CL primitives on VEGA with that on other devices that could in the future target on-chip at-edge learning, such as a state-of-the-art low-power STM32L4 microcontroller.
Our results show that the Quantized Latent Replay based Continual Learning lead to a minimal accuracy loss on the Core50 dataset compared to the FP32 baseline, when compressing the Latent Replay memory by by means of 8-bit quantization. Compression to 7 bit can also be exploited but at the cost of a slightly lower accuracy, up to 5% wrt the baseline when retraining one of the intermediate layer. When testing the QLR-CL pipeline on the proposed VEGA platform, our CL primitives demonstrated to run up to faster with respect to the MCUs for TinyML that can be found currently on the market. Compared against edge devices with a power envelope of 4W our solution is about more energy-efficient, enough to operate 317h with a typical battery for embedded devices. The rest of the paper is organized as follows: Section II discusses related work in CL, inference and learning at the edge, and hardware architectures targeted at edge learning. Section III introduces the proposed methodology for Quantized Continual Learning. Section IV describes the HW/SW architecture of the proposed TinyML. Section V evaluates and discusses experimental results. Section VI concludes the paper.
Ii Related Work
|Learning ||layer’s weights||Classification||Edge TPU|
|Retraining Biases||Image||EPYC AMD||✓||MEDIUM||LOW /|
Add layer for transfer-learning
|||based on streaming data||Detection||33 BLE|
|CNN backprop. from scracth||Linear Camera||GAP8||✓||-||-||✓|
|Minicar ||on increasing dataset||Class. 7 actions|
|||Class. 2 classes||(unbounded)|
|Hyperdimensional||EMG 10 gestures||Mr. Wolf||✓||✓||MEDIUM||LOW||✓|
|CNN backprop.||Image||Qualcomm||✓||HIGH||HIGH /||✓|
|||w/ LRs||Class. 50 classes||Snapdragon||MEDIUM|
|[This Work]||w/ Quantized LRs||Class. 50 classes|
In this section, we first review the recent memory-efficient Continual Learning approaches before discussing the main solutions and methods for the TinyML ecosystem, including the first attempts for on-device learning on embedded systems.
Ii-a Memory-efficient Continual Learning
Differently from Transfer Learning [9, 7], which by design does not retain the knowledge of the primitive learned task when learning a new one, Continual Learning (CL) has recently emerged as a new technique to tackle the acquisition of new/extended capabilities without losing the original ones – a phenomenon known as catastrophic forgetting [22, 43]
. One of the main causes of this phenomenon is that the newly acquired set breaks one of the main assumptions underlying supervised learning – i.e., that training data are statistically independent and identically distributed (IID). Instead, CL deals with training data that is organized in non-IIDlearning events. Maltoni et al. in  sort the main CL techniques intothree groups: rehearsal, which includes a periodic replay of the past information; architectural
, relying on a specialized architecture, layers, and activation functions to mitigate forgetting; andregularization-based, where the loss term is extended to encourage retaining memory of pre-learned tasks.
Among these groups, rehearsal CL strategies have emerged as the most effective to deal with catastrophic forgetting, at the cost of an additional replay memory [10, 50, 48]. In the recent CL challenge at CVPR2020 on the Core50 image dataset, 90% of the competitors used rehearsal strategies . The best entry of the more challenging New Instances and Classes track (the same scenario considered in our work) , which is evaluated in terms of test accuracy but also memory and computation requirements, scores 91% by replaying image data. Unfortunately, this strategy results untractable for an IoT platform because of the expanding replay memory (up to 78k images) and the usage of a large DenseNet-161 model. Conversely, the Latent Replay-based approach  relies on a fixed, and relatively small, amount of compressed latent activations as replay data; it scores 71% if retraining only the last layer, which presents a peak of lower (compressed) data points than the winning solution. Additionally, the Jodelet entry – also employing LR-based CL – achieves 83% thanks to 3 more replays and a more accurate pre-trained model (ResNet50) . In our work, we focus on  because of the tunable accuracy-memory setting. Nevertheless, our proposed platform and compression methodology can be applied to any replay-based CL approach.
uses discrete autoencoders to compress the input data for rehearsing. In contrast, we propose low-bitwidth quantization to compress the Latent Replay memory by4 and, at the same time, reduce the inference latency and the memory requirement of the inference task of the frozen stage if compared to a full-precision FP32 implementation.
Ii-B Deep Learning at the Extreme Edge
Two main trends can be identified for TinyML platforms targeting the extreme edge. On the one hand, Deep Learning applications are dominated by linear algebra which is an ideal target for application-specific HW acceleration [45, 58]. Most efforts in this direction employ a variety of inference-only acceleration techniques such as pruning  and byte and sub-byte integer quantization ; the use of large arrays of simple MAC units  or even mixed-signal techniques such as in-memory computing .
On the other hand, there are also many reasons for the alternative approach: running TinyML applications as software on top of commercial off-the-shelf (COTS) extreme-edge platforms, such as MCUs. Extreme-edge TinyML devices need to be very cheap; they have to be flexible due both to economy of scale and to their need for integration within larger applications, composed of both neural and non-neural tasks . For these reasons, there is a strong push towards squeezing the maximal performance out of platforms based on COTS ARM Cortex-M class microcontrollers and DSPs, such as STMicroelectronics STM32 microcontrollers222https://www.st.com/content/st_com/en/ecosystems/stm32-ann.html, or on multi-core parallel ultra-low-power (PULP) end-nodes, like GreenWaves Technologies GAP-8333https://greenwaves-technologies.com/gap8_gap9/. To cope with the severe constraints in terms of memory and maximum compute throughput of these platforms, a large number of deployment tools have been recently proposed. Examples of this trend include non-vendor-locked tools such as Google TFLite Micro , ARM CMSIS-NN , Apache TVM , as well as frameworks that only support specific families of devices, such as STMicroelectronics X-CUBE-AI444https://www.st.com/en/embedded-software/x-cube-ai.html, GreenWaves Technologies NNTOOL555https://greenwaves-technologies.com/sdk-manuals/nn_quick_start_guide, and DORY . Internally, these tools employ hardware-independent techniques, such as post-training compression & quantization [8, 16, 25], as well as hardware-dependent ones such as data tiling  and loop unrolling to boost data reuse exploitation , coupled with automated generation of optimized backend code .
As previously discussed, all of these efforts are mostly targeted at extreme edge inference, with little hardware and/or software dedicated to training. Most of the techniques used to boost inference efficiency are not as effective for learning. For example, the vast majority of training is done in full precision floating-point (FP32) or, with some restrictions, using half-precision floats (FP16)  – whereas inference is commonly pushed to INT8 or even below [8, 17]. IBM has recently proposed a specialized 8-bit format for training called HFP8 , but its effectiveness is still under investigation.
Hardware-accelerated on-device learning has so far been limited to high-performance embedded platforms (e.g., NVIDIA TensorCores on Tegra Xavier666https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-xavier-nx and mobile platforms such as Qualcomm Snapdragon 845 ) or very narrow in scope. For example, Shin et al.  claim to implement an online adaptable architecture, but this is done using a simple LUT to selectively activate parameters, and does not support more powerful mechanisms based on gradient descent. A few recently proposed hardware accelerators for low-power training platforms [14, 55, 26, 31] enable partial gradient back-propagation by using selective and compressed weight updates, but they do not address the large memory footprint required by training. Finally, several online-learning devices using bio-inspired algorithms such as Spiking Neural Networks  and High-Dimensional Computing  have been proposed [20, 47, 32]
. Most of these approaches, however, have only been demonstrated on simple MNIST-like tasks.
In this work, we propose the first, to the best of our knowledge, MCU-class hardware-software system capable of continual learning based on gradient back-propagation with a LR approach. We achieve these results by leveraging on few key ideas in the state-of-the-art: INT8 inference, FP32 continual learning, and exploitation of linear algebra kernels, back-propagation, and aggressive parallelization by deploying them on a multi-core FPU-enhanced PULP cluster.
Ii-C On-Device Learning on low-end platforms
Table I lists the main edge solutions featuring on-device learning capabilities. Every approach is evaluated by considering the memory and computational costs for the continual learning task and the suitability for deployment on highly resource-constrained (tiny) devices.
A first group of works deals with on-device transfer learning. The Coral Edge TPU, which presents a power budget of several Watts, features SW support for on-device fine-tuning of the parameters of the last fully-connected layer . TinyTL  demonstrated on a high-end CPU that the transfer learning task results more effective (+32% on the target Image Classification task) by retraining the bias terms and adding lite residual modules. TinyOL 
brought the transfer learning task on a tiny devices, i.e. an Arduino Nano platform featuring a 64MHz ARM Cortex-M4, by adding a trainable layer on top of a frozen inference model. Because only the coefficients of the last layer are updated during the online training process, no backpropagation of error gradients applies. Compared to these works, we address a continual learning scenario and therefore we provide a more capable and optimized HW/SW solution to match the memory and computational requirements of the adopted CL method.
Differently from the above works, de Prado et al.  proposed a Continual Learning framework for self-driving mini-cars. The embedded PULP-based MCU engine streams new data to a remote server, where the inference model is retrained from scratch on the enhanced dataset to improve the accuracy over time. This fully-rehearsal methodology cannot be migrated to low-end devices because of the unconstrained increase of the memory footprint. In contrast, Disabato et al. 
presented an online adaptive scheme based on a kNN classifier placed on top of a frozen feature extraction CNN model. The final stage is updated by incrementally adding the labeled samples to the knowledge memory of the kNN classifier. This approach has been evaluated on a tiny STM32F76ZI device but unfortunately has proven the effectiveness only on limited 2-classes problems and presents an unbounded memory requirement, which scales linearly with the number of training samples. PULP-HD showed few-shot continual learning capabilities on an ultra-low power prototype using Hyperdimensional Computing. During the training phase the new data are mapped intoa limited hyperdimensional space by making use of a complex encoding procedure; at inference time the incoming samples are compared to the computed class prototypes. The method has been demonstrated on a 10 gesture classification scenario based on EMG data but lacks of experimental evidences to be effective on complex image classification problems. In contrast to the these works, we demonstrate superior learning capabilities for a TinyML platform by i) running backpropagation on-device to update intermediate layers, and ii) supporting a memory-efficient Latent Replay-based strategy to address catastrophic forgetting on a more complex Continual Learning scenario. An initial CNN-based prototype of a Continual Learning system was presented in in  using Latent Replays. The authors demonstrated the on-device learning capabilities using a Qualcomm Snapdragon processor, which features a power envelope 100 higher than our target and therefore it results not suitable for battery-operated tiny devices. In contrast to them, we also extend the LR algorithm by leveraging on quantization to compress the LR memory requirements.
In this section, we analyze the memory requirements of the Latent Replay-based Continual Learning method and present QLR-CL
, our strategy to reduce the memory footprint of the LR vectors based on a quantization process.
Iii-a Background: Continual Learning with Latent Replays
In general, supervised learning aims at fitting an unknown function by using a set of known examples – the training dataset. In the case of Deep Neural Networks, the training procedure returns the values of the network parameters, such as weights and biases, that minimize a loss function. Among the used optimization strategies, the mini-batch Stochastic Gradient Descent (SGD), which is an iterative method applied over multiple learning step (i.e. the epochs), is widely adopted. In particular, The SGD algorithm computes the gradient of the parameters based on the loss function by back-propagating the error value through the network. This error function compares the model prediction, i.e. the output of the forward pass, with the expected outcome (the data label). Parameter gradients obtained after the backward pass are weighted over a mini-batch of data before updating the model coefficients.
As introduced at the beginning of this work, the Latent Replay CL method  is a viable solution to gain TinyML adaptive systems with on-device learning capabilities based on the availability of new labeled data. In Fig. 1 we illustrate the CL process with Latent Replays. The new data are injected into the model to obtain the latent embeddings, which are the feature maps of a specific intermediate layer. We indicate such a layer with the index , where , assuming the targeted model to be composed by stacked layers. At runtime, the new latent vectors are combined with the precomputed Latent Replays vectors to execute the learning algorithm on the last layers. More specifically, the coefficient parameters of the adaptive stage are updated by using a mini-batch gradient descend algorithm. Every mini-batch includes both new data (in the latent embedding form) and LR vectors. The typical ratio of new data over the full mini-batch is 1/6 . The coefficient gradients are computed through forward and backward passes over the adaptive (learned) layers. Multiple iterations, i.e. the epochs, of the learning algorithms take place within the training procedure.
Iii-B Memory Requirements
We model the Latent Replay-based Continual Learning task as operating on a set of new data coming from a sensor (e.g., a camera), which is interfaced with an embedded digital processing engine, namely the TinyML Platform
, and its memory subsystem. Given the limited memory capacity of IoT end-nodes, the quantification of the learning algorithm’s memory requirements is essential. We distinguish between two different memory requirements: additional memory necessary for CL, e.g., the LR memory, and that required to save intermediate tensors during forward-prop to be used for back-prop – a requirement common to all algorithms based on gradient descent, not specific to CL.
Concerning the LR memory, the system has to save a set of LRs, each one of the size of the feature map computed at the -th layer of the network. In our scenario, LR vectors are represented employing floating-point (FP32) datatype and typically determine the majority of the memory requirement . Since LRs are part of the static long-term memory of the CL system, for their storage, we use non-volatile memory, e.g., external Flash.
On the other hand, forward- and back-prop of the network model require to allocate the space for network parameters statically. In addition, forward-prop requires dynamically allocated buffers to store the activation feature maps for all layers. Up to the -th layer, these buffers are temporary and can be released after their usage. Conversely, the system must keep in memory the feature maps after to compute the gradients during back-prop. They can only be released after the corresponding layer has been back-propagated. Lastly, the system must also keep in memory the coefficients’ gradients, demanding a second array of elements. To keep accuracy on the learning process, every tensor, i.e. coefficients, gradients, and activations, employ a FP32 format in our baseline scenario. Different from LRs, these tensors are kept intovolatile memories, except the frozen weights, which are stored in a non-volatile memory.
Iii-C Quantized Latent Replay-based Continual Learning
Quantization techniques have been extensively used to reduce the data size of model parameters, and activation feature maps for the inference task, i.e. the forward pass. An effective quantization strategy reduces the data bitwidth from 32-bit (FP32) to low bit-precision, 8-bit or less (Q bits, in general) while paying an almost negligible accuracy loss.
In this paper, we introduce the Quantized Latent Replay-based Continual Learning method (QLR-CL) relying on low-bitwidth quantization to speed up the execution of the network up to the -th layer and at the same time reduce the memory requirement of the LR vectors from the baseline FP32 arrays. To do so, we split the deep model intotwo sub-networks, namely the frozen stage and the adaptive stage. The frozen stage includes the lower layers of the network, up to the Latent Replay layer
. The coefficients of this sub-network, including batch normalization statistics, are frozen during the incremental learning process. On the contrary, the parameters of theadaptive stage are updated based on the new data samples.
In QLR-CL, the Latent Replay vectors are generated by feeding the frozen stage sub-network with a random subset of training samples from the CL dataset, which we denote as . The frozen stage
is initialized using pre-trained weights from a related problem – in the case of Core50, we use a network pre-trained on the ImageNet-1k dataset. Post-Training Quantization of thefrozen stage is based on training samples . We apply a standard Post-Training Quantization process that works by i) determining the dynamic range of coefficient and activation tensors, ii) dividing the range intoequal steps, using a uniform affine quantization scheme 
. While the statistics of the parameters can be drawn without relying on data, the dynamic range of the activation features maps is estimated usingas a calibration set. If we denote the dynamic range of the weights at the -th layer of the network as , we can define the INT-Q representation of parameters as
where is the number of bits, is the full-precision output of the frozen stage. The representation of activations is similar, but we further restrict (1) for activations
by considering the effect of ReLU’s:are always positive and can be represented using an unsigned UINT-Q format:
where is obtained through calibration on .
Quantized Latent Replays (QLRs) are represented similarly to other quantized activations, setting the layer to the LR . Their value is initialized during the initial setup of the QLR-CL process using the latent quantized activations over the set.
During the QLR-CL process, the adaptive stage is fed by dequantized vectors obtained as , along with the dequantized latent representation of the new data sample . Hence, the single FP32 parameter is also stored in memory as part of the frozen stage. In our experiments, we set the bitwidth of all activations and coefficients to 8-bit, while the output of the frozen stage is compressed to 8-bit or less, as further explored in Section V.
Iv Hardware/Software Platform
In this section, we describe the hardware architecture of the proposed platform for TinyML learning and the related software stack.
Iv-a Hardware architecture
The CL platform we propose is inspired and extends on our previous work . We build it upon an advanced PULP-based SoC, called VEGA, which combines parallel programming for high-performance with ultra-low-power features. An advanced prototype of this platform has been taped out in GlobalFoundries 22nm technology . The system architecture, which is outlined in Fig. 2, is based on an I/O-rich MCU platform coupled with a multi-core cluster of RISC-V ISA digital signal processing cores which are used to accelerate data-parallel machine learning & linear algebra code. The MCU side features a single RISC-V core, namely the Fabric-Controller (FC), and a large set of peripherals. Besides the FC core, the MCU-side of the platform includes a large L2 SRAM, organized in an FC-private section of 64kB and a larger interleaved section of 1.5MB. The interleaved L2 is shared between the FC core and an autonomous I/O DMA controller, connected to a broad set of peripherals such as OctaSPI/HyperBus to access an external Flash or DRAM of up to 64MB, as well as camera interfaces (CPI, MIPI) and standard MCU interfaces (SPI, UART, I2C, I2S, and GPIO). The I/O DMA controller is connected to an on-chip magnetoresistive RAM (MRAM) of 4MB, which resides in its power and clock domain and can be accessed through the I/O DMA to move data to/from the L2 SRAM.
The multi-core cluster features nine processing elements (PE) that share data on a 128kB multi-banked L1 tightly coupled data memory (TCDM) through a 1-cycle latency logarithmic interconnect. All cores are identical, using an in-order 4-stage architecture implementing the RISC-V RV32IMCFXpulpv2 ISA. The cluster includes a set of four highly flexible FPUs shared between all nine cores, capable of FP32 and FP16 computation . Eight cores are meant to execute primarily data-parallel code, and therefore they use a hierarchical Instruction cache (I$) with a small private part (512B) plus 4kB of shared I$ 
. The ninth core is meant to be used as a cluster controller for control-heavy data tiling & marshaling operations; it has a private I$ of 1kB. The cluster also features a multi-channel DMA engine that autonomously handles data transfers between the shared L1 and the external memories through a 64-bit AXI4 cluster bus. The DMA can transfer up to 8B/cycle between L2 and L1 TCDM in both directions simultaneously and perform 2D strided access on the L2 side by generating multiple AXI4 bursts. The cluster can be switched on and off at runtime by the FC core employing clock-gating; it also resides on a separate power domain than the MCU, making it possible to completely turn it off and to tune its Vdd using an embedded DC-DC regulator.
Iv-B Software stack
To execute the CL algorithm, the workload is largely dominated by the execution of convolutional layers, such as pointwise, and depthwise, or fully connected layers (98% of operations in MobileNet-V1). Consequently, the main load on computations is due to variants of matrix multiplications during the forward and backward steps, which can be efficiently parallelized on the 8 compute PEs of the cluster, leaving one core out to manage tiling and program data transfers. Thus, to enable the learning paradigm on the PULP platform, we propose a SW stack composed of parallel layer-wise primitives that realize the forward step and the back-propagation. The latter concerns either the computation of the activation gradients (backward error step) and coefficient gradients (backward gradient step). Fig. 3 depicts the dataflow of the forward and backward for commonly used convolutional kernels such as pointwise (PW), depthwise (DW), and linear (L) layers. To reshape all convolution operations intomatrix multiplications, the im2col transformation is applied to the activation tensors to reshape them into2D matrix operands . The FP32 matrix multiplication kernel is parallelized over the eight cores of the cluster according to a data-parallelism strategy, making use of fmadd.s (floating multiply-add) instructions made available by the shared FPU engines.
The cores must operate on data from arrays located in the low-latency L1 TCDM to maximize throughput and computational efficiency (i.e., IPC). However, the operands of a layer function may not entirely fit into the lower memory level because of the limited space (128kB). For instance, the tensors of the PW layer #22 of the used MobileNet-V1 occupy 1.25MB. Hence, the operands have to be sliced intoreduced-size blocks that can fit intothe available L1 memory and convolutional functions are applied on L1 tensor slices to increase the computational efficiency.
This approach is generally referred to as tiling , which is schematized in Fig. 4. By locating layer-wise data on the larger L2 memory (1.5MB), the DMA firstly copies individual slices of operand data, also referred to as tiles, intoL1 buffers, to be later fetched by the cores. Since the cluster DMA engine is capable of 2D-strided access on the L2 side, this operation can also be designed to perform im2col, without any manual data marshaling overhead on L1.
To increase the computation efficiency, we implement a software flow that interleaves DMA transfers between L2 and L1 and calls to parallel primitives, e.g. forward, backward error, or backward gradient steps, which operate on individual tiles of data. Hence, every layer is expected to load and process all the tiles of any operand tensor. To reduce the overhead due to the data copy, the DMA transfers take place in the background of the multi-core computation: the copy of the next tile is launched before invoking the computation on loaded tiles. On the other side, this optimization requires doubling the L1 memory requirement: while one L1 buffer is used for computation, an equally-sized buffer is used by the data movement task. From a different viewpoint, the maximum tile size must not exceed half of the available memory. At runtime, layer-wise tiled kernels are invoked sequentially to run the learning algorithm with respect to the input data. To this aim, LRs are loaded from external embedded memory, if not fitting the internal memory, and copied to the on-chip L2 memory thanks to the I/O DMA.
V Experimental Results
In this section, we provide the experimental evidence about our proposed TinyML platform for on-device Continual Learning. First, we evaluate the impact of quantization of the frozen stage and the LR vectors upon the overall accuracy, and we analyze the memory-accuracy trade-off.
Secondly, we study the efficiency of the proposed SW architecture with respect to multiple HW configurations, namely #cores, L1 size and DMA bandwidth, introducing the tiling requirements and evaluating the latency for each kernel of computation. Then, we measure performance on an advanced PULP prototype, VEGA, fabricated in GlobalFoundries 22nm technology with 4 FPUs shared among all cores. We analyze the latency results for individual layers forward and backward and estimate the overall energy consumption to perform a CL task on our platform. Finally, we compare the efficiency of our TinyML platform to other devices used for on-device learning.
V-a Experimental Setup
We benchmark the compression technique for the Latent Replay memory on the image-classification Core50 dataset, which includes 120k 128128 RGB images of 50 objects for the training and about 40k images for the testing. On the Core50 dataset, the CL setting is regulated by the NICv2-391 protocol . According to this protocol, 3000 images belonging to ten classes are made available during the initial phase to fine-tune the targeted deep model on the Core50 problem. Afterward, the remaining 40 classes are introduced at training time in 390 learning events. Each event, as described more in detail in Section III-A, comprises iterations over mini-batches of 128 samples each: 21 coming from actual images, all from the same class and typically not independent (e.g., coming from a video), and 107 latent replays. After each learning event, the accuracy is measured on the test set, which includes samples from the complete set of classes.
Following , we use a MobileNet-V1 model with an input resolution of 128
128 and width multiplier 1, pre-trained on ImageNet; we start from their public released code777Available at https://github.com/vlomonaco/ar1-pytorch/. While Pellegrini et al.  report lower accuracies in their paper, our FP32 baseline results are aligned with their released code.
and use PyTorch 1.5. In our experiments, we replace BatchReNormalization with BatchNormalization layers and we freeze the statistics of thefrozen stage after fine-tuning.
V-B QLR-CL memory usage and accuracy
To evaluate the proposed QLR-CL setting, we quantize the frozen stage of the model using the PyTorch-based NEMO library  after fine-tuning the MobileNet-V1 model with the initially available 3000 images. We set the activation and parameters bitwidth of the frozen stage to bit while we vary the bitwidth of the latent replay layer. The quantized frozen stage is used to generate a set of Latent Replays, as sampled from the initial images.
The plots in Fig. 5 show the test accuracy on the Core50 that is achieved at the end of the NICv2-391 training protocol for a varying while sweeping the LR layer . Depending on the selected layer type, the size of the LR vector varies as reported in Table III.
Each subplot of Fig. 5 compares the baseline FP32 version with our 8-bit fully-quantized solutions with a varying , denoted in the figures, respectively, as UINT-8, UINT-7 and UINT-6. For a , we observe the Continual Learning process to not converge on the Core50 dataset.
From the obtained results, we can observe the UINT-8 compressed solution featuring a small accuracy drop with respect to the full-precision FP32 baseline. When increasing the number of latent replays to 3000, the UINT-8 quantized version results almost lossless (-0.26%), if LR. On the contrary, if the LR layer is moved towards the last layer (LR=), the accuracy drop increases up to 3.4%. The same effect is observed when reducing to 1500, 750 or 375. In particular, when , the UINT-8 quantatized version presents an accuracy drop from 1.2% (LR) to 2.9% (LR). On the other hand, lowering the bit precision to UINT-7, the accuracy reduces on average of up to , if compared to the FP32 baseline. Bringing this further down to UINT-6 largely degrades the accuracy by more than 10%.
To deeply investigate the impact of the quantization process on the overall accuracy, we perform an ablation study to distinguish the individual effects of i) the quantization of the front-end and ii) the quantization of the LRs. In case of , Table II compares the accuracy on the Core50 dataset for different LR layers, if applying quantization to both the LR memory and the frozen stage or only to the LR memory. The accuracy statistics are averaged over 5 experiments; we report in the table the mean and the std deviation of the obtained results. In particular, we see that quantizing the LRs has a larger effect on the accuracy than quantizing the frozen graph. By quantizing only the LR memory to UINT-8, the accuracy drops by up to 1.2-2.6% (higher in case of larger adaptive stages) with respect to the FP32 baseline. On the contrary, the UINT-8 quantized frozen graph brings only an additional 0.5-1% of accuracy drop. With UINT-7 LRs, the accuracy drop is mainly due to the LR quantization: when compressing also the frozen stage to 8-bit the accuracy drop is up to -1%, which is small compared to the total 4-7% of accuracy degradation.
|LR||FP32||FP32 +||UINT-8 +||FP32 +||UINT-8 +|
|27||72.7 0.34||70.1 0.54||69.2 0.48||68.0 0.63||67.8 1.14|
|25||73.3 0.58||70.9 0.65||70.2 0.67||66.2 0.75||66.1 0.94|
|23||75.0 0.83||73.2 0.46||73.4 0.66||71.1 0.63||69.9 1.25|
|21||76.5 0.63||74.9 0.51||73.9 1.67||72.7 0.74||72.6 1.30|
|19||77.7 0.73||76.5 0.48||76.0 0.80||74.0 0.57||75.2 1.10|
To facilitate the interpretation of the results, Fig. 6 reports the test accuracy for multiple quantization settings compared to the size (in MB) of the Latent Replay Memory. In red, we highlight a Pareto frontier of non-dominated points, to have a range of options to maximize accuracy and minimize the memory footprint. Among the best solutions, we detect two clusters of points on the frontier. The first cluster (A), corresponding to the low-memory side of the frontier, is constituted by experiments that use with 1500 or 3000 LRs and UINT-7 or UINT-8 representation. On the other hand, if we aim at the highest accuracy possible for our QLR-CL classification algorithm, we can follow the Pareto frontier to the right towards higher accuracies at steeper memory cost, reaching cluster B. All points in cluster B features as Latent Replay layer, which is a bottleneck layer of the network and allows to store more compact tensors as LR (refer to Table III). Adopting LR layers within B leads accuracy to an average of , gaining 5 on average with respect to the layers within cluster A. A single point C1 is shown further to the right, but still below 128MB.
For a deeper analysis of the Pareto frontier, in Fig. 7, we detail the memory requirements when analyzing the points into the two clusters A and B, as well as C1. We make two observations: first, in all A points, it would be possible to fit entirely within the on-chip memory available on VEGA, exploiting the 4MB of non-volatile MRAM. This would allow avoiding any external memory access, increasing the energy efficiency of the algorithm by a factor of up to 3 . Moreover, considering that the maximization of accuracy is often the primary objective in CL, we observe that accumulating features at with 1500 UINT-8 LRs (point C1) enables accuracy to grow above , almost more than the compact solutions in A (Fig. 7). This analysis allows us to also speculate over possible future architectural explorations to design optimized bottleneck layers that could facilitate better memory accuracy trade-off for QLR-CL.
|LR||Layer||LR Dim.||LR Size|
V-C Hardware/Software Efficiency
To assess the performance of the proposed solution, we study the efficiency of the CL Software primitives on the target platform and the sensitivity to some of the HW architectural parameters, namely the #cores, the L1 memory size and the cluster DMA Bandwidth.
Single-tile performance on L1 TCDM
Based on the tiling strategy described in Section IV-B, we run experiments concerning the CL primitives of the software stack that operates on individual tiles of data placed in the L1 memory. Figure 8 shows the latency performance, expressed as MAC/cyc, i.e. the ratio between Multiply-Accumulate operations (MAC) and elapsed clock cycles (cyc), for each of the main FP32 computation kernels in case of single-core (1-CORE) or multi-core (2-4-8-CORES) execution. We highlight that a higher value of MAC/cyc denotes a more efficient processing scheme, leading to lower latency for a given computation workload, i.e. fixed MAC. More specifically, in this plot, we evaluate the forward (FW), backward error (BW ERR), and backward gradient (BW GRAD) for each of the considered layer for a varying size of the L1 TCDM memory, i.e. 128, 256 or 512kB. The shapes of the tiles for PointWise (PW), DepthWise (DW), and Linear (Lin) layers used for the experiments are reported in the tables on the left of the figure. Such dimensions are defined to fit three different sizes of the TCDM, considering buffers of size 64kB, 128kB and 256kB.
Focusing firstly on the PW layers (histograms at the top of the figure), we observe a peak performance in the 8-cores FW step, achieving up to 1.91 MAC/cyc for a L1 memory size of 512kB. We observe also a performance improvement of up to 11% by increasing the L1 size from 128kB to 512kB, which is is motivated by the higher computational density of the kernel: if L1kB the inner loop features 4 iterations than a scenario with 128kB of L1 size. Moreover, the parallel speedup scales almost linearly with respect to the number of cores and archives 7.2 in case of 8 cores. With respect to the theoretical maximum of 8, the parallel implementation presents some overheads mainly due to increased L1 TCDM contentions and cluster’s cache misses.
If we look at DW convolutions, their performance is lower with respect to the others. The main reason is that it requires a software-based im2col data layout transformation, which increase the amount of data marshaling operations and adds an extra L1 buffer, thus reducing the size of matrices in the matrix multiplication, leading to increased overheads. Specifically, we measure the workload of the im2col to achieve up to 70% of the FW kernel’s latency. As mentioned in Section IV, the primitives we introduce also support performing the im2col directly when moving the data tile from L2 via DMA transfer – in that case, this source of performance loss is not present, and the MAC/cyc necessary for depthwise convolutions increases up to 1 MAC/cycles for depthwise forward-prop, depending also on the L1 size selected. The remaining overhead with respect to pointwise convolutions is justified by the fact that depthwise convolutions can only exploit filter reuse (of size 33, for example, in MobileNet-V1 DW layers) and no input channel data-reuse, resulting in much shorter inner loops and more visible effect of overheads. This latter effect cannot be counteracted by efficient DMA usage; on the other hand, since depthwise convolutions account for less than of the computation, their impact on the overall latency is limited, as we further explore in the following section.
Moving our analysis towards the different performance between forward- and backward-prop layers (particularly BW grad), we observe that this effect is again due to different data re-use between the matrix multiplication kernels. The reduction in re-use in the backward-prop is due to the tiling strategy adopted (see Fig. 3) has a grad_output vector which is shorter than the input in the forward matrix multiplication. Specifically, the input to the matrix multiplication has size 8x1x1 in backward, while the input shape in forward changes accordingly with the L1 memory: 512x1x1 for 128kB L1, 1024x1x1 for 256kB L1 and 2048 for 512kB L1. In this scenario, the inner loop of the matrix multiplication of a forward computation is 64, 128 or 256 larger with respect to the backward kernels’ cases. This fact motivates the lower MAC/cyc of the BW ERR step (22%) and BW GRAD step (-46%) if compared to the FW kernel.
L2-L1 DMA Bandwidth effects on performance
Next we analyze the impact of L2-L1 DMA Bandwidth variations, due to the Cluster DMA, on the overall performance of the learning task. In particular, we monitor the latency and the MAC/cyc for multiple values of L2-L1 bandwidth ranging from 8 to 128 bits per clock cycle (bit/cyc) and different configurations of #cores and L1 size. We remark that a higher value of MAC/cyc indicates a better performing HW configuration. Our analysis assumes a single half-duplex DMA channel, hence the bandwidth value accounts for either read or write transfers.
Fig. 9 reports the average MAC/cyc when running the forward and backward steps with respect to the L2-L1 cluster’s DMA bandwidth. As a benchmark, we consider the adaptive stage of the MobileNetV1 model when the LR layer is set to the 19th layer. Hence, we adopt our tiling strategy and double-buffering scheme to realize the training. When increasing the L1 size, the tensor tiles feature a larger size, therefore demanding a higher transfer time to copy data between the L1 memory (used for computation) and L2 memory (used for storage). Thanks to the adopted double-buffering technique, such transfer time can be hidden by the computation time because the DMA works in the background of CPU operation (compute-bound). On the contrary, if the transfer time results dominating, the computation becomes DMA transfer-bound, with lower benefits from the multi-core acceleration.
In case of single core execution, the measured MAC/cyc does not vary with respect to the L1 size (128kB, 256kB or 512kB) as can be seen from the plot. In this scenario, the CPU time results as the dominant contribution with respect to the transfer time: the execution is compute-bound and a higher L2-L1 bandwidth does not impact the overall performance. Differently, in a multi-core execution (2, 4 or 8 cores), the average MAC/cyc increases and therefore the ratio between transfer time and the computation time decreases: from the plot we can observe higher performance if the DMA bandwidth is increased. If featuring a L1 size of 128kB, the sweet spots between DMA and compute bound are observed when the L2-L1 DMA bandwidth is 16 (2 cores), 32 (4 cores) and 64 (8 cores) bit/cyc, respectively, as highlighted by the red circles in the plot. These configurations denote the sweet spots to tune the DMA requirements with respect to the chosen L1 memory size and #cores.
If focusing more on the impact of the L1 memory size to the multi-core performance, we observe up to 2 efficiency gain with 8 cores with a larger L1 memory, increasing from 0.25 MAC/cyc for a 128kB L1 memory to 0.4MAC/cyc at L1=256kB and to 0.53MAC/cyc for 512kB of L1. At 64 bit/cyc of L2-L1 DMA bandwidth, the execution, which is dominated by the computation, reaches 0.52MAC/cyc, 2.12 faster than the low-bandwidth configuration.
From this analysis we can conclude that the best design point for the learning task on a low-end multi-core architecture can be pinpointed leveraging the L2-L1 DMA Bandwidth and the L1 memory size tuning: when using 8 cores, 128kB of L1 memory, which is typically the main expensive resource for the system, can lead already to the highest performance as long as the DMA features a bandwidth of 64 bit/cyc. On the contrary, if the DMA’s bandwidth is as low as 8 bit/cyc, a 512 kB L1 memory is needed to gain maximum performance. The target chip VEGA includes a L1 memory of 128 kB; the DMA follows a full-duplex scheme and can provide up to 64 bit/cyc for read transactions and 64 bit/cyc for write transactions. Therefore the VEGA HW architecture can fully exploit the presented SW architecture and optimization schemes to reach the optimal utilization and performance for the learning task.
|LR||VEGA @ 375 MHz||STM32L4 @ 80 MHz||Snapdragon|
|[s]||[s]||En. [J]||[s]||En. [J]||[s]|
V-D Latency Evaluation on VEGA SoC
We run experiments on the VEGA SoC to assess the on-device learning performance, in terms of latency and energy consumption, of the proposed QLR-CL framework. Specifically, we report the computation time, i.e. the latency, at the running frequency of 375MHz and the power consumption by measuring the current absorbed by the chip when powered at 1.8V. To measure the full layer latency, we profile forward and backward tiled kernels, which include DMA transfers of data, initially stored in L2, and calls to low-level kernel primitives, introduced above. On average, we observe a 7% of tiling overhead with respect to the single-tile execution on L1. This is not surprising, due to the large bandwidth availability between L1 and L2 and the presence of compute-bound matrix multiplication operations.
Based on the implemented tiled functions, we report the layer-wise performance in Table IV for any of the layers of the MobileNet-V1 model. We consider as complete time for the execution of a layer the cumulated time for frozen stage and adaptive stage. The latency of the frozen stage is obtained using DORY  to deploy the network, as this operation is performed as pure 8-bit quantized inference. We compute the full latency of the adaptive stage as the time needed to execute the forward and backward phases of each layer. Since we have multiple configurations, latencies for retraining start growing from the last layer (#27) up to layer #20, where retraining comprises a total of eight layers.
First of all, we note that frozen stage latencies are utterly dominated by the adaptive stage. Apart from the faster inference backend, which can rely on 8-bit SIMD vectorization, this is because only 21 images per mini-batch pass through the frozen stage, while the adaptive stage has to be executed on 128 latent inputs (107 LRs and the 21 dequantized outputs from the frozen stage), and it has to run for multiple epochs (by default, 4) in order to work.
When , the adaptive stage is very fast thanks to its very small number of parameters (it produces just the 50 output classes). This is the only case in which the frozen stage is non-negligible ( 1/6 of the overall time). Progressing upward in the table, the frozen stage becomes negligible. The cumulative impact of forward and backward passes through all the other layers we take intoaccount ( from #20 to #26) is in the range between 0.3h and 1.5h. In particular, corresponds to 14 min per learning event; this LR layer corresponds to high accuracy (75% in Core50, see Fig. 6), which means that in this time the proposed system is capable of acquiring a very significant new capability (e.g., a new task/object to classify) while retaining previous knowledge to a high degree.
Having the basic mini-batch measurements, we can estimate any scenario, by considering that to train with 1500 LR and , we will need 300 new images, thus we need 14 mini-batches (), which leads to 3.30 seconds to learn a new set of images, with an accuracy of 69.2%. If we push back the LR layer , this leads to an increase of accuracy 76.5%, at the expense of much larger latency, up to 42 minutes for layer #20 (see Table IV).
V-E Energy Evaluation on CL Use-Cases and Comparison with other Solutions
To understand the performance of our system and its real-world applicability, we study two use-cases: a single mini-batch of the Core50 training we used, and the simplified scenario presented by Pellegrini et al.  in their demonstration video. We compare our results with another MCU targeting ultra-power consumption: a NUCLEO-64 board based on the STM32L476RG MCU, on which we ran a direct port of the same code we use on the PULP-based platforms. It has two on-chip SRAMs with 1-cycle access time and an overall capacity of 96kB. Performance results, in terms of latency, are reported in Table IV, where we take intoaccount the cumulative latency values both for VEGA and STM32 implementations, along with the cumulative energy consumption. Cumulative latency is computed by adding from the linear layer of the network the latencies of the preceding layers.
On average, execution on VEGA’s 8-cores on performs 65 faster with respect to the STM32 solution thanks to three main factors. Firstly, the clock frequency of VEGA is 4.7 higher than the max clock frequency of the STM32L4 (375MHz vs 80MHz), also thanks to the superior technology node. Secondly, VEGA presents a parallel speed-up of up to 7.2. Lastly, thanks to the more optimized ISA and the core microarchitecture, VEGA performs less operations while executing the same learning task. For example, the inner loop of the matrix multiplication on VEGA requires only 4 instructions while the STM32L4 takes 9 instructions, resulting 2.25 faster, mainly thanks to the HW loop extension and the fmadd.s instruction.
The latency speed up, leads to an energy gain of around 37, because the average power consumption of VEGA is 2 higher than the STM32L4 at full load.
Notice that the latency measurement of the STM32L4 does not account for possible overheads due to the tiling data between the small on-chip SRAM banks and off-chip memory. Even then, our results show that fine-tuning from any layer above the last one results in too large a latency to be realistic on STM32L4 – in the order of a day per learning event with . On the contrary, CL on VEGA can be completed in minutes if selecting or as fast as 3.3 seconds if retraining only the last layer.
Given the reported energy consumption, we estimated the battery lifetime of our device when adapting the inference model by means of multiple learning events per hour; we assumed no extra energy consumption for the remaining time. In particular, Fig. 10 shows the battery lifetime (in hours) depending on the selected Latent Replay layer and the adaptation rate, expressed as the amount of learning events per hour. We considered a 3300 mAh battery as the only energy source for the device. By retraining only the last layer (LR), an intelligent node featuring our device can perform more than 1080 continual learning events per hour, leading to a lifetime of about 175h. On the contrary, if retraining larger portions of the network, the training time increases and the maximum rate of the learning events reduces to less than 10/hour, with a lifetime in the range 200-1000h. In comparison, on a STM32L4, if retraining the coefficients of the last layer, the maximum learning rate per hour is limited to 750, with a lifetime of about 10h. This latter can be increased up to 10000h but retraining only once in one hour. At the same learning event rate, the battery lifetime of VEGA is 20x higher.
Lastly, we compare with the use-case presented by Pellegrini et al. , where they developed a mobile phone application that performs CL with LRs on a OnePlus6 with Snapdragon845. For this scenario, they consider only 500 LRs before the linear layer, these will be shuffled with 100 new images. Then, by construction the mini-batch is composed of 100 LRs and 20 new images, thus, for each of the 8 training epochs, the network will process 5 times over the 20 new images and the 100 LRs. This scenario leads them to obtain an average latency of 502 ms for a single learning event. On the other hand, considering our measurements on VEGA we obtain a forward latency of 1.25s and a training time of 2.07s for a whole learning event.
Considering the power envelope of a Snapdragon845 of about 4W, and the average power consumption of VEGA of 62mW, this implies that our solution is 9.7 more efficient in terms of energy. We additionally assess the energy consumption and the duration of a battery in the mobile application scenario, provided the energy measurements on VEGA, when using a 3300mAh battery. Thus, if we consider performing learning over a mini-batch of images once every minute in the ultra-fast scenario (just retraining the linear layer) and to perform an inference each second, we obtain an energy consumption of 0.25J per minute. This leads the accuracy of the model to achieve an average of 69.2%, with an overall lifetime of about 108 days.
In this work, we presented what, to the best of our knowledge, is the first HW/SW platform for TinyML Continual Learning – together with the novel Quantized Latent Replay-based Continual Learning (QLR-CL) methodology. More specifically, we propose to use low-bitwidth quantization to reduce the high memory requirements of a Continual Learning strategy based on Latent Replay rehearsing. We show a small accuracy drop as small as 0.26% if using 8-bit quantized LR memory if compared to floating-point vectors and an average degradation of 5% if lowering the bit precision to 7-bit, depending on the LR layer selected. Our results demonstrate that sophisticated adaptive behavior based on CL is within reach for next-generation TinyML devices, such as PULP devices; we show the capability to learn a new Core50 class with accuracy up to 77%, using less than 64MB of memory – a typical constraint to fit Flash memories. We show that our QLR-CL library based on VEGA achieves up to 65 better performance than a conventional STM32 microcontroller.
These results constitute an initial step towards moving the TinyML from a strict train-then-deploy approach to a more flexible and adaptive scenario, where low power devices are capable to learn and adapt to changing tasks and conditions directly in the field.
Despite this work focused on a single CL method, we remark that, thanks to the flexibility of the proposed platform, other adaptation methods or models can be also supported, especially if relying on the back-propagation algorithm and CNN primitives, such as convolution operations.
We thank Vincenzo Lomonaco and Lorenzo Pellegrini for the insightful discussions.
-  (2016) Concrete Problems in AI Safety. arXiv e-prints, pp. arXiv–1606. Cited by: §I.
-  (2020) Benchmarking tinyml systems: challenges and direction. arXiv e-prints, pp. arXiv–2003. Cited by: §I.
-  (2019) Online learning and classification of emg-based gestures on a parallel ultra-low power platform using hyperdimensional computing. IEEE transactions on biomedical circuits and systems 13 (3), pp. 516–528. Cited by: §II-B, §II-C, TABLE I.
-  (2020) What is the state of neural network pruning?. In Proceedings of Machine Learning and Systems, I. Dhillon, D. Papailiopoulos, and V. Sze (Eds.), Vol. 2, pp. 129–146. Cited by: §I.
-  (2021) DORY: automatic end-to-end deployment of real-world dnns on low-cost iot mcus. IEEE Transactions on Computers, pp. 1–1. External Links: Cited by: §II-B, §IV-B, §V-D.
-  (2020) Online learned continual compression with adaptive quantization modules. In International Conference on Machine Learning, pp. 1240–1250. Cited by: §II-A.
-  (2020) TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning. Advances in Neural Information Processing Systems 33. Cited by: §II-A, §II-C, TABLE I.
-  (2020) CMix-NN: Mixed low-precision CNN library for memory-constrained edge devices. IEEE Transactions on Circuits and Systems II: Express Briefs 67 (5), pp. 871–875. Cited by: §II-B, §II-B.
-  (2019) Taking ai to the edge: google’s tpu now comes in a maker-friendly package. IEEE Spectrum 56 (5), pp. 16–17. Cited by: §II-A, §II-C, TABLE I.
End-to-end incremental learning.
Proceedings of the European conference on computer vision (ECCV), pp. 233–248. Cited by: §II-A.
-  (2017) Optimal Tiling Strategy for Memory Bandwidth Reduction for CNNs. In Advanced Concepts for Intelligent Vision Systems, J. Blanc-Talon, R. Penne, W. Philips, D. Popescu, and P. Scheunders (Eds.), Lecture Notes in Computer Science, pp. 89–100 (en). External Links: Cited by: §II-B.
-  (2019) On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486. Cited by: §I.
-  (2018-10) TVM: an automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, pp. 578–594. External Links: Cited by: §II-B.
-  (2014) Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGARCH Computer Architecture News 42 (1), pp. 269–284. Cited by: §II-B.
Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits 52 (1), pp. 127–138. External Links: Cited by: §II-B.
-  (2017) Quantized CNN: A unified approach to accelerate and compress convolutional networks. IEEE transactions on neural networks and learning systems 29 (10), pp. 4730–4743. Cited by: §II-B.
-  (2018) PACT: Parameterized Clipping Activation for Quantized Neural Networks. arXiv e-prints, pp. arXiv–1805. Cited by: §I, §II-B, §II-B.
-  (2020) Technical Report: NEMO DNN Quantization for Deployment Model. arXiv preprint arXiv:2004.05930. Cited by: §V-B.
-  (2020) TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. arXiv e-prints, pp. arXiv–2010. Cited by: §I, §II-B.
-  (2018-01) Loihi: A Neuromorphic Manycore Processor with On-Chip Learning. IEEE Micro 38 (1), pp. 82–99. External Links: Cited by: §II-B.
-  (2021) Robustifying the Deployment of tinyML Models for Autonomous mini-vehicles. Sensors 21 (4), pp. 1339. Cited by: §I, §II-C, TABLE I.
-  (2021) A continual learning survey: defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §II-A.
-  (2021) A survey of on-device machine learning: an algorithms and learning theory perspective. ACM Transactions on Internet of Things 2 (3), pp. 1–49. Cited by: §I.
-  (2020) Incremental On-Device Tiny Machine Learning. In Proceedings of the 2nd International Workshop on Challenges in Artificial Intelligence and Machine Learning for Internet of Things, pp. 7–13. Cited by: §II-C, TABLE I.
-  (2021) Quantization-Guided Training for Compact TinyML Models. arXiv e-prints, pp. arXiv–2103. Cited by: §II-B.
-  (2021) HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching. IEEE Journal of Solid-State Circuits, pp. 1–1. External Links: Cited by: §II-B.
-  (2016-02) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149 [cs]. External Links: Cited by: §II-B.
-  (2019) Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9769–9776. Cited by: §II-A.
Quantization and training of neural networks for efficient integer-arithmetic-only inference.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2704–2713. Cited by: §III-C.
-  (2019) A Study of BFLOAT16 for Deep Learning Training. arXiv e-prints, pp. arXiv–1905. Cited by: §II-B.
-  (2020) 7.4 GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation. In 2020 IEEE International Solid- State Circuits Conference - (ISSCC), pp. 140–142. External Links: Cited by: §II-B.
-  (2021-04) Robust high-dimensional memory-augmented neural networks. Nature Communications 12 (1), pp. 2468 (en). External Links: Cited by: §II-B.
-  (2017) Overcoming catastrophic forgetting in neural networks. In Proceedings of the national academy of sciences, N. A. Sciences (Ed.), Vol. 114, pp. 3521–3526. Cited by: §I.
-  (2017) Resource-efficient machine learning in 2 kb ram for the internet of things. In International Conference on Machine Learning, pp. 1935–1944. Cited by: §I.
-  (2018-01) CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs. arXiv e-prints, pp. arXiv:1801.06601. External Links: Cited by: §II-B, §IV-B.
-  (2018-04) Mixed-precision in-memory computing. Nature Electronics 1 (4), pp. 246–253 (en). External Links: Cited by: §II-B.
-  (2020-01) Spiking Neural Networks and online learning: An overview and perspectives. Neural Networks 121, pp. 88–100 (en). External Links: Cited by: §II-B.
-  (2017) The quest for energy-efficient i$ design in ultra-low-power clustered many-cores. IEEE Transactions on Multi-Scale Computing Systems 4 (2), pp. 99–112. Cited by: §IV-A.
-  (2020) Rehearsal-free continual learning over small non-iid batches. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 989–998. Cited by: §V-A.
-  (2020) CVPR 2020 Continual Learning in Computer Vision Competition: Approaches, Results, Current Challenges and Future Directions. arXiv preprint arXiv:2009.09929. Cited by: §I, §II-A.
-  (2018) A transprecision floating-point architecture for energy-efficient embedded computing. In 2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: §IV-A.
-  (2020) Batch-level experience replay with review for continual learning. arXiv preprint arXiv:2007.05683. Cited by: §I, §II-A.
-  (2021) Online continual learning in image classification: an empirical survey. arXiv preprint arXiv:2101.10423. Cited by: §I, §II-A.
-  (2019) Continuous learning in single-incremental-task scenarios. Neural Networks 116, pp. 56–73. Cited by: §II-A.
-  (2017-02) Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 246–247. External Links: Cited by: §II-B.
-  (2019) A hardware–software blueprint for flexible deep learning specialization. IEEE Micro 39 (5), pp. 8–16. Cited by: §II-B.
-  (2019-08) Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature 572 (7767), pp. 106–111 (en). External Links: Cited by: §II-B.
-  (2020) Latent Replay for Real-Time Continual Learning. 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (), pp. 10203–10209. External Links: Cited by: A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays, §I, §I, §II-A, §II-B, §II-C, TABLE I, §III-A, §V-A, §V-E, §V-E, footnote 7.
-  (2020) Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node. In 2020 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. Cited by: §I, §III-B, §IV-A.
-  (2017) icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §II-A.
-  (2021) TinyOL: TinyML with Online-Learning on Microcontrollers. arXiv e-prints, pp. arXiv–2103. Cited by: §II-C, TABLE I.
-  (2021) 4.4 a 1.3tops/w @ 32gops fully integrated 10-core soc for iot end-nodes with 1.7uw cognitive wake-up from mram-based state-retentive sleep mode. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), Vol. 64, pp. 60–62. External Links: Cited by: §I, §IV-A, §V-B.
-  (2015) PULP: A parallel ultra low power platform for next generation IoT applications. In 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–39. Cited by: item 2, §I.
-  (2017) 14.2 DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pp. 240–241. Cited by: §II-B.
-  (2020) A pragmatic approach to on-device incremental learning system with selective weight updates. In 2020 57th ACM/IEEE Design Automation Conference (DAC), pp. 1–6. Cited by: §II-B.
-  (2018) In-situ ai: Towards autonomous and incremental deep learning for iot systems. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 92–103. Cited by: §I.
-  (2019) Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. Advances in neural information processing systems 32, pp. 4900–4909. Cited by: §II-B.
-  (2017-03) Efficient Processing of Deep Neural Networks: A Tutorial and Survey. arXiv:1703.09039 [cs]. External Links: Cited by: §II-B.
-  (2019) Three scenarios for continual learning. arXiv preprint arXiv:1904.07734. Cited by: §I.
512KiB RAM Is Enough! Live Camera Face Recognition DNN on MCU. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §II-B.