Latent Replay for Real-Time Continual Learning

by   Lorenzo Pellegrini, et al.
University of Bologna

Training deep networks on light computational devices is nowadays very challenging. Continual learning techniques, where complex models are incrementally trained on small batches of new data, can make the learning problem tractable even for CPU-only edge devices. However, a number of practical problems need to be solved: catastrophic forgetting before anything else. In this paper we introduce an original technique named “Latent Replay” where, instead of storing a portion of past data in the input space, we store activations volumes at some intermediate layer. This can significantly reduce the computation and storage required by native rehearsal. To keep the representation stable and the stored activations valid we propose to slow-down learning at all the layers below the latent replay one, leaving the layers above free to learn at full pace. In our experiments we show that Latent Replay, combined with existing continual learning techniques, achieves state-of-the-art accuracy on a difficult benchmark such as CORe50 NICv2 with nearly 400 small and highly non-i.i.d. batches. Finally, we demonstrate the feasibility of nearly real-time continual learning on the edge through the porting of the proposed technique on a smartphone device.



There are no comments yet.


page 8


Continual Learning of New Sound Classes using Generative Replay

Continual learning consists in incrementally training a model on a seque...

Continual Learning at the Edge: Real-Time Training on Smartphone Devices

On-device training for personalized learning is a challenging research p...

A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays

In the last few years, research and development on Deep Learning models ...

Continual Learning on the Edge with TensorFlow Lite

Deploying sophisticated deep learning models on embedded devices with th...

Memory-Latency-Accuracy Trade-offs for Continual Learning on a RISC-V Extreme-Edge Node

AI-powered edge devices currently lack the ability to adapt their embedd...

Latent Space based Memory Replay for Continual Learning in Artificial Neural Networks

Memory replay may be key to learning in biological brains, which manage ...

ADER: Adaptively Distilled Exemplar Replay Towards Continual Learning for Session-based Recommendation

Session-based recommendation has received growing attention recently due...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training on the edge (e.g., on light computing devices such as smartphones, smart cameras, embedded systems, etc.) is highly desirable in several applications where privacy, lack of network connection and fast adaptation are real concerns. While some steps in this direction have been recently moved [13]

, training on the edge often remains unfeasible. In fact, given the high demand in terms of memory and computation, most machine learning models nowadays are trained on powerful Multi-GPUs servers, and only frozen models are deployed to edge devices for inference.

Figure 1: Architectural diagram of Latent Replay.
Figure 2: Continual learning accuracy along the incremental training batches over NICv2 – 79, NICv2 – 196 and NICv2 – 391 as presented in [17]. None of the method compared uses rehearsal. Naive refers to a simple approach where the model is tuned along the batches and the only protection against forgetting is early stopping. LWF and EWC are well-known methods for CL (see [15, 10]). CWR* and AR1* are discussed in the main text. DSLDA is a recently proposed streaming continual learning approach [5]. The black dashed line denotes the “upper bound” accuracy achieved by the Cumulative approach, that is a full single training on the entire dataset.

Furthermore, in some applications (e.g., robotic vision), training from scratch a deep model as soon as new data becomes available is prohibitive in terms of storage / computation even if performed server side. Continual Learning (CL), that is the ability of continually training existing models using only new data, is gaining a lot of attention and several solutions have been recently proposed to deal with the frustrating problem of forgetting (i.e., as the model learns new concepts and skills it tends to forget the old ones) [1, 3, 20]. The recent surveys [21, 12] provide an overview of CL. In principle, CL approaches could be exploited not only to control forgetting but also to reduce the training complexity. In this paper we focus on real-time CL and prove that continual training with small batches can be compatible with the limited computing power made available by CPU-only edge devices.

In [17] it was shown that some CL approaches can effectively learn to recognize objects (on the CORe50 dataset [16]) even when fed with fine-grained incremental batches. CORe50 NICv2 [17] is a continual learning benchmark where objects from 50 different classes have to be learned incrementally. What makes this benchmark challenging is that classes are discovered a little at a time and the training batches are small and non i.i.d. In particular, in NICv2 – 391 each training batch includes only 300 frames extracted from a short video (15 seconds at 20 fps) of a single object slowly moving in front of the camera: hence, patterns within each batch are highly correlated. Despite these nuisances, in [17] the approaches denoted as CWR* and AR1* proved to be able to learn continually even in absence of replay mechanisms, that is, the periodic refresh of old examples maintained in an external memory (see Figure 2). While these results are encouraging:

  • the accuracy gap w.r.t. the cumulative approach (a sort of upper bound obtained by training the model on the entire training set) remains quite relevant (about 20%).

  • in NICv2 – 391, the most challenging setup, AR1* was not able to effectively adapt the representation layers during continual learning.

The aim of this work is to reduce as much as possible the gap w.r.t. the cumulative upper bound and, at the same time, to provide an efficient implementation strategy of CL approaches to enable nearly real-time training on the edge.

To this purpose we first show that a small amount of pattern replay is sufficient to significantly improve accuracy on NICv2 – 391. However, even if in the CORe50 setting the extra memory required by replay is not an issue (we store only 30 patterns for each of the 50 classes), a constant refresh significantly increases the required amount of computation because of the extra forward and backward steps and this makes the resulting training too resource demanding for real-time applications. Therefore, we propose a “Latent Replay” approach (see Figure 1) where old data are injected at some intermediate layer selected according to the desired accuracy-efficiency trade-off.

The rest of this paper is organized as follows: Section 2 introduces different CL scenarios and discusses related literature; in Section 3 we show that extending existing approaches with a limited amount of replay can significantly improve their performance at the expense of a relevant increase in the required amount of computation. In Section 4 we introduce latent replay, the main contribution of this work, and in Section 5 we apply it to CWR* and AR1* approaches and point out its advantages in terms of the computation-storage-accuracy trade-off. Finally, in order to demonstrate the practical applicability of the proposed approach, in Section 5.4 we discuss the implementation of a continual learning application for Android smartphones that, starting from a pre-trained MobileNetV1 model with 10 classes, can incrementally learn (in near real-time) new classes and or new objects of existing classes.

2 Related Works

While several works in the continual learning literature focus on Multiple independent Tasks (MT) scenario, in many practical applications such as robotic vision, a Single Incremental Task (SIT) scenario is more appropriate [19]. In particular, a robot should be able to incrementally improve its object recognition capabilities while being exposed to new instances of both known and completely new classes (denoted as NIC setting – New Instances and Classes). CORe50 NICv2 benchmark specifically addresses this problem [17]. Other datasets have been released to study continual learning for robotic vision (e.g., iCub-transformation [22], OpenLORIS [27]

) but no NIC benchmarks have been yet defined for them. ImageNet-1K

[14] and iCifar-100 [11] have also been used to evaluate continual learning techniques, but these datasets do not fit well the object recognition task because of the lack of multiple instances of the same objects taken under different poses, lighting and backgrounds.

In [17], two approaches denoted as CWR* and AR1* have been evaluated on CORe50 NICv2 (see Figure 2): in CWR* the last fully connected layer is implemented as a double memory, and simple initialization and fusions steps are performed before and after each training batch to synchronize the two memories. However, after the first training batch, CWR* freezes all the layers except the last one, thus losing the benefits of a continual adaptation of the underlying representation. AR1* extends CWR* by enabling end-to-end continual training throughout the entire network; to this purpose the Synaptic Intelligence [28] regularization approach (similar to Online-EWC [26]) is adopted to constrain the change of critical weights. CWR* and AR1* are adopted as baseline techniques in this study.

Figure 3: Comparison of CWR* and AR1* on CORe50 NICv2 – 391 with and without rehearsal (

). Each experiment was averaged on 5 runs with different batch ordering: colored areas represent the standard deviation of each curve. The black dashed line denotes the reference accuracy of the cumulative upper bound.

Patterns replay, which is central in the proposed model, proved to be an effective approach to contrast forgetting in continual learning scenarios [23, 18, 25]. In fact, periodically replaying some representative patterns from old data helps the model to retain important information of past tasks / classes while learning new concepts. iCaRL [23] uses well-designed entry / exit criteria to maintain a class-balanced set of exemplars that maximize representativeness. A comparison between the proposed technique and iCaRL is reported in Section 5.3. Generative Replay (also known as “Pseudo-rehearsal” [24]), where surrogates of past data are generated without explicitly storing native patterns, looks very appealing because of the storage saving; however, most of the proposed approaches to date do not allow on-line generation of effective replay patterns.

Figure 4: Comparison of CWR*, AR1* and AR1*free on CORe50 NICv2 – 391 with different external memory sizes ( and patterns).

Another class of relevant techniques for our study are the so called streaming continual learning approaches [4], where a model can be incrementally trained with a single pattern at a time. Even if in a robotic vision scenario learning from single frames does not appear necessary (in fact, using short videos of single objects can be more efficient and looks more biologically plausible) efficient streaming learning techniques can be effortlessly applied to NIC setting. Deep Streaming Linear Discriminant Analysis (DSLDA) was recently proposed [5]

where an online extension of the LDA classifier works on the top of a fixed deep learning feature extractor. This approach, which achieved state-of-the-art accuracy on (partitioned) ImageNet-1K and CORe50 (10 classes version) was run on NICv2 and compared with other techniques in Figure


Finally, training on the edge was recently addressed in [13], where an object detection model was incrementally trained based on LWF and pattern replay. While training is not actually real-time (it requires a few minutes on Nvidia Jetson TX2 board) and only few large continual training batches are presented to the model, the detection problem approached in [13] is more difficult than the classification problem here considered and therefore we cannot make a direct comparison.

3 Training CWR* and AR1* with Rehearsal

In [19] it was shown that a very simple rehearsal implementation, where for every training batch a random subset of the batch patterns is added to the external storage to replace a (equally random) subset of the external memory, is not less effective than more sophisticated approaches such as iCaRL. Therefore, in this study we opted for simplicity and started by expanding CWR* and AR1* with the trivial rehearsal approach summarized in Algorithm 1. In Figure 3 we compare the learning trend of CWR* and AR1* of a MobileNetV1111The network was pre-trained on ImageNet-1k. trained with and without rehearsal on CORe50 NICv2 – 391. We use the same protocol and hyper-parameters introduced in [17] and a rehearsal memory of 1500 patterns. It is well evident that even a moderate external memory (about 1,27% of the total training set) is very effective to improve the accuracy of both approaches and to reduce the gap with the cumulative upper bound that for this model is 85%.

2: number of patterns to be stored in
3:for each :
4:    train the model on shuffled
6:     random sampling patterns from
Algorithm 1 Pseudocode explaining how the external memory is populated across the training batches. Note that the amount of patterns to add progressively decreases to maintain a nearly balanced contribution from the different training batches, but no constraints are enforced to achieve a class-balancing.

To understand the influence of the external memory size we repeated the experiment with different values: 500, 1000, 1500, 3000. Since rehearsal itself protects the model from forgetting we also run AR1* (where important weights of lower layers are protected from forgetting by using Synaptic Intelligence [28] regularization) without Synaptic Intelligence protection, that is lower layers weights are left totally unconstrained; in the following we denote this approach as AR1*free. The results are shown in Figure 4: it is worth noting that increasing the rehearsal memory leads to better accuracy for all the algorithms, but the gap between 1500 and 3000 is not large and we believe 1500 is a good trade-off for this dataset. AR1*free works slightly better that AR1* when a sufficient number of rehearsal patterns are provided but, as expected, accuracy is worse with light (i.e. ) or no rehearsal.

It is worth noting that the best combination in Figure 4 (AR1*free with 3000 patterns) is only 5% worse than the cumulative upper bound and a better parametrization and exploitation of the rehearsal memory could further reduce this gap.

4 Latent Replay

In deep neural networks the layers close to the input (often denoted as representation layers) usually perform low-level feature extraction and, after a proper pre-training on a large dataset (e.g., ImageNet), their weights are quite stable and reusable across applications. On the other hand, higher layers tend to extract class-specific discriminant features and their tuning is often important to maximize accuracy.

With latent replay (see Figure 1) we denote an approach where instead of maintaining in the external memory copies of input patterns in the form of raw data, we store the activations volumes at a given layer (denoted as Latent Replay layer). To keep the representation stable and the stored activations valid we propose to slow-down the learning at all the layers below the latent replay one and to leave the layers above free to learn at full pace. In the limit case where low layers are completely frozen (i.e., slow-down to 0) latent replay is functionally equivalent to rehearsal from the input (hereafter denoted as native rehearsal), but achieves a computational and storage saving thanks to the smaller fraction of patterns that need to flow forward and backward across the entire network and the typical information compression that networks perform at higher layers.

In the general case where the representation layers are not completely frozen the activations stored in the external memory suffer from an aging effect (i.e., as the time passes they tend to increasingly deviate from the activations that the same pattern would produce if feed-forwarded from the input layer). However, if the training of these layers is sufficiently slow the aging effect is not disruptive since the external memory has enough time to be rejuvenated with fresh patterns. When latent replay is implemented with mini-batch SGD training: (i) in the forward step, a concatenation is performed at the replay layer (on the mini-batch dimension) to join patterns coming from the input layer with activations coming from the external storage; (ii) the backward step is stopped just before the replay layer for the replay patterns.

5 Training CWR* and AR1* with Latent Replay

While the proposed latent replay is architecture agnostic hereafter we discuss a specific design with CWR*, AR1* and AR1*free approaches over a MobileNetV1 CNN [6] pre-trained on ImageNet-1K:

  • for all the methods the output layer (fc7) must be implemented as a double memory with proper (pre)initialization and (post)fusion for each training batch (for details see CWR* pseudocode in Algorithm 2 of [17]);

  • for CWR* the latent replay layer is the second-last layer (i.e., pool6);

  • for AR1* and AR1*free the latent replay layer can be pushed down and selected according to the accuracy-efficiency trade-off discussed below;

  • for AR1*free the Synaptic Intelligence regularization is switched off.

Figure 5: AR1*free with latent replay () for different choices of the latent replay layer. Setting the replay layer at the pool6 layer makes AR1*free equivalent to CWR*. Setting the replay layer at the “images” layer corresponds to native rehearsal (same curve of Figure 4 for AR1*free and 1500 patterns). The saturation effect which characterizes the last training batches is due to the data distribution in NICv2 – 391 (see [17]): in particular, the lack of new instances for some classes (that already introduced all their data) slows-down the accuracy trend and intensifies the effect of activations aging.

To simplify the network design and training we keep the proportion of original and replay patterns fixed: for example, if the training batches contain 300 patterns and the external memory 1500 patterns, in a mini-batch of size 128 we concatenate 21 () original patterns with 107 () replay patterns. In this case only 21 patterns (over 128) need to travel across the blue layers in Figure 1.

Layer Computation % vs Native Rehearsal Pattern Size Final Accuracy % Accuracy % vs Native Rehearsal
Images 100,00% 49152 77,30% 0,00%
conv5_1/dw 59,261% 32768 72,82% -4,49%
conv5_2/dw 50,101% 32768 73,21% -4,10%
conv5_3/dw 40,941% 32768 73,22% -4,09%
conv5_4/dw 31,781% 32768 72,24% -5,07%
conv5_5/dw 22,621% 32768 68,59% -8,71%
conv5_6/dw 13,592% 8192 65,24% -12,06%
conv6/dw 9,012% 16384 59,89% -17,42%
pool6 0,027% 1024 59,76% -17,55%
Table 1: Computation, storage, and accuracy trade-off with Latent Replay at different layers of a MobileNetV1 ConvNet trained continually on NICv2 – 391 with . Computation and pattern size can be easily extrapolated from Table 5

in the appendix where the network architecture is exploded by reporting neurons, connections and weights at each layer.

Concerning the learning slow-down in the representation layers we found that an effective (and efficient) strategy is blocking the weight changes after the first batch (i.e., learning rate set to 0), while leaving the batch normalization moments free to adapt to the statistics of the input patterns across all the batches. Batch Normalization (BN)

[8] is widely used in modern deep neural networks (including MobileNets) to control internal covariate shift thus making learning faster and more robust. Replacing BN with Batch Renormalization (BRN) [7] was proved to be a very important step for effective continual learning with fine-grained non-i.d.d. batches [17], so in the MobileNetV1 here adopted BN layers have been replaced with BRN layers. In the context of latent replay, if we leave the BRN moments free to adapt, the activations stored in the external memory suffer the aging effect described in Section 4. However, we experimentally verified that, upon proper setting of the global moment mobile windows (more details are provided in the additional material), the accuracy drop due to the aging effect is quite limited and in any case the final accuracy is higher w.r.t. the case where BRN moments in the representation layers are frozen. On the computational side, blocking the weight changes in the representation layers allows to skip the backward pass in the lower part of the network also for native patterns, since updating the BRN moments only relies on the forward pass.

In Figure 5 we report the accuracy of AR1*free with latent replay () for different choices of the rehearsal layer (reported between parenthesis). As expected, when the replay layer is pushed down the corresponding accuracy increases, proving that a continual tuning of the representation layers is important. However, after conv5_4/dw there is a sort of saturation and the model accuracy is no longer improving. The residual gap (4%) with respect to native rehearsal is not due to the weights freezing of the lower part of the network but to the aging effect introduced above. This can be simply proved by implementing an “intermediate” approach that always feeds the replay pattern from the input and stops the backward at conv5_4: such intermediate approach achieved an accuracy at the end of the training very close to the native rehearsal. We believe that the accuracy drop due to the aging effect can be further reduced with better tuning of BNR hyper-parameters and/or with the introduction of a scheduling policy making the global moment mobile windows wider as the continual learning progresses (i.e., more plasticity in the early stages and more stability later); however, such fine optimization is application specific and beyond the scope of this study.

Strategy Run Time (Minutes) Mem. Overhead (Data + Params, MB) Final Accuracy % Acc. % vs Cumulative
CWR* 21,4 0 + 0,2 56,99% -28,27%
AR1*free (pool6) 23,7 5,8 + 12,4 59,75% -25,51%
AR1* 39,9 0 + 12,4 56,32% -28,94%
AR1*free (conv5_4/dw) 41,2 48 + 0 72,23% -13,03%
DSLDA 79,1 0 + 0,2 48,02% -37,24%
iCaRL 20185,0 375 + 0 15,65% -69,61%
Table 2: Summary of the computation, memory, and accuracy trade-off for each strategy. Memory overhead include both the data used for replay purposes as well as additional trainable parameters needed for continual learning. Each metric is averaged across 10 runs.

5.1 On the Computation, Storage and Accuracy Trade-off

To better evaluate the latent replay w.r.t. native rehearsal we report in Table 1 the relevant dimensions: (i) computation refers to the percentage cost in terms of ops of a partial forward (from the latent replay layer on) relative to a full forward step from the input layer; (ii) pattern size is the dimensionality of the pattern to be stored in the external memory (considering that we are using a MobileNetV1 with 128 128 3 inputs to match CORe50 image size); (iii) accuracy and accuracy quantify the absolute accuracy at the end of the training and the gap with respect to a native rehearsal, respectively.

For example, conv5_4/dw exhibits an interesting trade-off because the computation is about 32% of the native rehearsal one, the storage is reduced to 66% (more on this point in subsection 5.2) and the accuracy drop is mild (5,07%). CWR* (i.e. AR1* with latent replay layer pool6) has a really negligible computational cost (0.027%) with respect to native rehearsal and still provides and accuracy improvement of 4% w.r.t. the non-rehearsal case (60% vs 56% as it is possible to see from Figure 5 and Figure 4, respectively).

Figure 6: Sparsification of conv5_4/dw activations for different values of and the corresponding accuracy after the first training batch.

5.2 Reducing Activations Storage in Latent Replay

Even if in our CORe50 case study the external storage is quite limited (e.g., 1500 32KB = 48 MB for latent replay at conv5_4/dw), scaling up to applications with thousands of classes could require to store much more activations and the external memory could become an issue. Fortunately, high layers activations can be sparsified, quantized and encoded with almost no accuracy reduction. The authors of [2] show that MobileNetV1 activations can be compressed up to 10 times upon proper sparsification, encoding and lossless entropy compression. In their experiments a moderate compression even leads to slightly improved accuracy because of the regularization introduced.

In the case of latent replay, we only need to sparsify the activations of the latent replay layer (and not of the entire network), potentially introducing a sort of information bottleneck. This can be easily achieved by adding an L1 term (with relative weight) to the loss function attracting toward zero the activations of the latent replay layer (see

[2]). We performed some preliminary experiments to sparsify activations of layer conv5_4/dw during the first training batch starting from a non-sparsified ImageNet pre-trained model. Note that the weights of the latent replay layer and previous layers are frozen after the first training batch and no further sparsification can take place. The results are shown in Figure 6: for (i.e., no induced sparsification)

52% of activations are non-zero due to the natural spasification effect of the Relu activation function and the accuracy is about 14%; as we increase

the amount of non-zero activation start decreasing. Interestingly, for we can reduce the non-zero activations from 52% to 37% by achieving also a slight accuracy improvement (0,22%). By adding quantization and entropy encoding (out of the scope of this work) we believe that, analogously to [2], a 10 compression is at reach with almost no accuracy loss.

5.3 Comparison with Other Approaches

Figure 7: Accuracy results on the CORe50 NICv2 – 391 benchmark of CWR*, AR1*, DSLDA, iCaRL, AR1*free (conv5_4), AR1*free (pool6). Results are averaged across 10 runs in which the batches order is randomly shuffled. Colored areas indicate the standard deviation of each curve. As an exception, iCaRL was trained only on a single run given its extensive run time (14 days).

While the accuracy improvement of the proposed approach w.r.t. state-of-the-art rehearsal-free techniques have been already discussed in the previous sections, a further comparison with other state-of-the-art continual learning techniques may be beneficial for better appreciating its practical impact and advantages. In particular, a comparison with iCaRL, one of the best know rehearsal-based technique, is worth to be considered.

Unfortunately, iCaRL was conceived for incremental class learning scenario and its porting to NIC (whose batches also include patterns of know classes) is not trivial. To avoid subjective modifications, we started from the code shared by the authors and emulated a NIC setting by: (i) always creating new virtual classes from patterns in the coming batches; (ii) fusing virtual classes together when evaluating accuracies. For example, let us suppose to encounter 300 patterns of class 5 in batch 2 and other 300 patterns of the same class in batch 7; while two virtual classes are created by iCaRL during training, when evaluating accuracy both classes point to the real class 5. The hereby modified iCaRL implementation, with an external memory of 8000 patterns (much more than the 1500 used by the proposed latent replay, but in line with the settings proposed in the original paper [23]), was run on NICv2 – 391, but we were not able to obtain satisfactory results. In Figure 7 we report the iCaRL accuracy over time and compare it with AR1*free (conv5_4/dw), AR1* (pool6) as well as the top three performing rehearsal-free strategies introduced before: CWR*, AR1* and DSLDA. While iCaRL exhibits better performance than LWF and EWC (refer to Figure 2, right), it is far from DSLDA, CWR* and AR1*.

Furthermore, when the algorithm has to deal with a so large number of classes (including virtual ones) and training batches its efficiency becomes very low. In Table 2 we also report the total run time (training and testing), memory overhead and accuracy difference with respect to the cumulative upper bound. We believe AR1*free (conv5_4/dw) represents a good trade-off in terms of efficiency-efficacy with a limited computational-memory overhead and only at 13% distance from the cumulative upper bound. For iCaRL the total training time was 14 days compared to a training time of less than 1 hour for the other techniques. Finally, a slight variant of the proposed approach was submitted to the recent IROS 2019 Competition on “Lifelong Robotic Vision” challenge based on the OpenLORIS dataset (partially undisclosed) [27] and scored among the best performing techniques222More details will be provided in the camera ready to avoid disclosing the authors identity..

5.4 Real-World Deployment on Smartphone Devices

The feasibility of continual learning on the edge is demonstrated through the development of an Android app (called CORe) for Android smartphone (see Figure 8

). The app will be open-sourced and uploaded in the Google Play store upon publication of this manuscript along with a short video showcasing its main functionalities.

Figure 8: The user interface of CORe app. The camera field of view is partially grayed to highlight the central area where the image is cropped, resized to 128 128 and passed to the CNN. The top three categories are returned for each image (closed set classification) and a green frame is placed around the icon of the most likely class. A training session is trigger by tapping the icon of one of the 10 existing classes or one of the (initially) five empty classes.

The app comes pre-trained with 10 classes (corresponding to the 10 CORe50 categories) and allows to: (i) continually train existing classes (by learning new object/poses) and (ii) to learn up to 5 brand new classes. As the app is launched it switches to inference mode and classifies the framed objects with an inference efficiency of about 5 fps (CPU-only with no hardware acceleration). When learning is triggered a short video of 20 seconds (at 5 fps) is acquired and the resulting 100 frames are used for continual learning that completes in less than 1 seconds after the end of the acquisition.

Behind the scenes the app is running a customized Caffe version cross-compiled for Android and using the same MobileNetV1 architecture introduced in Section

5, here initialized to work with 15 classes. Latent replay in this case is implemented at the pool6 layer333Placing latent reply layer at pool6 corresponds to extending CWR* with latent replay, and leads to maximum efficiency on a CPU-only edge device.

with and external memory of 500 patterns. Low level code is written in C++ and the app interface in Java. A training session consists of 8 epochs, 5 iterations per epoch with a mini-batch size of 120 patterns: each mini-batch includes 20 original frames and 100 replay patterns. In order to speed up training during the video acquisition, a second thread immediately moves available frames forward in the CNN and caches activations at latent replay layer so that, when the acquisition is concluded, we can directly train the class specific discriminative layers. Further details and precise timing of different phases are provided in the additional materials.

6 Conclusions

In this paper we showed that latent replay is an efficient technique to continually learn new classes and new instances of known classes even from small and non i.i.d. batches. State-of-the-art CL approaches such as AR1*, extended with latent replay, are able to learn efficiently and, at the same time, the achieved accuracy is not far from the cumulative upper bound (about 5% in some cases). The computation-storage-accuracy trade-off can be defined according to the application and the available resources so that even edge devices with no GPUs can learn continually from short videos, as we proved through the development of an Android application. In the future we intend to investigate: (i) the design of more sophisticated pattern replacing strategies for the external memory to contrast the aging effect; (ii) replacing the external memory with a generative model trained in the loop and capable of providing pseudo activations on demand.


Appendix A Implementation and Experiments Details

For each of the proposed strategies a test accuracy curve was obtained by averaging over 5 different runs. We followed the same experimental setup as in [17] so that each run differs from the others by the order of the encountered batches.

Our experiments were executed in a "Ubuntu 16.04" Docker environment with a customized version of the Caffe [9] framework using a single GPU. See table 4 for more details of the host setup.

Appendix B Hyperparameters

The hyperparameters used in our experiments are described in tables

3 and 6. Please note that:

  • We used the same naming scheme used in [19].

  • For AR1* and AR1*free we used a higher learning rate for the CWR layer, as described in [17].

  • In order to optimize the results for the two different rehearsal types (native rehearsal and latent replay) we chose two different values for the moving average update rate of the BatchReNorm layers [7]. We found out that an higher value of the update rate was better suited for the latent version.

  • Excluding the aforementioned update rate, we used the same hyperparameters for both the native and latent rehearsal-based experiments.

Appendix C Model Architecture and Memory Trade-off

In order to assess the trade-off between accuracy, computation and memory usage we ran AR1* free using different latent replay layers. Table 5 shows the details of the model we used, which is based on the MobileNetV1 [6] with the only difference that Batch Norm layers have been replaced with Batch ReNorm ones. Here we report the network architecture as well as the pattern size and the number of weights per layer.

Appendix D Android Application Setup and Performance

The CORe Android application has been tested on a OnePlus 6 smartphone without additional accelerators. Table 7 shows the details about this hardware platform.

In table 9 we report the overall time taken for the different inference and training steps as well as the detected peak RAM usage. Note that step times, CPU usage and memory consumption may vary greatly depending on the hardware, operative system and other processes running in background.

In our experiments we used a customized version of Caffe compiled for arm64-v8a platform using OpenBLAS as the BLAS library. Additional information about the used libraries can be found in Table 8.

Parameters MobileNet V1
Head Maximal
: epochs, (learn. rate) 4, 0.001
: epochs, (learn. rate) 4, 0.003
Parameters MobileNet V1
Head Maximal
0.5, 0.5
: epochs, (learn. rate) 4, 0.001
: epochs 4
      (learn. rate, CWR layer) 0.003
      (learn. rate,other layers) 0.0003
AR1* free
Parameters MobileNet V1
Head Maximal
: epochs, (learn. rate) 4, 0.001
: epochs 4
      (learn. rate, CWR layer) 0.003
      (learn. rate,other layers) 0.0003
Parameters MobileNet V1
Shrinkage 1e-4
Parameters MobileNet V1
: epochs, (learn. rate) 40, 0.01
: epochs, (learn. rate) 4, 0.001
Table 3: Hyperparameter values used in our experiments. The selection was performed on run 0, and hyperparameters were then fixed for runs . As an exception for the long running time (around 14 days), iCaRL was trained only on run 0.
Component Model/Version
Operating System Debian 8.3
Docker 18.06.1
Nvidia Driver 430.40 (CUDA 9.0, CuDNN 7)
CPU Intel(R) Xeon(R) CPU E5-2650
GPU GTX 1080 Ti (11 GB VRAM)
RAM 64 GB DDR3 (1600 MHz)
Table 4: Experimental setup
Layer Neurons Ops Weights
Images 49152 - -
conv1 131072 3670016 896
conv2_1/dw 131072 1310720 320
conv2_1/sep 262144 8650752 2112
conv2_2/dw 65536 655360 640
conv2_2/sep 131072 8519680 8320
conv3_1/dw 131072 1310720 1280
conv3_1/sep 131072 16908288 16512
conv3_2/dw 32768 327680 1280
conv3_2/sep 65536 8454144 33024
conv4_1/dw 65536 655360 2560
conv4_1/sep 65536 16842752 65792
conv4_2/dw 16384 163840 2560
conv4_2/sep 32768 8421376 131584
conv5_1/dw 32768 327680 5120
conv5_1/sep 32768 16809984 262656
conv5_2/dw 32768 327680 5120
conv5_2/sep 32768 16809984 262656
conv5_3/dw 32768 327680 5120
conv5_3/sep 32768 16809984 262656
conv5_4/dw 32768 327680 5120
conv5_4/sep 32768 16809984 262656
conv5_5/dw 32768 327680 5120
conv5_5/sep 32768 16809984 262656
conv5_6/dw 8192 81920 5120
conv5_6/sep 16384 8404992 525312
conv6/dw 16384 163840 10240
conv6/sep 16384 16793600 1049600
pool6 1024 16384 0
fc7 50 51250 51250
Total 187,09M 3,35M
Table 5: The architecture of model used in our experiments with neurons, weights and ops for each layer. Those information, along with the results reported in table 1, can be used to identify the most appropriate trade-off between accuracy, computation and used memory depending on the problem context.
Parameters Latent Replay Native Rehearsal
1.25 1.25
0.5 0.5
Moving Avg. update rate 0.99995 0.9999
Table 6: Batch ReNormalization parameters. The reported parameters were used in our experiments on the NICv2-391 scenario involving the CWR*, AR1* and AR1* free algorithms.
Component Model/Version
Model OnePlus 6 (A6000)
Release date 2018, May
Operating System Android 9 (OxygenOS 9.0.9)
Chipset Qualcomm SDM845 Snapdragon 845
CPU Octa-core (4x2.8 GHz)
RAM 8 GB LPDDR4X, 1866 MHz
Table 7: The reference platform for the CORe App. In our tests we used a OnePlus 6 smartphone.
Used libraries
Library Version
Caffe (BVLC) Mar, 2
OpenBLAS 0.3.6
OpenMP 5.0.20140926
OpenCV 4.1.1
Boost 1.56.0
Gflags 2.2.0
Glog 0.3.5
LevelDB 1.21.0
Protobuf 3.6.1
Snappy 1.1.7
Build tools
Tool Version
Android Studio 3.5.2
Gradle Android Plugin 3.5.0
Android Ndk 20.0.5594570
Android Clang 8.0.7
Cmake 3.10.2
Android Build tools 28.0.3
Table 8: The libraries used in our Android application. In our experiments and in the Android application we use the BVLC Caffe distribution with a custom BatchReNorm layer and extended pyCaffe bindings. We also report the tools used in the build process. Note that OpenBLAS was compiled with OpenMP support (provided in the Android Ndk).
Inference and training times (per step)
Step name Average time (ms)
Inference 255.1
Features pre-extraction 202.3 ms (for each pattern)
Misc. training preparation 1.6 (overall)
Data feeding (at latent layer) 64.4 (8.05 per epoch)
Forward 292.4 (36.55 for each epoch)
Backward 43.6 (5.45 for each epoch)
Update 1269.0 (158.63 for each epoch)
Consolidation (CWR*) 1.0 (overall)
CPU and RAM usage
Phase CPU usage RAM usage
Inference 17% 225 MB
Image gathering (and feature pre-extraction) 25% 240 MB
Training 18% 260 MB
Table 9: The profiling information obtained while running the CORe App on the reference platform. The training times here reported were obtained by averaging the time taken from 5 incremental training sessions.