Weakly Supervised Continual Learning

08/14/2021 ∙ by Matteo Boschini, et al. ∙ 0

Continual Learning (CL) investigates how to train Deep Networks on a stream of tasks without incurring catastrophic forgetting. CL settings proposed in the literature assume that every incoming example is paired with ground-truth annotations. However, this clashes with many real-world applications: gathering labeled data, which is in itself tedious and expensive, becomes indeed infeasible when data flow as a stream and must be consumed in real-time. This work explores Weakly Supervised Continual Learning (WSCL): here, only a small fraction of labeled input examples are shown to the learner. We assess how current CL methods (e.g.: EWC, LwF, iCaRL, ER, GDumb, DER) perform in this novel and challenging scenario, in which overfitting entangles forgetting. Subsequently, we design two novel WSCL methods which exploit metric learning and consistency regularization to leverage unsupervised data while learning. In doing so, we show that not only our proposals exhibit higher flexibility when supervised information is scarce, but also that less than 25 enough to reach or even outperform SOTA methods trained under full supervision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Perceptual information flows as a continuous stream, in which a certain data distribution may occur once and not recur for a long time. Unfortunately, this violates the i.i.d. assumption at the foundation of most Deep Learning algorithms and leads to the catastrophic forgetting [mccloskey1989catastrophic]

problem, where the acquired knowledge is rapidly overwritten when incorporating the new one. In practical scenarios, we would rather need a system that learns incrementally from the raw and non-i.i.d. stream of data, possibly ready to provide answers at any moment. The study and design of such lifelong-learning algorithms is the main concern of Continual Learning (CL)  

[parisi2019continual, de2019continual].

Works in this field typically test the proposed methods on series of image-classification tasks that are presented in a sequential manner. The latter are built on top of conventional image classification datasets (e.g.

MNIST, CIFAR, etc.) by allowing the learner to see just a subset of classes at once. While these experimental protocols validly highlight the effects of forgetting, they assume that all incoming data are labeled.

In some scenarios, this condition does not represent an issue and can be easily met. This may be the case when ground-truth annotations can be directly and automatically collected (e.g.: a robot learning to avoid collisions while navigating by receiving a direct feedback from the environment [aljundi2019task]

). However, when the labeling stage involves human intervention (as holds in a number of computer vision tasks such as classification, object detection 

[shmelkov2017incremental, zhou2020lifelong], etc.), relying only on full supervision clashes with the pursuit of lifelong-learning. Indeed, the adaptability of the learner to incoming tasks would be bottlenecked by the speed of the human annotator: updating the model continually would lose its appeal w.r.t. the trivial solution of re-training from scratch. Therefore, we advocate taking into account the rate at which annotations are available to the stream.

To address this point, the adjustment of the prediction model can be simply limited to the fraction of examples that can be labeled in real-time. Our experiments show that this results in an expected degradation in terms of performance. Fortunately, the efforts recently made in

semi-supervised learning

 [olivier2006semi, bachman2014learning, tarvainen2017mean] come to the rescue: by revising these techniques to an incremental scenario, we can still benefit from the remaining part of the data represented by unlabeled observations. We argue that this is true to the lifelong nature of the application and also allows for exploiting the abundant source of information given by unlabeled data.

Fig. 1: An overview of the Weakly Supervised Continual Learning (WSCL) setting introduced in this paper. Input batches include both labeled (green) and unlabeled (red) examples (best seen in color).

To sum up, our work incorporates the features described above in a new setting called Weakly Supervised Continual Learning (WSCL): a scenario where just one out of

examples is associated with its ground-truth label. At training time, this corresponds to providing a ground-truth label for any given example with uniform probability 

(as exemplified in Fig. 1 for the case ).

Taking one more step, we propose two techniques that aim at filling the gap induced by partial annotations: Continual Interpolation Consistency

(CIC), which imposes consistency among augmented and interpolated examples, and

Contrastive Continual Interpolation Consistency (CCIC), which exploits secondhand information peculiar to the Class-Incremental setting. Doing so, we grant performance that matches and even surpasses that of the fully-supervised setting. Our contributions can be finally summarized as follows:

  • We propose WSCL that is, to the best of our knowledge, the first scenario in which the learner must learn continually by exploiting both supervised and unsupervised data at the same time;

  • We empirically review the performance of SOTA CL models at varying label-per-example rates, highlighting the subtle differences between CL and WSCL;

  • Exploiting semi-supervised techniques, we introduce novel WSCL methods that successfully address the new setting and keep on learning even when labels are scarce;

  • Surprisingly, our evaluations show that full supervision does not necessarily upper-bound partial supervision in CL: labels can be enough to outperform SOTA methods using all ground truth.

The rest of the paper is organized as follows: in Sec. II, we revise the relevant literature and outline the main features of CL protocols, CL methods and Semi-Supervised Learning approaches; we formalize WSCL and describe our proposed approaches in Sec. III; we report our experimental results and comment them in Sec. IV; further ablative experiments are conducted in Sec. V; in Sec. VI, we wrap up our findings and present our conclusions.

Ii Related Work

Ii-a Continual Learning Protocols

Continual Learning is an umbrella term encompassing several slightly yet meaningfully different experimental settings [farquhar2018towards, van2019three]. Van de Ven et al. produced a taxonomy [van2019three] describing the following three well-known scenarios. Task-Incremental Learning (Task-IL)

organizes the dataset in tasks comprising of disjoint sets of classes. The model must only learn (and remember) how to correctly classify examples within their original tasks.

Domain-Incremental Learning (Domain-IL) presents all classes since the first task, instead of introducing different ones at different times: distinct tasks are obtained by processing the examples with distinct transformations (e.g.: pixel permutations or image rotations) which change the input distribution. Class-Incremental Learning (Class-IL) operates on the same assumptions as Task-IL, but requires the learner to classify an example from any of the previously seen classes with no hints about its original task. Unlike Task-IL, this means that the model must learn the joint dataset distribution from partial observations, making this the hardest of the three scenarios [van2019three]. In this work, we focus our investigations on limited labels within the Class-IL formulation and briefly address the Task-IL setting in Appendix C.

Towards realistic setups. Several recent works point out that these classic settings lack realism [aljundi2019gradient, kj2020meta] and consequently define new scenarios by imposing restrictions on what models are allowed to do while learning. Online Continual Learning

forbids multiple epochs on the training data on the grounds that real-world CL systems would never see the same input twice 

[lopez2017gradient, riemer2018learning, chaudhry2019tiny, chaudhry2020using, kj2020meta]. Task-Free Learning does not provide task identities either at inference or at training time [he2019task, aljundi2018memory, aljundi2019gradient]. This is in contrast with the classic settings that signal task boundaries to the learner while training, thus allowing it to prepare for the beginning of a new task.

In proposing WSCL, we, too, aim at providing a more realistic setup. However – instead of focusing on model limitations – we acknowledge that providing labels in real-time is problematic and hinders the extension of CL algorithms to in-the-wild scenarios.

Continual Learning with Unsupervised Data. Some attempts have been recently made at improving CL methods by exploiting unlabeled data. Zhang et al. proposed the Deep Model Consolidation framework [zhang2020class]; in it, a new model is first specialized on each new encountered task, then a unified learner is produced by distilling knowledge from both the new specialist and the previous incremental model. For this second pass, the authors employ auxiliary unlabeled data, which must be generic and diversified to facilitate transfer. Alternatively, Lechat et al. introduced Semi-Supervised Incremental Learning [lechat2021semi], which alternates unsupervised feature learning on both input and auxiliary data with supervised classification. For this purpose, label-agnostic representations are clustered and assigned ground-truth labels according to similarity.

We remark that both these settings are significantly different from our proposed WSCL as we do not separate the supervised and unsupervised training phases. On the contrary, we intertwine both kinds of data in all drawn batches in varying proportions and require that the model learns from both at the same time. Additionally, we do not exploit auxiliary unsupervised external data to supplement the training set; instead, we reduce the original supervised data available to a fraction, thus modeling supervision becoming available at a much slower rate on the input stream of information.

Ii-B Continual Learning Methods

Continual Learning methods have been chiefly categorized in three families according to the way they contrast catastrophic forgetting [farquhar2018towards, de2019continual].

Architectural methods employ architectures that are designed to avoid forgetting, e.g.: by dynamically increasing the number of parameters [rusu2016progressive, serra2018overcoming, fernando2017pathnet] or devoting a part of them to each task [mallya2018packnet]. While being usually very effective, they depend on the availability of task labels at prediction time to prepare the model for inference, which limits them to Task-IL.

Regularization methods condition the evolution of the model to prevent it from forgetting previous tasks. This is attained either by identifying important weights for each task and preventing them from changing in later ones [kirkpatrick2017overcoming, zenke2017continual] or by distilling the knowledge from previous model snapshots to preserve the past responses [li2017learning, schwarz2018progress].

Finally, rehearsal methods maintain a fixed-size working memory of previously encountered exemplars and recall them to prevent forgetting [ratcliff1990connectionist]. This simple solution has been expanded upon in many ways, e.g. by adopting advanced memory management policies [aljundi2019online, aljundi2019gradient, buzzega2020rethinking], exploiting meta-learning algorithms [riemer2018learning], combining replay with knowledge distillation [rebuffi2017icarl, benjamin2018measuring, buzzega2020dark], or using the memory to train the model in an offline fashion [prabhu2020gdumb].

Ii-C Semi-Supervised Learning

Semi-Supervised Learning studies how to improve supervised learning methods by leveraging additional unlabeled data. We exploit the latter in light of specific assumptions on how input and labels interact [olivier2006semi]. By assuming that close input data-points should correspond to similar outputs (smoothness assumption), consistency regularization encourages the model to produce consistent predictions for the same data-point. This principle can be applied either by comparing the predictions on the same exemplar by different learners [laine2016temporal, tarvainen2017mean] or the predictions on different augmentations of the same data-point by the same learner [bachman2014learning, sajjadi2016regularization, berthelot2019mixmatch, sohn2020fixmatch, zbontar2021barlow].

Iii Weakly Supervised Continual Learning

A supervised Continual Learning classification problem can be defined as a sequence composed of tasks. During each of the latter (), input samples and their corresponding ground truth labels are drawn from an i.i.d. distribution . Considering a function with parameters

, we indicate its responses (logits) with

and the corresponding probability distribution over the classes with

. The goal is to find the optimal value for the parameters such that performs best on average on all tasks without incurring catastrophic forgetting; formally, we need to minimize the empirical risk over all tasks:

(1)

Experience Replay. To retain knowledge from past tasks, we build our proposals on top of a rehearsal strategy: Experience Replay (ER) [ratcliff1990connectionist, riemer2018learning]. In practice, we equip the learner with a small memory buffer (based on reservoir sampling) and interleave a batch of examples drawn from it with current training data. Among all possible approaches, we opt for ER due to its lightweight design and effectiveness [chaudhry2019tiny, buzzega2020rethinking]. In the following, we explain how we extend ER to take unlabeled examples into account.

In Weakly Supervised Continual Learning, we propose to distribute the samples coming from into two sets: , which contains a limited amount of pairs of labeled samples and their ground-truth labels (, ) and , containing the rest of the unsupervised samples. We define this split according to a given proportion that remains fixed across all tasks:

(2)

where indicates set cardinality. The objective of WSCL is optimizing Eq. 1 without having access to the ground-truth supervision signal for .

We are interested in shedding further light on CL models by understanding i) how they perform under partial lack of supervision and ii) how Semi-Supervised Learning approaches can be combined with them to exploit unsupervised data. Question i) is investigated experimentally in Sec. IV-B and IV-C by evaluating methods that simply drop unlabeled examples . On the other hand, question ii) opens up many possible solutions that we address by proposing two WSCL techniques: Continual Interpolation Consistency (CIC) and Contrastive Continual Interpolation Consistency (CCIC).

Iii-a Continual Interpolation Consistency

Fig. 2: In the proposed CIC, an unlabeled example undergoes different augmentations, which are in turn fed to the network. A surrogate of the target is then created by averaging these predictions and sharpening.
1.25
0:  input batch (supervised samples , labels , unsupervised items ), memory buffer , scalars , , weights
1:  
2:  
3:  
4:  for  in  do
5:     for  to  do
6:         
7:         
8:     end for
9:     
10:     
11:  end for
12:  
13:  
14:  
15:  Compute , according to Eqq. 7, 8
16:  
Algorithm 1 - Continual Interpolation Consistency

Self Training. To make CL methods able to work in a partial lack of supervision, we start off with self-training techniques; here, the model itself produces the targets (pseudo-labels) for unlabeled examples [yarowsky1995unsupervised, lee2013pseudo]. While this could be a simple and viable technique, it tends to become unstable with only a few annotations at disposal, forcing the model to overfit the limited supervised data available [oliver2018realistic].

Interpolation Consistency. We take one more step and mitigate the drawbacks of self-training by revisiting the use of pseudo-labels in an incremental scenario. Specifically, we use the predictions of the network not as training targets, but rather as means for applying consistency regularization [tarvainen2017mean, miyato2018virtual, berthelot2019mixmatch]. In other words, we require the model to yield the same prediction in response to distinct realistic perturbations of the same input data-point. Following this idea, we introduce our first proposal: Continual Interpolation Consistency (CIC). Formally, its objective is a weighted sum of two loss terms; , computed on supervised examples, and , computed on unsupervised elements:

(3)

To deal with catastrophic forgetting, we also optimize to a mini-batch of samples drawn from the memory buffer .

A complete formulation of CIC is provided in Alg. 1. Initially, we aggregate the supervised data-points from the input stream with a batch drawn from the replay buffer

into a single vector of labeled examples

(lines -). We construct a larger vector containing repetitions of each unsupervised example from the input stream, subject to distinct augmentations (lines -). As depicted in Fig. 2, each element is assigned a soft-label that is obtained as the mean of the pre-softmax responses of the model to its augmentations. The obtained labels are then further processed through a sharpening step (line ) with temperature to reduce their entropy:

(4)

This is in line with the low-density assumption typically made in semi-supervised learning, which implies that the decision boundary of the classifier should cross low-density regions [lee2013pseudo]. The sharpened labels resulting from this procedure are gathered in a vector of responses (line ).

To promote consistent responses to considerable variations of the data-points, we combine pairs of examples from both and through the mixUp procedure [zhang2017mixup]:

(5)
(6)

As explained in [berthelot2019mixmatch], we apply mixUp asymmetrically (Eq. 5). This means that the mixUp of and always remains semantically closer to . We then obtain and via mixUp of the elements of and respectively with other random items (line ).

Finally, the items contained in are gathered with their original ground truth labels in (line ) and used to compute the mean supervised cross-entropy loss as follows:

(7)

where denotes Cross-Entropy. On the other hand, the examples in are paired with their soft-labels in forming (line ), which is then used to compute the unsupervised loss as follows:

(8)

CIC is inspired by MixMatch [berthelot2019mixmatch], a Semi-Supervised Learning technique that also interpolates labeled and unlabeled exemplars; however, MixMatch applies mixUp to both examples and targets.

As a final remark, CIC does not depend on the explicit knowledge of task boundaries: it could be effortlessly extended to a Task-Free Learning-like scenario (see Sec. II). We leave the evaluation in such a setting for future studies.

Iii-B Contrastive Continual Interpolation Consistency

Fig. 3: Contrastive-CIC exploits task identifiers to encourage semantic constraints: namely, it requires the network to push away the representations of different tasks and to move closer representations of the same one.

Supposing that boundaries between tasks are provided, we can easily infer the task label for both supervised and unsupervised examples. Although we do not know the class of the latter exactly, their membership to a particular task can provide an additional weak form of supervision. In this respect, the formulation of our second proposal (Contrastive Continual Interpolation Consistency, CCIC) follows a strategy based on metric learning (Fig. 3): during training, it couples each incoming example (anchor) with a negative one, requiring that their representation should be separated. If its class is known, a positive point is additionally considered.

Unsupervised mining. Even without ground-truth information available, we know in advance that tasks are disjoint; consequently, examples of different tasks belong to different classes. We propose to account for that by adding a contrastive loss term, which pushes their representations away from each other in feature space. We aim at maximizing the Euclidean distance between embeddings of examples associated with different tasks. Formally, we minimize:

(9)

where is the index of the current task , indicates past examples from the memory buffer, and is a constant margin beyond which no more efforts should be put into enlarging the distance between negative pairs.

Supervised mining. For each incoming labeled example, we also encourage the network to move its representation close to those belonging to the same class. We look for positive candidates within both the current batch and the memory buffer. In formal terms:

(10)

Overall objective. To sum up, the objective of CCIC combines the consistency regularization term delivered by CIC (Eq. 3) with the two additional ones (Eq. 9 and Eq. 10) applied in feature space; the overall optimization problem can be formalized as follows:

(11)

where and

are hyperparameters setting the importance of the unsupervised examples.

Exploiting distance metric learning during inference. Once we have introduced constraints in feature space (Eq. 910), we can also exploit them by devising a different inference schema, which further contributes to relieve catastrophic forgetting. Similar to what has been done in [rebuffi2017icarl]

, we employ the k-Nearest Neighbors algorithm as final classifier, thus decoupling classification from feature extraction. This has been shown beneficial in Continual Learning, as it saves the final fully-connected layer from continuously keeping up with the changing features (and

vice versa

). As kNN is non-parametric and builds upon the feature space solely, it fits in harmony with the rest of the model, controlling the damage caused by catastrophic forgetting. We train it on the memory buffer at the end of each task (as past examples are assumed unavailable).

Iv Experiments

Iv-a Evaluation protocol

We conduct our experiments on three standard datasets. Split SVHN

: five subsequent tasks built on top of the Street View House Numbers (SVHN) dataset 

[netzer2011reading]; Split CIFAR-10: equivalent to the previous one, but using the CIFAR-10 dataset [krizhevsky2009learning]. Split CIFAR-100: a longer and more challenging evaluation in which the model is presented ten subsequent tasks, each comprising of classes from the CIFAR-100 dataset [krizhevsky2009learning].

In our experiments, we vary the fraction of labeled examples shown to the model (, , , and ), thus encompassing the evaluation at different degree of supervision. To guarantee fairness in the evaluation, we preserve the original balancing between classes in both training and test sets. Even in presence of low rates, we make sure that each class is represented by a proportional amount of labeled examples. For CIFAR-10, the above-mentioned percentages would correspond to , and examples per class respectively. Further details are reported in Appendix A.

Architectures. We adopt distinct architectures according to the dataset complexity. As in [abati2020conditional]

, experiments on Split SVHN are conducted on a small Convolutional Neural Network, comprising of three ReLU layers interleaved by max-pooling and with a final average-pooling layer. Instead, we rely on ResNet18 

[he2016deep] for CIFAR-10 and CIFAR-100, as done in [buzzega2020dark].

Evaluation Metrics. In this work, we analyze the performance of the evaluated models in terms of their average final accuracy , as commonly done in [lopez2017gradient, chaudhry2018efficient, aljundi2019gradient]. Let be the accuracy on the classes of the classification task after completing training on the task, we define as:

(12)

where is the total number of tasks in the given dataset.

Accuracies are averaged across 5 runs and reported along with standard deviations. Additional results expressed in

forgetting measure [chaudhry2018riemannian] can be found in Appendix D.

Implementation details. As discussed in Sec. III-B, our proposals rely on data augmentation to promote consistency regularization. In more detail, we apply random cropping and horizontal flipping (except for Split SVHN); the same choice is applied to competitors to ensure fairness. To perform hyperparameters selection (learning rate, batch size, optimization algorithm, and regularization coefficients), we carried out a grid search on top of a validation set (corresponding to the of the training set), as done in [riemer2018learning, buzzega2020dark, rebuffi2017icarl]. To guarantee an equal sample efficiency while evaluating, we fix the batch size and memory minibatch size to for all models. We train on each task for epochs on SVHN, for on CIFAR-10, and on CIFAR-100. All methods use SGD as an optimizer with the only exception of CCIC, which employs Adam [22].

Iv-B Experimental Comparisons

Class-IL ¦ SVHN (UB: ) ¦ CIFAR-10 (UB: ) ¦ CIFAR-100 (UB: )
Labels % ¦ · ¦ · ¦ ·
SGD ¦ · ¦ · ¦ ·
LwF ¦ · ¦ · ¦ ·
oEWC ¦ · ¦ · ¦ ·
SI ¦ · ¦ · ¦ ·
ER ¦ · ¦ · ¦ ·
iCaRL ¦ · ¦ · ¦ ·
DER ¦ · ¦ · ¦ ·
GDumb ¦ · ¦ · ¦ ·
CIC ¦ · - ¦ · - ¦ · -
CCIC ¦ · - ¦ · - ¦ · -
ER ¦ · ¦ · ¦ ·
iCaRL ¦ · ¦ · ¦ ·
DER ¦ · ¦ · ¦ ·
GDumb ¦ · ¦ · ¦ ·
CIC ¦ · - ¦ · - ¦ · -
CCIC ¦ · - ¦ · - ¦ · -
¦
TABLE I: Average Accuracy of CL Methods and of Our Proposals on WSCL Benchmarks.
Fig. 4: Visualization of some of the experiments with buffer reported in Tab. I (best in color).

Lower and Upper Bounds

. We bound the performance for our experiments by including two reference measures. As a lower bound, we evaluate the performance of a model trained through Stochastic Gradient Descent (SGD) exclusively on the supervised examples without any countermeasure to catastrophic forgetting. Additionally, we provide an upper-bound value (UB) for each dataset given by the accuracy of a model trained jointly on it, without dividing it into tasks or discarding any ground-truth annotation.

Drop-the-unlabeled. For adapting current CL methods to our setting, the most straightforward approach consists of simply discarding unlabeled examples from the current batch. In this regard, we compare our proposal with Learning Without Forgetting (LwF) [li2017learning], online Elastic Weight Consolidation (oEWC) [schwarz2018progress], Synaptic Intelligence (SI) [zenke2017continual], Experience Replay (ER) [riemer2018learning], iCaRL [rebuffi2017icarl], Dark Experience Replay (DER) [buzzega2020dark] and GDumb [prabhu2020gdumb]. By so doing, we aim to verify whether our proposal is able to better sustain a training regime with reduced supervision.

Pseudo-Labeling. Some Semi-Supervised Learning works [yarowsky1995unsupervised, lee2013pseudo] pre-train the model on the initially available labeled data to produce pseudo-labels for the unlabeled examples. To establish a simple WSCL baseline, we introduce a simple strategy that allows Experience Replay to profit from unlabeled examples by simply assigning them a pseudo-label  [lee2013pseudo] generated by the model as:

(13)

where indicates the set of classes which constitute the current task. As discussed in Sec. III-A, self training is likely to cause model instability especially at task boundaries, when the model still needs to learn to classify new data. By constraining label guessing within the labels of the current task, we mitigate this effect. Since targets might be incorrect, a threshold hyperparameter is applied to discard low-confidence outputs and their relative

. Specifically, we estimate the confidence as the difference between the two highest values of

. After this step, a pair can be considered on a par with any supervised pair , and therefore can be inserted into the memory buffer for later replay. We indicate this baseline with PseudoER (see Alg. 2 in Appendix B) during comparisons.

Iv-C Experimental Results

Tab. I and Fig. 4 report the average accuracy across all tasks obtained in the WSCL setting. As can be seen, the latter proves to be a challenging scenario, whose difficulty unsurprisingly increases when lower amounts of labels are provided to the learner. Experiments were conducted on a desktop computer equipped with an NVIDIA RTX 2080Ti GPU.

Regularization Methods. Regularization-based CL methods have been highlighted as generally weak in the Class-IL scenario [farquhar2018towards]. This is in line with our empirical observations, as LwF, oEWC and SI underperform across all datasets. Indeed, these methods rarely outperform our lower bound (SGD), indicating that they are not effective outside of the Task-IL and Domain-IL settings. We validate this by conducting an experiment on WSCL in the Task-IL setting in Appendix C.

Rehearsal Methods. Rehearsal methods overall show an expected decrease in performance as supervision diminishes. This is especially severe for DER and iCaRL, as their accuracy drops on average by more than between and labels. As the model underfits the task when less supervision is provided, it produces less reliable targets that cannot be successfully used for replay by these methods. On the contrary, ER is able to replay information successfully as it exploits hard targets; thus, it keeps learning effective even if it initially underfits the task. Indeed, its accuracy with labels and buffer is always higher than its fully-supervised accuracy with a smaller buffer. Remarkably, knowledge-distillation based approaches, which are widespread and successful in CL, encounter a major hindrance in the lack of supervision; instead, this is not true for plain ER, which is able to overcome the lack of supervision when provided with a big enough reservoir.

We attribute the failure of iCaRL on SVHN to the low complexity of the backbone network. Indeed, a shallow backbone provides for a latent space that is less suitable for its nearest-mean-of-exemplars classifier. Conversely, this method proves quite effective even with a reduced memory buffer on CIFAR-100. As the latter features a high number of classes, iCaRL ensures that all are fairly represented even in a small memory buffer thanks to herding sampling.

Finally, GDumb is insensitive to decreasing supervision as long as its buffer can be filled completely: its operation is not disrupted by unlabeled examples on the stream, as it ignores the latter entirely. While it outperforms other CL methods when few labels are available, WSCL methods surpass it consistently. This indicates that the stream provides potential for further learning and should not be dismissed lightly.

WSCL Methods. The PseudoER baseline performs overall on par with ER, showing that pseudo-labelling does not necessarily scale well in an online setting. Its highest gain in performance w.r.t. to ER can be found on CIFAR-10, a complex benchmark that however only features two classes for each task. As random guessing between them results in a accuracy, pseudo-labelling can easily produce sensible responses. The same does not hold for CIFAR-100, where PseudoER struggles to produce valid targets when supervision is low.

On the contrary, CIC and CCIC successfully blend supervised information and semi-supervised regularization. While ER, on which they are based, encounters an average performance drop of , going from to labels on CIFAR-10, CIC loses on average and CCIC only loses . Surprisingly, we observe that most of the times of supervision is enough to approach the results of fully-supervised methods, even outperforming the state-of-the-art in some circumstances (CIFAR-10 with buffer size , SVHN with buffer size and ). This indicates that the positive effect of their regularization combined with partial supervision can be even more beneficial than directly relying on more supervision.

V Ablation Studies

V-a Model ablation

Model Accuracy
CCIC
W/o kNN at inference
W/o sharpening
W/o
W/o MixUp
W/o
W/o
TABLE II: Performance Contribution from each Component of CCIC on Split CIFAR-10, Labels, Memory Buffer .

We further examine the quality of CCIC by means of an ablative study conducted on Split CIFAR-10. Specifically, we employ a fixed memory buffer of examples in a setting with only of labels available and summarize the results in Tab. II (results are averaged across runs). For starters, our experiments show a clear benefit from the additional consistency regularization provided by MixUp, in accordance with previous works on semi-supervised learning. In addition, however, we find a similar advantage from the unsupervised mining procedure discussed in Sec. III-A, which highlights the benefit of separating the knowledge learned during different tasks. Finally, our results show that each component of our proposal provides an essential contribution to its final performance in a WSCL setting.

V-B Importance of Unsupervised Mining in CCIC

Labels %
Across-Task Mining (Eq. 9)
Within-Task Mining
Task-Agnostic Mining
TABLE III: Evaluation of Distinct Unsupervised Mining Techniques for CCIC on Split CIFAR-100, Memory Buffer .

In its unsupervised mining loss term , CCIC takes examples of previous tasks in the memory buffer as negatives (Across-Task Mining) and requires their representations to be pushed away from current data. Alternatively, we could choose the negatives from the current task (Within-Task Mining), with the aim of using to further reinforce the learned classification boundaries. Another possible choice could be not to inject any task-specific prior in the mining process and let it freely choose a negative example from either the memory or the current batch (Task-Agnostic Mining). We compare these three approaches in Tab. III. As can be observed, Task-Agnostic Mining leads to a small but consistent decrease in performance, while using across tasks proves to be the most rewarding strategy.

V-C Pre-training in WSCL

CIFAR-10
ER from scratch
ImageNet
CIC from scratch
ImageNet
CCIC from scratch
ImageNet
TABLE IV: Results Delivered in Presence/Absence of ImageNet Pre-Training on Split CIFAR-10.

When the number of labeled examples is limited, the risk of overfitting them can be mitigated through the injection of prior knowledge over parameters learned during training. This can be done in various ways; the experiments shown above reveal that consistency regularization and contrastive learning are both valid techniques. However, such prior knowledge can also derive from a set of preliminary data. Indeed, several works [li2017learning, zenke2017continual, hou2019learning, douillard2020podnet] allow the model to be warmed-up on a task that, although potentially different from the incoming ones, can enable positive transfer.

In this scenario, does the regularization induced by pre-training provide additional aid to self-supervision? To investigate this matter, we pre-train the network on ImageNet and test the performance of several methods in WSCL on Split CIFAR-10. From the results in Tab. IV, we see that pre-training is beneficial for all methods, delivering especially large gains when labels are scarce. CIC and CCIC also improve when pre-trained, but only the latter performs slightly better than ER. As pre-training seems to reduce the differences among the evaluated methods, the slight advantage of CCIC might be attributed to its kNN-based classifier, which takes advantage of the pre-trained features as they are.

Vi Conclusion

Catastrophic forgetting prevents most of current state-of-the-art models from sequentially learning multiple tasks, forcing practitioners to resort to heavy resource-demanding training processes. Moreover, many of the applications that might benefit from CL algorithms are often characterized by label scarcity. For this reason, we investigate the possibility of leveraging unlabeled data-points to enhance the performance of Continual Learning models. In a scenario that we name Weakly Supervised Continual Learning (WSCL), we propose two incremental approaches. On the one hand, Continual Interpolation Consistency (CIC) combines the benefits of rehearsal with consistency regularization. On the other hand, Constrastive Continual Interpolation Consistency (CCIC) improves over CIC by imposing distance-based constraints in feature space. Our experiments reveal that both techniques yield far superior results w.r.t. baseline methods, even surpassing fully supervised models. This would suggest their applicability to supervised CL settings in future works.

Appendix A

In Tab. V, we include further details on the per-class split of all datasets used in this study. While Split CIFAR-10 and Split CIFAR-100 have balanced classes, Split SVHN is significantly unbalanced and this is reflected in WSCL when using a portion of the initial labels. In our experiments, we obtain all datasets from the torchvision.datasets submodule of torchvision111https://pytorch.org/vision/stable/index.html. All three datasets are composed of RGB images.

Dataset Class
SVHN
CIFAR-10 all
CIFAR-100 all
TABLE V: Amount of Labeled Examples for Each Class in the Datasets.

Appendix B

In Alg 2, we present a detailed algorithmic formulation for the PseudoER baseline used in the experiments.

1.25

Algorithm 2 - PseudoER
0:  input batch (supervised samples , labels , unsupervised items ), memory buffer , weights , confidence
1:   // Compute pseudo-targets for unlabeled examples:
2:  
3:  for  in  do
4:     
5:     if  then
6:         
7:         
8:     end if
9:  end for
10:  
11:  
12:  
13:  
14:  return

Appendix C

In this work, we chose to propose WSCL for the Class-IL setting for the reasons highlighted in Sec. II. We observe in Sec. IV that this leads to very poor performance for the tested regularization methods (LWF, oEWC, SI). We maintain that our choice is fair and in line with the recent tendency to disregard Task-IL, which has been criticized as trivial and unrealistic [farquhar2018towards], in CL experiments [aljundi2019gradient, kj2020meta].

Nevertheless, for the sake of completeness, we report the accuracy of regularization methods for the Task-IL setting in Tab. VI, along with ER with buffer size . These results show that LwF and SI are usually more accurate than oEwC in the fully supervised setting, but this difference is reduced when limited supervision is provided. Notably, the objective of oEwC seems to induce excessive regularization, as its performance occasionally increase when it is trained on fewer data. At any rate, ER dramatically outperforms all the regularization methods we studied even in this simpler setting. On CIFAR-100, it outperforms other methods on average of and doubles their performance with of labels.

These experiments extend to WSCL the findings of [farquhar2018towards], that credit superior performance to rehearsal methods w.r.t. regularization methods, and further justify our choice to develop our proposed WSCL methods on top of ER in Sec. III.

Task-IL CIFAR-10
Labels % ·
LwF ·
oEWC ·
SI ·
ER ·
CIFAR-100
Labels % ·
LwF ·
oEWC ·
SI ·
ER ·
TABLE VI: Average Accuracy of Some CL methods on WSCL Benchmarks in the Task-IL Setting.

Appendix D

In Tab. VII, we report Forgetting measure results [chaudhry2018riemannian] for the experiments in Tab. I. Forgetting is computed as the average difference between the peak accuracy reached by the model on a given task and its final accuracy:

(14)

note that the last task is not accounted for by this metric, as it incurs no forgetting. and a lower value indicates that previous tasks are forgotten less by the model. However, this metric is intrinsically higher for models that learn the current task better, while it remains lower for those that underfit all presented tasks.

SVHN ·
SGD ·
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
CIFAR-10 ·
SGD ·
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
CIFAR-100 ·
SGD ·
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
ER ·
iCaRL ·
DER ·
PseudoER · -
CIC · -
CCIC · -
TABLE VII: Forgetting measurements [chaudhry2018riemannian] for the experiments in Tab. I.

Acknowledgment

This project has been funded by the InSecTT (www.insectt.eu) project. InSecTT has received funding from the ECSEL Joint Undertaking (JU) under gradnt agreement No 876038.

References