Knowledge Distillation for Incremental Learning in Semantic Segmentation

11/08/2019
by   Umberto Michieli, et al.
Università di Padova
31

Although deep learning architectures have shown remarkable results in scene understanding problems, they exhibit a critical drop of overall performance due to catastrophic forgetting when they are required to incrementally learn to recognize new classes without forgetting the old ones. This phenomenon impacts on the deployment of artificial intelligence in real world scenarios where systems need to learn new and different representations over time. Current approaches for incremental learning deal only with the image classification and object detection tasks. In this work we formally introduce the incremental learning problem for semantic segmentation. To avoid catastrophic forgetting we propose to distill the knowledge of the previous model to retain the information about previously learned classes, whilst updating the current model to learn the new ones. We developed three main methodologies of knowledge distillation working on both the output layers and the internal feature representations. Furthermore, differently from other recent frameworks, we do not store any image belonging to the previous training stages while only the last model is used to preserve high accuracy on previously learned classes. Extensive results were conducted on the Pascal VOC2012 dataset and show the effectiveness of the proposed approaches in different incremental learning scenarios.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

page 8

page 10

page 12

07/31/2019

Incremental Learning Techniques for Semantic Segmentation

Deep learning architectures exhibit a critical drop of performance due t...
10/24/2019

Adversarial Feature Alignment: Avoid Catastrophic Forgetting in Incremental Task Lifelong Learning

Human beings are able to master a variety of knowledge and skills with o...
08/28/2021

Representation Memorization for Fast Learning New Knowledge without Forgetting

The ability to quickly learn new knowledge (e.g. new classes or data dis...
04/19/2022

Modeling Missing Annotations for Incremental Learning in Object Detection

Despite the recent advances in the field of object detection, common arc...
12/07/2021

A Contrastive Distillation Approach for Incremental Semantic Segmentation in Aerial Images

Incremental learning represents a crucial task in aerial image processin...
07/08/2018

Revisiting Distillation and Incremental Classifier Learning

One of the key differences between the learning mechanism of humans and ...
03/29/2021

ClaRe: Practical Class Incremental Learning By Remembering Previous Class Representations

This paper presents a practical and simple yet efficient method to effec...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks are nowadays gaining huge popularity and they are one of the key driving elements for the widespread diffusion of artificial intelligence. Despite their success on many visual recognition tasks, neural networks struggle with the incremental learning problem, i.e., improving the learned model to accomplish new tasks without losing previous knowledge. Traditional models typically require that all the samples corresponding to old and new tasks are available at training time and are not designed to consider new streams of data relative to the new tasks only. A system which is going to be deployed into the real world environment, instead, should be able to update its knowledge incorporating the new tasks while preserving unaltered the performance on the previous ones. This behavior is inherently present in the human brain which is incremental in the sense that new tasks are continuously incorporated and existing knowledge is largely preserved. The incremental learning problem is gaining wide relevance among the scientific community and it has been considered in many works dealing with the problems of image classification and object detection

[1, 2, 3, 4, 5].

On the other hand, incremental learning in dense labeling tasks, such as semantic segmentation, has never been extensively studied. This paper starts from our previous work [6], which is the first investigation of the problem.

Semantic segmentation is a key task that artificial intelligence systems must face frequently in various applications, e.g., autonomous driving or robotics [7, 8, 9]. In this work we start from introducing the problem and presenting the key aspects of this challenging task. In particular, differently from image classification, in semantic segmentation each image contains together pixels belonging to different classes, indeed they could contain exemplars of the new classes as well as of previously learned ones, making the problem conceptually different from incremental learning in image classification. Furthermore, contrary to many existing methodologies, we consider the most challenging setting where images from old tasks are not stored and cannot be used to help the incremental learning process. This is particularly important for the vast majority of real world applications where old images cannot be stored due to privacy concerns or storage requirements.

Then, we introduce a novel framework to perform incremental learning in semantic segmentation. Specifically, we re-frame the distillation loss concept used in image classification and we propose three novel approaches where knowledge is distilled respectively from the output layer, the intermediate features level and the intermediate layers of the decoding phase. Experimental results on the Pascal VOC2012 dataset demonstrate that the proposed framework is able to obtain good performance in many settings, even without storing any of the previous examples. The proposed schemes are applied to the old model and allow not only to retain the learned information but also to achieve higher accuracy on new tasks, leading to substantial improvements in all the scenarios w.r.t. the standard approach without distillation.

Compared to our conference work [6], we improve the distillation loss on the output layer to help retaining information about the prediction uncertainty of previous model. We add a novel distillation loss enforcing the similarity of different decoding stages. We propose a new strategy based on the idea of freezing only the first part of the encoder. Finally, we present much more extensive experiments on different scenarios discussing in detail the obtained results.

The remainder of this paper is organized as follows. Section II discusses contemporary incremental learning methodologies applied to different problems. In Section III a precise formulation of the incremental learning task for semantic segmentation is introduced. Section IV outlines the proposed methodologies while the results on Pascal VOC2012 are shown in Section V in more extensive settings than in [6]. Conclusion and future developments are presented in Section VI.

Ii Related Work

Incremental learning is strictly related to other research fields such as continual learning, lifelong learning, transfer learning, multi-task learning and never ending learning. All these tasks require to design an algorithm able to learn new tasks over time without forgetting the previously learned ones. The approaches for these tasks face a critical issue, which in the literature is referred to as

catastrophic forgetting [10, 11, 12] and still represents one of the main limitations of deep neural networks. The human brain, on the other hand, can efficiently learn new tasks and it turns out that this ability is essential for the deployment of artificial intelligence systems in challenging scenarios where new tasks or classes appear over time.

Catastrophic forgetting has been faced even before the rise of neural networks popularity [13, 14, 15] and more recently has been rediscovered and tackled in different ways. Focusing on deep neural networks, some methods [16, 17] exploit architectures which grow over time as a tree structure in a hierarchical manner as new classes are observed. Istrate et al. [18] proposed a method that partitions the original network into sub-networks which are then gradually incorporated in the main one during training. In [19] the network incrementally grows over time while sharing portions of the base module.

Alternatively, a different strategy consists in freezing or slowing down the learning process in some parts of the network. Kirkpatrick et al. [20] developed Elastic Weight Consolidation (EWC) to remember old tasks by slowing down the learning process on the important weights for those tasks. In [21] the learned knowledge is preserved by freezing the earlier and the mid-level layers of the models. Similar ideas have been used in recent studies [5, 18].

Another way of retaining high performance on old tasks which has recently gained wide success is knowledge distillation. This technique was originally proposed in [22, 23] to preserve the output of a complex ensemble of networks when adopting a simpler network for more efficient deployment. This idea was adapted in different ways in recent studies [1, 4, 5, 2, 3, 24, 25] to maintain the responses of the network on the old tasks whilst updating it with new training samples. We investigated this concept and we further adapted it to the semantic segmentation task, while previous works focused on object detection or classification problems.

As regards object detection and classification problems, some studies keep a small portion of data belonging to previous tasks and use them to preserve the accuracy on old tasks when dealing with new problems [4, 26, 27, 28, 3, 29]. In those works the exemplar set to store is chosen according to different criteria. In [4, 26] the authors use an episodic memory which stores a subset of the observed examples from previous tasks, while incrementally learning new classes. In [27] a fraction of previous classes is kept as a way to alleviate intransigence of a model, i.e., the inability of a model to update its knowledge. Hou et al. [28] tried to balance between preservation and adaptation of the model via distillation and retrospection by caching a small subset of randomly picked data for old tasks. In [3]

the classifier and the features for selecting the samples for the representative memory are learned jointly in an end-to-end fashion and herding selection

[30] is used to pick them.

Another example of this family is the first work on incremental learning for semantic segmentation [29], which however focused on a very specific setting related to satellite images and has several limitations when applied to generic semantic segmentation problems. Indeed, it considers the segmentation task as a multi-task learning problem, where a binary classification for each class replaces the multi-class labeling. In [29] the authors store some patches chosen according to an importance value determined by a weight assigned to each class and some other patches chosen at random. Furthermore, the capabilities on the old classes are preserved by storing a subset of the old images. However, for large amount of classes and different applications the methodology does not scale properly. Moreover, storing previously seen data could represent a serious limitation for certain applications where privacy issues or limited storage budgets are present.

For this reason, some recent methods [2, 31], instead, do not store old classes data but compensate this by training Generative Adversarial Networks (GANs) to generate images containing previous classes when new classes should be learned, thus retaining high accuracy on old tasks. Some other approaches, do not make use of exemplars set at all [1, 20, 5, 32, 33, 24]. In [1] an end-to-end learning framework is proposed where the representation and the classifier are learned jointly without storing any of the original training samples. In [5] previous knowledge is distilled directly from the last trained model. In [32] an attention distillation loss is introduced as an information preserving penalty for the classifiers’ attention maps. Aljundi et al. [33] introduced the idea that when learning a new task, changes to important parameters can be penalized, effectively preventing meaningful knowledge related to previous tasks from being overwritten. In [24] the current model distills knowledge from all previous model snapshots, of which a pruned version is saved.

Our aim is then to propose a more general framework where we do not store any previous image. To the best of our knowledge this work and our conference paper [6] are the first investigations on incremental learning for semantic segmentation which do not retain previously seen images and evaluate on standard datasets, like Pascal VOC2012 [34].

Iii Problem Formulation

In this section we introduce the task of incremental learning in semantic segmentation and we present different possible settings in which it can be explored. The incremental learning task when referring to semantic segmentation is defined as the ability of a learning system (e.g., a neural network) to learn the segmentation and the labeling of the new classes without forgetting or deteriorating too much the performance on previously learned ones. Typically, in semantic segmentation old and new classes coexist in the same image, thus the precision of an algorithm need to account for the accuracy on new classes as well as the accuracy on old ones. The first should be as large as possible in order to learn the new classes, the second should be as close as possible to the accuracy experienced before the addition of the new classes, thus avoiding catastrophic forgetting. The major aspect is how to set the trade-off between the preservation of previous knowledge and the capability of learning new tasks.
The considered problem is even harder when no data from previous tasks can be preserved, which is the scenario of interest in the majority of the applications where privacy concerns or limited storage requirements subsist. Here the most general incremental framework is addressed, in which:

  • previously seen images are not used;

  • new images may contain examples of unseen classes combined together with pixels belonging to the old ones;

  • the approach must scale well w.r.t. the number of classes.

Let us assume that the provided dataset contains images. As usual, part of the data is exploited for training and part for testing and we will refer to the training split of with the notation . Each pixel in each image of is associated to a unique element of the set of possible classes. In case a background class is present we associate it to the first class because it has a special and non-conventional behavior being present in almost all the images and having by far the largest occurrence among the classes.

Moving to the incremental learning steps, we assume that we have trained our network to recognize a subset of seen classes using a labeled subset , whose images contain only pixels belonging to the classes in . We then perform some incremental steps in which we want to recognize a new subset of unseen classes in a new set of training steps. During the -th incremental step the set of all previously learned classes is denoted with and after the current step the new set will contain also the last added classes. Formally, and . Each step of training involves a new set of samples, i.e., , whose images contain only elements belonging to . Notice that this set is disjoint from previously used samples, i.e., . It is important to notice that images in could also contain classes belonging to , however their occurrence will be limited since is restricted to consider only images which contain pixels from at least one class belonging to . Furthermore, the specific occurrence of a particular class belonging to is highly correlated to which is the set of classes being added (i.e., ). For example, if we assume that and , then it is reasonable to expect that contains images having the class , that typically appears in road scenes together with the , while the is extremely unlikely to occur.

Given this setting, there exist many different ways of sampling the set of unseen classes. Previous work [1, 6] sort the classes using the order in the exploited dataset (e.g., alphabetical order) and the first set of results in this paper stick to this assumption to replicate the same scenarios. However, we also present and discuss the order based on the pixels’ occurrence of each class inside the dataset since in real world applications it is more likely to start from common classes and then introduce rarer ones.

Additionally, there are many ways of selecting the cardinality of the sets , leading to different incremental scenarios. Starting from the choices considered in [1] for object detection on the Pascal VOC2007 dataset, we considered a wide range of scenarios and evaluated them on the Pascal VOC2012 dataset for semantic segmentation. Namely, as in [1, 6] we deeply analyze the behavior of various algorithms when adding a single class, a batch of classes and multiple classes sequentially one after the other.

Iv Knowledge Distillation Techniques for Semantic Segmentation

This work aims at investigating incremental learning techniques for semantic segmentation. Starting from our previous work [6], we propose a set of methodologies based on different type of knowledge distillation strategies.

Iv-a Network Architecture

The methods proposed in this paper can be fitted into any deep network architecture; however, since most recent architectures for semantic segmentation are based on the auto-encoder scheme, we focused on this representation. In particular, for the experimental evaluation of the results we used the Deeplab v2 network [35], which is a widely used approach with state-of-the art performance. More in detail, we exploited the Deeplab v2 network with ResNet-101 as the backbone, whose weights were pre-trained [36] on the MSCOCO dataset [37]. The pre-training of the feature extractor (as done also in other incremental learning works as [5]) is needed since the Pascal VOC 2012 is too small to be used for training a complex network like the Deeplab v2 from scratch. However notice that MSCOCO data are used only for the initialization of the feature extractor and that the labeling information in this dataset, even if there are some overlapping classes, is related to a different task (i.e., image classification). As previously introduced, the Deeplab v2 model is based on an auto-encoder structure (i.e., it consists of an encoder part followed by a decoder phase) where the decoder is composed by Atrous Spatial Pyramid Pooling (ASPP) layers in which multiple atrous convolutions with different rates are applied in parallel on the input feature map and then merged together to enhance the accuracy at multiple scales. The original work exploited also a post-processing step based on Conditional Random Fields, but we removed this module in order to train the network end-to-end and to measure the performance of the incremental approaches without considering the contribution of post-processing steps not related to the network training.

Iv-B Incremental Learning Steps

The procedures to achieve incremental learning in semantic segmentation are now introduced: a general overview of the proposed approach is shown in Fig. 1. We start by training the chosen network architecture to recognize the classes in with the corresponding training data . Recall that only contains images with pixels belonging to classes in , as detailed in Section III. The network is trained in a supervised way with a standard cross-entropy loss. After training, we save the obtained model as .

Fig. 1: Overview of the -th incremental step of our learning framework for semantic segmentation of RGB images. The scenario in which the current model is completely trainable, i.e. not frozen, is reported. The model , instead, is frozen and is not being updated during the current step.

Then, we perform a set of incremental steps indexed by to make the model learn every time a new set of classes . At the -th incremental step, the current training set is built with images that contain at least one of the new classes (but they can possibly contain also pixels belonging to previously seen classes, in particular they typically have a background region). During step , the model is loaded and trained exploiting a linear combination of two losses: a cross-entropy loss , which learns how to identify and label the classes, and a distillation loss , which helps to retain knowledge of previously seen classes and will be detailed in the following. Notice that since typically at least the background is present inside each image, we assume that all the images produce gradients for both the and losses. After the -th incremental step, we save the current model as and we repeat the described procedure every time a new set of classes to be learned is taken into account.

The complete loss used to train the model is defined as:

(1)

where balances the two terms. Setting corresponds to the simplest scenario in which no distillation is applied and the cross-entropy loss is applied to both unseen and seen classes. We expect this case to exhibit some sort of catastrophic forgetting, as already pointed out in the literature.

During the -th incremental step the cross-entropy loss is applied to all the classes and it is defined as:

(2)

where and

are respectively the one-hot encoded ground truth and the output of the segmentation network corresponding to the estimate score for class

. Note that, since , the sum is computed on both old and newly added classes, but since the new ones are much more likely in , there is a clear unbalance toward them leading to catastrophic forgetting [38].

We considered various possible definitions for the distillation loss . We focused on losses that only depend on the previous model to avoid the need for large storage requirements. Finally, we experimentally evaluated the effectiveness of the proposed methods in real world settings.

Iv-C Distillation on the Output Layer ()

The first considered distillation term

for semantic segmentation is the cross-entropy loss computed on the already seen classes between the probabilities produced by the output of the softmax layer of the previous model

and the output of the softmax layer of the current model (we assume to be at the -th incremental step). Notice that the cross-entropy is masked in order to consider only already seen classes, i.e. classes in , since we want to guide the learning process to preserve the behavior on these classes. The distillation loss in this case is defined as:

(3)

We improved the model by rescaling the logits using a softmax function with temperature

, i.e,

(4)

where is the logit value corresponding to class . By denoting with the output of the segmentation network for the estimated score of class after this rescaling we can rewrite Eq. (3) as:

(5)

Intuitively, when

the model produces a softer probability distribution over classes thus helping to retain information about the uncertainty of the classification scores

[22, 39]. In the experiments we empirically set . Notice that the temperature scaling was not present in the conference version of the work [6] and it revealed to be useful especially when one class is added at a time.

When the new task is quite similar with respect to previous ones, the encoder , which aims at extracting some intermediate feature representation from the input information, could be frozen to the status it reached after the initial training phase (

in short). In this way, the network is constrained to learn new classes only through the decoder, while preserving the features extraction capabilities unchanged from the training performed on

. We evaluated this approach both with and without the application of the distillation loss in Eq. (3) and Eq. (5). Since the procedure of freezing the whole encoder could appear too restrictive, we tried also to freeze only the first couple of convolutional layers of the encoder. We call this version, that was not present in [6], . Freezing only the first layers allows to preserve the lower level description produced by the first layers of the encoder while updating the weights of the more task-specific layers of the encoder and of all the decoder. A comparison of the different encoder freezing schemes is shown in Fig. 5.

(a) Encoder trainable
(b) Encoder frozen
(c) Two layers of encoder frozen
Fig. 5: Comparison of the different freezing schemes of the encoder at the -th incremental step. The whole model at previous step, i.e. , is always completely frozen and it is employed only for knowledge distillation purposes.

Iv-D Distillation on the Intermediate Feature Space ()

Another approach to preserve the feature extraction capabilities of the encoder is to apply a distillation loss on the intermediate level corresponding to the output layer of the encoder , i.e., on the features space before the decoding phase. The distillation function working on the features space in this case is no longer the cross-entropy but instead the L loss. At that level, indeed, the considered layer is not anymore a classification layer but instead just an internal stage where the output should be kept close to the previous one in, e.g., L-norm. We also considered using the L loss, but we verified empirically that both the L and the cross entropy lead to worse results. Considering that the network corresponding to model can be decomposed into an encoder and a decoder, the distillation term becomes:

(6)

where denotes the features computed by when a generic image is fed as input.

Iv-E Distillation on the Dilation Layers ()

Finally, we also tried to improve the approach by applying an L penalty at different points inside the network. In particular, we found that a reliable strategy is to apply the distillation on the four dilation layers contained in the ASPP block of the decoder [35]. Hence the distillation term becomes:

(7)

where is the output of the dilation layer with when is fed as input. This strategy was not considered in [6] and proved to be effective in preserving the learned knowledge.

A summary of the proposed strategies for the incremental step procedure is shown in Fig. 1, which points out the three losses. As a final remark, we also tried a combination of the described distillation losses without achieving significant enhancements.

V Experimental Results

The proposed incremental learning strategies are independent of the backbone architecture and generalize well to different scenarios where new tasks should be learned over time. For the experimental evaluation we selected the architecture presented in Section IV-A and we evaluated our method on the Pascal VOC2012 [34] semantic segmentation benchmark. This dataset has been widely used to evaluate semantic segmentation schemes and consists of images in the training split and in the validation split. The semantic labeling assigns the pixels to different classes ( plus the background). Since the test set has not been made available, all the results have been computed on the images belonging to the Pascal VOC2012 validation split (i.e., using it as a test set) as done by all competing approaches.

V-a Implementation Details

We optimized the network weights with Stochastic Gradient Descent (SGD) as done in

[35]. The initial stage of training of the network on the set is performed by setting the starting learning rate to and training for steps decreasing the learning rate up to with a polynomial decay rule with power . Notice that the number of training steps is linearly proportional to the number of classes in . We used a weight decay regularization of and we employed a batch size of images.

The incremental training steps have been performed employing a lower learning rate to better preserve previous weights. In this case the learning rate starts from and decreases up to after steps of polynomial decay. Notice that, again, we train the network for a number of steps which is proportional to the number of classes contained in the considered incremental step thus allowing to automatically adapt the training length to the number of new classes being learned. The considered metrics are the most widely used for semantic segmentation problems: namely, per-class pixel accuracy (PA), per-class Intersection over Union (IoU), mean PA (mPA), mean class accuracy (mCA) and mean IoU (mIoU) [40].

We used TensorFlow

[41] to develop and train the network: the overall training of the considered architecture takes around 5 hours on a NVIDIA 2080 Ti GPU. The code will be made available online.

V-B Addition of One Class

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

plant

sheep

sofa

train

mIoU old

tv

mIoU

mPA

mCA

Fine-tuning 90.2 80.8 33.3 83.1 53.7 68.2 84.6 78.0 83.2 32.1 73.4 52.6 76.6 72.7 68.8 79.8 43.8 76.5 46.5 68.4 67.3 20.1 65.1 90.7 76.5
([6]) 92.0 83.9 37.0 84.0 58.8 70.9 90.9 82.5 86.1 32.1 72.5 51.0 79.9 72.3 77.3 80.9 45.1 78.1 45.7 79.9 70.0 35.3 68.4 92.5 79.5
92.6 85.7 33.4 85.3 63.1 74.0 92.6 83.0 86.4 30.4 78.1 55.0 79.1 77.8 76.4 81.7 49.7 80.2 48.5 80.4 71.7 44.4 70.4 93.2 80.1
92.7 86.2 32.6 82.9 61.7 74.6 92.9 83.1 87.7 27.4 79.4 59.0 79.4 76.9 77.2 81.2 49.6 80.8 49.3 83.4 71.9 43.3 70.5 93.2 81.4
, 93.1 85.9 37.3 85.5 63.1 77.5 93.2 82.2 88.8 29.4 80.1 57.1 80.6 79.4 76.9 82.5 50.0 81.8 51.1 85.0 73.0 51.9 72.0 93.6 82.3
, 92.7 84.7 35.3 86.0 60.7 73.3 92.8 82.6 87.6 29.9 78.6 54.4 80.3 78.0 76.3 81.5 50.0 80.9 49.5 82.8 71.9 47.4 70.7 93.2 80.7
92.9 84.8 36.4 82.6 63.5 75.0 92.2 83.6 88.3 29.5 80.3 59.6 79.7 80.2 78.9 81.2 49.7 78.9 51.0 84.1 72.6 50.6 71.6 93.4 83.4
92.2 85.4 34.3 82.4 61.6 73.4 91.7 82.7 86.4 32.4 77.2 57.4 76.3 72.6 76.1 81.1 53.7 79.2 46.1 81.5 71.2 35.6 69.5 92.6 81.5
, 92.5 84.7 33.8 80.4 60.8 76.1 91.5 82.9 87.1 29.5 78.4 58.7 76.1 73.7 78.8 81.0 51.1 78.3 48.3 84.9 71.4 42.7 70.1 93.0 82.6
93.4 85.5 37.1 86.2 62.2 77.9 93.4 83.5 89.3 32.6 80.7 57.3 81.5 81.2 77.7 83.0 51.5 81.6 48.2 85.0 73.4 - 73.4 93.9 84.3
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 52.9 82.4 82.0 76.8 83.6 52.3 82.4 51.1 86.4 73.7 70.5 73.6 93.9 84.2
TABLE I: Per-class IoU on the Pascal VOC2012 under different settings when the last class, i.e., the tv/monitor class, is added.

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

plant

sheep

sofa

train

mCA old

tv

mIoU

mPA

mCA

Fine-tuning 94.7 83.9 66.9 90.2 63.9 80.2 86.6 81.6 93.6 46.1 80.9 57.4 87.8 90.4 72.4 87.8 48.5 81.1 56.2 71.8 76.1 84.3 65.1 90.7 76.5
([6]) 96.2 89.8 77.0 93.2 73.1 83.7 93.7 91.0 94.4 44.5 79.8 55.4 86.6 91.7 85.3 88.2 50.3 81.8 52.9 85.5 79.7 76.4 68.4 92.5 79.5
96.6 92.5 56.2 93.7 76.3 84.5 95.4 90.1 95.9 41.4 86.1 61.3 90.4 91.3 83.8 88.7 60.8 83.7 66.1 83.7 80.9 64.6 70.4 93.2 80.1
96.5 91.4 56.6 94.6 79.8 83.0 97.1 87.4 95.5 33.8 90.1 66.8 92.6 94.1 89.2 86.1 64.0 87.9 62.5 86.6 81.8 74.3 70.5 93.2 81.4
, 96.7 92.3 73.8 93.5 84.6 87.3 97.5 91.8 94.9 38.7 88.4 63.2 92.0 92.5 91.6 88.5 62.7 86.5 64.0 91.0 83.6 57.6 72.0 93.6 82.3
, 96.6 89.9 64.9 93.1 76.5 85.0 95.3 91.4 95.7 41.8 86.5 60.3 90.9 90.6 83.5 88.1 59.7 84.7 65.3 87.9 81.4 67.4 70.7 93.2 80.7
96.2 93.3 69.6 95.4 84.5 87.0 97.6 91.0 96.1 41.2 91.4 71.6 93.4 91.5 90.9 85.8 63.2 85.1 69.4 88.6 84.1 68.0 71.6 93.4 83.4
95.6 91.0 64.7 93.3 76.7 87.3 95.8 90.2 95.1 53.1 86.0 65.0 92.5 87.9 82.1 88.6 66.2 83.6 57.9 90.8 82.2 67.5 69.5 92.6 81.5
, 95.8 94.8 58.8 95.0 86.8 88.7 96.4 89.9 95.6 41.1 90.9 68.7 94.2 91.7 89.9 86.3 70.5 88.7 61.6 88.7 83.7 60.9 70.1 93.0 82.6
96.8 94.6 82.1 92.4 85.6 88.4 97.1 92.2 94.1 48.5 89.2 64.0 90.6 90.4 90.3 88.8 65.5 86.0 56.0 92.6 84.3 - 73.4 93.9 84.3
96.7 94.8 76.6 92.6 86.3 87.4 97.7 93.0 94.6 52.2 90.1 55.8 90.9 90.5 89.6 89.8 65.1 86.2 63.8 93.6 84.4 81.0 73.6 93.9 84.2
TABLE II: Per-class Pixel Accuracy on VOC2012 under different settings when the last class, i.e., the tv/monitor class, is added.

backgr.

person

cat

dog

car

train

chair

bus

sofa

mbike

din. table

aero

horse

bird

bike

tv

boat

plant

sheep

cow

mIoU old

bottle

mIoU

mPA

mCA

Fine-tuning 91.9 80.7 82.2 72.3 81.7 77.9 27.2 90.2 46.9 74.5 56.1 82.4 71.8 77.9 34.9 55.8 58.7 31.0 71.9 66.9 66.6 63.8 66.5 92.4 75.8
92.9 82.7 87.9 80.0 82.3 82.5 31.7 90.5 49.3 75.7 57.0 85.2 77.9 85.5 37.3 65.2 63.7 48.3 79.3 77.4 71.6 68.2 71.5 93.4 81.2
92.6 81.8 87.8 81.5 83.5 84.1 26.4 92.3 50.6 68.5 54.6 86.1 79.3 85.9 36.6 66.6 62.3 49.6 79.2 80.0 71.5 61.9 71.0 93.3 81.4
, 92.9 82.1 89.3 82.2 83.5 85.0 28.6 92.5 50.2 74.2 55.4 86.1 79.2 85.4 36.9 66.7 62.6 52.1 80.1 79.6 72.2 64.2 71.8 93.6 81.0
, 92.9 82.6 88.2 81.3 82.4 85.3 31.4 91.5 50.1 76.0 57.0 84.8 78.0 85.7 36.9 64.9 61.8 49.3 79.9 76.8 71.8 69.0 71.7 93.5 81.8
92.9 81.7 88.5 81.8 83.8 85.0 27.2 92.4 51.8 73.0 56.0 85.9 79.9 85.7 37.0 65.7 61.7 48.7 80.1 80.0 71.9 62.3 71.5 93.5 81.8
92.5 82.7 86.5 79.7 83.4 83.1 28.4 91.9 46.6 68.7 54.7 83.3 75.4 83.8 32.8 65.8 62.8 48.2 78.6 73.8 70.1 67.5 70.0 93.1 78.9
93.5 80.9 89.7 82.8 84.4 85.5 33.1 92.5 47.6 79.3 57.0 85.9 79.9 85.9 37.2 67.8 62.5 53.4 80.5 79.7 73.0 - 73.0 94.0 83.3
93.4 83.6 89.7 82.4 82.4 86.4 35.4 92.7 51.1 76.8 52.9 85.4 82.0 85.7 36.7 70.5 63.3 52.3 82.4 80.9 73.7 78.7 73.6 93.9 84.2
TABLE III: Per-class IoU on VOC2012 when the last class according to the occurrence in the dataset, i.e. the bottle class, is added.

Following the experimental scenarios presented in [1, 6], we first analyze the addition of the last class, in alphabetical order, to our network. Specifically, we consider and . The network is firstly optimized on the train split containing samples belonging to any of the classes in , i.e., . The evaluation of the proposed methodologies in this setting on the VOC2012 validation split is reported in Table I and II. Here we indicate with the initial training of the network using as training dataset. The network is then updated exploiting the dataset and the resulting model is referred to as . In this way we always specify both the index of the training step and the indexes of the classes added in the considered step. First of all, it is possible to notice that, even adding just one single class, the mIoU is affected by the catastrophic forgetting issue. Indeed, the reference model , where all the classes are learned at once, achieves a mIoU of , higher than all methods first trained on classes and then adapted to learn the last class.
From the first row of Table I we can appreciate that adapting the network in the standard way, i.e., without additional provisions, leads to an evident degradation of the performance with a final mIoU of . This is a clear confirmation of the catastrophic forgetting phenomenon in the semantic segmentation scenario even with the addition of just one single class. The main issue of the naïve approach (we called it “fine-tuning” in the tables) is that it tends to predict too frequently the last class, even when it is not present, as proved by the fact that the model has a very high pixel accuracy for the class of but a very poor IoU of on the same class. This is due to the high number of false positive detections of the considered class which are not taken into account by the pixel accuracy measure. For this reason semantic segmentation frameworks are commonly ranked by mIoU score instead of the mean pixel accuracy and we adopt the same criterion here. On the same class, the proposed methods are all able to outperform the naïve approach in terms of IoU by large margin: the best method achieves .

Knowledge distillation strategies and the procedure of freezing the encoder provide better results because they act as regularization constraints. Interestingly, those procedures allow to achieve higher accuracy not only on the previously learned classes but also on the newly added ones, which might be unexpected if we do not consider the regularization behavior of those terms. Hence all the proposed strategies allow to alleviate the catastrophic forgetting phenomenon and, noticeably, all of them overcome the standard approach (without knowledge distillation) in any of the considered metrics, as can be verified by looking at Table I and II. We can appreciate that alone is able to improve the average mIoU by with respect to the standard case. Notice how the improved version of with temperature scaling introduced in this work achieves a significant improvement of of mIoU w.r.t. our previous work [6] (i.e., ). Furthermore, it leads to a much better IoU on the new class, greatly reducing the aforementioned false positives issue. If we completely freeze the encoder without applying knowledge distillation the model improves the mIoU by . If we combine the two mentioned approaches, i.e., we freeze and we apply , the mIoU reaches , with an overall improvement of , higher than the two methods alone (also the performance on the new class is higher). If we just freeze the first two layers of the encoder and we apply knowledge distillation, i.e., (using the square brackets to collect the list of employed distillation strategies), a slightly lower mIoU of is achieved. Instead, if we apply an L loss at the intermediate features space () the model achieves of mIoU, which is higher than the standard approach. It is noticeable that two completely different approaches to preserve knowledge, namely (which applies a cross-entropy between the outputs with encoder frozen) and (which applies an L-loss between features spaces), achieve similar and high results both on the new class and on old ones. Notice that if the encoder is frozen then it does not make sense to enable the loss.
Finally, if we apply an L loss on the dilation filters of the decoder, i.e., the loss, we obtain a mIoU of which is higher than the standard approach but lower than the other strategies. Freezing the encoder yields in this setting to a small improvement from to .
From these results we can appreciate that the changes in performance on previously seen classes are correlated with the class being added. Some classes have even higher results in terms of mIoU than before because their prediction has been reinforced through the new training set: in semantic segmentation, differently from image classification, a scene usually contains multiple classes. For example objects of the classes or are typically present in scenes containing a , hence in the considered scenario they achieve almost always higher accuracy than after the first stage of training. Some other classes, instead, get more easily lost because they represent uncorrelated objects that are not present inside the new set of samples (e.g., instances of the or classes are not present in indoor scenes typically associated with the class being added).

(a)
background bottle cat chair cow dining table person plant tv unlabeled
(b)
Fig. 8: Qualitative results on sample scenes for the addition of one class (best viewed in colors). In the first two columns the tv/monitor class is added, in the last column the bottle class is added.

Some visual examples in this scenario are shown in the first two columns of Fig. 8 where the naïve approach is compared against the baseline , the and strategies. We can visually appreciate that knowledge distillation and encoder freezing help in preserving previous classes (e.g., the in the first column or the in the second) whilst not compromising the learning of the new class (e.g., in the first column). In column 2, however, a challenging example is reported where none of the proposed methodologies, and neither the baseline approach, are able to accurately detect the new class.

In Table III the IoU results in the same scenario are shown, but this time ordering the classes according to the pixels’ occurrence of each class, thus the class is added last. Results are similar to the previous case: the baseline approach exhibits a large drop in performance with respect to all the proposed approaches. Knowledge distillation always helps in every scenario and, as in the previous case, the best performing strategy is , . Qualitative results are shown in the last column of Fig. 8: we can verify that knowledge distillation and encoder freezing help not only to retain previously seen classes (e.g., the in the example), but also to better detect and localize the new class, i.e. the , thus acting as a regularization term.

V-C Addition of Multiple Classes

In this section we consider a more challenging scenario where the initial training is followed by one step of incremental learning with multiple classes to learn. The results presented here do not consider all the strategies shown in Section V-B but only those that led to substantial improvements.

First, the addition of the last classes at once (referred to as ) is discussed and the results are shown in Table IV and V. In this setting the results are much lower than in the previous cases where a single class was added at a time since there is a larger amount of information to be learned. In particular, the baseline exhibits an even larger drop in accuracy because it tends to overestimate the presence of the new classes assigning pixels to them more often than needed. We can confirm this by looking at the IoU scores of the newly added classes which are often lower in the baseline by a large margin (see Table IV) while on the other side the pixel accuracy of the new classes is much higher than the one obtained with the various distillation strategies (see Table V). As before, a strategy to preserve previous knowledge is needed and this is what the proposed strategies aim at obtaining. In this case the distillation on the output layer, i.e. , achieves the highest accuracy. In general, here the approaches based on outperform the other ones (also on new classes), however, in this scenario as well, all the proposals outperform the standard approach on both old and new classes. Interestingly, some previously seen classes exhibit a clear catastrophic forgetting phenomenon because the updated models mislead them with visually similar classes of the set of new classes. For example, the and classes are often misled (low IoU and low PA for these classes) with the newly added classes and that have similar shapes (low IoU but high PA for them).
Qualitative results are shown in Fig. 11: we can appreciate that the naïve approach tends to overestimate the presence of the new classes in spite of previously learned ones or of the background. This can be seen from the first two columns, where samples of the and the classes (which are added during the incremental step) are erroneously predicted in the region, while these classes are correctly removed applying and freezing the encoder. Additionally, in the third column the baseline predicts the in spite of the while this issue is not present on the output of best proposed model.

(a)
background cat chair dog person plant sofa tv unlabeled
(b)
Fig. 11: Qualitative results on sample scenes for the addition of five classes at once (best viewed in colors).

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

mIoU old

plant

sheep

sofa

train

tv

mIoU new

mIoU

mPA

mCA

Fine-tuning 89.7 59.5 34.6 68.2 58.1 58.8 59.2 79.2 80.2 30.0 12.7 51.0 72.5 61.7 74.4 79.4 60.6 36.4 32.4 27.2 55.2 42.4 38.7 55.4 88.4 70.6
91.4 85.0 35.6 84.8 61.8 70.5 85.6 77.9 83.7 30.7 72.3 45.3 76.2 76.9 77.0 81.3 71.0 33.8 55.2 30.9 73.9 51.6 49.1 65.8 91.6 78.1
, 91.7 83.4 35.6 78.7 60.9 73.0 65.8 82.2 87.0 30.2 58.0 55.3 80.0 78.3 78.5 81.4 70.0 36.0 45.9 32.2 62.5 53.0 45.9 64.3 91.5 76.1
, 91.0 80.3 35.8 82.9 60.9 66.4 80.9 80.1 84.3 32.8 59.4 47.7 75.9 76.0 76.4 81.6 69.5 37.7 47.2 29.9 69.8 48.0 46.5 64.0 91.0 77.1
90.9 81.4 33.9 80.3 61.9 67.4 73.1 81.8 84.8 31.3 0.4 55.8 76.1 72.2 77.7 81.2 65.6 39.4 31.8 31.3 64.1 52.9 43.9 60.5 90.0 74.9
91.1 85.1 31.7 80.3 62.6 72.1 82.6 79.5 84.4 31.1 34.9 56.6 77.2 75.7 77.5 81.7 69.0 40.6 43.4 30.3 70.7 52.2 47.4 63.9 91.0 77.4
94.0 83.5 36.1 85.5 61.0 77.7 94.1 82.8 90.0 40.0 82.8 54.9 83.4 81.2 78.3 83.2 75.5 - - - - - - 75.5 94.6 86.4
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 52.9 82.4 82.0 76.8 83.6 75.1 52.3 82.4 51.1 86.4 70.5 68.5 73.6 93.9 84.2
TABLE IV: Per-class IoU on the Pascal VOC2012 under some settings when classes are added at once.

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

mCA old

plant

sheep

sofa

train

tv

mCA new

mIoU

mPA

mCA

Fine-tuning 93.5 59.7 69.5 69.3 66.9 64.7 59.9 87.2 86.6 36.5 12.7 53.9 79.7 63.8 83.1 89.7 67.3 61.0 93.0 79.6 92.9 80.0 81.3 55.4 88.4 70.6
95.5 88.7 77.0 90.6 83.1 80.2 88.6 93.2 90.0 47.9 75.6 47.3 81.5 85.7 85.9 90.4 81.3 39.7 88.0 55.6 86.2 68.1 67.5 65.8 91.6 78.1
, 96.0 86.6 71.3 80.3 79.1 81.2 66.7 91.3 92.4 43.9 60.0 63.2 89.9 83.3 89.6 88.2 78.9 44.7 89.4 57.0 83.0 61.0 67.0 64.3 91.5 76.1
, 94.9 81.8 75.3 86.4 78.2 74.5 83.2 92.5 90.4 45.2 61.2 49.7 81.4 84.0 85.1 89.9 78.4 47.5 90.8 67.2 88.1 71.0 72.9 64.0 91.0 77.1
94.4 84.3 67.1 82.6 74.8 77.0 74.0 89.8 92.0 39.7 0.4 62.7 86.5 75.1 87.0 89.1 73.5 58.7 90.9 78.9 90.9 76.7 79.2 60.5 90.0 74.9
94.5 91.9 62.4 85.1 78.5 83.0 85.4 92.2 93.2 42.0 35.2 62.3 91.4 86.5 87.9 90.5 78.9 51.0 89.3 69.4 87.6 66.4 72.7 63.9 91.0 77.4
96.9 94.8 77.8 93.4 87.1 86.7 97.1 92.8 94.8 53.5 91.3 56.6 90.2 89.6 90.9 89.4 86.4 - - - - - - 75.5 94.6 86.4
96.7 94.8 76.6 92.6 86.3 87.4 97.7 93.0 94.6 52.2 90.1 55.8 90.9 90.5 89.6 89.8 86.2 65.1 86.2 63.8 93.6 81.0 78.0 73.6 93.9 84.2
TABLE V: Per-class pixel accuracy on the Pascal VOC2012 under some settings when classes are added at once.

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

mIoU old

din. table

dog

horse

mbike

person

plant

sheep

sofa

train

tv

mIoU new

mIoU

mPA

mCA

Fine-tuning 91.9 82.4 32.0 70.8 61.7 67.7 91.1 79.8 72.7 30.5 61.6 67.5 49.1 70.6 63.4 72.9 79.4 43.5 72.0 44.8 79.1 60.7 63.5 65.6 91.9 78.2
91.7 83.2 33.4 80.9 62.3 72.6 89.2 76.8 77.6 28.0 64.1 69.1 48.6 73.5 65.7 72.9 76.6 41.3 74.2 39.5 79.0 62.1 63.3 66.3 91.9 77.3
, 91.4 85.2 33.3 82.5 62.7 75.1 89.7 76.4 75.3 25.9 67.9 69.6 42.2 64.7 66.4 68.0 67.9 39.4 70.4 32.9 72.5 60.5 58.5 64.3 91.2 75.2
, 91.8 84.0 33.6 83.2 62.7 72.4 90.9 77.0 79.9 28.2 65.4 69.9 46.8 72.7 66.8 71.5 75.3 41.1 74.2 38.2 80.0 59.7 62.6 66.5 91.9 77.7
92.1 83.5 34.0 79.5 61.7 69.1 90.9 78.5 72.5 29.3 61.2 68.4 46.2 66.1 65.3 74.3 79.1 43.0 70.0 47.1 78.3 63.5 63.3 66.0 91.8 79.4
92.0 84.5 33.5 74.7 61.2 71.5 89.7 77.9 73.5 28.6 61.8 68.1 51.9 67.1 64.9 70.8 77.3 42.8 70.5 45.3 78.9 61.6 63.1 65.7 91.9 77.3
95.3 86.4 34.4 85.6 69.7 79.3 94.6 87.6 93.1 44.2 91.9 78.4 - - - - - - - - - - - 78.4 96.1 90.4
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 74.9 52.9 82.4 82.0 76.8 83.6 52.3 82.4 51.1 86.4 70.5 72.1 73.6 93.9 84.2
TABLE VI: Per-class IoU on the Pascal VOC2012 under some settings when classes are added at once.

The next experiment regards the addition of classes at once and the results are shown in Table VI. Here knowledge distillation is less effective. Indeed, even though it enhances the results, the improvement is smaller with respect to the one in other scenarios. In particular, the idea of freezing only the first two layers of the encoder, introduced in this version of the work, together with knowldege distillation leads to the best results in this setting. The gap is reduced also because the fine-tuning approach already achieves quite high results preventing other methods to largely overcome it. We argue that the critical aspect is that the cardinality of the set of classes being added is comparable to the one of the set of previously learned classes.

V-D Sequential Addition of Multiple Classes

The last set of experiments is the one in which one or more classes are added more than once (i.e., new classes are progressively added instead of all in one shot).

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

mIoU old

din. table

dog

horse

mbike

person

plant

sheep

sofa

train

tv

mIoU new

mIoU

mPA

mCA

Fine-tuning 89.3 56.8 32.0 60.6 56.0 55.8 36.8 75.7 76.4 28.2 4.1 52.0 47.3 67.8 50.8 69.2 78.3 34.8 27.6 25.7 44.8 37.0 48.3 50.2 86.9 66.2
90.3 80.7 33.5 74.9 62.6 62.3 74.2 77.2 78.5 27.4 23.6 62.3 44.3 70.6 61.2 72.5 78.9 38.1 37.5 29.7 62.4 46.9 54.2 58.4 89.5 73.0
, 90.3 85.0 32.6 73.2 61.3 72.5 79.2 79.7 81.0 27.1 31.3 64.8 42.5 68.5 58.5 68.5 73.1 36.5 33.9 30.5 68.0 54.8 53.5 59.4 89.8 71.4
89.8 64.7 33.3 73.7 58.3 63.8 48.7 77.9 79.8 28.4 11.4 57.2 50.1 68.2 53.0 70.8 79.2 39.0 28.9 26.8 49.4 44.2 51.0 54.3 88.0 69.7
90.6 81.8 32.9 77.7 62.5 66.7 78.8 78.7 79.2 27.7 25.1 63.8 49.7 69.1 56.6 72.1 79.5 40.1 34.2 28.5 65.5 50.7 54.6 59.4 89.6 73.9
95.3 86.4 34.4 85.6 69.7 79.3 94.6 87.6 93.1 44.2 91.9 78.4 - - - - - - - - - - - 78.4 96.1 90.4
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 74.9 52.9 82.4 82.0 76.8 83.6 52.3 82.4 51.1 86.4 70.5 72.1 73.6 93.9 84.2
TABLE VII: Per-class IoU on the Pascal VOC2012 under some settings when classes are added two times.

Let us start from the case in which two sets of classes are added in two incremental steps after an initial training stage with classes, leading to the final model . The mIoU results are reported in Table VII where we can appreciate a more severe drop in performance if compared with the introduction of all the classes in a single shot. In particular, the standard approach without distillation leads to a very poor mIoU of . Differently from the one shot case, the catastrophic forgetting is largely mitigated by knowledge distillation, that in this case proved to be much more effective. In the best settings, that in this case are the distillation with encoder fixed and the newly introduced distillation applied to the dilation layers (), the mIoU improves of with respect to the standard approach. The method using is also the one obtaining the best mIoU on the new classes.

backgr.

person

cat

dog

car

train

chair

bus

sofa

mbike

din. table

mIoU old

aero

horse

bird

bike

tv

boat

plant

sheep

cow

bottle

mIoU new

mIoU

mPA

mCA

Fine-tuning 91.3 80.5 75.7 67.8 80.2 73.4 29.7 84.4 42.1 70.4 55.6 68.3 19.7 41.1 5.7 29.8 63.9 41.2 38.4 45.7 55.1 63.3 40.4 55.0 90.3 69.1
92.5 81.0 78.6 69.9 80.5 80.3 31.0 88.8 45.1 73.3 50.2 70.1 79.7 53.4 71.5 32.7 61.6 53.1 40.1 58.8 59.9 71.0 58.2 64.4 92.2 76.0
, 92.5 81.7 82.1 76.1 83.5 83.1 29.0 92.7 46.7 71.2 55.4 72.2 70.9 29.5 59.2 32.0 59.6 46.3 38.5 49.0 52.4 61.6 49.9 61.6 91.9 72.6
92.5 82.2 82.7 74.2 81.8 78.7 31.8 88.0 46.2 73.8 58.3 71.8 66.0 39.8 56.9 31.0 63.5 42.6 45.3 54.5 60.2 69.3 52.9 62.8 92.0 74.7
92.0 81.2 82.6 68.2 78.3 81.4 29.1 91.3 45.1 71.6 56.9 70.7 0.3 23.4 0.1 23.6 61.6 46.8 44.1 49.6 59.4 70.0 37.9 55.1 91.0 66.8
92.5 80.6 89.2 85.5 86.3 86.8 30.7 93.3 46.2 80.7 59.6 75.6 - - - - - - - - - - - 75.6 93.5 82.8
93.4 83.6 89.7 82.4 82.4 86.4 35.4 92.7 51.1 76.8 52.9 75.2 85.4 82.0 85.7 36.7 70.5 63.3 52.3 82.4 80.9 78.7 71.8 73.6 93.9 84.2
TABLE VIII: Per-class IoU on VOC2012 when classes are added two times with classes ordered based on the occurrence in the dataset.

In Table VIII the same scenario is evaluated when classes are sorted on the basis of their occurrence inside the dataset. Also in this case a large improvement can be obtained with knowledge distillation which leads to of improvement in the best case if compared with the standard approach. As expected we can notice that the old classes are better preserved in this case being also the most frequent inside the dataset. Additionally, some methods struggle in learning new classes needing more samples to detect them.

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

mIoU old

plant

sheep

sofa

train

tv

mIoU new

mIoU

mPA

mCA

Fine-tuning 87.9 25.6 29.0 51.2 1.7 57.8 10.5 64.8 80.5 30.8 22.9 52.7 66.8 52.1 51.9 78.1 47.8 36.5 44.7 31.8 35.1 17.1 33.0 44.2 86.1 55.7
89.8 51.2 29.9 77.2 15.6 62.0 29.2 78.5 75.7 24.4 55.6 44.8 76.2 62.5 65.6 80.1 57.4 27.0 35.2 30.6 42.3 39.7 35.0 52.3 88.6 63.2
, 91.1 73.9 31.9 81.4 59.5 71.9 73.1 82.1 87.1 27.2 77.4 56.4 79.1 79.9 76.1 80.7 70.5 31.8 55.8 30.1 62.3 41.4 44.3 64.6 91.3 75.2
90.3 54.2 28.2 78.4 52.5 69.8 59.5 78.5 86.3 28.8 72.3 57.4 76.3 77.1 65.8 79.3 65.9 36.3 65.5 31.6 54.7 38.9 45.4 61.0 90.4 71.0
90.2 69.1 31.0 78.4 32.1 61.8 41.9 73.7 83.7 30.0 54.8 52.5 69.5 62.8 61.2 81.0 60.8 30.0 46.5 32.5 43.5 30.0 36.5 55.1 89.2 66.5
94.0 83.5 36.1 85.5 61.0 77.7 94.1 82.8 90.0 40.0 82.8 54.9 83.4 81.2 78.3 83.2 75.5 - - - - - - 75.5 94.6 86.4
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 52.9 82.4 82.0 76.8 83.6 75.1 52.3 82.4 51.1 86.4 70.5 68.5 73.6 93.9 84.2
TABLE IX: Per-class IoU on the Pascal VOC2012 under some settings when classes are added sequentially.

Then we move to consider the sequential addition of the last classes one by one, i.e., model . The results are reported in Table IX where we can appreciate a large gain of of mIoU between the best proposed method (i.e. ) and the standard approach. In this case freezing the encoder and distilling the knowledge is found to be very reliable because the addition on one single class should not alter too much the responses of the whole network. Distilling the knowledge from the previous model when the encoder is fixed guides the decoder to modify only the responses for the new class: in this way the best result is obtained.

Fine-tuning ,
mIoU mPA mCA mIoU mPA mCA mIoU mPA mCA mIoU mPA mCA mIoU mPA mCA
71.2 93.7 82.5 72.4 94.2 83.0 72.5 94.1 83.5 72.9 94.2 84.5 72.2 93.9 84.3
53.8 90.0 61.8 68.1 93.4 78.5 68.4 93.3 79.5 68.0 93.4 78.6 60.0 91.6 69.4
57.7 87.7 68.7 63.3 90.8 74.5 66.5 91.5 79.4 64.6 90.2 76.9 65.5 90.7 76.8
39.3 85.9 47.4 54.1 89.2 64.3 61.3 90.6 72.5 57.9 89.7 69.0 52.1 89.0 60.6
44.2 86.1 55.7 52.3 88.6 63.2 64.6 91.3 75.2 55.1 89.2 66.5 61.0 90.4 71.0
TABLE X: mIoU, mPA and mCA on the Pascal VOC2012 under some settings when classes are added sequentially.

The evolution of the models’ mean performance during the various steps is reported in Table X where we can appreciate how the drop of performance is distributed during the different steps. In particular, we can notice how the accuracy drop is affected by the specific class being added. As expected the larger drop is experienced when the classes or are added because such classes typically appear alone, i.e., they are only sparsely correlated with a few other classes (they mainly appear alone or with the class). The opposite is true when the classes being added show high assortativity coefficient with other classes, for example the presence of the classes and is highly correlated with the presence of classes like , or .

 background  chair  person  train  tv  unlabeled
(a)
Fig. 13: Qualitative results on sample scenes for the addition of five classes sequentially (best viewed in colors).

Some visual results for this scenario are reported in Fig. 13 where a huge gap in performance between the naïve approach and two of the best performing proposals can be appreciated. In particular, the standard approach without knowledge distillation tends to overestimate the presence of the last seen class, i.e., , in spite of the or of other previously learned classes.

RGB

GT

before

after

(a)
 background  bus  cow  horse  person  sheep  train  unlabeled
(b)
Fig. 16: Qualitative comparison on sample scenes of the best model of Table XI before and after the addition of a highly correlated class. The first two columns show the performance results after the addition of the class while the last two deals with the addition of the class (best viewed in colors).

backgr.

aero

bike

bird

boat

bottle

bus

car

cat

chair

cow

din. table

dog

horse

mbike

person

plant

sheep

sofa

train

tv

mIoU

mPA

mCA

94.0 83.5 36.1 85.5 61.0 77.7 94.1 82.8 90.0 40.0 82.8 54.9 83.4 81.2 78.3 83.2 - - - - - 75.5 94.6 86.4
93.5 84.0 36.1 84.8 60.5 72.5 93.4 84.2 89.7 40.0 83.0 55.7 81.9 81.6 79.4 83.2 29.0 - - - - 72.5 94.1 83.5
93.5 84.9 35.6 72.5 61.2 73.7 93.7 83.7 79.6 39.9 73.2 57.1 78.4 74.7 79.1 83.2 29.4 37.3 - - - 68.4 93.3 79.5
91.3 83.5 34.4 76.2 61.7 72.6 93.8 83.9 85.6 26.2 77.3 57.4 78.0 77.8 78.8 81.8 30.0 46.7 26.7 - - 66.5 91.5 79.4
91.2 67.8 31.7 63.9 60.5 73.1 43.2 83.5 86.4 25.1 77.7 56.7 79.1 77.9 74.3 81.7 27.0 49.2 28.0 48.7 - 61.3 90.6 72.5
91.1 73.9 31.9 81.4 59.5 71.9 73.1 82.1 87.1 27.2 77.4 56.4 79.1 79.9 76.1 80.7 31.8 55.8 30.1 62.3 41.4 64.6 91.3 75.2
93.4 85.4 36.7 85.7 63.3 78.7 92.7 82.4 89.7 35.4 80.9 52.9 82.4 82.0 76.8 83.6 52.3 82.4 51.1 86.4 70.5 73.6 93.9 84.2
TABLE XI: Per-class IoU on VOC2012 when classes are added sequentially. Only the best method of Table IX (“ and