On the Exploration of Incremental Learning for Fine-grained Image Retrieval

10/15/2020 ∙ by Wei Chen, et al. ∙ 0

In this paper, we consider the problem of fine-grained image retrieval in an incremental setting, when new categories are added over time. On the one hand, repeatedly training the representation on the extended dataset is time-consuming. On the other hand, fine-tuning the learned representation only with the new classes leads to catastrophic forgetting. To this end, we propose an incremental learning method to mitigate retrieval performance degradation caused by the forgetting issue. Without accessing any samples of the original classes, the classifier of the original network provides soft "labels" to transfer knowledge to train the adaptive network, so as to preserve the previous capability for classification. More importantly, a regularization function based on Maximum Mean Discrepancy is devised to minimize the discrepancy of new classes features from the original network and the adaptive network, respectively. Extensive experiments on two datasets show that our method effectively mitigates the catastrophic forgetting on the original classes while achieving high performance on the new classes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 9

page 10

page 11

page 12

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In an era when the number of images is increasing, deep models for fine-grained image retrieval (FGIR) are required to be adaptable for new incoming classes. However, current image retrieval approaches are focusing mainly on static datasets and are not suited for incremental learning scenarios. To be specific, deep networks well-trained on original classes will under-perform on new incoming classes.

When new classes are added into an existing dataset, joint training on all classes allows to guarantee the performance. However, as the number of new classes increases sequentially, the repetitive re-training is time-consuming. Alternatively, fine-tuning makes the network adapt to new classes and achieve good performance on these classes. However, when the original classes become inaccessible during fine-tuning, the performance of the original classes degrades dramatically because of catastrophic forgetting, a phenomenon that occurs when a network is trained sequentially on a series of new tasks and the learning of these tasks interferes with performance on previous tasks, as shown in Figure 1(a).

Most of incremental learning methods are exploited for image classification, which is robust and forgiving as long as features remain within the classification boundaries. In contrast, image retrieval focuses more on the discrimination in the feature space rather than the classification decisions. Especially for FGIR, small changes on visual features may have a big impact on the retrieval performance. Additionally, we find that standard methods like Learning without forgetting (i.e. LwF [Li and Hoiem(2017)]) and Elastic Weight Consolidation (i.e. EWC [Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.]) are insufficient for this problem because the distillation is not on the actual feature space (see Section 4.2 and 4.3).

Considering the above limitations, we propose a deep learning model to tackle the problem of incremental fine-grained image retrieval. We regularize the updates of the model to simultaneously retain preservation on original classes and adaptation on new classes. Importantly, to avoid the repeated training, the samples of the original classes are not used when learning the new classes. The classifier of the original network provides soft “labels” to transfer knowledge to train the adaptive network using the distillation loss function

[Hinton et al.(2015)Hinton, Vinyals, and Dean][Wu et al.(2019b)Wu, Chen, Wang, Ye, Liu, Guo, and Fu]. This focuses on pair-wise similarity but can not well quantify the distance between two feature distributions. This limitation inspires us to adopt a regularization term based on Maximum Mean Discrepancy (MMD) [Gretton et al.(2012)Gretton, Sejdinovic, Strathmann, Balakrishnan, Pontil, Fukumizu, and Sriperumbudur] to minimize the discrepancy between the features derived from an original network and an adaptive network, respectively. Moreover, the cross-entropy loss and triplet loss are utilized to identify subtle differences among sub-categories.

In summary, our contributions are two-fold. First, our work extends FGIR in the context of incremental learning. This is the first work to study this problem, to the best of our knowledge. Second, we propose a deep network, which includes a knowledge distillation loss and a MMD loss, for incremental learning without using any samples from the original classes. It achieves significant improvements over previous incremental learning methods.

Figure 1: (a) Illustration of catastrophic forgetting for FGIR. Our method aims to maintain good performance on the original classes where the inaccurate returned images are in red box and correct results are in blue box. (b) Framework of our method. The only inputs for the adaptive net are new classes and labels , . The frozen net is firstly trained on original classes and then copied as initialization for net .

2 Related Work

Incremental learning is the process of transferring learned knowledge from an original model to an incremental model. It has been researched in a few applications like image classification

[Li and Hoiem(2017)][Yao et al.(2019)Yao, Huang, Wu, Zhang, and Sun][Li et al.(2018)Li, Grandvalet, and Davoine][Zhou et al.(2019)Zhou, Mai, Zhang, Xu, Wu, and Davis], image generation [Zhai et al.(2019)Zhai, Chen, Tung, He, Nawhal, and Mori][Xiang et al.(2019)Xiang, Fu, Ji, and Huang], object detection [Shmelkov et al.(2017)Shmelkov, Schmid, and Alahari], hashing image retrieval [Wu et al.(2019a)Wu, Dai, Liu, Li, and Wang] and semantic segmentation [Michieli and Zanuttigh(2019)]. To overcome the so-called catastrophic forgetting, numerous methods have been proposed. For example, a subset of data (exemplars) of original classes are stored into an external memory, and the forgetting is thereby avoided by replaying these exemplars [Hou et al.(2018)Hou, Pan, Change Loy, Wang, and Lin][Lopez-Paz and Ranzato(2017)][Wu et al.(2019b)Wu, Chen, Wang, Ye, Liu, Guo, and Fu]. Recently, GANs [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] are used to synthesize samples with respect to the previous data distributions [Shin et al.(2017)Shin, Lee, Kim, and Kim][van de Ven and Tolias(2018)], which avoids the shortcomings of memory-consuming and exemplar-choosing, but generating real-like images with complex semantics is a challenging task. Alternatively, regularization methods constrain the objective functions or parameters of deep networks to preserve the previously learned knowledge. The distillation loss function [Hinton et al.(2015)Hinton, Vinyals, and Dean] is used to transfer knowledge of old classes [Li and Hoiem(2017)][Wu et al.(2019b)Wu, Chen, Wang, Ye, Liu, Guo, and Fu]

. The importance weight per parameter is estimated based on the old classes, and then is used as regularization to penalize essential parameter changes when training on new incoming classes

[Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.].

3 Proposed Approach

Problem Formulation Given a fine-grained dataset which includes class labels where , each sub-category has a different amount of images in and the ground-truth labels . A deep network is trained to perform the retrieval task for the classes. Consider the incremental learning scenario, images from new classes are added sequentially or at once. We take as input only the images from new incoming classes, i.e. , where , to incrementally train the deep network. In this way, it is efficient to update the network with no need of re-training the original classes again. Besides, the image instances from the original classes are not always accessible due to privacy issue or memory limit. Finally, the aim is to continually train the network, to make it preserve promising performance for all seen classes.

Overall Idea As shown in Figure 1(b), our method includes two training stages. First, we train a network

on the original classes using cross-entropy and triplet loss on the output logits and representations. After

is well-trained, we make two copies of : one freezing its parameters when incrementally training, and the other adapting its parameters for the incremental classes. We refer to this adaptive network as . It is initialized with parameters from

, including the feature extraction layers

and classifier

, but extends the number of neurons in its classifier

, from which the output logits are , and previous neurons are copied from . To overcome catastrophic forgetting, we propose to integrate two regularization strategies based on knowledge distillation and maximum mean discrepancy, respectively. Given a query image from either the original classes or newly added classes, we extract the features from the fully-connected layer for image retrieval. We introduce the details of our method below.

3.1 Semantic Preserving Loss

First, we train the model with the standard cross-entropy loss. Given the logits and its class label , the loss is . Note that we only use images from the new classes during incremental training, thus the classification is performed on , the categorical cross-entropy loss function is

(1)

To identify subtle differences among categories, we adopt the triplet loss

by mining the hard positive pairs and hard negative pairs based on feature vectors

R.

(2)

where and , based on matrix multiplication (i.e. ), indicate the similarity of hard negative and positive pairs, respectively. is the margin parameter.

3.2 Knowledge Distillation Loss

We rewrite , as , for simplicity. Knowledge distillation loss [Hinton et al.(2015)Hinton, Vinyals, and Dean] is defined to regularize the activations of the output layer in both the old and new model. To be specific, we constrain the first values in as close as possible to the logits from the frozen network . Following the method in [Wu et al.(2019b)Wu, Chen, Wang, Ye, Liu, Guo, and Fu][Li and Hoiem(2017)], when new classes are added at once, we compute the knowledge distillation loss by

(3)

where and , is a temperature factor that is normally set to 2 [Li and Hoiem(2017)]. and

refer to the probabilities produced by the modified Softmax function in

[Hinton et al.(2015)Hinton, Vinyals, and Dean]. and denote the parameters of network . Similarly, and denote the parameters of network , as shown in Figure 1(b). indicates the number of images from the new classes in a mini-batch. denotes the number of the original classes. Note that will be extended accordingly when more new classes are added.

Figure 2: (a) The red and blue color depict the feature distributions of two categories. The dashed line indicates the distributions from the network , the solid line indicates that from the network . Since is copied as the initial model for network , MMD=0 in the beginning. As training progresses, is updated to change its output features and the MMD is expected to increase. (b) MMD for instance-to-instance similarity.

3.3 Maximum Mean Discrepancy Loss

Knowledge distillation loss focuses on constraining classification boundaries to mitigate the forgetting issue. However, for FGIR, it is more important to reduce the difference between feature distributions. For this, we adopt maximum mean discrepancy (MMD) [Gretton et al.(2012)Gretton, Sejdinovic, Strathmann, Balakrishnan, Pontil, Fukumizu, and Sriperumbudur] to capture the correlation of feature distributions between network and . MMD has been used to bridge source and target distributions such as in domain adaptation [Long et al.(2016)Long, Zhu, Wang, and Jordan][Yan et al.(2017)Yan, Ding, Li, Wang, Xu, and Zuo]. However, our work is the first to impose MMD to regularize the forgetting issue for FGIR.

Given the features (d is feature dimension) from network and , MMD measures the distance between the means of two feature distributions after mapping them into a reproducing kernel Hilbert space (RKHS). In Figure 2, we illustrate how MMD mitigates the catastrophic forgetting issue. Note that, in the Hilbert space , norm operation can be equal to the inner product [Chwialkowski et al.(2016)Chwialkowski, Strathmann, and Gretton][Gretton et al.(2012)Gretton, Sejdinovic, Strathmann, Balakrishnan, Pontil, Fukumizu, and Sriperumbudur]. Finally, the squared MMD distance is:

(4)

where is batch size, and denotes the mapping function. However, it is hard to determine . In RKHS, the kernel trick is used to replace the inner product in Eq. 4, i.e. . Considering all the features in a mini-batch, R = { and = {, we define the MMD loss as:

(5)

where , means variances in the Gaussian kernel.

Discussion. Knowledge distillation loss focuses on constraining pair-wise similarity. However, MMD loss measures the distance between each feature vector, as depicted in Figure 2. Finally, it captures the distance of two feature distributions from the frozen net and adaptive net. Thus, MMD loss is more powerful to quantize the correlation of two models.

Overall, the objective function in our method for incremental FGIR learning is:

(6)

4 Experiments

4.1 Datasets and Experimental Settings

Datasets. We demonstrate our method on the Stanford-Dogs [Khosla et al.(2011)Khosla, Jayadevaprakash, Yao, and Li] and CUB-Birds [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] datasets. For the former, we use the official train/test splits. When training incrementally, we split the first 60 sub-categories (in the order of official classes) as the original classes and images from the remaining 60 sub-categories are added at once or sequentially. For the latter, we choose 60% of images from each sub-category as training set and 40% as testing set. Afterwards, we split the first 100 sub-categories (in the order of official classes) as the original classes and the remaining 100 sub-categories as new classes. The details are shown in Table 1.

Datasets
Training set
(#Image/#Class)
Testing set
(#Image/#Class)
Original cls. New cls. Total Original cls. New cls. Total
Stanford-Dogs 6000/60 6000/60 12000/120 4651/60 3929/60 8580/120
CUB-Birds 3504/100 3544/100 7048/200 2360/100 2380/100 4740/200
Table 1: Statistics of the datasets used in our experiments.

Experimental Settings. We use the Recall@K [Jegou et al.(2010)Jegou, Douze, and Schmid][Oh Song et al.(2016)Oh Song, Xiang, Jegelka, and Savarese] (K is the number of retrieved samples), mean Average Precision (mAP), the precision-recall (PR) curve and feature distribution visualizations for evaluation. We adopt the Google Inception [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] to extract image features. During training, the parameters in Inception are updated using the Adam optimizer [Kingma and Ba(2014)] with a learning rate of , while parameters in fully-connected layers and classifier are updated with a learning rate of . We follow the sampling strategy in [Wang et al.(2019)Wang, Han, Huang, Dong, and Scott]

and each incremental process is trained 800 epochs. Following the practice in

[Oh Song et al.(2016)Oh Song, Xiang, Jegelka, and Savarese][Wang et al.(2019)Wang, Han, Huang, Dong, and Scott], the output 512-D features from fully-connected layers are used for retrieval. We set the hyper-parameters in Eq. 6, and the margin in Eq. 2. Note that we mainly report the results tested on the CUB-Birds dataset in main paper. For the results of the Stanford-Dogs dataset, we show them in the supplementary material.

4.2 One-step Incremental Learning for FGIR

We report the results for multiple classes added at once. The process includes two stages. First, we use the cross-entropy and triplet loss to train the network on the original classes (100 classes for the CUB-Birds dataset), denoted as . Second, only images of new classes are added at once to train network , denoted as . DIHN [Wu et al.(2019a)Wu, Dai, Liu, Li, and Wang] has been explored the incremental learning for hashing-based image retrieval. However, its main difference with ours is to depend on the usage of old data as query set to avoid forgetting in their assumption. Considering no previous works for the fine-grained incremental image retrieval, we apply Learning without Forgetting (LwF) [Li and Hoiem(2017)], Elastic Weight Consolidation (EWC) [Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.], ALASSO [Park et al.(2019)Park, Hong, Han, and Lee], and the incremental learning for semantic segmentation (dubbed L2 loss) [Michieli and Zanuttigh(2019)] for comparison. LwF, EWC, and ALASSO distill knowledge on classifier and network parameters, which are insufficient for incremental FGIR. L2 loss in [Michieli and Zanuttigh(2019)] is more similar with ours where the knowledge is distilled on the classifier and intermediate feature space. Note that cross-entropy and triplet loss (i.e. semantic preserving loss) are combined with these three algorithms for fair comparison. The Recall@K are reported in Table 2.

Configurations Original classes New classes
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4
(1-100) (initial model) 79.41 85.64 89.63 - - -
+(101-200) w feature extraction - - - 47.02 57.44 67.86
+(101-200) w fine-tuning 53.90 64.56 73.56 76.18 82.56 87.39
+(101-200) w LwF () 54.92 66.40 75.42 75.76 82.69 86.93
+(101-200) w ALASSO 56.91 66.65 76.57 72.48 79.50 85.67
+(101-200) w EWC 62.03 72.16 80.08 73.32 80.92 86.01
+(101-200) w L2 loss 66.48 75.68 82.67 77.44 83.78 88.07
+(101-200) w Our method 74.41 82.57 88.52 73.11 80.84 86.64
(1-200) (reference model) 77.33 85.08 89.03 76.64 83.53 89.12
Table 2: Recall@K (%) of incremental FGIR on the CUB-Birds dataset when new classes are added at once. The best performance in the original class and the new class are in bold.

The “w feature extraction” depicts when directly extracts features on the new classes without re-training. The “w fine-tuning” depicts using and to train on the new classes but without using . Overall, the network suffers from the catastrophic forgetting issue and has lower performance on the original classes, whereas our method outperforms the others. As for the new classes, other three algorithms outperform ours. For example, “ w L2 loss” method achieves on Recall@1 by 4.33% compared to ours (77.44% 73.11%). However, it suffers from significant performance degradation on the original classes with Recall@1 dropping by 12.93% compared to the initial model (79.41% 66.48%). For our method, the Recall@1 on the original classes is 74.41% (dropped by 5.00% from 79.41% of the initial model); the Recall@1 on the new classes is 73.11% compared to the reference model from (i.e. Recall@1=76.64%).

Figure 3: (a)-(b) denote the PR curves tested on the original classes and new classes. (c) depicts the mAP results for different methods as the training proceeds. We only show the results tested on the original classes. (d) training time comparison during each epoch.

We report the PR curves and mAP results in Figure 3, 3, and 3, respectively. Overall, when tested on the new classes, all methods share similar trends. When tested on the original classes, our method has better performance although it still has gap to reference performance. For mAP results, the reference results are the same as in Table 2. We utilize the well-trained network at epoch=700 as initial model to train on the new classes until convergence, we test the mAP of network on the original classes. As the curves show, the network trends to degrade its accuracy on the original classes during incremental training.

Furthermore, we explore the influence of the new classes number. Specifically, we choose 100 classes and 25 classes as new categories. The results are reported in Table 5, we observe that larger newly-added classes lead to heavier forgetting. For example, when only 25 new classes are used, the Recall@1 drops from 79.41% to 76.65%, compared to the one drops from 79.41% to 74.41% where 100 new classes are added. Note that the reference models are trained jointly on all classes and tested on the original and new classes separately.

4.3 Multi-step Incremental Learning for FGIR

We split the new classes into 4 groups and added each group sequentially. The training procedures are as follows: the initial model is pre-trained on the original classes (1-100), and used as an initial model to train on newly-added classes (101-125) until convergence to produce a new model (101-125). Afterwards, the newly-trained model (101-125) is used as an initial model to train on other new classes (126-150) to produce (101-125)(126-150). This process is repeated until 4 groups of classes are added sequentially.

We compare to three representative methods (we choose EWC rather than ALASSO since EWC obtains higher performance on the CUB-Birds dataset) and report the results in Table 3. The reference performances are achieved by jointly training all the classes, and then tested on each group (including the original classes). Overall, the model suffers from the catastrophic forgetting issue when sequentially training. However, our method achieves a minimal performance degradation. For instance, when 4 groups have been added, the model (101-125)(126-150)(151-175)(176-200) is tested on the original classes(1-100). The “L2 loss” algorithm Recall@1 drops 79.41% 67.37% 58.14% 53.86% 45.85%, the average degradation is 8.39%. Our method Recall@1 drops 79.41% 76.65% 73.77% 70.47% 66.40%. The average performance degrades by 3.25%, which indicates that our method significantly mitigates the forgetting problem. Furthermore, our method has good performance on new classes, which are closer to the reference performance. When the model (101-125)(126-150)(151-175)(176-200) is tested on new classes (176-200), the results are achieved with Recall@1=85.21%, Recall@2=89.92% and Recall@4=93.28%, respectively, while the reference results are Recall@1=83.70%, Recall@2=90.25% and Recall@4=93.78%.

Configurations Original (1-100) Added new (101-125) Added new (126-150) Added new (151-175) Added new (176-200)
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4
(1-100) (initial model) 79.41 85.64 89.63 - - - - - - - - - - - -
LwF algorithm [Li and Hoiem(2017)] +(101-125) 57.50 68.05 75.68 79.59 85.88 88.95 - - - - - - - - -
+(101-125)(126-150) 42.46 54.03 64.66 62.59 74.83 82.31 70.17 79.67 86.00 - - - - - -
+(101-125)(126-150)(151-175) 40.21 51.57 61.27 47.79 63.10 75.68 56.83 67.33 78.17 81.57 87.27 90.79 - - -
+(101-125)(126-150)(151-175)(176-200) 33.31 44.75 55.38 49.83 63.78 75.85 48.00 60.33 72.33 67.17 75.88 82.91 83.70 89.41 92.94
EWC algorithm [Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.] +(101-125) 61.23 70.85 80.04 80.95 86.39 90.82 - - - - - - - - -
+(101-125)(126-150) 46.65 56.40 67.54 65.48 77.72 84.01 72.33 80.67 86.67 - - - - - -
+(101-125)(126-150)(151-175) 43.60 54.79 64.70 61.50 72.45 80.44 66.50 75.50 82.67 81.08 85.26 87.77 - - -
+(101-125)(126-150)(151-175)(176-200) 36.82 47.54 59.66 57.99 67.01 76.87 50.67 64.67 77.67 64.15 74.87 81.24 82.02 86.39 90.42
L2 loss algorithm [Michieli and Zanuttigh(2019)] +(101-125) 67.37 76.27 83.31 80.61 85.54 89.46 - - - - - - - - -
+(101-125)(126-150) 58.14 68.31 76.78 72.11 80.44 87.41 73.33 82.17 88.67 - - - - - -
+(101-125)(126-150)(151-175) 53.86 62.03 71.91 60.37 71.43 80.27 66.33 76.67 84.67 81.24 87.27 90.95 - - -
+(101-125)(126-150)(151-175)(176-200) 45.85 56.61 67.75 57.65 71.77 80.95 59.33 70.50 79.13 73.70 83.08 88.94 84.20 89.24 92.10
Our method +(101-125) 76.65 83.47 88.86 73.13 82.31 88.44 - - - - - - - - -
+(101-125)(126-150) 73.77 81.36 87.80 74.32 83.33 89.29 74.50 83.00 87.83 - - - - - -
+(101-125)(126-150)(151-175) 70.47 78.77 85.97 70.41 80.78 88.78 72.00 79.17 86.83 78.89 86.77 90.26 - - -
+(101-125)(126-150)(151-175)(176-200) 66.40 75.93 83.14 70.07 80.27 86.22 69.00 78.33 85.50 73.87 83.92 88.78 85.21 89.92 93.28
(1-200) (reference model) 77.33 85.08 89.03 76.87 84.86 90.48 73.00 80.00 87.67 83.25 88.94 92.29 83.70 90.25 93.78
Table 3: Recall@K (%) results on the CUB-Birds dataset when new classes are added sequentially. “Added new (101-125)” indicates the first 25 classes (101-125) are used as the first part to train the network.

4.4 Validation with Image Classification

We evaluate the effectiveness of our method on the CIFAR-100 dataset

[Krizhevsky et al.(2009)Krizhevsky, Hinton, et al.] which is the popular benchmark for class-incremental learning in image classification. We split 100 classes into a sequence of 5 tasks, and each task includes 20 classes. In Table 5, the results indicate the average top-1 accuracy of the classes from seen tasks. In the last column, the test set evaluates the classes from all the five tasks. Note that, the 20 classes in the first task (the second column) achieve the same performance, as it has no incremental learning yet. We observe that our method outperforms other methods across the tasks. It suggests our method generalizes well to various applications. Notably, our improvement for image retrieval is more significant than that for image classification. The reason is that the proposed MMD loss is imposed on the feature representation, which largely benefits the metric learning for image retrieval. This also explains why our method is focused mainly on image retrieval.

Configurations Original classes New classes
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4
(1-100) (initial model) 79.41 85.64 89.63 - - -
+(101-125) w Our method 76.65 83.47 88.86 73.13 82.31 88.44
+(101-200) w Our method 74.41 82.57 88.52 73.11 80.84 86.64
(1-125) (reference model) 77.84 83.94 87.80 79.25 85.54 91.96
(1-200) (reference model) 77.33 85.08 89.03 76.64 83.53 89.12
Table 4: Recall@K (%) on the CUB-Birds dataset when 25 or 100 new classes are added at once. Correspondingly, indicates the results are tested on different new classes.
Method Number of new classes
20 40 60 80 100
L2 loss 77.3 47.5 40.5 36.6 32.8
EWC 77.3 60.5 50.9 43.3 39.5
LwF 77.3 62.5 52.9 46.2 41.0
Ours 77.3 64.6 55.8 49.2 43.3
Table 5: Average top-1 accuracy of incremental learning for image classification on CIFAR-100 dataset.

4.5 Training Time Comparison

We compare the average training time on the CUB-Birds dataset when 100 new classes are added at once. The results are shown in Figure 3. Note that all models in five methods are starting from the same initial model trained on the original 100 classes as initialization. The reference time is from joint training where the initial model is trained on all classes. The other four methods are incrementally learning the new classes only. We observe that our method saves more time by 50% as expected. EWC and ALASSO algorithms take more time than reference because the gradients computation during back-propagation process is time-consuming.

4.6 Components Analysis

Ablation Study. We have done an ablation study on the CUB-Birds dataset when multiple classes are added at once. Note that the component “” comprises our baseline performance, thus we analyze the different loss items in Eq. 6. We can observe the influence of difference components for the original and new classes. The results are shown in Table 6.

Configurations Original classes New classes
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4
(1-100) (initial model) 79.41 85.64 89.63 - - -
+(101-200) w 53.90 64.56 73.56 76.18 82.56 87.39
+(101-200) w 54.92 66.40 75.42 75.76 82.69 86.93
+(101-200) w 73.36 81.25 87.43 73.40 81.60 86.64
+(101-200) w 74.41 82.57 88.52 73.11 80.84 86.64
(1-200) (reference model) 77.33 85.08 89.03 76.64 83.53 89.12
Table 6: Ablation study for different components of loss function

Hyper-parameters Sensitivity Analysis. We explore the sensitivity of hyper-parameters in Eq. 6, which affect significantly the trade-off performance. We conduct this experiment on the CUB-Birds dataset. As shown in Table 7, we find that the incrementally-trained model is more sensitive to than . For instance, when is set as 0.1, but changes from 0.1 to 1, model performs better on the new classes and significantly retains its previous performance. However, this obvious trend cannot be observed when is set as 0.1, but changes from 0.1 to 1 where the model performs almost the same on the original and new classes. Finally, if , the incrementally-trained model keeps a better trade-off performance between the original and the new classes.

Configurations Original classes New classes
Recall@K (%) K=1 K=2 K=4 K=1 K=2 K=4
(1-100) (initial model) 79.41 85.64 89.63 - - -
+(101-200) () 56.53 66.31 75.59 77.52 83.82 88.15
+(101-200) () 73.31 82.00 87.14 72.77 80.92 87.14
+(101-200) () 79.58 85.76 90.47 49.50 61.51 70.59
+(101-200) () 55.81 67.25 75.59 77.02 83.91 87.90
+(101-200) () 74.41 82.57 88.52 73.11 80.84 86.64
+(101-200) () 79.41 86.31 90.51 48.82 61.09 71.05
(1-200) (reference model) 77.33 85.08 89.03 76.64 83.53 89.12
Table 7: Sensitivity analysis of the hyper-parameters . The better trade-off performance of the hyper-parameters are underlined.

5 Conclusion

In this paper, for the first time, we have exploited incremental learning for fine-grained image retrieval in several scenarios for increasing numbers of image categories when only images of new classes are used. To overcome the catastrophic forgetting, we adopted the distillation loss function to constrain the classifier in the original network and the incremental classifier in the adaptive network. Moreover, we introduced a regularization function, based on Maximum Mean Discrepancy (MMD), to minimize the discrepancy between features of newly added classes from the original and the adaptive network. Comprehensive and empirical experiments on two fine-grained datasets show the effectiveness of our method that is superior over existing methods. In the future, it is promising to investigate incremental learning between different fine-grained datasets for image retrieval.

6 Acknowledgment

This work is supported by LIACS MediaLab at Leiden University, China Scholarship Council (CSC No. 201703170183), and the FWO project “Structure from Semantic” (grant No. G086617N). We would like to thank NVIDIA for the donation of GPU cards.

References

  • [Chwialkowski et al.(2016)Chwialkowski, Strathmann, and Gretton] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of fit. JMLR: Workshop and Conference Proceedings, 2016.
  • [Goodfellow et al.(2014)Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, and Bengio] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
  • [Gretton et al.(2012)Gretton, Sejdinovic, Strathmann, Balakrishnan, Pontil, Fukumizu, and Sriperumbudur] Arthur Gretton, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu, and Bharath K Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In NIPS, pages 1205–1213, 2012.
  • [Hinton et al.(2015)Hinton, Vinyals, and Dean] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • [Hou et al.(2018)Hou, Pan, Change Loy, Wang, and Lin] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Lifelong learning via progressive distillation and retrospection. In ECCV, pages 437–452, 2018.
  • [Jegou et al.(2010)Jegou, Douze, and Schmid] Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. TPAMI, 33(1):117–128, 2010.
  • [Khosla et al.(2011)Khosla, Jayadevaprakash, Yao, and Li] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for fine-grained image categorization: Stanford dogs. In CVPR Workshop, volume 2, 2011.
  • [Kingma and Ba(2014)] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al.

    Overcoming catastrophic forgetting in neural networks.

    In PNAS, 114(13):3521–3526, 2017.
  • [Krizhevsky et al.(2009)Krizhevsky, Hinton, et al.] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • [Li et al.(2018)Li, Grandvalet, and Davoine] Xuhong Li, Yves Grandvalet, and Franck Davoine. Explicit inductive bias for transfer learning with convolutional networks. arXiv preprint arXiv:1802.01483, 2018.
  • [Li and Hoiem(2017)] Zhizhong Li and Derek Hoiem. Learning without forgetting. TPAMI, 40(12):2935–2947, 2017.
  • [Long et al.(2016)Long, Zhu, Wang, and Jordan] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Unsupervised domain adaptation with residual transfer networks. In NIPS, pages 136–144, 2016.
  • [Lopez-Paz and Ranzato(2017)] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient episodic memory for continual learning. In NIPS, pages 6467–6476, 2017.
  • [Michieli and Zanuttigh(2019)] Umberto Michieli and Pietro Zanuttigh. Incremental learning techniques for semantic segmentation. In ICCV Workshops, 2019.
  • [Oh Song et al.(2016)Oh Song, Xiang, Jegelka, and Savarese] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, pages 4004–4012, 2016.
  • [Park et al.(2019)Park, Hong, Han, and Lee] Dongmin Park, Seokil Hong, Bohyung Han, and Kyoung Mu Lee. Continual learning by asymmetric loss approximation with single-side overestimation. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 3335–3344, 2019.
  • [Shin et al.(2017)Shin, Lee, Kim, and Kim] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In NIPS, pages 2990–2999, 2017.
  • [Shmelkov et al.(2017)Shmelkov, Schmid, and Alahari] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In ECCV, pages 3400–3409, 2017.
  • [Szegedy et al.(2015)Szegedy, Liu, Jia, Sermanet, Reed, Anguelov, Erhan, Vanhoucke, and Rabinovich] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages 1–9, 2015.
  • [van de Ven and Tolias(2018)] Gido M van de Ven and Andreas S Tolias. Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635, 2018.
  • [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [Wang et al.(2019)Wang, Han, Huang, Dong, and Scott] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, pages 5022–5030, 2019.
  • [Wu et al.(2019a)Wu, Dai, Liu, Li, and Wang] Dayan Wu, Qi Dai, Jing Liu, Bo Li, and Weiping Wang. Deep incremental hashing network for efficient image retrieval. In CVPR, pages 9069–9077, 2019a.
  • [Wu et al.(2019b)Wu, Chen, Wang, Ye, Liu, Guo, and Fu] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In CVPR, pages 374–382, 2019b.
  • [Xiang et al.(2019)Xiang, Fu, Ji, and Huang] Ye Xiang, Ying Fu, Pan Ji, and Hua Huang. Incremental learning using conditional adversarial networks. In ICCV, pages 6619–6628, 2019.
  • [Yan et al.(2017)Yan, Ding, Li, Wang, Xu, and Zuo] Hongliang Yan, Yukang Ding, Peihua Li, Qilong Wang, Yong Xu, and Wangmeng Zuo. Mind the class weight bias: Weighted maximum mean discrepancy for unsupervised domain adaptation. In CVPR, pages 2272–2281, 2017.
  • [Yao et al.(2019)Yao, Huang, Wu, Zhang, and Sun] Xin Yao, Tianchi Huang, Chenglei Wu, Rui-Xiao Zhang, and Lifeng Sun. Adversarial feature alignment: Avoid catastrophic forgetting in incremental task lifelong learning. Neural computation, 31(11):2266–2291, 2019.
  • [Zhai et al.(2019)Zhai, Chen, Tung, He, Nawhal, and Mori] Mengyao Zhai, Lei Chen, Frederick Tung, Jiawei He, Megha Nawhal, and Greg Mori. Lifelong gan: Continual learning for conditional image generation. In ICCV, pages 2759–2768, 2019.
  • [Zhou et al.(2019)Zhou, Mai, Zhang, Xu, Wu, and Davis] Peng Zhou, Long Mai, Jianming Zhang, Ning Xu, Zuxuan Wu, and Larry S Davis. M2kd: Multi-model and multi-level knowledge distillation for incremental learning. arXiv preprint arXiv:1904.01769, 2019.

1 One-step Incremental Learning for FGIR

(1) Recall@K Evaluation on the Stanford-Dogs Dataset

The process includes two stages. First, we use the cross-entropy and triplet loss to train the network on the original classes (1-60), denoted as . Second, only images of new classes are added at once to train network , denoted as . We observe similar trends as the results we shown in main paper, when our method achieves good performance on the original classes and new classes with Recall@1= 76.67% and Recall@1=81.88%, respectively. Compared to the initial model on the original classes, our method has dropped Recall@1 performance by 4.00% (80.67% 76.67%).

Configurations Original classes New classes
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4
(1-60) (initial model) 80.67 87.27 92.20 - - -
+(61-120) w feature extraction - - - 75.64 83.91 90.48
+(61-120) w fine-tuning 61.43 72.80 81.70 78.93 86.99 91.55
+(61-120) w LwF () 61.77 72.72 81.70 78.52 86.38 91.12
+(61-120) w EWC 62.24 73.30 82.82 78.90 86.59 91.19
+(61-120) w ALASSO 62.61 74.49 82.98 78.14 85.98 91.02
+(61-120) w L2 loss 72.07 81.44 87.47 82.21 88.75 92.52
+(61-120) w Our method 76.67 85.10 91.14 81.88 88.98 93.36
(1-120) (reference model) 79.29 86.86 91.61 82.57 88.75 93.13
Table 8: Recall@K (%) of incremental FGIR on the Stanford-Dogs dataset when new classes are added at once. The best performance are in bold.

(2) Precision-Recall Curves and mAP Results

We report the precision-recall curves and mAP results in Figure 4. We can observe these curves share with the similar trends with those from the CUB-Birds dataset. Overall, our method can effectively address the catastrophic forgetting issue on the original classes while achieve ideal performance on the new classes.

Figure 4: Figure (a)-(b) denote the precision-recall curves tested on the original classes and new classes on the Stanford-Dogs dataset. The larger the area under each curve, the better performance of the method. Figure (c) depicts the mAP results for different methods as the training proceeds. We only show the results tested on the original classes. Being closer to the reference curve (red one) indicates less performance degradation, i.e., the method can maintain its previous performance on the original classes on the Stanford-Dogs dataset.

(3) t-SNE Visualization for Feature Distribution

We visualize the feature distributions with and without MMD loss in Figure 5, which demonstrate the MMD loss reduces the distance between distributions and effectiveness for mitigating the forgetting issue.

Figure 5: t-SNE visualization for feature distribution of 6 categories. The circle shape indicates the features from reference model, which has the same distribution in two cases. The triangle shape denotes the feature from models trained with/without MMD loss. (a): model trained without MMD loss; (b): model trained with MMD loss.

2 Influence of Added Multiple Classes

In previous experiments, we add multiple classes (i.e. 100 new classes for the CUB-Birds dataset) for one-step incremental training at once. Herein, we further explore the influence of the new classes number for the Stanford-Dogs dataset where we choose 60 new classes and 5 classes for incremental learning.

The results are reported in Table 9. We observe these two datasets share with similar trends that larger new coming classes lead to heavier forgetting issue. For the Stanford-Dogs dataset, when only 5 new classes are added, the Recall@1 drops from 80.67% to 79.75%, compared to the one drops from 80.67% to 76.67% when 60 new classes are added.

Configurations Original classes New classes
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4
(1-60) (initial model) 80.67 87.27 92.20 - - -
 +(61-65) w Our full method 79.75 87.23 91.92 97.45 98.55 99.27
 +(61-120) w Our full method 76.67 85.10 91.14 81.88 88.98 93.36
(1-65) (reference model) 79.62 86.15 90.91 96.73 97.82 98.55
(1-120) (reference model) 79.29 86.86 91.61 82.57 88.75 93.13
Table 9: Recall@K (%) on the Stanford-Dogs dataset when 5 or 60 new classes are added at once. Correspondingly, indicates the results are tested on different new classes.

3 Multi-step Incremental Learning for FGIR

We report the results on the Stanford-Dogs dataset in Table 10 when new classes are added sequentially. We observe similar trends as those for the CUB-Birds dataset. Compared to the other two methods, the proposed method has ideal retrieval performance on the newly added classes and the original classes.

Configurations Original (1-60) Added new (61-75) Added new (76-90) Added new (91-105) Added new (106-120)
Recall@K(%) K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4 K=1 K=2 K=4
(1-60) (initial model) 80.67 87.27 92.20 - - - - - - - - - - - -
LwF algorithm [Li and Hoiem(2017)] +(61-75) 50.87 62.76 73.40 88.35 92.48 94.36 - - - - - - - - -
+(61-75)(76-90) 42.06 53.62 65.10 71.18 82.46 88.10 77.33 87.19 92.44 - - - - - -
+(61-75)(76-90)(91-105) 37.58 50.48 63.00 60.65 73.68 82.33 70.10 81.38 87.40 80.62 87.17 92.48 - - -
+(61-75)(76-90)(91-105)(106-120) 38.46 50.63 62.59 59.90 72.68 81.20 63.86 77.22 85.54 68.41 77.70 85.66 81.34 88.69 92.74
EWC algorithm [Kirkpatrick et al.(2017)Kirkpatrick, Pascanu, Rabinowitz, Veness, Desjardins, Rusu, Milan, Quan, Ramalho, Grabska-Barwinska, et al.] +(61-75) 55.84 67.64 77.57 89.10 92.61 94.36 - - - - - - - - -
+(61-75)(76-90) 45.32 58.29 68.85 79.82 85.21 90.35 81.38 88.72 93.32 - - - - - -
+(61-75)(76-90)(91-105) 37.60 49.71 61.88 67.04 79.04 85.71 67.47 79.52 88.39 81.33 86.99 91.15 - - -
+(61-75)(76-90)(91-105)(106-120) 34.08 45.60 58.40 63.53 75.19 83.71 63.42 77.66 86.31 70.00 79.12 85.84 81.99 87.96 92.10
L2 loss algorithm [Michieli and Zanuttigh(2019)] +(61-75) 65.30 75.83 83.51 90.85 94.74 95.61 - - - - - - - - -
+(61-75)(76-90) 55.97 67.36 77.04 84.46 90.73 92.73 80.94 89.38 93.54 - - - - - -
+(61-75)(76-90)(91-105) 50.38 62.87 73.64 72.68 82.21 88.85 75.68 84.67 91.57 83.72 90.00 93.81 - - -
+(61-75)(76-90)(91-105)(106-120) 46.01 58.74 69.64 67.79 78.07 85.71 72.51 84.45 90.03 74.87 83.98 89.82 86.21 91.36 94.39
Our method +(61-75) 76.07 84.88 90.11 91.85 95.36 96.87 - - - - - - - - -
+(61-75)(76-90) 70.67 80.48 87.87 89.10 93.11 95.99 84.23 89.92 93.43 - - - - - -
+(61-75)(76-90)(91-105) 67.75 79.17 86.45 86.09 91.98 95.49 81.60 90.25 93.76 84.25 89.03 93.45 - - -
+(61-75)(76-90)(91-105)(106-120) 65.47 76.52 85.08 83.21 89.35 93.73 79.19 87.84 93.32 82.83 89.20 94.42 87.13 92.10 94.39
(1-120) (reference model) 79.29 86.86 91.61 92.61 94.99 96.37 82.48 90.80 93.76 83.72 91.33 95.58 86.12 93.11 95.96
Table 10: Recall@K (%) results on the Stanford-Dogs dataset when new classes are added sequentially. “Added new (61-75)” indicates we use first 15 classes (61-75) as the first incremental part to train the network.