Online Lifelong Generalized Zero-Shot Learning

03/19/2021 ∙ by Chandan Gautam, et al. ∙ Indian Institute Of Technology, Madras indian institute of science 12

Methods proposed in the literature for zero-shot learning (ZSL) are typically suitable for offline learning and cannot continually learn from sequential streaming data. The sequential data comes in the form of tasks during training. Recently, a few attempts have been made to handle this issue and develop continual ZSL (CZSL) methods. However, these CZSL methods require clear task-boundary information between the tasks during training, which is not practically possible. This paper proposes a task-free (i.e., task-agnostic) CZSL method, which does not require any task information during continual learning. The proposed task-free CZSL method employs a variational autoencoder (VAE) for performing ZSL. To develop the CZSL method, we combine the concept of experience replay with knowledge distillation and regularization. Here, knowledge distillation is performed using the training sample's dark knowledge, which essentially helps overcome the catastrophic forgetting issue. Further, it is enabled for task-free learning using short-term memory. Finally, a classifier is trained on the synthetic features generated at the latent space of the VAE. Moreover, the experiments are conducted in a challenging and practical ZSL setup, i.e., generalized ZSL (GZSL). These experiments are conducted for two kinds of single-head continual learning settings: (i) mild setting-: task-boundary is known only during training but not during testing; (ii) strict setting-: task-boundary is not known at training, as well as testing. Experimental results on five benchmark datasets exhibit the validity of the approach for CZSL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Conventional supervised machine learning (ML) and, more recently, deep learning algorithms have shown remarkable performance on various tasks (e.g., classification/recognition) in various domains, such as Computer Vision and Natural Language Processing

[14, 20]. Despite the recent success of supervised ML/deep learning algorithms, they have two crucial limitations.

  1. Conventional machine learning models have restricted themselves to training classes. When any example from the novel/unseen class exists in the test set, such models can not be appropriately categorized.

  2. Conventional machine learning models cannot continually learn over time by accommodating new knowledge while retaining previously learned experiences due to catastrophic forgetting.

Recently, the first limitation is addressed by zero-shot learning (ZSL), where we classified objects from classes that are not available at training time [47, 43, 23, 6]. The second limitation is resolved by continual/lifelong learning [19, 31, 7]. A traditional ZSL approaches having a problem for sequential training, and continual learning approaches can not handle unseen class objects. Therefore, a more preferable and desirable approach needs to tackle sequential training and unseen object problems simultaneously. This paper aims to leverage the advantages of both zero-shot learning and continual learning in a single framework.

Zero-shot learning is an interesting paradigm to classify objects from classes that are not available at training time. Zero-shot learning (ZSL) methods have attracted considerable attention in recent years owing to their ability to classify unseen/novel class examples. Earlier approaches for zero-shot learning are based on the embedding function between visual and semantic space and are therefore biased towards the seen classes. Generative models recently address this issue, where the generative models are used to synthesize visual features directly from semantic class descriptors. Feature generative methods provide a shortcut to cast the zero-shot learning problem into a conventional classification problem [41, 42, 34, 37, 36, 44].

The conventional training approach of ZSL trains the model on different classes (of the same dataset) under the assumption that all of the concerned data is available together during training. However, this is not always the case. There can be instances when new data arrives in a stream for training. The samples available in the newly arrived streaming data might belong to the existing or newly discovered classes, which need to be added to the dataset for updating the knowledge of the ZSL model. If we do not update the ZSL model on newly arrived data, then the model’s prediction might be wrong as its knowledge is not updated. One way of handling this issue is to train the model from scratch with the inclusion of each incoming sample into the dataset every time. However, it will be a tedious and computationally expensive task to train each incoming sample model. Moreover, we need to store both previous and current data for training the ZSL model, which is not a feasible option due to memory constraints. Continual/incremental/lifelong learning can address the above concern by enabling the training of the ZSL model in a sequential manner with the preservation of accumulated (previous) knowledge while acquiring the new knowledge [31, 24, 9]. Consequently, we propose a single model that can predict the unseen class object and adapt to a new task without forgetting the knowledge about the previous task. This model is known as continual zero-shot learning (CZSL). It can update its current knowledge continuously without forgetting previous information in contrast to the conventional ZSL approaches. The CZSL method is a broad generalization of zero-shot learning. A few CZSL methods have been proposed yet for this setup [39, 32]. Both existing works require task-boundary information during training. However, we are the first to propose CZSL for a task-free learning/task-free setting, which is a strict single-head setting. LZSL [39] considers one whole dataset as a task and trains a separate attribute encoder-decoder for each task (dataset); therefore, it is a very trivial setting and does not support class-incremental setup for continual learning. More importantly, it requires the task boundary information during testing; therefore, this setup is not suitable for a single-head setting.

The contributions of our proposed approach are summarized as:

  1. To the best of our knowledge, this is the first work that proposes continual zero-shot learning for the task-free setting. The existing approaches [39, 32] are only compatible when task-boundary is either present during training or both training and testing.

  2. This paper also provides the novel evaluation setting for CZSL as the existing settings [39, 32] are not suitable for task-free learning.

  3. To enable the generative model for CZSL, the proposed approach employs experience replay with knowledge distillation. Here, we don’t use the student-teacher network strategy for knowledge distillation (KD). Instead of that, we store the required information in the memory of the corresponding sample to perform KD. The stored information is generally known as dark knowledge [15].

  4. To enable the model for task-free learning, this paper proposes short-term memory-based two different task-free learning strategies, which are compatible with any ZSL method.

  5. An extensive experimental results validate the effectiveness of the proposed task-free continual ZSL method.

2 Related Work

2.1 Zero-shot learning

Recently, ZSL has attracted considerable attention due to handling unknown objects during testing. It transfers knowledge from seen classes to unseen classes via class attributes. Earlier proposed approaches for ZSL primarily were discriminative or non-generative (i.e., embedding-based) in nature [1, 25, 40, 22, 46, 33, 45, 28, 17, 12, 2]. Non-generative methods learn an embedding from visual space to semantic space or vise versa via a linear compatibility function [1, 25, 40, 22]. These approaches are based on the assumption that the class attributes of seen and unseen classes share many similarities. Embedding-based approaches represent image class as a point hence unable to capture intra-class variability.

The generalized ZSL (GZSL) problem is potentially more practical and challenging where the training and the test classes are not disjoint. As ZSL models train using seen class examples, most of the embedding-based approaches show a strong bias towards the seen classes in GZSL. VAE and GAN are the backbones for generative models used to synthesize the examples for several applications. The ability to synthesize the seen/unseen class examples from class attributes using VAE/GAN is the basis of generative models-based ZSL. Generative models have recently shown promising results for both ZSL and GZSL setups using synthesized examples for unseen classes to train a supervised classifier [35, 11, 41, 16, 29, 47, 23, 42, 30]

. A particular advantage of the generative models is that they transform a ZSL problem into a typical supervised learning problem using the synthesized examples of unseen classes to train a supervised classifier.

2.2 Continual Learning

Continual learning learns from streaming data with two objectives: avoid catastrophic forgetting (preserve experience while learning on new tasks) and avoid intransigence (update new knowledge and transfer the previous knowledge). The whole work of continual learning can be broadly categorized into two parts: (i) regularization-based methods [19, 27, 7] (ii) replay-based methods [24, 31, 8, 13, 9]. Most of the continual learning works are focused on multi-head setting [19, 7, 27]. In recent years, task-free learning receives a surge of interest among researchers [3, 4, 5, 18] as it is a more practical continual learning setting than a multi-head setting. This paper is focused on experience replay for task-free learning.

2.3 Continual Zero-shot Learning

In a traditional continual learning setting, training and testing data contain the same number of classes for classification. However, in the CZSL setting, training data also contains some unseen classes with their description in textual form, and a classifier should be able to classify these unseen classes during testing. Most recently, CZSL [39, 32] has drawn increasing interest. To the best of our knowledge, only a handful the number of work is available for this problem. Chaudhry et al. [8] develop an average gradient episodic memory (A-GEM) -based CZSL method for a multi-head setting. A generative model-based CZSL [39] method is also developed for multi-head setting. Most recently, Skorokhodov and Elhoseiny [32] develop an A-GEM-based CZSL method for a single-head setting; however,,, it is not a strict single-head setting as task-identity is required during training. This paper develops a CZSL method for a strict single-head setting where task identity is neither known during training nor during testing.

3 Problem Formulation

Formally, a CZSL is divided among tasks (), where each task consists of training and testing data stream. Generally, the training stream for

task contains only the information of seen classes, which consists of feature vector

, task identity (it provides task-boundary), class label , and class attribute information . Where represents sample from the whole training samples of task. In addition, training stream also contains class attribute information for unseen classes, i.e., where denotes number of unseen classes. This is the key information which enables model for performing CZSL. Similarly, testing stream consists of , where is total number of test samples for task. Here, testing class label is only used for evaluation purpose. In this paper, we address single-head setting for two possible situations: (i) task-agnostic prediction: when task boundary is only available during training but not in testing, i..e., and ; (ii) task-free learning: when task boundary is neither available during training nor testing, i.e., and .

4 Task-Free Generalized Continual Zero-shot Learning: Tf-GCZSL

Figure 1: Proposed Task-free-GCZSL Framework. Here, task is just shown for better understanding, however, it is not required either during training or testing as reservoir sampling technique does not depend on the task information. We do not show the short-term memory in the diagram for sake of simplicity, however, it is described in the Algorithm 1 and 2. Actually, incoming samples come to the short-term memory before training in the proposed method. Although, this framework is still a task-free framework without that short-term memory, however, performance significantly enhances if we use short-term memory

Earlier developed methods for CZSL contains task-boundary either during training (task-agnostic prediction setting) [32] or during training and testing both (multi-head setting) [8, 39]. In this section, a task-free continual learning method is proposed for the GZSL framework, i.e., task-free generalized continual zero-shot learning (Tf-GCZSL). Tf-GCZSL is developed based on the concept of experience replay (ER) with knowledge distillation (KD) and regularization. Here, knowledge distillation is performed by using the dark knowledge [15] instead of using the teacher network (i.e., teacher network is the immediate previous network in the case continual learning). Dark knowledge is the soft labels of the training samples of the previous tasks [15]. Training on these soft labels helps in alleviating the catastrophic forgetting and regularize the model for better performance. The proposed Tf-GCZSL method (as shown in Figure 1) use latent space (i.e., output of the encoder) information as a dark knowledge for performing KD. In Tf-GCZSL, A generatic ZSL method, CADA [30], is used as a base method due to it’s good performance for the ZSL in the current literature; however, any ZSL method can be used instead of CADA as the proposed Task-free GCZSL framework isa generic. It minimizes 3 kinds of losses:

VAE loss: It minimizes two standard VAE losses simultaneously for feature and attribute encoder-decoder network: Kullback–Leibler (KL) divergence [21] loss () and reconstruction loss ().

Distribution-alignment loss (DA): It minimizes the distribution between the latent space information of feature and attribute encoder.

(1)

where and

are the estimated mean and variance by visual encoder

, respectively. and are the estimated mean and variance by attribute encoder , respectively, and represents Frobenius norm.

Cross-alignment loss (CA): It is cross-reconstruction loss between the output of feature and attribute decoder and is given as

(2)

where , , , and denote visual feature vector, class attribute vector, attribute decoder, and visual decoder, respectively.

The overall loss () of a generative method (i.e., CADA) for performing ZSL is as follows:

(3)

where , and are the weighting factors.

4.1 Experience Replay and Task-free Strategies for CZSL

Experience Replay (ER) is a well-known method to alleviate catastrophic forgetting in continual learning framework for handling general classification task. However, in this paper, we combine ER and knowledge distillation for task-free generalized continual zero-shot learning (Tf-GCZSL). In Tf-GCZSL, like an experience replay strategy, ER stores the previously learned samples in a small memory and replays it later for training the model. A model is jointly trained by the samples from the replay memory and the samples from the current streaming data. This joint training helps the model in retaining the past knowledge. Here, we need to address two important issues: (i) replay memory capacity is full and (ii) task-free setting during training. In order to handle full memory capacity, we employ a task-independent sampling technique like reservoir sampling, i.e., task-boundary is not known in the streaming data. Reservoir sampling selects

random samples from the streaming data with probability

. Here, is the number of samples seen so far and it is not required to known in advance. Further, CZSL model needs to train in the task-free setting. In this setting, sample arrives one-by-one to the model for training, and train the model by a single sample heavily overfit the CZSL model. As the task-boundary is not known, it is difficult to optimize the model parameters and determining stopping criteria. To handle this issue, we propose two different task-free learning strategies using short-term memory. One should noted that the short-term memory is different from memory present in experience replay. ():

(i) task-free CZSL strategy-1: When the memory reaches the maximum capacity first time, we stop the incoming data stream for a while and optimize the model once. After completing this one-time optimization, training data stream resumes with very small-sized short-term batch memory (

) to store the incoming data stream. This short-term memory is simply a very small batch which is required to pass only once to the model for training without multiple epochs. After completing the training using

, the memory is cleared to store the sample from incoming data stream. The process is repeated until all samples from stream of training data are presented to the model. Since, this strategy does not require multiple epochs, it is really fast in learning the samples. It is referred as Tf-GCZSL. The pseudocode of this procedure is provided in Algorithm 1.

(ii) task-free CZSL strategy-2: In this strategy, we employ a larger short-term memory, i.e., is larger than . The incoming samples are stored in until it becomes full. Once, this memory becomes full then we stop the incoming training samples for a while and train the model for multiple epochs for better generalization. After completing the training using , the is cleared to store the samples from the incoming data stream. The process is repeated untill there is no samples from data stream. Tf-GCZSL with this strategy is referred as Tf-GCZSL. The pseudocode of this procedure is provided in Algorithm 2.

Input: Data stream , Short-term memory
Output: Trained model

1: False
2:for  sample in  do
3:     Store the incoming sample in replay memory using Reservoir sampling strategey
4:     if Replay memory is full and is False then
5:         Stop the incoming data stream
6:         Train the CZSL model on for multiple epochs on the available data in replay memory .
7:          True
8:     else
9:         if  is True then
10:              Store sample in a short-term batch memory
11:              if short-term batch memory is full then
12:                  Train the CZSL model continuously on the incoming batch of samples available in and samples taken from replay memory without running any epochs.
13:                  Clear the short-term batch memory
14:              else
15:                  Keep model in sleep for very small duration of time                             
Algorithm 1 Task-free Learning Strategy-1

Input: Data stream , Short-term memory
Output: Trained model

1:for  sample in  do
2:     Store the incoming sample in replay memory using Reservoir sampling strategey
3:     Store the incoming sample in the short-term memory
4:     if  is full then
5:         Train the CZSL model on the samples taken from replay memory and short-term memory for multiple epochs to optimize the parameters
6:         Clear the short-term memory
7:     else
8:         Keep model in sleep until is not full      
Algorithm 2 Task-free Learning Strategy-2

4.2 Knowledge Distillation (KD) Using Dark Knowledge for CZSL

In addition to ER, Tf-GCZSL also performs KD with dark knowledge for mitigating catastrophic forgetting of the model. For this purpose, in addition to storing the training sample in , class attribute information and latent space information (i.e., estimated , , and by the encoder) corresponding to the training sample are also stored. These latent space information is dark knowledge, which is used to perform knowledge distillation using dark knowledge () as:

(4)

where and are retrieved from the stored latent information for the corresponding sample in . These values were estimated by an encoder at any point of time in the past on the learning trajectory of the Tf-GCZSL. One should note that the approach does not store/use any previously trained network as a teacher for performing knowledge distillation. Instead the knowledge required to perform distillation is stored into the with sample information.

4.3 Overall training procedure of Tf-GCZSL:

Overall, Tf-GCZSL minimizes the following loss during training:

(5)

where , , , and are the weighting factors. For the task-free CZSL, first minimize the loss and follow one of the two above-discussed task-free training strategies, i.e., either Tf-GCZSL or Tf-GCZSL.

After completion of training, latent features are generated by sampling based on the estimated mean and variance of visual/attribute encoder. Visual encoder is used to generate latent features for seen classes and attribute encoder is used for unseen class. Since, these latent features are very discriminative, a simple linear classifier using Softmax is trained on these latent features. The proposed Tf-GCZSL method can also used for the task-agnostic prediction where task boundary known at training time but not at testing time. In this case, Tf-GCZSL minimize the same loss function with-out the task-free learning strategy.

5 Performance Evaluation

CZSL methods have been evaluated over five benchmark ZSL datasets, namely Caltech-UCSD-Birds 200-2011 (CUB) [38], Attribute Pascal and Yahoo (aPY) [10], Animals with Attributes (AWA1 and AWA2) [10], and SUN [26]. These datasets split and prepared for two kinds of CZSL settings, which is discussed in the next subsection.

5.1 Settings and Evaluation Metrics

In this section, we discuss two different settings for evaluating our proposed model in task-agnostic prediction and task-free learning for continual zero-shot learning (CZSL) .

CUB aPY AWA1 AWA2 SUN [b]
mSA mUA mH mSA mUA mH mSA mUA mH mSA mUA mH mSA mUA mH
Seq-Tf-GCZSL 40.82 14.37 21.14 47.00 7.83 13.13 50.81 16.68 25.45 52.24 13.98 22.33 25.94 16.22 20.10 [t]
AGEM+CZSL [8] 17.30 9.60
AGEM+CZSL+CN [32] 23.80 14.20
EWC+CZSL [schwarz2018progress] 18.00 9.60
EWC+CZSL+CN [32] 23.30 14.30
MAS+CZSL [aljundi2018memory] 17.70 9.40
MAS+CZSL+CN [32] 23.80 14.20
Tf-GCZSL 45.00 30.50 34.57 58.41 18.74 26.85 61.67 37.38 44.90 65.46 36.40 45.75 27.07 23.35 23.84
Tf-GCZSL 46.63 32.42 36.31 57.92 21.22 29.55 64.00 38.34 46.14 64.89 40.23 48.33 28.09 24.70 24.79 [b]
Table 1:

CZSL results for task-agnostic prediction in terms of mean seen accuracy (mSA) for seen, mean unseen accuracy (mUA) for unseen classes, and their mean of harmonic mean (mH). The best results in the table are presented in bold face.

CUB aPY AWA1 AWA2 SUN [b]
SA UA H SA UA H SA UA H SA UA H SA UA H
Offline (Upper Bound) 53.5 51.6 52.4 59.36 30.36 40.18 72.8 57.3 64.1 75 55.8 63.9 35.7 47.2 40.6
Seq-Tf-GCZSL 38.73 21.42 27.58 23.51 1.68 3.14 20.37 2.12 3.85 14.64 9.43 11.47 27.89 18.88 22.52 [t]
Seq-Tf-GCZSL 42.48 20.46 27.61 57.23 3.90 7.30 55.00 13.99 22.31 59.67 18.37 28.09 26.42 17.84 21.30 [t]
Tf-GCZSL 45.69 31.90 37.57 73.13 15.51 25.60 67.65 44.01 53.33 70.08 46.59 55.97 31.04 29.30 30.15 [t]
Tf-GCZSL 46.70 43.09 44.82 77.68 16.67 27.46 66.24 53.28 59.06 69.38 55.07 61.40 26.86 37.56 31.32 [t]
Tf-GCZSL 45.08 34.02 38.78 72.55 14.33 23.94 65.64 51.46 57.69 68.42 42.74 52.62 31.00 29.37 30.16 [t]
Tf-GCZSL 44.52 43.21 43.85 72.12 19.66 30.90 61.79 57.77 59.72 67.42 58.08 62.41 27.76 39.09 32.46 [b]
Table 2: Results for task-free CZSL in terms of seen accuracy (mSA) for seen, unseen accuracy (UA) for unseen classes, and their harmonic mean (H) .

CZSL setting for task-agnostic prediction: For task-agnostic prediction, we have used the setting mentioned in [32] for CZSL. In this setting, first, data is divided among tasks. Next, if the model is training on task then all classes till task are treated as seen classes and all classes from task to

tasks are treated as unseen classes. Following evaluation metrics are used to evaluate the model in case of task-agnostic prediction for a

task [32]:

  • Mean Seen-class Accuracy (mSA)

    (6)

    where stands for per class accuracy.

  • Mean Unseen-class Accuracy(mUA)

    (7)
  • Mean Harmonic Accuracy (mH)

    (8)

    where stands for harmonic mean.

Here, denotes all train/test samples from to task, and denotes all train/test samples from to last task.

CZSL setting for task-free learning: The setting mentioned above in [32] is not suitable for task-free GCZSL as seen and unseen classes are decided based on the task boundary. However, in task-free learning, task boundary information is not available during training and testing of the model. Therefore, we propose a more challenging and different CZSL setting for task-free learning. Here data split into multiple blocks based on the standard split of ZSL benchmark datasets. Each block contains samples from distinct classes. First, we train the model by streaming sample from these blocks one by one, then, test the model on the standard testing data available in the split of ZSL benchmark datasets. The performance is evaluated using the harmonic mean (), and top-1 accuracy of seen-class accuracy (SA) and unseen classes accuracy (UA).

5.2 Baseline Methods

There are only handful of work available for CZSL. Recently, it is developed for multi-head setting [39] and task-agnostic prediction [32]; however, there is no work available for task-free learning. For task-agnostic prediction, the results are compared with the following methods:

  • The sequential training of the proposed method without considering any continual learning setting: Seq-Tf-GCZSL.

  • Skorokhodov et al. developed various methods for CZSL with and without class normalization [32]:

    1. With class normalization: AGEM+CZSL [8], EWC+CZSL [schwarz2018progress], MAS+CZSL [aljundi2018memory].

    2. Without class normalization: AGEM+CZSL+CL, EWC+CZSL+CL, MAS+CZSL+CL

For task-free learning, the results are compared with the offline training of the proposed method where one can assume all data are available at once, which is basically an upper bound for the proposed method. We also perform sequential training of the proposed methods: Seq-Tf-GCZSL and Seq-Tf-GCZSL.

(a) Task-wise analysis
(b) Impact of the size of the memory , which is analyzed in terms of number of samples per class.
(c) Impact of the latent dimensions
Figure 2: Ablation study for task-agnostic prediction
(a) Impact of the size of the memory , which is analyzed in terms of number of samples per class.
(b) Impact of the latent dimensions
(c) Impact of the size of
Figure 3: Ablation study for task-free prediction

5.3 Results

In this section, results are presented for the both cases of single head setting, i.e. task-agnostic prediction and task free learning.

Task-agnostic Prediction: Results for this setting are presented in Table 1 for 5 CZSL datasets. The performance results of Tf-GCZSL without KD using dark knowledge are given in the table, i.e., Tf-GCZSL. It can be observed from this table, Tf-GCZSL outperforms the all existing CZSL methods presented in [32] from at least and margin for CUB and SUN datasets in terms of , respectively. It has also significantly outperform the seq-Tf-GCZSL, which is obvious. Moreover, when we compare the results of Tf-GCZSL and Tf-GCZSL, it has been observed that KD using dark knowledge with experience replay improves the performance and helps in alleviating the catastrophic forgetting further.

Task Free Learning: Results for this setting are presented in Table 2 for all 5 CZSL datasets. We also provide the results of the proposed methods in the sequential setting (i.e., Seq-Tf-GCZSL and Seq-Tf-GCZSL), and without KD using dark knowledge (i.e., Seq-Tf-GCZSL and Seq-Tf-GCZSL). It can be observed from this table, proposed methods outperforms sequential methods for all datasets. Further, dark knowledge improves the performance for most of the case if we use . Overall, in the second strategy-based task-free learning (Tf-GCZSL) outperforms significantly over the first strategy (i.e., Tf-GCZSL) by more than , , , , and for CUB, aPY, AWA1, AWA2, and SUN datasets, respectively in terms of . Moreover, When we compare from the upper bound, Tf-GCZSL yields good performance as it lacks by only , , , , and for CUB, aPY, AWA1, AWA2, and SUN datasets, respectively in terms of . This lack of performance is due to catastrophic forgetting.

Tf-GCZSL gains this performance for both settings due to the joint training of the current samples with the samples from the replay memory . Although, replay memory does the repetitive training of the already trained samples, but it does not lead to overfitting or any adverse effect. In contrast, this joint training regularize the model for better generalization.

5.4 Ablation Study on CUB dataset

In this section, an ablation study is presented for the CUB dataset.

For task-agnostic prediction: The ablation study is presented in terms of the following three factors:

  • Task-wise analysis: Task-wise analysis is depicted in Figure 1(a). It can be observed from this figure, performance of Tf-GCZSL improves as number of tasks increases because as number of task increases then task-relatedness between seen and unseen class samples are increasing. Task-relatedness increases because number of seen samples also increases and it enriches the knowledge of the model. Although Tf-GCZSL improves the performance when task increase, however, seq-Tf-GCZSL performance has been decreased as it doesn’t use any continual learning strategy.

  • Analysis on replay memory: The performance of Tf-GCZSL is very sensitive to the size of the memory. Size is kept in term of number of samples per class. If there are number of classes and number of samples per class then memory size is . It can be observed from Figure 1(b), the performance of TF-GCZSL improves as memory size increases as memory can keep more number of samples from the past experience.

  • Analysis on latent dimensions: It is another important factor for CZSL. Performance of TF-GCZSL is depicted in 1(c)

    on different latent dimensions. This figures suggests that size of latent dimensions should not be very small or very large. If it is very small then it is unable to create the more discriminative feature and if it is very large then degree of freedom increases which will not provide the compact features.

For task-free learning: Similarly for task-agnostic prediction, the ablation study is conducted on memory size and latent dimensions. Since, Tf-GCZSL uses two kinds of memories: replay memory and short-term memory . Analysis-based on both memories is presented in Figures 2(a) and 2(c) for and , respectively. In case of , performance increases as memory size increases due to the same reason as discussed above. In case of , size is not impacting much on the performance as these samples jointly trains with the larger memory in the task-free setting, therefore, performance is very similar for all cases. In the ablation of latent dimension in Figure 2(b), we again observe the similar plot as 1(c). Moreover, for all three plots in Figure 3, we also plot sequential results for better understanding and results are obvious that proposed methods outperforms sequential methods for all cases.

6 Conclusion

This is the first work that tackles the continual Zero-shot learning for the task-free set-up to the best of our knowledge. This paper has proposed general task-free continual zero-shot learning strategies using experience replay, knowledge distillation with dark knowledge, and short-term memory. The performance is evaluated on five benchmark data, and the results indicate that the Tf-GCSZL achieve closer to the upper bound with minimal catastrophic forgetting. The framework is generic; therefore, one can use other ZSL approaches to develop it for task-free CZSL.

References

  • [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid (2016) Label-embedding for image classification. IEEE transactions on pattern analysis and machine intelligence 38 (7), pp. 1425–1438. Cited by: §2.1.
  • [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele (2015) Evaluation of output embeddings for fine-grained image classification. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 2927–2936. Cited by: §2.1.
  • [3] R. Aljundi, K. Kelchtermans, and T. Tuytelaars (2019) Task-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11254–11263. Cited by: §2.2.
  • [4] R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio (2019) Gradient based sample selection for online continual learning. In Advances in neural information processing systems, pp. 11816–11825. Cited by: §2.2.
  • [5] P. Buzzega, M. Boschini, A. Porrello, D. Abati, and S. CALDERARA (2020) Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems, Vol. 33, pp. 15920–15930. Cited by: §2.2.
  • [6] W. Chao, S. Changpinyo, B. Gong, and F. Sha (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In ECCV, pp. 52–68. Cited by: §1.
  • [7] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr (2018) Riemannian walk for incremental learning: understanding forgetting and intransigence. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–547. Cited by: §1, §2.2.
  • [8] A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny (2018) Efficient lifelong learning with a-gem. In International Conference on Learning Representations, Cited by: §2.2, §2.3, §4, item i, Table 1.
  • [9] A. Chaudhry, M. Rohrbach, M. Elhoseiny, T. Ajanthan, P. K. Dokania, P. H. Torr, and M. Ranzato (2019) On tiny episodic memories in continual learning. arXiv preprint arXiv:1902.10486. Cited by: §1, §2.2.
  • [10] A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth (2009) Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785. Cited by: §5.
  • [11] R. Felix, V. B. Kumar, I. Reid, and G. Carneiro (2018) Multi-modal cycle-consistent generalized zero-shot learning. In ECCV, pp. 21–37. Cited by: §2.1.
  • [12] Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong (2014) Transductive multi-view embedding for zero-shot recognition and annotation. In European Conference on Computer Vision, pp. 584–599. Cited by: §2.1.
  • [13] T. L. Hayes, N. D. Cahill, and C. Kanan (2019) Memory efficient experience replay for streaming learning. In 2019 International Conference on Robotics and Automation (ICRA), pp. 9769–9776. Cited by: §2.2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385. Cited by: §1.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2014) Dark knowledge. Presented as the keynote in BayLearn 2. Cited by: item 3, §4.
  • [16] H. Huang, C. Wang, P. S. Yu, and C. Wang (2019) Generative dual adversarial network for generalized zero-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 801–810. Cited by: §2.1.
  • [17] S. J. Hwang and L. Sigal (2014) A unified semantic embedding: relating taxonomies and attributes. In Advances in Neural Information Processing Systems, pp. 271–279. Cited by: §2.1.
  • [18] X. Jin, J. Du, and X. Ren (2020) Gradient based memory editing for task-free continual learning. In International Conference on Machine Learning Workshops, Cited by: §2.2.
  • [19] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2017)

    Overcoming catastrophic forgetting in neural networks

    .
    Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §1, §2.2.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [21] S. Kullback and R. A. Leibler (1951) On information and sufficiency. The annals of mathematical statistics 22 (1), pp. 79–86. Cited by: §4.
  • [22] C. H. Lampert, H. Nickisch, and S. Harmeling (2013) Attribute-based classification for zero-shot visual object categorization. IEEE transactions on pattern analysis and machine intelligence 36 (3), pp. 453–465. Cited by: §2.1.
  • [23] J. Li, M. Jing, K. Lu, Z. Ding, L. Zhu, and Z. Huang (2019) Leveraging the invariant side of generative zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7402–7411. Cited by: §1, §2.1.
  • [24] D. Lopez-Paz and M. Ranzato (2017) Gradient episodic memory for continual learning. In Advances in neural information processing systems, pp. 6467–6476. Cited by: §1, §2.2.
  • [25] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean (2013) Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650. Cited by: §2.1.
  • [26] G. Patterson and J. Hays (2012) Sun attribute database: discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758. Cited by: §5.
  • [27] S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) Icarl: incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001–2010. Cited by: §2.2.
  • [28] B. Romera-Paredes and P. Torr (2015) An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning, pp. 2152–2161. Cited by: §2.1.
  • [29] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §2.1.
  • [30] E. Schonfeld, S. Ebrahimi, S. Sinha, T. Darrell, and Z. Akata (2019) Generalized zero-and few-shot learning via aligned variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8247–8255. Cited by: §2.1, §4.
  • [31] H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, §1, §2.2.
  • [32] I. Skorokhodov and M. Elhoseiny (2021) Normalization matters in zero-shot learning. In International Conference on Learning Representations, Cited by: item 1, item 2, §1, §2.3, §4, 2nd item, §5.1, §5.1, §5.2, §5.3, Table 1.
  • [33] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng (2013) Zero-shot learning through cross-modal transfer. In Advances in neural information processing systems, pp. 935–943. Cited by: §2.1.
  • [34] K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §1.
  • [35] V. K. Verma, G. Arora, A. Mishra, and P. Rai (2018) Generalized zero-shot learning via synthesized examples. CVPR. Cited by: §2.1.
  • [36] V. K. Verma, D. Brahma, and P. Rai (2020) Meta-learning for generalized zero-shot learning.. In AAAI, pp. 6062–6069. Cited by: §1.
  • [37] V. K. Verma and P. Rai (2017) A simple exponential family framework for zero-shot learning. In ECML-PKDD, pp. 792–808. Cited by: §1.
  • [38] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §5.
  • [39] K. Wei, C. Deng, and X. Yang (2020) Lifelong zero-shot learning. In

    Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence

    ,
    pp. 551–557. Cited by: item 1, item 2, §1, §2.3, §4, §5.2.
  • [40] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele (2016) Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 69–77. Cited by: §2.1.
  • [41] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata (2018) Feature generating networks for zero-shot learning. In CVPR, pp. 5542–5551. Cited by: §1, §2.1.
  • [42] Y. Xian, S. Sharma, B. Schiele, and Z. Akata (2019) F-vaegan-d2: a feature generating framework for any-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10275–10284. Cited by: §1, §2.1.
  • [43] G. Xie, L. Liu, X. Jin, F. Zhu, Z. Zhang, J. Qin, Y. Yao, and L. Shao (2019) Attentive region embedding network for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9384–9393. Cited by: §1.
  • [44] Y. Yu, Z. Ji, J. Han, and Z. Zhang (2020) Episode-based prototype generating network for zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14035–14044. Cited by: §1.
  • [45] L. Zhang, T. Xiang, and S. Gong (2017) Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030. Cited by: §2.1.
  • [46] Z. Zhang and V. Saligrama (2015) Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4166–4174. Cited by: §2.1.
  • [47] Y. Zhu, J. Xie, B. Liu, and A. Elgammal (2019) Learning feature-to-feature translator by alternating back-propagation for generative zero-shot learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9844–9854. Cited by: §1, §2.1.