Log In Sign Up

A Meta-Learning Framework for Generalized Zero-Shot Learning

by   Vinay Kumar Verma, et al.

Learning to classify unseen class samples at test time is popularly referred to as zero-shot learning (ZSL). If test samples can be from training (seen) as well as unseen classes, it is a more challenging problem due to the existence of strong bias towards seen classes. This problem is generally known as generalized zero-shot learning (GZSL). Thanks to the recent advances in generative models such as VAEs and GANs, sample synthesis based approaches have gained considerable attention for solving this problem. These approaches are able to handle the problem of class bias by synthesizing unseen class samples. However, these ZSL/GZSL models suffer due to the following key limitations: (i) Their training stage learns a class-conditioned generator using only seen class data and the training stage does not explicitly learn to generate the unseen class samples; (ii) They do not learn a generic optimal parameter which can easily generalize for both seen and unseen class generation; and (iii) If we only have access to a very few samples per seen class, these models tend to perform poorly. In this paper, we propose a meta-learning based generative model that naturally handles these limitations. The proposed model is based on integrating model-agnostic meta learning with a Wasserstein GAN (WGAN) to handle (i) and (iii), and uses a novel task distribution to handle (ii). Our proposed model yields significant improvements on standard ZSL as well as more challenging GZSL setting. In ZSL setting, our model yields 4.5%, 6.0%, 9.8%, and 27.9% relative improvements over the current state-of-the-art on CUB, AWA1, AWA2, and aPY datasets, respectively.


page 1

page 3

page 8

page 9


Task Aligned Generative Meta-learning for Zero-shot Learning

Zero-shot learning (ZSL) refers to the problem of learning to classify i...

Meta-Learned Attribute Self-Gating for Continual Generalized Zero-Shot Learning

Zero-shot learning (ZSL) has been shown to be a promising approach to ge...

Towards Zero-Shot Learning with Fewer Seen Class Examples

We present a meta-learning based generative model for zero-shot learning...

Zero-shot meta-learning for small-scale data from human subjects

While developments in machine learning led to impressive performance gai...

Learning Graph-Based Priors for Generalized Zero-Shot Learning

The task of zero-shot learning (ZSL) requires correctly predicting the l...

Bayesian Zero-Shot Learning

Object classes that surround us have a natural tendency to emerge at var...

1 Introduction

With the ever-growing quantities, diversity, and complexity of real-world data, machine learning algorithms are increasingly faced with challenges that are not adequately addressed by traditional learning paradigms. For classification problems, one such challenging setting is where test-time requires correctly labeling objects that could be from classes that were not present at training time. This setting is popularly known as Zero-Shot Learning (ZSL), and has drawn a considerable interest recently 

cmt; conse; verma2017simple; changpinyo2016synthesized; ESZSL2015; xian2018feature; vermageneralized; calibration_generalized; romera2015embarrassingly; Chen_2018_CVPR; mishra2017generative; bucher2017generating; Zero-ShotTaskTransfer; cycle-consistancy; saligram_cvpr19; cada-vae; kumar2019generative

. ZSL algorithms typically rely on class-descriptions (e.g., human-provided class attribute vectors, textual description, or word2vec embedding of class name). These class-description/class-attributes are leveraged to transfer the knowledge from

seen classes (i.e., classes that were present at training-time) to unseen classes (i.e., classes only encountered in test data).

Driven by the recent advances in generative modeling wgan; VAE; progressivekarras, there is a growing interest in generative models for ZSL. Broadly, these models learn to generate/synthesize “artificial” examples from unseen classes vermageneralized; cycle-consistancy; xian2018feature; mishra2017generative; lisgan; cada-vae; khare2019generative, conditioning on their class attributes, and learn a classifier using these synthesized examples. Despite the recent progress on such approaches, these still have some key limitations. Firstly, while the goal of these approaches is to generate unseen/novel class examples given the respective class attributes, these models are trained using data (inputs and the respective class attributes) from the seen classes vermageneralized; xian2018feature; cycle-consistancy; lisgan and do not explicitly learn to generate the unseen class samples during training. Consequently, these generative ZSL models show a large quality gap between the synthesized unseen class inputs and actual unseen class input. To mimic the ZSL setting explicitly, we propose a novel variant of the standard meta-learning based approach finn2017model. Notably, in our variant, the meta-train and meta-validation classes are disjoint.

The second limitation of existing ZSL/GZSL models is that they do not learn an optimal parameter which can easily generalize to the seen/unseen class generation. Our meta-learning framework learns such an optimal parameter that can quickly adapt to the novel classes (meta-test) with few gradient steps. prototypical; matching show that even with the zero-gradient step (without fine-tuning), meta-learning learns to generalize novel class samples/task. We build on this idea to train a class-conditioned WGAN for sample generation.

The third key limitation is that all the existing ZSL methods rely on the availability of a significant number of labeled samples from each of the seen classes. This itself is a severe requirement and may not be met in practice (e.g., we may only have a handful, say 5, or 10 examples from each seen class). Note that this setting is somewhat similar to few-shot learning or meta-learning finn2017model where the goal is to learn a classifier using very few examples per class, but all the test/unseen class are assumed to have few samples in test time. In contrast, in ZSL, we do not have any labeled training data from unseen classes. Our meta-learning based formulation is naturally suited to this setting where only a few samples per class are available.

Our approach is primarily based on learning a generative model that can synthesize inputs from any class (seen/unseen), given the respective class-attributes/description. However, unlike recent works on synthesis based ZSL models lisgan; noisy_text; vermageneralized; xian2018feature; cycle-consistancy, we endow the generator the capability to meta-learn using very few examples per seen class. To this end, we develop a meta-learning based conditional Wasserstein GAN wgan (conditioning on the class-attributes) which has a generator and a discriminator modules augmented with a classifier. Each module is associated with a meta-learning agent, to facilitate learning with a very small number of seen class inputs. Also, the novel task distribution helps to mimic the ZSL behavior, i.e., the generative model not only learns to generate the seen class samples but the unseen class samples as well. We would also like to highlight that, although we develop this model with the focus being ZSL and generalized ZSL, our ideas can be used for the task of supervised few-shot generation few-shotimage, which is the problem of learning to generate data given very few examples to learn the data distribution. Our main contributions are summarized below:

  • [leftmargin=0.35cm]

  • We develop a novel meta-learning framework for ZSL and generalized ZSL by learning to synthesize examples from unseen classes, given the respective class-attributes. Notably, our framework is based on model-agnostic meta-learning finn2017model, which enables the synthesis of high-quality examples. This helps to overcome the above mentioned second and third limitation.

  • We propose a novel episodic training for the meta-learning based ZSL where, in each episode, the training-set and validation-set classes are disjoint. This helps learning to generate the novel class examples in training itself. This contributes in overcoming the above mentioned first limitation.

2 Notation, Preliminaries, Problem Setup

A typical ZSL setting is as follows: We have seen classes with labelled training data and unseen classes with no labelled data present during the training time. The test data can be either exclusively from unseen classes (standard ZSL setting), or can be from both unseen and seen classes (generalized ZSL setting). We further assume that we are provided class-attribute vectors for the seen as well as unseen classes , where is the class-attribute vector of class . These class-attribute vectors are leveraged by the ZSL algorithms to transfer the knowledge from seen to unseen classes.

Existing ZSL algorithms 1assume that we have access to a significant number of examples from each of the seen classes. This may however not be the case; in practice, we may have very few examples from each of the seen classes. We train our model in -way -shot setting such that it can handle the ZSL problem when only very few samples are available per seen class. We choose the model-agnostic meta learning (MAML) finn2017model

as our meta-learner due to its generic nature; it only requires a differentiable model and can work with any loss function.

Figure 1: Left: Task episode for zero-shot meta-learning. For each task , training set and validation set classes are disjoint. In the ZSL setup, we have zero training examples from the meta-test set. Right: The proposed architecture model. : ResNet-101 feature vector.

2.1 Model-Agnostic Meta-Learning (MAML)

MAML finn2017model is an optimization based meta-learning framework designed for few-shot learning. The model is designed in such a way that it can quickly adapt to a new task with the help of only few training examples. MAML assumes that model is parameterized by learnable parameters and the loss function is smooth in that can be used for the gradient-descent based updates.

Let be the distribution of tasks over the meta-train set. MAML defines the notion of a “task” such that a task represents a set of labeled examples and MAML splits this set further into a training set and a validation set , i.e., . The split is done such that has very few examples per class. We follow the general notion of -way -shot problem matching , i.e., contains classes with examples from each class. The model is trained using an episodic formulation where each round samples a batch of tasks and uses gradient-descent based updates (inner loop) for the parameters specific to each task . The meta-update step (outer loop) then aggregates the information from all these “local” updates to update the overall model parameters , using gradient descent update.

For task , its local parameters are updated by starting with the global model parameters , and using a few gradient based updates computed on from task . Assuming a single step of update, this can be written as: .Here, is the hyper-parameter and denotes the loss function being used. The overall global/meta objective defined over the multiple tasks sampled from task distribution can be defined as:


Assuming a gradient descent based optimization of the global objective in Eq. 1, a single-step gradient descent update for the global parameter can be written as: .

2.2 Zero-Shot Meta-Learning (ZSML)

The meta-learning framework finn2017model; ravi2016optimization; matching; prototypical can quickly adapt to a new task with the help of only a few gradient steps. The quick adaption is only possible for the model if it learns the optimal parameter in the parameter space that is unbiased towards the meta-train data. The learned parameters are close to the optimal parameters for both meta-train and meta-test data respectively (as shown in Figure 2). It is already demonstrated in matching; prototypical where without fine-tuning (using zero gradient steps, i.e., not making any update) on the meta-test, the meta-learning model shows better/similar performance. Our ZSML approach is primarily motivated by high-quality generalization ability of the meta-learning towards the seen/unseen class samples. We use the meta-learning framework to train a generative adversarial network conditioned on class attributes, that can generate the novel class samples. A key difference with MAML, to mimic the ZSL behaviour, is that for each task , the classes of and are disjoint, whereas, in MAML, both set of classes are the same. Therefore, the training is done in such a way that acts as seen classes and acts as unseen classes. The inner loop of the meta-learning optimizes the parameters using , and final parameters are updated over the loss of the (containing disjoint set of classes). Therefore, the model learns to generate the novel class during the training itself. In the next section, we describe our complete model (shown in Figure 1 (right)).

Figure 2: Our proposed ZSML learns a generic optimal parameter such that it can generate the seen/unseen class samples with zero-gradient step update conditioned on the class attribute (at test time).

3 Meta-Learning based Adversarial Generation

The core of our ZSL model (Figure 1 right) is a generative adversarial network GAN, coupled with (1) an additional classifier module trained to correctly classify the examples generated by the generator module; and (2) meta-learners in each of the three modules (Generator (), Discriminator (), and Classifier ()). We use the Wasserstein GAN wgan architecture due to its nice stability properties. We assume , and to be the parameters of the Discriminator, Generator and Classifier, respectively.

Our model follows the episode-wise training akin to MAML (however, and classes are disjoint in our ZSL setting). There are three meta-learners in the model, one for each , and , but and are optimized jointly. From now on, we will denote the parameters for and as a joint set of parameters .

For each task , sampled from the task distribution , is used by the meta-learners (in the inner loop) of and . is used to calculate the loss over the most recent parameters of the meta-learners. For our model, the generator network takes input as, a random noise (), concatenated with the class-attribute vector of a class. produces a sample that is similar to a real sample from that class. The discriminator network tries to distinguish such generated samples (concatenated with attributes) from the actual sample (real data distribution). In addition, the goal of the classifier network is to take the generated sample from and classify it into the original class where is the set of both seen and unseen classes. Presence of the classifier module ensures that the generated sample has the same characteristics as that of samples from that class.

We now describe the objective function of our model. Let denote the meta-learner objective of the discriminator and denote the meta-learner objective of the generator and the classifier , on the task . The meta-learner objective for discriminator can be defined as:


Here, is attribute vector of samples belonging to . The objective in Eq. 2 (to be maximized) essentially says that the discriminator should have large for real examples and small for generated examples. The meta-learner objective for generator and classifier is given as:


This objective (to be minimized) says that the generator’s output should be such that is large, as well as the classifier’s loss should be small (i.e., the classifier should predict the correct class for generated example ). Having defined the individual objectives, the overall objective for the meta-learner (inner loop) update for task :


The meta-learner gradient ascent update for the discriminator over a task will be:


Similarly the meta-learner gradient descent update for the generator and classifier over will be:


The model parameters are learned by optimizing Eq 2 and Eq 3 over a batch of sampled tasks from the task distribution . The overall meta-objective for the discriminator and generator is:


Unlike to standard MAML in the inner loop (i.e. Eq:7 and 8) are optimize on the set of task instead of per task. We observe that this increase the stability of the WGAN training. Having meta-learned the discriminator parameters from the meta-training phase (performed using the seen class examples), the discriminator’s objective function w.r.t. the unseen class examples in the validation meta-set is given by:


Therefore, the final update of the discriminator for the batch is:


Here, is the learning rate for the meta-step and is the optimal parameter provided by the inner loop of meta-learner for the discriminator. Likewise, the generator’s and classifier’s objective function w.r.t. the unseen class examples in is given by:


Eq. 11 performs the meta-optimization across the batch of task for the generator and classifier. Again, note that, each task is partitioned into training set and validation set , such that the classes are disjoint. In contrast, traditional meta-learning finn2017model designed for few-shot learning assumes that the set of classes in is same as the set of classes in . This disjoint setup for and is designed for zero-shot learning in order to mimic the problem setting which requires predicting the labels for examples from unseen classes not present at training time.

3.1 Example Generation and Zero-Shot Classification

After training the model, we can generate the unseen class examples given the respective class-attribute vectors. The generation of the novel class examples is done as:


Here, and . Once we have generated samples from the unseen classes, we can train any classifier (e.g., SVM or softmax classifier) with these samples as labeled training data. In generalized ZSL setting, we synthesize samples from both seen and unseen class. We use the unseen class generated samples and actual/generated examples from seen classes to train a classifier with the label space being the union of seen and unseen classes. In practice, we found that using generated samples from seen classes (as opposed to actual samples) tends to perform better in the generalized ZSL setting. A justification for this is that the generated sample quality is uniform across seen and unseen class examples.

LATEM latem 55.3 49.3 55.1 55.8 35.2
SJE SJE 53.7 53.9 65.6 61.9 32.9
ESZSL ESZSL2015 54.5 53.9 58.2 58.6 38.3
SYNCchangpinyo2016synthesized 56.3 55.6 54.0 46.6 23.9
SAE SAE2017 40.3 33.3 53.0 54.1 8.3
DEM dem 61.9 51.7 68.4 67.1 35.0
DCN calibration_generalized 61.8 56.2 65.2 43.6
ZSKL zskl 61.7 51.7 70.1 70.5 45.3
GFZSLverma2017simple 62.6 49.2 69.4 67.0 38.4
SP-AEN Chen_2018_CVPR 55.4 58.5 24.1
CVAE-ZSLmishra2017generative 61.7 52.1 71.4 65.8
cycle-UWGAN cycle-consistancy 59.9 58.6 66.8
f-CLSWGAN xian2018feature 60.8 57.3 68.2
SE-ZSL vermageneralized 63.4 59.6 69.5 69.2
VSE-S saligram_cvpr19 66.7 69.1 50.1
LisGAN lisgan 61.7 58.8 70.6 43.1
ZSML Softmax (Ours) 60.2 69.6 73.5 76.1 64.1
ZSML SVM (Ours) 60.1 69.7 74.3 77.5 64.0
Table 1: ZSL result using the per-class mean metric xian2018zero. The non-generative models are mentioned at the top and the generative models are mentioned at the bottom. All compared methods use CNN-RNN feature for CUB dataset.

4 Related Work

Some of the earliest works on ZSL were based on directly or indirectly mapping the inputs to the class-attributes IAP; conse; cmt. The learned mapping is used at inference time, this mapping first projects the unseen data to class-attribute space and then uses nearest neighbour search to predict the class. In a similar vein, other approaches ESZSL2015; changpinyo2016synthesized also consider the relationship between seen and unseen classes. They represent the parameters of each unseen class as a similarity weighted combination of the parameters of seen classes. All of these models require plenty of data from the seen classes, and also do not work well in GZSL setting vermageneralized; xian2018zero.

Because of the wide applicability and more realistic setting the ZSL framework also applied on the different domain like Zero-Shot Task Transfer zero-shottask_cvpr19

, zero-shot sketch-based image retrieval

shen2018zero; kumar2019generative, zero-shot knowledge distillation nayak2019zero, zero-shot action recognition xu2015semantic; gan2015exploring; mishra2018generative; mandal2019out etc. These fields are not in the scope of this paper. In this paper, we focus on zero-shot image classification. Therefore in the rest of the section, we discuss the ZSL framework for the image classification. Here note that our approach is generic and it can be easily applied over the other ZSL domain also.

Another prominent approach for ZSL focuses on learning the bilinear compatibility between the visual space and the semantic space of classes. akata2013label; frome2013devise; SJE; ESZSL2015; SAE2017 are based on computing a linear/bilinear compatibility function. sse embeds the inputs based on the semantic similarity. Some of the ZSL methods assume that all the unseen class inputs are also present at the time of training without the class labels. These transductive methods have extra information about all the unlabelled data of the unseen class, which leads to improved predictions as compared to the inductive setting song2018transductive; xu2017transductive. Note that the transductive assumption is not very realistic since often test data is not available at the time of training.

The generalized ZSL (GZSL) vermageneralized; chao2016empirical; xian2018zero; xian2018feature problem is arguably a very realistic and challenging problem wherein, unlike the ZSL problem, the training (seen) and the test (unseen) classes are not disjoint. Most of the previous models that perform well on standard ZSL fail to handle the biases towards predicting seen classes. Recently, generative models Chen_2018_CVPR; xian2018feature; verma2017simple; guo2017synthesizing; wang2017zero have shown promising results for both ZSL and GZSL setups. verma2017simple used a simple generative model based on the exponential family framework while guo2017synthesizing synthesized the classifier weights using class attributes. Recent generative approaches for ZSL are mostly based on VAE VAE and GAN GAN. Among these, vermageneralized; bucher2017generating; fvaegan are based on the VAE architectures while xian2018feature; Chen_2018_CVPR; lisgan; cycle-consistancy

use adversarial sample generation based on the class conditioned attribute. The recent approaches based on VAE and GAN show very competitive results. A particular advantage of the generative approaches is that, by using synthesized samples, we can convert the ZSL problem to the conventional supervised learning problem that can handle the biases towards the seen classes. The meta-learning approach are already tried for the ZSL

hu2018correction to correct the learned network. To the best of our knowledge MAML finn2017model based approach over GAN has not been investigated yet. The meta-learning based adversarial generation model shows significant performance improvement, whereas the recent generative ZSL models have saturated.

5 Experiments and Results

We perform a comprehensive evaluation of our approach ZSML (Zero-Shot Meta-Learning) by applying it on both standard ZSL and generalized ZSL problems and compare it with several state-of-the-art methods. We also perform several ablation studies to demonstrate/disentangle the benefits of the various aspects of our proposed approach.111We will provide the code and data upon publication. We evaluate our approach on the following benchmark ZSL datasets: SUN xiao2010sun and CUB welinder2010caltech which are fine-grained and considered very challenging; AWA1  lampert2009learning and AWA2  xian2018zero; aPY  farhadi2009describing with diverse classes that makes this dataset very challenging. For CUB dataset, we use CNN-RNN textual features  reed2016learning as class attributes, similar to the approaches mentioned in Table 5 and 2. Due to the lack of space, the complete Algorithm and details about the datasets are provided in the Supplementary Material

. The generator and discriminator are 2-hidden layer networks with hidden layer size 2048 and 512, respectively. More details of the model architecture, experimental setup and various hyperparameters are provided in the

Supplementary Material.

Method AWA1 CUB aPY AWA2
SJE SJE 11.3 74.6 19.6 23.5 59.2 33.6 3.7 55.7 6.9 8.0 73.9 14.4
ESZSL ESZSL2015 6.6 75.6 12.1 12.6 63.8 21.0 2.4 70.1 4.6 5.9 77.8 11.0
SYNCchangpinyo2016synthesized 8.9 87.3 16.2 11.5 70.9 19.8 7.4 66.3 13.3 10.0 90.5 18.0
SAE SAE2017 8.8 18.0 11.8 7.8 54.0 13.6 0.4 80.9 0.9 1.1 82.2 2.2
LATEM latem 7.3 71.7 13.3 15.2 57.3 24.0 0.1 73.0 0.2 11.5 77.3 20.0
DEVISE frome2013devise 13.4 68.7 22.4 23.8 53.0 32.8 4.9 76.9 9.2 17.1 74.7 27.8
DEM dem 32.8 84.7 47.3 19.6 57.9 29.2 11.1 75.1 19.4 30.5 86.4 45.1
ZSKL zskl 18.3 79.3 29.8 21.6 52.8 30.6 10.5 76.2 18.5 18.9 82.7 30.8
DCN calibration_generalized 28.4 60.7 38.7 14.2 75.0 23.9 25.5 84.2 39.1
CVAE-ZSLmishra2017generative 47.2 34.5 51.2
f-CLSWGAN xian2018feature 61.4 57.9 59.6 43.7 57.7 49.7 57.9 61.4 59.6
SP-AEN Chen_2018_CVPR 34.7 70.6 46.6 13.7 63.4 22.6 23.3 90.9 37.1
cycle-UWGAN cycle-consistancy 47.9 59.3 53.0 59.6 63.4 59.8
SE-GZSL vermageneralized 56.3 67.8 61.5 41.5 53.3 46.7 58.3 68.1 62.8
F-VAEGAND2 fvaegan 48.4 60.1 53.6 57.6 70.6 63.5
VSE-S saligram_cvpr19 33.4 87.5 48.4 24.5 72.0 36.6 41.6 91.3 57.2
ZSML Softmax (Ours) 57.4 71.1 63.5 60.0 52.1 55.7 36.3 46.6 40.9 58.9 74.6 65.8
Table 2: Accuracy for GZSL, on novel proposed split (PS). U and S represent top-1 accuracy on unseen and seen class with all the

classes. H stands for the harmonic mean. All compared methods use CNN-RNN feature for CUB dataset.

5.1 Zero-Shot Learning

For the ZSL setting, we first train our model on seen class examples and then synthesize samples from the unseen classes. These synthesized samples are further used to train either a multi-class linear SVM or a softmax classifier. The trained model over the synthesized examples is used to predict the classes for the test examples

. We report results with both softmax classifier and linear SVM but we can, in principle, use any supervised classifier to train the model once we have generated the data. The average per-class accuracy is used as the standard evaluation metric

xian2018zero, shown in Table 5, as it overcomes the biases towards some particular class that has more data. In the ZSL setting, our model yields 4.5%, 6.0%, 9.8%, and 27.9% relative improvements over the current state-of-the-art on CUB, AWA1, AWA2, and aPY datasets, respectively. While, on the SUN dataset, it is very competitive as compared to the previous state-of-the-art methods. The SUN dataset contains 717 fine-grain classes; therefore, using the GAN based generation is highly prone to mode collapse. We believe that mode collapse is the possible reason for lower performance on SUN dataset. We are using the same network architecture and hyper-parameter for all the dataset. Since SUN dataset is fairly different compare to the other datasets, we believe that better hyper-parameter tuning for SUN dataset may improve the result.

5.2 Generalized Zero-Shot Learning

Standard ZSL assumes that all test inputs are from the unseen classes. The more challenging generalized Zero-Shot Learning (GZSL) relaxes this assumption and requires performing classification where the test set can potentially contain classes from the seen classes along with the unseen classes. We used the harmonic (HM) mean of the seen and unseen, average per class accuracy as the evaluation metric to report the results. It is found that HM xian2018zero is a better evaluation metric for GZSL since it overcomes the biases of predictions towards the seen class.

For GZSL task, we evaluate our model on the popular benchmark datasets CUB, aPY, AWA1 and AWA2. The results for GZSL is shown in Table 2. Our results demonstrate that ZSML achieves significant improvements in the harmonic mean. In terms of HM based accuracies, our ZSML yields 3.9%, 11.8%, 3.3% and 3.6% relative improvement over the current state-of-the-art on CUB, aPY, AWA1 and AWA2 datasets, respectively. Thus, ZSML not only works well in the standard ZSL setting but also in the GZSL setting. From Table 5 and 2, it is clear that all the models that show good results on the ZSL setup fail badly on the GZSL setup, whereas our model ZSML has consistently strong performance in both settings.

Method N AwA2 CUB
cycle-UWGAN cycle-consistancy 5 40.4 43.3 41.8 22.6 40.5 29.0
10 45.5 50.9 48.0 25.5 42.1 32.5
f-CLSWGAN xian2018feature 5 37.8 44.2 40.7 30.4 28.5 29.4
10 40.5 55.9 46.9 34.7 38.9 36.6
GF-ZSL vermageneralized 5 38.2 44.3 41.0 29.4 33.0 31.0
10 41.4 45.1 43.1 35.6 43.5 39.1
Ours (ZSML) 5 38.4 61.3 47.3 32.9 38.2 35.3
10 47.8 59.6 53.1 42.7 45.1 43.9
Table 3: GZSL results using only five and ten example per seen classes to train the model
Figure 3: Our ZSL result for AWA2 and CUB datasets with the proposed zero-shot task distribution.

5.3 Ablation Study

In this section, we perform various ablation studies to assess the different aspects of our ZSML model on CUB, aPY and AWA2 datasets. We find that the proposed zero-shot meta-learning protocol (i.e., how we split the data from each task into meta-train and meta-validations sets) and meta-learning based adversarial generation are the key contributors for improving the model performance. We also conduct experiments when only few examples (say 5 or 10) are available from the seen class.

Meta-learner vs Plain-learner: We found that meta-learning based training is the key component to boost the model performance. Meta-learned model in the adversarial setting generates high-quality samples that are close to the real samples. In Figure 4, we are comparing the results with a recent approach Chen_2018_CVPR; cycle-consistancy; xian2018feature that uses Improved-WGAN improved-wgan for the same problem.

To show the effectiveness of the proposed model, we are not using any advanced GAN architecture. We simply rely on the WGAN architecture. In the proposed model, the plain WGAN is associated with meta-learning agents. We have found that meta-learning framework is the key component to improve the performance. The proposed meta-learning framework improved the results in the ZSL setup, from to and to on CUB and AWA2 dataset respectively, compared to the current state-of-the-art as shown in Figure 4 (Top). Also in the same setting, our approach without meta-learning shows the ZSL results of and on AWA2 and CUB dataset respectively.

Few-Shot ZSL and Few-Shot GZSL: This is another significant result of the proposed approach. The meta-learning framework is specially designed for few-shot learning. So it is natural to ask how ZSL/GZSL will perform when only few-shot are present from the seen classes. This is the most extreme case for any classification algorithm (i.e. only a few examples are present from the seen class and at test time we have unseen/novel data). We perform the experiment for AWA2, CUB and aPY datasets assuming that only 5 or 10 examples per seen class are available and unseen class has no data at training time. In the 5 examples per class experiment, we create a new dataset (by sampling from the original dataset) that contains 5 examples per seen classes (i.e. for 40 unseen classes in AWA2 dataset, our new dataset contains only samples). The model learns to generate unseen samples when it sees only 5 examples per seen class. Once the model is trained, we perform the classification following the procedure mentioned in Subsection 3.1. We follow the same process for 10 examples per seen class.

Figure 4: Left: Comparison of ZSL results on AWA2 and CUB dataset with recently proposed models based on GAN and our meta-learned GAN. Right: Our ZSL result when only few samples (say 5 and 10) from the seen class, while competitor uses all training samples.

As shown in Figure 4 (Bottom), with as few as only 10 examples per-class our approach outperforms other state-of-the-art methods on CUB, aPY and AWA2 datasets in ZSL setting, also using only 5 examples per class our result are very competitive (while competitor model uses all examples in training). Also as shown in Table 3, in the most challenging GZSL setting, using only 5 or 10 samples our result out performs the recent approach by a significant margin.

Zero-Shot MAML Split vs Traditional MAML Split: We propose a novel task distribution for ZSML where each task is partitioned into two sets and and the classes in and are disjoint. While in the MAML setup these classes are the same. This disjoint class partition helps in learning to generate the novel classes in the training itself. The ablation over the MAML and ZSML task distribution is shown in Figure 3. The proposed training set and validation set split (per episode) performs significantly better than traditional MAML split. Using the novel ZSML split, the ZSL results improves 1.7% and 2.4% on the AWA2 and CUB dataset, respectively.

Which Aspects Benefit More from Meta-Adversarial Learning? In adversarial learning, the sample quality depends on how powerful the discriminator and generator are. The optimal discriminator minimizes the JS-Divergence between the generated and the original samples GAN. The meta-learner associated with discriminator or generator provides a powerful discriminator and generator by enhancing their learning capability. The optimal discriminator provides strong feedback to the generator and the generator continuously increases its generation capability. We observe that if we remove the meta-learner from the discriminator, we have 5.8% and 8.6% accuracy drop as compared to our model with a meta-learning component on CUB and AWA2 dataset, respectively. The significant accuracy drop occurs since the discriminator is not optimal and provides poor feedback to the generator. Even though we have a much more powerful generator, because of the poor feedback from the weak discriminator, the generator is unable to learn. Similarly, if we remove the meta-learner from the generator, we again observe a significant accuracy drop (2.2% and 7.9% on CUB and AWA2 dataset, respectively). Since the generator has a reduced capability without meta-learner, even though discriminator provides strong feedback to the generator, the generator is not powerful enough to counter the discriminator. Also, if we remove the meta-learning agent from generator and discriminator, it becomes a plain adversarial network. The ablation results are shown in Figure 4. More ablation are provided into supplementary material.

6 Conclusion

In this work, we identify and address three key limitations of current ZSL approaches, that limit the performance of the recent generative models for ZSL/GZSL. We observe that a meta-learning based approach can naturally overcome these limitations in a principled manner. We have proposed a novel framework for ZSL and GZSL which is based on the meta-learning framework over a conditional generative model (WGAN). We also propose a novel zero-shot task distribution for the meta-learning model to mimic the ZSL behaviour. We have conducted extensive experiments benchmark ZSL datasets. In the few-shot, as well as standard GZSL setting, the proposed model outperforms the state-of-the-art methods by a significant margin. Our ablation study shows that the proposed meta-learning framework and zero-shot task distribution are the key components for performance improvement. Finally, although our focus here has been on ZSL and generalized ZSL, our meta-learning based adversarial generation model can be useful for the problem of distribution learning and generation tasks as well reed2017few; hewitt2018variational. For GZSL, we achieve the state-of-the-art results over all the standard datasets, whereas for ZSL, we surpass the state-of-art results by a significant margin on aPY, CUB, AWA1 and AWA2 datasets.



Appendix A Datasets

This section describes the benchmark datasets used for model evaluation over ZSL and GZSL setup. We evaluate our proposed method on five benchmark datasets. SUN and CUB are fine-grained datasets, and each class has limited data that makes this dataset very challenging. AWA1 and AWA2 are animal datasets with a diverse background. aPY is a small scale dataset, but the diverse domain in seen and unseen class makes the dataset very challenging. The objective of the proposed approach is not to generate the seen/unseen image but the ResNet-101 feature vector. The objective of the proposed approach is to produce state-of-the-art result for ZSL and GZSL setting. Therefore, like the other recent models [vermageneralized, xian2018feature, xian2018zero, cycle-consistancy]

, our objective is to synthesize high-quality image features. We are using ResNet-101 feature vectors for the class attributes as used by the other competitive approaches. ResNet-101 model is pretrained on the ImageNet

[imagenet2015] dataset. The features for all the datasets are extracted using the pretrained ResNet-101 model without any further finetuning. Also, the seen and unseen class split is done such that no test/unseen classes are present in the ImageNet dataset; otherwise, it violates the ZSL setting [xian2018zero]. The complete dataset with the train, validation, and test splits are provided by [xian2018zero]. We used the same setup as used by other approaches (mentioned in Table-1 and Table-3 in the main paper). Table-4 below summarizes the statistics of all the datasets.

Dataset Attribute/Dim #Image Seen/Unseen Class
AWA1 A/85 30475 40/10
AWA2 A/85 37322 40/10
CUB CR/1024 11788 150/50
SUN A/102 14340 645/72
aPY A/64 15339 20/12
Table 4: Datasets used in our experiments and their statistics. CR: CNN-RNN [reed2016learning]

a.1 Animals with Attributes (AWA)

In AWA1 dataset [lampert2009learning], there are 30,475 images in total. There are 50 classes of animals captured in a diverse background making the dataset very challenging. In the ZSL setting, 40 classes are used for training and validation, and the rest of the 10 classes are used for testing. The dataset also contains an 85-dimensional attribute vector provided by a human annotator. There are two types of attribute vector with the AWA dataset, binary and continuous. The continuous attributes are much informative as used by other models. The raw images of the AWA1 dataset are not provided, and only the features are available. Therefore another updated version, AWA2, is released with the raw images as well. In our experiment, we evaluate the model using both the datasets and perform the ablation over the AWA2 dataset. The ResNet-101 feature is used for both the datasets pretrained on ImageNet dataset. Similar to other approaches, no fine-tuning is performed for the seen classes, and the split is done in such a way that no unseen class belongs to the ImageNet classes.

a.2 Caltech UCSD Birds 200 (CUB)

The CUB [welinder2010caltech]

dataset comprises of 11,788 images of birds in total which belongs to 200 classes. In the ZSL setting, 150 classes are used for training and validation, while 50 classes are used for testing. The CUB dataset is a fine-grained dataset containing 200 classes of birds, and each class has nearly 60 samples. Some of the classes are very similar even for humans, making it is very challenging to detect the birds correctly. To collect large samples from each class is very challenging. Therefore, each class contains a limited sample. For deep learning algorithms, it is a difficult task to train a model using only 60 samples per class. In this case, meta-learning models have an advantage and show significant improvement. In the CUB dataset, each class is also provided with a 312-dimensional human annotated class attribute vector. Also

[reed2016learning] provides the textual description for each image. Using the character-based CNN-RNN, [reed2016learning]

provides 1024-dimensional embedding of the textual description. Recent works use the CNN-RNN feature as the attributes since it gives superior performance compared to 312-dimensional attribute vector. Without any fine-tuning on the CUB dataset, ResNet-101 pre-trained model trained on the ImageNet is used for the feature extraction of the seen and unseen classes.

a.3 SUN Scene Recognition (SUN)

The SUN dataset [xiao2010sun] consists of 717 scenes or classes. We used the same split proposed by [xian2018zero], where 645 classes are used for train and validation, and rest 72 unseen classes are used for testing. The split is done in such a way that no test class is from the ImageNet classes. This dataset contains 14,340 fine-grained images where each image is also associated with a human annotated attribute vector. The attribute of the same class is averaged and used the class attribute. The attribute is of 102-dimensions. Again ResNet-101 pretrained feature is used as the image feature without any fine-tuning.

a.4 a-Pascal a-Yahoo (aPY)

In aPY dataset [farhadi2009describing], there are 15339 total images belonging to 32 classes. The training and validation dataset contains 20 classes, and for unseen/test class 12 classes are used [xian2018zero]. In aPY, each class is associated with a 64-dimensional human labeled attribute vector. Unlike the other datasets, this dataset contains very diverse objects. Therefore, for ZSL, this is a very challenging dataset. Same ResNet-101 feature is used without any fine-tuning on the aPY dataset.

In the next section, we describe the model architecture and the experimental setup for ZSL and GZSL. We denote the part of the examples with seen classes by and that of unseen classes by .

Appendix B Model Architecture Details

The proposed model is composed of a Generator , a Discriminator and a Classifier network, also each component associated with the meta-learning agent. The meta-learner optimizes its parameters based on . Once the inner loop is optimized, the loss over optimal parameters of the meta-learner is calculated for data. Note that the class of and are disjoint. Therefore the outer loop is optimized over the novel class. Therefore the model learns to optimize the loss over the novel class data on the outer loop.

The setting for each task is -way -shot. For all datasets, used in the inner loop is in -way -shot setting, but, to calculate the loss in the outer loop, is in -way -shot setting. At test time, we have -way -shot meta-learner model for unseen class classification, where is the number of classes in the test examples. For ZSL, contains classes, and for GZSL, is classes. We have sampled 10 tasks for each batch to train the model. The learning rate in the algorithm 1 uses and . For CUB dataset, we trained the model for 5000 iterations while for the aPY dataset, 500 iterations are sufficient, and the performance saturated. The AWA1 and AWA2 datasets took 20000 iterations for convergence. The architecture details for all the components of the model are as follows:

The complete architecture is:

. The non-linearity is used after the input layer and before the output layer, and a dropout probability of 0.5 is used on all the layers. The network

also contains two hidden layers with the same non-linearity as that of , but no BatchNorm is used. The complete architecture is given as: Input, , , , . The non-linearity is used on all the layers. The classification network contains a single layer hidden network with the same non-linearity as the previous one. The classification architecture is given as; and no BatchNorm is used.

b.1 Generator (G)

The network contains two hidden layers of size

with BatchNorm applied on each hidden layer. For non-linearity, we use Leaky-ReLU with parameter 0.2. The output layer size is of

-dimension (size of ResNet-101 feature). Dropout with probability 0.5 is used for all the layers. The details are given below:

b.2 Discriminator (D)

The network contains two hidden layers with same non-linearity as that of , but no batch-norm is used. The non-linearity is used on all the layers. The details are given below:

b.3 Classifier (C)

The classifier network contains a single layer hidden network with a Leaky-ReLU non-linearity with parameter 0.2. Dropout with probability value 0.5 is used on each layer. The details are given below:

Appendix C Sample generation and Classification

Once the model is trained, we are generating samples from unseen classes using the class attribute. The samples are generated using the generator network conditioned on the class attributes, such that the input is a concatenation of the class attribute vector with a noise vector . While training, we used , and we empirically found that

with 0.25 standard deviation gives the stable result. We claim that once the samples are synthesized, we can use any supervised classifier. Therefore, to support our claim, we are reporting the results using the two most popular classifiers. We are generating 200 samples for AWA1, aPY, and AWA2 dataset while for CUB and SUN dataset 100 samples are sufficient for the stable results. Training is done using the synthesized samples, and unseen class samples are tested over the trained model.

Method CUB AWA1 AWA2 aPY
DCN [calibration_generalized] 56.2 65.2 43.6
ZSKL [zskl] 51.7 70.1 70.5 45.3
GFZSL[verma2017simple] 49.2 69.4 67.0 38.4
cycle-UWGAN [cycle-consistancy] 58.6 66.8
f-CLSWGAN [xian2018feature] 57.3 68.2
SE-ZSL [vermageneralized] 59.6 69.5 69.2
ZSML (Ours) 5-Example per class 56.0 65.1 65.5 62.4
ZSML (Ours) 10-Example per class 63.1 66.3 67.7 62.9
ZSML Softmax (Ours) All-examples 69.6 73.5 76.1 64.1
ZSML SVM (Ours) All-examples 69.7 74.3 77.5 64.0
Table 5: Zero-Shot Learning results on the novel setup proposed by [xian2018zero]. The non-generative models models are mentioned at the top and the generative models are mentioned at the bottom. All the results are in the Inductive setting.

c.1 SoftMax Classifier2 (C2)

The classifier contains a single layer neural network without any nonlinearity. We use dropout with probability 0.5, and the output layer contains softmax. The number of classes on the output layer is the number of unseen class

for ZSL, whereas it is the number of seen and unseen classes for GZSL. We provide the model details below:

c.2 Linear-SVM

We also use Linear-SVM for classification of the unseen class data. The model is trained over the synthesized samples. The samples used in training are mentioned above. Linear-SVM consistently performs better than softmax, but training Linear-SVM is very time-consuming for a larger number of classes and data size. Therefore, for GZSL, we are reporting the results over softmax only. In Linear-SVM, we used soft-margin penalty , and class weights are balanced based on the class frequencies in the data.

Appendix D Algorithm

The algorithm for the complete approach is given below222We will provide code and data upon publication:

1:: distribution over tasks
2:, : step size hyperparameters
3:randomly initialize and
4:while not done do
5:     Sample batch of tasks with disjoint set of -
6:     classes between are used
7:     for all  do
8:          Evaluate with respect to
9:          Evaluate with respect to
10:          Compute adapted parameters:
11:          Compute adapted parameters:
12:     end for
13:     Update
14:     Update
15:end while
Algorithm 1 Generative Adversarial MAML for ZSL

Appendix E Ablation

In this section, we are performing the ablation study with a different setup. The CUB-200 and AWA2 datasets are used for the ablation analysis over different components. In ablation study, we found that the proposed zero-shot task distribution and generative meta-learning is a key component for improving the model performance.

e.1 Softmax and SVM

The proposed approach is generative, and so we can generate the samples of any class/distribution given the novel class attribute/description. Once we have the novel class synthetic data, we can train any traditional classifier for the unseen class data. Here we show the ZSL results of two standard classifiers, softmax, and linear-SVM, that are trained on the synthesized novel/unseen class samples. We find that both the classifiers have very competitive results and in some cases, linear-SVM shows a slightly better performance than softmax. We suggest that the reason behind this difference in performance is because linear-SVM learns the max-margin classifier while linear softmax classifier using neural ignores the max-margin. Refer to Table-[5] for the results of softmax and linear-SVM classifier.