Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders

12/05/2018 ∙ by Edgar Schönfeld, et al. ∙ University of Amsterdam berkeley college 0

Many approaches in generalized zero-shot learning rely on cross-modal mapping between the image feature space and the class embedding space. As labeled images are rare, one direction is to augment the dataset by generating either images or image features. However, the former misses fine-grained details and the latter requires learning a mapping associated with class embeddings. In this work, we take feature generation one step further and propose a model where a shared latent space of image features and class embeddings is learned by modality-specific aligned variational autoencoders. This leaves us with the required discriminative information about the image and classes in the latent features, on which we train a softmax classifier. The key to our approach is that we align the distributions learned from images and from side-information to construct latent features that contain the essential multi-modal information associated with unseen classes. We evaluate our learned latent features on several benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2, and establish a new state-of-the-art on generalized zero-shot as well as on few-shot learning. Moreover, our results on ImageNet with various zero-shot splits show that our latent features generalize well in large-scale settings.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most approaches to generalized zero-shot learning (GZSL) where no labeled training examples are available from some of the classes, learn a mapping between images and their class embeddings  [5, 1, 18, 33, 2]. Among these, ALE [1] maps CNN features of images to a per-class attribute space whereas DeViSE [5] uses a word embedding space of the class labels learned from the English language Wikipedia. Although [2] combines attributes and word embeddings, a separate mapping needs to be learned for each modality.

Figure 1: Our CADA-VAE model learns a latent embedding () of image features () and class embedding ( of labels ) via aligned VAEs optimized with cross-alignment () and distribution alignment () objectives, and subsequently trains a classifier on sampled latent features of seen and unseen classes.

An orthogonal approach to GZSL is to augment data by generating artificial images [21]

. However, due to the level of detail missing in the synthetic images, CNN features extracted from them do not improve classification accuracy. To alleviate this issue,

[35] proposed to generate image features via a conditional WGAN, which simplifies the task of the generative model and directly optimizes the loss on image features. Similarly, [16, 30] use conditional variational autoencoders (VAE) for this purpose. A complementary idea, proposed by [27], is to learn a cross-modal embedding between image features and class embeddings in a latent embedding space. For instance, [27] proposed to transform both modalities to the latent spaces of autoencoders and match the corresponding distributions by minimizing the Maximum Mean Discrepancy (MMD). Learning such cross-modal embeddings is beneficial for potential downstream tasks that require multimodal fusion, e.g. visual question answering. In this domain, [20] recently used a cross-modal autoencoder to extend visual question answering to previously unseen objects.

Although recent cross-modal autoencoder architectures represent class prototypes in a latent space [17, 27]

, better generalization can be achieved if the shared representation space is amenable to interpolation between different classes. Variational Autoencoders (VAEs) are known for their capability in accurate interpolation between representations in their latent space, i.e. as demonstrated for sentence interpolation 

[3] and image interpolation [10]. Hence, in this work, we train VAEs to encode and decode features from different modalities, e.g. images and class attributes, and use the learned latent features to train a generalized zero-shot learning classifier. Our latent representations are aligned by matching their parametrized distributions and by enforcing a cross-modal reconstruction criterion. Consequently, by explicitly enforcing alignment both in the latent features and in the distributions of latent features learned using different modalities, the VAEs learn to adapt to unseen classes without forgetting the previously seen ones.

Our contributions in this work are as follows. (1) We propose a model that learns shared cross-modal latent representations of multiple data modalities using simple VAEs via distribution alignment and cross alignment objectives. (2) We extensively evaluate our model using conventional benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2, developed for zero-shot learning and extended to few-shot learning. Our model establishes the new state-of-the-art performance on generalized zero-shot and few-shot learning settings on all these datasets. Furthermore, we show that our model can be extended easily to more than two modalities trained simultaneously. (3) Finally, we show that the latent features learned by our model improve the state of the art in the truly large-scale ImageNet dataset in all splits for the generalized zero-shot learning task.

2 Related Work

In this section, we present related work on generalized zero-shot learning and few-shot learning, generative models for learning representations and cross-modal reconstruction.

Generalized Zero-and Few-Shot Learning.

In the classic zero-shot learning setting, training and test classes are disjoint with attributes available at train time, and the performance of the method is solely judged on its classification accuracy on the novel or unseen classes. Generalized zero-shot learning is a more realistic variant of zero-shot learning, since the same information is available at training time, but the performance of the model is judged on the harmonic mean of the classification accuracy on seen and unseen classes.

In few-shot learning, the training set up is similar to that of (generalized) zero-shot learning, with the exception that there are examples provided at training time for the previously unseen classes [28, 24, 9, 31]. Using auxiliary information for few-shot learning was introduced in [28], where attributes related to images were used to improve the performance of the model. The use of auxiliary information was also explored in ReViSE [27], in which a common image-label semantic space for transductive few-shot learning is learned. Analogous to the relation between zero-shot learning and generalized zero-shot learning, we extend few-shot to the generalized few-shot learning (GFSL) setting, in which we evaluate the model on both seen and unseen classes.

Generative Models for Learning Representations. In the context of generalized zero-shot learning, generative models are used for data-augmentation [35, 16, 30] and latent representation learning [17, 27]. Approaches using data augmentation treat generalized zero-shot learning as a missing data problem and train conditional GANs [35] or conditional VAEs to generate image features or images for unseen classes from semantic side-information. Generative methods for representation learning for GZSL are based on autoencoders, such as ReViSE [27] and DMAE [17], which learn to jointly represent features from different modalities in their latent space. By making use of autoencoders, it is possible to learn representations of visual and semantic information in a semi-supervised fashion. Learning a joint representation for visual and semantic data is achieved by aligning the latent distributions between different data types. ReViSE implements this distribution alignment by minimizing the maximum mean discrepancy between the two latent distributions [8]. DMAE aligns distributions by means of minimizing the Squared-Loss Mutual Information [26]. In this work, we use Variational Autoencoders instead, and align the latent distributions by minimizing their Wasserstein distance. In contrast to [17] and [27], we also enforce a cross-reconstruction loss, by decoding every encoded feature into every other modality.

Figure 2: Our Model: Cross- and Distribution Aligned VAE Model. Different ways of aligning the latent representations learned by our architecture. (a) The DA-VAE aligns latent spaces by minimizing their Wasserstein distance. (b) The CA-VAE aligns latent representations via cross-reconstruction. (c) The CADA-VAE is a fusion of the first two models.

Cross-Reconstruction in Generative models. Reconstructing data across domains, referred to as cross-alignment, is commonly used in the field of domain adaptation. While models like CycleGAN [36] learn to generate data across domains directly, latent space models use cross-reconstruction to capture the common information contained in both domains in their intermediate latent representations [29]. In this regard, cross-aligned VAE’s have been used previously for text-style transfer [23]

and image-to-image translation 

[14]. In [23] a cross-aligned VAE ensures that the latent representations of texts from different input domains are similar, while in [14] a comparable approach was developed to match the latent representation of images from different domains. Both methods have in common that they use a different variant of VAEs with an adversarial loss. Additionally, [23] makes use of conditional encoders and decoders, while [14] enforces cycle consistency and weight sharing. In this paper, on the other hand, our building blocks are unconditional VAEs and we achieve multi-modal alignment via cross-reconstruction and latent distribution alignment in a highly reduced space.

3 Model

Of the existing generalized zero-shot learning models, recent data generating approaches [35, 30, 16] achieve superior performance over other methods on disjoint datasets. For some approaches [35], the edge in performance comes at the cost of an unstable training procedure. On the other hand, classifying generated images or image feautures from conditional VAEs [30, 16]

runs at risk of being compromised by the curse of dimensionality. The main insight of our proposed model is that instead of generating images or image features, we generate low-dimensional latent features and achieve both stable training and the state-of-the-art performance. Hence, the key to our approach is the choice of a VAE latent-space, a reconstruction and cross-reconstruction criterion to preserve class-discriminative information in lower dimensions, as well as explicit distribution alignment to encourage domain-agnostic representations.

3.1 Background

We first provide the background as the task, i.e. generalized zero-shot learning and as the model building block, i.e. variational autoencoders.

Generalized Zero-shot Learning. The task definition of generalized zero-shot learning is as follows. Let be a set of training examples, consisting of image-features , e.g. extracted by a CNN, class labels available during training and class-embeddings

. Typical class-embeddings are vectors of hand-annotated continuous attributes or Word2Vec features

[15]. In addition, an auxiliary training set is used, where denote unseen classes from a set , which is disjoint from . Here, is the set of class-embeddings of unseen classes. In the legacy challenge of ZSL, the task is to learn a classifier . However, in this work, we focus on the more realistic and challenging setup of generalized zero-shot learning (GZSL) where the aim is to learn a classifier .

Variational Autoencoder (VAE). The basic building block of our model is the variational autoencoder (VAE) [12]

. Variational inference aims at finding the true conditional probability distribution over the latent variables,

. Due to interactibility of this distribution, it can be approximated by finding its closest proxy posterior, , through minimizing their distance using a variational lower bound limit. The objective function of a VAE is the variational lower bound on the marginal likelihood of a given datapoint and can be formulated as:


where the first term is the reconstruction error and the second term is the unpacked Kullback-Leibler divergence between the inference model

, and

. A common choice for the prior is a multivariate standard Gaussian distribution. The encoder predicts

and such that , from which a latent vector is generated via the reparametrization trick [12].

3.2 Cross and Distribution Aligned VAE

The goal of our model is to learn representations within a common space parameterized by a combination of data modalities. Hence, our model includes encoders, one for every modality, to map into this representation space. To minimize the loss of information, the original data must be reconstructed via the decoder networks. In effect, the basic VAE loss of our model is the sum of the VAE-losses over number of VAEs:


where weights the KL-Divergence [10]. In the case of matching image features with class embeddings, , and . However, in order to ensure that the modality-specific autoencoders learn similar representations for features of the same classes, additional regularization terms are necessary. For this reason, our model aligns the latent distributions explicitly and enforces a cross-reconstruction criterion. In Figure 2 we show an overview of our model, depicting these two forms of latent distribution matching, which we refer to as Cross-Alignment (CA) and Distribution-Alignment (DA).

Cross-Alignment (CA) Loss. In cross-alignment, the reconstruction is obtained by decoding the latent encoding of a sample from another modality, but the same class. In essence, every modality-specific decoder is trained on the latent vectors derived from the other modalities. The loss for these cross-reconstructions is computed as:


where is the encoder of a feature of modality and is the decoder of a feature belonging to the same class but the modality.

Distribution-Alignment (DA) Loss. Another way of matching latent representations across modalities is to minimize the distance between two distributions. Here, we minimize the Wasserstein distance between the predicted latent multivariate Gaussian distributions. In the case of multivariate Gaussians, a closed-form solution of the -Wasserstein distance [7] between two distributions and is given as:


Since our network predicts diagonal covariance matrices, which are commutative, the Wasserstein distance simplifies to the following:


and the Distribution Alignment (DA) loss for a group an M-tuple is written as:


Cross- and Distribution Alignment (CADA-VAE) Loss. In our cross- and distribution aligned VAE model, we combine the basic VAE-loss with (CA-VAE) and (DA-VAE). Our final model combines them all, i.e. and (CADA-VAE) leading to the following objective:


where and are the weighting factors of the cross alignment and the distribution alignment loss, respectively. We show in section 4.1 that our model can learn shared multimodal embeddings of more than two modalities, without being required that examples of all modalities have to be available for all classes.

Implementation Details.

Here we explain the details of our experimental set up used across both (G)ZSL and (G)FSL experiments. All our encoders and decoders are multilayer perceptrons with one hidden layer. For our proposed

CADA-VAE, we use a hidden layer of size for the image feature encoder and a layer of size for the decoder. The attribute net consists of an encoder layer of size and a decoder layer of dimensions. Finally, we set the latent embedding size to . Note that for ImageNet, we chose a larger latent size of , and use two hidden layers of the same size for the encoder, using the number of hidden units specified above. Furthermore, the image feature decoder for ImageNet has two hidden layers of size and , while the attribute decoder uses and units. The model is trained for epochs by stochastic gradient descent using the Adam optimizer [11] and a batch size of . A batch size of is used for the ImageNet experiments.

After individual VAEs learn to encode features of their specific datatype, we compute cross- and distribution alignment losses. is increased from epoch to epoch by a rate of per epoch, while is increased from epoch to by per epoch. For the KL-divergence we use an annealing scheme [3], in which we increase the weight of the KL-divergence by a rate of per epoch until epoch . A KL-annealing scheme serves the purpose of first letting the VAE learn “useful” representations before they are “smoothed” out, since the KL-divergence would be otherwise a very strong regularizer [3]. We empirically found that it is useful to use a variant of the reparametrization trick [12], in which all dimensions of the noise vector are sampled from a single unimodal Gaussian. Also, using the L1 distance as reconstruction error appeared to yield slightly better results than L2. After training, VAE encoders are used to transform the training and test set of the final classifier into the latent space. Finally, a linear softmax classifier is trained and tested on the latent features.

4 Experiments

We evaluate our framework on zero-shot learning benchmark datasets CUB-200-2011 [32] (CUB), SUN attribute (SUN) [19], Animals with Attributes and (AWA1 [13], AWA2 [34]) for the generalized zero-shot learning (GZSL) and generalized few-shot learning (GFSL) settings. All image features used for training the VAEs are extracted from the -dimensional final pooling layer of a -. To avoid violating the zero-shot assumption, i.e. test classes need to be disjoint from the classes that - was trained with, we use the proposed training splits in [34]. As class embeddings, attribute vectors were utilized if available. For CUB, we also used sentence embeddings provided by [21] and for ImageNet we used Word2Vec [15] embeddings provided by [4]

. All hyperparameters were chosen on a validation set provided by

[34]. We report the harmonic mean (H) between seen (S) and unseen (U) average per-class accuracy, i.e. the Top-1 accuracy is averaged on a per-class basis.

Figure 3: In this experiment we analyse the effect of different class embeddings. (Left) Seen, unseen and harmonic mean accuracy for CUB using different class embeddings as side information. (Right) Using both attributes and sentences as side information, i.e. : the percentage of seen classes with sentences, : the percentage of unseen classes with sentences. Attributes were used as class embeddings for the remaining of the classes.
Model S U H
Table 1: In this ablation study, we compare GZSL accuracy on CUB for different multi-modal alignment objective func- tions, i.e. DA-VAE (distribution aligned VAE), CA-VAE (cross-aligned VAE) and CADA-VAE (cross and distribution aligned VAE).

4.1 Analyzing Cada-Vae in Detail on CUB

In this section, we analyze several building blocks of our proposed framework such as the model, the choice of class embeddings as well as the size and the number of latent embeddings generated by our model in the generalized zero-shot learning setting.

Analyzing Model Variants. In this ablation study for GZSL, we present the results for different objective functions and the corresponding VAE variants, CA-VAE (cross-aligned VAE), DA-VAE (distribution-aligned VAE) and CADA-VAE (cross- and distribution-aligned VAE) on the CUB dataset. As shown in Table 1, the cross-alignment objective offer noticable improvement in performance from with distribution alignment only, to . However, combining distribution alignment and the cross-alignment objectives, i.e. CADA-VAE, exceeds the harmonic mean accuracy to . This experiment shows that aligning both the latent representations and the latent spaces is complementary since their comparison leads to the highest result on both seen, unseen classes and their harmonic mean in the GZSL setting.

Analyzing Side Information. In sparse data regimes, semantic representation of the classes, i.e. class embeddings, are as important as the image embeddings. We compare the results obtained with per-class attributes, per-class sentences and class-based Word2Vec representations. Our results in Figure 3 (left) show that per-class sentence embeddings result in the best performance among all three, i.e. , attributes follow closely, i.e. . With Word2Vec, the difference between the seen and the unseen class accuracy is significant, indicating that the latent representations learned by Word2Vec generalize less well to unseen classes. This is expected, given that Word2Vec features do not explicitely or exclusively represent visual characteristics. In summary, these results demonstrate that our model is able to learn to generate transferable latent features from various sources of side information. The results also show that latent features learned with more discriminative class embeddings lead to better overall accuracy.

Next, we investigate one of the most prominent aspects of our model, i.e. the ability to handle missing side information. We train CADA-VAE such that of seen class image features are paired with sentence embeddings while the other of seen classes are paired with attributes. The setup is evaluated for . We also vary the fraction of unseen classes that are learned from sentence features (whereas denotes the fraction of unseen classes for which image features are only paired with attributes).

Figure 3 (right) shows the results using different fractions of sentence embeddings and attribute embeddings for and . When is held stable at 50%, i.e. both seen and unseen classes have access to sentences and attributes equally half the time, we reach the highest accuracy for an - ratio of -. Interestingly, at , i.e. no sentences for seen classes but 50% attributes and 50% sentences for unseen classes, the accuracy is , while at , i.e. no sentences for unseen classes but 50% attributes and 50% sentences for seen classes, the accuracy is . On the other hand, at , i.e. 50% attributes and 50% sentences for seen classes but 100% sentences for unseen classes, the accuracy is whereas at , i.e. 50% attributes and 50% sentences for unseen classes but 100% sentences for seen classes, the accuracy is . These results indicate that sentences have the edge over attribute. However, when either sentences or attributes are not available, our model can recover the missing information from the other modality and still learn discriminative representations.

Figure 4: The influence of different latent dimensionalities on the harmonic mean accuracy for CUB dataset
Figure 5: Analyzing the effect of number of latent features per class on the harmonic mean accuracy in GZSL. An unseen-seen ratio of 2 means that twice as many samples are generated for unseen classes than for seen classes . The dynamic dataset (light blue) does not rely on a fixed number of sampled latent features.

Increasing Number of Latent Dimensions. In this analysis, we have explored the robustness of our method to the dimensionality of the latent space. Without loss of generality, we report the harmonic mean accuracy of CADA-VAE with respect to different latent dimensions on CUB, ranging from and . We observe in Figure 4 that the accuracy initially increases with increasing dimensionality until it achieves its peak accuracy of at and flattens until after which it declines upon further increase of the latent dimension. We conclude from these experiments that the most discriminative properties of two modalities are captured when the latent space has around dimensions. For efficiency reasons, we use dimensional latent features for the rest of the paper.

Increasing Number of Latent Features. Our model can be used to generate an arbitrarily large number of latent features. In this experiment, we vary the number of latent features per class from to on CUB in the GZSL setting and reach the optimum performance with or more latent features per seen class (Figure 5, left). In principle, seen and unseen classes do not need to have the same number of samples. Hence, we also vary the number of features per seen and unseen classes. Indeed, the best accuracy is achieved when there are approximately twice as many features per unseen than seen classes which improves the accuracy from to . While latent features per every class, i.e. , gives accuracy, having latent features per seen classes and latent features per unseen class, i.e. , leads to accuracy. Hence, generating more features of under-represented classes is important for better accuracy.

As for our results in Figure 5 on the right, we build a dynamic training set by generating latent features continuously at every iteration and do not use any sample more than once. Hence, we eliminate one of the tunable parameters, i.e. the number of latent features to generate. Because of the non-deterministic mapping of the VAE encoder, every latent feature of a different class is unique. Our results indicate that the best accuracy is achieved when unseen and seen class samples are equally balanced. In CUB, using a dynamic training set reaches the same performance as using a fixed dataset with unseen examples and seen examples. On the other hand, using a fixed dataset leads to a faster training procedure. Hence, we use a fixed dataset with examples per seen class and examples per unseen class in every benchmark reported in this paper.

Model Feature Size S U H S U H S U H S U H
CMT [25]
SJE [2]
ALE [1]
LATEM [33]
EZSL [22]
SYNC [4]
DeViSE [5]
f-CLSWGAN [35]
CVAE [16]
SE [30]
ReViSE [27]
ours (CADA-VAE)
Table 2: Comparing CADA-VAE with the state of the art. We report per class accuracy for seen (S) and unseen (S) classes and their harmonic mean (H). All reported numbers for our method are averaged over ten runs.
Figure 6: Comparing CA-VAE, DA-VAE, CADA-VAE in comparison with ReViSE [28] with increasing the number of training samples from unseen classes, i.e. in the generalized few-shot setting.

4.2 Comparing Cada-Vae on Benchmark Datasets

In this section, we compare our CADA-VAE on four benchmark datasets in the generalized zero-shot learning and generalized few-shot learning settings.

Generalized Zero-Shot Learning. We compare our model with 11 state-of-the-art models. Among those, CVAE [16], SE [30], and f-CLSWGAN [35] learn to generate artificial visual data and thereby treat the zero-shot problem as a data-augmentation problem. On the other hand, the classic ZSL methods DeViSE [5], SJE [2], ALE [1], EZSL [22] and LATEM [33] use a linear compatibility function or other similarity metrics to compare embedded visual and semantic features; CMT [25] and LATEM [33]

utilize multiple neural networks to learn a non-linear embedding; and SYNC 

[4] learns by aligning a class embedding space and a weighted bipartite graph. ReViSE [27] proposes a shared latent manifold learning using an autoencoder between the image features and class attributes. We report results on CUB, SUN, AWA1 and AWA2 datasets.

The results in Table 2 show that our CADA-VAE outperforms all other methods on all datasets. The accuracy difference between our model and the closest baseline, ReViSE [27], is as follows: vs on CUB, vs on SUN, vs on AWA1 and vs on AWA2. Moreover, our model achieves significant improvements over feature generating models most notably on CUB. Compared to the classic ZSL methods, our method leads to at least improvement in harmonic mean accuracies. In the legacy challenge of ZSL setting, which is hardly reallistic, our CADA-VAE provides competetive performance, i.e. on CUB, on SUN, on AWA1, on AWA2. However, in this work, we focus on the more practical and challenging GZSL setting. We believe the obtained increase in performance by our model can be explained as follows:
Since our model does not use any CNN features, i.e. we generate -dimensional latent features for all classes, it achieves a balance between seen and unseen class accuracies better than feature-generating approaches especially on CUB.

In addition, in CADA-VAE a shared representation is learned in a semi-supervised fashion, through a cross-reconstruction objective. Since the latent features have to be decoded into every involved modality, and since every modality encodes complementary information, the model is encouraged to learn an encoding that retains the information contained in all used modalities. In doing so, our method is less biased towards learning the distribution of the seen class image features, which is known as the projection domain shift problem [6]. As we generate a certain number of latent features per class using non-deterministic encoders, our method is also akin to data-generating approaches. However, the learned representations lie in a lower dimensional space, i.e. only , and therefore, are less prone to bias towards the training set of image features. In effect, our training is more stable than the adversarial training schemes used for data generation [35]. In fact, we did not conduct any dataset specific parameter tuning, and all the parameters are the same for all datasets.

Generalized Few-Shot Learning We evaluate our models by using zero, one, two, five and ten shots in the generalized few shot learning (GFSL) on four datasets. We compare our results with the most similar published work in this domain, i.e. ReViSE [28]. Figure 6 reports the performance of all variants of our model (CA-VAE, DA-VAE, CADA-VAE) compared to ReViSE. Our overall observation is as follows. Our latent representations learned from the side information and a small number of visual information improves over the generalized zero-shot setting significantly even by including only a few of the labeled samples.

Specifically, we observe that adding a single latent feature from unseen classes to the training set improves the accuracy by -, depending on the dataset. While on CUB the accuracy improvement from -shot to -shots is , on AWA1 and AWA2 this improvement reaches . Moreover, while the harmonic mean accuracy increases with the number of shots in both methods, all variants of our method outperform the baseline by a large margin across all the datasets indicating the generalization capability of our method to the GFSL setting.

Furthermore, similar to the GZSL scenario, on the fine-grained CUB and SUN datasets, CADA-VAE reaches the highest performance where it is followed by CA-VAE and DA-VAE, respectively. However, on AWA1 and AWA2 the difference between different models is not significant. We associate this with the fact that as AWA1 and AWA2 datasets are coarse-grained datasets, the image features are already discriminative. Hence, aligning the latent space with attributes does not lead to a significant difference.

Figure 7: ImageNet results on GZSL. We report the top-1 accuracy for unseen classes. Both f-CLSWGAN and CADA-VAE use a linear softmax classifier.

4.3 ImageNet Experiments

The ImageNet dataset serves as a challenging testbed for generalized zero-shot learning research. In [34] several evaluation splits were proposed with increasing granularity and size both in terms of the number of classes and the number of images. Note that since all the images of classes are used to train -, measuring seen class accuracies would be biased. However, we can still evaluate the accuracy of unseen class images in the GZSL search space that contains both seen and unseen classes. Hence, at test time the seen classes act as distractors. This way, we can measure the transferability of our latent representations to completely unseen classes, i.e. classes that are not seen either during ResNet training nor CADA-VAE training. For ImageNet, as attributes are not available, we use Word2Vec features as class embeddings provided by [4]. We compare our model with f-CLSWGAN [35], i.e. an image feature generating framework which currently achieves the state of the art on ImageNet. We use the same evaluation protocol on all the splits. Among the splits, 2H and 3H are the classes or hops away from the seen training classes of ImageNet according to the ImageNet hierarchy. , and are the , and most populated classes, while L500, L1K and L5K are the , and least populated classes that come from the rest of the classes. Finally, ‘All‘ denotes the remaining classes of ImageNet.

As shown in Figure 7, our model significantly improves the state of the art in all the available splits. The accuracy improvement is significant especially on M500 and M1K splits, i.e. for M500 the search space is classes, for M1K, the search space is consists of classes. For the , and splits, there are on average only , and images per class available [34]. Note that the test time search space in the ‘All‘ split is dimensional. Hence even a small improvement in accuracy on this split is considered to be compelling. The achieved substantial increase in performance by CADA-VAE shows that our -dim latent feature space constitutes a robust generalizable representation, surpassing the current state-of-the-art image feature generating framework f-CLSWGAN.

5 Conclusion

In this work, we propose CADA-VAE, a cross-modal embedding framework for generalized zero- and few-shot learning. In CADA-VAE, we train a VAE for both visual and semantic modalities. The VAE of each modality has to jointly represent the information embodied by all modalities in its latent space. The corresponding latent distributions are aligned by minimizing their Wasserstein distance. This procedure leaves us with encoders that can encode features from different modalities into one cross-modal embedding space, in which a linear softmax classifier can be trained. We present different variants of cross-aligned and distribution aligned VAEs and establish new state of the art results in generalized zero-shot learning for four medium-scale benchmark datasets as well as the large-scale ImageNet. We further show that a cross-modal embedding model for generalized zero-shot learning achieves better performance than data-generating methods, establishing the new state of the art.