1 Introduction
Most approaches to generalized zeroshot learning (GZSL) where no labeled training examples are available from some of the classes, learn a mapping between images and their class embeddings [5, 1, 18, 33, 2]. Among these, ALE [1] maps CNN features of images to a perclass attribute space whereas DeViSE [5] uses a word embedding space of the class labels learned from the English language Wikipedia. Although [2] combines attributes and word embeddings, a separate mapping needs to be learned for each modality.
An orthogonal approach to GZSL is to augment data by generating artificial images [21]
. However, due to the level of detail missing in the synthetic images, CNN features extracted from them do not improve classification accuracy. To alleviate this issue,
[35] proposed to generate image features via a conditional WGAN, which simplifies the task of the generative model and directly optimizes the loss on image features. Similarly, [16, 30] use conditional variational autoencoders (VAE) for this purpose. A complementary idea, proposed by [27], is to learn a crossmodal embedding between image features and class embeddings in a latent embedding space. For instance, [27] proposed to transform both modalities to the latent spaces of autoencoders and match the corresponding distributions by minimizing the Maximum Mean Discrepancy (MMD). Learning such crossmodal embeddings is beneficial for potential downstream tasks that require multimodal fusion, e.g. visual question answering. In this domain, [20] recently used a crossmodal autoencoder to extend visual question answering to previously unseen objects.Although recent crossmodal autoencoder architectures represent class prototypes in a latent space [17, 27]
, better generalization can be achieved if the shared representation space is amenable to interpolation between different classes. Variational Autoencoders (VAEs) are known for their capability in accurate interpolation between representations in their latent space, i.e. as demonstrated for sentence interpolation
[3] and image interpolation [10]. Hence, in this work, we train VAEs to encode and decode features from different modalities, e.g. images and class attributes, and use the learned latent features to train a generalized zeroshot learning classifier. Our latent representations are aligned by matching their parametrized distributions and by enforcing a crossmodal reconstruction criterion. Consequently, by explicitly enforcing alignment both in the latent features and in the distributions of latent features learned using different modalities, the VAEs learn to adapt to unseen classes without forgetting the previously seen ones.Our contributions in this work are as follows. (1) We propose a model that learns shared crossmodal latent representations of multiple data modalities using simple VAEs via distribution alignment and cross alignment objectives. (2) We extensively evaluate our model using conventional benchmark datasets, i.e. CUB, SUN, AWA1 and AWA2, developed for zeroshot learning and extended to fewshot learning. Our model establishes the new stateoftheart performance on generalized zeroshot and fewshot learning settings on all these datasets. Furthermore, we show that our model can be extended easily to more than two modalities trained simultaneously. (3) Finally, we show that the latent features learned by our model improve the state of the art in the truly largescale ImageNet dataset in all splits for the generalized zeroshot learning task.
2 Related Work
In this section, we present related work on generalized zeroshot learning and fewshot learning, generative models for learning representations and crossmodal reconstruction.
Generalized Zeroand FewShot Learning.
In the classic zeroshot learning setting, training and test classes are disjoint with attributes available at train time, and the performance of the method is solely judged on its classification accuracy on the novel or unseen classes. Generalized zeroshot learning is a more realistic variant of zeroshot learning, since the same information is available at training time, but the performance of the model is judged on the harmonic mean of the classification accuracy on seen and unseen classes.
In fewshot learning, the training set up is similar to that of (generalized) zeroshot learning, with the exception that there are examples provided at training time for the previously unseen classes [28, 24, 9, 31]. Using auxiliary information for fewshot learning was introduced in [28], where attributes related to images were used to improve the performance of the model. The use of auxiliary information was also explored in ReViSE [27], in which a common imagelabel semantic space for transductive fewshot learning is learned. Analogous to the relation between zeroshot learning and generalized zeroshot learning, we extend fewshot to the generalized fewshot learning (GFSL) setting, in which we evaluate the model on both seen and unseen classes.
Generative Models for Learning Representations. In the context of generalized zeroshot learning, generative models are used for dataaugmentation [35, 16, 30] and latent representation learning [17, 27]. Approaches using data augmentation treat generalized zeroshot learning as a missing data problem and train conditional GANs [35] or conditional VAEs to generate image features or images for unseen classes from semantic sideinformation. Generative methods for representation learning for GZSL are based on autoencoders, such as ReViSE [27] and DMAE [17], which learn to jointly represent features from different modalities in their latent space. By making use of autoencoders, it is possible to learn representations of visual and semantic information in a semisupervised fashion. Learning a joint representation for visual and semantic data is achieved by aligning the latent distributions between different data types. ReViSE implements this distribution alignment by minimizing the maximum mean discrepancy between the two latent distributions [8]. DMAE aligns distributions by means of minimizing the SquaredLoss Mutual Information [26]. In this work, we use Variational Autoencoders instead, and align the latent distributions by minimizing their Wasserstein distance. In contrast to [17] and [27], we also enforce a crossreconstruction loss, by decoding every encoded feature into every other modality.
CrossReconstruction in Generative models. Reconstructing data across domains, referred to as crossalignment, is commonly used in the field of domain adaptation. While models like CycleGAN [36] learn to generate data across domains directly, latent space models use crossreconstruction to capture the common information contained in both domains in their intermediate latent representations [29]. In this regard, crossaligned VAE’s have been used previously for textstyle transfer [23]
and imagetoimage translation
[14]. In [23] a crossaligned VAE ensures that the latent representations of texts from different input domains are similar, while in [14] a comparable approach was developed to match the latent representation of images from different domains. Both methods have in common that they use a different variant of VAEs with an adversarial loss. Additionally, [23] makes use of conditional encoders and decoders, while [14] enforces cycle consistency and weight sharing. In this paper, on the other hand, our building blocks are unconditional VAEs and we achieve multimodal alignment via crossreconstruction and latent distribution alignment in a highly reduced space.3 Model
Of the existing generalized zeroshot learning models, recent data generating approaches [35, 30, 16] achieve superior performance over other methods on disjoint datasets. For some approaches [35], the edge in performance comes at the cost of an unstable training procedure. On the other hand, classifying generated images or image feautures from conditional VAEs [30, 16]
runs at risk of being compromised by the curse of dimensionality. The main insight of our proposed model is that instead of generating images or image features, we generate lowdimensional latent features and achieve both stable training and the stateoftheart performance. Hence, the key to our approach is the choice of a VAE latentspace, a reconstruction and crossreconstruction criterion to preserve classdiscriminative information in lower dimensions, as well as explicit distribution alignment to encourage domainagnostic representations.
3.1 Background
We first provide the background as the task, i.e. generalized zeroshot learning and as the model building block, i.e. variational autoencoders.
Generalized Zeroshot Learning. The task definition of generalized zeroshot learning is as follows. Let be a set of training examples, consisting of imagefeatures , e.g. extracted by a CNN, class labels available during training and classembeddings
. Typical classembeddings are vectors of handannotated continuous attributes or Word2Vec features
[15]. In addition, an auxiliary training set is used, where denote unseen classes from a set , which is disjoint from . Here, is the set of classembeddings of unseen classes. In the legacy challenge of ZSL, the task is to learn a classifier . However, in this work, we focus on the more realistic and challenging setup of generalized zeroshot learning (GZSL) where the aim is to learn a classifier .Variational Autoencoder (VAE). The basic building block of our model is the variational autoencoder (VAE) [12]
. Variational inference aims at finding the true conditional probability distribution over the latent variables,
. Due to interactibility of this distribution, it can be approximated by finding its closest proxy posterior, , through minimizing their distance using a variational lower bound limit. The objective function of a VAE is the variational lower bound on the marginal likelihood of a given datapoint and can be formulated as:(1) 
where the first term is the reconstruction error and the second term is the unpacked KullbackLeibler divergence between the inference model
, and. A common choice for the prior is a multivariate standard Gaussian distribution. The encoder predicts
and such that , from which a latent vector is generated via the reparametrization trick [12].3.2 Cross and Distribution Aligned VAE
The goal of our model is to learn representations within a common space parameterized by a combination of data modalities. Hence, our model includes encoders, one for every modality, to map into this representation space. To minimize the loss of information, the original data must be reconstructed via the decoder networks. In effect, the basic VAE loss of our model is the sum of the VAElosses over number of VAEs:
(2)  
where weights the KLDivergence [10]. In the case of matching image features with class embeddings, , and . However, in order to ensure that the modalityspecific autoencoders learn similar representations for features of the same classes, additional regularization terms are necessary. For this reason, our model aligns the latent distributions explicitly and enforces a crossreconstruction criterion. In Figure 2 we show an overview of our model, depicting these two forms of latent distribution matching, which we refer to as CrossAlignment (CA) and DistributionAlignment (DA).
CrossAlignment (CA) Loss. In crossalignment, the reconstruction is obtained by decoding the latent encoding of a sample from another modality, but the same class. In essence, every modalityspecific decoder is trained on the latent vectors derived from the other modalities. The loss for these crossreconstructions is computed as:
(3) 
where is the encoder of a feature of modality and is the decoder of a feature belonging to the same class but the modality.
DistributionAlignment (DA) Loss. Another way of matching latent representations across modalities is to minimize the distance between two distributions. Here, we minimize the Wasserstein distance between the predicted latent multivariate Gaussian distributions. In the case of multivariate Gaussians, a closedform solution of the Wasserstein distance [7] between two distributions and is given as:
(4)  
Since our network predicts diagonal covariance matrices, which are commutative, the Wasserstein distance simplifies to the following:
(5) 
and the Distribution Alignment (DA) loss for a group an Mtuple is written as:
(6) 
Cross and Distribution Alignment (CADAVAE) Loss. In our cross and distribution aligned VAE model, we combine the basic VAEloss with (CAVAE) and (DAVAE). Our final model combines them all, i.e. and (CADAVAE) leading to the following objective:
(7) 
where and are the weighting factors of the cross alignment and the distribution alignment loss, respectively. We show in section 4.1 that our model can learn shared multimodal embeddings of more than two modalities, without being required that examples of all modalities have to be available for all classes.
Implementation Details.
Here we explain the details of our experimental set up used across both (G)ZSL and (G)FSL experiments. All our encoders and decoders are multilayer perceptrons with one hidden layer. For our proposed
CADAVAE, we use a hidden layer of size for the image feature encoder and a layer of size for the decoder. The attribute net consists of an encoder layer of size and a decoder layer of dimensions. Finally, we set the latent embedding size to . Note that for ImageNet, we chose a larger latent size of , and use two hidden layers of the same size for the encoder, using the number of hidden units specified above. Furthermore, the image feature decoder for ImageNet has two hidden layers of size and , while the attribute decoder uses and units. The model is trained for epochs by stochastic gradient descent using the Adam optimizer [11] and a batch size of . A batch size of is used for the ImageNet experiments.After individual VAEs learn to encode features of their specific datatype, we compute cross and distribution alignment losses. is increased from epoch to epoch by a rate of per epoch, while is increased from epoch to by per epoch. For the KLdivergence we use an annealing scheme [3], in which we increase the weight of the KLdivergence by a rate of per epoch until epoch . A KLannealing scheme serves the purpose of first letting the VAE learn “useful” representations before they are “smoothed” out, since the KLdivergence would be otherwise a very strong regularizer [3]. We empirically found that it is useful to use a variant of the reparametrization trick [12], in which all dimensions of the noise vector are sampled from a single unimodal Gaussian. Also, using the L1 distance as reconstruction error appeared to yield slightly better results than L2. After training, VAE encoders are used to transform the training and test set of the final classifier into the latent space. Finally, a linear softmax classifier is trained and tested on the latent features.
4 Experiments
We evaluate our framework on zeroshot learning benchmark datasets CUB2002011 [32] (CUB), SUN attribute (SUN) [19], Animals with Attributes and (AWA1 [13], AWA2 [34]) for the generalized zeroshot learning (GZSL) and generalized fewshot learning (GFSL) settings. All image features used for training the VAEs are extracted from the dimensional final pooling layer of a . To avoid violating the zeroshot assumption, i.e. test classes need to be disjoint from the classes that  was trained with, we use the proposed training splits in [34]. As class embeddings, attribute vectors were utilized if available. For CUB, we also used sentence embeddings provided by [21] and for ImageNet we used Word2Vec [15] embeddings provided by [4]
. All hyperparameters were chosen on a validation set provided by
[34]. We report the harmonic mean (H) between seen (S) and unseen (U) average perclass accuracy, i.e. the Top1 accuracy is averaged on a perclass basis.Model  S  U  H 

DAVAE  
CAVAE  
CADAVAE 
4.1 Analyzing CadaVae in Detail on CUB
In this section, we analyze several building blocks of our proposed framework such as the model, the choice of class embeddings as well as the size and the number of latent embeddings generated by our model in the generalized zeroshot learning setting.
Analyzing Model Variants. In this ablation study for GZSL, we present the results for different objective functions and the corresponding VAE variants, CAVAE (crossaligned VAE), DAVAE (distributionaligned VAE) and CADAVAE (cross and distributionaligned VAE) on the CUB dataset. As shown in Table 1, the crossalignment objective offer noticable improvement in performance from with distribution alignment only, to . However, combining distribution alignment and the crossalignment objectives, i.e. CADAVAE, exceeds the harmonic mean accuracy to . This experiment shows that aligning both the latent representations and the latent spaces is complementary since their comparison leads to the highest result on both seen, unseen classes and their harmonic mean in the GZSL setting.
Analyzing Side Information. In sparse data regimes, semantic representation of the classes, i.e. class embeddings, are as important as the image embeddings. We compare the results obtained with perclass attributes, perclass sentences and classbased Word2Vec representations. Our results in Figure 3 (left) show that perclass sentence embeddings result in the best performance among all three, i.e. , attributes follow closely, i.e. . With Word2Vec, the difference between the seen and the unseen class accuracy is significant, indicating that the latent representations learned by Word2Vec generalize less well to unseen classes. This is expected, given that Word2Vec features do not explicitely or exclusively represent visual characteristics. In summary, these results demonstrate that our model is able to learn to generate transferable latent features from various sources of side information. The results also show that latent features learned with more discriminative class embeddings lead to better overall accuracy.
Next, we investigate one of the most prominent aspects of our model, i.e. the ability to handle missing side information. We train CADAVAE such that of seen class image features are paired with sentence embeddings while the other of seen classes are paired with attributes. The setup is evaluated for . We also vary the fraction of unseen classes that are learned from sentence features (whereas denotes the fraction of unseen classes for which image features are only paired with attributes).
Figure 3 (right) shows the results using different fractions of sentence embeddings and attribute embeddings for and . When is held stable at 50%, i.e. both seen and unseen classes have access to sentences and attributes equally half the time, we reach the highest accuracy for an  ratio of . Interestingly, at , i.e. no sentences for seen classes but 50% attributes and 50% sentences for unseen classes, the accuracy is , while at , i.e. no sentences for unseen classes but 50% attributes and 50% sentences for seen classes, the accuracy is . On the other hand, at , i.e. 50% attributes and 50% sentences for seen classes but 100% sentences for unseen classes, the accuracy is whereas at , i.e. 50% attributes and 50% sentences for unseen classes but 100% sentences for seen classes, the accuracy is . These results indicate that sentences have the edge over attribute. However, when either sentences or attributes are not available, our model can recover the missing information from the other modality and still learn discriminative representations.
Increasing Number of Latent Dimensions. In this analysis, we have explored the robustness of our method to the dimensionality of the latent space. Without loss of generality, we report the harmonic mean accuracy of CADAVAE with respect to different latent dimensions on CUB, ranging from and . We observe in Figure 4 that the accuracy initially increases with increasing dimensionality until it achieves its peak accuracy of at and flattens until after which it declines upon further increase of the latent dimension. We conclude from these experiments that the most discriminative properties of two modalities are captured when the latent space has around dimensions. For efficiency reasons, we use dimensional latent features for the rest of the paper.
Increasing Number of Latent Features. Our model can be used to generate an arbitrarily large number of latent features. In this experiment, we vary the number of latent features per class from to on CUB in the GZSL setting and reach the optimum performance with or more latent features per seen class (Figure 5, left). In principle, seen and unseen classes do not need to have the same number of samples. Hence, we also vary the number of features per seen and unseen classes. Indeed, the best accuracy is achieved when there are approximately twice as many features per unseen than seen classes which improves the accuracy from to . While latent features per every class, i.e. , gives accuracy, having latent features per seen classes and latent features per unseen class, i.e. , leads to accuracy. Hence, generating more features of underrepresented classes is important for better accuracy.
As for our results in Figure 5 on the right, we build a dynamic training set by generating latent features continuously at every iteration and do not use any sample more than once. Hence, we eliminate one of the tunable parameters, i.e. the number of latent features to generate. Because of the nondeterministic mapping of the VAE encoder, every latent feature of a different class is unique. Our results indicate that the best accuracy is achieved when unseen and seen class samples are equally balanced. In CUB, using a dynamic training set reaches the same performance as using a fixed dataset with unseen examples and seen examples. On the other hand, using a fixed dataset leads to a faster training procedure. Hence, we use a fixed dataset with examples per seen class and examples per unseen class in every benchmark reported in this paper.
CUB  SUN  AWA1  AWA2  
Model  Feature Size  S  U  H  S  U  H  S  U  H  S  U  H 
CMT [25]  
SJE [2]  
ALE [1]  
LATEM [33]  
EZSL [22]  
SYNC [4]  
DeViSE [5]  
fCLSWGAN [35]  
CVAE [16]  –  –  –  –  –  –  –  –  
SE [30]  
ReViSE [27]  
ours (CADAVAE) 
4.2 Comparing CadaVae on Benchmark Datasets
In this section, we compare our CADAVAE on four benchmark datasets in the generalized zeroshot learning and generalized fewshot learning settings.
Generalized ZeroShot Learning. We compare our model with 11 stateoftheart models. Among those, CVAE [16], SE [30], and fCLSWGAN [35] learn to generate artificial visual data and thereby treat the zeroshot problem as a dataaugmentation problem. On the other hand, the classic ZSL methods DeViSE [5], SJE [2], ALE [1], EZSL [22] and LATEM [33] use a linear compatibility function or other similarity metrics to compare embedded visual and semantic features; CMT [25] and LATEM [33]
utilize multiple neural networks to learn a nonlinear embedding; and SYNC
[4] learns by aligning a class embedding space and a weighted bipartite graph. ReViSE [27] proposes a shared latent manifold learning using an autoencoder between the image features and class attributes. We report results on CUB, SUN, AWA1 and AWA2 datasets.The results in Table 2 show that our CADAVAE outperforms all other methods on all datasets. The accuracy difference between our model and the closest baseline, ReViSE [27], is as follows: vs on CUB, vs on SUN, vs on AWA1 and vs on AWA2. Moreover, our model achieves significant improvements over feature generating models most notably on CUB. Compared to the classic ZSL methods, our method leads to at least improvement in harmonic mean accuracies. In the legacy challenge of ZSL setting, which is hardly reallistic, our CADAVAE provides competetive performance, i.e. on CUB, on SUN, on AWA1, on AWA2. However, in this work, we focus on the more practical and challenging GZSL setting.
We believe the obtained increase in performance by our model can be explained as follows:
Since our model does not use any CNN features, i.e. we generate dimensional latent features for all classes, it achieves a balance between seen and unseen class accuracies better than featuregenerating approaches especially on CUB.
In addition, in CADAVAE a shared representation is learned in a semisupervised fashion, through a crossreconstruction objective. Since the latent features have to be decoded into every involved modality, and since every modality encodes complementary information, the model is encouraged to learn an encoding that retains the information contained in all used modalities. In doing so, our method is less biased towards learning the distribution of the seen class image features, which is known as the projection domain shift problem [6]. As we generate a certain number of latent features per class using nondeterministic encoders, our method is also akin to datagenerating approaches. However, the learned representations lie in a lower dimensional space, i.e. only , and therefore, are less prone to bias towards the training set of image features. In effect, our training is more stable than the adversarial training schemes used for data generation [35]. In fact, we did not conduct any dataset specific parameter tuning, and all the parameters are the same for all datasets.
Generalized FewShot Learning We evaluate our models by using zero, one, two, five and ten shots in the generalized few shot learning (GFSL) on four datasets. We compare our results with the most similar published work in this domain, i.e. ReViSE [28]. Figure 6 reports the performance of all variants of our model (CAVAE, DAVAE, CADAVAE) compared to ReViSE. Our overall observation is as follows. Our latent representations learned from the side information and a small number of visual information improves over the generalized zeroshot setting significantly even by including only a few of the labeled samples.
Specifically, we observe that adding a single latent feature from unseen classes to the training set improves the accuracy by , depending on the dataset. While on CUB the accuracy improvement from shot to shots is , on AWA1 and AWA2 this improvement reaches . Moreover, while the harmonic mean accuracy increases with the number of shots in both methods, all variants of our method outperform the baseline by a large margin across all the datasets indicating the generalization capability of our method to the GFSL setting.
Furthermore, similar to the GZSL scenario, on the finegrained CUB and SUN datasets, CADAVAE reaches the highest performance where it is followed by CAVAE and DAVAE, respectively. However, on AWA1 and AWA2 the difference between different models is not significant. We associate this with the fact that as AWA1 and AWA2 datasets are coarsegrained datasets, the image features are already discriminative. Hence, aligning the latent space with attributes does not lead to a significant difference.
4.3 ImageNet Experiments
The ImageNet dataset serves as a challenging testbed for generalized zeroshot learning research. In [34] several evaluation splits were proposed with increasing granularity and size both in terms of the number of classes and the number of images. Note that since all the images of classes are used to train , measuring seen class accuracies would be biased. However, we can still evaluate the accuracy of unseen class images in the GZSL search space that contains both seen and unseen classes. Hence, at test time the seen classes act as distractors. This way, we can measure the transferability of our latent representations to completely unseen classes, i.e. classes that are not seen either during ResNet training nor CADAVAE training. For ImageNet, as attributes are not available, we use Word2Vec features as class embeddings provided by [4]. We compare our model with fCLSWGAN [35], i.e. an image feature generating framework which currently achieves the state of the art on ImageNet. We use the same evaluation protocol on all the splits. Among the splits, 2H and 3H are the classes or hops away from the seen training classes of ImageNet according to the ImageNet hierarchy. , and are the , and most populated classes, while L500, L1K and L5K are the , and least populated classes that come from the rest of the classes. Finally, ‘All‘ denotes the remaining classes of ImageNet.
As shown in Figure 7, our model significantly improves the state of the art in all the available splits. The accuracy improvement is significant especially on M500 and M1K splits, i.e. for M500 the search space is classes, for M1K, the search space is consists of classes. For the , and splits, there are on average only , and images per class available [34]. Note that the test time search space in the ‘All‘ split is dimensional. Hence even a small improvement in accuracy on this split is considered to be compelling. The achieved substantial increase in performance by CADAVAE shows that our dim latent feature space constitutes a robust generalizable representation, surpassing the current stateoftheart image feature generating framework fCLSWGAN.
5 Conclusion
In this work, we propose CADAVAE, a crossmodal embedding framework for generalized zero and fewshot learning. In CADAVAE, we train a VAE for both visual and semantic modalities. The VAE of each modality has to jointly represent the information embodied by all modalities in its latent space. The corresponding latent distributions are aligned by minimizing their Wasserstein distance. This procedure leaves us with encoders that can encode features from different modalities into one crossmodal embedding space, in which a linear softmax classifier can be trained. We present different variants of crossaligned and distribution aligned VAEs and establish new state of the art results in generalized zeroshot learning for four mediumscale benchmark datasets as well as the largescale ImageNet. We further show that a crossmodal embedding model for generalized zeroshot learning achieves better performance than datagenerating methods, establishing the new state of the art.
References
 [1] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Labelembedding for image classification. IEEE TPAMI, 38(7):1425–1438, 2016.
 [2] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for finegrained image classification. In IEEE CVPR, pages 2927–2936, 2015.
 [3] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 [4] S. Changpinyo, W.L. Chao, B. Gong, and F. Sha. Synthesized classifiers for zeroshot learning. In IEEE CVPR, pages 5327–5336, 2016.
 [5] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visualsemantic embedding model. In NIPS, pages 2121–2129, 2013.

[6]
Y. Fu, T. M. Hospedales, T. Xiang, Z. Fu, and S. Gong.
Transductive multiview embedding for zeroshot recognition and
annotation.
In
European Conference on Computer Vision
, pages 584–599. Springer, 2014.  [7] C. R. Givens, R. M. Shortt, et al. A class of wasserstein metrics for probability distributions. The Michigan Mathematical Journal, 31(2):231–240, 1984.

[8]
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola.
A kernel twosample test.
Journal of Machine Learning Research
, 13(Mar):723–773, 2012.  [9] B. Hariharan and R. B. Girshick. Lowshot visual recognition by shrinking and hallucinating features. In ICCV, pages 3037–3046, 2017.
 [10] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 [11] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [12] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [13] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by betweenclass attribute transfer. In IEEE CVPR, pages 951–958. IEEE, 2009.
 [14] M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised imagetoimage translation networks. In NIPS, pages 700–708, 2017.
 [15] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, pages 3111–3119, 2013.
 [16] A. Mishra, M. Reddy, A. Mittal, and H. A. Murthy. A generative model for zero shot learning using conditional variational autoencoders. arXiv preprint arXiv:1709.00663, 2017.
 [17] T. Mukherjee, M. Yamada, and T. M. Hospedales. Deep matching autoencoders. arXiv preprint arXiv:1711.06047, 2017.
 [18] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zeroshot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
 [19] G. Patterson and J. Hays. Sun attribute database: Discovering, annotating, and recognizing scene attributes. In IEEE CVPR, pages 2751–2758. IEEE, 2012.
 [20] S. K. Ramakrishnan, A. Pal, G. Sharma, and A. Mittal. An empirical evaluation of visual question answering for novel objects. arXiv preprint arXiv:1704.02516, 2017.
 [21] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of finegrained visual descriptions. In IEEE CVPR, pages 49–58, 2016.
 [22] B. RomeraParedes and P. Torr. An embarrassingly simple approach to zeroshot learning. In ICML, pages 2152–2161, 2015.
 [23] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola. Style transfer from nonparallel text by crossalignment. In NIPS, pages 6830–6841, 2017.
 [24] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. In NIPS, pages 4077–4087, 2017.
 [25] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zeroshot learning through crossmodal transfer. In NIPS, pages 935–943, 2013.

[26]
T. Suzuki and M. Sugiyama.
Sufficient dimension reduction via squaredloss mutual information estimation.
InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, pages 804–811, 2010.  [27] Y.H. H. Tsai, L.K. Huang, and R. Salakhutdinov. Learning robust visualsemantic embeddings. arXiv preprint arXiv:1703.05908, 2017.
 [28] Y.H. H. Tsai and R. Salakhutdinov. Improving oneshot learning through fusing side information. arXiv preprint arXiv:1710.08347, 2017.

[29]
E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell.
Adversarial discriminative domain adaptation.
In
Computer Vision and Pattern Recognition (CVPR)
, volume 1, page 4, 2017.  [30] V. K. Verma, G. Arora, A. Mishra, and P. Rai. Generalized zeroshot learning via synthesized examples. arXiv preprint arXiv:1712.03878, 2017.
 [31] Y.X. Wang, R. Girshick, M. Hebert, and B. Hariharan. Lowshot learning from imaginary data.
 [32] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltechucsd birds 200. 2010.
 [33] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zeroshot classification. In IEEE CVPR, pages 69–77, 2016.
 [34] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata. Zeroshot learninga comprehensive evaluation of the good, the bad and the ugly. IEEE TPAMI, 2018.
 [35] Y. Xian, T. Lorenz, B. Schiele, and Z. Akata. Feature generating networks for zeroshot learning. In IEEE CVPR, 2018.
 [36] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint, 2017.
Comments
There are no comments yet.