Zero-shot learning (ZSL) aims at recognizing objects of categories that are not available at the training stage. Its basic idea is to transfer visual knowledge learned from seen categories to the unseen categories through the connection made by the semantic embeddings of classes. Attribute [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] is the first kind of semantic embedding utilized for ZSL and remains the best choice for achieving the state-of-the-art performance of ZSL [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Zhang and Saligrama(2015)]. Its good performance, however, is obtained at the cost of extensive human labour to label these attributes.
Recently, several works have explored to use distributed word embeddings (DWE) [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean, Pennington et al.(2014)Pennington, Socher, and Manning] as the alternative to attributes in zero-shot learning [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean]. In contrast to human annotated attributes, DWEs are learned from a large-scale text corpus in an unsupervised fashion, which requires little or no human labour to collect. However, the training process of DWEs does not involve visual information and thus they only capture the semantic relationship between different classes. In practice, the semantic similarity does not necessarily correspond to the visual similarity and this visual-semantic discrepancy may lead to the inferior performance of ZSL. In fact, it has been shown that when applied to the same ZSL approach, DWE is always outperformed by attribute [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha, Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele]. To reduce the visual-semantic discrepancy, a popular way in ZSL is to map the semantic embeddings and visual features into a shared space [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Romera-Paredes and Torr(2015), Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele, Zhang and Saligrama(2016), Long et al.(2016)Long, Liu, and Shao] to make these two domains comparable. However, when a large visual-semantic discrepancy exists, finding such a mapping can be difficult.
Different to existing work, the method proposed in this paper directly learns a neural network to map the semantic embedding to a space in which the mapped semantic embeddings preserves a similar neighbourhood structure as their visual counterparts. In other words, we do not require the mapped semantic embeddings to be comparable to visual features but only impose constraints on their structure. This gives more freedom in learning the mapping function, and this could potentially enhance its generalizability. Moreover, since our approach is not tied to a particular zero-shot learning method, the learned mapping can be applied to any zero-shot learning algorithm.
Three contributions are made in this work. First, we experimentally demonstrate that the inferior ZSL performance of DWE is caused by the discrepancy of visual features and semantic embeddings. Second, to overcome this issue, we propose the visually aligned word embeddings (VAWE) which preserve similar neighbourhood structure with that in the visual domain. Third, we show that VAWE has improved the word embedding based ZSL methods to state-of-the-art performance and is potentially generalizable to any type of ZSL method.
2 Related works
Zero-shot learning and semantic embedding: Zero-shot learning was firstly made possible by attributes[Lampert et al.(2009)Lampert, Nickisch, and Harmeling, Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth], which describe the visual appearance of the concept or instance by assigning labelled visual properties to it, and they are easily transferable from seen to unseen classes. Distributed word embeddings, most notably word2vec [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] and GloVe [Pennington et al.(2014)Pennington, Socher, and Manning], are recently explored [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng, Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] as a promising alternative semantic embedding towards fully automatic zero-shot learning since their unsupervised training process does not involve any human intervention. ZSL approaches learn a connection between visual and semantic domains either by directly mapping visual features to semantic space [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean, Fu and Sigal(2016)] or projecting both visual and semantic embeddings into a common space [Akata et al.(2013)Akata, Perronnin, Harchaoui, and Schmid, Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Romera-Paredes and Torr(2015), Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele, Zhang and Saligrama(2016), Long et al.(2016)Long, Liu, and Shao, Jetley et al.(2015)Jetley, Romera-Paredes, Jayasumana, and Torr, Li et al.(2015)Li, Guo, and Schuurmans]. It should be noted that specically in [Long et al.(2016)Long, Liu, and Shao, Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han], similar issues like visual-semantic ambiguity or visual-semantic structure preservation are proposed and attribute based ZSL methods are designed to deal with them. Although our work shares a common goal with [Long et al.(2016)Long, Liu, and Shao] and [Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han], VAWE is learned in the semantic domain only which serves as a general tool for any word embedding based ZSL methods. In other words, we are NOT proposing a particular ZSL method, and VAWE can be regarded as a meta-method for improving existing ZSL methods.
Word embedding with visual information: As distributed word embedding is limited to pure textual representation, a few works have proposed to improve it with visual cues. Visual word2vec [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh] is trained by adding abstract scenes to context. In [Lazaridou et al.(2015)Lazaridou, Pham, and Baroni], the language model learns to predict visual representations jointly with the linguistic features. Our work is different from those two works in two aspects: 1) our target is to learn a mapping function which can be generalized to words in unseen classes while the above works try to learn an embedding for the words in the training set. 2) the objective of our method is to encourage a certain neighbourhood structure of the mapped word embedding rather than applying the context prediction objective across visual and semantic domains as in [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh, Lazaridou et al.(2015)Lazaridou, Pham, and Baroni].
3 Background and motivation
Assume a set of class labels and for images from seen and unseen classes, where . Most zero-shot learning approaches can be summarised by the general form
where measures compatibility score of the visual feature and a semantic embedding of class . During the training phase, where , is learned to measure the compatibility between and . During the testing phase, the learned is applied to measure the compatibility between novel classes and testing visual samples for zero-shot prediction.
This formulation indicates that is an important factor of ZSL. It is desirable that the relationship among retains consistency with the relationship among their visual features, that is, of the visually similar classes remains to be similar and vice versa. Human defined attributes are empirically proven to be qualified
because the annotators implicitly infuses the attribute vectors with visual information based on their knowledge and experience of the concepts111In this work, we use “concept” and “word” interchangeably to denote category names.. However, this is not always the case for semantic embeddings learned from pure text sources, which are trained to maintain the semantic relation of concepts from large text corpora. For example, the concepts “violin” and “piano” are strongly related in the semantic sense even though their appearances are completely different.
To investigate how visual-semantic consistency affects ZSL performance, we conduct a preliminary experiment on AwA dataset222See Section 5 and the supplementary material for the details of this dataset and the visual features. We use the state-of-the-art ESZSL [Romera-Paredes and Torr(2015)] method to measure the ZSL performance and the average neighbourhood overlap to measure visual-semantic consistency. To calculate the latter, we measure the visual distance between two classes as the average distance between all pairs of visual features within those two classes and this is also equivalent to calculating the distance between their mean feature vectors.That is to say,the visual distance between two classes and are
where is the mean feature vectors for each class and is the L-2 norm. Likewise, the semantic distance between two classes can be calculated in the same manner by replacing and with the semantic embeddings of classes and .
We define and as the sets that includes the most similar classes to class in visual and semantic domains respectively. Then for each class , we calculate its top- nearest classes in visual domain using (2) and put them in . Similarly, we calculate the top- nearest classes of in the semantic domain and put them in .
Four types of semantic embeddings: word2vec, GloVe 333See Section 5 and the supplementary material for the training details of the two word embeddings, binary attribute (presence/absence of the attribute for a class) and continuous attribute (strength of association to a class) are tested. The average neighbourhood overlap is defined in (3) as the average number of shared neighbours (out of =10 nearest neighbours in this case) for all classes in semantic and visual domains. A value closer to indicates that the embedding is more consistent with the visual domain.
|ESZSL||visual feature mean||10.00||86.34|
The results in Table 1 demonstrate that semantic embeddings with more consistent visual-semantic neighbourhood structure clearly produce better ZSL performance. Motivated by that, in this paper we propose to map the semantic word embedding into a new space in which the neighbourhood structure of the mapped embeddings becomes consistent with their visual domain counterparts. Hereafter, we call the mapped word embedding visually aligned word embedding (VAWE) since the mapped word embedding is re-aligned with visual information in comparison with its unmapped counterpart.
Notations: Before elaborating our approach, we formally define the notations as follows. For each visual category , we denote its semantic word embedding as and its visual signature as , where and are the dimensionality of the visual and semantic space, respectively. The visual signature will be used to define the neighbourhood structure in the visual domain. In the main body of this paper, we use the mean vector of the visual features in the th category as its visual signature. Certainly, this is merely one way to define the visual neighbourhood structure, and our method also applies to other alternative definitions. The to-be-learned mapping function (neural network) is represented by 444For simplicity, we omit the parameter in later notations., where is the model parameters. This function will be learned on the seen classes and is expected to generalize to unseen classes. In this way, we can apply the VAWE to any zero-shot learning methods. We use the notation and to denote an unseen class and its semantic embedding respectively.
4.1 Visually aligned word embedding
To learn , we design an objective function to encourage that and share similar neighbours. Specifically, we consider a triplet of classes , where is more visually similar to than . We assume that by examining the consistency of the neighbourhood of class in the view of its visual signature, the VAWE of the class and should be pulled closer while the VAWE of the class and should be pushed far part. Hereafter, we call class , and anchor class, positive class and negative class respectively. The training objective is to ensure the distance between and is smaller than the distance between and
. Therefore we employ a triplet hinge loss function:
where denotes the hinge loss and is an enforced margin between the distances from anchor class to positive and negative classes. Note that our method does not map the semantic word embedding into a shared space with visual feature as in many ZSL methods such as DeViSE [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov]. The mapping function only applies at the semantic domain. We set in all experiments.
4.2 Triplet selection
The choice of the triplet plays a crucial role in our method. Our method encourages the output embedding to share neighbourhood with visual signatures. Therefore if two classes are close in the visual domain, but distant in the semantic domain, their semantic embeddings should be pulled closer, and vice versa. Specifically, for an anchor class , if another class is within the top- neighbours in the view of visual domain but not within the top- () neighbours in the view of semantic domain, then should be pulled closer to and we include as a positive class. On the other hand, if another class is within the top- neighbours of in semantic view but not within the top- neighbours in visual view, should be pushed far away from and we include as the negative class. Note that using avoids over-sensitive decision on the neighbourhood boundary. In other words, if is within the top- neighbourhood of , it is deemed “close” to and only if is not within the top- neighbourhood of , it is considered as “distant” from .
As noted by [Dinu and Baroni(2015)], nearest neighbour approaches may suffer from the hubness problem in high dimension: some items are similar to all other items and thus become hubs. In our experiment if a positive concept appears in the neighbourhood of many words during training, the VAWE would concentrate around this hub and this could be harmful for learning a proper
. We design a simple-but-effective hubness correction mechanism as a necessary regularizor for training by removing such hub vectors from the positive class candidates as the training progresses. We calculate the “hubness level” for each concept before each epoch. Concretely, we accumulate the each concept’s times of appearances in the neighbourhood of other concepts in the mapped semantic domain. We mark the concepts that appear too often in the neighbourhood of other concepts as hubs and remove them from positive classes in the next training epoch. In our experiment the hubness correction usually brings 2-3% of improvement over the ordinary triplet selection. We summarize the triplet selection process and hub vectors generation in Algorithm 1 and Algorithm 2, respectively.
4.3 Learning the neural network
We formulate as a neural network that takes inputs from the pre-trained word embeddings and outputs new visually aligned word embeddings. During the training stage, the training triplets are selected from the seen classes according to Algorithm 1, and parameters of are adjusted by SGD to minimize the triplet loss (5). Note that although the number of training classes is limited, is trained with the triplets of classes, which amount up to . The inference structure of
contains two fully-connected hidden layers with ReLU non-linearity, and the output embedding is L-2 normalized to-D unit hypersphere before being propagated to the triplet loss layer. For a detailed description of the neural network and the training parameters, please refer to the supplementary material.
where is the set of all selected triplets at epoch .
During the inference stage, is applied to word embeddings of both seen and unseen classes. The output VAWEs are off-the-shelf for any zero-shot learning tasks.
In order to conduct a comprehensive evaluation, we train the VAWE from two kinds popular distributed word embeddings: word2vec [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] and GloVe [Pennington et al.(2014)Pennington, Socher, and Manning]. We apply the the trained VAWE to four state-of-the-art methods. We compare the performance against the original word embeddings and other ZSL methods using various semantic embeddings.
Datasets: We test the methods on four widely used benchmark dataset for zero-shot learning: aPascal/aYahoo object dataset [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] (aPY), Animals with Attributes [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] (AwA), Caltech-UCSD birds-200-2011 [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] (CUB), and the SUN scene attribute dataset [Xiao et al.(2014)Xiao, Ehinger, Hays, Torralba, and Oliva] (SUN).
Distributed word embeddings: We train the VAWE from two pre-trained distributed word emedding models: word2vec and GloVe. We pre-train the word2vec model from scratch on a large combination of text corpus. The resulted model generates 1000-D real valued vectors for each concept. As for GloVe, we use the pre-trained 300-D word embeddings provided by [Pennington et al.(2014)Pennington, Socher, and Manning]. We only test GloVe on aPY and AwA datasets because the pre-trained GloVe model does not contain the associated word embeddings for too many fine-grained categories in CUB and SUN.
Image features and visual signatures: For all the four test methods in our experiments, we extract the image features from the fully connected layer activations of the deep CNN VGG-19 [Simonyan and Zisserman(2014)]. As aforementioned, we use the average VGG-19 features of each seen category as the visual signatures for them.
Test ZSL methods: We apply trained VAWEs on four state-of-the-art methods denoted as ConSE [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean], SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha], LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] and ESZSL [Romera-Paredes and Torr(2015)] in the following sections.
Implementation details: We stop the training when the triplet loss stops decreasing. This usually takes 150 epochs for aPY, 250 epochs for AwA, 50 epochs for CUB and 20 epochs for SUN. Number of nearest neighbours in visual space is for all datasets. Number of nearest neighbours in semantic space is set to half the number of seen classes for each dataset (except for SUN, which has many seen classes), that is 10, 20, 75 and 200 for aPY, AwA, CUB and SUN, respectively. The output dimension is set to =128. More detailed experimental settings and results are provided in the supplementary material.
5.1 Performance improvement and discussion
In this section, we test the effect of using VAWE trained from word2vec and GloVe in various ZSL methods. The main results of VAWE compared against the original word embeddings are listed in the bottom part of Table 2. Except for the fine-grained dataset CUB, the VAWEs trained from both word embeddings gain overall performance improvement on all test methods. Most notably on the coarse-grained datasets, i.e., aPY and AwA, the VAWEs outperform their original counterparts by a very large margin.
For ZSL methods, we find that the performance improvement is most significant for ConSE and ESZSL, partly because these two methods directly learns a linear mapping between visual and semantic embeddings. A set of semantic embeddings that is inconsistent with the visual domain would hurt their performance the most. By using the VAWEs, those methods learn a much better aligned visual-semantic mapping and earns a great performance improvement.
The performance improvement is limited on fine-grained datasets CUB and SUN. Compared to the coarse-grained class datasets, the difference in categories in CUB and SUN is subtle in both visual and semantic domains. This causes the their visual signatures and semantic embeddings more entangled and results higher hubness level. Therefore it is more challenging to re-align the word embeddings of fine-grained categories by our method.
Overall, the VAWEs exhibit consistent performance gain for various methods on various datasets (improved performance on 22 out of 24 experiments). This observation suggests that VAWE is able to serve as a general tool for improving the performance of ZSL approaches.
5.2 Comparison against the state-of-the-art
We also compare the improved results of VAWE against the results of recently published state-of-the-art ZSL methods using various sources of semantic embeddings in the upper part of Table 2. It can be observed that methods using VAWE beat all other methods using non-attribute embeddings. Even compared against the best performing attribute-based methods, our results are still very competitive on coarse-grained class datasets: only a small margin lower than [Zhang and Saligrama(2016)] that uses continuous attributes. The results indicate that VAWE is a potential substitution for human-labelled attributes. The VAWE is not only human labour free but also provides comparable performance to attributes.
5.3 The effect of visual features
|Visual signature source||Low-level||DeCAF||VGG-19|
|ConSE + Ours||55.57||60.08||61.24|
|SynC + Ours||59.06||67.30||66.10|
|LatEm + Ours||58.43||63.33||61.46|
|ESZSL + Ours||67.28||73.23||76.16|
The learning process of the mapping function relies on the choice of visual features which implicitly affects the neighbourhood structure in the visual domain. In this section, we investigate the impact of the choice of visual features on the quality of the mapped VAWE. Again, the quality of the VAWE is measured by its performance on ZSL. In previous sections, we extracted the visual signature as the mean of the VGG-19 features of each class. Here we further replace it with low-level features or DeCAF features provided by [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] and use them to obtain the VAWEs of word2vec. Once the VAWEs is learned we apply the ZSL with VGG-19 features and the experiment is conducted on the AwA dataset. Note that both DeCAF features and low-level features are weaker image features than VGG-19. The experiment results are shown in Table 3. From the experimental results, we find that performance of all four ZSL methods do not change too much when we replace VGG-19 with DeCAF to learn the mapping function. Using low-level features will degrade the performance but comparing to the performance of using the original word2vec the learned VAWE still shows superior performance. These observation suggests that we may use one type of visual features to train the VAWE and apply them to ZSL methods trained with another kind of visual features and still obtain good results.
In this paper, we show that the discrepancy of visual features and semantic embeddings negatively impacts the performance of ZSL approaches. Motivated by that, we propose to learn a neural network with triplet loss to map the word embeddings into a new space in which the neighbourhood structure of the mapped word embedding becomes similar to that in the visual domain. The visually aligned word embeddings boost the ZSL performance to a level that is competitive to human defined attributes. Besides that, our approach is independent of any particular ZSL method. This gives it much more flexibility to generalize to more potential applications of vision-language tasks.
L. Liu was in part supported by ARC DECRA Fellowship DE170101259. C. Shen was in part supported by ARC Future Fellowship FT120100969.
- [Akata et al.(2013)Akata, Perronnin, Harchaoui, and Schmid] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2013.
- [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2015.
[Ba et al.(2015)Ba, Swersky, Fidler, and Salakhutdinov]
Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, and Ruslan Salakhutdinov.
Predicting deep zero-shot convolutional neural networks using textual descriptions.In Proc. IEEE Int. Conf. Comp. Vis. IEEE, 2015.
[Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha]
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha.
Synthesized classifiers for zero-shot learning.In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
- [Deng et al.(2014)Deng, Ding, Jia, Frome, Murphy, Bengio, Li, Neven, and Adam] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation graphs. In Proc. Eur. Conf. Comp. Vis., 2014.
- [Dinu and Baroni(2015)] Georgiana Dinu and Marco Baroni. Improving zero-shot learning by mitigating the hubness problem. In Proc. Int. Conf. Learn. Representations, 2015.
- [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
- [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. DeViSE: A deep visual-semantic embedding model. In Proc. Advances in Neural Inf. Process. Syst., 2013.
- [Fu and Sigal(2016)] Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
- [Jetley et al.(2015)Jetley, Romera-Paredes, Jayasumana, and Torr] Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jayasumana, and Philip H. S. Torr. Prototypical priors: From improving classification to zero-shot learning. In Proc. British Machine Vis. Conf., 2015.
- [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh] Satwik Kottur, Ramakrishna Vedantam, Jose M. F. Moura, and Devi Parikh. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
- [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by betweenclass attribute transfer. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
- [Lampert et al.(2014)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell., 2014.
- [Lazaridou et al.(2015)Lazaridou, Pham, and Baroni] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. Combining language and vision with a multimodal skip-gram model. In NAACL HLT, 2015.
- [Li et al.(2015)Li, Guo, and Schuurmans] X. Li, Y. Guo, and D. Schuurmans. Semi-supervised zero-shot classification with label representation learning. In Proc. IEEE Int. Conf. Comp. Vis., pages 4211–4219, 2015.
- [Long et al.(2016)Long, Liu, and Shao] Yang Long, Li Liu, and Ling Shao. Attribute embedding with visual-semantic ambiguity removal for zero-shot learning. In Proc. British Machine Vis. Conf., 2016.
- [Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han] Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
- [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Inf. Process. Syst., 2013.
- [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In Proc. Int. Conf. Learn. Representations, 2014.
[Pennington et al.(2014)Pennington, Socher, and Manning]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning.
GloVe: Global vectors for word representation.
Conf. Empirical Methods in Natural Language Processing: EMNLP, 2014.
- [Qiao et al.(2016)Qiao, Liu, Shen, and van den Hengel] Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Less is more: Zero-shot learning from online textual documents with noise suppression. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
- [Romera-Paredes and Torr(2015)] Bernardino Romera-Paredes and Philip H.S. Torr. An embarrassingly simple approach to zero-shot learning. Proc. Int. Conf. Mach. Learn., 2015.
- [Simonyan and Zisserman(2014)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. Int. Conf. Learn. Representations, 2014.
- [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Advances in Neural Inf. Process. Syst., 2013.
- [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
- [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
[Xiao et al.(2014)Xiao, Ehinger, Hays, Torralba, and Oliva]
Jianxiong Xiao, Krista A. Ehinger, James Hays, Antonio Torralba, and Aude
Sun database: Exploring a large collection of scene categories.
Int. J. Comput. Vision, 2014.
- [Zhang and Saligrama(2015)] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In Proc. IEEE Int. Conf. Comp. Vis. IEEE, 2015.
- [Zhang and Saligrama(2016)] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. IEEE, 2016.