Visually Aligned Word Embeddings for Improving Zero-shot Learning

07/18/2017 ∙ by Ruizhi Qiao, et al. ∙ 0

Zero-shot learning (ZSL) highly depends on a good semantic embedding to connect the seen and unseen classes. Recently, distributed word embeddings (DWE) pre-trained from large text corpus have become a popular choice to draw such a connection. Compared with human defined attributes, DWEs are more scalable and easier to obtain. However, they are designed to reflect semantic similarity rather than visual similarity and thus using them in ZSL often leads to inferior performance. To overcome this visual-semantic discrepancy, this work proposes an objective function to re-align the distributed word embeddings with visual information by learning a neural network to map it into a new representation called visually aligned word embedding (VAWE). Thus the neighbourhood structure of VAWEs becomes similar to that in the visual domain. Note that in this work we do not design a ZSL method that projects the visual features and semantic embeddings onto a shared space but just impose a requirement on the structure of the mapped word embeddings. This strategy allows the learned VAWE to generalize to various ZSL methods and visual features. As evaluated via four state-of-the-art ZSL methods on four benchmark datasets, the VAWE exhibit consistent performance improvement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Zero-shot learning (ZSL) aims at recognizing objects of categories that are not available at the training stage. Its basic idea is to transfer visual knowledge learned from seen categories to the unseen categories through the connection made by the semantic embeddings of classes. Attribute [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] is the first kind of semantic embedding utilized for ZSL and remains the best choice for achieving the state-of-the-art performance of ZSL  [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Zhang and Saligrama(2015)]. Its good performance, however, is obtained at the cost of extensive human labour to label these attributes.

Figure 1: The key idea of our approach. Given class names and visual features of the seen classes, we extract the word embeddings from a pre-trained language model and obtain the visual signatures that summarize the appearances of the seen classes. The word embeddings are mapped to a new space where the neighbourhood structure of the mapped embeddings are enforced to be consistent with their visual domain counterparts. During the inference stage, the VAWEs and visual features of seen classes are used to train the ZSL model. Then VAWEs of unseen classes are fed to the trained ZSL model for zero-shot prediction.

Recently, several works have explored to use distributed word embeddings (DWE) [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean, Pennington et al.(2014)Pennington, Socher, and Manning] as the alternative to attributes in zero-shot learning [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean]. In contrast to human annotated attributes, DWEs are learned from a large-scale text corpus in an unsupervised fashion, which requires little or no human labour to collect. However, the training process of DWEs does not involve visual information and thus they only capture the semantic relationship between different classes. In practice, the semantic similarity does not necessarily correspond to the visual similarity and this visual-semantic discrepancy may lead to the inferior performance of ZSL. In fact, it has been shown that when applied to the same ZSL approach, DWE is always outperformed by attribute [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha, Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele]. To reduce the visual-semantic discrepancy, a popular way in ZSL is to map the semantic embeddings and visual features into a shared space [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Romera-Paredes and Torr(2015), Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele, Zhang and Saligrama(2016), Long et al.(2016)Long, Liu, and Shao] to make these two domains comparable. However, when a large visual-semantic discrepancy exists, finding such a mapping can be difficult.

Different to existing work, the method proposed in this paper directly learns a neural network to map the semantic embedding to a space in which the mapped semantic embeddings preserves a similar neighbourhood structure as their visual counterparts. In other words, we do not require the mapped semantic embeddings to be comparable to visual features but only impose constraints on their structure. This gives more freedom in learning the mapping function, and this could potentially enhance its generalizability. Moreover, since our approach is not tied to a particular zero-shot learning method, the learned mapping can be applied to any zero-shot learning algorithm.

Three contributions are made in this work. First, we experimentally demonstrate that the inferior ZSL performance of DWE is caused by the discrepancy of visual features and semantic embeddings. Second, to overcome this issue, we propose the visually aligned word embeddings (VAWE) which preserve similar neighbourhood structure with that in the visual domain. Third, we show that VAWE has improved the word embedding based ZSL methods to state-of-the-art performance and is potentially generalizable to any type of ZSL method.

2 Related works

Zero-shot learning and semantic embedding: Zero-shot learning was firstly made possible by attributes[Lampert et al.(2009)Lampert, Nickisch, and Harmeling, Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth], which describe the visual appearance of the concept or instance by assigning labelled visual properties to it, and they are easily transferable from seen to unseen classes. Distributed word embeddings, most notably word2vec [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] and GloVe [Pennington et al.(2014)Pennington, Socher, and Manning], are recently explored [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng, Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] as a promising alternative semantic embedding towards fully automatic zero-shot learning since their unsupervised training process does not involve any human intervention. ZSL approaches learn a connection between visual and semantic domains either by directly mapping visual features to semantic space [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng, Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean, Fu and Sigal(2016)] or projecting both visual and semantic embeddings into a common space [Akata et al.(2013)Akata, Perronnin, Harchaoui, and Schmid, Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Romera-Paredes and Torr(2015), Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele, Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele, Zhang and Saligrama(2016), Long et al.(2016)Long, Liu, and Shao, Jetley et al.(2015)Jetley, Romera-Paredes, Jayasumana, and Torr, Li et al.(2015)Li, Guo, and Schuurmans]. It should be noted that specically in [Long et al.(2016)Long, Liu, and Shao, Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han], similar issues like visual-semantic ambiguity or visual-semantic structure preservation are proposed and attribute based ZSL methods are designed to deal with them. Although our work shares a common goal with [Long et al.(2016)Long, Liu, and Shao] and [Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han], VAWE is learned in the semantic domain only which serves as a general tool for any word embedding based ZSL methods. In other words, we are NOT proposing a particular ZSL method, and VAWE can be regarded as a meta-method for improving existing ZSL methods.

Word embedding with visual information: As distributed word embedding is limited to pure textual representation, a few works have proposed to improve it with visual cues. Visual word2vec [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh] is trained by adding abstract scenes to context. In [Lazaridou et al.(2015)Lazaridou, Pham, and Baroni], the language model learns to predict visual representations jointly with the linguistic features. Our work is different from those two works in two aspects: 1) our target is to learn a mapping function which can be generalized to words in unseen classes while the above works try to learn an embedding for the words in the training set. 2) the objective of our method is to encourage a certain neighbourhood structure of the mapped word embedding rather than applying the context prediction objective across visual and semantic domains as in [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh, Lazaridou et al.(2015)Lazaridou, Pham, and Baroni].

3 Background and motivation

Assume a set of class labels and for images from seen and unseen classes, where . Most zero-shot learning approaches can be summarised by the general form

(1)

where measures compatibility score of the visual feature and a semantic embedding of class . During the training phase, where , is learned to measure the compatibility between and . During the testing phase, the learned is applied to measure the compatibility between novel classes and testing visual samples for zero-shot prediction.

This formulation indicates that is an important factor of ZSL. It is desirable that the relationship among retains consistency with the relationship among their visual features, that is, of the visually similar classes remains to be similar and vice versa. Human defined attributes are empirically proven to be qualified

because the annotators implicitly infuses the attribute vectors with visual information based on their knowledge and experience of the concepts

111In this work, we use “concept” and “word” interchangeably to denote category names.. However, this is not always the case for semantic embeddings learned from pure text sources, which are trained to maintain the semantic relation of concepts from large text corpora. For example, the concepts “violin” and “piano” are strongly related in the semantic sense even though their appearances are completely different.

To investigate how visual-semantic consistency affects ZSL performance, we conduct a preliminary experiment on AwA dataset222See Section 5 and the supplementary material for the details of this dataset and the visual features. We use the state-of-the-art ESZSL [Romera-Paredes and Torr(2015)] method to measure the ZSL performance and the average neighbourhood overlap to measure visual-semantic consistency. To calculate the latter, we measure the visual distance between two classes as the average distance between all pairs of visual features within those two classes and this is also equivalent to calculating the distance between their mean feature vectors.That is to say,the visual distance between two classes and are

(2)

where is the mean feature vectors for each class and is the L-2 norm. Likewise, the semantic distance between two classes can be calculated in the same manner by replacing and with the semantic embeddings of classes and .

We define and as the sets that includes the most similar classes to class in visual and semantic domains respectively. Then for each class , we calculate its top- nearest classes in visual domain using (2) and put them in . Similarly, we calculate the top- nearest classes of in the semantic domain and put them in .

Four types of semantic embeddings: word2vec, GloVe 333See Section 5 and the supplementary material for the training details of the two word embeddings, binary attribute (presence/absence of the attribute for a class) and continuous attribute (strength of association to a class) are tested. The average neighbourhood overlap is defined in (3) as the average number of shared neighbours (out of =10 nearest neighbours in this case) for all classes in semantic and visual domains. A value closer to indicates that the embedding is more consistent with the visual domain.

(3)
Method Embedding Consistency Accuracy
ESZSL word2vec 2.88 58.12
ESZSL GloVe 2.84 59.72
ESZSL binary attribute 4.80 62.85
ESZSL continuous attribute 5.66 75.12
ESZSL visual feature mean 10.00 86.34

Table 1: Preliminary experiment: ZSL accuracies of ESZSL on AwA dataset with different semantic embeddings. The visual feature mean summaries the visual appearance of each seen or unseen class.

The results in Table 1 demonstrate that semantic embeddings with more consistent visual-semantic neighbourhood structure clearly produce better ZSL performance. Motivated by that, in this paper we propose to map the semantic word embedding into a new space in which the neighbourhood structure of the mapped embeddings becomes consistent with their visual domain counterparts. Hereafter, we call the mapped word embedding visually aligned word embedding (VAWE) since the mapped word embedding is re-aligned with visual information in comparison with its unmapped counterpart.

4 Approach

Notations: Before elaborating our approach, we formally define the notations as follows. For each visual category , we denote its semantic word embedding as and its visual signature as , where and are the dimensionality of the visual and semantic space, respectively. The visual signature will be used to define the neighbourhood structure in the visual domain. In the main body of this paper, we use the mean vector of the visual features in the th category as its visual signature. Certainly, this is merely one way to define the visual neighbourhood structure, and our method also applies to other alternative definitions. The to-be-learned mapping function (neural network) is represented by 444For simplicity, we omit the parameter in later notations., where is the model parameters. This function will be learned on the seen classes and is expected to generalize to unseen classes. In this way, we can apply the VAWE to any zero-shot learning methods. We use the notation and to denote an unseen class and its semantic embedding respectively.

4.1 Visually aligned word embedding

To learn , we design an objective function to encourage that and share similar neighbours. Specifically, we consider a triplet of classes , where is more visually similar to than . We assume that by examining the consistency of the neighbourhood of class in the view of its visual signature, the VAWE of the class and should be pulled closer while the VAWE of the class and should be pushed far part. Hereafter, we call class , and anchor class, positive class and negative class respectively. The training objective is to ensure the distance between and is smaller than the distance between and

. Therefore we employ a triplet hinge loss function:

(4)

where denotes the hinge loss and is an enforced margin between the distances from anchor class to positive and negative classes. Note that our method does not map the semantic word embedding into a shared space with visual feature as in many ZSL methods such as DeViSE [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov]. The mapping function only applies at the semantic domain. We set in all experiments.

4.2 Triplet selection

The choice of the triplet plays a crucial role in our method. Our method encourages the output embedding to share neighbourhood with visual signatures. Therefore if two classes are close in the visual domain, but distant in the semantic domain, their semantic embeddings should be pulled closer, and vice versa. Specifically, for an anchor class , if another class is within the top- neighbours in the view of visual domain but not within the top- () neighbours in the view of semantic domain, then should be pulled closer to and we include as a positive class. On the other hand, if another class is within the top- neighbours of in semantic view but not within the top- neighbours in visual view, should be pushed far away from and we include as the negative class. Note that using avoids over-sensitive decision on the neighbourhood boundary. In other words, if is within the top- neighbourhood of , it is deemed “close” to and only if is not within the top- neighbourhood of , it is considered as “distant” from .

As noted by [Dinu and Baroni(2015)], nearest neighbour approaches may suffer from the hubness problem in high dimension: some items are similar to all other items and thus become hubs. In our experiment if a positive concept appears in the neighbourhood of many words during training, the VAWE would concentrate around this hub and this could be harmful for learning a proper

. We design a simple-but-effective hubness correction mechanism as a necessary regularizor for training by removing such hub vectors from the positive class candidates as the training progresses. We calculate the “hubness level” for each concept before each epoch. Concretely, we accumulate the each concept’s times of appearances in the neighbourhood of other concepts in the mapped semantic domain

. We mark the concepts that appear too often in the neighbourhood of other concepts as hubs and remove them from positive classes in the next training epoch. In our experiment the hubness correction usually brings 2-3% of improvement over the ordinary triplet selection. We summarize the triplet selection process and hub vectors generation in Algorithm 1 and Algorithm 2, respectively.

Input: Nearest neighbourhood structure sets and in visual and semantic domains for each seen class computed from semantic and visual signatures ; and ; hub vector set at epoch .

Initialize triplet set at epoch .

for do

        , //remove hubs. , for do
               if then
                      ,for do
                             if then
                                    , .
                             end if
                            
                      end for
               end if
              
        end for
       
end for
Randomly shuffle the order of triplets in . Output: .
Algorithm 1 Dynamic triplet selection at epoch

Input: output embeddings at epoch ; number of neighbours in semantic domains ;

Initialize as a zero-valued vector with each of its element counting the hubness level of each vector; hub vector set at epoch .

for do

        Get from , for do
               if then
                      ,
               end if
              
        end for
       
end for
for do
        if then
              
        end if
       
end for
Output: .
Algorithm 2 Generating hub vector set before epoch

4.3 Learning the neural network

We formulate as a neural network that takes inputs from the pre-trained word embeddings and outputs new visually aligned word embeddings. During the training stage, the training triplets are selected from the seen classes according to Algorithm 1, and parameters of are adjusted by SGD to minimize the triplet loss (5). Note that although the number of training classes is limited, is trained with the triplets of classes, which amount up to . The inference structure of

contains two fully-connected hidden layers with ReLU non-linearity, and the output embedding is L-2 normalized to

-D unit hypersphere before being propagated to the triplet loss layer. For a detailed description of the neural network and the training parameters, please refer to the supplementary material.

(5)

where is the set of all selected triplets at epoch .

During the inference stage, is applied to word embeddings of both seen and unseen classes. The output VAWEs are off-the-shelf for any zero-shot learning tasks.

5 Experiments

In order to conduct a comprehensive evaluation, we train the VAWE from two kinds popular distributed word embeddings: word2vec [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] and GloVe [Pennington et al.(2014)Pennington, Socher, and Manning]. We apply the the trained VAWE to four state-of-the-art methods. We compare the performance against the original word embeddings and other ZSL methods using various semantic embeddings.

Method Feature Embedding aPY AwA CUB SUN
Lampert [Lampert et al.(2014)Lampert, Nickisch, and Harmeling] V continuous attribute 38.16 57.23 72.00
Deng [Deng et al.(2014)Deng, Ding, Jia, Frome, Murphy, Bengio, Li, Neven, and Adam] D class hierarchy 44.2
Ba [Ba et al.(2015)Ba, Swersky, Fidler, and Salakhutdinov] V web documents 12.0
Akata [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele] V word2vec 51.2 28.4
Akata [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele] V GloVe 58.8 24.2
Akata [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele] V continuous attribute 66.7 50.1
Qiao [Qiao et al.(2016)Qiao, Liu, Shen, and van den Hengel] V web documents 66.46 29.00
Zhang [Zhang and Saligrama(2015)] V continuous attribute 46.23 76.33 30.41 82.50
Zhang [Zhang and Saligrama(2016)] V continuous attribute 50.35 79.12 41.78 83.83
VSAR [Long et al.(2016)Long, Liu, and Shao] L continuous attribute 39.42 51.75
SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] G continuous attribute 69.7 53.4 62.8
LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] G continuous attribute 72.5 45.6
ESZSL [Romera-Paredes and Torr(2015)] V continuous attribute 24.22 75.32 82.10

ConSE [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean]
V word2vec 21.82 46.80 23.12 43.00
ConSE [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] + Ours V VAWE word2vec 35.29 61.24 27.44 63.10
ConSE [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] V GloVe 35.17 51.21
ConSE [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] + Ours V VAWE GloVe 42.21 59.26
SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] V word2vec 28.53 56.71 21.54 68.00
SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] + Ours V VAWE word2vec 33.23 66.10 21.21 70.80
SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] V GloVe 29.92 60.74
SynC [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] + Ours V VAWE GloVe 31.88 64.51
LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] V word2vec 19.64 50.84 16.52 52.50
LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] + Ours V VAWE word2vec 35.64 61.46 19.12 61.30
LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] V GloVe 27.72 46.12
LatEm [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] + Ours V VAWE GloVe 37.29 55.51
ESZSL [Romera-Paredes and Torr(2015)] V word2vec 28.32 58.12 24.82 64.50
ESZSL [Romera-Paredes and Torr(2015)] + Ours V VAWE word2vec 43.23 76.16 24.10 71.20
ESZSL [Romera-Paredes and Torr(2015)] V GloVe 34.53 59.72
ESZSL [Romera-Paredes and Torr(2015)] + Ours V VAWE GloVe 44.25 75.10
Table 2: ZSL classification results on 4 datasets. Blank spaces indicate these methods are not tested on the corresponding datasets. Bottom part: methods using VAWE and the original word embeddings as semantic embeddings. Upper part: state-of-the-art methods using various sources of semantic embeddings. Visual features include V:VGG-19; G:GoogLeNet; D:DECAF; L:low-level features.

Datasets: We test the methods on four widely used benchmark dataset for zero-shot learning: aPascal/aYahoo object dataset [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] (aPY), Animals with Attributes [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] (AwA), Caltech-UCSD birds-200-2011 [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] (CUB), and the SUN scene attribute dataset [Xiao et al.(2014)Xiao, Ehinger, Hays, Torralba, and Oliva] (SUN).

Distributed word embeddings: We train the VAWE from two pre-trained distributed word emedding models: word2vec and GloVe. We pre-train the word2vec model from scratch on a large combination of text corpus. The resulted model generates 1000-D real valued vectors for each concept. As for GloVe, we use the pre-trained 300-D word embeddings provided by [Pennington et al.(2014)Pennington, Socher, and Manning]. We only test GloVe on aPY and AwA datasets because the pre-trained GloVe model does not contain the associated word embeddings for too many fine-grained categories in CUB and SUN.

Image features and visual signatures: For all the four test methods in our experiments, we extract the image features from the fully connected layer activations of the deep CNN VGG-19 [Simonyan and Zisserman(2014)]. As aforementioned, we use the average VGG-19 features of each seen category as the visual signatures for them.

Implementation details: We stop the training when the triplet loss stops decreasing. This usually takes 150 epochs for aPY, 250 epochs for AwA, 50 epochs for CUB and 20 epochs for SUN. Number of nearest neighbours in visual space is for all datasets. Number of nearest neighbours in semantic space is set to half the number of seen classes for each dataset (except for SUN, which has many seen classes), that is 10, 20, 75 and 200 for aPY, AwA, CUB and SUN, respectively. The output dimension is set to =128. More detailed experimental settings and results are provided in the supplementary material.

5.1 Performance improvement and discussion

In this section, we test the effect of using VAWE trained from word2vec and GloVe in various ZSL methods. The main results of VAWE compared against the original word embeddings are listed in the bottom part of Table 2. Except for the fine-grained dataset CUB, the VAWEs trained from both word embeddings gain overall performance improvement on all test methods. Most notably on the coarse-grained datasets, i.e., aPY and AwA, the VAWEs outperform their original counterparts by a very large margin.

For ZSL methods, we find that the performance improvement is most significant for ConSE and ESZSL, partly because these two methods directly learns a linear mapping between visual and semantic embeddings. A set of semantic embeddings that is inconsistent with the visual domain would hurt their performance the most. By using the VAWEs, those methods learn a much better aligned visual-semantic mapping and earns a great performance improvement.

The performance improvement is limited on fine-grained datasets CUB and SUN. Compared to the coarse-grained class datasets, the difference in categories in CUB and SUN is subtle in both visual and semantic domains. This causes the their visual signatures and semantic embeddings more entangled and results higher hubness level. Therefore it is more challenging to re-align the word embeddings of fine-grained categories by our method.

Overall, the VAWEs exhibit consistent performance gain for various methods on various datasets (improved performance on 22 out of 24 experiments). This observation suggests that VAWE is able to serve as a general tool for improving the performance of ZSL approaches.

5.2 Comparison against the state-of-the-art

We also compare the improved results of VAWE against the results of recently published state-of-the-art ZSL methods using various sources of semantic embeddings in the upper part of Table 2. It can be observed that methods using VAWE beat all other methods using non-attribute embeddings. Even compared against the best performing attribute-based methods, our results are still very competitive on coarse-grained class datasets: only a small margin lower than [Zhang and Saligrama(2016)] that uses continuous attributes. The results indicate that VAWE is a potential substitution for human-labelled attributes. The VAWE is not only human labour free but also provides comparable performance to attributes.

5.3 The effect of visual features

Visual signature source Low-level DeCAF VGG-19
ConSE + Ours 55.57 60.08 61.24
SynC + Ours 59.06 67.30 66.10
LatEm + Ours 58.43 63.33 61.46
ESZSL + Ours 67.28 73.23 76.16


Table 3: ZSL accuracies on the AwA dataset of VAWE trained with visual signatures from different feature sources. For the ZSL methods, the VGG-19 features are still used for training and testing.

The learning process of the mapping function relies on the choice of visual features which implicitly affects the neighbourhood structure in the visual domain. In this section, we investigate the impact of the choice of visual features on the quality of the mapped VAWE. Again, the quality of the VAWE is measured by its performance on ZSL. In previous sections, we extracted the visual signature as the mean of the VGG-19 features of each class. Here we further replace it with low-level features or DeCAF features provided by [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] and use them to obtain the VAWEs of word2vec. Once the VAWEs is learned we apply the ZSL with VGG-19 features and the experiment is conducted on the AwA dataset. Note that both DeCAF features and low-level features are weaker image features than VGG-19. The experiment results are shown in Table 3. From the experimental results, we find that performance of all four ZSL methods do not change too much when we replace VGG-19 with DeCAF to learn the mapping function. Using low-level features will degrade the performance but comparing to the performance of using the original word2vec the learned VAWE still shows superior performance. These observation suggests that we may use one type of visual features to train the VAWE and apply them to ZSL methods trained with another kind of visual features and still obtain good results.

6 Conclusion

In this paper, we show that the discrepancy of visual features and semantic embeddings negatively impacts the performance of ZSL approaches. Motivated by that, we propose to learn a neural network with triplet loss to map the word embeddings into a new space in which the neighbourhood structure of the mapped word embedding becomes similar to that in the visual domain. The visually aligned word embeddings boost the ZSL performance to a level that is competitive to human defined attributes. Besides that, our approach is independent of any particular ZSL method. This gives it much more flexibility to generalize to more potential applications of vision-language tasks.

Acknowledgement

L. Liu was in part supported by ARC DECRA Fellowship DE170101259. C. Shen was in part supported by ARC Future Fellowship FT120100969.

References

  • [Akata et al.(2013)Akata, Perronnin, Harchaoui, and Schmid] Zeynep Akata, Florent Perronnin, Zaid Harchaoui, and Cordelia Schmid. Label-embedding for attribute-based classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2013.
  • [Akata et al.(2015)Akata, Reed, Walter, Lee, and Schiele] Zeynep Akata, Scott Reed, Daniel Walter, Honglak Lee, and Bernt Schiele. Evaluation of output embeddings for fine-grained image classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2015.
  • [Ba et al.(2015)Ba, Swersky, Fidler, and Salakhutdinov] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, and Ruslan Salakhutdinov.

    Predicting deep zero-shot convolutional neural networks using textual descriptions.

    In Proc. IEEE Int. Conf. Comp. Vis. IEEE, 2015.
  • [Changpinyo et al.(2016)Changpinyo, Chao, Gong, and Sha] Soravit Changpinyo, Wei-Lun Chao, Boqing Gong, and Fei Sha.

    Synthesized classifiers for zero-shot learning.

    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [Deng et al.(2014)Deng, Ding, Jia, Frome, Murphy, Bengio, Li, Neven, and Adam] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-scale object classification using label relation graphs. In Proc. Eur. Conf. Comp. Vis., 2014.
  • [Dinu and Baroni(2015)] Georgiana Dinu and Marco Baroni. Improving zero-shot learning by mitigating the hubness problem. In Proc. Int. Conf. Learn. Representations, 2015.
  • [Farhadi et al.(2009)Farhadi, Endres, Hoiem, and Forsyth] Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
  • [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] Andrea Frome, Greg Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. DeViSE: A deep visual-semantic embedding model. In Proc. Advances in Neural Inf. Process. Syst., 2013.
  • [Fu and Sigal(2016)] Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [Jetley et al.(2015)Jetley, Romera-Paredes, Jayasumana, and Torr] Saumya Jetley, Bernardino Romera-Paredes, Sadeep Jayasumana, and Philip H. S. Torr. Prototypical priors: From improving classification to zero-shot learning. In Proc. British Machine Vis. Conf., 2015.
  • [Kottur et al.(2016)Kottur, Vedantam, Moura, and Parikh] Satwik Kottur, Ramakrishna Vedantam, Jose M. F. Moura, and Devi Parikh. Visual word2vec (vis-w2v): Learning visually grounded word embeddings using abstract scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
  • [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by betweenclass attribute transfer. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2009.
  • [Lampert et al.(2014)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell., 2014.
  • [Lazaridou et al.(2015)Lazaridou, Pham, and Baroni] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni. Combining language and vision with a multimodal skip-gram model. In NAACL HLT, 2015.
  • [Li et al.(2015)Li, Guo, and Schuurmans] X. Li, Y. Guo, and D. Schuurmans. Semi-supervised zero-shot classification with label representation learning. In Proc. IEEE Int. Conf. Comp. Vis., pages 4211–4219, 2015.
  • [Long et al.(2016)Long, Liu, and Shao] Yang Long, Li Liu, and Ling Shao. Attribute embedding with visual-semantic ambiguity removal for zero-shot learning. In Proc. British Machine Vis. Conf., 2016.
  • [Long et al.(2017)Long, Liu, Shao, Shen, Ding, and Han] Yang Long, Li Liu, Ling Shao, Fumin Shen, Guiguang Ding, and Jungong Han. From zero-shot learning to conventional supervised classification: Unseen visual data synthesis. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2017.
  • [Mikolov et al.(2013)Mikolov, Sutskever, Chen, Corrado, and Dean] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In Proc. Advances in Neural Inf. Process. Syst., 2013.
  • [Norouzi et al.(2014)Norouzi, Mikolov, Bengio, Singer, Shlens, Frome, Corrado, and Dean] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. In Proc. Int. Conf. Learn. Representations, 2014.
  • [Pennington et al.(2014)Pennington, Socher, and Manning] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In

    Conf. Empirical Methods in Natural Language Processing: EMNLP

    , 2014.
  • [Qiao et al.(2016)Qiao, Liu, Shen, and van den Hengel] Ruizhi Qiao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. Less is more: Zero-shot learning from online textual documents with noise suppression. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2016.
  • [Romera-Paredes and Torr(2015)] Bernardino Romera-Paredes and Philip H.S. Torr. An embarrassingly simple approach to zero-shot learning. Proc. Int. Conf. Mach. Learn., 2015.
  • [Simonyan and Zisserman(2014)] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Proc. Int. Conf. Learn. Representations, 2014.
  • [Socher et al.(2013)Socher, Ganjoo, Manning, and Ng] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-modal transfer. In Proc. Advances in Neural Inf. Process. Syst., 2013.
  • [Wah et al.(2011)Wah, Branson, Welinder, Perona, and Belongie] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  • [Xian et al.(2016)Xian, Akata, Sharma, Nguyen, Hein, and Schiele] Yongqin Xian, Zeynep Akata, Gaurav Sharma, Quynh Nguyen, Matthias Hein, and Bernt Schiele. Latent embeddings for zero-shot classification. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2016.
  • [Xiao et al.(2014)Xiao, Ehinger, Hays, Torralba, and Oliva] Jianxiong Xiao, Krista A. Ehinger, James Hays, Antonio Torralba, and Aude Oliva. Sun database: Exploring a large collection of scene categories.

    Int. J. Comput. Vision

    , 2014.
  • [Zhang and Saligrama(2015)] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In Proc. IEEE Int. Conf. Comp. Vis. IEEE, 2015.
  • [Zhang and Saligrama(2016)] Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. IEEE, 2016.