Zero-shot learning (ZSL) for visual classification has received increasing attentions recently [10, 17, 16, 8, 13]. This is because although virtually unlimited images are available via social media sharing websites such as Flickr, there are still not enough annotated images for building a visual classification model for a large number of visual classes. ZSL aims to imitate human’s ability to recognize a new class without even seeing any instance. A human has that ability because he/she is able to make connections between an unseen class with the seen classes based on its semantic description. Similarly a zero-shot learning method for visual classification relies on the existence of a labeled training set of seen classes and the knowledge about how each unseen class is semantically related to the seen classes.
An unseen class can be related to a seen class by representing both in a semantic embedding space 
. Existing ZSL methods can be categorized by the different embedding spaces deployed. Early works are dominated by semantic attribute based approaches. Visual classes are embedded in to an attribute space by defining an attribute ontology and annotating a binary attribute vector for each class. The similarity between different classes can thus be measured by how many attributes are shared. However, both the ontology and attribute vector for each class need to be manually defined with the latter may have to be annotated at the instance level due to large intra-class variations. This gives poor scalability to these attribute-based approaches. Alternatively, recently embedding based on semantic word space started to gain popularity [8, 13]. Learned from a large language corpus, this embedding space is ‘free’ and applicable to any visual classes [12, 11]. It thus has much better scalability and is the embedding space adopted in this paper.
After choosing an embedding space, the remaining problem for a ZSL approach is to measure the similarity between a test data with each unseen class so that (zero-shot) classification can be performed. Since there is no training data for the unseen classes, such a similarity obviously cannot be computed directly and the training data from the seen classes need to be explored to compute the similarity indirectly. Again, two options are available. In the first option, the seen class data are used to learn a mapping function to map a low-level feature representation of a training image to the semantic space. Such a mapping function is then employed to map a test image belonging a unseen class to the same space where similarity between the data and a class embedding vector can be computed for classification . However, this approach has an intrinsic limitation – the mapping function learned from the seen class may not be suitable for the unseen classes due to the domain shift problem. Rectifying this problem by adapting the mapping function to the unseen classes is also hard as no labeled data is available for those classes. The second option is to avoid the need for mapping a test image into the semantic embedding space. The training data is used in a different way – instead of learning a mapping function from the low-level feature to the semantic embedding space, a n-way probabilistic classifier is learned in the visual feature space. The embedding space is used purely for computing the semantic relatedness  between the seen and unseen classes. This semantic relatedness based approach alleviates the domain shift problem and has been empirically shown to be superior to the direct mapping based approach [9, 13]. It is thus the focus of this paper.
In this paper, a novel semantic graph based approach is proposed to model the relatedness between seen and unseen classes. In previous work , the relatedness between seen and unseen classes is modeled with a bipartite graph. As shown in Fig. 1(a), in such a graph the relatedness between each unseen class and each seen class is modeled directly in a flat structure, while the relatedness between the seen classes is ignored. This can be viewed as an ‘one step’ exploration in the bipartite graph. In contrast, in this paper, we extend the modeling for semantic relationships from the flat structure to a hierarchical structure and perform a multiple-step exploration. As shown in Fig. 1(b), in our approach, seen and unseen classes will form a semantic graph, in which each seen or unseen class corresponds to a graph node. The semantic graph is constructed as a -nearest-neighbor (nn) graph. It should be noted that on a semantic graph, the relatedness between seen classes is modeled explicitly; in addition, each unseen class can only connect with seen classes and there is no direct connection among unseen classes. In this way the relatedness between different seen classes are also exploited, making the similarity measure between a test image and each unseen class more robust. Furthermore, compared to the bipartite graph, the -nn semantic graph can be computed more efficiently. For example, for seen classes and unseen classes, the bipartite graph needs to store parameters (the weights on the graph edges), while the -nn semantic graph only needs to store parameters.
More specifically, for a test image , to perform the zero-shot learning, we connect it to the seen class nodes, that is, we incorporate into the semantic graph. Different with the bipartite graph-based method , it is possible that there is no direct connection between the real target unseen class and the seen classes connected by the test image on the semantic graph. Consequently, we have to design a new approach so that if the test image and an unseen class are connected with shorter paths on the semantic graph, the test image should have higher probability to be labeled as that unseen class. For example, in Fig. 1(b), the test image should have higher probabilities to be classified to unseen class or than . To this end, we define a special absorbing Markov chain process on the semantic graph. We view each unseen class node as the absorbing state. Thus, each path that starts from and terminates at one unseen class will not include other unseen classes. The inner nodes of such kind of paths only include the seen class nodes. The seen class nodes can thus be viewed as the bridge nodes that connect the test image and the unseen classes. The absorbing probabilities from the test image to each unseen class can be effectively computed. Given the predicted absorbing probabilities, we perform zero-shot learning by finding the class label with highest absorbing probability. Moreover, we show that the proposed method has a closed-form solution which is linear with respect to the number of test images.
The main contributions of this work are as follows. First, we propose to use the -nearest-neighbor semantic graph to model the relatedness among seen and unseen classes. This makes the similarity measure between a test image and unseen classes more robust, and as the number of visual categories increases, compared to bipartite graph, our -nn semantic graph will be more efficient. Second, we design a special absorbing Markov chain process on the semantic graph and show how to effectively compute the absorbing probabilities from one test image to each of unseen classes. Third, after stacking the absorbing probabilities for each test image together, we provide a zero-shot learning algorithm that has a closed-form solution and is a linear with respect to the number of test images.
2 Previous Work
Semantic embedding for ZSL. In most earlier works on zero-shot learning, semantic attributes are employed as the embedding space for knowledge transfer [10, 15, 6, 5, 1]. Most existing studies assume that an exhaustive ontology of attributes has been manually specified at either the class or instance level [10, 18]. However, annotating attributes scales poorly as ontologies tend to be domain specific. For example, birds and trees have very different set of attributes. Some works proposed to automatically learn discriminative visual attributes from data [7, 6]. But this sacrifices the name-ability of the embedding space as the discovered attributes may not be semantically meaningful. To overcome this problem, semantic representations that do not rely on an explicit attribute ontology have been proposed [17, 16]. In particular, recently semantic word space has been investigated [14, 19, 8]
. A word space is extracted from linguistic knowledge bases e.g. WordNet or Wikipedia by natural language processing models. Instead of manually defining an attribute prototype, a novel target class’ textual name can be projected into this space and then used as the prototype for zero-shot learning. Typically learned from a large corpus covering all English words and bi-grams, this word space can be used for any visual classes without the need for any manual annotation. It is thus much scalable than an attribute embedding space for ZSL. In this work, we choose the word space for its scalability, but our method differs significantly from[14, 19, 8] in how the embedding space is used for knowledge transfer and we show superior performance experimentally (see Section 4).
Knowledge transfer via an embedding space. Given an embedding space, existing approaches differ significantly in how the knowledge is transferred from a labeled training set containing seen classes. Most existing approaches, such as direct attribute prediction (DAP)  or its variants  take a directly mapping based strategy. Specifically, the training data set is used to learn a mapping function from the low-level feature space to the semantic embedding space. Once learned, the same mapping function is used to map a test image in to the same space where the similarity between the test image to each unseen class semantic vector or prototype can be measured 
. This strategy however suffers from the mapping domain shift problem mentioned earlier. Alternatively, a semantic relatedness based strategy can be adopted. This involves learning a n-way probabilistic classifier in the low-level feature space for the training seen classes. Given a test image, the probabilities produced by this classifier for each seen class indicate the visual similarity or relatedness between the test image and the seen classes. This relatedness is then compared with the semantic relatedness between each unseen class and the same seen classes. The test image is then classified according to how the visual similarity and semantic similarity agree. One representative approach following this strategy is Indirect Attribute Prediction (IAP). It has also been shown that the semantic relatedness does not necessarily come from a semantic embedding space, e.g. it can be computed from hit counts from an image search engine . This indirect semantic relatedness based strategy can be potentially advantageous over the direct mapping based one, as verified by the results in [9, 13]. However, as we analyzed earlier, the existing approaches based on semantic relatedness employ a flat bipartite graph and ignore the important inter-seen-class relatedness. In this work we develop a novel semantic graph based zero-shot learning method and show its advantages over the bipartite graph based methods on both classification performance and computational efficiency.
3.1 Problem Definition
Let denote the seen classes set and denote the unseen classes set. Given a training dataset labeled as , the goal of zero-shot learning is to learn a classifier even if there is no training data labeled as .
Taking a semantic relatedness strategy for knowledge transfer, we first utilize the training dataset to learn a classifier for the seen classes
. In this paper, we use the support vector machine (SVM) as the classifier for seen classes. For a test imageof image belonging to seen class . Let be a row vector with elements, in which each element . For a whole test dataset with images, we will have the matrix , in which each row corresponds to a test image . stores the relationship between the test images and the seen classes. It should be noted that although in this work, this relationship is measured by the posterior probability , other ways of computing the relationship between test images and the seen classes can also be adopted.
Our objective is to perform zero-shot learning through modeling the relationship between seen classes and unseen classes . In this paper, we propose to use semantic graph to model the relationship among classes.
3.2 Semantic Graph
For measuring the relationship between two classes, we employ the word vector representation from the linguistic research [11, 12] and use the similarity of their word vectors as the similarity measurement of the two classes.
Furthermore, a semantic graph is constructed as a -nearest-neighbor graph. In the semantic graph, each class (regardless if it is a seen or unseen class) will have a corresponding graph node which is connected with its most similar (semantically related) other classes. The edge weight of the semantic graph is the similarity between two end node of this edge. More details about the semantic graph construction can be found in Section 4.1. After constructing the semantic graph, the graph structure will be fixed in the next steps of the pipeline.
We then define a special absorbing Markov chain process on the semantic graph, in which each unseen class node is viewed as an absorbing state and each seen class node is viewed as transient state. The transition probability from class node to class node is , i.e. the normalized similarity. The absorbing state means that for each unseen class node , we have and for . It should be noted that since all of the unseen class nodes are absorbing states, there will have no direct connection between two unseen class nodes. In other words, the unseen classes will be connected through the seen classes.
We re-number the class nodes (states in Markov process) so that the seen class nodes (transient states) come first. Then, the transition matrix of the above absorbing Markov chain process will have the following canonical form:
In El. 1, describes the probability of transitioning from a transient state (seen class) to another and describes the probability of transitioning from a transient state (seen class) to an absorbing state (unseen class). In addition,
and the identity matrixmean that the absorbing Markov chain process cannot leave the absorbing states once it arrives.
3.3 Zero-shot Learning
For zero-shot learning, i.e. predicting the label of an unseen image , we first need to incorporate into the semantic graph. And then we will apply an extended absorbing Markov chain process, in which the test image is involved, to perform the zero-shot learning.
In order to introduce a test image into the semantic graph, it is connected with some seen class nodes 111Obviously it cannot be connected to the unseen class nodes directly as we are not mapping in to the same semantic space.. The nodes selected for connection is determined by the posterior probability of image belonging to seen class . Specifically, the node representing image is connected to the seen classes with the highest posterior probability, i.e. most visually similar. Note that for , there will have no stepping in probabilities and the Markov process can only step out from to other seen class nodes. The stepping out probabilities from to seen class nodes are , which are the posterior probability computed using the seen class classifier as described in Section 3.1. is thus incorporated into the semantic graph as a transient state. The transition matrix of the extended absorbing Markov chain process have the following canonical form:
In the meanwhile, the extended transition matrix within all transient states, including all seen class nodes and one extra test image node , are written as
and the extended transition matrix between transient states and absorbing states should be
In the extended semantic graph, it is obvious that if there are many short paths that connect the test image node and one unseen class node, e.g. , the absorbing Markov chain process that starts from will have a high probability to be absorbed at . Thus, the probability that is labeled as should be high. This is a cumulative process and can be reflected by the absorbing probabilities from to all unseen class nodes.
The absorbing probability is the probability that the absorbing Markov chain will be absorbed in the absorbing state if it starts from the transient state . The absorbing probability matrix can be computed as follows:
in which is the fundamental matrix of the extended absorbing Markov chain process and is defined as follows:
We use the following block matrix inversion formula to compute .
Since we only care about the absorbing probabilities that the absorbing chain process starts from the test image node , we only need to compute the last row of , i.e. for ( corresponds to the last transient state in the extended canonical form in Eq. 2). In particular, we can apply the above block matrix inversion formula to compute the last row of as
and then we may further compute as
For the whole test dataset with images, we use a matrix to store the computed absorbing probabilities, in which the -th row of equals to the absorbing probabilities of . If we stack the results of all test images together, we will get the final matrix as follows,
In Eq. 9, is a matrix and is a matrix that is only related to the semantic graph structure and can be pre-computed. The only dimension variable in Eq. 9 is the number of test images . Therefore, our method is linearly with respect to the number of test images.
Finally, for the test image , we assign it to the unseen label that has the maximum absorbing probability when the absorbing chain starts from . That is,
It should be noted that in our formulation, we consider all the paths in the semantic graph, i.e. the whole structure of the semantic graph. Therefore, our method is more stable compared to direct similarity-based zero-shot learning, in the sense of being less sensitive to the number of connections to the seen classes for each test image, and the imperfect seen class classifier causing noise in the posterior probability computed. This is verified by the experimental results in Section 4.2.
4.1 Experimental Setup
|Area under ROC curve (AUC) in %|
mean AUC (in %)
mean accuracy (in %)
Dataset. We utilize the AwA (animals with attributes) dataset  to evaluate the performance of the proposed zero-shot learning method. AwA provides 50 classes of animals (30475 images) and 85 associated class-level attributes (such as furry, and hasClaws). In this work, attributes are not used unless otherwise stated. AwA also provides a defined source/target split for zero-shot learning with 10 classes and 6180 images held out.
Competitors. Our method is compared against three alternatives. The first two are the most related, namely Rohrbach et al.’s direct similarity-based ZSL (DS-based)  and Norouzi et al.’s convex semantic embedding ZSL (ConSE) . Both methods take a semantic relatedness strategy and learn a n-way probabilistic classifier for the seen classes. In DS-based zero-shot learning, the semantic relatedness among categories are modeled as a bipartite graph. ConSE will choose the top similar seen classes for a test image using the trained classifier, and then use the prototypes of the seen classes in the word space to form a new word vector for the test image. Zero-shot learning is performed by finding the most similar unseen prototype in the word space. In addition, we also apply the support vector regression to train a mapping from visual space to word space and after mapping each test image into the word space, the nearest-neighbor classifier is used to perform zero-shot learning. We call this direct mapping based method SVR+NN. This method differs from the other two and ours in that it uses the training data of seen classes to learn a mapping function rather than a classifier. Apart from these three, we also compare with the published results using attribute space rather than the semantic word space.
Settings. We first exploit the word space representation [12, 11] to transform each AwA seen or unseen class name to a vector in the word space. For the word space, we train the skip-gram text model on a corpus of 4.6M Wikipedia documents to form a 1000-D word space. Since the seal unseen class name of AwA has many meanings in English, not just the animal seal, we choose seven concrete seal species from the ‘seals-world’ website222http://www.seals-world.com/seal-species/, that is, leopard seal, harp seal, harbour seal, gray seal, elephant seal, weddell seal and monk seal, to generate word vector for unseen seal class. We use the decaf feature  that is provided at the AwA website333http://attributes.kyb.tuebingen.mpg.de/ and apply the libsvm  to train a linear kernel SVM with probability estimates output. All other parameters in libsvm are set to the default value. For training SVR mapping, we apply the liblinear toolbox  and set the parameter . For semantic graph construction, we choose different for seen classes and unseen classes when searching for the -nearest-neighbors. That is, we first construct a subgraph with seen classes, in which we choose . For the similarity matrix of the seen subgraph, we set to ensure that it is symmetric for the seen classes. For each unseen class, we connect it with top
similar seen classes according to the cosine similarity in the word space. This will ensure that each unseen class is connected into the seen subgraph and there is no isolated unseen class node on semantic graph. The code of our method can be found at444https://sites.google.com/site/zhenyongfu10/.
Table 1 compare the zero-shot classification performance measure by area under ROC curve (AUC) scores for the ten individual test classes and their average. The last column in Table 1 gives the corresponding average multi-class classification accuracies. In DS-ZSL, ConSE and our method, each test image will be connected with seen classes. From Table 1, we can see that the proposed semantic graph based method can achieve the best AUC results at six individual test classes and the best average multi-class classification accuracy. As for the average AUC on the ten test classes, the results of direct similarity-based method, SVR+NN and our method are almost the same. SVR+NN achieves the best average AUC result, but its average multi-class classification accuracy is the lowest.
Comparison with attribute-based ZSL. We also compare our result with the state-of-the-art results of attribute-based ZSL methods, including Lampert et al.’s DAP and IAP  and Akata et al.’s label-embedding method , on the AwA dataset.We list the results of average multi-class classification accuracy in Table 2. Overall, compared to the state-of-the-art attribute-based ZSL, our proposed method achieves better or comparable performance, especially compared to DAP and IAP. It should be noted that all the attribute-based ZSL methods are based on the well-defined visual attribute and the category-attribute relationship. In contrast, our method does not depend on manually defined visual attributes; instead we only exploit ‘free’ semantic word space learned from linguistic knowledge bases without the need for any manual annotation for the AwA classes. This is thus a very encouraging result. If we apply the given visual attributes on AwA to do the similarity computation, we can get 49.5% performance, which is much higher than the existing attribute-based methods.
|Approach||semantic space||mean accuracy (in %)|
|DAP||attribute||40.5() / 41.4()|
|IAP||attribute||27.8() / 42.2()|
|ALE/HLE/AHLE ||attribute||37.4 / 39.0 / 43.5|
|Our method||word vector / attribute||43.1 / 49.5|
Parameter sensitivity. Since DS-ZSL, ConSE and our method have a same parameter , i.e., the number of top similar seen classes that a test image will choose, we analyze the effect of setting different values of for the three methods. From Fig. 4, we can see that DS-ZSL will be heavily affected by the number of seen classes that connect with the test image, while ConSE and our method are more stable. Especially, our method is almost not influenced by the parameter at all. That is because through the more robust semantic graph, our method can reduce the influence of the noisy seen classes which will be inevitably included when the value of increases.
Running time comparison. We also test the running time of DS-ZSL, ConSE and our method w.r.t. different number of test images. There are totally 6180 test images on AwA. They are divided into 10 folds and we test increasing number of folds of test images, i.e. from 618 to 6180 and show the results in Fig. 4. We run each algorithm 100 times at a PC machine with 3.9GHz and 16GB memory and report the average result. From Fig. 4, we can see that all the three methods are linear and our method is significantly faster than the other two, especially given large number of test images.
In this work, we have introduced a novel zero-shot learning framework based on semantic graph. The proposed method models the relationship among visual categories using the semantic graph and then performs zero-shot learning through an absorbing Markov chain process on the semantic graph. We have shown experimentally that our method is more effective and more stable than the alternative bipartite graph based methods.
-  Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based classification. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 819–826. IEEE, 2013.
-  C.-C. Chang and C.-J. Lin. Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3):27, 2011.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531, 2013.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
Liblinear: A library for large linear classification.
The Journal of Machine Learning Research, 9:1871–1874, 2008.
-  A. Farhadi, I. Endres, and D. Hoiem. Attribute-centric recognition for cross-category generalization. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2352–2359. IEEE, 2010.
-  A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. Describing objects by their attributes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1778–1785. IEEE, 2009.
-  V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, 2007.
-  A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems, pages 2121–2129, 2013.
-  C. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot learning of object categories. 2013.
-  C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 951–958. IEEE, 2009.
-  T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
-  T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
-  M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. In ICLR, 2014.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic.
Learning and transferring mid-level image representations using convolutional neural networks.In CVPR, 2014.
-  M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In NIPS, volume 3, pages 5–2, 2009.
-  M. Rohrbach, M. Stark, and B. Schiele. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1641–1648. IEEE, 2011.
-  M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele. What helps where–and why? semantic relatedness for knowledge transfer. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 910–917. IEEE, 2010.
-  W. J. Scheirer, N. Kumar, P. N. Belhumeur, and T. E. Boult. Multi-attribute spaces: Calibration for attribute fusion and similarity search. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2933–2940. IEEE, 2012.
-  R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, pages 935–943, 2013.