1 Introduction
There are around 30,000 humandistinguishable basic object classes [Biederman(1987)]
and many more subordinate ones. A major barrier to progress in visual recognition is thus collecting training data for many classes. Zeroshot learning (ZSL) strategies have therefore gained increasing interest as a route to sidestep this prohibitive cost, as well as enabling potential new categories emerging over time to be represented and recognised. To classify instances from a class with no examples, ZSL exploits knowledge transferred from a set of seen (auxiliary) classes to unseen (test) classes, typically via an intermediate semantic representation such as attributes. This has recently been explored at large scale on ImageNet
[Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov, Rohrbach et al.(2012)Rohrbach, Stark, and Schiele].Prior zeroshot learning methods have assumed that class labels on each instance are mutually exclusive, i.e., multiclass single label classification. Nevertheless many realworld data are intrinsically multilabel. For example, an image on Flickr often contains multiple objects with cluttered background, thus requiring more than one label to describe its content. There is an even more acute need for zeroshot learning in the case of multilabel classification. This is because different labels are often correlated (e.g. cows often appear on grass). In order to better predict these labels given an image, the label correlation must be modelled. However, for labels, there are possible multilabel combinations and to collect sufficient training samples for each combination to learn the correlations of labels is infeasible. It is thus surprising to note that there is little if any existing work on multilabel zeroshot learning. Is it because there is a trivial extension of existing single label ZSL approaches to this new problem? By assuming each label is independent from one another, it is indeed possible to decompose a multilabel ZSL problem into multiple single label ZSL problems and solve them using existing single label ZSL methods. However this does not exploit label correlation, and we demonstrate in this work that this naive extension leads to very poor label prediction for unseen classes. Any attempt to model this correlation, in particular for the unseen classes with zeroshot, is extremely challenging.
In this paper, a novel framework for multilabel zeroshot learning is proposed. Our framework is based on transfer learning – given a training/auxiliary dataset containing labelled images, and a test/target dataset with a set of unseen labels/classes (i.e. none of the labels appear in the training set), we aim to learn a multilabel classification model from the training set and generalise/transfer it to the test set with unseen labels. This knowledge transfer is achieved using an intermediate semantic representation in the form of the skipgram word vectors
[Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean, Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] learned from linguistic knowledge bases. This representation is shared between the training and test classes, thus making the transfer possible.More specifically, our framework has two main components: multioutput deep regression (MulDR) and zeroshot multilabel prediction (ZSMLP). MulDR is a 9 layer neural network that exploits the widely used convolutional neural network (CNN) layers
[Razavian et al.(2014)Razavian, Sullivan, and Carlsson], and includes two multioutput regression layers as the final layers. It learns from auxiliary data the explicit and direct mapping from raw image pixels to a linguistic representation defined by the skipgram language model [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean, Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean]. With MulDR, each test image is now projected into the semantic word space where the unseen labels and their combinations can be represented as data points without the need to collect any visual data. ZSMLP aims to address the multilabel ZSL problem in this semantic word space. Specifically, we note that in this space any label combination can be synthesised. We thus exhaustively synthesise the power set of all possible prototypes (i.e., combinations of multilabels) to be treated as if they were a set of labelled instances in the space. With this synthetic dataset, we are able to extend conventional multilabel algorithms [Kong et al.(2013)Kong, Ng, and Zhou, Zhang and Zhou(2013), Wu and Zhang(2013), Hariharan et al.(2012)Hariharan, Vishwanathan, and Varma], to propose two new multilabel algorithms – direct multilabel zeroshot prediction (DMP) and transductive multilabel zeroshot prediction (TraMP). However, since MulDR is learned using the auxiliary classes/labels, it may not generalise well to the unseen classes/labels. To overcome this problem, we further exploit selftraining to adapt the MulDR to the test classes to improve its generalisation capability.2 Related Work
Multilabel classification Multilabel classification has been widely studied – for a review of the field please see [Zhang and Zhou(2013), Wu and Zhang(2013)]. Most previous studies assume plenty of training data. Recently efforts have been made to relax this assumption. Kong et al. [Kong et al.(2013)Kong, Ng, and Zhou] studied transductive multilabel learning with a small set of training instances. Hariharan et al. [Hariharan et al.(2012)Hariharan, Vishwanathan, and Varma] explored the label correlations of auxiliary data via a multilabel maxmargin formulation and better incorporated such label correlations as prior for multiclass zeroshot learning problem. However, none of them addresses the multilabel zeroshot learning problem tackled in this work.
Zeroshot learning Multiclass single label zeroshot learning has now been widely studied using attributebased intermediate semantic layers [Ferrari and Zisserman(2007), Palatucci et al.(2009)Palatucci, Hinton, Pomerleau, and Mitchell, Lampert et al.(2009)Lampert, Nickisch, and Harmeling, Fu et al.(2014a)Fu, Hospedales, Xiang, Fu, and Gong, Fu et al.(2014b)Fu, Hospedales, Xiang, Gong, and Yao, Chen et al.(2013)Chen, Gong, Xiang, and Loy] or datadriven [Fu et al.(2013)Fu, Hospedales, Xiang, and Gong, Fu et al.(2012)Fu, Hospedales, Xiang, and Gong, Sharmanska et al.(2012)Sharmanska, Quadrianto, and Lampert, Layne et al.(2014)Layne, Hospedales, and Gong] representations. However attributebased strategies have limited ability to scale to many classes because the attribute ontology has to be manually defined. To address this limitation, Socher et al. [Socher et al.(2013)Socher, Ganjoo, Sridhar, Bastani, Manning, and Ng] first employed a linguistic model [Huang et al.(2012)Huang, Socher, Manning, and Ng] as the intermediate semantic representation. However, this does not model the syntactic and semantic regularities in language [Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] which allows vectororiented reasoning. Such a reasoning is critical for our ZSMLP to synthesise label combination prototypes in the semantic word space. For example, should be much closer to than or only. For this purpose, we employ the skipgram language model to learn the word space, which has shown to be able to capture such syntactic regularities [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean, Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean]. Frome et al. [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] also used the skipgram language model. They learned a visualsemantic embedding model – DeViSE model for single label zeroshot learning by projecting both visual and semantic information of auxiliary data into a common space. However there are a number of fundamental differences between their work and ours: (1) Comparing the DeViSE model with our MulDR, the learning of the mapping between images and the semantic word space by MulDR is more explicit and direct. We show in our experiments that this leads to better projections and thus better classification performance. (2) Our MulDR can generalise better to the unseen test classes thanks to our selftraining based transductive learning strategy. (3) Most critically, we address the multilabel ZSL problem whilst they only focused on the single label ZSL problem. Additionally, zeroshot learning can be taken as the generalisation of classincremental learning (CIL) [Zhou and Chen(2002), Da et al.(2014)Da, Yu, and Zhou] or lifelong learning [Pentina and Lampert(2014)].
Our Contributions Overall, we make following contributions: (1) As far as we know this is the first work that addresses the multilabel zeroshot learning problem. (2) Our multioutput deep regression framework exploits correlations across dimensions while learning the direct mapping from images to intermediate skipgram linguistic word space. (3) Within the linguistic space, two algorithms are proposed for multilabel ZSL. (4) We propose a simple selftraining strategy to make the deep regression model generalise better to the unseen test classes. (5) Experimental results on benchmark multilabel datasets show the efficacy of our framework for multilabel ZSL over a variety of baselines.
3 Methodology
3.1 Problem setup
Suppose we have two datasets – source/auxiliary and target/test. The auxiliary dataset has training instances and test dataset has test instances. We use and to denote the index set for instances in auxiliary and test dataset. and are the raw image data of all auxiliary and test instances respectively. and are the intermediate semantic representations of each auxiliary and test instance – in our case is a dimensional continuous word vector for instance in the skipgram language model [Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] space. and are the label vectors for auxiliary and test dataset to be predicted respectively.
The possible textual labels for each instance in and are denoted and respectively, where and are the total number of classes/labels in each dataset. Given a labelspace of binary labels, an instance can be tagged with any of the possible label subsets, , where means instance has label , and means otherwise. Denoting the power sets of textual labels and as and , for multilabel classification we need to find the optimal class label set column vector for the test instance in the power set space . At training time are all observed. At test time only new class names and images are given, their representation and multilabel vectors are to be predicted.
3.2 Learning a semantic word space
The semantic representations and are the projection of each instance into a linguistic word vector space . The semantic word vector space is learned by using the stateoftheart skipgram language model [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean, Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] on all English Wikipedia articles^{1}^{1}1Only articles are used without any user talk/discussion. To 13 Feb. 2014, it includes 2.9 billion words and 4.33 million vocabulary (single and bi/trigram words).. The space represents almost all available English vocabulary and thus is potentially much more effective than human annotators to measure subtle similarities and differences between any two textual labels. Furthermore, encodes the syntactic and semantic regularities in language [Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] which allows vectororiented reasoning by its ‘compositionality’ property. This property enables the critical capability of synthesising the exhaustive set of test label combinations . Note that cosine distance is used in the space because of its robustness against noise [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean, Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean]. We use to represent the skipgram projection from textual concepts (words) in to vectors in . Such a semantic space thus captures the correlations between labels without any need to collect visual examples – the meaning of multiple labels for one instance can be inferred by the sum of the word vector projections of its individual labels. Formally, we have
(1) 
where and are the word vector projections of the label class sets in the auxiliary and test datasets respectively. The next section discusses how to learn a predictive model for given visual data .
3.3 Multioutput deep regression
We design a multioutput deep regression (MulDR) model to predict the semantic representation from images where is the space of raw image pixel intensity values. Our MulDR is inspired by the recent success of the deep convolutional neural network (CNN) features [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] as well as the importance of modelling correlations within the semantic representation. The MulDR model is a neural network composed of nine layers: Layer are convolutional layers; Layer are fully connected layers; Layer is the linear mapping layer with least square regressors.
Two key components contribute to the effectiveness of MulDR. The first component (layers 17) provides stateoftheart feature extraction for many computer vision tasks
[Razavian et al.(2014)Razavian, Sullivan, and Carlsson]. It directly maps the raw image to the powerful CNN features^{2}^{2}2However, it has more than 148.3 millions parameters and thus to prevent overfitting on small auxiliary dataset, ImageNet with 1.2 million labelled instances are used to train this component [Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun]., avoiding the pitful of bad performance due to “wrong selection” of features for a given dataset. The second component (layers 89) provides the multioutput neural network (NN) regressors. Different from [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton, Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun], where the 8th layer is an output layer for classification, the 8th layer in our model is a fully connected layer of 1024 neurons with Rectified Linear Units (ReLUs) activation functions. This softthresholding nonlinearity has better properties for generalisation than the widely used tanh activation units. Such a fully connected layer helps explore correlations among the different dimensions in the semantic word space. The final (9th) layer of least square regressors provide an estimation of the 100 dimensional semantic representation in the space
.To apply this neural network, we resize all images and to pixels. The parameters of the first components are pretrained using ImageNet [Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] while the parameters of the second component are trained by gradient descendent with auxiliary data and . At test time, MulDR predicts the semantic word vector for each unseen image . Here the hat operator indicates the variable is estimated.
3.4 Zeroshot multilabel prediction
Given the estimated semantic representation , we need to infer the labels of the test set. A straightforward solution is to decompose the multilabel classification problem into multiple independent binary classification problems which is equivalent [Hastie et al.(2009)Hastie, Tibshirani, and Friedman] to directly solving Eq (1) by:
(2) 
where is the MoorePenrose pseudoinverse. Eq (2
) directly predicts the labels of each instance by a linear transformation of the intermediate representation
. In a way, this can be considered as an extension of the ‘Direct Attribute Prediction (DAP)’ [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] to the case of multilabel and continuous representation. We thus term this method exDAP. However, this does not exploit the multilabel correlations and thus has very limited expressive power [Zhang and Zhou(2007), Elisseeff and Weston(2001)]. Hence we propose two more principled multilabel zeroshot algorithms – Direct Multilabel zeroshot Prediction (DMP) and Transductive Multilabel zeroshot Prediction(TraMP).Direct Multilabel zeroshot Prediction (DMP) Thanks to the compositionality property of , labelcorrelation can be explored by synthesising the representation of every possible multilabel annotations in : that is the power set of label vector matrix where . Thus Eq (2) is replaced by a nearest neighbour (NN) classifier using all the synthesised instances as training data. The label set of instance with representation is then assigned as , where is the index computed by
(3) 
where refers to the cosine distance.
Transductive Multilabel zeroshot Prediction (TraMP) DMP can explore label correlations but only insofar as encoded by the compositionality of the prototypes in . It would be more desirable if the manifold structure of given test instances could be used to improve multilabel zeroshot learning, i.e. via transductive learning. We therefore propose TramMP, which can be viewed as an extension the TRAM model in [Kong et al.(2013)Kong, Ng, and Zhou] for zeroshot learning, or a semisupervised generalisation of Eq (3). The key idea is to use the power set of prototypes as a known label set and to perform transductive label propagation from to the inferred semantic representations . We denote the index of the power set prototypes as and its corresponding class label set as
. Specifically, we define a knearest neighbour (kNN) graph among the test instances
and prototypes . For any two instances and , where ,(4) 
where . indicates the index set of knearest neighbors of from . is the normalisation term to make sure . We define and partition the matrix into blocks, and the label set of test instances can be inferred by the following closed form solution [Kong et al.(2013)Kong, Ng, and Zhou],
(5) 
3.5 Generalisation of multioutput deep regression
As described above, our framework consists of two key steps: applying the multioutput deep regression (MulDR) model to obtain the estimated semantic representation , and followed by applying either DMP or TraMP to predict . There is however an unsolved issue, that is, our MulDR is learned from the auxiliary data with a different set of labels from the target/test data. This projection model is thus not guaranteed to accurately project a test image to be near its ground truth label vector in the semantic word space. For example, if our MulDR is learned to project images of cat and dog to the word vector representation of “cat" and “dog" ( and ), it may not accurately project an image with a person and a chair to its word vector representation of when both labels were not available for learning the MulDR model. Any regression model will have such a generalisation problem especially when the test data are distributed differently from the auxiliary data. To make the MulDR model generalise better to the target domain, we transductively exploit the predicted semantic representation to update the power set of label vector matrix
. In this way the target data would be better aligned with the synthesised label combination vectors in the semantic word space, thus helping generalise the MulDR to the target domain. This can be viewed as a semisupervised learning (SSL) method starting from one instance for each label combination if the synthesised prototypes themselves are treated as instances. We therefore take a simple SSL strategy and perform one step of selftraining
[Fu et al.(2013)Fu, Hospedales, Xiang, and Gong] to refine each prototype of ,(6) 
where is the updated prototype matrix and is the number of nearest neighbour^{3}^{3}3Note that is not necessarily with the same value in Eq (4). selected. We use the updated label vector matrix to compute DMP (Eq (3)) and TramMP (Eqs (4) and (5)) in our framework.
4 Experiments
Datasets Two popular multilabel datasets – Natural Scene [Zhang and Zhou(2007)] and IAPRTC12 [Grubinger(2007)] are used to evaluate our framework. Natural Scene consists of natural scene images where each image can be labelled as any combinations of desert, mountains, sea, sunset and trees and over of the whole dataset is multilabelled. For multilabel zeroshot learning on Natural Scene, we use a multiclass single label dataset – Scene dataset [Oliva and Torralba(2001)] (totally images) as the auxiliary dataset which have been labelled with a nonoverlapping set of labels such as street, coast and highway. IAPRTC12 consists of images and a total of different labels. The labels are hierarchically organised into main branches: humans, animals, food, landscapenature, manmade and other. Our experiments consider the subset of landscapenature branch (around images) and use the top most frequent labels from this branch with over of multilabel test images. For zeroshot classification on this dataset, we employ both Scene and Natural Scene as the auxiliary dataset.
4.1 Experimental setup
Evaluation metrics (a) Hamming Loss: it measures the percentage of mismatches between estimated and groundtruth labels; (b) MicroF1 [Kang et al.(2006)Kang, Jin, and Sukthankar]: it evaluates both micro average of Precision (MicroPrecision) and micro average of Recall (MicroRecall) with equal importance; (c) Ranking Loss: given the ranked list of predicted labels, it measures the number of label pairs that are incorrectly ordered by comparing their confidence scores with the groundtruth labels; (d) Average precision: given a ranked list of classes, it measures the area under precisionrecall curve. These four criteria evaluate very different aspects of multilabel classification performance. Usually very few algorithms can achieve the best performance on all metrics. High values are preferred for MicroF1 and AP and viceversa for Ranking and Hamming loss. For ease of interpretation we present MicroF1 and AP; so smaller values for all metrics are preferred.
Competitors Our full framework includes two main novel components: MulDR and DMP/TraMP. To evaluate the effectiveness of these two components, we define several competitors by replacing each component with possible alternatives. (1) SVR+exDAP: Support Vector Regression (SVR)^{4}^{4}4For fair comparison, we use the CNN features output by the first component (Layer 17) of our MulDR framework as the lowlevel feature for linear SVR used with the cost parameter set to 10. [Chang and Lin(2001)] is used to learn and infer the representation of each test instance. Using exDAP (Eq (2)) is a straightforward generalisation of [Lampert et al.(2009)Lampert, Nickisch, and Harmeling, Lampert et al.(2013)Lampert, Nickisch, and Harmeling] to multilabel zeroshot learning. (2) SVR+DMP: SVR replaces MulDR and we further use DMP (Eq (3)) for classification; thus it serves as a reference to compare DMP with exDAP. (3) DeViSE+DMP: We use DeViSE [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] to learn the visualsemantic embedding into which the power set is projected. And we use Eq (3) for final labelling in the embedding space, i.e., DMP. Thus it corresponds to the extension of [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] to multilabel zeroshot learning problems. (4) MulDR+exDAP: Our MulDR is used to learn the visualsemantic embedding, with exDAP for multilabel classification; thus it can be used to compare MultiDR with SVR. (5) MulDR+DMP/TraMP: Our method with either of the two proposed ZSL algorithms used. For fair comparison, all results use selftraining strategy in Eq (6) to update the prototypes.
4.2 Results
Our MulDR model vs. alternatives The results obtained by various competitors on NaturalScene and IAPRTC12 are shown in Fig. 1. We first compare our MulDR with the alternative SVR and DeViSE model for learning the projection from raw images to the semantic word space. It is evident that our MulDR significantly improve the results on conventional SVR [Lampert et al.(2009)Lampert, Nickisch, and Harmeling, Lampert et al.(2013)Lampert, Nickisch, and Harmeling] regression model (MulDR+DMP>SVR+DMP, MulDR+exDAP>SVR+exDAP). This is because that SVR treats each of the 100 semantic word space dimensions independently, whilst our multioutput regression model, as well as the DeViSE model [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] capture the correlations between different dimensions. Comparing to the DeViSE model [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] (MulDR+DMP vs. DeViSE+DMP), our regression model is also clearly better using three of the four evaluation metrics, suggesting that direct and explicit mapping between the image space and the semantic word space is a better strategy. The only case where a better result is obtained by DeViSE+DMP is on the IAPCTC12 dataset with Hamming Loss. But this result is worth further discussion. In particular, we note that Hamming Loss treats the false alarm and missing prediction errors equally. However, for multilabel classification problem, the distribution of labels is very unbalanced and each image usually has only a small portion of labels compared to the whole label set. This is particularly the case for IAPCTC12. The good result of DeViSE on IAPCTC12 with better Hamming loss but worse MicroF1 and Ranking Loss is an indication that it is mostly predicting no label, and biased against making any predictions. This explains the qualitative results of DeViSE shown in Table 1.
Our DMP/TraMP vs. exDAP Given the same regression model, we compared our DAP against the alternative exDAP. The results (SVR+DMP>SVR+exDAP, MulDR+DMP>MulDR+exDAP) show that our algorithm, which is based on synthesising the label combinations in order to encode the multilabel correlations, is superior to exDAP which treats each label independently and decomposes the multilabel classification problem as multiple single label classification problems. Comparing the two proposed algorithms – DMP and TraMP, the main difference is that TraMP transductively exploits the manifold structure of the test data for label prediction. Figure 1 shows that this tranductive label prediction algorithm is better overall. Specifically, TraMP has much better MicroF1, Ranking Loss and AP than DMP. The NN classifier (Eq (3)) used in DMP is directly minimising the Hamming Loss. This explains why TraMP is slightly worse than DMP on IAPCTC12 on Hamming Loss.
Effectiveness of the selftraining step In this experiment we compare the results of our DMP and TraMP with and without the selftraining step in Eq (6). We use ‘’ and ‘+’ to indicate algorithms without and with selftraining respectively. Both DMP and TraMP use MulDR to infer the word vector . As shown in Fig. 2, the selftraining step clearly has a positive influence on the multilabel prediction performance. This result suggests that this simple step is helpful in making the learned MulDR model from the auxiliary data generalise better to the target data.
Qualitative results Table 1 gives a qualitative comparison of multilabel annotation by our DMP and TraMP with DeViSE on IAPCTC12. As discussed, DeViSE is too conservative on this dataset and assigns no label to most instances.
Groundtruth  sandbeach,  landscapenature,  grass  sandbeach, 
mountain,sky  mountain, sky  sky  
MulDR+DMP  sandbeach,  landscapenature,  grass  sandbeach, 
sky  mountain, sky  sky  
MulDR+TraMP  sandbeach,  landscapenature,  grass, ground,  ground, sky, 
mountain, sky  mountain, sky  landscapenature  sandbeach  
DeViSE+DMP  sky  –  –  sky 
5 Conclusion and future work
We have for the first time generalised zeroshot learning from the single label to the multilabel setting. It is somewhat surprising that it turns out to be possible to exploit label correlation at test time in the zero shot case – since there is no dataset of examples to learn cooccurance statistics in the conventional way. We achieve this via introducing novel strategies to exploit the compositionality of the semantic word space, and by transductively exploiting the unlabelled test data.
Besides the proposed tailormade multilabel algorithms – DMP and TraMP, our strategy could potentially help other existing multilabel algorithms to generalise to the multilabel zeroshot learning problem. Finally, we note that many prototypes of the power set
actually have an extremely low chance to occur in the test dataset. They should not be considered in the same way as the other more likely prototypes. Thus another line of ongoing research is to investigate how to prune lowprobability prototypes from the power set
.References
 [Biederman(1987)] I. Biederman. Recognition by components  a theory of human image understanding. Psychological Review, 1987.

[Chang and Lin(2001)]
ChihChung Chang and ChihJen Lin.
LIBSVM: a library for support vector machines
, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.  [Chen et al.(2013)Chen, Gong, Xiang, and Loy] Ke Chen, Shaogang Gong, Tao Xiang, and Chen Chang Loy. Cumulative attribute space for age and crowd density estimation. In CVPR, 2013.
 [Da et al.(2014)Da, Yu, and Zhou] Qing Da, Yang Yu, and ZhiHua Zhou. Learning with augmented class by exploiting unlabeled data. In AAAI, 2014.
 [Elisseeff and Weston(2001)] Andre Elisseeff and Jason Weston. A kernel method for multilabelled classification. In NIPS, 2001.
 [Ferrari and Zisserman(2007)] V. Ferrari and A. Zisserman. Learning visual attributes. In NIPS, December 2007.
 [Frome et al.(2013)Frome, Corrado, Shlens, Bengio, Dean, Ranzato, and Mikolov] Andrea Frome, Greg S. Corrado, Jon Shlens, Samy Bengio, Jeffrey Dean, Marc Aurelio Ranzato, and Tomas Mikolov. Devise: A deep visualsemantic embedding model andrea. In NIPS, 2013.
 [Fu et al.(2012)Fu, Hospedales, Xiang, and Gong] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. Attribute learning for understanding unstructured social activity. In ECCV, 2012.
 [Fu et al.(2013)Fu, Hospedales, Xiang, and Gong] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, and Shaogang Gong. Learning multimodal latent attributes. TPAMI, 2013.
 [Fu et al.(2014a)Fu, Hospedales, Xiang, Fu, and Gong] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Zhengyong Fu, and Shaogang Gong. Transductive multiview embedding for zeroshot recognition and annotation. In ECCV, 2014a.
 [Fu et al.(2014b)Fu, Hospedales, Xiang, Gong, and Yao] Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Shaogang Gong, and Yuan Yao. Interestingness prediction by robust learning to rank. In ECCV, 2014b.
 [Grubinger(2007)] Michael Grubinger. Analysis and Evaluation of Visual Information Systems Performance. PhD thesis, School of Computer Science and Mathematics, Faculty of Health, Engineering and Science, Victoria University, 2007.
 [Hariharan et al.(2012)Hariharan, Vishwanathan, and Varma] Bharath Hariharan, S. V. Vishwanathan, and Manik Varma. Efficient maxmargin multilabel classification with applications to zeroshot learning. Mach. Learn., 2012.
 [Hastie et al.(2009)Hastie, Tibshirani, and Friedman] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer New York Inc., 2009.
 [Huang et al.(2012)Huang, Socher, Manning, and Ng] Eric H. Huang, Richard Socher, Christopher D. Manning, and Andrew Y. Ng. Improving word representations via global context and multiple word prototypes. In ACL, 2012.
 [Kang et al.(2006)Kang, Jin, and Sukthankar] Feng Kang, Rong Jin, and Rahul Sukthankar. Correlated label propagation with application to multilabel learning. In CVPR, 2006.
 [Kong et al.(2013)Kong, Ng, and Zhou] Xiangnan Kong, M.K. Ng, and ZhiHua Zhou. Transductive multilabel learning via label set propagation. Knowledge and Data Engineering, IEEE Transactions on, 2013.
 [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [Lampert et al.(2009)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes by betweenclass attribute transfer. In CVPR, 2009.
 [Lampert et al.(2013)Lampert, Nickisch, and Harmeling] Christoph H. Lampert, Hannes Nickisch, and Stefan Harmeling. Attributebased classification for zeroshot visual object categorization. IEEE TPAMI, 2013.
 [Layne et al.(2014)Layne, Hospedales, and Gong] Ryan Layne, Timothy M. Hospedales, and Shaogang Gong. Reid: Hunting attributes in the wild. In BMVC, 2014.
 [Mikolov et al.(2013a)Mikolov, Chen, Corrado, and Dean] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representation in vector space. In Proceedings of Workshop at ICLR, 2013a.
 [Mikolov et al.(2013b)Mikolov, Sutskever, Chen, Corrado, and Dean] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013b.
 [Oliva and Torralba(2001)] Aude Oliva and Antonio Torralba. Modeling the shape of the scene: A holistic representation of the spatial envelope. IJCV, 42, 2001.
 [Palatucci et al.(2009)Palatucci, Hinton, Pomerleau, and Mitchell] Mark Palatucci, Geoffrey Hinton, Dean Pomerleau, and Tom M. Mitchell. Zeroshot learning with semantic output codes. In NIPS, 2009.
 [Pentina and Lampert(2014)] Anastasia Pentina and Christoph H. Lampert. A pacbayesian bound for lifelong learning. In ICML, 2014.
 [Razavian et al.(2014)Razavian, Sullivan, and Carlsson] Ali Sharif Razavian, Josephine Sullivan, and Stefan Carlsson. Cnn features offtheshelf : an astounding baseline for recognition. arXiv:1403.6382v1, 2014.
 [Rohrbach et al.(2012)Rohrbach, Stark, and Schiele] Marcus Rohrbach, Michael Stark, and Bernt Schiele. Evaluating knowledge transfer and zeroshot learning in a largescale setting. In CVPR, 2012.
 [Sermanet et al.(2014)Sermanet, Eigen, Zhang, Mathieu, Fergus, and LeCun] Pierre Sermanet, David Eigen, Xiang Zhang, Michael Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
 [Sharmanska et al.(2012)Sharmanska, Quadrianto, and Lampert] Viktoriia Sharmanska, Novi Quadrianto, and Christoph H. Lampert. Augmented attribute representations. In ECCV, 2012.
 [Socher et al.(2013)Socher, Ganjoo, Sridhar, Bastani, Manning, and Ng] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bastani, Christopher D. Manning, and Andrew Y. Ng. Zeroshot learning through crossmodal transfer. In NIPS, 2013.
 [Wu and Zhang(2013)] Le Wu and MinLing Zhang. Multilabel classification with unlabeled data: An inductive approach. In ACML, pages 197–212, 2013.
 [Zhang and Zhou(2007)] MinLing Zhang and ZhiHua Zhou. Mlknn: A lazy learning approach to multilabel learning. Pattern Recognition, 2007.
 [Zhang and Zhou(2013)] MinLing Zhang and ZhiHua Zhou. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering, page 1, 2013.

[Zhou and Chen(2002)]
ZhiHua Zhou and ZhaoQian Chen.
Hybrid decision tree.
KnowledgeBased Systems, 15(8):515 – 528, 2002.
Comments
There are no comments yet.