1 Introduction
In standard multiclass classification settings, classes are treated as a categorical set without any extra structure. When we have sideinformation on the structure of classes, such as semantic relatedness, we can use this information to improve the classification itself, or transfer any knowledge learned from the training domain to solve problems in a new domain.
Consider a classification problem of the following 10 visual objects: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There are many sources from which semantic information for those objects can be obtained. WordNet is a knowledgebase of semantic hierarchies developed manually by linguistic experts (Miller, 1995). In WordNet, objects form a hierarchical tree (Figure 1, Left), where a child object is ‘a kind of’ its parent object. Several similarity metrics can be defined from the hierarchy ^{1}^{1}1http://maraca.d.umn.edu/similarity/measures.html, one of which is shown in Figure 1
(Middle) as a twodimensional classical multidimensional scaling (MDS) embedding. Semantic relatedness can also be mined automatically from existing corpora, such as Wikipedia, Google NGram corpus, or using search engines, where cosine angles of cooccurrence vectors can be used as a similarity of two words. More recently, elaborate methods for learning vectorial representations of words have also been proposed
(Huang et al., 2012; Mikolov et al., 2013; Pennington et al., 2014). Figure 1 (Right) is an example MDS embedding from the representation from (Huang et al., 2012). As can be seen from the figure, similarity of the same objects can look very different depending on which semantic source and measure is used.Nonmetric representation of similarity. Multiple sources of semantic information have the potential to complement each other for an improved classification result. Still, how to best aggregate similarity from inhomogeneous sources remains an open problem. Similarity measures from different corpora or methods are not directly comparable, and therefore a simple averaging of the measures will not be optimal. The first key idea of our paper is that we use nonmetric, rankingbased representation of semantic similarity, instead of numerical representation.
To illustrate our approach, consider the problem of distinguishing cat and truck. In Figure 1 (Middle), cat is closer to dog than automobile:
and truck is closer to automobile than to dog:
In other words, we may be able to distinguish cat and truck from their closeness to other reference objects without using any numerical similarity. As a special case, we can use the similarity of all the other objects to cat to form a semantic ranking of cat. For example, cat has a semantic ranking
(1) 
and truck has a semantic ranking
(2) 
according to the distance in Figure 1 (Middle). Not only the ordinal similarity may be sufficient for distinguishing cat and truck, but it also seems a more natural representation, since the ordinal similarity is invariant under scaling and monotonic transformation of numerical values and therefore has a better chance of being consistent across different heterogeneous sources. Moreover, ordinal information can be obtained directly from nonnumerical comparisons. In particular, when we ask human subjects to judge similarity of objects, it is easier for subjects to rank objects rather than to assign numerical scores of similarity.
Zeroshot classification without retraining. In this paper, we apply nonmetric rankingsbased representations of semantic similarity to zeroshot classification problems (Palatucci et al., 2009; Lampert et al., 2009; Rohrbach et al., 2010, 2011; Qi et al., 2011; Mensink et al., 2012; Frome et al., 2013; Socher et al., 2013). In zeroshot learning we have samples from the domain (e.g., is the set of 8 objects), but no samples from the test domain (e.g., . The goal is to construct a classifier using the only training data and semantic knowledge of the two domains and .
A standard approach to classifying classes is to use binary classifications in onevsrest or onevsone setting, or to use multiclass losses directly. If we already have pretrained classifiers of the training domain classes using one of those settings, can we use those classifiers ‘for free’ to distinguish unseen classes cat and truck without retraining with training domain samples? Figure 2
provides an intuition on the problem. Consider multiple decision hyperplanes learned from the onevsone setting (others will be discussed in Section
2.) The hyperplanes partition the feature space into ‘cells’, each of which assigns a ranking of objects to points inside its interior. To see this, note that all pairs of objects are compared in each cell (either or ), and transitivity (see Section 2) follows the metric triangle inequality. The ranking of an unseen test sample assigned by pretrained classifiers can be compared with the semantic rankings of cat or truck for zeroshot classification, assuming feature and semantic similarities are strongly correlated (see (Deselaers & Ferrari, 2011) for a discussion).Building on this idea, we present novel zeroshot classification methods that are free of retraining and can aggregate semantic information from multiple sources. We start by proposing a simple deterministic rankingbased method, and further improve the method by introducing probability models of rankings. In the probabilistic approach, realvalued classification scores are mapped to posterior probabilities of rankings, and combined with prior probability of rankings learned from (multiple) semantic sources. The advantage of using probabilistic approach will be explained more in the method and the experiment sections. For both the posterior and the prior probabilities of rankings, we use classic probabilistic models of ranking including the PlackettLuce, the Mallows, and the BabingtonSmith models.
To summarize the contributions of this paper, we present

nonmetric rankingbased representation of a semantic structure, alternative to numerical similarity representation

methods of aggregating multiple semantic sources using probability models of rankings

deterministic and probabilistic zeroshot classifiers built from pretrained classifiers without retraining.
In the experiment section we demonstrate the advantages of our approach over a numericallybased approach and a deterministic approach using two wellknown image databases Animalswithattributes (Lampert et al., 2009) and CIFAR10/100 (Krizhevsky, 2009). In particular, we demonstrate that aggregating different semantic sources, including crowdsourcing, leads to more accurate zeroshot classification.
The remainder of the paper is organized as follows: In Section 2, we present deterministic and probabilistic rankingbased algorithms for zeroshot classification. In Section 3, we relate our work to others in the literature. In Section 4, we test our methods with realworld image databases, and conclude the paper in Section 5.
2 Zeroshot learning with rankings
Notations. Let denote the set of all rankings on items/classes, and denote a ranking: is the position of item and is the item number whose position is . We write (‘ precedes ’) when (‘item is ranked higher than item ’.) A topK ranking is a straightforward generalization of a ranking, in which the order of only the first K items matter and the order of the remaining items are ignored. With an abuse of notation, we use and as a topK ranking and the as the set of all topK rankings as well, since a full ranking is a special case (.) A partial order is a further generalization of a ranking and a topK ranking. In a (full) ranking, a pair of items has to satisfy either or , whereas it can be neither in a partial order. In addition, a partial order has to satisfy the transitivity: for any triple , and implies . Item positions are in general undefined for a partial order.
2.1 Deterministic approach
A simple deterministic approach to zeroshot learning using semantic rankings was already outlined in Introduction. In onevsone setting, pretrained classifiers assign a ranking . In onevsrest, binary classifiers assign realvalued scores to a test point according to the point’s distances to decision hyperplanes. The scores can be sorted to provide a ranking . Given this ranking of a test sample , and prior knowledge of semantic rankings of testdomain classes , we predict
(3) 
where is a distance between two rankings. For example, let whose semantic rankings are (1) and (2), respectively. If an unseen image has classification scores in the order so that for some , then we classify as a cat rather than a truck. We use the Kendall’s ranking distance which is the number of mismatching orders:
(4) 
Sometimes it may make more sense to compare only the closest items than to compare all the items. The topK version of the Kendall’s distance was proposed in (Critchlow, 1985), which can be computed as follows. Let , , and be the sets
Then the Kendall’s topK distance can be computed by
(5)  
Zeroshot classification using the rule (3) will be called deterministic rankingbased (DR) method.
2.2 Probabilistic approach
We can further refine rankingbased algorithms by considering a probabilistic approach. There are several causes of uncertainty in rankingbased representation. First, classifier outputs for a testdomain sample can have low confidence, since the classifiers are trained only with trainingdomain samples. Second, prior knowledge of semantic rankings from multiple semantic sources may not be unanimous. Third, feature and semantic similarities do not always coincide. For these reasons, we consider probability models of (topK) rankings. We discuss three models: the Mallows (Mallows, 1957), the PlackettLuce (Plackett, 1975; Luce, 1959; Marden, 1995), and the BabingtonSmith (Joe & Verducci, 1993; Smith, 1950), which we will introduce where they are needed (see (Critchlow, 1985; Critchlow et al., 1991; Marden, 1995) for more reviews.)
In our probabilistic zeroshot learning approach, we assume the following dependence:
(6) 
that is, the label of a sample is dependent only on the predicted ranking , which in turn is dependent only on the sample . The probability of a testdomain label given the sample is obtained by marginalizing out the latent ranking variable :
(7) 
and the final prediction of the label for a test sample is made by .
There are two terms in (7): a probabilistic ranker and a prior for semantic ranking . First, we describe the prior for semantic ranking learned from one or more semantic sources (e.g. different corpora or crowdsourcing) in Section 2.3. Second, we describe probabilistic rankers based on standard classifiers trained with trainingdomain data in Section 2.4. The final zeroshot classifier for unseen samples bringing these two learned components is described in Section 2.5.
2.3 Prior for semantic ranking
We encode the semantic similarity between training and testdomain classes by probabilistic ranking models of trainingdomain classes for each testdomain class . To learn , we use multiple instances of rankings for each testdomain class. These rankings can come from multiple linguistic corpora or by humanrated rankings directly. Below we outline three popular models of rankings – the PlackettLuce, the Mallows, and the BabingtonSmith models.
PlackettLuce. The PlackettLuce model for the probability of observing a topK ranking is
(8) 
The nonnegative parameters indicate the relative chances of being ranked higher than the rest of the items, and are invariant under constant scaling of ’s. One interpretation of the generative procedure of the PlackettLuce model is the Vase interpretation from (Silberberg, 1980). Suppose there is a vase with infinite number of balls marked 1 to , whose numbers are proportional to ’s. At the first stage, a ball is drawn and is recorded as . At the second stage, another ball is drawn and is recorded as unless the ball is already selected before (), in which case the drawing is tried again. The procedure is continued until distinct balls are drawn and recorded. This generative probability is captured by (8).
Given samples (=semantic sources) of rankings
for a class, the parameters can be estimated by MLE. The loglikelihood of (
8) is(9) 
with possibly an additional regularization term . Hunter (Hunter, 2004)
proposed an iterative method of estimation using the MinorizationMaximization procedure which generalizes the ExpectationMaximization procedure and converges to a global maximum solution under a certain condition on the data. From our experience, simple gradientbased or NewtonRaphson methods seem to work well with an appropriate choice of the regularization parameter.
Mallows. The Mallows model (Mallows, 1957) for full rankings is defined as , where is the mode, is the spread parameter, and
is the Kendall’s distance between two rankings. It may be viewed as a discrete analog of the Gaussian distribution for ranking. The distance can further be written as
where is the identity ranking and the ’s are defined as(10) 
Fligner et al. (Fligner & Verducci, 1986) proposed the Mallows model for topK lists by marginalizing the Mallows model:
(11) 
where the ’s are defined in (10) and is the normalization constant which can be computed in closed form:
Given samples of rankings , the parameters of the Mallows model for total rankings can also be found by MLE (Feigin & Cohen, 1978). When the mode is known, the MLE of the spread parameter can be found by convex optimization, owing to the fact that the loglikelihood is a concave function of . The MLE of the centroid is the maximum of and is equivalent to
(12) 
The minimization (12) is also known as the Kemeny optimal consensus or aggregation problem (Kemeny, 1959) and is known to be NPhard (Bartholdi et al., 1989)
. However, there are known heuristic methods such as sequential transposition of adjacent items
(Critchlow, 1985) or other admissible heuristics (Meila et al., 2007). We use the former method. Starting from the average ranking as the initial value of , and we search adjacent items and whose swapping lowers the sum of distances the most. We stop if there is no such item or if the maximum number of iteration (1000 in our case) is exceeded. The MLE with (11) can be solved by using in place of .BabingtonSmith.
The BabingtonSmith model (Joe & Verducci, 1993; Smith, 1950) is another probabilistic
ranking model based on pairwise comparisons.
Given two items and , let be the probability that item is
ranked higher than item .
Given these preferences , the probability of a ranking is
After introducing new parameters (Joe & Verducci, 1993),
the probability of a topK ranking can be written as ^{2}^{2}2We presents a slightly modified form.
(13) 
The BabingtonSmith model is similar to the PlackettLuce model in that the probability is the product of ’s. The larger is, the more likely it is that item is ranked higher than item . However, unlike the PlackettLuce model, the normalizing constant does not have a known closed form. We do not use it for modeling the semantic prior, but use it for probabilistic ranker in the next section.
2.4 Probabilistic ranker from classifiers
The probabilistic ranker takes a sample as input and probabilistically ranks the similarity of to trainingdomain classes . We propose to build rankers from standard settings of multiclass classifiers: onevsrest, onevsone, or multiclassloss as in (Crammer & Singer, 2002). Any classifier that output a realvalued confidence or score can be used for this purpose.
Onevsrest binary classifiers. In this setting, there will be such scores for each trainingdomain class. We relate the realvalued scores and the nonnegative parameters of the PlackettLuce model (8) by setting , to get
(14) 
Instead of producing a single ranking as in the deterministic approach (3), this ranker evaluates the probability of any ranking given taking into account the confidence () of onevsrest classifiers.
Onevsone binary classifiers. In this setting, there will be scores for each pair of trainingdomain classes. We related these scores to the parameters of the BabingtonSmith model (13) by , to get
(15) 
Similar to (14), this ranker evaluates the probability of any ranking given taking into account the confidence () of onevsone classifiers. Note that if the pretrained classifiers are linear, that is, , then this ranker is quite similar to (14), since , with defined as . However, it has a different normalization term from (14).
Multiclassloss classifiers
. Other types of classifiers can be accommodated. When the pretrained classifiers are multinomial logistic regression (=softmax) or SVMs with a multiclass loss
(Crammer & Singer, 2002), we again have scores computed from parameter vectors . Similar to the onevsrest case, we can use the relation with the PlackettLuce model which gives us the same ranker as (14). Note that if the original classifier is a multinomial logistic regression, the (14) is in fact a direct generalization of logistic regression for , which is also observed in (Cheng et al., 2010). In this case, the trained parameters coincide with the optimal maximum likelihood parameters for (14) trained with top1 rankings which are ground truth labels of the training domain.To summarize, there exist natural interpretations of the PlackettLuce and the BabingtonSmith models that allow us to relate classification scores to their parameters and use them to produce posterior probability of rankings without any further training^{3}^{3}3It is not immediately clear whether the Mallows model can be adapted in this setting and is left for future work..
2.5 Zeroshot prediction
The probabilistic rankers constructed from pretrained classifiers and the priors for semantic rankings learned from semantic sources are plugged into (7)
and the final prediction of the label for a test sample is made by . The sum over (topK) rankings
cannot be computed analytically for either of the PlackettLuce and the Mallows models and requires approximations, e.g., by Markov chain Monte Carlo (MCMC) sampling. Alternatively, we use
and a uniform prior , somewhat similar to (Rohrbach et al., 2010). In our preliminary experiments, MCMCbased summation showed inferior results to this simple version and therefore will be omitted from the report. The final zeroshot classifier is the MAP classifier(16) 
for pretrained onevsrest/multiclassloss classifiers,
(17) 
for pretrained onevsone classifiers. We summarize the overall training and testing procedures below.
Training Step 1. Obtain pretrained classifiers Input: trainingdomain sample and label pairs
, regularization hyperparameter
Output: score functions or Method: onevsrest/onevsone/multiclass with any classifier Training Step 2. Learn priors for semantic rankings Input: ranking and testdomain label pairs collected from corpora or crowdsourcing Output: consensus rankings for each testdomain class from either the PlackettLuce model (8) or the Mallows (11) Method: MLE of (9) by BFGS or MLE of (12) by sequential transposition Testing. Zeroshot classification Input: data , parameter for topK list size Output: prediction of testdomain label Method: MAP estimation (16) or (17), using ’s from Training Step 1 and ’s from Training Step 23 Related work
There are two major approaches to zeroshot learning explored in the literature: attributebased and similaritybased. In attributebased knowledge transfer (e.g., (Palatucci et al., 2009; Lampert et al., 2009; Akata et al., 2013)), the classes from training and test domains are assumed to be distinguishable by a common list of attributes. Attributebased approaches often show excellent empirical performance (Palatucci et al., 2009; Rohrbach et al., 2010). However, designing the attributes that are discriminative, common to multiple classes, and correlated with the original feature at the same time, can be a nontrivial task that typically requires human supervision. Similar arguments can be found in (Rohrbach et al., 2010) or (Mensink et al., 2014).
By contrast, similaritybased zeroshot approaches use semantic similarity between trainingdomain classes and testdomain classes directly. The advantage of this approach is that similarity information can be mined automatically from the web, linguistic corpora or other sources. Similarity information has been used to build a probabilistic zeroshot classifier called direct similaritybased method (DS) (Rohrbach et al., 2010, 2011), which parallels the attributebased approach from (Lampert et al., 2009). Direct similaritybased method also uses classification scores and probabilistic inference as ours, but it uses numerical similarity instead of nonmetric ranking presentation in our method. More recently, similaritybased approaches using semantic embedding have been proposed (Frome et al., 2013; Socher et al., 2013)
. In these algorithms, training and test domain objects are simultaneously embedded into a semantic space using multilayer neural networks. While these two methods produce interesting results, they use specific metric similarity models and require retraining when the semantic model changes, unlike our method. Mensink et al. use a linear combination of pretrained classifiers for classifying unseen data
(Mensink et al., 2014). They use cooccurrence statistics as semantic information, whereas we do not assume a specific type of similarity information. Lastly, our method provides a means to aggregate multiple semantic sources that has not been addressed in the literature.4 Experiments
4.1 Datasets
We use two datasets 1) Animals with Attributes dataset (Lampert et al., 2009) and 2) CIFAR100/10 (Torralba et al., 2008) collected by (Krizhevsky, 2009). Semantic similarity is obtained from WordNet distance, web searches (Rohrbach et al., 2010), word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and from Amazon Mechanical Turk. Table 1 summarizes the characteristics of the datasets and the types of available semantic information used in the experiments. More details on data processing are provided in Appendix.
Animals  CIFAR  

Feature dimension  8941  4000 
Number of training/validation/test samples  21847 / 2448 / 6180  50000 / 50000 / 10000 
Number of training/test classes  40 / 10  100 / 10 
Linguistic sources  WordNet, Wikipedia, Yahoo, YahooImage, Flicker  WordNet, word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) 
Number of surveys from crowdsourcing  500  500 
4.2 Methods
We perform comprehensive tests of the probabilistic rankingbased (PR) zeroshot model under 1) three learning settings (onevsrest, onevsone, multiclass), 2) two types of semantic sources (linguistic, crowdsourcing), and 3) different prior models for semantic rankings (the PlackettLuce and the Mallows models). We compare probabilistic rankingbased method (PR, Sec. 2.5) to deterministic rankingbased method (DR, Sec. 2.1)and direct similaritybased method (DS, (Rohrbach et al., 2010)) which is the closest stateoftheart to our methods that uses classifier scores. We also refer to other results in the literature for comparison.
Regularization parameters for classifiers are determined from the validation set and partially manually to avoid exhaustive crossvalidation. We test with different hyperameters (in topK list) and report the results with . For onevsrest and onevsone, we trained SVMs followed by Platt’s probabilistic scaling (Platt, 1999). For multiclass, we used multinomial logistic regression.
4.3 Result 1 – Discriminability of semantic rankings
Animals dataset  
Direct Similarity (DS)  Deterministic Ranking (DR)  Probabilistic Ranking (PR)  
Linguistic sources  Linguistic sources  Linguistic sources  Crowdsource  
Indiv  Arithm  Geom  Indiv  Arithm  Geom  P.L  Mallows  P.L  Mallows  
onevsrest  0.320  0.334  0.354  0.329  0.330  0.347  0.320  0.312  0.351  0.351 
onevsone  n/a  0.341  0.343  0.359  0.358  0.320  0.374  0.374  
multiclass  n/a  0.331  0.345  0.355  0.370  0.345  0.395  0.392  
CIFAR dataset  
Direct Similarity (DS)  Deterministic Ranking (DR)  Probabilistic Ranking (PR)  
Linguistic sources  Linguistic sources  Linguistic sources  Crowdsource  
Indiv  Arithm  Geom  Indiv  Arithm  Geom  P.L  Mallows  P.L  Mallows  
onevsrest  0.273  0.300  0.316  0.224  0.258  0.260  0.314  0.288  0.258  0.282 
onevsone  n/a  0.244  0.281  0.278  0.335  0.297  0.244  0.261  
multiclass  n/a  0.251  0.272  0.276  0.339  0.320  0.260  0.292 
: accuracy using geometric mean of similarities. The best result for each method is highlighted in boldface.
We first compare the discriminability of classes with ranking vs numerical representations of similarity without using image data. Using all five linguistic sources for Animals, we compute pairwise distances of 510=50 similarity vectors. Two types of distances are computed – the Euclidean distance of numerical similarity, with or without normalization, and the Hausdorff distance for topK lists using the Kendall’s ranking distance (5). Note that the rankings are obtained by sorting the numerically similarity. For these different representations, the average accuracy of leaveoneout 1Nearest Neighbor classification was 0.44 (Euclidean), 0.62 (Euclidean with normalization), 0.72 (Kendall’s, =2), 0.70 (Kendall’s, =5), and 0.64 (Kendall’s, =10), which shows that the ranking distances are better than the Euclidean distances for discriminating testdomain classes when there are multiple heterogeneous sources. Figure 3 shows the embeddings from classical Multidimensional Scaling (MDS) using these distances. It shows qualitative differences of numerical similarity (a and b) and ranking (c). The embedding of rankings has better betweenclass separation and withinclass clustering than the embeddings of numerical similarity, suggesting that the nonmetric order information is more consistent than numerical similarity across different sources.
We also compute leaveoneout accuracy of Bayesian classification with the rankings collected directly from crowdsourcing for Animals and CIFAR datasets. Out of 500 rankings, one ranking is held out and the 499 remaining rankings are used to build 10 semantic ranking probabilities using both the Mallows and the PlackettLuce models. Prediction of the class of the heldout rankings is made by the maximum of over all 10 classes. For Animals, the average accuracy was 0.91/0.99/0.99 (=2/5/10) using the Mallows model, and 0.79/0.84/0.96 (=2/5/10) using the PlackettLuce model. For CIFAR, the average accuracy was 0.73/0.79/0.84 (=2/5/10) using the Mallows model, and 0.72/0.77/0.74 (=2/5/10) using the PlackettLuce model. These numbers show that the rankings obtained from crowdsourcing have information to discriminate the testdomain classes with up to accuracy.
4.4 Result 2  Comparison of PR, DR, and DS
We compare probabilistic ranking (PR), deterministic ranking (DR) and direct similarity (DS) methods for zeroshot classification accuracy. All three methods share the same image features and the same linguistic sources of semantic information (except for the crowdsourcing for PR), but use them in different ways. PR uses probabilistic models to combine multiple sources of semantic similarity. DR and DS inherently use a single source of semantic similarity, and therefore the multiple sources have to be combined heuristically. We first normalize individual similarity sources to be in the range from 0 to 1, and then compute arithmetic and geometric means over multiple sources. Note that the main difference between DR and DS, is that DR uses rankings whereas DS uses numeric values.
DS vs DR. The results are shown in Table 2. For both DR and DS, using averaged semantic similarity (“Arithm” and “Geom”) is better than using individual similarity (“Indiv”), for both Animals and CIFAR datasets. A plausible interpretation is that the aggregate similarity is more reliable than individual similarities despite using heuristic methods of aggregation. The highest accuracy from DS is 0.354 (for Animals) and 0.316 (for CIFAR), whereas the hight accuracy from DR is 0.359 (for Animals) and 0.281 (for CIFAR). DR performs slightly better than DS in Animals, but worse in CIFAR. Within DR, accuracy is not affected much by the pretrained classifier type (onevsrest, onevsone, multiclass).
PR vs others. Using the same linguistic sources, the highest accuracy from PR is 0.370 (for Animals) and 0.339 (for CIFAR) which are much higher than DS and DR regardless of whether a single (Indiv) or multiple (Arithm and Geom) sources are used. This suggests the advantage of using probabilistic models to aggregate multiple semantic sources. Within PR results, onevsrest and onevsone classifiers perform comparably, and multiclass logistic regression performs the best. PR performs even better with crowdsourced semantic information (0.395) than with linguistic sources (0.370) in Animals, but the opposite is true in CIFAR, probably due to the less reliability of human subject ratings with CIFAR (sorting 100 categories correctly compared to 40 in Animals).
In the literature, the accuracy of attributesbased methods with Animals ranges from 0.36 to 0.44 (Tables 3 and 4, (Akata et al., 2013)), compared to 0.395 from our method which do not use attributes. We remind the reader that finding ‘good’ attributes is itself a nontrivial task. When both similarity and attributes are mined automatically from corpora, similaritybased methods perform much better than attributedbased methods (individual average of 0.22 from Table 1, (Rohrbach et al., 2010)).
Lastly, Figure 4 shows twoclass classification accuracy of PR (PLlinguistic sources), DS, and an embeddingbased method on select pairs of classes from CIFAR (Figure 3, (Socher et al., 2013)). Although the numbers may not be directly comparable due to different settings^{4}^{4}4Socher et al. used the rest of classes from CIFAR10 instead of CIFAR100 for training, and also used different semantic information, PR performs noticeably better than the two stateofthearts. In fact, we can distinguish auto vs deer, deer vs ship, or cat vs truck with accuracy, without a single training image of these categories.
5 Conclusion
In this paper, we propose rankingbased representation of semantic similarity, as an alternative to metric representation of similarity. Using rankings, semantic information from multiple sources can be aggregated naturally to produce a better representation than individual sources. Using this representation and probability models of rankings, we present new zeroshot classifiers which can be constructed from pretrained classifiers without retraining, and demonstrate their potential for exploiting semantic structures of realworld visual objects.
References
 Akata et al. (2013) Akata, Zeynep, Perronnin, Florent, Harchaoui, Zaid, and Schmid, Cordelia. Labelembedding for attributebased classification. In CVPR, 2013.
 Bartholdi et al. (1989) Bartholdi, J., Tovey, C. A., and Trick, M. Voting schemes for which it can be difficult to tell who won. Social Choice and Welfare, 6(2):157–165, 1989.
 Cheng et al. (2010) Cheng, Weiwei, Dembczynski, Krzysztof, and Hüllermeier, Eyke. Label ranking methods based on the PlackettLuce model. In ICML, pp. 215–222, 2010.

Crammer & Singer (2002)
Crammer, Koby and Singer, Yoram.
On the algorithmic implementation of multiclass kernelbased vector
machines.
The Journal of Machine Learning Research
, 2:265–292, 2002.  Critchlow et al. (1991) Critchlow, Douglas E, Fligner, Michael A, and Verducci, Joseph S. Probability models on rankings. Journal of Mathematical Psychology, 35(3):294–318, 1991.
 Critchlow (1985) Critchlow, Douglas Edward. Metric methods for analyzing partially ranked data. Number 34 in Lecture notes in statistics. Springer, Berlin [u.a.], 1985.

Deselaers & Ferrari (2011)
Deselaers, T. and Ferrari, V.
Visual and semantic similarity in imagenet.
In CVPR, pp. 1777–1784, 2011.  Feigin & Cohen (1978) Feigin, Paul D and Cohen, Ayala. On a model for concordance between judges. Journal of the Royal Statistical Society. Series B (Methodological), pp. 203–213, 1978.
 Fligner & Verducci (1986) Fligner, Michael A and Verducci, Joseph S. Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), pp. 359–369, 1986.
 Frome et al. (2013) Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A deep visualsemantic embedding model. In NIPS, pp. 2121–2129, 2013.
 Huang et al. (2012) Huang, Eric H, Socher, Richard, Manning, Christopher D, and Ng, Andrew Y. Improving word representations via global context and multiple word prototypes. In Proceedings of ACL, pp. 873–882. Association for Computational Linguistics, 2012.
 Hunter (2004) Hunter, David R. MM algorithms for generalized BradleyTerry models. The Annals of Statistics, 32(1):384–406, 2004.
 Joe & Verducci (1993) Joe, Harry and Verducci, Joseph S. On the Babington Smith class of models for rankings. In Probability models and statistical analyses for ranking data, pp. 37–52. Springer, 1993.
 Kemeny (1959) Kemeny, John G. Mathematics without numbers. Daedalus, pp. 577–591, 1959.
 Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
 Lampert et al. (2009) Lampert, C.H., Nickisch, H., and Harmeling, S. Learning to detect unseen object classes by betweenclass attribute transfer. CVPR, pp. 951–958, 2009.
 Luce (1959) Luce, R. D. Individual Choice Behavior. Wiley, New York., 1959.
 Mallows (1957) Mallows, Colin L. Nonnull ranking models. i. Biometrika, 44(1/2):114–130, 1957.
 Marden (1995) Marden, John I. Analyzing and modeling rank data, volume 64. CRC Press, 1995.
 Meila et al. (2007) Meila, Marina, Phadnis, Kapil, Patterson, Arthur, and Bilmes, Jeff A. Consensus ranking under the exponential model. In Parr, Ronald and van der Gaag, Linda C. (eds.), UAI, pp. 285–294. AUAI Press, 2007.
 Mensink et al. (2012) Mensink, Thomas, Verbeek, Jakob, Perronnin, Florent, and Csurka, Gabriela. Metric learning for large scale image classification: Generalizing to new classes at nearzero cost. In ECCV, 2012.
 Mensink et al. (2014) Mensink, Thomas, Gavves, Efstratios, and Snoek, Cees GM. Costa: Cooccurrence statistics for zeroshot classification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 2441–2448. IEEE, 2014.
 Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 Miller (1995) Miller, George A. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995.
 Palatucci et al. (2009) Palatucci, Mark, Pomerleau, Dean, Hinton, Geoffrey, and Mitchell, Tom. Zeroshot learning with semantic output codes. In NIPS, pp. 1410–1418, 2009.

Pennington et al. (2014)
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D.
Glove: Global vectors for word representation.
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014)
, 12, 2014.  Plackett (1975) Plackett, Robin L. The analysis of permutations. Applied Statistics, pp. 193–202, 1975.

Platt (1999)
Platt, John C.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.
In ADVANCES IN LARGE MARGIN CLASSIFIERS, pp. 61–74. MIT Press, 1999.  Qi et al. (2011) Qi, GuoJun, Aggarwal, C., Rui, Yong, Tian, Qi, Chang, Shiyu, and Huang, T. Towards crosscategory knowledge propagation for learning visual concepts. CVPR, pp. 897–904, 2011.
 Rohrbach et al. (2011) Rohrbach, M., Stark, M., and Schiele, B. Evaluating knowledge transfer and zeroshot learning in a largescale setting. In CVPR, pp. 1641–1648, 2011.
 Rohrbach et al. (2010) Rohrbach, Marcus, Stark, Michael, Szarvas, Gyorgy, Gurevych, Iryna, and Schiele, Bernt. What helps where and why? semantic relatedness for knowledge transfer. CVPR, pp. 910–917, 2010.
 Silberberg (1980) Silberberg, A.R. Ph.D. thesis, 1980.
 Smith (1950) Smith, B Babington. Discussion of professor Ross’s paper. Journal of the Royal Statistical Society B, 12(1):41–59, 1950.
 Socher et al. (2013) Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng, Andrew. Zeroshot learning through crossmodal transfer. In NIPS, pp. 935–943, 2013.

Torralba et al. (2008)
Torralba, Antonio, Fergus, Rob, and Freeman, William T.
80 million tiny images: A large data set for nonparametric object and scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.
Appendix A Datasets
The Animals with Attributes dataset (Animals) was collected and processed by (Lampert et al., 2009). The training domain consists of images of 40 types of animals, from which 21,847 and 2,448 images were used as training and validation sets. From each image, 10,942 dimensional features are extracted (Lampert et al., 2009). The test domain consists of 6,180 images of 10 types of animals which are nonoverlapping with the trainingdomain classes. Semantic similarity of the animals are provided by (Rohrbach et al., 2010) ^{5}^{5}5http://www.d2.mpiinf.mpg.de/nlp4vision, which are computed from five different linguistic sources: path distance from WordNet, cooccurrence from Wikipedia, Yahoo web search, Yahoo image Search, and Flickr image search.
The CIFAR100 and CIFAR10 are collected by (Krizhevsky, 2009). The training domain (CIFAR100) consists of 60,000 images of 100 types of objects including animals, plants, household objects, and scenery. We use 50,000 and 10,000 images from CIFAR100 as training and validation sets. The test domain (CIFAR10) consists of 60,000 images of 10 types of objects similar to CIFAR100, without any overlap with the classes from CIFAR100. We use 10,000 images as test data. To compute features, we use a deeptrained neural network ^{6}^{6}6https://github.com/jetpacapp/DeepBeliefSDK, which is trained from ImageNet ILSVRC2010 dataset^{7}^{7}7http://www.imagenet.org/challenges/LSVRC consisting of 1.2 million images of 1000 categories. We apply CIFAR100 and CIFAR10 training images to the network, and use the 4096dimensional output from the last hidden layer of the network as features. For semantic similarity of CIFAR100 and CIFAR10, we compute the WordNet path distance, and also used word2vec tools (Mikolov et al., 2013) ^{8}^{8}8https://code.google.com/p/word2vec/ and GloVe tools (Pennington et al., 2014)^{9}^{9}9http://nlp.stanford.edu/projects/glove/.
In addition to using linguistic sources, we use Amazon Mechanical Turk to collect word similarity data by crowdsourcing. Each participant of the survey is shown a word from the test domain classes, and is asked to sort 10 words from the training domain according to their perceived similarity to the given word. The initial order of 10 words is randomized for each survey. We preselect those 10 closest words for each testdomain word, because we found from preliminary trials that ordering all words (40 for Animal and 100 for CIFAR) is too demanding and timeconsuming for participants. For Animal, 10 closest words are selected based on the average ranking of the words w.r.t. the testdomain word from the five linguistic sources. For CIFAR, we use the path distance from WordNet. Fifty surveys are collected for each testdomain class.
Appendix B Implementation
The Direct Similaritybased method (DS) (Rohrbach et al., 2010) is implemented as follows. The probability is modeled by a onevsrest binary SVM classifier followed by the Platt’s probabilistic scaling (Platt, 1999), trained with trainingdomain feature and label pairs. In testing, the probability is evaluated for a test image, and the prediction of the testdomain class is made by MAP estimation using
(18) 
where is the similarity score. The sum above is limited to five most similar trainingdomain classes. We have tested different values of the prior , which did not have visible effects on the result.