In standard multiclass classification settings, classes are treated as a categorical set without any extra structure. When we have side-information on the structure of classes, such as semantic relatedness, we can use this information to improve the classification itself, or transfer any knowledge learned from the training domain to solve problems in a new domain.
Consider a classification problem of the following 10 visual objects: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. There are many sources from which semantic information for those objects can be obtained. WordNet is a knowledge-base of semantic hierarchies developed manually by linguistic experts (Miller, 1995). In WordNet, objects form a hierarchical tree (Figure 1, Left), where a child object is ‘a kind of’ its parent object. Several similarity metrics can be defined from the hierarchy 111http://maraca.d.umn.edu/similarity/measures.html, one of which is shown in Figure 1
(Middle) as a two-dimensional classical multidimensional scaling (MDS) embedding. Semantic relatedness can also be mined automatically from existing corpora, such as Wikipedia, Google N-Gram corpus, or using search engines, where cosine angles of co-occurrence vectors can be used as a similarity of two words. More recently, elaborate methods for learning vectorial representations of words have also been proposed(Huang et al., 2012; Mikolov et al., 2013; Pennington et al., 2014). Figure 1 (Right) is an example MDS embedding from the representation from (Huang et al., 2012). As can be seen from the figure, similarity of the same objects can look very different depending on which semantic source and measure is used.
Non-metric representation of similarity. Multiple sources of semantic information have the potential to complement each other for an improved classification result. Still, how to best aggregate similarity from inhomogeneous sources remains an open problem. Similarity measures from different corpora or methods are not directly comparable, and therefore a simple averaging of the measures will not be optimal. The first key idea of our paper is that we use non-metric, ranking-based representation of semantic similarity, instead of numerical representation.
To illustrate our approach, consider the problem of distinguishing cat and truck. In Figure 1 (Middle), cat is closer to dog than automobile:
and truck is closer to automobile than to dog:
In other words, we may be able to distinguish cat and truck from their closeness to other reference objects without using any numerical similarity. As a special case, we can use the similarity of all the other objects to cat to form a semantic ranking of cat. For example, cat has a semantic ranking
and truck has a semantic ranking
according to the distance in Figure 1 (Middle). Not only the ordinal similarity may be sufficient for distinguishing cat and truck, but it also seems a more natural representation, since the ordinal similarity is invariant under scaling and monotonic transformation of numerical values and therefore has a better chance of being consistent across different heterogeneous sources. Moreover, ordinal information can be obtained directly from non-numerical comparisons. In particular, when we ask human subjects to judge similarity of objects, it is easier for subjects to rank objects rather than to assign numerical scores of similarity.
Zero-shot classification without retraining. In this paper, we apply non-metric rankings-based representations of semantic similarity to zero-shot classification problems (Palatucci et al., 2009; Lampert et al., 2009; Rohrbach et al., 2010, 2011; Qi et al., 2011; Mensink et al., 2012; Frome et al., 2013; Socher et al., 2013). In zero-shot learning we have samples from the domain (e.g., is the set of 8 objects), but no samples from the test domain (e.g., . The goal is to construct a classifier using the only training data and semantic knowledge of the two domains and .
A standard approach to classifying classes is to use binary classifications in one-vs-rest or one-vs-one setting, or to use multiclass losses directly. If we already have pre-trained classifiers of the training domain classes using one of those settings, can we use those classifiers ‘for free’ to distinguish unseen classes cat and truck without re-training with training domain samples? Figure 2
provides an intuition on the problem. Consider multiple decision hyperplanes learned from the one-vs-one setting (others will be discussed in Section2.) The hyperplanes partition the feature space into ‘cells’, each of which assigns a ranking of objects to points inside its interior. To see this, note that all pairs of objects are compared in each cell (either or ), and transitivity (see Section 2) follows the metric triangle inequality. The ranking of an unseen test sample assigned by pre-trained classifiers can be compared with the semantic rankings of cat or truck for zero-shot classification, assuming feature and semantic similarities are strongly correlated (see (Deselaers & Ferrari, 2011) for a discussion).
Building on this idea, we present novel zero-shot classification methods that are free of re-training and can aggregate semantic information from multiple sources. We start by proposing a simple deterministic ranking-based method, and further improve the method by introducing probability models of rankings. In the probabilistic approach, real-valued classification scores are mapped to posterior probabilities of rankings, and combined with prior probability of rankings learned from (multiple) semantic sources. The advantage of using probabilistic approach will be explained more in the method and the experiment sections. For both the posterior and the prior probabilities of rankings, we use classic probabilistic models of ranking including the Plackett-Luce, the Mallows, and the Babington-Smith models.
To summarize the contributions of this paper, we present
non-metric ranking-based representation of a semantic structure, alternative to numerical similarity representation
methods of aggregating multiple semantic sources using probability models of rankings
deterministic and probabilistic zero-shot classifiers built from pre-trained classifiers without retraining.
In the experiment section we demonstrate the advantages of our approach over a numerically-based approach and a deterministic approach using two well-known image databases Animals-with-attributes (Lampert et al., 2009) and CIFAR-10/100 (Krizhevsky, 2009). In particular, we demonstrate that aggregating different semantic sources, including crowd-sourcing, leads to more accurate zero-shot classification.
The remainder of the paper is organized as follows: In Section 2, we present deterministic and probabilistic ranking-based algorithms for zero-shot classification. In Section 3, we relate our work to others in the literature. In Section 4, we test our methods with real-world image databases, and conclude the paper in Section 5.
2 Zero-shot learning with rankings
Notations. Let denote the set of all rankings on items/classes, and denote a ranking: is the position of item and is the item number whose position is . We write (‘ precedes ’) when (‘item is ranked higher than item ’.) A top-K ranking is a straightforward generalization of a ranking, in which the order of only the first K items matter and the order of the remaining items are ignored. With an abuse of notation, we use and as a top-K ranking and the as the set of all top-K rankings as well, since a full ranking is a special case (.) A partial order is a further generalization of a ranking and a top-K ranking. In a (full) ranking, a pair of items has to satisfy either or , whereas it can be neither in a partial order. In addition, a partial order has to satisfy the transitivity: for any triple , and implies . Item positions are in general undefined for a partial order.
2.1 Deterministic approach
A simple deterministic approach to zero-shot learning using semantic rankings was already outlined in Introduction. In one-vs-one setting, pre-trained classifiers assign a ranking . In one-vs-rest, binary classifiers assign real-valued scores to a test point according to the point’s distances to decision hyperplanes. The scores can be sorted to provide a ranking . Given this ranking of a test sample , and prior knowledge of semantic rankings of test-domain classes , we predict
where is a distance between two rankings. For example, let whose semantic rankings are (1) and (2), respectively. If an unseen image has classification scores in the order so that for some , then we classify as a cat rather than a truck. We use the Kendall’s ranking distance which is the number of mismatching orders:
Sometimes it may make more sense to compare only the closest items than to compare all the items. The top-K version of the Kendall’s distance was proposed in (Critchlow, 1985), which can be computed as follows. Let , , and be the sets
Then the Kendall’s top-K distance can be computed by
Zero-shot classification using the rule (3) will be called deterministic ranking-based (DR) method.
2.2 Probabilistic approach
We can further refine ranking-based algorithms by considering a probabilistic approach. There are several causes of uncertainty in ranking-based representation. First, classifier outputs for a test-domain sample can have low confidence, since the classifiers are trained only with training-domain samples. Second, prior knowledge of semantic rankings from multiple semantic sources may not be unanimous. Third, feature and semantic similarities do not always coincide. For these reasons, we consider probability models of (top-K) rankings. We discuss three models: the Mallows (Mallows, 1957), the Plackett-Luce (Plackett, 1975; Luce, 1959; Marden, 1995), and the Babington-Smith (Joe & Verducci, 1993; Smith, 1950), which we will introduce where they are needed (see (Critchlow, 1985; Critchlow et al., 1991; Marden, 1995) for more reviews.)
In our probabilistic zero-shot learning approach, we assume the following dependence:
that is, the label of a sample is dependent only on the predicted ranking , which in turn is dependent only on the sample . The probability of a test-domain label given the sample is obtained by marginalizing out the latent ranking variable :
and the final prediction of the label for a test sample is made by .
There are two terms in (7): a probabilistic ranker and a prior for semantic ranking . First, we describe the prior for semantic ranking learned from one or more semantic sources (e.g. different corpora or crowd-sourcing) in Section 2.3. Second, we describe probabilistic rankers based on standard classifiers trained with training-domain data in Section 2.4. The final zero-shot classifier for unseen samples bringing these two learned components is described in Section 2.5.
2.3 Prior for semantic ranking
We encode the semantic similarity between training- and test-domain classes by probabilistic ranking models of training-domain classes for each test-domain class . To learn , we use multiple instances of rankings for each test-domain class. These rankings can come from multiple linguistic corpora or by human-rated rankings directly. Below we outline three popular models of rankings – the Plackett-Luce, the Mallows, and the Babington-Smith models.
Plackett-Luce. The Plackett-Luce model for the probability of observing a top-K ranking is
The non-negative parameters indicate the relative chances of being ranked higher than the rest of the items, and are invariant under constant scaling of ’s. One interpretation of the generative procedure of the Plackett-Luce model is the Vase interpretation from (Silberberg, 1980). Suppose there is a vase with infinite number of balls marked 1 to , whose numbers are proportional to ’s. At the first stage, a ball is drawn and is recorded as . At the second stage, another ball is drawn and is recorded as unless the ball is already selected before (), in which case the drawing is tried again. The procedure is continued until distinct balls are drawn and recorded. This generative probability is captured by (8).
Given samples (=semantic sources) of rankings
for a class, the parameters can be estimated by MLE. The log-likelihood of (8) is
with possibly an additional regularization term . Hunter (Hunter, 2004)
proposed an iterative method of estimation using the Minorization-Maximization procedure which generalizes the Expectation-Maximization procedure and converges to a global maximum solution under a certain condition on the data. From our experience, simple gradient-based or Newton-Raphson methods seem to work well with an appropriate choice of the regularization parameter.
Mallows. The Mallows model (Mallows, 1957) for full rankings is defined as , where is the mode, is the spread parameter, and
is the Kendall’s distance between two rankings. It may be viewed as a discrete analog of the Gaussian distribution for ranking. The distance can further be written aswhere is the identity ranking and the ’s are defined as
Fligner et al. (Fligner & Verducci, 1986) proposed the Mallows model for top-K lists by marginalizing the Mallows model:
where the ’s are defined in (10) and is the normalization constant which can be computed in closed form:
Given samples of rankings , the parameters of the Mallows model for total rankings can also be found by MLE (Feigin & Cohen, 1978). When the mode is known, the MLE of the spread parameter can be found by convex optimization, owing to the fact that the log-likelihood is a concave function of . The MLE of the centroid is the maximum of and is equivalent to
. However, there are known heuristic methods such as sequential transposition of adjacent items(Critchlow, 1985) or other admissible heuristics (Meila et al., 2007). We use the former method. Starting from the average ranking as the initial value of , and we search adjacent items and whose swapping lowers the sum of distances the most. We stop if there is no such item or if the maximum number of iteration (1000 in our case) is exceeded. The MLE with (11) can be solved by using in place of .
The Babington-Smith model (Joe & Verducci, 1993; Smith, 1950) is another probabilistic
ranking model based on pairwise comparisons.
Given two items and , let be the probability that item is
ranked higher than item .
Given these preferences , the probability of a ranking is
After introducing new parameters (Joe & Verducci, 1993), the probability of a top-K ranking can be written as 222We presents a slightly modified form.
The Babington-Smith model is similar to the Plackett-Luce model in that the probability is the product of ’s. The larger is, the more likely it is that item is ranked higher than item . However, unlike the Plackett-Luce model, the normalizing constant does not have a known closed form. We do not use it for modeling the semantic prior, but use it for probabilistic ranker in the next section.
2.4 Probabilistic ranker from classifiers
The probabilistic ranker takes a sample as input and probabilistically ranks the similarity of to training-domain classes . We propose to build rankers from standard settings of multiclass classifiers: one-vs-rest, one-vs-one, or multiclass-loss as in (Crammer & Singer, 2002). Any classifier that output a real-valued confidence or score can be used for this purpose.
One-vs-rest binary classifiers. In this setting, there will be such scores for each training-domain class. We relate the real-valued scores and the non-negative parameters of the Plackett-Luce model (8) by setting , to get
Instead of producing a single ranking as in the deterministic approach (3), this ranker evaluates the probability of any ranking given taking into account the confidence () of one-vs-rest classifiers.
One-vs-one binary classifiers. In this setting, there will be scores for each pair of training-domain classes. We related these scores to the parameters of the Babington-Smith model (13) by , to get
Similar to (14), this ranker evaluates the probability of any ranking given taking into account the confidence () of one-vs-one classifiers. Note that if the pre-trained classifiers are linear, that is, , then this ranker is quite similar to (14), since , with defined as . However, it has a different normalization term from (14).
. Other types of classifiers can be accommodated. When the pre-trained classifiers are multinomial logistic regression (=softmax) or SVMs with a multiclass loss(Crammer & Singer, 2002), we again have scores computed from parameter vectors . Similar to the one-vs-rest case, we can use the relation with the Plackett-Luce model which gives us the same ranker as (14). Note that if the original classifier is a multinomial logistic regression, the (14) is in fact a direct generalization of logistic regression for , which is also observed in (Cheng et al., 2010). In this case, the trained parameters coincide with the optimal maximum likelihood parameters for (14) trained with top-1 rankings which are ground truth labels of the training domain.
To summarize, there exist natural interpretations of the Plackett-Luce and the Babington-Smith models that allow us to relate classification scores to their parameters and use them to produce posterior probability of rankings without any further training333It is not immediately clear whether the Mallows model can be adapted in this setting and is left for future work..
2.5 Zero-shot prediction
The probabilistic rankers constructed from pre-trained classifiers and the priors for semantic rankings learned from semantic sources are plugged into (7)
and the final prediction of the label for a test sample is made by . The sum over (top-K) rankings
cannot be computed analytically for either of the Plackett-Luce and the Mallows models and requires approximations, e.g., by Markov chain Monte Carlo (MCMC) sampling. Alternatively, we useand a uniform prior , somewhat similar to (Rohrbach et al., 2010). In our preliminary experiments, MCMC-based summation showed inferior results to this simple version and therefore will be omitted from the report. The final zero-shot classifier is the MAP classifier
for pre-trained one-vs-rest/multiclass-loss classifiers,
for pre-trained one-vs-one classifiers. We summarize the overall training and testing procedures below.
Training Step 1.
Obtain pre-trained classifiers
Input: training-domain sample and label pairs ,
, regularization hyperparameterOutput: score functions or Method: one-vs-rest/one-vs-one/multiclass with any classifier Training Step 2. Learn priors for semantic rankings Input: ranking and test-domain label pairs collected from corpora or crowdsourcing Output: consensus rankings for each test-domain class from either the Plackett-Luce model (8) or the Mallows (11) Method: MLE of (9) by BFGS or MLE of (12) by sequential transposition Testing. Zero-shot classification Input: data , parameter for top-K list size Output: prediction of test-domain label Method: MAP estimation (16) or (17), using ’s from Training Step 1 and ’s from Training Step 2
3 Related work
There are two major approaches to zero-shot learning explored in the literature: attribute-based and similarity-based. In attribute-based knowledge transfer (e.g., (Palatucci et al., 2009; Lampert et al., 2009; Akata et al., 2013)), the classes from training and test domains are assumed to be distinguishable by a common list of attributes. Attribute-based approaches often show excellent empirical performance (Palatucci et al., 2009; Rohrbach et al., 2010). However, designing the attributes that are discriminative, common to multiple classes, and correlated with the original feature at the same time, can be a non-trivial task that typically requires human supervision. Similar arguments can be found in (Rohrbach et al., 2010) or (Mensink et al., 2014).
By contrast, similarity-based zero-shot approaches use semantic similarity between training-domain classes and test-domain classes directly. The advantage of this approach is that similarity information can be mined automatically from the web, linguistic corpora or other sources. Similarity information has been used to build a probabilistic zero-shot classifier called direct similarity-based method (DS) (Rohrbach et al., 2010, 2011), which parallels the attribute-based approach from (Lampert et al., 2009). Direct similarity-based method also uses classification scores and probabilistic inference as ours, but it uses numerical similarity instead of non-metric ranking presentation in our method. More recently, similarity-based approaches using semantic embedding have been proposed (Frome et al., 2013; Socher et al., 2013)
. In these algorithms, training and test domain objects are simultaneously embedded into a semantic space using multilayer neural networks. While these two methods produce interesting results, they use specific metric similarity models and require retraining when the semantic model changes, unlike our method. Mensink et al. use a linear combination of pre-trained classifiers for classifying unseen data(Mensink et al., 2014). They use co-occurrence statistics as semantic information, whereas we do not assume a specific type of similarity information. Lastly, our method provides a means to aggregate multiple semantic sources that has not been addressed in the literature.
We use two datasets 1) Animals with Attributes dataset (Lampert et al., 2009) and 2) CIFAR-100/10 (Torralba et al., 2008) collected by (Krizhevsky, 2009). Semantic similarity is obtained from WordNet distance, web searches (Rohrbach et al., 2010), word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), and from Amazon Mechanical Turk. Table 1 summarizes the characteristics of the datasets and the types of available semantic information used in the experiments. More details on data processing are provided in Appendix.
|Number of training/validation/test samples||21847 / 2448 / 6180||50000 / 50000 / 10000|
|Number of training/test classes||40 / 10||100 / 10|
|Linguistic sources||WordNet, Wikipedia, Yahoo, YahooImage, Flicker||WordNet, word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014)|
|Number of surveys from crowd-sourcing||500||500|
We perform comprehensive tests of the probabilistic ranking-based (PR) zero-shot model under 1) three learning settings (one-vs-rest, one-vs-one, multiclass), 2) two types of semantic sources (linguistic, crowd-sourcing), and 3) different prior models for semantic rankings (the Plackett-Luce and the Mallows models). We compare probabilistic ranking-based method (PR, Sec. 2.5) to deterministic ranking-based method (DR, Sec. 2.1)and direct similarity-based method (DS, (Rohrbach et al., 2010)) which is the closest state-of-the-art to our methods that uses classifier scores. We also refer to other results in the literature for comparison.
Regularization parameters for classifiers are determined from the validation set and partially manually to avoid exhaustive cross-validation. We test with different hyperameters (in top-K list) and report the results with . For one-vs-rest and one-vs-one, we trained SVMs followed by Platt’s probabilistic scaling (Platt, 1999). For multiclass, we used multinomial logistic regression.
4.3 Result 1 – Discriminability of semantic rankings
|Direct Similarity (DS)||Deterministic Ranking (DR)||Probabilistic Ranking (PR)|
|Linguistic sources||Linguistic sources||Linguistic sources||Crowd-source|
|Direct Similarity (DS)||Deterministic Ranking (DR)||Probabilistic Ranking (PR)|
|Linguistic sources||Linguistic sources||Linguistic sources||Crowd-source|
: accuracy using geometric mean of similarities. The best result for each method is highlighted in boldface.
We first compare the discriminability of classes with ranking vs numerical representations of similarity without using image data. Using all five linguistic sources for Animals, we compute pairwise distances of 510=50 similarity vectors. Two types of distances are computed – the Euclidean distance of numerical similarity, with or without normalization, and the Hausdorff distance for top-K lists using the Kendall’s ranking distance (5). Note that the rankings are obtained by sorting the numerically similarity. For these different representations, the average accuracy of leave-one-out 1-Nearest Neighbor classification was 0.44 (Euclidean), 0.62 (Euclidean with normalization), 0.72 (Kendall’s, =2), 0.70 (Kendall’s, =5), and 0.64 (Kendall’s, =10), which shows that the ranking distances are better than the Euclidean distances for discriminating test-domain classes when there are multiple heterogeneous sources. Figure 3 shows the embeddings from classical Multidimensional Scaling (MDS) using these distances. It shows qualitative differences of numerical similarity (a and b) and ranking (c). The embedding of rankings has better between-class separation and within-class clustering than the embeddings of numerical similarity, suggesting that the non-metric order information is more consistent than numerical similarity across different sources.
We also compute leave-one-out accuracy of Bayesian classification with the rankings collected directly from crowd-sourcing for Animals and CIFAR datasets. Out of 500 rankings, one ranking is held out and the 499 remaining rankings are used to build 10 semantic ranking probabilities using both the Mallows and the Plackett-Luce models. Prediction of the class of the held-out rankings is made by the maximum of over all 10 classes. For Animals, the average accuracy was 0.91/0.99/0.99 (=2/5/10) using the Mallows model, and 0.79/0.84/0.96 (=2/5/10) using the Plackett-Luce model. For CIFAR, the average accuracy was 0.73/0.79/0.84 (=2/5/10) using the Mallows model, and 0.72/0.77/0.74 (=2/5/10) using the Plackett-Luce model. These numbers show that the rankings obtained from crowd-sourcing have information to discriminate the test-domain classes with up to accuracy.
4.4 Result 2 - Comparison of PR, DR, and DS
We compare probabilistic ranking (PR), deterministic ranking (DR) and direct similarity (DS) methods for zero-shot classification accuracy. All three methods share the same image features and the same linguistic sources of semantic information (except for the crowd-sourcing for PR), but use them in different ways. PR uses probabilistic models to combine multiple sources of semantic similarity. DR and DS inherently use a single source of semantic similarity, and therefore the multiple sources have to be combined heuristically. We first normalize individual similarity sources to be in the range from 0 to 1, and then compute arithmetic and geometric means over multiple sources. Note that the main difference between DR and DS, is that DR uses rankings whereas DS uses numeric values.
DS vs DR. The results are shown in Table 2. For both DR and DS, using averaged semantic similarity (“Arithm” and “Geom”) is better than using individual similarity (“Indiv”), for both Animals and CIFAR datasets. A plausible interpretation is that the aggregate similarity is more reliable than individual similarities despite using heuristic methods of aggregation. The highest accuracy from DS is 0.354 (for Animals) and 0.316 (for CIFAR), whereas the hight accuracy from DR is 0.359 (for Animals) and 0.281 (for CIFAR). DR performs slightly better than DS in Animals, but worse in CIFAR. Within DR, accuracy is not affected much by the pre-trained classifier type (one-vs-rest, one-vs-one, multiclass).
PR vs others. Using the same linguistic sources, the highest accuracy from PR is 0.370 (for Animals) and 0.339 (for CIFAR) which are much higher than DS and DR regardless of whether a single (Indiv) or multiple (Arithm and Geom) sources are used. This suggests the advantage of using probabilistic models to aggregate multiple semantic sources. Within PR results, one-vs-rest and one-vs-one classifiers perform comparably, and multiclass logistic regression performs the best. PR performs even better with crowd-sourced semantic information (0.395) than with linguistic sources (0.370) in Animals, but the opposite is true in CIFAR, probably due to the less reliability of human subject ratings with CIFAR (sorting 100 categories correctly compared to 40 in Animals).
In the literature, the accuracy of attributes-based methods with Animals ranges from 0.36 to 0.44 (Tables 3 and 4, (Akata et al., 2013)), compared to 0.395 from our method which do not use attributes. We remind the reader that finding ‘good’ attributes is itself a non-trivial task. When both similarity and attributes are mined automatically from corpora, similarity-based methods perform much better than attributed-based methods (individual average of 0.22 from Table 1, (Rohrbach et al., 2010)).
Lastly, Figure 4 shows two-class classification accuracy of PR (PLlinguistic sources), DS, and an embedding-based method on select pairs of classes from CIFAR (Figure 3, (Socher et al., 2013)). Although the numbers may not be directly comparable due to different settings444Socher et al. used the rest of classes from CIFAR-10 instead of CIFAR-100 for training, and also used different semantic information, PR performs noticeably better than the two state-of-the-arts. In fact, we can distinguish auto vs deer, deer vs ship, or cat vs truck with accuracy, without a single training image of these categories.
In this paper, we propose ranking-based representation of semantic similarity, as an alternative to metric representation of similarity. Using rankings, semantic information from multiple sources can be aggregated naturally to produce a better representation than individual sources. Using this representation and probability models of rankings, we present new zero-shot classifiers which can be constructed from pretrained classifiers without retraining, and demonstrate their potential for exploiting semantic structures of real-world visual objects.
- Akata et al. (2013) Akata, Zeynep, Perronnin, Florent, Harchaoui, Zaid, and Schmid, Cordelia. Label-embedding for attribute-based classification. In CVPR, 2013.
- Bartholdi et al. (1989) Bartholdi, J., Tovey, C. A., and Trick, M. Voting schemes for which it can be difficult to tell who won. Social Choice and Welfare, 6(2):157–165, 1989.
- Cheng et al. (2010) Cheng, Weiwei, Dembczynski, Krzysztof, and Hüllermeier, Eyke. Label ranking methods based on the Plackett-Luce model. In ICML, pp. 215–222, 2010.
Crammer & Singer (2002)
Crammer, Koby and Singer, Yoram.
On the algorithmic implementation of multiclass kernel-based vector
The Journal of Machine Learning Research, 2:265–292, 2002.
- Critchlow et al. (1991) Critchlow, Douglas E, Fligner, Michael A, and Verducci, Joseph S. Probability models on rankings. Journal of Mathematical Psychology, 35(3):294–318, 1991.
- Critchlow (1985) Critchlow, Douglas Edward. Metric methods for analyzing partially ranked data. Number 34 in Lecture notes in statistics. Springer, Berlin [u.a.], 1985.
Deselaers & Ferrari (2011)
Deselaers, T. and Ferrari, V.
Visual and semantic similarity in imagenet.In CVPR, pp. 1777–1784, 2011.
- Feigin & Cohen (1978) Feigin, Paul D and Cohen, Ayala. On a model for concordance between judges. Journal of the Royal Statistical Society. Series B (Methodological), pp. 203–213, 1978.
- Fligner & Verducci (1986) Fligner, Michael A and Verducci, Joseph S. Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), pp. 359–369, 1986.
- Frome et al. (2013) Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A deep visual-semantic embedding model. In NIPS, pp. 2121–2129, 2013.
- Huang et al. (2012) Huang, Eric H, Socher, Richard, Manning, Christopher D, and Ng, Andrew Y. Improving word representations via global context and multiple word prototypes. In Proceedings of ACL, pp. 873–882. Association for Computational Linguistics, 2012.
- Hunter (2004) Hunter, David R. MM algorithms for generalized Bradley-Terry models. The Annals of Statistics, 32(1):384–406, 2004.
- Joe & Verducci (1993) Joe, Harry and Verducci, Joseph S. On the Babington Smith class of models for rankings. In Probability models and statistical analyses for ranking data, pp. 37–52. Springer, 1993.
- Kemeny (1959) Kemeny, John G. Mathematics without numbers. Daedalus, pp. 577–591, 1959.
- Krizhevsky (2009) Krizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.
- Lampert et al. (2009) Lampert, C.H., Nickisch, H., and Harmeling, S. Learning to detect unseen object classes by between-class attribute transfer. CVPR, pp. 951–958, 2009.
- Luce (1959) Luce, R. D. Individual Choice Behavior. Wiley, New York., 1959.
- Mallows (1957) Mallows, Colin L. Non-null ranking models. i. Biometrika, 44(1/2):114–130, 1957.
- Marden (1995) Marden, John I. Analyzing and modeling rank data, volume 64. CRC Press, 1995.
- Meila et al. (2007) Meila, Marina, Phadnis, Kapil, Patterson, Arthur, and Bilmes, Jeff A. Consensus ranking under the exponential model. In Parr, Ronald and van der Gaag, Linda C. (eds.), UAI, pp. 285–294. AUAI Press, 2007.
- Mensink et al. (2012) Mensink, Thomas, Verbeek, Jakob, Perronnin, Florent, and Csurka, Gabriela. Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV, 2012.
- Mensink et al. (2014) Mensink, Thomas, Gavves, Efstratios, and Snoek, Cees GM. Costa: Co-occurrence statistics for zero-shot classification. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 2441–2448. IEEE, 2014.
- Mikolov et al. (2013) Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
- Miller (1995) Miller, George A. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, November 1995.
- Palatucci et al. (2009) Palatucci, Mark, Pomerleau, Dean, Hinton, Geoffrey, and Mitchell, Tom. Zero-shot learning with semantic output codes. In NIPS, pp. 1410–1418, 2009.
Pennington et al. (2014)
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher D.
Glove: Global vectors for word representation.
Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), 12, 2014.
- Plackett (1975) Plackett, Robin L. The analysis of permutations. Applied Statistics, pp. 193–202, 1975.
Platt, John C.
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.In ADVANCES IN LARGE MARGIN CLASSIFIERS, pp. 61–74. MIT Press, 1999.
- Qi et al. (2011) Qi, Guo-Jun, Aggarwal, C., Rui, Yong, Tian, Qi, Chang, Shiyu, and Huang, T. Towards cross-category knowledge propagation for learning visual concepts. CVPR, pp. 897–904, 2011.
- Rohrbach et al. (2011) Rohrbach, M., Stark, M., and Schiele, B. Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR, pp. 1641–1648, 2011.
- Rohrbach et al. (2010) Rohrbach, Marcus, Stark, Michael, Szarvas, Gyorgy, Gurevych, Iryna, and Schiele, Bernt. What helps where and why? semantic relatedness for knowledge transfer. CVPR, pp. 910–917, 2010.
- Silberberg (1980) Silberberg, A.R. Ph.D. thesis, 1980.
- Smith (1950) Smith, B Babington. Discussion of professor Ross’s paper. Journal of the Royal Statistical Society B, 12(1):41–59, 1950.
- Socher et al. (2013) Socher, Richard, Ganjoo, Milind, Manning, Christopher D, and Ng, Andrew. Zero-shot learning through cross-modal transfer. In NIPS, pp. 935–943, 2013.
Torralba et al. (2008)
Torralba, Antonio, Fergus, Rob, and Freeman, William T.
80 million tiny images: A large data set for nonparametric object and scene recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.
Appendix A Datasets
The Animals with Attributes dataset (Animals) was collected and processed by (Lampert et al., 2009). The training domain consists of images of 40 types of animals, from which 21,847 and 2,448 images were used as training and validation sets. From each image, 10,942 dimensional features are extracted (Lampert et al., 2009). The test domain consists of 6,180 images of 10 types of animals which are non-overlapping with the training-domain classes. Semantic similarity of the animals are provided by (Rohrbach et al., 2010) 555http://www.d2.mpi-inf.mpg.de/nlp4vision, which are computed from five different linguistic sources: path distance from WordNet, co-occurrence from Wikipedia, Yahoo web search, Yahoo image Search, and Flickr image search.
The CIFAR-100 and CIFAR-10 are collected by (Krizhevsky, 2009). The training domain (CIFAR-100) consists of 60,000 images of 100 types of objects including animals, plants, household objects, and scenery. We use 50,000 and 10,000 images from CIFAR-100 as training and validation sets. The test domain (CIFAR-10) consists of 60,000 images of 10 types of objects similar to CIFAR-100, without any overlap with the classes from CIFAR-100. We use 10,000 images as test data. To compute features, we use a deep-trained neural network 666https://github.com/jetpacapp/DeepBeliefSDK, which is trained from ImageNet ILSVRC2010 dataset777http://www.image-net.org/challenges/LSVRC consisting of 1.2 million images of 1000 categories. We apply CIFAR-100 and CIFAR-10 training images to the network, and use the 4096-dimensional output from the last hidden layer of the network as features. For semantic similarity of CIFAR-100 and CIFAR-10, we compute the WordNet path distance, and also used word2vec tools (Mikolov et al., 2013) 888https://code.google.com/p/word2vec/ and GloVe tools (Pennington et al., 2014)999http://nlp.stanford.edu/projects/glove/.
In addition to using linguistic sources, we use Amazon Mechanical Turk to collect word similarity data by crowd-sourcing. Each participant of the survey is shown a word from the test domain classes, and is asked to sort 10 words from the training domain according to their perceived similarity to the given word. The initial order of 10 words is randomized for each survey. We pre-select those 10 closest words for each test-domain word, because we found from preliminary trials that ordering all words (40 for Animal and 100 for CIFAR) is too demanding and time-consuming for participants. For Animal, 10 closest words are selected based on the average ranking of the words w.r.t. the test-domain word from the five linguistic sources. For CIFAR, we use the path distance from WordNet. Fifty surveys are collected for each test-domain class.
Appendix B Implementation
The Direct Similarity-based method (DS) (Rohrbach et al., 2010) is implemented as follows. The probability is modeled by a one-vs-rest binary SVM classifier followed by the Platt’s probabilistic scaling (Platt, 1999), trained with training-domain feature and label pairs. In testing, the probability is evaluated for a test image, and the prediction of the test-domain class is made by MAP estimation using
where is the similarity score. The sum above is limited to five most similar training-domain classes. We have tested different values of the prior , which did not have visible effects on the result.