Mind the Nuisance: Gaussian Process Classification using Privileged Noise

07/01/2014 ∙ by Daniel Hernández-Lobato, et al. ∙ University of Sussex 0

The learning with privileged information setting has recently attracted a lot of attention within the machine learning community, as it allows the integration of additional knowledge into the training process of a classifier, even when this comes in the form of a data modality that is not available at test time. Here, we show that privileged information can naturally be treated as noise in the latent function of a Gaussian Process classifier (GPC). That is, in contrast to the standard GPC setting, the latent function is not just a nuisance but a feature: it becomes a natural measure of confidence about the training data by modulating the slope of the GPC sigmoid likelihood function. Extensive experiments on public datasets show that the proposed GPC method using privileged noise, called GPC+, improves over a standard GPC without privileged knowledge, and also over the current state-of-the-art SVM-based method, SVM+. Moreover, we show that advanced neural networks and deep learning methods can be compressed as privileged information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Prior knowledge is a crucial component of any learning system, as without a form of prior knowledge, learning is provably impossible [1]. Many forms of integrating prior knowledge into machine learning algorithms have been developed: as a preference of certain prediction functions over others, as a Bayesian prior over parameters, or as additional information about the samples in the training set used for learning a prediction function. In this work, we rely on the last of these setups, adopting Vapnik and Vashist’s learning using privileged information (LUPI), see e.g. [2, 3]: we want to learn a prediction function, e.g. a classifier, and in addition to the main data modality that is to be used for prediction, the learning system has access to additional information about each training example.

This scenario has recently attracted considerable interest within the machine learning community, because it reflects well the increasingly relevant situation of learning as a service: an expert trains a machine learning system for a specific task on request from a customer. Clearly, in order to achieve the best result, the expert will use all the information available to him or her, not necessarily just the information that the system itself will have access to during its operation after deployment. Typical scenarios for learning as a service include visual inspection tasks, in which a classifier makes real-time decisions based on the input from its sensor, but at training time, additional sensors could be made use of, and the processing time per training example plays less of a role. Similarly, a classifier built into a robot or mobile device operates under strong energy constraints, while at training time, energy is less of a problem, so additional data can be generated and made use of. A third, and increasingly important scenario is when the additional data is confidential, as, e.g., in health care applications. One can expect that a diagnosis system can be improved when more information is available at training time. One might, e.g., perform specific blood test, genetic sequence, or drug trials, for the subjects that form the training set. However, the same data will not be available at test time, as obtaining it would be impractical, unethical, or outright illegal.

In this work, we propose a novel method for using privileged information based on the framework of Gaussian process classifiers (GPCs). The privileged data enters the model in form of a latent variable, which modulates the noise term of the GPC. Because the noise is integrated out before obtaining the final predictive model, the privileged information is indeed only required at training time, not at prediction time. The most interesting aspect of the proposed model is that by this procedure, the influence of the privileged information becomes very interpretable: its role is to model the confidence that the Gaussian process has about any training example, which can be directly read off from the slope of the sigmoid-shaped GPC likelihood. Training examples that are easy to classify by means of their privileged data cause a faster increasing sigmoid, which means the GP trusts the taining example and tried to fit it well. Examples that are hard to classify result in a slowly increase slope, so the GPC considers the training example less reliable and puts not a lot of effort in fitting its label well. Our experiments on multiple datasets show that this procedure leads not just to interpretable models, but also to significantly higher classification accuracy.

Related Work The LUPI framework was originally proposed by Vapnik and Vashist [2], inspired by a thought-experiment: when training a soft-margin SVM, what if an oracle would provide us with the optimal values of the slack variables? As it turns out, this would actually provably reduce the amount of training data needed, and consequently, Vapnik and Vashist proposed the SVM+ classifier that uses privileged data to predict values for the slack variables, which led to improved performance on several categorization tasks and found applications, e.g., in finance [4]. This setup was subsequently improved, by a faster training algorithm [5], better theoretical characterization [3], and it was generalized, e.g., to the learning to rank setting [6], clustering [7], and metric learning [8]. Recently, however, it was shown that the main effect of the SVM+ procedure is to assign a data-dependent weight to each training example in the SVM objective [9]. In contrast, GPC+ constitutes the first Bayesian treatment of classification using privileged information. Indeed, the resulting privileged noise approach is related to input-modulated noise commonly done in the regression task, and several Bayesian treatments of this heteroscedastic regression

using Gaussian processes have been proposed. Since the predictive density and marginal likelihood are no longer analytically tractable, most works in heteroscedastic GPs deal with approximate inference; techniques such as Markov Chain Monte Carlo

[10], maximum a posteriori [11], and recently a variational Bayes method [12]. To our knowledge, however, there is no prior work on heteroscedastic classification using GPs — we will elaborate the reasons in Section 2.1 — and consequently this work develops the first approximate inference based on expectation propagation for the heteroscedastic noise case in the context of classification.

2 GPC+: Gaussian Process Classification with Privileged Noise

For self-consistency of the paper, we first review the GPC model [13] with a particular emphasis on the noise-corrupted latent Gaussian process view. Then, we show how to treat privileged information as heteroscedastic noise in this latent process. The elegant aspect of this view is the intuition as how the privileged noise is able to distinguish between easy and hard samples and in turn to re-calibrate our uncertainty in the original space.

2.1 Gaussian process classifier with noisy latent process

We are given a set of input-output data points or samples . Furthermore, we assume that the class label of sample has been generated as , where is a noisy latent function and is the Iverson’s bracket notation: when the condition is true, and otherwise. Induced by the label generation process, we adopt the following form of likelihood function for :

(1)

where the noisy latent function at sample is given by with being the noise-free latent function. The noise term

is assumed to be independent and normally distributed with zero mean and variance

, that is . To make inference about , we need to specify a prior over this function. We proceed by imposing a zero mean Gaussian process prior [13] on the noise-free latent function, that is where is a positive-definite kernel function [14] that specifies prior properties of . A typical kernel function that allows for non-linear smooth function is the squared exponential kernel . In this kernel function, the parameter controls the amplitude of function while controls the smoothness of . Given the prior and the likelihood, Bayes’ rule is used to compute the posterior of , that is .

We can simplify the above noisy latent process view by integrating out the noise term and writing down the individual likelihood at sample in term of noise-free latent function as follows

(2)

where

is a Gaussian cumulative distribution function (CDF) with mean

and variance . Typically the standard Gaussian CDF is used, that is , in the likelihood of (2.1). Coupled with a Gaussian process prior on the latent function , this results in the widely adopted noise-free latent Gaussian process view with probit likelihood. The equivalence between a noise-free latent process with probit likelihood and a noisy latent process with step-function likelihood is widely known [13]. It is also widely accepted that the noisy latent function (or the noise-free latent function ) is a nuisance function as we do not observe the value of this function itself and its sole purpose is for a convenient formulation of the classification model [13]. However, in this paper, we show that by using privileged information as the noise term, the latent function now plays a crucial role. The latent function with privileged noise adjusts the slope transition in the Gaussian CDF to be faster or slower corresponding to more certainty or more uncertainty about the samples in the original space. This is described in details in the next section.

2.2 Privileged information is in the Nuisance Function

In the learning using privileged information (LUPI) paradigm [2], besides input data points and associated outputs , we are given additional information about each training instance . However this privileged information will not be available for unseen test instances. Our goal is to exploit the additional data to influence our choice of the latent function . This needs to be done while making sure that the function does not directly use the privileged data as input, as it is simply not available at test time. We achieve this naturally by treating the privileged information as a heteroscedastic (input-dependent) noise in the latent process.

Our classification model with privileged noise is then as follows:

(3)
(4)
(5)
(6)

In the above, the function is needed to ensure positivity of the noise variance. The term is a positive-definite kernel function that specifies the prior properties of another latent function which is evaluated in the privileged space . Crucially, the noise term is now heteroscedastic, that is it has a different variance at each input point . This is in contrast to the standard GPC approach discussed in Section 2.1 where the noise term is assumed to be homoscedastic, . Indeed, an input-dependent noise term is very common in a task with continuous output values (a regression task), resulting in the so-called heteroscedastic regression models, which have been proven to be more flexible in numerous applications as already touched upon in the related work section. However, to our knowledge, there is no prior work on heteroscedastic classification models. This is not surprising as the nuisance view of the latent function renders having a flexible input-dependent noise point-less.

Figure 1: Effects of privileged noise on the nuisance function. (Left) On synthetic data. Suppose for an input , the latent function value is . Now also assume that the associated privileged information for -th data point deems the sample as difficult, say . Then the likelihood will reflect this uncertainty . In contrast, if the associated privileged information considers the sample as easy, say e.g. , the likelihood is very certain . (Right) On real data taken from our experiments in Sec. 4. The posterior means of the function (solid) and its

-standard deviation confidence interval (dash-dot) for easy (blue) and difficult (black) instances of the Chimpanzee v. Giant Panda binary task on the Animals with Attributes (AwA) dataset. (Best viewed in color)

In the context of learning with privileged information, however, heteroscedastic classification is actually a very sensible idea. This is best illustrated when investigating the effect of privileged information in the equivalent formulation of a noise free latent process, i.e., one integrates out the privileged input-dependent noise term:

(7)
(8)
(9)

From (9), it is clear that privileged information adjusts the slope transition of the Gaussian CDF. For difficult (a.k.a. noisy) samples, the latent function will be high, the slope transition will be slower, and thus more uncertainty is in the likelihood term . For easy samples, however, the latent function will be low, the slope transition will be faster, and thus less uncertainty is in the likelihood term . This is illustrated in Fig. 1.

2.3 Posterior and Prediction on Test Data

Define and . Given the conditional i.i.d. likelihood with the per observation likelihood term given in (9) and the Gaussian process priors on functions, the posterior for and is:

(10)

where can be maximised with respect to a set of hyper-parameter values such as amplitude and smoothness parameters of the kernel functions [13]. For a previously unseen test point , the predictive distribution for its label is given as:

(11)

where is a Gaussian conditional distribution. We note that in (11), we do not consider the privileged information associated to . The interpretation is we consider a homoscedastic noise at test time. This is a reasonable approach as there is no additional information for increasing or decreasing our confidence in the newly observed data

. Finally, we predict the label for a test point via Bayesian decision theory: the label being predicted is the one with the largest probability.

3 Expectation Propagation with Numerical Quadrature

Unfortunately, as for most interesting Bayesian models, inference in the GPC+ model is very challenging. Already in the homoscedastic case, the predictive density and marginal likelihood are not analytically tractable. In this work, we therefore adapt Minka’s expectation propagation (EP) [15] with numerical quadrature for approximate inference. Please note that EP is the preferred method for approximate inference with GPCs in terms of accuracy and computational cost [16, 17].

Consider the joint distribution of

, and . Namely, where and are Gaussian process priors and the likelihood is equal to , with given by (9). EP approximates each non-normal factor in this joint distribution by an un-normalised bi-variate normal distribution of and (we assume independence between and ). The only non-normal factors correspond to those of the likelihood. These are approximated as:

(12)

where the parameters with the super-script are to be found by EP. The posterior approximation computed by EP results from normalising with respect to and the EP approximate joint distribution. This distribution is obtained by replacing each likelihood factor by the corresponding approximate factor . In particular,

(13)

where is a normalisation constant that approximates the model evidence

. The normal distribution belongs to the exponential family of probability distributions and is closed under the product and division. It is hence possible to show that

is the product of two multi-variate normals [18]. The first normal approximates the posterior for and the second the posterior for .

EP tries to fix the parameters of so that it is similar to the exact factor

in regions of high posterior probability

[15]. For this, EP iteratively updates each until convergence to minimise , where is a normal distribution proportional to with all variables different from and marginalised out, is simply a normalisation constant and

denotes the Kullback-Leibler divergence between probability distributions. Assume

is the distribution minimising the previous divergence. Then, and the parameter of is fixed to guarantee that integrates the same as the exact factor with respect to . The minimisation of the KL divergence involves matching expected sufficient statistics (mean and variance) between and . These expectations can be obtained from the derivatives of with respect to the (natural) parameters of [18]. Unfortunately, the computation of in closed form is intractable. We show here that it can be approximated by a one dimensional quadrature. Denote by , , and the means and variances of for and , respectively. Then,

(14)

Thus, the EP algorithm only requires five quadratures to update each . A first one to compute and four extras to compute its derivatives with respect to , , and . After convergence, can be used to approximate predictive distributions and the normalisation constant can be maximised to find good values for the model’s hyper-parameters. In particular, it is possible to compute the gradient of with respect to the parameters of the Gaussian process priors for and [18].

4 Experiments

Our intention here is to investigate the performance of the GP with privileged noise approach. To this aim, we considered three types of binary classification tasks corresponding to different privileged information using two real-world datasets: Attribute Discovery and Animals with Attributes. We detail those experiments in turn in the following sections.

Methods We compared our proposed GPC+ method with the well-established LUPI method based on SVM, SVM+ [5]. As a reference, we also fit standard GP and SVM classifiers when learning on the original space (GPC and SVM baselines). For all four methods, we used a squared exponential kernel with amplitude parameter and smoothness parameter . For simplicity, we set

in all cases. For GPC and GPC+, we used type II-maximum likelihood for estimating the hyper-parameters. There are two hyper-parameters in GPC (smoothness parameter

and noise variance ) and also two in GPC+ (smoothness parameters of kernel and of kernel ). For SVM and SVM+, we used cross-validation to set the hyper-parameters. SVM has two knobs, that is smoothness and regularisation, and SVM+ has four knobs, two smoothness and two regularisation parameters. It turned out that a grid search via cross validation was too expensive for searching the best parameters in SVM+, we instead use the performance on a separate validation set to guide the search process. None of the other three methods used this separate validation set, this means that we give a competitive advantage to SVM+ over the other methods.

Evaluation metric To evaluate the performance of the methods we used classification error on an independent test set. We performed

repeats of all the experiments to get the better statistics of the performance and report the mean and the standard error.

4.1 Attribute Discovery Dataset [19]

The data set was collected from a shopping website that aggregates product data from variety of e-commerce sources and includes both images and associated textual descriptions. The images and associated texts are grouped into broad shopping categories: bags, earrings, ties, and shoes. We used samples from this dataset. We generated binary classification tasks for each pair of the classes with samples for training, samples for validation, and the rest of samples for testing the predictive performance.

Neural networks on texts as privileged information We used images as the original domain and texts as the privileged domain. This setting was also explored in [6]. However, we used a different dataset as textual descriptions of the images used in [6] are sparse and contain duplicates. Furthermore, we extracted more advanced text features instead of simple term frequency (TF) features. As image representation, we extracted SURF descriptors [20] and constructed a codebook of visual words using the -means clustering. As text representation, we extracted

dimensional continuous word-vector representation using a neural network skip-gram architecture

[21]111https://code.google.com/p/word2vec/. To convert this word representation to a fixed-length sentence representation, we constructed a codebook of word-vector using again -means clustering. We note that a more elaborate approach to transform word to sentence or document features has recently been developed [22], and we are planning to explore this in the future. We performed PCA for dimensionality reduction in the original and privileged domains and only kept the top principal components. Finally, we standardised the data so that each feature has zero mean and unit standard deviation.

The experimental results are summarised in Tab. 1. On average over tasks, SVM with hinge loss outperforms GPC with probit likelihood. However, GPC+ significantly improves over GPC providing the best results on average. This clearly shows that GPC+ is able to utilise the neural network textual representation as privileged information. In contrast, SVM+ produced the same result as SVM. We suspect this is due to: SVM has already shown strong performance on the original image space coupled with the difficulties in finding the best values of four hyper-parameters. Keep in mind that, in SVM+, we discretised the hyper-parameter search space over () possible combination values and used a separate validation technique.

GPC GPC+ (Ours) SVM SVM+
bags v. earrings 9.79 0.12 9.50 0.11 9.89 0.14 9.89 0.13
bags v. ties 10.36 0.16 10.03 0.15 9.44 0.16 9.47 0.13
bags v. shoes 9.66 0.13 9.22 0.11 9.31 0.12 9.29 0.14
earrings v. ties 10.84 0.14 10.56 0.13 11.15 0.16 11.11 0.16
earrings v. shoes 7.74 0.11 7.33 0.10 7.75 0.13 7.63 0.13
ties v. shoes 15.51 0.16 15.54 0.16 14.90 0.21 15.10 0.18
average error 10.65 0.11 10.36 0.12 10.41 0.11 10.42 0.11
average ranking 3.0 1.8 2.7 2.5
Table 1: Error rate performance (the lower the better) on the Attribute Discovery dataset over 100 repeated experiments. We used images as the original domain and neural networks word-vector representation on texts as the privileged domain. The best method for each binary task is highlighted in boldface. An average rank equal to one means that the corresponding method has the smallest error on the tasks.
Figure 2: Pairwise comparison of the proposed GPC+ method and main baselines is shown via relative difference of the error rate (top: GPC+ versus GPC, bottom: GPC+ versus SVM+). The length of the bars corresponds to relative difference of the error rate over cases. Full results of the average error rate performance on AwA dataset across tasks over repeated experiments are in the appendix. (Best viewed in color)

4.2 Animals with Attributes (AwA) Dataset [23]

The dataset was collected by querying the image search engines for each of the animals categories which have complimentary high level description of the semantic properties such as shape, colour, or habitation forms, among others. The semantic attributes per animal class were retrieved from a prior psychological study. We focused on the categories corresponding to the test set of this dataset for which the predicted attributes are provided based on the probabilistic DAP model [23]. The classes are: chimpanzee, giant panda, leopard, persian cat, pig, hippopotamus, humpback whale, raccoon, rat, seal, and contain images in total. As in Section 4.1 and also in [6], we generated binary classification tasks for each pair of the classes with samples for training, samples for validation, and the rest of samples for testing the predictive performance.

Neural networks on images as privileged information

Deep learning methods have gained an increased attention within the machine learning and computer vision community over the recent years. This is due to their capability in extracting informative features and delivering strong predictive performance in many classification tasks. As such, we are interested to explore the use of deep learning based features as privileged information so that their predictive power can be used even if we do not have access to them at prediction time. We used the standard

SURF features [20] with visual words as the original domain and used the recently proposed DeCAF features [24] extracted from the activation of a deep convolutional network trained in a fully supervised fashion as the privileged domain. The DeCAF features were in dimensions. All features are provided with the AwA dataset222http://attributes.kyb.tuebingen.mpg.de. We again performed PCA for dimensionality reduction in the original and privileged domains and only kept the top principal components, as well as standardised the data.

(DeCAF as privileged) (Attributes as privileged)
Figure 3: Average rank (the lower the better) of the four methods and a critical distance for significant differences [25] on the AwA dataset. An average rank equal to one means that particular method has the smallest error on the tasks. Whenever the average ranks differ by more than the critical distance, there is a statistical evidence (-value ) to support a difference in the average ranks and thus in the performance. We also link two methods with a solid line if they are not statistically different from each other (-value ). In DeCAF, there is statistical evidence that GPC+ performs best among the four methods considered, while in attributes, GPC+ still performs best but there is not enough evidence to reject that GPC+ performs comparable to GPC.

Attributes as privileged information Following the experimental setting of [6], we also used images as the original domain and attributes as the privileged domain. Images were represented by visual words based on SURF descriptors and attributes were in the form of 85 dimensional predicted attributes based on probabilistic binary classifiers [23]. This time, we only performed PCA and kept the top principal components in the original domain. Finally, we also standardised the data.

The results are summarised in Fig. 2 in term of pairwise comparison over binary tasks between GPC+ and main baselines, GPC and SVM+. The full results with the error of each method GPC, GPC+, SVM, and SVM+ on each problem are relegated to the appendix. In contrast to the results on the attribute discovery dataset, on the AwA dataset it is clear that GPC outperforms SVM in almost all of the binary classification tasks (see the appendix). The average error of GPC over ( tasks and repeats per task) experiments is much lower than SVM. On the AwA dataset, SVM+ can take advantage of privileged information – be it deep belief DeCAF features or semantic attributes – and shows significant performance improvement over SVM. However, GPC+ still shows the best overall results and further improves the already strong performance of GPC. As illustrated in Fig. 1 (right), the privileged information modulates the slope of the sigmoid likelihood function differently for easy and difficult examples: easy examples gain slope and hence importance whereas difficult ones lose importance in the classification. We analysed our experimental results using the multiple dataset statistical comparison method described in [25]333We are not able to use this method for our attribute discovery results in Tab. 1 as the number of methods being compared () is almost equal to the number of tasks or datasets ().. The statistical tests are summarised in Fig. 3. When DeCAF is used as privileged information, there is statistical evidence that GPC+ performs best among the four methods, while in semantic attributes as privileged information setting, GPC+ still performs best but there is not enough evidence to reject that GPC+ performs comparable to GPC.

5 Conclusions

We presented the first treatment of the learning with privileged information setting in the Gaussian process classification (GPC) framework, called GPC+. The privileged information enters the latent noise layer of the GPC+, resulting in a data-dependent modulation of the sigmoid slope of the GP likelihood. As our experimental results demonstrate this is an effective way to make use of privileged information, which manifest itself in significantly improved classification accuracies. Actually, to our knowledge, this is the first time that a heteroscedastic noise term is used to improve GPC. Furthermore, we also showed that recent advances in continuous word-vector neural networks representations [22] and deep convolutional networks for image representations [24] are privileged information. For future work, we plan to extend the GPC+ to the multiclass situation and to speed up computation by devising a quadrature-free expectation propagation method, similar to [26].

References

  • [1] D.H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.
  • [2] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5–6):544 – 557, 2009.
  • [3] D. Pechyony and V. Vapnik. On the theory of learning with privileged information. In Advances in Neural Information Processing Systems (NIPS), 2010.
  • [4] B. Ribeiro, C. Silva, A. Vieira, A. Gaspar-Cunha, and J.C. das Neves. Financial distress model prediction using SVM+. In International Joint Conference on Neural Networks (IJCNN), 2010.
  • [5] D. Pechyony and V. Vapnik. Fast optimization algorithms for solving SVM+. In

    Statistical Learning and Data Science

    , 2011.
  • [6] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank using privileged information. In International Conference on Computer Vision (ICCV), 2013.
  • [7] J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194:4–23, 2012.
  • [8] S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider. Incorporating privileged information through metric learning. IEEE Transactions on Neural Networks and Learning Systems, 2013.
  • [9] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM. Neural Networks, 53:95–108, 2014.
  • [10] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with input-dependent noise: A gaussian process treatment. In Advances in Neural Information Processing Systems (NIPS), 1998.
  • [11] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine.

    Kernel conditional quantile estimation via reduction revisited.

    In International Conference on Data Mining (ICDM), 2009.
  • [12] M. Lázaro-Gredilla and M. K. Titsias. Variational heteroscedastic gaussian process regression. In International Conference on Machine Learning (ICML), 2011.
  • [13] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.
  • [14] B. Scholkopf and A. J. Smola.

    Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond

    .
    MIT Press, Cambridge, MA, USA, 2001.
  • [15] T. P. Minka.

    A Family of Algorithms for Approximate Bayesian Inference

    .
    PhD thesis, Massachusetts Institute of Technology, 2001.
  • [16] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078, 2008.
  • [17] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
  • [18] M. Seeger. Expectation propagation for exponential families. Technical report, Department of EECS, University of California, Berkeley, 2006.
  • [19] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. In European Conference on Computer Vision (ECCV), 2010.
  • [20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Computer Vision and Image Understanding, 110(3):346–359, June 2008.
  • [21] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • [22] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning (ICML), 2014.
  • [23] C. H. Lampert, H. Nickisch, and S. Harmeling. Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
  • [24] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
  • [25] J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
  • [26] J. Riihimäki, P. Jylänki, and A. Vehtari. Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood. Journal of Machine Learning Research, 14:75–109, 2013.

Appendix

Error rate performance on the AwA dataset over 100 repeated experiments. SURF image features as the original domain and DeCAF deep neural network image features as the privileged domain. The best methods for each binary task is highlighted in boldface.

GPC GPC+ (Ours) SVM SVM+
Chimp. v. Panda
Chimp. v. Leopard
Chimp. v. Cat
Chimp. v. Pig
Chimp. v. Hippo.
Chimp. v. Whale
Chimp. v. Raccoon
Chimp. v. Rat
Chimp. v. Seal
Panda v. Leopard
Panda v. Cat
Panda v. Pig
Panda v. Hippo.
Panda v. Whale
Panda v. Raccoon
Panda v. Rat
Panda v. Seal
Leopard v. Cat
Leopard v. Pig
Leopard v. Hippo.
Leopard v. Whale
Leopard v. Raccoon
Leopard v. Rat
Leopard v. Seal
Cat v. Pig
Cat v. Hippo.
Cat v. Whale
Cat v. Raccoon
Cat v. Rat
Cat v. Seal
Pig v. Hippo.
Pig v. Whale
Pig v. Raccoon
Pig v. Rat
Pig v. Seal
Hippo. v. Whale
Hippo. v. Raccoon
Hippo. v. Rat
Hippo. v. Seal
Whale v. Raccoon
Whale v. Rat
Whale v. Seal
Raccoon v. Rat
Raccoon v. Seal
Rat v. Seal
average ranking 2.09 1.40 3.71 2.80
average error 17.60 0.10 17.47 0.10 18.21 0.11 17.80 0.10

Error rate performance on the AwA dataset over 100 repeated experiments. SURF image features as the original domain and attributes as the privileged domain. The best methods for each binary task is highlighted in boldface

GPC GPC+ (Ours) SVM SVM+
Chimp. v. Panda
Chimp. v. Leopard
Chimp. v. Cat
Chimp. v. Pig
Chimp. v. Hippo.
Chimp. v. Whale
Chimp. v. Raccoon
Chimp. v. Rat
Chimp. v. Seal
Panda v. Leopard
Panda v. Cat
Panda v. Pig
Panda v. Hippo.
Panda v. Whale
Panda v. Raccoon
Panda v. Rat
Panda v. Seal
Leopard v. Cat
Leopard v. Pig
Leopard v. Hippo.
Leopard v. Whale
Leopard v. Raccoon
Leopard v. Rat
Leopard v. Seal
Cat v. Pig
Cat v. Hippo.
Cat v. Whale
Cat v. Raccoon
Cat v. Rat
Cat v. Seal
Pig v. Hippo.
Pig v. Whale
Pig v. Raccoon
Pig v. Rat
Pig v. Seal
Hippo. v. Whale
Hippo. v. Raccoon
Hippo. v. Rat
Hippo. v. Seal
Whale v. Raccoon
Whale v. Rat
Whale v. Seal
Raccoon v. Rat
Raccoon v. Seal
Rat v. Seal
average ranking 1.98 1.40 3.44 3.18
average error 17.60 0.10 17.48 0.10 18.21 0.11 18.06 0.11