1 Introduction
Prior knowledge is a crucial component of any learning system, as without a form of prior knowledge, learning is provably impossible [1]. Many forms of integrating prior knowledge into machine learning algorithms have been developed: as a preference of certain prediction functions over others, as a Bayesian prior over parameters, or as additional information about the samples in the training set used for learning a prediction function. In this work, we rely on the last of these setups, adopting Vapnik and Vashist’s learning using privileged information (LUPI), see e.g. [2, 3]: we want to learn a prediction function, e.g. a classifier, and in addition to the main data modality that is to be used for prediction, the learning system has access to additional information about each training example.
This scenario has recently attracted considerable interest within the machine learning community, because it reflects well the increasingly relevant situation of learning as a service: an expert trains a machine learning system for a specific task on request from a customer. Clearly, in order to achieve the best result, the expert will use all the information available to him or her, not necessarily just the information that the system itself will have access to during its operation after deployment. Typical scenarios for learning as a service include visual inspection tasks, in which a classifier makes realtime decisions based on the input from its sensor, but at training time, additional sensors could be made use of, and the processing time per training example plays less of a role. Similarly, a classifier built into a robot or mobile device operates under strong energy constraints, while at training time, energy is less of a problem, so additional data can be generated and made use of. A third, and increasingly important scenario is when the additional data is confidential, as, e.g., in health care applications. One can expect that a diagnosis system can be improved when more information is available at training time. One might, e.g., perform specific blood test, genetic sequence, or drug trials, for the subjects that form the training set. However, the same data will not be available at test time, as obtaining it would be impractical, unethical, or outright illegal.
In this work, we propose a novel method for using privileged information based on the framework of Gaussian process classifiers (GPCs). The privileged data enters the model in form of a latent variable, which modulates the noise term of the GPC. Because the noise is integrated out before obtaining the final predictive model, the privileged information is indeed only required at training time, not at prediction time. The most interesting aspect of the proposed model is that by this procedure, the influence of the privileged information becomes very interpretable: its role is to model the confidence that the Gaussian process has about any training example, which can be directly read off from the slope of the sigmoidshaped GPC likelihood. Training examples that are easy to classify by means of their privileged data cause a faster increasing sigmoid, which means the GP trusts the taining example and tried to fit it well. Examples that are hard to classify result in a slowly increase slope, so the GPC considers the training example less reliable and puts not a lot of effort in fitting its label well. Our experiments on multiple datasets show that this procedure leads not just to interpretable models, but also to significantly higher classification accuracy.
Related Work The LUPI framework was originally proposed by Vapnik and Vashist [2], inspired by a thoughtexperiment: when training a softmargin SVM, what if an oracle would provide us with the optimal values of the slack variables? As it turns out, this would actually provably reduce the amount of training data needed, and consequently, Vapnik and Vashist proposed the SVM+ classifier that uses privileged data to predict values for the slack variables, which led to improved performance on several categorization tasks and found applications, e.g., in finance [4]. This setup was subsequently improved, by a faster training algorithm [5], better theoretical characterization [3], and it was generalized, e.g., to the learning to rank setting [6], clustering [7], and metric learning [8]. Recently, however, it was shown that the main effect of the SVM+ procedure is to assign a datadependent weight to each training example in the SVM objective [9]. In contrast, GPC+ constitutes the first Bayesian treatment of classification using privileged information. Indeed, the resulting privileged noise approach is related to inputmodulated noise commonly done in the regression task, and several Bayesian treatments of this heteroscedastic regression
using Gaussian processes have been proposed. Since the predictive density and marginal likelihood are no longer analytically tractable, most works in heteroscedastic GPs deal with approximate inference; techniques such as Markov Chain Monte Carlo
[10], maximum a posteriori [11], and recently a variational Bayes method [12]. To our knowledge, however, there is no prior work on heteroscedastic classification using GPs — we will elaborate the reasons in Section 2.1 — and consequently this work develops the first approximate inference based on expectation propagation for the heteroscedastic noise case in the context of classification.2 GPC+: Gaussian Process Classification with Privileged Noise
For selfconsistency of the paper, we first review the GPC model [13] with a particular emphasis on the noisecorrupted latent Gaussian process view. Then, we show how to treat privileged information as heteroscedastic noise in this latent process. The elegant aspect of this view is the intuition as how the privileged noise is able to distinguish between easy and hard samples and in turn to recalibrate our uncertainty in the original space.
2.1 Gaussian process classifier with noisy latent process
We are given a set of inputoutput data points or samples . Furthermore, we assume that the class label of sample has been generated as , where is a noisy latent function and is the Iverson’s bracket notation: when the condition is true, and otherwise. Induced by the label generation process, we adopt the following form of likelihood function for :
(1) 
where the noisy latent function at sample is given by with being the noisefree latent function. The noise term
is assumed to be independent and normally distributed with zero mean and variance
, that is . To make inference about , we need to specify a prior over this function. We proceed by imposing a zero mean Gaussian process prior [13] on the noisefree latent function, that is where is a positivedefinite kernel function [14] that specifies prior properties of . A typical kernel function that allows for nonlinear smooth function is the squared exponential kernel . In this kernel function, the parameter controls the amplitude of function while controls the smoothness of . Given the prior and the likelihood, Bayes’ rule is used to compute the posterior of , that is .We can simplify the above noisy latent process view by integrating out the noise term and writing down the individual likelihood at sample in term of noisefree latent function as follows
(2) 
where
is a Gaussian cumulative distribution function (CDF) with mean
and variance . Typically the standard Gaussian CDF is used, that is , in the likelihood of (2.1). Coupled with a Gaussian process prior on the latent function , this results in the widely adopted noisefree latent Gaussian process view with probit likelihood. The equivalence between a noisefree latent process with probit likelihood and a noisy latent process with stepfunction likelihood is widely known [13]. It is also widely accepted that the noisy latent function (or the noisefree latent function ) is a nuisance function as we do not observe the value of this function itself and its sole purpose is for a convenient formulation of the classification model [13]. However, in this paper, we show that by using privileged information as the noise term, the latent function now plays a crucial role. The latent function with privileged noise adjusts the slope transition in the Gaussian CDF to be faster or slower corresponding to more certainty or more uncertainty about the samples in the original space. This is described in details in the next section.2.2 Privileged information is in the Nuisance Function
In the learning using privileged information (LUPI) paradigm [2], besides input data points and associated outputs , we are given additional information about each training instance . However this privileged information will not be available for unseen test instances. Our goal is to exploit the additional data to influence our choice of the latent function . This needs to be done while making sure that the function does not directly use the privileged data as input, as it is simply not available at test time. We achieve this naturally by treating the privileged information as a heteroscedastic (inputdependent) noise in the latent process.
Our classification model with privileged noise is then as follows:
(3)  
(4)  
(5)  
(6) 
In the above, the function is needed to ensure positivity of the noise variance. The term is a positivedefinite kernel function that specifies the prior properties of another latent function which is evaluated in the privileged space . Crucially, the noise term is now heteroscedastic, that is it has a different variance at each input point . This is in contrast to the standard GPC approach discussed in Section 2.1 where the noise term is assumed to be homoscedastic, . Indeed, an inputdependent noise term is very common in a task with continuous output values (a regression task), resulting in the socalled heteroscedastic regression models, which have been proven to be more flexible in numerous applications as already touched upon in the related work section. However, to our knowledge, there is no prior work on heteroscedastic classification models. This is not surprising as the nuisance view of the latent function renders having a flexible inputdependent noise pointless.
In the context of learning with privileged information, however, heteroscedastic classification is actually a very sensible idea. This is best illustrated when investigating the effect of privileged information in the equivalent formulation of a noise free latent process, i.e., one integrates out the privileged inputdependent noise term:
(7)  
(8)  
(9) 
From (9), it is clear that privileged information adjusts the slope transition of the Gaussian CDF. For difficult (a.k.a. noisy) samples, the latent function will be high, the slope transition will be slower, and thus more uncertainty is in the likelihood term . For easy samples, however, the latent function will be low, the slope transition will be faster, and thus less uncertainty is in the likelihood term . This is illustrated in Fig. 1.
2.3 Posterior and Prediction on Test Data
Define and . Given the conditional i.i.d. likelihood with the per observation likelihood term given in (9) and the Gaussian process priors on functions, the posterior for and is:
(10) 
where can be maximised with respect to a set of hyperparameter values such as amplitude and smoothness parameters of the kernel functions [13]. For a previously unseen test point , the predictive distribution for its label is given as:
(11) 
where is a Gaussian conditional distribution. We note that in (11), we do not consider the privileged information associated to . The interpretation is we consider a homoscedastic noise at test time. This is a reasonable approach as there is no additional information for increasing or decreasing our confidence in the newly observed data
. Finally, we predict the label for a test point via Bayesian decision theory: the label being predicted is the one with the largest probability.
3 Expectation Propagation with Numerical Quadrature
Unfortunately, as for most interesting Bayesian models, inference in the GPC+ model is very challenging. Already in the homoscedastic case, the predictive density and marginal likelihood are not analytically tractable. In this work, we therefore adapt Minka’s expectation propagation (EP) [15] with numerical quadrature for approximate inference. Please note that EP is the preferred method for approximate inference with GPCs in terms of accuracy and computational cost [16, 17].
Consider the joint distribution of
, and . Namely, where and are Gaussian process priors and the likelihood is equal to , with given by (9). EP approximates each nonnormal factor in this joint distribution by an unnormalised bivariate normal distribution of and (we assume independence between and ). The only nonnormal factors correspond to those of the likelihood. These are approximated as:(12) 
where the parameters with the superscript are to be found by EP. The posterior approximation computed by EP results from normalising with respect to and the EP approximate joint distribution. This distribution is obtained by replacing each likelihood factor by the corresponding approximate factor . In particular,
(13) 
where is a normalisation constant that approximates the model evidence
. The normal distribution belongs to the exponential family of probability distributions and is closed under the product and division. It is hence possible to show that
is the product of two multivariate normals [18]. The first normal approximates the posterior for and the second the posterior for .EP tries to fix the parameters of so that it is similar to the exact factor
in regions of high posterior probability
[15]. For this, EP iteratively updates each until convergence to minimise , where is a normal distribution proportional to with all variables different from and marginalised out, is simply a normalisation constant anddenotes the KullbackLeibler divergence between probability distributions. Assume
is the distribution minimising the previous divergence. Then, and the parameter of is fixed to guarantee that integrates the same as the exact factor with respect to . The minimisation of the KL divergence involves matching expected sufficient statistics (mean and variance) between and . These expectations can be obtained from the derivatives of with respect to the (natural) parameters of [18]. Unfortunately, the computation of in closed form is intractable. We show here that it can be approximated by a one dimensional quadrature. Denote by , , and the means and variances of for and , respectively. Then,(14) 
Thus, the EP algorithm only requires five quadratures to update each . A first one to compute and four extras to compute its derivatives with respect to , , and . After convergence, can be used to approximate predictive distributions and the normalisation constant can be maximised to find good values for the model’s hyperparameters. In particular, it is possible to compute the gradient of with respect to the parameters of the Gaussian process priors for and [18].
4 Experiments
Our intention here is to investigate the performance of the GP with privileged noise approach. To this aim, we considered three types of binary classification tasks corresponding to different privileged information using two realworld datasets: Attribute Discovery and Animals with Attributes. We detail those experiments in turn in the following sections.
Methods We compared our proposed GPC+ method with the wellestablished LUPI method based on SVM, SVM+ [5]. As a reference, we also fit standard GP and SVM classifiers when learning on the original space (GPC and SVM baselines). For all four methods, we used a squared exponential kernel with amplitude parameter and smoothness parameter . For simplicity, we set
in all cases. For GPC and GPC+, we used type IImaximum likelihood for estimating the hyperparameters. There are two hyperparameters in GPC (smoothness parameter
and noise variance ) and also two in GPC+ (smoothness parameters of kernel and of kernel ). For SVM and SVM+, we used crossvalidation to set the hyperparameters. SVM has two knobs, that is smoothness and regularisation, and SVM+ has four knobs, two smoothness and two regularisation parameters. It turned out that a grid search via cross validation was too expensive for searching the best parameters in SVM+, we instead use the performance on a separate validation set to guide the search process. None of the other three methods used this separate validation set, this means that we give a competitive advantage to SVM+ over the other methods.Evaluation metric To evaluate the performance of the methods we used classification error on an independent test set. We performed
repeats of all the experiments to get the better statistics of the performance and report the mean and the standard error.
4.1 Attribute Discovery Dataset [19]
The data set was collected from a shopping website that aggregates product data from variety of ecommerce sources and includes both images and associated textual descriptions. The images and associated texts are grouped into broad shopping categories: bags, earrings, ties, and shoes. We used samples from this dataset. We generated binary classification tasks for each pair of the classes with samples for training, samples for validation, and the rest of samples for testing the predictive performance.
Neural networks on texts as privileged information We used images as the original domain and texts as the privileged domain. This setting was also explored in [6]. However, we used a different dataset as textual descriptions of the images used in [6] are sparse and contain duplicates. Furthermore, we extracted more advanced text features instead of simple term frequency (TF) features. As image representation, we extracted SURF descriptors [20] and constructed a codebook of visual words using the means clustering. As text representation, we extracted
dimensional continuous wordvector representation using a neural network skipgram architecture
[21]^{1}^{1}1https://code.google.com/p/word2vec/. To convert this word representation to a fixedlength sentence representation, we constructed a codebook of wordvector using again means clustering. We note that a more elaborate approach to transform word to sentence or document features has recently been developed [22], and we are planning to explore this in the future. We performed PCA for dimensionality reduction in the original and privileged domains and only kept the top principal components. Finally, we standardised the data so that each feature has zero mean and unit standard deviation.The experimental results are summarised in Tab. 1. On average over tasks, SVM with hinge loss outperforms GPC with probit likelihood. However, GPC+ significantly improves over GPC providing the best results on average. This clearly shows that GPC+ is able to utilise the neural network textual representation as privileged information. In contrast, SVM+ produced the same result as SVM. We suspect this is due to: SVM has already shown strong performance on the original image space coupled with the difficulties in finding the best values of four hyperparameters. Keep in mind that, in SVM+, we discretised the hyperparameter search space over () possible combination values and used a separate validation technique.
GPC  GPC+ (Ours)  SVM  SVM+  
bags v. earrings  9.79  0.12  9.50  0.11  9.89  0.14  9.89  0.13 
bags v. ties  10.36  0.16  10.03  0.15  9.44  0.16  9.47  0.13 
bags v. shoes  9.66  0.13  9.22  0.11  9.31  0.12  9.29  0.14 
earrings v. ties  10.84  0.14  10.56  0.13  11.15  0.16  11.11  0.16 
earrings v. shoes  7.74  0.11  7.33  0.10  7.75  0.13  7.63  0.13 
ties v. shoes  15.51  0.16  15.54  0.16  14.90  0.21  15.10  0.18 
average error  10.65  0.11  10.36  0.12  10.41  0.11  10.42  0.11 
average ranking  3.0  1.8  2.7  2.5 
4.2 Animals with Attributes (AwA) Dataset [23]
The dataset was collected by querying the image search engines for each of the animals categories which have complimentary high level description of the semantic properties such as shape, colour, or habitation forms, among others. The semantic attributes per animal class were retrieved from a prior psychological study. We focused on the categories corresponding to the test set of this dataset for which the predicted attributes are provided based on the probabilistic DAP model [23]. The classes are: chimpanzee, giant panda, leopard, persian cat, pig, hippopotamus, humpback whale, raccoon, rat, seal, and contain images in total. As in Section 4.1 and also in [6], we generated binary classification tasks for each pair of the classes with samples for training, samples for validation, and the rest of samples for testing the predictive performance.
Neural networks on images as privileged information
Deep learning methods have gained an increased attention within the machine learning and computer vision community over the recent years. This is due to their capability in extracting informative features and delivering strong predictive performance in many classification tasks. As such, we are interested to explore the use of deep learning based features as privileged information so that their predictive power can be used even if we do not have access to them at prediction time. We used the standard
SURF features [20] with visual words as the original domain and used the recently proposed DeCAF features [24] extracted from the activation of a deep convolutional network trained in a fully supervised fashion as the privileged domain. The DeCAF features were in dimensions. All features are provided with the AwA dataset^{2}^{2}2http://attributes.kyb.tuebingen.mpg.de. We again performed PCA for dimensionality reduction in the original and privileged domains and only kept the top principal components, as well as standardised the data.(DeCAF as privileged)  (Attributes as privileged) 
Attributes as privileged information Following the experimental setting of [6], we also used images as the original domain and attributes as the privileged domain. Images were represented by visual words based on SURF descriptors and attributes were in the form of 85 dimensional predicted attributes based on probabilistic binary classifiers [23]. This time, we only performed PCA and kept the top principal components in the original domain. Finally, we also standardised the data.
The results are summarised in Fig. 2 in term of pairwise comparison over binary tasks between GPC+ and main baselines, GPC and SVM+. The full results with the error of each method GPC, GPC+, SVM, and SVM+ on each problem are relegated to the appendix. In contrast to the results on the attribute discovery dataset, on the AwA dataset it is clear that GPC outperforms SVM in almost all of the binary classification tasks (see the appendix). The average error of GPC over ( tasks and repeats per task) experiments is much lower than SVM. On the AwA dataset, SVM+ can take advantage of privileged information – be it deep belief DeCAF features or semantic attributes – and shows significant performance improvement over SVM. However, GPC+ still shows the best overall results and further improves the already strong performance of GPC. As illustrated in Fig. 1 (right), the privileged information modulates the slope of the sigmoid likelihood function differently for easy and difficult examples: easy examples gain slope and hence importance whereas difficult ones lose importance in the classification. We analysed our experimental results using the multiple dataset statistical comparison method described in [25]^{3}^{3}3We are not able to use this method for our attribute discovery results in Tab. 1 as the number of methods being compared () is almost equal to the number of tasks or datasets ().. The statistical tests are summarised in Fig. 3. When DeCAF is used as privileged information, there is statistical evidence that GPC+ performs best among the four methods, while in semantic attributes as privileged information setting, GPC+ still performs best but there is not enough evidence to reject that GPC+ performs comparable to GPC.
5 Conclusions
We presented the first treatment of the learning with privileged information setting in the Gaussian process classification (GPC) framework, called GPC+. The privileged information enters the latent noise layer of the GPC+, resulting in a datadependent modulation of the sigmoid slope of the GP likelihood. As our experimental results demonstrate this is an effective way to make use of privileged information, which manifest itself in significantly improved classification accuracies. Actually, to our knowledge, this is the first time that a heteroscedastic noise term is used to improve GPC. Furthermore, we also showed that recent advances in continuous wordvector neural networks representations [22] and deep convolutional networks for image representations [24] are privileged information. For future work, we plan to extend the GPC+ to the multiclass situation and to speed up computation by devising a quadraturefree expectation propagation method, similar to [26].
References
 [1] D.H. Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation, 8(7):1341–1390, 1996.
 [2] V. Vapnik and A. Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5–6):544 – 557, 2009.
 [3] D. Pechyony and V. Vapnik. On the theory of learning with privileged information. In Advances in Neural Information Processing Systems (NIPS), 2010.
 [4] B. Ribeiro, C. Silva, A. Vieira, A. GasparCunha, and J.C. das Neves. Financial distress model prediction using SVM+. In International Joint Conference on Neural Networks (IJCNN), 2010.

[5]
D. Pechyony and V. Vapnik.
Fast optimization algorithms for solving SVM+.
In
Statistical Learning and Data Science
, 2011.  [6] V. Sharmanska, N. Quadrianto, and C. H. Lampert. Learning to rank using privileged information. In International Conference on Computer Vision (ICCV), 2013.
 [7] J. Feyereisl and U. Aickelin. Privileged information for data clustering. Information Sciences, 194:4–23, 2012.
 [8] S. Fouad, P. Tino, S. Raychaudhury, and P. Schneider. Incorporating privileged information through metric learning. IEEE Transactions on Neural Networks and Learning Systems, 2013.
 [9] M. Lapin, M. Hein, and B. Schiele. Learning using privileged information: SVM+ and weighted SVM. Neural Networks, 53:95–108, 2014.
 [10] P. W. Goldberg, C. K. I. Williams, and C. M. Bishop. Regression with inputdependent noise: A gaussian process treatment. In Advances in Neural Information Processing Systems (NIPS), 1998.

[11]
N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L. Buntine.
Kernel conditional quantile estimation via reduction revisited.
In International Conference on Data Mining (ICDM), 2009.  [12] M. LázaroGredilla and M. K. Titsias. Variational heteroscedastic gaussian process regression. In International Conference on Machine Learning (ICML), 2011.
 [13] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2006.

[14]
B. Scholkopf and A. J. Smola.
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond
. MIT Press, Cambridge, MA, USA, 2001. 
[15]
T. P. Minka.
A Family of Algorithms for Approximate Bayesian Inference
. PhD thesis, Massachusetts Institute of Technology, 2001.  [16] H. Nickisch and C. E. Rasmussen. Approximations for Binary Gaussian Process Classification. Journal of Machine Learning Research, 9:2035–2078, 2008.
 [17] M. Kuss and C. E. Rasmussen. Assessing approximate inference for binary Gaussian process classification. Journal of Machine Learning Research, 6:1679–1704, 2005.
 [18] M. Seeger. Expectation propagation for exponential families. Technical report, Department of EECS, University of California, Berkeley, 2006.
 [19] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute discovery and characterization from noisy web data. In European Conference on Computer Vision (ECCV), 2010.
 [20] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speededup robust features (surf). Computer Vision and Image Understanding, 110(3):346–359, June 2008.
 [21] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS), 2013.
 [22] Q. V. Le and T. Mikolov. Distributed representations of sentences and documents. In International Conference on Machine Learning (ICML), 2014.
 [23] C. H. Lampert, H. Nickisch, and S. Harmeling. Attributebased classification for zeroshot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465, 2014.
 [24] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning (ICML), 2014.
 [25] J. Demšar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30, 2006.
 [26] J. Riihimäki, P. Jylänki, and A. Vehtari. Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood. Journal of Machine Learning Research, 14:75–109, 2013.
Appendix
Error rate performance on the AwA dataset over 100 repeated experiments. SURF image features as the original domain and DeCAF deep neural network image features as the privileged domain. The best methods for each binary task is highlighted in boldface.
GPC  GPC+ (Ours)  SVM  SVM+  
Chimp. v. Panda  
Chimp. v. Leopard  
Chimp. v. Cat  
Chimp. v. Pig  
Chimp. v. Hippo.  
Chimp. v. Whale  
Chimp. v. Raccoon  
Chimp. v. Rat  
Chimp. v. Seal  
Panda v. Leopard  
Panda v. Cat  
Panda v. Pig  
Panda v. Hippo.  
Panda v. Whale  
Panda v. Raccoon  
Panda v. Rat  
Panda v. Seal  
Leopard v. Cat  
Leopard v. Pig  
Leopard v. Hippo.  
Leopard v. Whale  
Leopard v. Raccoon  
Leopard v. Rat  
Leopard v. Seal  
Cat v. Pig  
Cat v. Hippo.  
Cat v. Whale  
Cat v. Raccoon  
Cat v. Rat  
Cat v. Seal  
Pig v. Hippo.  
Pig v. Whale  
Pig v. Raccoon  
Pig v. Rat  
Pig v. Seal  
Hippo. v. Whale  
Hippo. v. Raccoon  
Hippo. v. Rat  
Hippo. v. Seal  
Whale v. Raccoon  
Whale v. Rat  
Whale v. Seal  
Raccoon v. Rat  
Raccoon v. Seal  
Rat v. Seal  
average ranking  2.09  1.40  3.71  2.80  
average error  17.60  0.10  17.47  0.10  18.21  0.11  17.80  0.10 
Error rate performance on the AwA dataset over 100 repeated experiments. SURF image features as the original domain and attributes as the privileged domain. The best methods for each binary task is highlighted in boldface
GPC  GPC+ (Ours)  SVM  SVM+  
Chimp. v. Panda  
Chimp. v. Leopard  
Chimp. v. Cat  
Chimp. v. Pig  
Chimp. v. Hippo.  
Chimp. v. Whale  
Chimp. v. Raccoon  
Chimp. v. Rat  
Chimp. v. Seal  
Panda v. Leopard  
Panda v. Cat  
Panda v. Pig  
Panda v. Hippo.  
Panda v. Whale  
Panda v. Raccoon  
Panda v. Rat  
Panda v. Seal  
Leopard v. Cat  
Leopard v. Pig  
Leopard v. Hippo.  
Leopard v. Whale  
Leopard v. Raccoon  
Leopard v. Rat  
Leopard v. Seal  
Cat v. Pig  
Cat v. Hippo.  
Cat v. Whale  
Cat v. Raccoon  
Cat v. Rat  
Cat v. Seal  
Pig v. Hippo.  
Pig v. Whale  
Pig v. Raccoon  
Pig v. Rat  
Pig v. Seal  
Hippo. v. Whale  
Hippo. v. Raccoon  
Hippo. v. Rat  
Hippo. v. Seal  
Whale v. Raccoon  
Whale v. Rat  
Whale v. Seal  
Raccoon v. Rat  
Raccoon v. Seal  
Rat v. Seal  
average ranking  1.98  1.40  3.44  3.18  
average error  17.60  0.10  17.48  0.10  18.21  0.11  18.06  0.11 
Comments
There are no comments yet.