1 Introduction
Different from singlelabel classification, multilabel learning (MLL) allows each example to own multiple and nonexclusive labels. For instance, when to post a photo taken in the scene of Rio Olympics on Instagram, Twitter or Facebook, we may simultaneously include hashtags as #RioOlympics, #athletes, #medals and #flags. Or a related news article can be simultaneously annotated as “Sports”, “Politics” and “Brazil”. Multilabel learning aims to accurately allocate a group of labels to unseen examples with the knowledge harvested from the training data, and it has been widelyused in many applications, such as document categorization Yang et al. (2009); Li et al. (2015), image/videos classification/annotation Yang et al. (2016); Wang et al. (2016); Bappy et al. (2016), gene function classification CesaBianchi et al. (2012)
and image retrieval
Ranjan et al. (2015).The most straightforward approach is 1vsall or Binary Relevance (BR) Tsoumakas et al. (2010), which decomposes the multilabel learning into a set of independent binary classification tasks. However, due to neglecting label relationships, only passable performance can be achieved. A number of methods have thus been developed for further improving the performance by taking label relationships into consideration, such as label ranking Fürnkranz et al. (2008), chains of binary classification Read et al. (2011), ensemble of multiclass classification Tsoumakas et al. (2011) and labelspecific features Zhang & Wu (2015)
. Recently, embeddingbased methods have emerged as a mainstream solution of the multilabel learning problem. The approaches assume that the label matrix is lowrank, and adopt different manipulations to embed the original label vectors, such as compressed sensing
Hsu et al. (2009), principal component analysis
Tai & Lin (2012), canonical correlation analysis Zhang & Schneider (2011), landmark selection Balasubramanian & Lebanon (2012) and manifold deduction Bhatia et al. (2015); Hou et al. (2016).Most of lowrank based multilabel learning algorithms exploit label relationships in the hypothesis space. The hypotheses of different labels are interacted with each other under the lowrank constraint, which is as an implicit use of label relationships. By contrast, multiple labels can help each other in a more explicit way, where the hypothesis of a label is not only evaluated by the label itself, but also can be assessed by the other labels. More specifically in multilabel learning, for the label hypothesis at hand, the other labels can together act as an Oracle teacher to provide some comments on its performance, which is then beneficial for updating the learner. Multiple labels of examples can only be accessed in the training stage instead of the testing stage, and then Oracle teachers only exist in the training stage. This privileged setting has been studied in LUPI (learning using privileged information) paradigm Vapnik et al. (2009); Vapnik & Vashist (2009); Vapnik & Izmailov (2015) and it has been reported that appropriate privileged information can boost the performance in ranking Sharmanska et al. (2013), metric learning Fouad et al. (2013), classification Pechyony & Vapnik (2010) and visual recognition Motiian et al. (2016).
In this paper, we bridge connections between labels through privileged label information and then formulate an effective privileged multilabel learning (PrML) method. For each label, each example’s privileged label feature can be generated from other labels. Then it is able to provide additional guidances on the learning of this label, given the underlying connections between labels. By integrating the privileged information into the lowrank based multilabel learning, each label predictor learned from the resulting model not only interacts with other labels via their predictors, but also receives explicit comments from these labels. Iterative optimization strategy is employed to solve PrML, and we theoretically show that each subproblem can be solved by dual coordinate descent algorithm with the guarantee of solution’s uniqueness. Experimental results demonstrate the significance of exploiting the privileged label features and the effectiveness of the proposed algorithm.
2 Problem Formulation
In this section we elaborate the intrinsic privileged information in multilabel learning and formulate the corresponding privileged multilabel learning (PrML) as well.
We first introduce multilabel learning (MLL) problem and its frequent notations. Given training points, we denote the whole data set as , where is the input feature vector and is the corresponding label vector with the label size . Let be the data matrix and be the label matrix. Specifically, if and only if the th label is assigned to the example and otherwise. Given the dataset , multilabel learning is formulated as learning a mapping function that can accurately predict labels for unseen test points.
2.1 Lowrank multilabel embedding
A straightforward manner to parameterize the decision function is using linear classifiers,
i.e. where . Note that the linear form is actually incorporated with the bias term by augmenting an additional 1 to the feature vector . Binary Relevance (BR) method Tsoumakas et al. (2010) decomposes multilabel learning into a set of singlelabel learning problems. The binary classifier for each label can be obtained by the widelyused SVM method:(1) 
where is slack variable and is the inner product between two vectors or matrices. Predictors of different labels are thus independently solved without considering relationships between labels, which limits the classification performance of BR method.
Some labels can be closely connected and used to occur together on examples, and thus the label matrix is often supposed to be lowrank, which leads to the low rank of label predictor matrix as a result. Considering the rank of as , which is smaller than and , we are able to employ two smaller matrices to approximate , i.e. . can be seen as a dictionary of hypotheses in latent space , while each in is the coefficient vector to generate the predictor of th label from the hypothesis dictionary . Each classifier is represented as and Problem (1) can be extended into:
(2) 
where . Thus in Eq.(2), the classifiers of all labels are drawn from an identical lowdimensional subspace, i.e. the row space of . Then using block coordinate descent, either or can be solved within the empirical risk minimization (ERM) framework by turning it into a hinge loss minimization problem.
2.2 Privileged information in multilabel learning
The slack variable in Eq.(2) indicates the prediction error of the th example on the th label. In fact, it depicts the errortolerant ability of a model, and is directly related to the optimal classifier and its classification performance. From a different point of view, slack variables can be regarded as comments of some Oracle Teacher on the performance of predictors on each example. In multilabel context for each label, its hypothesis is not only evaluated by itself, but also assessed by the other labels. Thus other labels can be seen as its Oracle teacher, who will provide some comments during this label’s learning. Note that these label values are known as a priori only during training; when we get down to learning the th label’s predictor, we actually know the values of other labels for each training point . Therefore, we can formulate the other label values as privileged information (or hidden information) of each example. Let
(3) 
We call the training point ’s privileged label feature on the th label. It can be seen that the privileged label space is constructed straightforwardly from the original label space. These privileged label features can thus be regarded as an explicit way to connect all labels. In addition, note that the valid dimension (removing 0) of is , since we take the other label values as the privileged label features. Moreover, not all the other labels have the positive impact on the learning of some label Sun et al. (2014), and thus it is appropriate to strategically select some key labels to formulate the privileged label features. We will discuss this in the Experiment section.
Since for each label, the other labels serve as the Oracle teacher via the privileged label feature on each example, the comments on slack variables can be modelled as a linear function Vapnik & Vashist (2009),
(4) 
The function is thus called correcting function with respect to the th label, where is the parameter vector. As shown in Eq.(4), the privileged comments directly correct the values of slack variables as the prior knowledge or the additional information. Integrating privileged features as Eq.(4) into the SVM stimulates the popular SVM+ method Vapnik & Vashist (2009), which has been proved to improve the convergence rate and the performance.
Integrating the proposed privileged label features into the lowrank parameter structure as Eqs.(2) and (4), we formulate a new multilabel learning model, privileged multilabel learning (PrML) by casting it into the SVM+based LUPI paradigm,
(5) 
where . Particularly, we absorb the bias term to obtain a compact variant of the original SVM+, because it is turned out to have a simpler form in the dual space and can be solved more efficiently. In this way, the training data within multilabel learning is actually in the triplet fashion, i.e. , where is the privileged label feature matrix for each label.
Remark. When , i.e. the lowdimensional projection is identical, the proposed PrML degenerates into a simpler BRstyle model (we call it privileged Binary Relevance, PrBR), where the whole model decomposes into independent binary models. However, every single model is still combined with the comments form privileged information, thus it may still be superior to BR.
3 Optimization
In this section, we present how to solve the proposed privileged multilabel learning algorithm Eq.(5). The whole model of Eq.(5) is not convex due to the multiplication of in constraints. However, each subproblem with fixed or is convex, thus it can be solved by various efficient convex solvers. Note that has two equivalent forms, i.e. and , and thus the correcting function can be coupled with or , without damaging the convexity of either subproblem. In this way, Eq.(5) can be solved using the alternative iteration strategy, i.e. iteratively conducting the following two steps: optimizing and privileged variable with fixed , and updating and privileged variable with fixed . Both subproblems are related to SVM+, inducing their dual problems to be quadratic programming (QP). In the following, we elaborate the solving process in real implementations.
3.1 Optimizing with fixed
Fixing , Eq.(5) can be decomposed into independent binary classification problems, each of which regards the variable pair . Parallel techniques or multicore computation can thus be employed to speed up the training process. In specific, the optimization problem with respect to is
(6) 
and its dual form is (see supplementary materials)
(7) 
with the parameter update and the constraints , i.e. Moreover, is the labelwise vectors for the th label. is the Hadamard (elementwise) product of two vectors or matrices. is the based features’ inner product (kernel) matrix with . is the privileged label features’ inner product (kernel) matrix with respect to the th label, where . is the vector with all ones.
Pechyony et al. (2010) proposed an SMOstyle algorithm (gSMO) for SVM+ problem. However, because of the bias term, the Lagrange multipliers are tangled together in the dual problem, which leads to a more complicated constraint set
than in our PrML. Hence by absorbing the bias term, Eq.(6) can produce a more compact dual problem only with nonnegative constraint. Coordinate descent (CD) ^{1}^{1}1We optimize an equivalent “min” problem instead of the original “max” one. algorithm can be applied to solve the dual problem, and a closedform solution can be obtained in each iteration step Li et al. (2016). After solving the Eq.(7), according to the KarushKuhnTucker (KKT) conditions, the optimal solution for the primal problem (6) can be expressed by the Lagrange multipliers:
(8)  
(9) 
3.2 Optimizing with fixed
Given fixed coefficient matrix
, we update and learn the linear transformation
with the help of comments provided by privileged information. Thus the problem (5) for is reduced to(10) 
Eq.(10) has constraints, each of which can be indexed with a twodimensional subscript . The Lagrange multipliers of Eq.(10) are thus twodimensional as well. To make the dual problem of Eq.(10) consistent with Eq.(7), we define a bijection as the rowbased vectorization index mapping, i.e. . In a nutshell, we arrange the constraints (also the multipliers) according to the order of rowbased vectorization. In this way, the corresponding dual problem of Eq.(10) is formulated as (see details in supplementary materials)
(11) 
where is the rowbased vectorization of and is a block diagonal matrix, which corresponds to the kernel matrix of privileged label features. is the kernel matrix of input features with every element , where . Based on the KKT conditions, can be constructed using :
(12)  
(13) 
In this way, Eq.(11) has an identical optimization form with Eq.(7). Thus we can also turn it to the fast CD method Li et al. (2016). However, due to the script index mapping, directly using the method proposed in Li et al. (2016) is very expensive. Considering the privileged kernel matrix is block sparse, we can further speed up the calculation. Details of the modified version of dual CD algorithm for solving Eq.(11) are presented in Algorithm 1. Also note that one primary merit of this algorithm is the free calculation of the whole kernel matrix. Instead, we only need to calculate its diagonal elements as line 2 in Algorithm 1.
3.3 Framework of PrML
Our proposed privileged multilabel learning is summarized in Algorithm 2. As indicated in Algorithm 2, both and are updated with the help of comments from privileged information. Note that the primal variables and dual variables are connected with KKT connections, and thus in real applications lines 56 and 89 in Algorithm 2 can be implemented iteratively. Since each subproblem is actually a linear SVM+ optimization and solved by the CD method, its convergence is consistent with that of the dual CD algorithm for linear SVM Hsieh et al. (2008). Due to the cheap updates, Hsieh et al. (2008); Li et al. (2016) empirically showed it can be much faster than GMOstyle methods and many other convex solvers when (number of features) is large. Moreover, the independence of labels in Problem (6) enables to use parallel techniques and multicore computation to accommodate the large (number of labels). As for a large (number of examples) (also large for Problem (10)), we can use minibatch CD method Takac et al. (2015) , where each time a batch of examples are selected and CD updates are parallelly applied to them, i.e. lines 517 can be implemented parallelly. Also recently Chiang et al. (2016) designed a framework for parallel CD and achieved significant speeding up even when the and are very large. Thus, our model can scale to , and . In addition, the solution for each of subproblem is also unique, as Theorem 1 stated.
Proof skeleton.
Both Eq.(6) and Eq.(10) can be cast into an identical SVM+ optimization with the form of objective function being and a closed convex feasible solution set. Denote and assume two optimal solutions , we have . Let , then , thus for all , which implies that . Moreover, we have
For , then we have and . ∎
Proof of Theorem 1 mainly lies in the strict convexity of the objective function in either Eq.(6) or (10). Concrete details are referred to the supplementary materials. In this way, the correcting function serves as a bridge to channel the and , and the convergence of infers the convergence of and . Thus we can take as the barometer of the whole algorithm’s convergence.
Dataset  n  Den()  type  

enron  1702  1001  53  3.378  0.064  nominal 
yeast  2417  103  14  4.237  0.303  numeric 
corel5k  5000  499  374  3.522  0.009  nominal 
bibtex  7395  1836  159  2.402  0.015  nominal 
eurlex  19348  5000  3993  5.310  0.001  nominal 
mediamill  43907  120  101  4.376  0.043  numeric 
4 Experimental Results
In this section, we conduct various experiments on benchmark datasets to validate the effectiveness of using the intrinsic privileged information for multilabel learning. In addition, we also investigate the performance and superiority of the proposed PrML model comparing to recent competing multilabel methods.
4.1 Experiment configuration
Datasets. We select six benchmark multilabel datasets, including enron, yeast, corel5k, bibtex, eurlex and mediamill. Specially, we consider the cases when (eurlex), (eurlex) and (corel5k, bibtex, eurlex & mediamill) are large respectively. Also note that enron, corel5k, bibtex and eurlex are of sparse features. See Table 1 for the details of these datasets.
Comparison approaches.
1). BR (binary relevance) Tsoumakas et al. (2010). A SVM is trained with respect to each label.
2). ECC (ensembles of classifier chains) Read et al. (2011). It turns ML into a series of binary classification problems.
3). RAKEL (random klabelsets) Tsoumakas et al. (2011). It transforms ML into an ensemble of multiclass classification problems.
4). LEML (low rank empirical risk minimization for multilabel learning) Yu et al. (2014). It is a lowrank embedding approach
which is casted into ERM framework.
5). ML (multilabel manifold learning) Hou et al. (2016). It is a latest multilabel learning method, which is based on the manifold assumption in label space.
Evaluation Metrics.
We use six prevalent metrics to evaluate the performance of all methods, including Hamming loss, Oneerror, Coverage, Ranking loss, Average precision (Aver precision) and Macroaveraging AUC (Mac AUC). Note that all evaluation metrics have the value range [0,1]. In addition, for the first four metrics, the smaller values would indicate the better classification performance and we use
to index this positive logic. On the contrary, for the last two metrics larger values represent the better performance, indexed by .4.2 Algorithm analysis
Performance visualization. First we analyze our proposed PrML and on a global sense, we select the benchmark image dataset corel5k and visualize the results of image annotation to directly examine how PrML functions. For the learning process, we randomly selected 50% examples without repeating as the training set and the rest ones as the testing set. In our experiment, parameter is set to be 1; and are in the range of and respectively, and determined using cross validation by a part of training points. The embedding dimension is set to be , where is the smallest integer greater than . Some of the annotation results is presented in Figure 1, where the left tags are the groundtruth and the right ones are the top five predicted tags.
As shown in Figure 1, we can safely conclude that the proposed PrML performs well on the image annotation tasks. It can predict correctly the semantic labels in most cases. Note that although in some cases the predicted labels are not in the groundtruth, they are essentially related in semantic sense. For example, the “swimmers” in image (b) would be much natural when it comes to the “pool”. Moreover, we can also see that PrML would make supplementary predictions to describe images, enriching the corresponding content. For example, the “grass” in image (c) and “buildings” in image (d) are the missing objects in groundtruth labels.
Validation of privileged label features. We then validate the effectiveness of the proposed privileged information for multilabel learning. As discussed previously, the privileged label features serve as an guidance or comments from an Oracle teacher to connect the learning of all the labels. For the sake of fairness, we simply implement the validation experiments with LEML (without privileged label features) and PrML (with privileged label features). Note that our proposed privileged label features are composed with the values of labels; however, not all labels have prominent connections in multilabel learning Sun et al. (2014). Thus we selectively construct the privileged label features with respect to each label.
Particularly, we just use Knearest neighbor rule to form the pool per label. For each label, only labels in its label pool, instead of the whole label set, are reckoned to provide mutual guidance during its learning. In our implementation, we simply utilize Hamming distance to accomplish search of Knearest neighbor on the dataset corel5k. The experimental setting is the same with before and both algorithms share the same embedding dimension . Moreover, we carry out independent tests ten times and the average results are shown in Figure 2.
As shown in Figure 2, we have the following two observations. (a) PrML is clearly superior to LEML when we select enough labels as privileged label features, e.g. more than 350 labels in corel5k dataset. Since their only difference lies in the usage of the privileged information, we can conclude that the guidance from the privileged information, i.e. the proposed privileged label features, can significantly improve the performance of multilabel learning. (b) With more labels involved in the privileged label features, the performance of PrML keeps improving in a steady speed, and when the dimension of privileged label features is large enough, the performance tends to stabilize on the whole.
The number of labels is directly related to the complexity of correcting function defined as a linear function. Thus few labels might induce the low function complexity, and the correcting function can not determine the optimal slack variables. In this way, the faulttolerant capacity would be crippled and thus the performance is even worse than LEML. For example, when the dimension of privileged labels is less than 250 on corel5k, the Hamming loss, Oneerror, Coverage and Ranking loss of PrML is much larger than LEML. In contrast, overmuch labels might introduce unnecessary guidance of labels, and the extra labels thus make no contribution to the further improvement of classification performance. For instance, the performance with 365 labels involved in privileged label features would be on par with that of all the other (373) labels in Hamming loss, Oneerror, Ranking loss and Average precision. Moreover, in real applications, it is still a safe choice that all other labels are involved in privileged information.
dataset  method  Hamming loss  Oneerror  Coverage  Ranking loss  Aver precision  Mac AUC 

enron  BR  0.0600.001  0.4980.012  0.5950.010  0.3080.007  0.4490.011  0.5790.007 
ECC  0.0560.001  0.2930.008  0.3490.014  0.1330.004  0.6510.006  0.6460.008  
RAKEL  0.0580.001  0.4120.016  0.5230.008  0.2410.005  0.5390.006  0.5960.007  
LEML  0.0490.002  0.3200.004  0.2760.005  0.1170.006  0.6610.004  0.6250.007  
ML  0.0510.001  0.2580.090  0.2560.017  0.0900.012  0.6810.053  0.7140.021  
PrBR  0.0530.001  0.3420.010  0.2380.006  0.0880.003  0.6180.004  0.6380.005  
PrML  0.0500.001  0.2880.005  0.2210.005  0.0880.006  0.6850.005  0.6740.004  
yeast  BR  0.2010.003  0.2560.008  0.6410.005  0.3150.005  0.6720.005  0.5650.003 
ECC  0.2070.003  0.2440.009  0.4640.005  0.1860.003  0.7520.006  0.6460.003  
RAKEL  0.2020.003  0.2510.008  0.5580.006  0.2450.004  0.7200.005  0.6140.003  
LEML  0.2010.004  0.2240.003  0.4800.005  0.1740.004  0.7510.006  0.6420.004  
ML  0.1960.003  0.2280.009  0.4540.004  0.1680.003  0.7650.005  0.7020.007  
PrBR  0.2270.004  0.2370.006  0.4870.005  0.2040.003  0.7190.005  0.6230.004  
PrML  0.2010.003  0.2140.005  0.4590.004  0.1650.003  0.7710.003  0.6850.003  
corel5k  BR  0.0120.001  0.8490.008  0.8980.003  0.6550.004  0.1010.003  0.5180.001 
ECC  0.0150.001  0.6990.006  0.5620.007  0.2920.003  0.2640.003  0.5680.003  
RAKEL  0.0120.001  0.8190.010  0.8860.004  0.6270.004  0.1220.004  0.5210.001  
LEML  0.0100.001  0.6830.006  0.2730.008  0.1250.003  0.2680.005  0.6220.006  
ML  0.0100.001  0.6470.007  0.3720.006  0.1630.003  0.2970.002  0.6670.007  
PrBR  0.0100.001  0.7400.007  0.3670.005  0.1650.004  0.2270.004  0.5600.005  
PrML  0.0100.001  0.6750.003  0.2660.007  0.1180.003  0.2820.005  0.6510.004  
bibtex  BR  0.0150.001  0.5590.004  0.4610.006  0.3030.004  0.3630.004  0.6240.002 
ECC  0.0170.001  0.4040.003  0.3270.008  0.1920.003  0.5150.004  0.7630.003  
RAKEL  0.0150.001  0.5060.005  0.4430.006  0.2860.003  0.3990.004  0.6410.002  
LEML  0.0130.001  0.3940.004  0.1440.002  0.0820.003  0.5340.002  0.7570.003  
ML  0.0130.001  0.3650.004  0.1280.003  0.0670.002  0.5960.004  0.9110.002  
PrBR  0.0140.001  0.4260.004  0.1780.010  0.0960.005  0.5290.009  0.7020.003  
PrML  0.0120.001  0.3670.003  0.1310.007  0.0660.003  0.5710.004  0.8190.005  
eurlex  BR  0.0180.004  0.5370.002  0.3220.008  0.1860.009  0.3880.005  0.6890.007 
ECC  0.0110.003  0.4920.003  0.2980.004  0.1550.006  0.4580.004  0.7870.009  
RAKEL  0.0090.004  0.4960.007  0.2770.009  0.1610.001  0.4170.010  0.8220.005  
LEML  0.0030.002  0.4470.005  0.2330.003  0.1030.010  0.4880.006  0.8210.014  
ML  0.0010.001  0.3200.001  0.1710.003  0.0450.007  0.4970.003  0.8850.003  
PrBR  0.0070.008  0.4840.003  0.2290.009  0.1080.009  0.4550.003  0.7930.008  
PrML  0.0010.002  0.2990.003  0.1920.008  0.0570.002  0.5260.009  0.8920.004  
mediamill  BR  0.0310.001  0.2000.003  0.5750.003  0.2300.001  0.5020.002  0.5100.001 
ECC  0.0350.001  0.1500.005  0.4670.009  0.1790.008  0.5970.014  0.5240.001  
RAKEL  0.0310.001  0.1810.002  0.5600.002  0.2220.001  0.5210.001  0.5130.001  
LEML  0.0300.001  0.1260.003  0.1840.007  0.0840.004  0.7200.007  0.6990.010  
ML  0.0350.002  0.2310.004  0.2780.003  0.1210.003  0.6470.002  0.8470.003  
PrBR  0.0310.001  0.1470.005  0.2550.003  0.0920.002  0.6480.003  0.6410.004  
PrML  0.0290.002  0.1300.002  0.1720.004  0.0550.006  0.7260.002  0.7270.008 
4.3 Performance comparison
Now we formally analyze the performance of the proposed privileged multilabel learning (PrML) in comparison with popular stateoftheart methods. For each dataset, we randomly selected 50% examples without repeating as the training set and the rest for testing. For the results’ credibility, the dataset division process is implemented ten times independently and we recorded the corresponding results in each trail. Parameters and are determined in the same manner as before. As for the low embedding dimension , following the wisdom of Yu et al. (2014), we choose to be in and determined by cross validation using a part of training points. Particularly, we also cover the PrBR (privileged information + BR) to further investigate the proposed privileged information. The detailed results are reported in Table 2.
From Table 2, we can see the proposed PrML is comparable to the stateoftheart ML method, and significantly surpasses the other competing multilabel methods. Concretely, across all evaluation metrics and datasets, PrML ranks first in 52.8% cases and the first two in all cases; even in the second place, PrML’s performance is close to the top one. Comparing BR & PrBR, and LEML & PrML, we can safely infer that the privileged information plays an important role in enhancing the classification performance of multilabel predictors. Besides, in all the 36 cases, PrML wins 34 cases against PrBR and plays a tie twice in Ranking loss on enron and Hamming loss on corel5k respectively, which implies that the lowrank structure in PrML has positive impact in further improving the multilabel performance. Therefore, we can see PrML has inherited the merits of both lowrank parameter structure and privileged label information. In addition, PrML and LEML tend to perform better on datasets with more labels (100). This might be because the lowrank assumption is more sensible when the number of labels is considerably large.
5 Conclusion
In this paper, we investigate the intrinsic privileged information to connect labels in multilabel learning. Tactfully, we regard the label values as the privileged label features. This strategy indicates that for each label’s learning, other labels of each example may serve as its Oracle comments on the learning of this label. Without the requirement of additional data, we propose to actively construct privileged label features directly from the label space. Then we integrate the privileged information with the lowrank hypotheses in multilabel learning, and formulate privileged multilabel learning (PrML) as a result. During the optimization, both the dictionary and the coefficient matrix can receive the comments from the privileged information. And experimental results show that with this very privileged information, the classification performance can be significantly improved. Thus we can also take the privileged label features as a way to boost the classification performance of the lowrank based models.
As for the future work, our proposed PrML can be easily extended into Kernel version to cohere with the nonlinearity in the parameter space. Besides, using SVMstyle hinge loss might further improve the training efficiency Xu et al. (2016). Theoretical guarantees will be also investigated.
References

Balasubramanian & Lebanon (2012)
Krishnakumar Balasubramanian and Guy Lebanon.
The landmark selection method for multiple output prediction.
In
Proceedings of the 29th International Conference on Machine Learning (ICML12)
, pp. 983–990, 2012. 
Bappy et al. (2016)
Jawadul H Bappy, Sujoy Paul, and Amit K RoyChowdhury.
Online adaptation for joint scene and object classification.
In
European Conference on Computer Vision
, pp. 227–243. Springer, 2016.  Bhatia et al. (2015) Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multilabel classification. In Advances in Neural Information Processing Systems, pp. 730–738, 2015.
 CesaBianchi et al. (2012) Nicolò CesaBianchi, Matteo Re, and Giorgio Valentini. Synergy of multilabel hierarchical ensembles, data fusion, and costsensitive methods for gene functional inference. Machine Learning, 88(12):209–241, 2012.
 Chiang et al. (2016) WeiLin Chiang, MuChu Lee, and ChihJen Lin. Parallel dual coordinate descent method for largescale linear classification in multicore environments. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2016.

Fouad et al. (2013)
Shereen Fouad, Peter Tino, Somak Raychaudhury, and Petra Schneider.
Incorporating privileged information through metric learning.
IEEE transactions on neural networks and learning systems
, 24(7):1086–1098, 2013.  Fürnkranz et al. (2008) Johannes Fürnkranz, Eyke Hüllermeier, Eneldo Loza Mencía, and Klaus Brinker. Multilabel classification via calibrated label ranking. Machine learning, 73(2):133–153, 2008.

Hou et al. (2016)
Peng Hou, Xin Geng, and MinLing Zhang.
Multilabel manifold learningn.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  Hsieh et al. (2008) ChoJui Hsieh, KaiWei Chang, ChihJen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for largescale linear svm. In Proceedings of the 25th international conference on Machine learning, pp. 408–415. ACM, 2008.
 Hsu et al. (2009) Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multilabel prediction via compressed sensing. In NIPS, volume 22, pp. 772–780, 2009.

Li et al. (2016)
Wen Li, Dengxin Dai, Mingkui Tan, Dong Xu, and Luc Van Gool.
Fast algorithms for linear and kernel svm+.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2258–2266, 2016.  Li et al. (2015) Ximing Li, Jihong Ouyang, and Xiaotang Zhou. Supervised topic models for multilabel classification. Neurocomputing, 149:811–819, 2015.
 Motiian et al. (2016) Saeid Motiian, Marco Piccirilli, Donald A. Adjeroh, and Gianfranco Doretto. Information bottleneck learning using privileged information for visual recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Pechyony & Vapnik (2010) Dmitry Pechyony and Vladimir Vapnik. On the theory of learnining with privileged information. In Advances in neural information processing systems, pp. 1894–1902, 2010.
 Pechyony et al. (2010) Dmitry Pechyony, Rauf Izmailov, Akshay Vashist, and Vladimir Vapnik. Smostyle algorithms for learning using privileged information. In DMIN, pp. 235–241, 2010.
 Ranjan et al. (2015) Viresh Ranjan, Nikhil Rasiwasia, and C. V. Jawahar. Multilabel crossmodal retrieval. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
 Read et al. (2011) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multilabel classification. Machine learning, 85(3):333–359, 2011.
 Sharmanska et al. (2013) Viktoriia Sharmanska, Novi Quadrianto, and Christoph H Lampert. Learning to rank using privileged information. In Proceedings of the IEEE International Conference on Computer Vision, pp. 825–832, 2013.
 Sun et al. (2014) Fuming Sun, Jinhui Tang, Haojie Li, GuoJun Qi, and Thomas S Huang. Multilabel image categorization with sparse factor representation. Image Processing, IEEE Transactions on, 23(3):1028–1037, 2014.
 Tai & Lin (2012) Farbound Tai and HsuanTien Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508–2542, 2012.
 Takac et al. (2015) Martin Takac, Peter Richtarik, and Nathan Srebro. Distributed minibatch sdca. arXiv preprint arXiv:1507.08322, 2015.
 Tsoumakas et al. (2010) Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multilabel data. In Data mining and knowledge discovery handbook, pp. 667–685. Springer, 2010.
 Tsoumakas et al. (2011) Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Random klabelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089, 2011.
 Vapnik & Izmailov (2015) Vladimir Vapnik and Rauf Izmailov. Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research, 16:2023–2049, 2015.
 Vapnik & Vashist (2009) Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22(5):544–557, 2009.
 Vapnik et al. (2009) Vladimir Vapnik, Akshay Vashist, and Natalya Pavlovitch. Learning using hidden information (learning with teacher). In 2009 International Joint Conference on Neural Networks, pp. 3188–3195. IEEE, 2009.
 Wang et al. (2016) Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. Cnnrnn: A unified framework for multilabel image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Xu et al. (2016) Xinxing Xu, Joey Tianyi Zhou, IvorW Tsang, Zheng Qin, Rick Siow Mong Goh, and Yong Liu. Simple and efficient learning using privileged information. arXiv preprint arXiv:1604.01518, 2016.

Yang et al. (2009)
Bishan Yang, JianTao Sun, Tengjiao Wang, and Zheng Chen.
Effective multilabel active learning for text classification.
In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 917–926. ACM, 2009.  Yang et al. (2016) Hao Yang, Joey Tianyi Zhou, Yu Zhang, BinBin Gao, Jianxin Wu, and Jianfei Cai. Exploit bounding box annotations for multilabel object recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 Yu et al. (2014) HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Largescale multilabel learning with missing labels. In Proceedings of The 31st International Conference on Machine Learning, pp. 593–601, 2014.
 Zhang & Wu (2015) MinLing Zhang and Lei Wu. Lift: Multilabel learning with labelspecific features. IEEE transactions on pattern analysis and machine intelligence, 37(1):107–120, 2015.
 Zhang & Schneider (2011) Yi Zhang and Jeff G Schneider. Multilabel output codes using canonical correlation analysis. In International Conference on Artificial Intelligence and Statistics, pp. 873–882, 2011.
Appendix A Deduction of Eq.(6)’s dual problem Eq.(7)
Without the loss of generality, the objective of Eq.(6) can be rewritten into with . Then its Lagrangian function is defined as
where are the Lagrange multipliers. Setting the derivatives of with respect to and to zero, we have
Then plugging them back to the Lagrangian function, we obtain
Denote , and . Besides, let with th element being , and with . Thus the dual problem of Eq.(6) is formulated as
with the constraints , i.e. where is the Hadamard (elementwise) product of two vectors or matrices, and is the vector with all ones.
Appendix B Deduction of Eq.(10)’s dual problem Eq.(11)
Similarly, the Lagrangian function of Problem (10) is defined as
where are the Lagrange multipliers. Setting the derivatives of with respect to and to zero, we have
Then plugging them back to the Lagrangian function, we obtain
where and is the indicator function. To make the dual problem of Eq.(10) consistent with Eq.(7), we define a bijection as the rowbased vectorization index mapping, i.e. . In a nutshell, we line the constraints (also the multipliers) according to the “first label, then training points” principle, i.e. “first , then ”. Let , and , where means the rowbased vectorization manipulation. Moreover, let
Comments
There are no comments yet.