1 Introduction
In multilabel learning, an example can be assigned more than one label. This is different to the conventional single label learning, in which each example corresponds to one, and only one, label. Over the past few decades, multilabel learning (Zhang and Zhou, 2013) has been successfully applied to many realworld applications such as text categorization (Gao et al., 2004), image annotation (Wang et al., 2009), and gene function analysis (Barutcuoglu et al., 2006).
A straightforward approach to multilabel learning is to decompose it into a series of binary classification problems for different labels(Tsoumakas et al., 2010)
. However, this approach can result in poor performance when strong label correlations exist. To improve prediction, a large number of algorithms have been developed that approach the multilabel learning problem from different perspectives such as the classifier chains algorithm
(Read et al., 2011), the maxmargin multilabel classifier (Hariharan et al., 2010), the probabilistic multilabel learning algorithms (Zhang and Zhang, 2010; Guo and Xue, 2013), the correlation learning algorithms (Huang et al., 2012; Bi and Kwok, 2014), and label dependency removal algorithms (Chen and Lin, 2012; Tai and Lin, 2012).Vapnik’s learning theory (Vapnik, 1998) can be used to justify the successful development of multilabel learning (Xu et al., 2013b; Li and Guo, 2013; Doppa et al., 2014). One of the most useful datadependent complexity measures is Rademacher complexity (Bartlett and Mendelson, 2003), which leads to tighter generalization error bounds than those derived using the VC dimension and cover number. Recently, (Yu et al., 2014) proved that the Rademacher complexity of empirical risk minimization (ERM)based multilabel learning algorithms can be bounded by the trace norm of the multilabel predictors, which provides a theoretical explanation for the effectiveness of using the trace norm for regularization in multilabel learning. On the other hand, minimizing the trace norm over the predictor implicitly exploits the correlations between different labels in multilabel learning.
One shortcoming of the general Rademacher complexity is that it ignores the fact that the hypotheses selected by a learning algorithm usually belong to a more favorable subset of all the hypotheses, and they therefore have better performance than in the worst case. To overcome this drawback, the local Rademacher complexity considers the Rademacher averages of smaller subsets of the hypothesis set. This results in a sharper generalization error bound than that derived using global Rademacher complexity. Specifically, the generalization error bound derived by Rademacher complexity is at most of convergence order of , while the bound obtained using local Rademacher complexity usually converges as fast as . We therefore seek to use local Rademacher complexity in multilabel learning problem and design a new algorithm.
In this paper, we show that the local Rademacher complexities for ERMbased multilabel learning algorithms can be upperbounded in terms of the tail sum of the singular values of the multilabel predictor. As a result, we are motivated to penalize the tail sum of the singular values of the multilabel predictor in multilabel learning, rather than the sum of all the singular values (i.e., the trace norm). As well as the advantage of producing a sharper generalization error bound, this new constraint over the multilabel predictor achieves better recovery of the lowrank structure of the predictor and effectively exploits the correlations between labels in multilabel learning. The resulting objective function can be efficiently solved using a newly proposed conditional singular value thresholding algorithm. Extensive experiments on realworld datasets validate our theoretical analysis and demonstrate the effectiveness of the new multilabel learning algorithm.
2 Global and Local Rademacher Complexities
In a standard supervised learning setting, a set of training examples
are i.i.d. sampled from distribution over . Let be a set of functions mapping to . The learning problem is to select a function such that the expected loss is small, whereis a loss function. Defining
as the loss class, the learning problem is then equivalent to findings a function with small .Global Rademacher complexity (Bartlett and Mendelson, 2003) is an effective approach for measuring the richness (complexity) of the function class , and it is defined as Let be independent uniform
valued random variables. The global Rademacher complexity of
is then defined as(1) 
Based on the notion of global Rademacher complexity, the algorithm has a standard generalization error bound, as shown in the following theorem (Bartlett and Mendelson, 2003) . Given , suppose the function is learned over
training points. Then, with probability at least
, we have(2) 
Since global Rademacher complexity is in the order of for various classes used in practice, the generalization error bound in Theorem 2 converges at rate
. Global Rademacher complexity is a global estimation of the complexity of the function class, and thus it ignores the fact that the algorithm is likely to pick functions with a small error, and, in particular, only a small subset of the function class will be used.
Instead of using the global Rademacher averages of the entire class as the complexity measure, it is more reasonable to consider the Rademacher complexity of a small subset of the class, e.g., the intersection of the class with a ball centered on the function of interest. Clearly, this local Rademacher complexity (Bartlett et al., 2005) is always smaller than the corresponding global Rademacher complexity, and its formal definition is given by For any , the local Rademacher complexity of is defined as
(3) 
The following theorem describes the generalization error bound based on local Rademacher complexity. Given , suppose we learn the function over training points. Assume that there is some such that for every , . Then with probability at least , we have
(4) 
By choosing a much smaller class
with as small a variance as possible while requiring that
still lies in , the generalization error bound in Theorem 2 has a faster convergence rate than Theorem 2 of up to . Once the local Rademacher complexity is known, can be bounded in terms of the fixed point of the local Rademacher complexity of .3 Local Rademacher Complexity for Multilabel Learning
In this section, we analyze the local Rademahcer complexity for multilabel learning and illustrate our motivation for developing a new multilabel learning algorithm.
The multilabel learning model is described by a distribution on the space of data points and labels . We receive training points sampled i.i.d. from the distribution , where
are the ground truth label vectors. Given these training data, we learn a multilabel predictor
by performing ERM as follows:(5) 
where is the empirical risk of a multilabel learner , and is some constraint on .
(Yu et al., 2014) proposes to solve the multilabel learning problem with Eq. (5) by setting as the trace constraint , and then providing its corresponding global Rademacher complexity bound
(6) 
This global Rademacher complexity for multilabel learning is in the order of , which is exactly consistent with the general analysis shown in the previous section. Hence, the generalization error bound based on the global Rademacher complexity in (Yu et al., 2014) converges up to .
In practice, the hypotheses selected by a learning algorithm usually have better performance than the worst case and belong to a more favorable subset of all the hypotheses. Based on this idea, we employ the local Rademacher complexities to measure the complexity of smaller subsets of the hypothesis set, which results in sharper learning bounds and guarantees faster convergence rates. The local Rademacher complexity of the multilabel learning algorithm using Eq. (5) is shown in Theorem 3.
Suppose we learn over training points. Let be the SVD decomposition of , where and are the unitary matrices, and is the diagonal matrix with singular values in descending order. Assume and there is some such that for every , . Then, the local Rademacher complexity of is
Considering , can be rewritten as
(7) 
where and are the column vectors of and , respectively. Based on the orthogonality of and , we have the following decomposition
Considering
(8) 
and
(9) 
we have
(10) 
which completes the proof.
According to Theorem 3, the local Rademacher complexity for ERMbased multilabel learning algorithms is determined by the tail sum of the singular values. When , we have , which leads to a sharper generalization error bound than that based on global Rademacher complexity.
4 Algorithm
In this section, the properties of the local Rademacher complexity discussed above are used to devise a new multilabel learning algorithm.
Each training point has a feature vector and a corresponding label vector . If , example will have label; otherwise, there is no label for example . The multilabel predictor is parameterized as , where . is the loss function that computes the discrepancy between the true label vector and the predicted label vector.
The trace norm is an effective approach for modeling and capturing correlations between labels associated with examples, and it has been widely adopted in many multilabel algorithms (Amit et al., 2007; Loeff and Farhadi, 2008; Cabral et al., 2011; Xu et al., 2013a). Within the ERM framework, their objective functions usually take the form
(11) 
where is the trace norm and is a constant. In particular, for Problem (11), (Yu et al., 2014) has proved that the global Rademacher complexity of is upperbounded in terms of its trace.
As shown in the previous section, however, the tail sum of the singular values of , rather than its trace, determines the local Rademacher complexity. Since the local Rademacher complexity can lead to tighter generalization bounds than those of the global Rademacher complexity, this motivates us to consider the following objective function to solve the multilabel learning problem.
(12) 
where is the th largest singular value of , and is a parameter to control the tail sum. If we use the squared L2loss function, we get
(13) 
In multilabel learning, the multilabel predictor usually has a lowrank structure due to the correlations between multiple labels. The trace norm is regarded as an effective surrogate of rank minimization by simultaneously penalizing all the singular values of . However, it may incorrectly keep the small singular values, which should be zero, or shrink the large singular values to zeros, which should be nonzero. In contrast, our new algorithm can directly minimize over the small singular values, which encourages the lowrank structure.
To understand why the trace norm may fail in rank minimization, we consider a matrix
(14) 
where and are unknown. The results shown in Figure 1 plot the trace norm and the proposed new norm of for all possible completions in a range around the value that minimizes its rank and . We find that the trace norm yields the optimal solution for with singular values when and . In contrast, we propose to constrain over the tail singular values (setting ), and derive the optimal solution with singular values when and . Hence, the new norm can successfully discover the lowrank structure, while the trace norm fails in this case.
4.1 Optimization
Starting with Eq. (13) without the norm regularization, we get the following problem,
(15) 
where is the data matrix and is the label matrix. The gradient method is a natural approach for solving this problem and generates a sequence of approximate solutions:
(16) 
which can be further reformulated as a proximal regularization of the linearized function at as
(17) 
Since Eq. (17) can be regarded as a linear approximation of the function at point regularized by a quadratic proximal term, Problem (13) can be solved in the following iterative step:
(18) 
where the terms in Eq. (17) that do not depend on are ignored.
Recall that if problem (18) is constrained with the trace norm, (Cai et al., 2010) showed that it can be efficiently solved using the singular value thresholding algorithm. Hence, we propose a new conditional singular value thresholding algorithm to handle the newly proposed norm regularization. The solution is summarized in the following theorem.
Let and its SVD decomposition is , where and have orthonormal columns, is diagonal. Then,
(19) 
is given by , where is diagonal with .
Assuming that is the optimal solution, should be a subgradient of the objective function at the point ,
(20) 
where is the set of subgradients of the new norm regularization. Letting , we have
(21) 
where is obtained by setting the diagonal values with indices greater than
in the identity matrix
as zeros. Set the SVD of as(22) 
where are the singular vectors associated with singular values greater than , while correspond to those smaller than or equal to . With these definitions, we have
(23) 
and thus,
(24) 
where is defined as . The proof is completed. It turns out that the minimization of Problem (18) can be solved by first computing the SVD of , and then applying the conditional thresholding on the singular values. By exploiting the structure of the newly proposed norm regularization, the convergence rate of the the resulting algorithm is expected to be the same as that of the gradient method. The whole optimization procedure is shown in Algorithm 1.
5 Experiments
In this section, we evaluate our proposed algorithm on four datasets from diverse applications: bibtex and delicious for tag recommendation, yeast for gene function prediction, and corel5k for image classification. All these datasets were obtained from Mulan’s website ^{1}^{1}1http://mulan.sourceforge.net/datasets.html and were preseparated into training and test sets. Detailed information about these datasets is shown in Table 1.
Dataset  domain  # instances  # attributes  # labels  cardinality  density  # distinct 

bibtex  text  7395  1836  159  2.402  0.015  2856 
delicious  web  16105  500  983  19.020  0.019  15806 
yeast  biology  2417  103  14  4.237  0.303  198 
corel5k  images  5000  499  374  3.522  0.009  3175 
The details of the competing methods are summarized as follows:

LRML (Local Rademacher complexity Multilabel Learning): our proposed method. The optimal value of the regularization parameter was chosen on a validation set.

MLtrace: solves the multilabel learning problem based on Eq. (5) with the squared loss and trace norm.

MLFro: solves the multilabel learning problem based on Eq. (5) with the squared loss and Frobenius norm.

LEML: the method proposed in (Yu et al., 2014), which decomposes the trace norm into the Frobenius norms of two lowrank matrices.
Given a test set, we used four criteria to evaluate the performance of the multilabel predictor:

Average precision: evaluates the average fraction of relevant labels ranked higher than a particular label.

TopK accuracy: for each example, labels with the largest decision values for prediction were selected. The average accuracy of all the examples is reported.

Hamming loss: measures the overall classification error through .

Average AUC: the area under the ROC curve for each example was measured, and the average AUC of all the test examples is reported.
Top1 Accuracy  

bibtex  delicious  
LRML  LEML  CPLST  LRML  LEML  CPLST  
0.6002  0.5833  0.5555  0.6713  0.6716  0.6653  
0.6123  0.6099  0.5463  0.6708  0.6666  0.6625  
0.6330  0.6199  0.5753  0.6701  0.6628  0.6622  
0.6412  0.6394  0.5976  0.6688  0.6625  0.6622  
Top3 Accuracy  
bibtex  delicious  
LRML  LEML  CPLST  LRML  LEML  CPLST  
0.3520  0.3416  0.3199  0.6297  0.6120  0.6113  
0.3713  0.3653  0.3453  0.6288  0.6123  0.6108  
0.3892  0.3800  0.3601  0.6300  0.6115  0.6109  
0.3912  0.3858  0.3675  0.6290  0.6113  0.6109  
Top5 Accuracy  
bibtex  delicious  
LRML  LEML  CPLST  LRML  LEML  CPLST  
0.2511  0.2449  0.2311  0.5690  0.5646  0.5630  
0.2718  0.2684  0.2496  0.5688  0.5639  0.5628  
0.2812  0.2766  0.2607  0.5701  0.5628  0.5623  
0.2856  0.2820  0.2647  0.5703  0.5627  0.5623  
Hamming Loss  
bibtex  delicious  
LRML  LEML  CPLST  LRML  LEML  CPLST  
0.0129  0.0126  0.0127  0.0175  0.0181  0.0182  
0.0119  0.0124  0.0126  0.0178  0.0181  0.0182  
0.0118  0.0123  0.0125  0.0178  0.0182  0.0182  
0.0118  0.0123  0.0125  0.0178  0.0182  0.0182  
Average AUC  
bibtex  delicious  
LRML  LEML  CPLST  LRML  LEML  CPLST  
0.9023  0.8910  0.8657  0.8812  0.8854  0.8833  
0.9056  0.9015  0.8802  0.8897  0.8827  0.8814  
0.9055  0.9040  0.8854  0.8901  0.8814  0.8834  
0.9049  0.9035  0.8882  0.8895  0.8814  0.8834 
5.1 Evaluations over Different Norms
We first compared the proposed LRML algorithm with the MLtrace and MLFro algorithms on the four datasets; the results are reported in Figure 2. For LRML, we varied to examine the influence of the number of constrained singular values. As shown in Figure 2, when is smaller, the performance of LRML is limited, because the rank of the learned multilabel predictor is too low to cover all the information from different labels. With increasing , LRML discovers the optimal multilabel predictor with appropriate rank and achieves stable performance. Compared to MLtrace, which constrains over all the singular values, the best LRML performance usually offers further improvements by tightly approximating the rank of the multilabel predictor. Moreover, it is important to note that MLFro is of comparable performance to LRML and MLtrace on the yeast dataset. This is not surprising, since this dataset has a relatively small number of labels, and thus the influence of the lowrank structure is limited.
5.2 Evaluations over Lowrank Structures
We next compared LRML algorithm with the current stateoftheart LEML and CPLST algorithms. Since all three of these algorithms are implicitly or explicitly designed to exploit the lowrank structure, we accessed their performances using varying dimensionality reduction ratios. The results are summarized in Table 2.
LRML either improves on, or has comparable performance with, the other methods for nearly all settings. Although these algorithms all focus on the lowrank structure of the predictor in multilabel learning, they study and discover it from different perspectives. LEML factorizes the trace norm by introducing two lowrank matrices, while CPLST learns the multilabel predictor by first learning an encoder to encode the multiple labels. Compared to proposed LRML algorithm, which directly constrains the tail singular values of the predictor to obtain a lowrank structure, both LEML and CPLST increase the number of free parameters in pursuing the low rank, as a result, the optimization complexity will largely increases with large numbers of labels. In addition, since the objective function of LRML is an explicit minimization of the local Rademacher complexity, it will lead to a tight generalization error bound and guarantee stable performances for unseen examples.
In order to investigate the convergence behaviors of LRML, we plot the objective values of LRML on the yeast dataset in Figure 3. We can observe that LRML converges fast in different cases. This confirms that the proposed conditional singular value thresholding algorithm can efficiently solve LRML.
6 Conclusion
In this paper, we use the principle of local Rademacher complexity to guide the design of a new multilabel learning algorithm. We analyze the local Rademacher complexity of ERMbased multilabel learning algorithms, and discover that it is upperbounded by the tail sum of the singular values of the multilabel predictor. Inspired by this local Radermacher complexity bound, a new multilabel learning algorithm is therefore proposed that concentrates solely on the tail singular values of the predictor, rather than on all the singular values as with the trace norm. This use of the local Rademacher complexity results in a sharper generalization error bound and moreover, the new constraints over tail singular values provides a tighter approximation of the lowrank structure than the trace norm. The experimental results demonstrate the effectiveness of the proposed algorithm in discovering the lowrank structure and its generalizability for multilabel prediction.
References

Amit et al. (2007)
Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman.
Uncovering shared structures in multiclass classification.
In
Proceedings of the 24th international conference on Machine learning
, pages 17–24. ACM, 2007.  Bartlett and Mendelson (2003) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
 Bartlett et al. (2005) Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
 Barutcuoglu et al. (2006) Zafer Barutcuoglu, Robert E Schapire, and Olga G Troyanskaya. Hierarchical multilabel prediction of gene function. Bioinformatics, 22(7):830–836, 2006.

Bi and Kwok (2014)
Wei Bi and James T Kwok.
Multilabel classification with label correlations and missing labels.
In
Proceedings of AAAI Conference on Artificial Intelligence (AAAI)
, 2014.  Cabral et al. (2011) Ricardo S Cabral, Fernando Torre, João P Costeira, and Alexandre Bernardino. Matrix completion for multilabel image classification. In Advances in Neural Information Processing Systems, pages 190–198, 2011.
 Cai et al. (2010) JianFeng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
 Chen and Lin (2012) YaoNan Chen and HsuanTien Lin. Featureaware label space dimension reduction for multilabel classification. In Advances in Neural Information Processing Systems, pages 1529–1537, 2012.
 Doppa et al. (2014) Janardhan Rao Doppa, Jun Yu, Chao Ma, Alan Fern, and Prasad Tadepalli. Hcsearch for multilabel prediction: An empirical study. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2014.
 Gao et al. (2004) Sheng Gao, Wen Wu, ChinHui Lee, and TatSeng Chua. A mfom learning approach to robust multiclass multilabel text categorization. In Proceedings of the twentyfirst international conference on Machine learning, page 42. ACM, 2004.
 Guo and Xue (2013) Yuhong Guo and Wei Xue. Probabilistic multilabel classification with sparse feature learning. In Proceedings of the TwentyThird international joint conference on Artificial Intelligence, pages 1373–1379. AAAI Press, 2013.
 Hariharan et al. (2010) Bharath Hariharan, Lihi ZelnikManor, Manik Varma, and Svn Vishwanathan. Large scale maxmargin multilabel classification with priors. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 423–430, 2010.
 Huang et al. (2012) ShengJun Huang, ZhiHua Zhou, and ZH Zhou. Multilabel learning by exploiting label correlations locally. In AAAI, 2012.
 Li and Guo (2013) Xin Li and Yuhong Guo. Active learning with multilabel svm classification. In Proceedings of the TwentyThird international joint conference on Artificial Intelligence, pages 1479–1485. AAAI Press, 2013.
 Loeff and Farhadi (2008) Nicolas Loeff and Ali Farhadi. Scene discovery by matrix factorization. In Computer Vision–ECCV 2008, pages 451–464. Springer, 2008.
 Read et al. (2011) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multilabel classification. Machine learning, 85(3):333–359, 2011.
 Tai and Lin (2012) Farbound Tai and HsuanTien Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508–2542, 2012.
 Tsoumakas et al. (2010) Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multilabel data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2010.
 Vapnik (1998) Vapnik. Statistical learning theory, volume 2. Wiley New York, 1998.

Wang et al. (2009)
Changhu Wang, Shuicheng Yan, Lei Zhang, and HongJiang Zhang.
Multilabel sparse coding for automatic image annotation.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 1643–1650. IEEE, 2009.  Xu et al. (2013a) Miao Xu, Rong Jin, and ZhiHua Zhou. Speedup matrix completion with side information: Application to multilabel learning. In Advances in Neural Information Processing Systems, pages 2301–2309, 2013a.
 Xu et al. (2013b) Miao Xu, Li YuFeng, and Zhou ZhiHua. Multilabel learning with pro loss. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2013b.
 Yu et al. (2014) HsiangFu Yu, Prateek Jain, and Inderjit S Dhillon. Largescale multilabel learning with missing labels. In Proceedings of the twentyfirst international conference on Machine learning, 2014.
 Zhang and Zhou (2013) M Zhang and Z Zhou. A review on multilabel learning algorithms. 2013.
 Zhang and Zhang (2010) MinLing Zhang and Kun Zhang. Multilabel learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 999–1008. ACM, 2010.
Comments
There are no comments yet.