In multi-label learning, an example can be assigned more than one label. This is different to the conventional single label learning, in which each example corresponds to one, and only one, label. Over the past few decades, multi-label learning (Zhang and Zhou, 2013) has been successfully applied to many real-world applications such as text categorization (Gao et al., 2004), image annotation (Wang et al., 2009), and gene function analysis (Barutcuoglu et al., 2006).
A straightforward approach to multi-label learning is to decompose it into a series of binary classification problems for different labels(Tsoumakas et al., 2010)
. However, this approach can result in poor performance when strong label correlations exist. To improve prediction, a large number of algorithms have been developed that approach the multi-label learning problem from different perspectives such as the classifier chains algorithm(Read et al., 2011), the max-margin multi-label classifier (Hariharan et al., 2010), the probabilistic multi-label learning algorithms (Zhang and Zhang, 2010; Guo and Xue, 2013), the correlation learning algorithms (Huang et al., 2012; Bi and Kwok, 2014), and label dependency removal algorithms (Chen and Lin, 2012; Tai and Lin, 2012).
Vapnik’s learning theory (Vapnik, 1998) can be used to justify the successful development of multi-label learning (Xu et al., 2013b; Li and Guo, 2013; Doppa et al., 2014). One of the most useful data-dependent complexity measures is Rademacher complexity (Bartlett and Mendelson, 2003), which leads to tighter generalization error bounds than those derived using the VC dimension and cover number. Recently, (Yu et al., 2014) proved that the Rademacher complexity of empirical risk minimization (ERM)-based multi-label learning algorithms can be bounded by the trace norm of the multi-label predictors, which provides a theoretical explanation for the effectiveness of using the trace norm for regularization in multi-label learning. On the other hand, minimizing the trace norm over the predictor implicitly exploits the correlations between different labels in multi-label learning.
One shortcoming of the general Rademacher complexity is that it ignores the fact that the hypotheses selected by a learning algorithm usually belong to a more favorable subset of all the hypotheses, and they therefore have better performance than in the worst case. To overcome this drawback, the local Rademacher complexity considers the Rademacher averages of smaller subsets of the hypothesis set. This results in a sharper generalization error bound than that derived using global Rademacher complexity. Specifically, the generalization error bound derived by Rademacher complexity is at most of convergence order of , while the bound obtained using local Rademacher complexity usually converges as fast as . We therefore seek to use local Rademacher complexity in multi-label learning problem and design a new algorithm.
In this paper, we show that the local Rademacher complexities for ERM-based multi-label learning algorithms can be upper-bounded in terms of the tail sum of the singular values of the multi-label predictor. As a result, we are motivated to penalize the tail sum of the singular values of the multi-label predictor in multi-label learning, rather than the sum of all the singular values (i.e., the trace norm). As well as the advantage of producing a sharper generalization error bound, this new constraint over the multi-label predictor achieves better recovery of the low-rank structure of the predictor and effectively exploits the correlations between labels in multi-label learning. The resulting objective function can be efficiently solved using a newly proposed conditional singular value thresholding algorithm. Extensive experiments on real-world datasets validate our theoretical analysis and demonstrate the effectiveness of the new multi-label learning algorithm.
2 Global and Local Rademacher Complexities
In a standard supervised learning setting, a set of training examplesare i.i.d. sampled from distribution over . Let be a set of functions mapping to . The learning problem is to select a function such that the expected loss is small, where
is a loss function. Definingas the loss class, the learning problem is then equivalent to findings a function with small .
Global Rademacher complexity (Bartlett and Mendelson, 2003) is an effective approach for measuring the richness (complexity) of the function class , and it is defined as Let be independent uniform
-valued random variables. The global Rademacher complexity ofis then defined as
Based on the notion of global Rademacher complexity, the algorithm has a standard generalization error bound, as shown in the following theorem (Bartlett and Mendelson, 2003) . Given , suppose the function is learned over
training points. Then, with probability at least, we have
Since global Rademacher complexity is in the order of for various classes used in practice, the generalization error bound in Theorem 2 converges at rate
. Global Rademacher complexity is a global estimation of the complexity of the function class, and thus it ignores the fact that the algorithm is likely to pick functions with a small error, and, in particular, only a small subset of the function class will be used.
Instead of using the global Rademacher averages of the entire class as the complexity measure, it is more reasonable to consider the Rademacher complexity of a small subset of the class, e.g., the intersection of the class with a ball centered on the function of interest. Clearly, this local Rademacher complexity (Bartlett et al., 2005) is always smaller than the corresponding global Rademacher complexity, and its formal definition is given by For any , the local Rademacher complexity of is defined as
The following theorem describes the generalization error bound based on local Rademacher complexity. Given , suppose we learn the function over training points. Assume that there is some such that for every , . Then with probability at least , we have
By choosing a much smaller class
with as small a variance as possible while requiring thatstill lies in , the generalization error bound in Theorem 2 has a faster convergence rate than Theorem 2 of up to . Once the local Rademacher complexity is known, can be bounded in terms of the fixed point of the local Rademacher complexity of .
3 Local Rademacher Complexity for Multi-label Learning
In this section, we analyze the local Rademahcer complexity for multi-label learning and illustrate our motivation for developing a new multi-label learning algorithm.
The multi-label learning model is described by a distribution on the space of data points and labels . We receive training points sampled i.i.d. from the distribution , where
are the ground truth label vectors. Given these training data, we learn a multi-label predictorby performing ERM as follows:
where is the empirical risk of a multi-label learner , and is some constraint on .
This global Rademacher complexity for multi-label learning is in the order of , which is exactly consistent with the general analysis shown in the previous section. Hence, the generalization error bound based on the global Rademacher complexity in (Yu et al., 2014) converges up to .
In practice, the hypotheses selected by a learning algorithm usually have better performance than the worst case and belong to a more favorable subset of all the hypotheses. Based on this idea, we employ the local Rademacher complexities to measure the complexity of smaller subsets of the hypothesis set, which results in sharper learning bounds and guarantees faster convergence rates. The local Rademacher complexity of the multi-label learning algorithm using Eq. (5) is shown in Theorem 3.
Suppose we learn over training points. Let be the SVD decomposition of , where and are the unitary matrices, and is the diagonal matrix with singular values in descending order. Assume and there is some such that for every , . Then, the local Rademacher complexity of is
Considering , can be rewritten as
where and are the column vectors of and , respectively. Based on the orthogonality of and , we have the following decomposition
which completes the proof.
According to Theorem 3, the local Rademacher complexity for ERM-based multi-label learning algorithms is determined by the tail sum of the singular values. When , we have , which leads to a sharper generalization error bound than that based on global Rademacher complexity.
In this section, the properties of the local Rademacher complexity discussed above are used to devise a new multi-label learning algorithm.
Each training point has a feature vector and a corresponding label vector . If , example will have label-; otherwise, there is no label- for example . The multi-label predictor is parameterized as , where . is the loss function that computes the discrepancy between the true label vector and the predicted label vector.
The trace norm is an effective approach for modeling and capturing correlations between labels associated with examples, and it has been widely adopted in many multi-label algorithms (Amit et al., 2007; Loeff and Farhadi, 2008; Cabral et al., 2011; Xu et al., 2013a). Within the ERM framework, their objective functions usually take the form
As shown in the previous section, however, the tail sum of the singular values of , rather than its trace, determines the local Rademacher complexity. Since the local Rademacher complexity can lead to tighter generalization bounds than those of the global Rademacher complexity, this motivates us to consider the following objective function to solve the multi-label learning problem.
where is the -th largest singular value of , and is a parameter to control the tail sum. If we use the squared L2-loss function, we get
In multi-label learning, the multi-label predictor usually has a low-rank structure due to the correlations between multiple labels. The trace norm is regarded as an effective surrogate of rank minimization by simultaneously penalizing all the singular values of . However, it may incorrectly keep the small singular values, which should be zero, or shrink the large singular values to zeros, which should be non-zero. In contrast, our new algorithm can directly minimize over the small singular values, which encourages the low-rank structure.
To understand why the trace norm may fail in rank minimization, we consider a matrix
where and are unknown. The results shown in Figure 1 plot the trace norm and the proposed new norm of for all possible completions in a range around the value that minimizes its rank and . We find that the trace norm yields the optimal solution for with singular values when and . In contrast, we propose to constrain over the tail singular values (setting ), and derive the optimal solution with singular values when and . Hence, the new norm can successfully discover the low-rank structure, while the trace norm fails in this case.
Starting with Eq. (13) without the norm regularization, we get the following problem,
where is the data matrix and is the label matrix. The gradient method is a natural approach for solving this problem and generates a sequence of approximate solutions:
which can be further reformulated as a proximal regularization of the linearized function at as
where the terms in Eq. (17) that do not depend on are ignored.
Recall that if problem (18) is constrained with the trace norm, (Cai et al., 2010) showed that it can be efficiently solved using the singular value thresholding algorithm. Hence, we propose a new conditional singular value thresholding algorithm to handle the newly proposed norm regularization. The solution is summarized in the following theorem.
Let and its SVD decomposition is , where and have orthonormal columns, is diagonal. Then,
is given by , where is diagonal with .
Assuming that is the optimal solution, should be a subgradient of the objective function at the point ,
where is the set of subgradients of the new norm regularization. Letting , we have
where is obtained by setting the diagonal values with indices greater than
in the identity matrixas zeros. Set the SVD of as
where are the singular vectors associated with singular values greater than , while correspond to those smaller than or equal to . With these definitions, we have
where is defined as . The proof is completed. It turns out that the minimization of Problem (18) can be solved by first computing the SVD of , and then applying the conditional thresholding on the singular values. By exploiting the structure of the newly proposed norm regularization, the convergence rate of the the resulting algorithm is expected to be the same as that of the gradient method. The whole optimization procedure is shown in Algorithm 1.
In this section, we evaluate our proposed algorithm on four datasets from diverse applications: bibtex and delicious for tag recommendation, yeast for gene function prediction, and corel5k for image classification. All these datasets were obtained from Mulan’s website 111http://mulan.sourceforge.net/datasets.html and were pre-separated into training and test sets. Detailed information about these datasets is shown in Table 1.
|Dataset||domain||# instances||# attributes||# labels||cardinality||density||# distinct|
The details of the competing methods are summarized as follows:
LRML (Local Rademacher complexity Multi-label Learning): our proposed method. The optimal value of the regularization parameter was chosen on a validation set.
ML-trace: solves the multi-label learning problem based on Eq. (5) with the squared loss and trace norm.
ML-Fro: solves the multi-label learning problem based on Eq. (5) with the squared loss and Frobenius norm.
LEML: the method proposed in (Yu et al., 2014), which decomposes the trace norm into the Frobenius norms of two low-rank matrices.
Given a test set, we used four criteria to evaluate the performance of the multi-label predictor:
Average precision: evaluates the average fraction of relevant labels ranked higher than a particular label.
Top-K accuracy: for each example, labels with the largest decision values for prediction were selected. The average accuracy of all the examples is reported.
Hamming loss: measures the overall classification error through .
Average AUC: the area under the ROC curve for each example was measured, and the average AUC of all the test examples is reported.
5.1 Evaluations over Different Norms
We first compared the proposed LRML algorithm with the ML-trace and ML-Fro algorithms on the four datasets; the results are reported in Figure 2. For LRML, we varied to examine the influence of the number of constrained singular values. As shown in Figure 2, when is smaller, the performance of LRML is limited, because the rank of the learned multi-label predictor is too low to cover all the information from different labels. With increasing , LRML discovers the optimal multi-label predictor with appropriate rank and achieves stable performance. Compared to ML-trace, which constrains over all the singular values, the best LRML performance usually offers further improvements by tightly approximating the rank of the multi-label predictor. Moreover, it is important to note that ML-Fro is of comparable performance to LRML and ML-trace on the yeast dataset. This is not surprising, since this dataset has a relatively small number of labels, and thus the influence of the low-rank structure is limited.
5.2 Evaluations over Low-rank Structures
We next compared LRML algorithm with the current state-of-the-art LEML and CPLST algorithms. Since all three of these algorithms are implicitly or explicitly designed to exploit the low-rank structure, we accessed their performances using varying dimensionality reduction ratios. The results are summarized in Table 2.
LRML either improves on, or has comparable performance with, the other methods for nearly all settings. Although these algorithms all focus on the low-rank structure of the predictor in multi-label learning, they study and discover it from different perspectives. LEML factorizes the trace norm by introducing two low-rank matrices, while CPLST learns the multi-label predictor by first learning an encoder to encode the multiple labels. Compared to proposed LRML algorithm, which directly constrains the tail singular values of the predictor to obtain a low-rank structure, both LEML and CPLST increase the number of free parameters in pursuing the low rank, as a result, the optimization complexity will largely increases with large numbers of labels. In addition, since the objective function of LRML is an explicit minimization of the local Rademacher complexity, it will lead to a tight generalization error bound and guarantee stable performances for unseen examples.
In order to investigate the convergence behaviors of LRML, we plot the objective values of LRML on the yeast dataset in Figure 3. We can observe that LRML converges fast in different cases. This confirms that the proposed conditional singular value thresholding algorithm can efficiently solve LRML.
In this paper, we use the principle of local Rademacher complexity to guide the design of a new multi-label learning algorithm. We analyze the local Rademacher complexity of ERM-based multi-label learning algorithms, and discover that it is upper-bounded by the tail sum of the singular values of the multi-label predictor. Inspired by this local Radermacher complexity bound, a new multi-label learning algorithm is therefore proposed that concentrates solely on the tail singular values of the predictor, rather than on all the singular values as with the trace norm. This use of the local Rademacher complexity results in a sharper generalization error bound and moreover, the new constraints over tail singular values provides a tighter approximation of the low-rank structure than the trace norm. The experimental results demonstrate the effectiveness of the proposed algorithm in discovering the low-rank structure and its generalizability for multi-label prediction.
Amit et al. (2007)
Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman.
Uncovering shared structures in multiclass classification.
Proceedings of the 24th international conference on Machine learning, pages 17–24. ACM, 2007.
- Bartlett and Mendelson (2003) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3:463–482, 2003.
- Bartlett et al. (2005) Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. Annals of Statistics, pages 1497–1537, 2005.
- Barutcuoglu et al. (2006) Zafer Barutcuoglu, Robert E Schapire, and Olga G Troyanskaya. Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830–836, 2006.
Bi and Kwok (2014)
Wei Bi and James T Kwok.
Multilabel classification with label correlations and missing labels.
Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2014.
- Cabral et al. (2011) Ricardo S Cabral, Fernando Torre, João P Costeira, and Alexandre Bernardino. Matrix completion for multi-label image classification. In Advances in Neural Information Processing Systems, pages 190–198, 2011.
- Cai et al. (2010) Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
- Chen and Lin (2012) Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label classification. In Advances in Neural Information Processing Systems, pages 1529–1537, 2012.
- Doppa et al. (2014) Janardhan Rao Doppa, Jun Yu, Chao Ma, Alan Fern, and Prasad Tadepalli. Hc-search for multi-label prediction: An empirical study. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2014.
- Gao et al. (2004) Sheng Gao, Wen Wu, Chin-Hui Lee, and Tat-Seng Chua. A mfom learning approach to robust multiclass multi-label text categorization. In Proceedings of the twenty-first international conference on Machine learning, page 42. ACM, 2004.
- Guo and Xue (2013) Yuhong Guo and Wei Xue. Probabilistic multi-label classification with sparse feature learning. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages 1373–1379. AAAI Press, 2013.
- Hariharan et al. (2010) Bharath Hariharan, Lihi Zelnik-Manor, Manik Varma, and Svn Vishwanathan. Large scale max-margin multi-label classification with priors. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 423–430, 2010.
- Huang et al. (2012) Sheng-Jun Huang, Zhi-Hua Zhou, and ZH Zhou. Multi-label learning by exploiting label correlations locally. In AAAI, 2012.
- Li and Guo (2013) Xin Li and Yuhong Guo. Active learning with multi-label svm classification. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages 1479–1485. AAAI Press, 2013.
- Loeff and Farhadi (2008) Nicolas Loeff and Ali Farhadi. Scene discovery by matrix factorization. In Computer Vision–ECCV 2008, pages 451–464. Springer, 2008.
- Read et al. (2011) Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.
- Tai and Lin (2012) Farbound Tai and Hsuan-Tien Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508–2542, 2012.
- Tsoumakas et al. (2010) Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining and knowledge discovery handbook, pages 667–685. Springer, 2010.
- Vapnik (1998) Vapnik. Statistical learning theory, volume 2. Wiley New York, 1998.
Wang et al. (2009)
Changhu Wang, Shuicheng Yan, Lei Zhang, and Hong-Jiang Zhang.
Multi-label sparse coding for automatic image annotation.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1643–1650. IEEE, 2009.
- Xu et al. (2013a) Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side information: Application to multi-label learning. In Advances in Neural Information Processing Systems, pages 2301–2309, 2013a.
- Xu et al. (2013b) Miao Xu, Li Yu-Feng, and Zhou Zhi-Hua. Multi-label learning with pro loss. In Proceedings of AAAI Conference on Artificial Intelligence (AAAI), 2013b.
- Yu et al. (2014) Hsiang-Fu Yu, Prateek Jain, and Inderjit S Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the twenty-first international conference on Machine learning, 2014.
- Zhang and Zhou (2013) M Zhang and Z Zhou. A review on multi-label learning algorithms. 2013.
- Zhang and Zhang (2010) Min-Ling Zhang and Kun Zhang. Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 999–1008. ACM, 2010.