1 Introduction
Multilabel learning [1], which allows an instance to be associated with multiple labels simultaneously, has been applied successfully to various realworld problems, including images [2], texts [3] and biological data [4]. An important issue with multilabel learning is that collecting all labels requires investigation of a large number of candidate labels one by one, and thus labels are usually missing in practice due to limited resources.
Multilabel learning with such missing labels, which is often called weakly supervised multilabel learning (WSML), has been investigated thoroughly [5, 6, 7]. Among them, the most popular methods are based on matrix completion (MC) [5, 8, 9], which is a technique to complete an (approximately) lowrank matrix with uniformed randomly missing entries [10, 11].
The MCbased WSML methods mentioned above assume that only the label matrix has missing entries, while the feature matrix is complete. However, in reality, features can also be missing [12]. To deal with such missing features, a naive solution is to first complete the feature matrix using a classical MC technique, and then employ a WSML method to fill the label matrix . However, such a twostep approach may not work well since recovery of is performed in an unsupervised way (i.e., label information is completely ignored). Thus, when facing a situation with both label and feature missing, it would be desirable to employ the label information to complete the feature matrix in a supervised way. Following this spirit, [8] proposed concatenating the feature matrix and label matrix into a single big matrix , and employed an MC algorithm to recover both features and labels simultaneously.
In the WSML methods reviewed above, the label matrix is commonly assumed to be of (approximately) lowrank, based on the natural observation that labels are correlated in the multilabel setting. However, such a lowrank assumption on may not be true in reality since entries of take binary values and thus is unlikely to be of lowrank. Indeed, as we observe in Figure 1
, the singular values of the label matrix
of the CAL500 data [13] have a heavy tail and thus such a lowrank assumption on may not be reasonable. Another assumption of [8]is that there is a linear relationship between the feature and label matrices. Such an assumption may not hold well on some datasets, where classical multilabel methods learn a nonlinear classifier
[14, 15].In this paper, we propose a method to deal with WSML learning when both features and annotations are incomplete. Motivated by [16] learning a lowdimensional shared subspace between labels and features, we assume that there is some latent matrix generating the annotation matrix. More specifically, each entry of will be mapped by a nonlinear function to , which corresponds to the probability of setting the corresponding entry of to be . Considering the label correlation, we assume is lowrank. Motivated by elastic net [17] and onebit MC [18, 19, 20], we propose a subgradientbased MC method named cocompletion (COCO) which can recover and simultaneously. Furthermore, we give a theoretical bound on the recovery effect of COCO. In the experiments, we demonstrate that COCO not only has a better recovery performance than baseline, but also achieves better test error when the recovered data are used to train new classifiers.
2 Algorithm
In this section, we will first give the formal definition of our studied problem. Then, we will give our learning objective, as well as the optimization algorithm.
2.1 Formulation
We will assume is the feature matrix, in which is the number of instances and is the number of features. There is a label matrix where is the number of labels in multilabel learning. We assume that generates the label matrix , that is, . In the following, we will assume
is the sigmoid function, i.e.,
.Our basic assumption is that and are concatenated together into a big matrix which is written as , and is lowrank. Note that previous work [8] has two assumptions when using matrix completion to solve a multilabel problem. One is that the label matrix has a linear relationship with the feature matrix . However, such an assumption may not hold well on real data, otherwise, there may not exist so many algorithms learning a nonlinear mapping between features and labels. Another assumption is that the label matrix is lowrank. Such an assumption is motivated by the fact that labels in multilabel learning are correlated, thus only a few factors determine their values. However, we want to argue that a sparse matrix may not be lowrank, and instead, we assume that the latent matrix generating labels is lowrank. Thus in our problem, we assume the concatenate of and forms a lowrank matrix, and will recover such a matrix when entries in both and are missing.
Similarly to previous works [5, 9], we assume that data are uniformly randomly missing with probability , where, for , for contains all the indices of observed entries in the matrix . Let be the subsets containing all the indices of observed entries in and respectively. Based on all these notations, in the following we will give our learning objective.
2.2 Learning Objective
In our learning objective, we need to consider three factors. One focuses on the feature matrix. To recover the feature matrix, a classical way is to use the Frobenius norm on those observed entries, i.e.,
where
Note that the Frobenius norm on matrices is corresponding to the L2 norm on vectors, and the trace norm on matrix singular values is similar to the L1 norm on vectors. Motivated by the advantage of the elastic net
[17] which uses both the L1 norm and L2 norm for regularization, we additionally consider optimizing the trace norm of the difference between the recovered feature matrix and the observed feature matrix, i.e.,For the label matrix , motivated by previous work on onebit matrix completion [18], we will consider the loglikelihood of those observed entries, i.e.,
where is the indicator function. Note that we will minimize the negative loglikelihood instead of maximizing the loglikelihood, in order to agree with other components in the objective.
To incorporate all the above conditions into consideration, we have our final learning objective,
(1)  
where is the concatenation of and , i.e., .
2.3 Optimization
Previous deterministic algorithms on trace norm minimization [21, 22]
always assume that the loss function is composed of two parts. One part is a differential convex function, and another part is the trace norm on the whole matrix, which is not differentiable but convex. In this way, it is easy for them to find a closedform solution using algorithms such as proximal gradient descend, because minimizing the trace norm on the whole matrix plus a simple loss function will have a closedform solution
[21].However, in our problem Eq (1), besides the simple trace norm on the whole matrix, we still have the trace norm on the submatrix, while
is a linear transformation of the whole matrix
. Thus classical methods based on proximal gradient descent cannot be employed.We divide the learning objective into two parts, and consider each part seperatedly. One part is
(2) 
where
and
where is a matrix whose diagonal entries are and other entries are .
Another part contains only . Note that in previous works on the stochastic L1 loss minimization problem [23, 24], they first perform gradient descent on the loss function without considering the L1 loss part, and then derived a closedform solution for the L1 loss part. Motivated by this, we will first perform gradient descent on Eq (2) and then obtain a closedform solution taking into consideration.
is convex and it is easy to calculate the derivation. To calculate the subgradient of , we will need the following results:
Lemma 1.
(Subgradient of the trace norm [25]) Let with , and let
be an singular value decomposition (SVD) of
. Let . Then,In this way,the subgradient of is given by
where .
We will perform iterative optimization. In the th iteration, after we have the subgradient of Eq (2), will be updated using gradient descent by
(3) 
where is the step size. We then have a closedform solution of taking the trace norm into consideration, which is,
(4) 
where is the SVD of and
We will call our proposed method cocompletion (COCO) and give the whole process in Algorithm 1.
Note that such a solution coincides with works on stochastic trace norm minimization [26, 27]
. In both works, they constructed a random probe matrix, and multiplied the gradient with the probe matrix in each iteration to generate a stochastic gradient. In this way, the expectation of the stochastic gradient calculated in each iteration will be the exact gradient, which agrees with the principle of stochastic gradients in ordinary stochastic gradient descent (SGD).
[27] provided a theoretical guarantee of the convergence rate for such a kind of problems. As their objective is to save space for trace norm minimization, here we will not consider the space limitation problem, and will use plain gradient descent instead of subgradient descent. However, their convergence results on SGD can be used as a weak guarantee for the convergence of our algorithm.3 Theory
In this section, we give a bound on the following optimization problem:
s.t.  (5)  
Note that if we change the operator in Eq (5) to and change the objective to its additive inverse, we will have an equivalence of Eq (5). We can then use Lagrange multiplier and add the two inequality constraints into the objective. In this way, the problem will have similar form as Eq (1). Thus by appropriately setting parameters and in Eq (1), the maximization problem Eq (5) and the minimization problem Eq (1) will be equal.
We assume that
and
Since is a constant matrix, minus will not affect optimizing of the objective function, i.e., maximizing and under the same constraints will result in the same .
In the following, we will start deriving our theoretical results.
Lemma 2.
Let be
for some , and . Then
where and are constants, and the expectation are both over the choice of and the draw of .
With Lemma 2, we can have the following results:
Theorem 1.
Assume that and the largest entry of is less than . Suppose that is chosen independently at random following a binomial model with probability . Suppose that is generated using . Let be the solution to the optimization problem Eq (5). Then with a probability at least , we have
where denotes the KullbackLeibler on two matrices. For it is defined as
By enforcing and using the fact that that when is a sigmoid function [18], we can have our main result:
Theorem 2.
Assume that . Suppose that is chosen independently at random following a binomial model with probability . Suppose that is generated using . Let be the solution to the optimization problem Eq.(5). Then with probability at least , we have
Furthermore, as long as , and further assuming that , we will have
Remarks
Theorem 2 tells us that the average KLdivergence of the recovered and , together with the average Frobenius norm of weighted by are bounded above by if , in which . Note that when , the part will take the majority of , and the bound implies that we can have a nearly perfect feature recovery result with sample complexity , agreeing with previous perfectrecovery results although the confidence is degenerated a bit from where to [28]. Otherwise if , our bound also agrees with previous bound on onebit matrix completion [18].
4 Experiments
We evaluate the proposed algorithm COCO on both synthetic and real data sets. Our implementation is in Matlab except the neural network which is implemented in Python and used to show the generalization performance of classifiers trained on recovered data.
4.1 Experimental Results on Synthetic Data
Our goal is to show the recovery effect of our proposed algorithm on both the feature matrix and label matrix. We will also show how adding the term can enhance our recovery effect.
Settings and Baselines
To create synthetic data, following previous works generating a lowrank matrix [29]
, we first generate a random matrix
and with each entry drawn uniformly and independently randomly from . We then construct by . The first columns of is regarded as the feature matrix and the rest is regarded as the matrix. We then set each entry of by with probability and with . Here is the sigmoid function. Finally both and are observed with probability for each entry.We set a variety of different numbers to , , , , . More specially, , , , , . In the experiments, we weight by and set all other weight parameters to be . The step size is set initially as and decays at the rate of , i.e., , until it is below . We will compare two cases: One is the parameter without considering the term in the optimization; another is motivated by the elastic net. The Maxide method is to first complete features using proximal graident descent [22] and then perform weakly supervised multilabel learning [9]. The Mc method is to complete the concatenate of and , which is proposed in [8]. We repeat each experiment five times, and report the average results.
Results
We measure the recovery performance on the feature matrix by the relative error . The classification performance is measured by the Hamming loss. More specially, after we got , we set if and otherwise. The recovery performance on is then measured by where is the zeroone loss. The results are shown in Table 1. Note that we have results in total. We present results here and put all others in Appendix. From the results, we can see that, when data satisfy our assumption, our proposed COCO with the term in the optimization objective is always better at recovery. For the recovery, our proposal is always better than two baselines, i.e., Maxide and Mc. Occasionally ( among all cases) it is comparable to COCO0. This would be reasonable since the term put more emphasis on feature recovery, and does not aid label recovery much. Comparing Maxide and Mc, we find that both two algorithms have the same recovery results on , but Maxide performs much worse on than MC. This may due to the fact that when recovery , Mc uses additional information on the structure of instead of using only the nonperfect recovered feature data.
To further study the impact of on the final performance, we illustrate how the recovery error of and decrease when the iterations evolve in Figure 2. We can see that the recovery error of COCO1 decreases to a lower point when it converges, and get a slightly better recovery results than COCO0. Although the recovery error also decreases to a lower point, the difference is not obvious. We can conclude that adding the term to the optimization objective can benefit recovery.
recovery error  

COCO1  COCO0  Maxide  Mc  
recovery error  
COCO1  COCO0  Maxide  Mc  
. The best result and comparable ones (pairwise singletailed ttests at
confidence level) in each row are bold.4.2 Experimental Results on Real Data
We evaluate the proposed algorithm on real data. Here we will evaluate the performance using the CAL500 dataset [13]. CAL500 is a music dataset containing instances, features, labels. As we previously shown in Figure 1, CAL500’s annotation matrix does not have the lowrank or approximately lowrank property. In this experiment, we will not only report the recovery performance of COCO, but also use the recovered data to train new classifiers, and report the test error of the trained classifier.
Settings and Baselines
We will first divide the datasets into two parts, for training and for testing. For the training data, we will randomly sample as observed data, and make all other entries unobserved. We will use the same parameter setting as Section 4.1, except that the step size will keep decaying without stopping. Here we will also compare with Maxide and Mc. For the two compared methods, we use the default parameter setting in their original codes. After the data are recovered, we use the stateoftheart multilabel classification method called LIMO (labelwise and Instancewise margins optimization) [30] and a single hidden layer neural network to test the generalization performance when using the recovered data to train a classifier. To make a fair comparison, we also use the clean data to train a classifier and record its test error, which can be counted as the best baseline for the current model. We will call this method the oracle. All the experiments are repeated twenty times and report the average results.
Results
The results are reported in Table 2. We can see that our proposed COCO achieves the best recovery results among all three methods. For the generalization performance, we can see that our method also achieves the best results in all comparable methods, and it is more closer to the baseline using clean data.
Recovery Error  Test Error  

Xerror  Yerror  LIMO  NN  
COCO  
Maxide  
Mc  
Oracle 
5 Conclusion
In this paper, we considered the problem where both features and labels have missing values in weakly supervised multilabel learning. Realizing that previous methods either recover the features ignoring supervised information, or make unrealistic assumptions, we proposed a new method to deal with such problems. More specifically, we considered a latent matrix generating the label matrix, and considering labels are correlated, such a latent matrix together with features form a big lowrank matrix. We then gave our optimization objective and algorithm motivated by the elastic net. Experimental results on both simulated and realworld data validated the effectiveness of our proposed methods.
Acknowledgments
We want to thank BoJian Hou for discussion and polishing of the paper.
References

[1]
ZhiHua Zhou and MinLing Zhang.
Multilabel learning.
In Claude Sammut and Geoffrey I. Webb, editors,
Encyclopedia of Machine Learning and Data Mining
, pages 875–881. Springer US, 2017.  [2] Minmin Chen, Alice X. Zheng, and Kilian Q. Weinberger. Fast image tagging. In Proceedings of the 30th International Conference on Machine Learning, pages 1274–1282, 2013.
 [3] VietAn Nguyen, Jordan L. BoydGraber, Philip Resnik, and Jonathan Chang. Learning a concept hierarchy from multilabeled documents. In Advances in Neural Information Processing Systems 27, pages 3671–3679, 2014.
 [4] Zheng Chen, Minmin Chen, Kilian Q. Weinberger, and Weixiong Zhang. Marginalized denoising for link prediction and multilabel learning. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 1707–1713, 2015.
 [5] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit S. Dhillon. Largescale multilabel learning with missing labels. In Proceedings of the 31th International Conference on Machine Learning, pages 593–601, 2014.
 [6] YuYin Sun, Yin Zhang, and ZhiHua Zhou. Multilabel learning with weak label. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, 2010.

[7]
Serhat Selcuk Bucak, Rong Jin, and Anil K. Jain.
Multilabel learning with incomplete class assignments.
In
Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition,
, pages 2801–2808, 2011.  [8] Andrew B. Goldberg, Xiaojin Zhu, Ben Recht, JunMing Xu, and Robert D. Nowak. Transduction with matrix completion: Three birds with one stone. In Advances in Neural Information Processing Systems 23, pages 757–765, 2010.
 [9] Miao Xu, Rong Jin, and ZhiHua Zhou. Speedup matrix completion with side information: Application to multilabel learning. In Advances in Neural Information Processing Systems 26, pages 2301–2309, 2013.
 [10] Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9(6):717–772, 2009.
 [11] Emmanuel J. Candès and Yaniv Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925–936, 2010.
 [12] Ofer Dekel and Ohad Shamir. Learning to classify with missing and corrupted features. In Proceedings of the 25th International Conference on Machine Learning, pages 216–223, 2008.
 [13] Douglas Turnbull, Luke Barrington, David A. Torres, and Gert R. G. Lanckriet. Semantic annotation and retrieval of music and sound effects. IEEE Transactions on Audio, Speech & Language Processing, 16(2):467–476, 2008.
 [14] André Elisseeff and Jason Weston. A kernel method for multilabelled classification. In Advances in Neural Information Processing Systems 14, pages 681–687, 2001.
 [15] MinLing Zhang and ZhiHua Zhou. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering, 18(10):1338–1351, 2006.
 [16] ShengJun Huang, Wei Gao, and ZhiHua Zhou. Fast multiinstance multilabel learning. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pages 1868–1874, 2014.
 [17] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B, 67:301–320, 2005.
 [18] Mark A. Davenport, Yaniv Plan, Ewout van den Berg, and Mary Wootters. 1bit matrix completion. arXiv, abs/1209.3672, 2012.
 [19] Mark Herbster, Stephen Pasteris, and Massimiliano Pontil. Mistake bounds for binary matrix completion. In Advances in Neural Information Processing Systems 29, pages 3954–3962, 2016.
 [20] Renkun Ni and Quanquan Gu. Optimal statistical and computational rates for one bit matrix completion. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, pages 426–434, 2016.
 [21] Yurii Nesterov. Gradient methods for minimizing composite objective function. Technical report, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE), 2007.
 [22] Paul Tseng. On accelerated proximal gradient methods for convexconcave optimization. Technical report, University of Washington, WA, 2008.
 [23] Shai ShalevShwartz and Ambuj Tewari. Stochastic methods for l regularized loss minimization. In Proceedings of the 26th International Conference on Machine Learning, pages 929–936, 2009.
 [24] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. In Advances in Neural Information Processing Systems 21, pages 905–912, 2008.
 [25] G Watson. Characterization of the subdifferential of some matrix norms. Linear Algebra and its Applications, 170:33–45, 1992.
 [26] Haim Avron, Satyen Kale, Shiva Prasad Kasiviswanathan, and Vikas Sindhwani. Efficient and practical stochastic subgradient descent for nuclear norm regularization. In Proceedings of the 29th International Conference on Machine Learning, 2012.
 [27] Lijun Zhang, Tianbao Yang, Rong Jin, and ZhiHua Zhou. Stochastic proximal gradient descent for nuclear norm regularization. arXiv, abs/1511.01664, 2015.
 [28] Benjamin Recht. A simpler approach to matrix completion. Journal of Machine Learning Research, 12:3413–3430, 2011.
 [29] JianFeng Cai, Emmanuel J. Candès, and Zuowei Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
 [30] XiZhu Wu and ZhiHua Zhou. A unified view of multilabel performance measures. In Proceedings of the 34th International Conference on Machine Learning, pages 3780–3788, 2017.
Comments
There are no comments yet.