1 Introduction
In many realworld applications, samples are often represented by several feature subsets, and meanwhile associated with multiple labels [Xu et al.2013a]. For example, a natural scene image can be annotated with multiple tags {, , }, and described by various visual features, such as histogram of oriented gradients, color features and scale invariant feature transform. As an effective way to deal with such data, multiview multilabel learning has attracted a lot of attention in various realworld applications [Wu et al.2019, Zhang et al.2020]. Though these approaches have achieved much success, there still exists two problems. The first one is that it is difficult to collect all the relevant labels of every sample. For example, in image annotations, an annotator may annotate an image with a partial label set from the large number of groundtruth labels. To address such problem, weaklabel learning methods [Yu et al.2014, Dong et al.2018, Wu et al.2018, Tan et al.2018b] are proposed based on the assumption that similar instances have similar labels. Though these methods have shown promising results in real applications, they do not consider the second problem, i.e., samples may miss their representations on some views, which possibly leads to the performance degradation [Xu et al.2015]. Many incomplete multiview learning methods [Zhang et al.2013, Liu et al.2015, Xu et al.2015, Yin et al.2017] are then proposed to improve the performance by exploiting the complementary information from multiple incomplete views.
The coexistence of incomplete views and weak labels poses a severe challenge. To the best of our knowledge, only few studies [Tan et al.2018a, Zhu et al.2019, Li and Chen2021] take both two issues into consideration. However, iMVWL [Tan et al.2018a] and IMVLIV [Zhu et al.2019] impose the lowrank constraint on the label matrix, which is usually violated in practice due to the presence of tail labels in multilabel learning [Li and Chen2021]. NAIML [Li and Chen2021]
assumes that the label matrix is highrank, but treats all views equally just as iMVWL and IMVLIV do, which probably suffers from the problem of noisy views.
To cope with the aforementioned challenges, we propose a novel method for inCompletE Multiview wEaklabel learNing with longTailed labels (CEMENT) in this paper. Specifically, CEMENT first embeds both incomplete views and weak labels into lowdimensional subspaces with adaptive weights, which automatically detects noisy views by assigning relatively lower weights to them. It then adaptively correlates embedded views and labels via HilbertSchmidt Independence Criterion (HSIC) in Reproducing Kernel Hilbert Spaces (RKHSs). To capture tail labels, it separates an additional sparse component from weak labels, that makes the lowrankness valid in the multilabel setting. The framework of CEMENT is shown in Fig. 1. An alternating algorithm is developed to optimize the proposed problem, and its effectiveness is demonstrated on seven realworld datasets. The contributions of this work are summarized into threefolds:

A novel method CEMENT is proposed to handle the incomplete multiview weaklabel issue. It jointly embeds incomplete views and weak labels into lowdimensional subspaces with adaptive weights, and adaptively correlates the embeddings via HSIC in RKHSs.

CEMENT enables to capture noisy views and tail labels in realworld datasets by learning adaptive embedding weights and exploring an additional sparse component from weak labels, respectively.

Experimental results on seven widely used realworld datasets show the effectiveness of CEMENT.
2 Related Work
In this section, we discuss the related works with this paper, and focus on the works from three research fields: Incomplete Multiview Learning, Weaklabel Learning and Incomplete Multiview Weaklabel Learning.
2.1 Incomplete MultiView Learning
Multiview learning handles the data represented by multiple views and aims to improve learning performance by discovering view correlations [Yin et al.2017]
. Under the incomplete multiview setting, many algorithms have been proposed to handle the problem of missing views in recent years. Previous approaches have shown promising results in conjunction with semisupervised learning
[Xu et al.2015, Yin et al.2017], or with contrastive learning [Lin et al.2021]. And some tried to seek shared information by projecting original multiview data into a single lowdimensional subspace [Zhang et al.2013, Liu et al.2015].2.2 WeakLabel Learning
Previous weaklabel learning studies focus mainly on the singleview setting. MAXIDE [Xu et al.2013b]
uses the input feature data as side information to recover the label matrix, based on the assumption that the label matrix is lowrank. COCO
[Xu et al.2018] leverages a latent possibility matrix to generate the label matrix, and can recover the feature matrix and the label matrix simultaneously without the lowrank assumption. lrMMC [Liu et al.2013] and McWL [Tan et al.2018b] are multiview weaklabel learning methods, but they both need all the views to be complete.2.3 Incomplete MultiView WeakLabel Learning
As far as we know, there are only few studies focused on the incomplete multiview weaklabel learning. iMVWL [Tan et al.2018a] learns a shared subspace from incomplete views with weak labels, and leverages both crossview relationships and local label correlations. IMVLIV [Zhu et al.2019] designs a multiview multilabel learning method with incomplete views and weak labels by learning labelspecific features, label correlations, and complementary information of multiple views. These methods assume that the label matrix is lowrank, which is typically unsuitable in practice. NAIML [Li and Chen2021] explicitly exploits the highrank structure of the multilabel matrix, and jointly takes incompleteness of views and missing of labels into account. However, the existing three methods treat all views equally, limiting the real applications in presence of noisy views.
3 Methodology
3.1 Preliminaries
For the
th instance, we denote its feature vector of
th view by , and its corresponding label vector by , where is the feature dimension of the th view, and is the number of distinct labels. Let denote the input data with samples and views, where indicates the feature matrix in the th view. Let denote the label matrix, where means that the th label is assigned to the th instance, while otherwise. In the incomplete multiview weaklabel scenario, partial views and labels of some samples may be missing. Thus, we introduce and to denote indexes of the missing entries in the feature matrix and the label matrix , respectively. or if is an observed entry in or , and or otherwise.3.2 Formulation
Given a multiview dataset, we can optimize the following problem to find a shared latent subspace (), that integrates complementary information from different views [Gao et al.2015]:
(1) 
where represents the Frobenius norm, and is the coefficient matrix of the th view. Eq. (1) treats each view equally, whose objective actually equals to with and . Therefore, it might deviate from the true latent subspace, due to the existence of noisy views. Moreover, structured missing views in many applications also make Eq. (1) unreliable. A naive way to solve this problem is to fill the missing entries with average feature values, but it may introduce errors. To overcome the limitations, we propose the incomplete multiview model as follows:
(2) 
where weights the embedding importance of the th view, and is the Hadamard product. According to Eq. (2), the th view data is mapped to the viewspecific latent representation (), with the viewspecific adaptive weight . In addition, Eq. (2) minimizes the reconstruction error between and based only on the observed entries, which is indexed by . In this way, we successfully overcome the two limitations of Eq. (1).
Similarly, we map the label matrix to its latent representation () by , where is the coefficient matrix. However, the presence of longtailed labels makes the lowrank assumption invalid in practice [Li and Chen2021]
. Thus, it is desired to separate tail labels from the entire labels. To this end, we treat tail labels as outliers and decompose the label matrix
by(3) 
In Eq. (3), models nontail labels under the lowrank assumption, and captures tail labels with a sparse constraint. Besides, in the weaklabel setting, the label matrix is often incomplete and contains many missing entries. Thus, we propose to solve the following problem:
(4) 
where
is a tradeoff hyperparameter. Therefore, we successfully capture tail labels in the weaklabel setting, and thus make the lowrankness valid.
Next, we adopt the HilbertSchmidt Independence Criterion (HSIC) [Gretton et al.2005] to build the correlations among the embedded views and the embedded labels in an adaptive manner. HSIC computes the squared norm of the crosscovariance operator over and
, in Reproducing Kernel Hilbert Spaces (RKHSs) to estimate the dependency, which is empirically defined by:
(5) 
where and are two Gram matrices, measuring the kernel induced similarity between row vectors of and . is the centering matrix, where
is an identity matrix, and
is an allone vector. In theory, the larger the value of HSIC, the higher the dependence between and . Thus, we promote the dependence between and by maximizing the value of HSIC:(6) 
where weights the importance of the th view embedding and the label embedding .
By incorporating Eq. (2), Eq. (4) and Eq. (6), we now have the optimization problem for the proposed CEMENT method:
(7) 
In fact, Eq. (7) treats the label matrix as the th view, and uses an additional nonnegative parameter to weight its embedding. It is worth noting that weights the reconstruction between and , while balances the correlation between and , . In other words, will be assigned to a large value once is well recovered by , and will take a large value if is highly correlated to . In this way, CEMENT adaptively embeds incomplete views and weak labels into lowdimensional subspaces, and correlates them with adaptive weights, enabling to handle the real problems in presence of both noisy views and tail labels.
4 Optimization
The objective function in Eq. (7) is convex w.r.t , , , , , and , respectively, that motivates us to develop an alternating optimization algorithm^{2}^{2}2We provide the algorithm and the MATLAB code of CEMENT in the supplementary materials.. For simplicity, the linear kernel is used in HSIC, and it is easily extended to apply the other kernels. The algorithm repeats following steps until convergence.
Update with fixed others.
When , , , , , are fixed, each can be updated individually, and the objective function becomes
(8) 
We then optimize Eq. (8) with Project Gradient Descent algorithm (PGD) [Calamai and Moré1987] by:
(9) 
where is a learning rate, and is the partial derivative of w.r.t. . The projection function if ; otherwise.
Update with fixed others.
When , , , , , are fixed, the objective function becomes:
(10) 
Similar with updating , we use PGD to update by:
(11) 
where is the partial derivative of w.r.t. .
Update with fixed others.
With the others fixed, the computation of each is independent. The objective function w.r.t. is
(12) 
Under the KarushKuhnTucker (KKT) condition [Boyd and Vandenberghe2004], we can derive the following updating rule:
(13) 
Update with fixed others.
With the others fixed, the objective function w.r.t. becomes
(14) 
By using the KKT condition, we can derive the following updating rule:
(15) 
Update with fixed others.
We solve the following problem to update the longtailed label matrix :
(16) 
Eq. (16) can be easily optimized by softthresholding [Donoho1995], and the updating rule is
(17) 
where is the shrinkage operator, and it is defined as .
Update with fixed others.
When , , , , and are fixed, updating is to solve the following problem
(18) 
Based on [Li et al.2021], each is updated independently according to the following equation
(19) 
Eq. (19) actually is an inverse distance weighting. Obviously, the larger the distance, the smaller the value of , .
Update with fixed others.
When , , , , , are fixed, the objective function w.r.t. becomes
(20) 
Given , we have the following derivations according to CauchySchwarz inequality [Steele2004]:
(21) 
The inequality in Eq. (21) holds when
(22) 
which is the closed solution of Eq. (20).
4.1 Complexity Analysis
In terms of computational complexity, updating needs a cost of , updating and cost , and updating costs , where and represent the largest dimensionality of the subspaces and feature matrices from all views, respectively. Thus, the total computational complexity of the algorithm at each iteration is .
5 Experiments
5.1 Experimental Settings
Datasets  #Samples  #Views  #Features  #Labels  #Average  Domain 

Corel5k  4999  6  100/512/1000/4096/4096/4096  260  3.397  image 
ESPGame  20770  6  100/512/1000/4096/4096/4096  268  4.686  image 
IAPRTC12  19627  6  100/512/1000/4096/4096/4096  291  5.719  image 
Mirflickr  25000  6  100/512/1000/4096/4096/457  38  4.716  image 
Pascal07  9963  6  1000/1000/512/4096/4096/804  20  1.465  image 
Yeast  2417  2  79/24  14  4.237  biology 
Emotions  593  2  64/8  6  1.869  music 
Datasets.
We conduct a comprehensive experimental study to evaluate the performance of the proposed CEMENT on seven widely used multiview multilabel datasets. The statistics of the used datasets are summarized in Table 1. The first five datasets (Corel5k, ESPGame, IAPRTC12, Mirflickr, and Pascal07)^{3}^{3}3http://lear.inrialpes.fr/people/guillaumin/data.php are all image datasets, and obtained from [Guillaumin et al.2010]. Each sample of these datasets is represented by six feature views. In the Yeast dataset^{4}^{4}4http://vlado.fmf.unilj.si/pub/networks/data/ [Bu et al.2003], each gene is represented by a genetic expression and a phylogenetic profile. In the Emotions dataset^{5}^{5}5http://www.uco.es/kdis/mllresources [Tsoumakas et al.2008]
, each music is represented by rhythmic and timbre feature views, and classified into emotions that it evokes.
Comparing Methods.
We compare the proposed method CEMENT with four stateoftheart methods: lrMMC [Liu et al.2013], McWL [Tan et al.2018b], iMVWL [Tan et al.2018a], NAIML [Li and Chen2021]. lrMMC and McWL are two multiview weaklabel learning methods, but they all assume that the views of features are complete. Thus, we adapt lrMMC and McWL by filling missing features with zero. iMVWL and NAIML are two incomplete multiview weaklabel learning methods, which can be seen as the baselines. The implementations of the above algorithms are publicly available in corresponding papers.
Configurations.
On the five image datasets, the hyperparameters of McWL, iMVWL and NAIML are selected as recommended in the original papers. We tune the hyperparameters of lrMMC and CEMENT on all datasets, and the other three methods on the Yeast and Emotions datasets by grid search to produce the best possible results. We select the value of the hyperparameter from , and the ratio of and from {0.2, 0.5, 0.8} for our method. We set the values of the hyperparameters of the other methods from the ranges recommended in the original paper. The prediction performance of all algorithms is evaluated by three widely used metrics: Hamming Score (HS), Ranking Score (RS) [Zhang and Zhou2013], and Area Under RocCurve (AUC) [Bucak et al.2011]. We randomly sample 2000 samples of each image dataset, and use all samples from the Yeast and Emotions datasets in the experiment. Furthermore, we follow the protocol given in [Tan et al.2018a] to create incomplete multiview weaklabel scenarios: we randomly remove sampled positive and negative samples for each label, and
samples from each view by ensuring that each sample appears in at least one view. For all comparing algorithms, we repeat the experiment by ten times and report the average values and the standard deviations.
5.2 Experimental Results
Dataset  Metric  lrMMC  McWL  iMVWL  NAIML  CEMENT 

Corel5k  HS  
RS  
AUC  
ESPGame  HS  
RS  
AUC  
Mirflickr  HS  
RS  
AUC  
Pascal07  HS  
RS  
AUC  
IAPRTC12  HS  
RS  
AUC  
Yeast  HS  
RS  
AUC  
Emotions  HS  
RS  
AUC 
Evaluations of Comparing Methods.
Table 2 shows the experimental results of all comparing methods on seven realworld datasets with , and . From Table 2, we can see that CEMENT outperforms compared methods in most of the cases. The performance superiority probably comes from the ability of CEMENT on capturing noisy views and tail labels. The incompleteness of multiview data causes the degradation of results on lrMMC and McWL. iMVWL and NAIML are able to handle the incomplete multiview and weaklabel datasets, but perform worse than CEMENT. There are two possible reasons. One is that iMVWL assumes that the label matrix is lowrank, and the other is that both iMVWL and NAIML treat every view equally. In contrast, CEMENT measures the importance of each view by adaptively choosing the appropriate values of and .
Ablation Study.
We first introduce three variants of CEMENT, namely CEMENT1, CEMENT2 and CEMENT3, to investigate the effects of the components of CEMENT^{6}^{6}6The formulations of the three variants of CEMENT and more results of the study are provided in the supplementary materials.. CEMENT1 only learns shared information from all feature views, and ignores individual information, i.e. , , and . CEMENT2 assumes that the label matrix is lowrank by ignoring the tail label matrix . CEMENT3 only learns a single shared subspace among all views and labels, i.e., (), which does not need HSIC. Fig. 2 shows the ablation study of CEMENT on the Yeast dataset with and different values of . As shown in Fig. 2, we can see that CEMENT2 performs the worst, while CEMENT has the best performance on almost all metrics. This demonstrates that capturing tail labels is beneficial to recover the missing labels.
Parameter Analysis.
In this section, we analyze the sensitivity of CEMENT w.r.t. and . The value of is selected from , and the value of is selected from . The results in terms of HS and AUC on the Yeast dataset are reported in Fig. 3, and similar results are obtained on the other datasets. From Fig. 3, we can see that CEMENT achieves relatively stable and good performance when and . And we can also observe that when , HS and AUC decrease sharply. The possible reason is that CEMENT may not successfully capture longtailed labels, given a large penalty on . It again confirms the contribution of capturing longtailed labels in improving the performance of CEMENT.
Convergence Analysis
We plot the convergence curve of the optimization algorithm on the Yeast and Emotions datasets, as shown in Fig. 4. We terminate the optimization algorithm of CEMENT once the relative change of its objective value is below . To show the convergence curve clearly, we omit the objective value of the first iteration in Fig. 4. We observe that the objective value monotonically decreases as the number of iterations increases, and it usually converges within 200 iterations. Similar results are obtained on the other datasets.
6 Conclusion
In this paper, we propose a novel model named CEMENT to deal with incomplete multiview weaklabel data. CEMENT jointly embeds incomplete views and weak labels into lowdimensional subspaces with adaptive weights, and adaptively correlates them via HSIC. Moreover, CEMEMT explores an additional sparse component to model tail labels, making the lowrankness available in the multilabel setting. An alternating algorithm is developed to solve the proposed optimization problem. Empirical evidence verified that CEMENT is flexible enough to handle the incomplete multiview weaklabel learning problems in presence of missing views and tail labels, leading to improved performance.
References
 [Boyd and Vandenberghe2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [Bu et al.2003] Dongbo Bu, Yi Zhao, Lun Cai, Hong Xue, Xiaopeng Zhu, Hongchao Lu, Jingfen Zhang, Shiwei Sun, Lunjiang Ling, Nan Zhang, et al. Topological structure analysis of the protein–protein interaction network in budding yeast. Nucleic acids research, 31(9):2443–2450, 2003.
 [Bucak et al.2011] Serhat Selcuk Bucak, Rong Jin, and Anil K Jain. Multilabel learning with incomplete class assignments. In CVPR, pages 2801–2808. IEEE, 2011.
 [Calamai and Moré1987] Paul H Calamai and Jorge J Moré. Projected gradient methods for linearly constrained problems. Mathematical programming, 39(1):93–116, 1987.
 [Dong et al.2018] HaoChen Dong, YuFeng Li, and ZhiHua Zhou. Learning from semisupervised weaklabel data. In AAAI, volume 32, 2018.
 [Donoho1995] David L Donoho. Denoising by softthresholding. TIT, 41(3):613–627, 1995.
 [Gao et al.2015] Hongchang Gao, Feiping Nie, Xuelong Li, and Heng Huang. Multiview subspace clustering. In ICCV, pages 4238–4246, 2015.
 [Gretton et al.2005] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbertschmidt norms. In ALT, pages 63–77. Springer, 2005.

[Guillaumin et al.2010]
Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid.
Multimodal semisupervised learning for image classification.
In2010 IEEE Computer society conference on computer vision and pattern recognition
, pages 902–909. IEEE, 2010.  [Li and Chen2021] Xiang Li and Songcan Chen. A concise yet effective model for nonaligned incomplete multiview and missing multilabel learning. TPAMI, 2021.
 [Li et al.2021] Lusi Li, Zhiqiang Wan, and Haibo He. Incomplete multiview clustering with joint partition and graph learning. TKDE, 2021.
 [Lin et al.2021] Yijie Lin, Yuanbiao Gou, Zitao Liu, Boyun Li, Jiancheng Lv, and Xi Peng. Completer: Incomplete multiview clustering via contrastive prediction. In CVPR, pages 11174–11183, 2021.
 [Liu et al.2013] Jialu Liu, Chi Wang, Jing Gao, and Jiawei Han. Multiview clustering via joint nonnegative matrix factorization. In SDM, pages 252–260. SIAM, 2013.
 [Liu et al.2015] Meng Liu, Yong Luo, Dacheng Tao, Chao Xu, and Yonggang Wen. Lowrank multiview learning in matrix completion for multilabel image classification. In AAAI, 2015.
 [Steele2004] J Michael Steele. The CauchySchwarz master class: an introduction to the art of mathematical inequalities. Cambridge University Press, 2004.
 [Tan et al.2018a] Qiaoyu Tan, Guoxian Yu, Carlotta Domeniconi, Jun Wang, and Zili Zhang. Incomplete multiview weaklabel learning. In IJCAI, pages 2703–2709, 2018.
 [Tan et al.2018b] Qiaoyu Tan, Guoxian Yu, Carlotta Domeniconi, Jun Wang, and Zili Zhang. Multiview weaklabel learning based on matrix completion. In SDM, pages 450–458. SIAM, 2018.
 [Tsoumakas et al.2008] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In MMD’08, volume 21, pages 53–59, 2008.
 [Wu et al.2018] Baoyuan Wu, Fan Jia, Wei Liu, Bernard Ghanem, and Siwei Lyu. Multilabel learning with missing labels using mixed dependency graphs. IJCV, 126(8):875–896, 2018.
 [Wu et al.2019] Xuan Wu, QingGuo Chen, Yao Hu, Dengbao Wang, Xiaodong Chang, Xiaobo Wang, and MinLing Zhang. Multiview multilabel learning with viewspecific information extraction. In IJCAI, pages 3884–3890, 2019.
 [Xu et al.2013a] Chang Xu, Dacheng Tao, and Chao Xu. A survey on multiview learning. arXiv preprint arXiv:1304.5634, 2013.
 [Xu et al.2013b] Miao Xu, Rong Jin, and ZhiHua Zhou. Speedup matrix completion with side information: Application to multilabel learning. In NIPS, pages 2301–2309, 2013.
 [Xu et al.2015] Chang Xu, Dacheng Tao, and Chao Xu. Multiview learning with incomplete views. TIP, 24(12):5812–5825, 2015.
 [Xu et al.2018] Miao Xu, Gang Niu, Bo Han, Ivor W Tsang, ZhiHua Zhou, and Masashi Sugiyama. Matrix cocompletion for multilabel classification with missing features and labels. arXiv preprint arXiv:1805.09156, 2018.
 [Yin et al.2017] Qiyue Yin, Shu Wu, and Liang Wang. Unified subspace learning for incomplete and unlabeled multiview data. Pattern Recognition, 67:313–327, 2017.
 [Yu et al.2014] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Largescale multilabel learning with missing labels. In ICML, pages 593–601. PMLR, 2014.
 [Zhang and Zhou2013] MinLing Zhang and ZhiHua Zhou. A review on multilabel learning algorithms. TKDE, 26(8):1819–1837, 2013.
 [Zhang et al.2013] Wei Zhang, Ke Zhang, Pan Gu, and Xiangyang Xue. Multiview embedding learning for incompletely labeled data. In IJCAI, 2013.

[Zhang et al.2020]
Yongshan Zhang, Jia Wu, Zhihua Cai, and S Yu Philip.
Multiview multilabel learning with sparse feature selection for image annotation.
TMM, 22(11):2844–2857, 2020.  [Zhu et al.2019] Changming Zhu, Duoqian Miao, Rigui Zhou, and Lai Wei. Improved multiview multilabel learning with incomplete views and labels. In ICDMW, pages 689–696. IEEE, 2019.