1 Introduction
Supervised machine learning crucially relies on the accuracy of the observed labels associated with the training samples Nettleton2010 ; 6685834 ; Zhu2004 ; natarajan2013learning ; peche ; ASLAM1996189 ; xiao2015learning ; lachenbruch1966discriminant ; BI20101622 ; Angluin1988 . Observed labels may be corrupted and, therefore, they do not necessarily coincide with the true class of the samples. Such inaccurate labels are also referred to as noisy 6685834 ; liu2016classification ; natarajan2013learning . Label noise can occur for various reasons in realworld data, e.g. because of imperfect evidence, insufficient information, labelsubjectivity or fatigue on the part of the labeler. In other cases, noisy labels may result from the use of frameworks such as anchor learning Halpernocw011 ; MIKALSEN2017105 or silver standard learning agarwal2016learning , which have received interest for instance in healthcare analytics CALLAHAN2017279 ; doi:10.1146/annurevbiodatasci080917013315 . A review of various sources of label noise can be found in 6685834 .
In standard supervised machine learning settings, the challenge posed by noisy labels has been studied extensively. For example, many noisetolerant versions of wellknown classifiers have been proposed, including discriminant analysis
lachenbruch1966discriminant ; Lawrence:2001:EKF:645530.655665 BOOTKRAJANG20143641 , the knearest neighbor classifier wilson2000reduction , boosting algorithms long2010random ; mcdonald2003empirical Bylander:1994:LLT:180139.181176 ; crammer2009adaptive biggio2011support, deep neural networks
xiao2015learning ; NIPS2017_7143 ; patrini2016making . Others have proposed more general classification frameworks that are not restricted to particular classifiers natarajan2013learning ; liu2016classification .However, very little research has been conducted on solving the challenge posed by noisy labels in nonstandard settings, where the magnitude of the noisy label problem is increased considerably. Some examples of such a nonstandard setting occur for instance within image analysis WU2018212 , document analysis krithara2008semi
Ekbal2016 , crowdsourcing nowak2010reliable , or in the healthcare domain, used here as an illustrative casestudy. Nonstandard settings include (i) Semisupervised learning 4787647 CHEN2017361 , referring to a situation where only a few (noisy) labeled data points are available, making the impact of noise in those few labels more prevalent, and where information must also jointly be inferred from unlabeled data points. In healthcare, it may be realistic to obtain some labels through a (imperfect) manual labeling process, but the vast amount of data remains unlabeled; (ii) Multilabel learning tahir2012multilabel ; madjarov2012extensive ; XU2013885 ; CHEN201661 ; LIU2018307 ; WANG20143405 ; trajdos2015extension ; TRAJDOS201860 ; ZHUANG2018225, wherein objects may not belong exclusively to one category. This situation occurs frequently in a number of domains, including healthcare, where for instance a patient could suffer from multiple chronic diseases; (iii) Highdimensional data, where the abundance of features and the limited (noisy) labeled data, lead to a curse of dimensionality problem. In such situations,
dimensionality reduction (DR) Theodoridis:2008:PRF:1457541 is useful, either as a preprocessing step, or as an integral part of the learning procedure. This is a wellknown challenge in health, where the number of patients in the populations under study frequently is small, but heterogeneous potential sources of data features from electronic health records for each patient may be enormous lee2017medical ; jensen2012mining ; 7154395 ; Miotto .In this paper, and to the best of our knowledge, we propose the first noisy label, semisupervised and multilabel DR machine learning method, which we call the Noisy multilabel semisupervised dimensionality reduction (NMLSDR) method. Towards that end, we propose a label propagation method that can deal with noisy multilabel data. Label propagation zhu2002learning ; zhu2003semi ; yang2016revisiting ; belkin2003using ; 6409473 ; zhou2004learning ; fan2017semi , wang2016dynamic , wherein one propagates the labels to the unlabeled data in order to obtain a fully labeled dataset, is one of the most successful and fundamental frameworks within semisupervised learning. However, in contrast to many of these methods that clamp the labeled data, in our multilabel propagation method we allow the labeled part of the data to change labels during the propagation to account for noisy labels. In the second part of our algorithm we aim at learning a lower dimensional representation of the data by maximizing the featurelabel dependence. Towards that end, similarly to other DR methods zhang2010multilabel ; XU2016172 , we employ the HilbertSchmidt independence criterion (HSIC) gretton2005measuring , which is a nonparametric measure of dependence.
The NMLSDR method is a DR method, which is general and can be used in many different settings, e.g. for visualization or as a preprocessing step before doing classification. However, in order to test the quality of the NMLSDR embeddings, we (preferably) have to use some quantitative measures. For this purpose, a common baseline classifier such as the multilabel knearest neighbor (MLkNN) classifier
ZHANG20072038 has been applied to the lowdimensional representations of the data guo2016semi ; 8059761 . Even though this is a valid way to measure the quality of the embeddings, to apply a supervised classifier in a semisupervised learning setting is not a realistic setup since one suddenly assumes that all labels are known (and correct). Therefore, as an additional contribution, we introduce a novel framework for semisupervised classification of noisy multilabel data.In our experiments, we compare NMLSDR to baseline methods on synthetic data, benchmark datasets, as well as a realworld case study, where we use it to identify the health status of patients suffering from potentially multiple chronic diseases. The experiments demonstrate that for partially and noisy labeled multilabel data, NMLSDR is superior to existing DR methods according to seven different multilabel evaluation metrics and the Wilcoxon statistical test.
In summary, the contributions of the paper are as follows.

A new label noisetolerant semisupervised multilabel dimensionality reduction method based on dependence maximization.

A novel framework for semisupervised classification of noisy multilabel data.

A comprehensive experimental section that illustrate the effectiveness of the NMLSDR on synthetic data, benchmark datasets and on a realworld case study.
The remainder of the paper is organized as follows. Related work is reviewed in Sec. 2. In Sec. 3, we describe our proposed NMLSDR method and the novel framework for semisupervised classification of noisy multilabel data. Sec. 4 describes experiments on synthetic and benchmark datasets, whereas Sec. 5 is devoted to the case study where we study chronically ill patients. We conclude the paper in Sec. 6.
2 Related work
In this section we review related unsupervised, semisupervised and supervised DR methods.^{1}^{1}1
DR may be obtained both by feature extraction, i.e. by a data transformation, and by feature selection
guyon2003introduction . Here, we refer to DR in the sense of feature extraction.Unsupervised DR methods do not exploit label information and can therefore straightforwardly be applied to multilabel data by simply ignoring the labels. For example, principal component analysis (PCA) aims to find the projection such that the variance of the input space is maximally preserved
jolliffe2011principal . Other methods aim to find a lower dimensional embedding that preserves the manifold structure of the data, and examples of these include Locally linear embedding roweis2000nonlinear , Laplacian eigenmaps belkin2002laplacian and ISOMAP tenenbaum2000global .One of the most wellknown supervised DR methods is linear discriminative analysis (LDA) fisher1936use , which aims at finding the linear projection that maximizes the withinclass similarity and at the same time minimizes the betweenclass similarity. LDA has been extended to multilabel LDA (MLDA) in several different ways park2008applying ; chen2007document ; wang2010multi ; lin2010mr ; XU2018107 . The difference between these methods basically consists in the way the labels are weighted in the algorithm. Following the notation in XU2018107 , wMLDAb park2008applying uses binary weights, wMLDAe chen2007document uses entropybased weights, wMLDAc wang2010multi uses correlationbased weights, wMLDAf lin2010mr uses fuzzybased weights, whereas wMLDAd XU2018107 uses dependencebased weights.
Canonical correlation analysis (CCA) hardoon2004canonical is a method that maximizes the linear correlation between two sets of variables, which in the case of DR are the set of labels and the set of features derived from the projected space. CCA can be directly applied also for multilabels without any modifications. Multilabel informed latent semantic indexing (MLSI) yu2005multi
is a DR method that aims at both preserving the information of inputs and capturing the correlations between the labels. In the Multilabel least square (MLLS) method one extracts a common subspace that is assumed to be shared among multiple labels by solving a generalized eigenvalue decomposition problem
ji2010shared .In zhang2010multilabel , a supervised method for doing DR based on dependence maximization gretton2005measuring called Multilabel dimensionality reduction via dependence maximization (MDDM) was introduced. MDDM attempts to maximize the featurelabel dependence using the HilbertSchmidt independence criterion and was originally formulated in two different ways. MDDMp is based on orthonormal projection directions, whereas MDDMf makes the projected features orthonormal. Yu et al. showed that MDDMp can be formulated using least squares and added a PCA term to the cost function in a new method called Multilabel feature extraction via maximizing feature variance and featurelabel dependence simultaneously (MVMD) XU2016172 .
The most closely related existing DR methods to NMLSDR are the semisupervised multilabel methods. The Semisupervised dimension reduction for multilabel classification method (SSDRMC) qian2010semi , Coupled dimensionality reduction and classification for supervised and semisupervised multilabel learning GONEN2014132 , and Semisupervised multilabel learning with joint dimensionality reduction yu2016semisupervised are semisupervised multilabel methods that simultaneously learn a classifier and a low dimensional embedding.
Other semisupervised multilabel DR methods are semisupervised formulations of the corresponding supervised multilabel DR method. Blascho et al. introduced semisupervised CCA based on Laplacian regularization blaschko2011semi . Several different semisupervised formulations of MLDA have also been proposed. Multilabel dimensionality reduction based on semisupervised discriminant analysis (MSDA) adds two regularization terms computed from an adjacency matrix and a similarity correlation matrix, respectively, to the MLDA objective function li2010multi . In the Semisupervised multilabel dimensionality reduction (SSMLDR) guo2016semi method one does label propagation to obtain soft labels for the unlabeled data. Thereafter the soft labels of all data are used to compute the MLDA scatter matrices. An other extension of MLDA is Semisupervised multilabel linear discriminant analysis (SMLDA) yu2017semi , which later was modified and renamed Semisupervised multilabel dimensionality reduction based on dependence maximization (SMDRdm) 8059761 . In SMDRdm the scatter matrices are computed based on only labeled data. However, a HSIC term is also added to the familiar Rayleigh quotient containing the two scatter matrices, which is computed based on soft labels for both labeled and unlabeled data obtained in a similar way as in SSMLDR.
Common to all these methods is that none of them explictly assume that the labels can be noisy. In SSMLDR and SMDRdm, the labeled data are clamped during the label propagation and hence cannot change. Moreover, these two methods are both based on LDA, which is known heavily affected by outliers, and consequently also wrongly labeled data
HUBERT2004301 ; croux2001robust ; hubert2008high .3 The NMLSDR method
We start this section by introducing notation and the setting for noisy multilabel semisupervised linear feature extraction, and thereafter elaborate on our proposed NMLSDR method.
3.1 Problem statement
Let be a set of dimensional data points, . Assume that the data are ordered such that the first of the data points are labeled and are unlabeled, . Let be a matrix with the data points as row vectors.
Assume that the number of classes is and let be the labelvector of data point , . The elements are given by , if data point belongs to the th class and otherwise. Define the label matrix as the matrix with the known labelvectors , as row vectors and let be the corresponding label matrix of the unknown labels.
The objective of linear feature extraction is to learn a projection matrix that maps a data point in the original feature space to a lower dimensional representation ,
(1) 
where and denotes the transpose of the matrix .
In our setting, we assume that the label matrix is potentially noisy and that is unknown. The first part of our proposed NMLSDR method consists of doing label propagation in order to learn the labels
and update the estimate of
. We do this by introducing soft labels for the label matrix , whererepresents the probability that data point
belong to the class. We obtain with label propagation and thereafter use to learn the projection matrix . However, we start by explaining our label propagation method.3.2 Label propagation using a neighborhood graph
The underlying idea of label propagation is that similar data points should have similar labels. Typically, the labels are propagated using a neighborhood graph zhu2002learning . Here, inspired by nie2010general , we formulate a label propagation method for multilabels that is robust to noise. The method is as follows.
Step 1. First, a neighbourhood graph is constructed. The graph is described by its adjacency matrix , which can be designed e.g. by setting the entries to
(2) 
where is the Euclidean distance between the datapoints and , and
is a hyperparameter. Alternatively, one can use the Euclidian distance to compute a knearest neighbors (kNN) graph where the entries of
are given by(3) 
Step 2. Symmetrically normalize the adjacency matrix by letting
(4) 
where is a diagonal matrix with entries given by .
Step 3.
Calculate the stochastic matrix
(5) 
where . The entry can now be considered as the probability of a transition from node to node along the edge between them.
Step 4. Compute soft labels by iteratively using the following update rule
(6) 
where is a diagonal matrix with the hyperparameters , , on the diagonal. To initialize , we let , where the unlabeled data are set to , .
3.2.1 Discussion
Setting for the labeled part of the data corresponds to clamping of the labels. However, this is not what we aim for in the presence of noisy labels. Therefore, a crucial property of the proposed framework is to set such that the labeled data can change labels during the propagation.
Moreover, we note that our extension of label propagation to multilabels is very similar to the singlelabel variant introduced in nie2010general , with the exception that we do not add the outlier class, which is not needed in our case. In other extensions to the multilabel label propagation guo2016semi ; 8059761 , the label matrix is normalized such that the rows sum to 1, which ensures that the output of the algorithm also has rows that sum to 1. In the singlelabel case this makes sense in order to maintain the interpretability of probabilities. However, in the multilabel case the data points do not necessarily exclusively belong to a single class. Hence, the requirement does not make sense since then can maximally belong to one class if one think of as a probability and require the probability to be 0.5 or higher in order to belong to a class.
On the other hand, in our case, a simple calculation shows that :
(7) 
since and . However, we do not necessarily have that .
From matrix theory it is known that, given that is nonsingular, the solution of the linear iterative process (6) converges to the solution of
(8) 
for any initialization if and only if is a convergent matrix meyer1977convergent (spectral radius ). is obviously convergent if . Hence, we can find the soft labels by solving the linear system given by Eq. (8).
Moreover, can be interpreted as the probability that datapoint belongs to class , and therefore, if one is interested in hard label assignments, , these can be found by letting if and otherwise.
3.3 Dimensionality reduction via dependence maximization
In this section we explain how we use the labels obtained using label propagation to learn the projection matrix .
The motivation behind dependence maximization is that there should be a relation between the features and the label of an object. This should be the case also in the projected space. Hence, one should try to maximize the dependence between the feature similarity in the projected space and the label similarity. A common measure of such dependence is the HilbertSchmidt independence criterion (HSIC) gretton2005measuring , defined by
(9) 
where denotes the trace of a matrix. is given by , where if , and otherwise. is a kernel matrix over the feature space, whereas is a kernel computed over the label space.
Let the projection of be given by the projection matrix and function , . We select a linear kernel over the feature space, and therefore the kernel function is given by
(10) 
Hence, given data , the kernel matrix can be approximated by .
The kernel over the label space, , is given via the labels . One possible such kernel is the linear kernel
(11) 
However, in our semisupervised setting, some of the labels are unknown and some are noisy. Hence, the kernel cannot be computed. In order to enable DR in our nonstandard problem, we propose to estimate the kernel using the labels obtained via our label propagation method. For the part of the data that was labeled from the beginning we use the hard labels, , obtained from the label propagation, whereas for the unlabeled part we use the soft labels, . Hence, the kernel is approximated via , where . The reason for using the hard labels obtained from label propagation for the labeled part is that we want some degree of certainty for those labels that change during the propagation (if the soft label changes with less than 0.5 from its initial value 0 or 1 during the propagation, the hard label does not change).
The constant term, , in Eq. (9) is irrelevant in an optimization setting. Hence, by inserting the estimates of the kernels into Eq. (9), the following objective function is obtained,
(12) 
Note that the matrix is symmetric. Hence, by requiring that the projection directions are orthogonal and that the new dimensionality is , the following optimization problem is obtained
(13)  
As a consequence of the CourantFisher characterization saad1992numerical , it follows that the maximum is achieved when is an orthonormal basis corresponding to the largest eigenvalues. Hence, can be found by solving the eigenvalue problem
(14) 
The dimensionality of the projected space, , is upper bounded by the rank of , which in turn is upper bounded by the number of classes . Hence, cannot be set larger than . The pseudocode of the NMLSDR method is shown in Alg. 1.
3.4 Semisupervised classification for noisy multilabel data
The multilabel knearest neighbor (MLkNN) classifier ZHANG20072038 is a widely adopted classifier for multilabel classification. However, similarly to many other classifiers, its performance can be hampered if the dimensionality of the data is too high. Moreover, the MLkNN classifier only works in a completely supervised setting. To resolve these problems, as an additional contribution of this work, we introduce a novel framework for semisupervised classification of noisy multilabel data, consisting of two steps. In the first step, we compute a low dimensional embedding using NMLSDR. The second step consists of applying a semisupervised MLkNN classifier. For this classifier we use our label propagation method on the learned embedding to obtain a fully labeled dataset, and thereafter apply the MLkNN classifier.
4 Experiments
In this paper, we have proposed a method for computing a lowdimensional embedding of noisy, partially labeled multilabel data. However, it is not a straightforward task to measure how well the method works. Even though the method is definitely relevant to realworld problems (illustrated in the case study in Sec. 5), the framework cannot be directly applied to most multilabel benchmark datasets since most of them are completely labeled, and the labels are assumed to be clean. Moreover, the NMLSDR provides a low dimensional embedding of the data, and we need a way to measure how good the embedding is. If the dimensionality is 2 or 3, this can to some degree be done visually by plotting the embedding. However, in order to quantitatively measure the quality and simultaneously maintain a realistic setup, we will apply our proposed endtoend framework for semisupervised classification and dimensionality reduction. In our experiments, this realistic semisupervised setup will be applied in an illustrative example on synthetic data and in the case study.
A potential disadvantage of using a semisupervised classifier, is that it does not necessarily isolate effect of the DR method that is used to compute the embedding. For this reason, we will also test our method on some benchmark datasets, but in order to keep everything coherent, except for the method used to compute the embedding, we compute the embedding using NMLSDR and baseline DR methods based on only the noisy and partially labeled multilabel training data. Thereafter, we assume that the true multilabels are available when we train the MLkNN classifier on the embeddings.
The remainder of this section is organized as follows. First we describe the performance measures we employed, baseline DR methods, and how we select hyperparameters. Thereafter we provide an illustrative example on synthetic data, and secondly experiments on the benchmark data. The case study is described in the next section.
4.1 Evaluation metrics
Evaluation of performance is more complicated in a multilabel setting than for traditional singlelabels. In this work, we decide use the seven different evaluation criteria that were employed in zhang2010multilabel , namely Hamming loss (HL), Macro F1score (MaF1), Micro F1 (MiF1), Ranking loss (RL), Average precision (AP), Oneerror (OE) and Coverage (Cov).
HL simply evaluates the number of times there is a mismatch between the predicted label and the true label, i.e.
(15) 
where denotes the predicted label vector of data point and is the XORoperator. MaF1 is obtained by first computing the F1score for each label, and then averaging over all labels.
(16) 
MiF1 calculates the F1 score on the predictions of different labels as a whole,
(17) 
We note that HL, MiF1 and MaF1 are computed based on hard labels assignments, whereas the four other measures are computed based on soft labels. In all of our experiments, we obtain the hard labels by putting a threshold at 0.5.
RL computes the average ratio of reversely ordered label pairs of each data point. AP evaluates the average fraction of relevant labels ranked higher than a particular relevant label. OE gives the ratio of data points where the most confident predicted label is wrong. Cov gives an average of how far one needs to go down on the list of ranked labels to cover all the relevant labels of the data point. For a more detailed description of these measures, we point the interested reader to wu2016unified .
In this work, we modify four of the evaluation metrics such that all of them take values in the interval and “higher always is better”. Hence, we define
(18)  
(19)  
(20) 
and normalized coverage (Cov’) by
(21) 
4.2 Baseline dimensionality reduction methods
In this work, we consider the following other DR methods: CCA, MVMD, MDDMp, MDDMf and four variants of MLDA, namely wMLDAb, wMLDAe, wMLDAc and wMLDAd. These methods are supervised and require labeled data, and are therefore trained only on the labeled part of the training data. In addition, we compare to a semisupervised method, SSMLDR, which we adapt to noisy multilabels by using the label propagation algorithm we propose in this paper instead of the label propagation method that was originally proposed in SSMLDR. We note that the computational complexity of NMLSDR and the all the baselines is of the same order as all of them require a step involving eigendecomposition.
4.3 Hyperparameter selection and implementation settings
For the MLkNN classifier we set . The effect of varying the number of neighbors will be left for further work. In order to learn the NMLSDR embedding we use a kNNgraph with and binary weights. Moreover, we set for labeled data and for unlabeled data. By doing so, one ensures that an unlabeled datapoint is not affected by its initial value, but gets all contribution from the neighbors during the propagation. All experiments are run in Matlab using an Ubuntu 16.04 64bit system with 16 GB RAM and an Intel Core i77500U processor.
4.4 Illustrative example on synthetic toy data
Dataset description
To test the framework in a controlled experiment, a synthetic dataset is created as follows.
A dataset of size 8000 samples is created, where each of the data points has dimensionality 320. The number of classes is set to 4, and we generate 2000 samples from each class. 30% from class 1 also belong to class 2, and vice versa. 20% from class 2 also belong to class 3 and vice versa, whereas 25% from class 3 also belong to class 4 and vice versa.
A sample from class is generated by randomly letting 10% of the features in the interval take a random integer value between 1 and 10. Since there are 4 classes, this means that the first 80 features are directly dependent on the classmembership.
For the remaining 240 features we consider 20 of them at the time. We randomly select 50% of the 8000 samples and randomly let 20% of the 20 features take a random integer value between 1 and 10. We repeat this procedure for the 12 different sets of 20 features , .
All features that are not given a value using the procedure described above are set to 0. Noise is injected into the labels by randomly flipping a fraction of the labels and we make the data partially labeled by removing 50 % of the labels. 2000 of the samples are kept aside as an independent test set. We note that noisy labels are often easier and cheaper to obtain than true labels and it is therefore not unreasonable that the fraction of labeled examples is larger than what it commonly is in traditional semisupervised learning settings.
Results
We apply the NMLSDR method in combination with the semisupervised MLkNN classifier as explained above and compare to SSMLDR. We create two baselines by, for both of these methods, using a different value for the hyperparameter
for the labeled part of the data, namely 0, which corresponds to clamping. We denote these two baselines by SSMLDR* and NMLSDR*. In addition, we compare to baselines that only utilize the labeled part of the data, namely the supervised DR methods explained above in combination with a MLkNN classifier. The data is standardized to 0 mean and 1 in standard deviation and we let the dimensionality of the embedding be 3.
Fig. 0(a) and 0(b) show the embeddings obtained obtained using SSMLDR and NMLSDR, respectively. For ivisualization purposes, we have only plotted those datapoints that exclusively belong to one class. In Fig. 0(c), we have added two of the multiclasses for the NMLSDR embedding. For comparison, we also added the embedding obtained using PCA in Fig. 0(d). As we can see, in the PCA embedding the classes are not separated from each other, whereas in the NMLSDR and SSMLDR embeddings the classes are aligned along different axes. It can be seen that the classes are better separated and more compact in the NMLSDR embedding than the SSMLDR embedding. Fig. 0(c) shows that the data points that belong to multiple classes are placed where they naturally belong, namely between the axes corresponding to both of the classes they are member of.
Method  HL’  RL’  AP  OE’  Cov’  MaF1  MiF1 

CCA  0.863  0.884  0.898  0.852  0.816  0.787  0.785 
MVMD  0.906  0.912  0.924  0.897  0.836  0.850  0.849 
MDDMp  0.906  0.911  0.924  0.897  0.836  0.851  0.850 
MDDMf  0.859  0.888  0.900  0.855  0.819  0.785  0.783 
wMLDAb  0.844  0.871  0.885  0.831  0.807  0.754  0.750 
wMLDAe  0.864  0.885  0.899  0.855  0.818  0.790  0.788 
wMLDAc  0.865  0.887  0.900  0.857  0.818  0.787  0.785 
wMLDAd  0.869  0.891  0.907  0.869  0.822  0.788  0.786 
SSMLDR*  0.863  0.883  0.899  0.859  0.814  0.796  0.793 
SSMLDR  0.879  0.898  0.910  0.871  0.827  0.817  0.814 
NMLSDR*  0.907  0.919  0.929  0.903  0.842  0.861  0.859 
NMLSDR  0.913  0.925  0.935  0.912  0.846  0.868  0.866 
Dataset  Domain  Train instances  Test instances  Attributes  Labels  Cardinality 

Birds  audio  322  323  260  19  1.06 
Corel  scene  5188  1744  500  153  2.87 
Emotions  music  391  202  72  6  1.81 
Enron  text  1123  579  1001  52  3.38 
Genbase  biology  463  199  99  25  1.26 
Medical  text  645  333  1161  39  1.24 
Scene  scene  1211  1196  294  6  1.06 
Tmc2007  text  3000  7077  493  22  2.25 
Toy  synthetic  6000  2000  320  4  1.38 
Yeast  biology  1500  917  103  14  4.23 
Tab. 1 shows the results obtained using the different methods on the synthetic dataset. As we can see, our proposed method gives the best performance for all metrics. Moreover, NMLSDR with , which corresponds to clamping of the labeled data during label propagation gives the second best results but cannot compete with our proposed method, in which the labels are allowed to change during the propagation to account for noisy labels. We also note that, even though the SSMLDR improves the MLDA approaches that are based on only the labeled part of the data, it gives results that are considerably worse than NMLSDR.
4.5 Benchmark datasets
Experimental setup
We consider the following benchmark datasets ^{2}^{2}2Downloaded from mulan.sourceforge.net/datasetsmlc.html: Birds, Corel, Emotions, Enron, Genbase, Medical, Scene, Tmc2007 and Yeast. We also add our synthetic toy dataset as a one of our benchmark datasets (described in Sec. 4.4). These datasets are shown in Tab. 2, along with some useful characteristics. In order to be able to apply our framework to the benchmark datasets, we randomly flip 10 % of the labels to generate noisy labels and let 30 % of the data points training sets be labeled. All datasets are standardized to zero mean and standard deviation one.
We apply the DR methods to the partially and noisy labeled multilabel training sets in order to learn the projection matrix , which in turn is used to map the Ddimensional training and test sets to a dimensional representation. is set as large as possible, i.e. to for the MLDAbased methods and for the other methods. Then we train a MLkNN classifier using the lowdimensional training sets, assuming that the true multilabels are known and validate the performance on the lowdimensional test sets.
In total we are evaluating the performance over 10 different datasets and across 7 different performance measures for all the feature extraction methods we use. Hence, to investigate which method performs better according to the different metrics, we also report the number of times each method gets the highest value of each metric. In addition, we compare all pairs of methods by using a Wilcoxon signed rank test with 5% significance level demvsar2006statistical . Similarly to XU2018107
, if method A performs better than B according to the test, A is assigned the score 1 and B the score 0. If the null hypothesis (method A and B perform equally) is not rejected, both A and B are assigned an equal score of 0.5.
Results
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.947  0.950  0.950  0.947  0.948  0.949  0.949  0.949  0.949  0.951 
Corel  0.980  0.980  0.980  0.980  0.980  0.980  0.980  0.980  0.980  0.980 
Emotions  0.715  0.771  0.778  0.711  0.696  0.714  0.709  0.717  0.786  0.787 
Enron  0.941  0.950  0.950  0.942  0.941  0.941  0.941  0.940  0.938  0.950 
Genbase  0.989  0.996  0.996  0.988  0.990  0.991  0.988  0.989  0.994  0.997 
Medical  0.976  0.974  0.974  0.976  0.974  0.975  0.975  0.976  0.966  0.975 
Scene  0.810  0.899  0.900  0.809  0.810  0.814  0.817  0.810  0.873  0.897 
Tmc2007  0.914  0.928  0.928  0.912  0.911  0.911  0.911  0.916  0.922  0.929 
Toy  0.836  0.894  0.894  0.839  0.821  0.831  0.831  0.854  0.861  0.903 
Yeast  0.780  0.791  0.790  0.782  0.785  0.783  0.781  0.781  0.793  0.793 
# Best values  2  2  3  2  1  1  1  2  2  8 
Wilcoxon  2.0  7.0  7.5  2.5  2.0  3.0  2.5  3.5  6.0  9.0 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.715  0.766  0.767  0.734  0.709  0.718  0.719  0.725  0.681  0.771 
Corel  0.800  0.808  0.808  0.800  0.799  0.799  0.800  0.800  0.801  0.814 
Emotions  0.695  0.824  0.824  0.709  0.693  0.700  0.676  0.714  0.829  0.845 
Enron  0.894  0.911  0.911  0.893  0.893  0.892  0.891  0.893  0.883  0.914 
Genbase  0.993  0.995  0.995  0.993  0.994  0.992  0.992  0.991  0.995  1.000 
Medical  0.925  0.952  0.949  0.925  0.916  0.921  0.919  0.945  0.856  0.946 
Scene  0.585  0.900  0.898  0.629  0.574  0.583  0.572  0.616  0.853  0.898 
Tmc2007  0.831  0.906  0.906  0.830  0.830  0.830  0.831  0.847  0.872  0.910 
Toy  0.871  0.909  0.909  0.870  0.849  0.865  0.861  0.888  0.887  0.926 
Yeast  0.806  0.820  0.819  0.811  0.810  0.809  0.806  0.803  0.818  0.816 
# Best values  0  3  0  0  0  0  0  0  0  7 
Wilcoxon  3.0  8.0  7.5  4.5  1.5  2.0  2.0  5.0  3.0  8.5 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.389  0.499  0.500  0.426  0.374  0.392  0.379  0.424  0.357  0.502 
Corel  0.260  0.277  0.277  0.261  0.265  0.263  0.263  0.268  0.266  0.288 
Emotions  0.669  0.781  0.773  0.686  0.672  0.687  0.666  0.704  0.799  0.808 
Enron  0.592  0.669  0.670  0.583  0.584  0.582  0.580  0.578  0.526  0.675 
Genbase  0.963  0.990  0.993  0.964  0.960  0.968  0.963  0.969  0.984  0.997 
Medical  0.673  0.722  0.716  0.666  0.644  0.674  0.669  0.723  0.446  0.725 
Scene  0.491  0.836  0.835  0.534  0.481  0.488  0.475  0.521  0.781  0.834 
Tmc2007  0.584  0.714  0.713  0.587  0.579  0.576  0.577  0.623  0.662  0.721 
Toy  0.882  0.921  0.921  0.880  0.862  0.880  0.875  0.900  0.897  0.933 
Yeast  0.732  0.748  0.747  0.731  0.733  0.733  0.729  0.725  0.745  0.741 
# Best values  0  2  0  0  0  0  0  0  0  8 
Wilcoxon  3.5  7.5  7.5  4.0  1.0  3.5  1.0  5.0  3.0  9.0 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.273  0.419  0.407  0.314  0.250  0.273  0.250  0.297  0.203  0.419 
Corel  0.250  0.261  0.262  0.252  0.255  0.254  0.253  0.267  0.260  0.283 
Emotions  0.535  0.673  0.644  0.564  0.535  0.589  0.550  0.589  0.718  0.728 
Enron  0.620  0.762  0.762  0.610  0.587  0.604  0.606  0.579  0.544  0.765 
Genbase  0.950  0.990  0.995  0.955  0.935  0.960  0.950  0.965  0.980  0.995 
Medical  0.583  0.607  0.592  0.589  0.538  0.583  0.577  0.628  0.323  0.619 
Scene  0.265  0.732  0.729  0.319  0.258  0.264  0.247  0.303  0.656  0.727 
Tmc2007  0.527  0.650  0.648  0.531  0.523  0.519  0.516  0.578  0.604  0.656 
Toy  0.821  0.888  0.887  0.819  0.785  0.821  0.811  0.850  0.849  0.903 
Yeast  0.760  0.755  0.749  0.740  0.747  0.751  0.748  0.744  0.751  0.739 
# Best values  1  2  1  0  0  0  0  1  0  7 
Wilcoxon  3.5  8.0  7.0  4.0  1.0  3.5  1.0  5.0  3.0  9.0 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.821  0.851  0.852  0.830  0.818  0.824  0.824  0.831  0.808  0.860 
Corel  0.601  0.617  0.617  0.603  0.600  0.599  0.601  0.603  0.603  0.628 
Emotions  0.563  0.684  0.679  0.579  0.567  0.565  0.554  0.587  0.679  0.696 
Enron  0.738  0.762  0.763  0.736  0.737  0.736  0.734  0.736  0.724  0.768 
Genbase  0.983  0.984  0.984  0.983  0.985  0.981  0.981  0.980  0.985  0.991 
Medical  0.918  0.941  0.939  0.917  0.909  0.913  0.911  0.936  0.859  0.939 
Scene  0.637  0.899  0.898  0.672  0.625  0.633  0.624  0.663  0.860  0.898 
Tmc2007  0.740  0.835  0.835  0.741  0.740  0.739  0.741  0.762  0.790  0.840 
Toy  0.809  0.837  0.837  0.807  0.794  0.805  0.802  0.822  0.820  0.849 
Yeast  0.513  0.533  0.532  0.526  0.526  0.523  0.519  0.518  0.530  0.528 
# Best values  0  3  0  0  0  0  0  0  0  7 
Wilcoxon  2.5  7.5  7.5  4.5  2.0  2.5  1.5  5.0  3.0  9.0 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.011  0.079  0.076  0.027  0.002  0.000  0.000  0.039  0.006  0.104 
Corel  0.012  0.023  0.022  0.014  0.010  0.010  0.010  0.019  0.010  0.021 
Emotions  0.381  0.599  0.604  0.419  0.366  0.385  0.371  0.415  0.623  0.649 
Enron  0.044  0.102  0.105  0.048  0.043  0.049  0.044  0.065  0.063  0.101 
Genbase  0.520  0.561  0.603  0.514  0.497  0.515  0.497  0.442  0.558  0.630 
Medical  0.153  0.168  0.164  0.159  0.135  0.126  0.133  0.197  0.038  0.175 
Scene  0.059  0.705  0.707  0.132  0.084  0.055  0.041  0.098  0.569  0.700 
Tmc2007  0.183  0.419  0.418  0.189  0.171  0.177  0.175  0.212  0.349  0.434 
Toy  0.732  0.830  0.828  0.741  0.709  0.722  0.724  0.758  0.776  0.845 
Yeast  0.266  0.318  0.323  0.276  0.281  0.279  0.248  0.233  0.321  0.342 
# Best values  0  1  2  0  0  0  0  1  0  6 
Wilcoxon  2.5  7.5  7.5  5.0  2.0  2.0  1.0  3.5  5.0  9.0 
CCA  MVMD  MDDMp  MDDMf  wMLDAb  wMLDAe  wMLDAc  wMLDAd  SSMLDR  NMLSDR  
Birds  0.036  0.178  0.172  0.063  0.006  0.000  0.000  0.065  0.019  0.197 
Corel  0.017  0.033  0.031  0.019  0.013  0.013  0.013  0.031  0.015  0.033 
Emotions  0.459  0.630  0.639  0.450  0.404  0.448  0.430  0.460  0.652  0.666 
Enron  0.351  0.523  0.530  0.413  0.340  0.378  0.369  0.310  0.346  0.518 
Genbase  0.882  0.953  0.959  0.872  0.885  0.902  0.873  0.881  0.932  0.968 
Medical  0.459  0.501  0.495  0.505  0.400  0.440  0.455  0.498  0.212  0.496 
Scene  0.066  0.700  0.702  0.142  0.086  0.058  0.041  0.102  0.584  0.698 
Tmc2007  0.421  0.589  0.586  0.443  0.440  0.438  0.438  0.485  0.540  0.590 
Toy  0.729  0.828  0.826  0.739  0.706  0.719  0.721  0.756  0.774  0.843 
Yeast  0.573  0.605  0.607  0.577  0.582  0.584  0.555  0.548  0.609  0.626 
# Best values  0  1  2  1  0  0  0  0  0  7 
Wilcoxon  2.5  8.0  7.5  5.0  1.5  2.5  2.0  4.0  3.5  8.5 
Tab. 3 shows results in terms of HL’. NMLSDR gets best HL’score for eight of the datasets and achieves a maximal Wilcoxon score, i.e performs statistically better than all nine other methods according to the test at a 5 % significance level. The second best method MDDMp gets the highest HL’ score for three datasets and Wilcoxon score of 7.5. From Tab. 4 we see that NMLSDR achieves the highest RL’score seven times and a Wilcoxon score of 8.5. The second best method is MVMD, which obtains three of the highest RL’ values and a Wilcoxon score of 8.0.
Tab. 5 shows performance in terms of AP. The highest AP score is achieved for NMLSDR for eight datasets and it gets a maximal Wilcoxon score of 9.0. According to the Wilcoxon score second place is tied between MVMD and MDDMp. However, MVMD gets the highest AP score for two datasets, whereas MDDMp does not get the highest score for any of them. OE’ is presented in Tab. 6. We can see that NMLSDR gets a maximal Wilcoxon score and the highest OE’ score for seven datasets. MVMD is number two with a Wilcoxon score of 8.0 and two best values.
Tab. 7 shows Cov’. NMLSDR gets a maximal Wilcoxon score and the highest Cov’ value for seven datasets. Despite that MVMD gets the highest Cov’ for three datasets and MDDMp for none of the datasets, the second best Wilcoxon score is 7.5 and tied between MVMD and MDDMp. MaF1 is shown in Tab. 8. The best method, which is our proposed method gets a maximal Wilcoxon score and the highest MaF1 value for six datasets. Tab. 9 shows MiF1. NMLSDR achieves 8.5 in Wilcoxon score and has the highest MiF1 score for seven datasets.
In total, NMLSDR consistently gives the best performance for all seven evaluation metrics. Moreover, in order to summarize our findings, we compute the mean Wilcoxon score across all seven performance metrics and plot the result in Fig. 2. If we sort these results, we get NMLSDR (8.86), MVMD (7.64), MDDMp (7.43), wMLDAd (4.43), MDDMf (4.21), SSMLDR (3.79), CCA (2.79), wMLDAe (2.71) and wMLDAb/wMLDAc (1.57). The best method, which is our proposed method, gets a mean value that is 1.22 higher than number two. The second best method is MVMD, slightly better than MDDMp. The best MLDAbased method is wMLDAd, which is ranked 4th, however, with a much lower mean value than the three best methods. The semisupervised extension of MLDA (SSMLDR) is ranked 6th and is actually performing worse that wMLDAd, which is a bit surprising. However, SSMLDR also uses a binary weighting scheme, and should therefore be considered as a semisupervised variant of wMLDAb, which it performs considerably better than. wMLDAb and wMLDAc give the worst performance of all the 10 methods.
The main reason why the MLDAbased approaches in general perform worse than the other DR methods is probably related to what we discussed in Sec. 2, namely that LDAbased approaches are heavily affected by outliers and wrongly labeled data. More concretely, the fact that the number of labeled data points are relatively few and that the labels are noisy, leads to errors in the scatter matrices that even might amplify since one has to invert a matrix to solve the generalized eigenvalue problem. The semisupervised extension of MLDA, SSMLDR, improves quite much compared to wMLDAb, but the starting point is so bad that even though it improves, it cannot compete with the best methods. On the other hand, the MDDMbased methods (MVMD and MDDMp) are not so sensitive to label noise and the fact that there are few labels, and therefore these methods can perform quite well even though they are trained only on the labeled subset. Hence, the reasons to the good performance of NMLSDR are probably that MDDMp is the basis of NMLSDR, and that NMLSDR in addition uses our label propagation method to improve.
5 Case study
In this section, we describe a case study where we study patients potentially suffering from multiple chronic diseases. This healthcare case study reflects the need for label noisetolerant methods in a nonstandard situation (semisupervised learning, multiple labels, high dimensionality). The objective is to identify patients with certain chronic diseases, more specifically hypertension and/or diabetes mellitus. In order to do so, we take an approach where we use clinical expertise to create a partially and noisy labeled dataset, and thereafter apply our proposed endtoend framework, namely NMLSDR for dimensionality reduction in combination with semisupervised MLkNN to classify these patients. An overview of the framework employed in the case study is shown in Fig. 3.
Chronic diseases
According to The World Health Organisation, a disease is defined as chronic if one or several of the following criteria are satisfied: the disease is permanent, requires special training of the patient for rehabilitation, is caused by nonreversible pathological alterations, or requires a long period of supervision, observation, or care. The two most prevalent chronic diseases for people over 64 years are those that we study in this paper, namely hypertension and diabetes mellitus soni2001 . These types of diseases represent an increasing problem in modern societies all over the world, which to a large degree is due to a general increase in life expectancy, along with an increased prevalence of chronic diseases in an aging population calderon2016assessing . Moreover, the economical burden associated with these chronic conditions is high. For example, in 2017, treatment of diabetic patients accounted for 1 out of 4 healthcare dollars in the United States american2018economic . Hence, in the future, a significant amount of resources must be devoted to the care of chronic patients and it will be important not only to improve the patient care, but also more efficiently allocate the resources spent on treatment of these diseases.
5.1 Data
In this case study, we study a dataset consisting of patients that potentially have one or more chronic diseases. All of these patients got some type of treatment at University Hospital of Fuenlabrada, Madrid (Spain) in the year 2012. The patients are described by diagnosis codes following the International Classification of Diseases 9th revision, Clinical Modification (ICD9CM) ICD9CM , and pharmacological dispensing codes according to Anatomical Therapeutic Chemical (ATC) classification systems world2016guidelines . Some preprocessing steps are considered. Similarly to soguero2018use ; sanchez2018scaled , the ICD9CM and ATC codes are represented using frequencies, i.e, for each patient, we consider all encounters with the health system in 2012 and we count how many times each ICD9CM and ATC code appear in the electronic health record. In total there are 1517 ICD9CM codes and 746 ATC codes. However, all codes that appear for less than 10 patients across the training set are removed. After this feature selection, the dimensionality of the data is 455, of which 267 represent ICD9CM codes and 188 represent ATC codes.
We do have access to ground truth labels that indicate what type of chronic disease(s) the patients have. These are provided by a patient classification system developed by the company 3M Averill1999 . This classification system stratify patients into socalled Clinical Risk Groups (CRG) that indicate what type(s) of chronic disease the patient has and the severity based on the patient encounters with the health system during a period of time, typically one year. A fivedigit classification code is used to assign each patient to a severity risk group. The first digit of the CRG is the core health status group, ranging from healthy (1) to catastrophic (9); the second to fourth digits represents the base 3M CRG; and the fifth digit is used for characterizing the severityofillness levels.
For the purpose of this work, the ground truth labels are only used for cohort selection and final evaluation of our models. For the remaining parts they are considered unknown. To select a cohort, we consider the first four digits of the CRGs to analyze the the following chronic conditions: CRG1000 (healthy), which contains 46835 individuals; CRG5192 (hypertension) with 12447 patients; CRG5424 (diabetes), which has 2166 patients; and CRG6144 (hypertension and diabetes), with a total of 3179 patients. We employ an undersampling strategy and randomly select 2166 patients from each of the four categories, and thereby obtain balanced classes. An independent test set is created by randomly selecting 20 % of these patients. Hence, the training set contains 6932 patients and the test set 1732 patients.
5.2 Rulebased creation of noisy labeled training data using clinical knowledge
There are some important ICD9CM codes and ATCdrugs that are strongly correlated with hypertension and diabetes, respectively. These are verified by our clinical experts and described in Tab. 10. In particular, the ICD9CM code 250 is important for diabetes because it is the code for diabetes mellitus. Similarly, the ICD9CM codes 401405 are important for hypertension because they describe different types of hypertension.
Chronicity  ATC codes  ICD9CM codes 
Hypertension  C01AA, C01BA, C01BA, C01BC, C01BD, C01CA, C01CB, C01CX,  362, 
C01DA, C01DX, C01EB, C02AB, C02AC, C02CA, C02DB, C02DC,  401,  
C02DD, C02K, C02LC, C03AA, C03AX, C03BA, C03CA, C03DA  402,  
C03EA, C03EB, C04AD, C04AE, C04AX, C05AA, C05AD, C05AE,  403,  
C05AX, C05BA, C05BB, C05BX, C05CA, C05CX, C07AA, C07AB,  404,  
C07AG, C07B, C07G, C07D, C07E, C07X, C08CA, C08DA, C08DB,  405,  
C08GA, C09AA, C09BA, C09BB, C09CA, C09DA, C09DB, C09XA,  760  
C10AA, C10AB, C10AC, C10AD, C10AX, C10BA, C10BX  
Diabetes  A10AB, A10AC, A10AD, A10AE, A10AF, A10BA, A10BB,  250, 588, 
A10BD, A10BFM, A10BGM, A10BH, A10BX,  648, 775 
In this case study we are interested in four groups, namely those that have hypertension, those that have diabetes, those that have both, and those that do not have any these two chronic diseases. Thanks to the clinical expertise and the information that they provided us with, which is summarized in Tab. 10, we can create a partially and noisy labeled dataset using the following set of rules.

Those that have the ICD codes 250 and any of the codes 401405 are assigned to both the hypertension and diabetes class.

Those that have the ICD code 250, but none of the 7 ICD9CM codes and 64 ATC drugs listed by the clinicians as indicators for hypertension, are labeled with diabetes.

Those that have any of the ICD9CM codes 401405, but none of the 4 ICD9CM codes for diabetes or 12 ATC drugs for diabetes, are labeled with hypertension.

Those that do not have any of the ICD9CM codes or ATC drugs listed up in Tab. 10 are labeled as healthy.

The remaining patients do not get a label.
In total, this leads to 1734 in the healthy class, 2547 in the hypertension class, 1971 in the diabetes class. 1302 of the patients in the hypertension class also belongs to the diabetes class. 1982 of the patients do not get a label using the the routine described above. To be able to examine for statistical significance, we randomly select 1000 of the noisy labeled patients and 1000 of the unlabeled patients. By doing so, we can repeat the experiments several times and test for significance using a pairwise ttest. We do the repetition 10 times and let the significance level be 95%.
5.2.1 Performing feature extraction and classification
After having obtained the partially and noisy labeled multilabel dataset, we do feature extraction using NMLSDR, followed by semisupervised multilabel classification, exactly in the same manner as we did it for the synthetic toy data in Section 4.4. In this case study, we use the same evaluation metrics, hyperparameters and baseline feature extraction methods as explained in Sec. 4.1. The dimensionality of the embedding is set to for all embedding methods.
5.3 Results
Tab. 11 shows the performance of the different DR methods on the task of classifying patients with chronic diseases in terms of seven different evaluation metrics. According to the pairwise ttest, our method achieves the best performance for all metrics. Second place is tied between MDDMp and MVMD. The semisupervised variant of MLDA, namely SSMLDR, performs better than the supervised counterparts (wMLDAb, wMLDAc, wMLDAd, wMLDAe) and is consistently ranked 4th according to all metrics. Interestingly, the more advanced weighting schemes in wMLDAc and wMLDAd actually lead to worse results than what the simple weights in wMLDAb and wMLdAe give. CCA gives the worst performance according to 4 of the evaluation measures, for the 3 other measures the difference between CCA and wMLDAd is not significant.
Method  HL’  RL’  AP  OE’  Cov’  MaF1  MiF1 

CCA  0.782 0.009  0.823 0.008  0.866 0.006  0.755 0.011  0.798 0.004  0.712 0.012  0.741 0.011 
MVMD  0.875 0.006  0.930 0.006  0.942 0.004  0.894 0.006  0.861 0.005  0.853 0.008  0.858 0.006 
MDDMp  0.875 0.006  0.930 0.005  0.942 0.003  0.895 0.006  0.861 0.005  0.853 0.008  0.858 0.006 
MDDMf  0.811 0.010  0.853 0.012  0.888 0.009  0.798 0.017  0.815 0.006  0.750 0.015  0.774 0.013 
wMLDAb  0.794 0.007  0.844 0.012  0.883 0.008  0.788 0.017  0.810 0.008  0.731 0.012  0.744 0.011 
wMLDAe  0.805 0.008  0.856 0.009  0.891 0.006  0.801 0.014  0.818 0.005  0.749 0.013  0.763 0.012 
wMLDAc  0.790 0.007  0.842 0.008  0.882 0.004  0.783 0.009  0.810 0.005  0.729 0.012  0.745 0.011 
wMLDAd  0.779 0.013  0.838 0.012  0.874 0.008  0.770 0.016  0.805 0.008  0.720 0.017  0.729 0.018 
SSMLDR  0.839 0.005  0.889 0.009  0.911 0.006  0.839 0.012  0.835 0.008  0.799 0.007  0.811 0.005 
NMLSDR  0.882 0.005  0.939 0.004  0.950 0.003  0.909 0.006  0.867 0.005  0.864 0.007  0.865 0.005 
Fig. 4 shows plots of the twodimensional embeddings of the chronic patients obtained using four different DR methods, namely MDDMp, wMLDAb, NMLSDR and SSMLDR. The different colors and markers represent the true CRGlabels of the patients. As we can see, visually the MDDMp and NMLSDR embeddings look quite similar. The healthy patients are squeezed together in a small area (purple dots), and the yellow dots that represent patients that have both diabetes and hypertension are placed between the blue dots, which are those that have only hypertension, and the red dots, which represent the patient that only have diabetes. Intuitively, this placement makes sense. On the other hand, the embedding obtained using SSMLDR does not look similar to its counterpart obtained using wMLDAb, and it is easy to see why the performance of wMLDAb is worse.
6 Conclusions and future work
In this paper we have introduced the NMLSDR method, a dimensionality reduction method for partially and noisy labeled multilabel data. To our knowledge, NMLSDR is the only method the can explicitly deal with this type of data. Key components in the method are a label propagation algorithm that can deal with noisy data and maximization of featurelabel dependence using the HilbertSchmidt independence criterion. Our extensive experimental sections show that NMLSDR is a good dimensionality reduction method in settings where one has access to partially and noisy labeled multilabel data.
A potential limitation of NMLSDR is that it is a linear dimensionality reduction method. The method can, however, be extended within the framework of kernel methods 10.1007/BFb0020217 ; MIKALSEN2018569 ; HOOSHMANDMOGHADDAM2016921 to deal with nonlinear data. In fact, NMLSDR is already a kernel method in the current formulation, in which we put a linear kernel over the feature space. The linear kernel can, however, straightforwardly be replaced with a nonlinear kernel. The effect of doing this will be investigated in future work. In the future, we will also investigate more thoroughly the effect of using different weighting schemes in NMLSDR, similarly to how it is done in MLDA with wMLDAb, wMLDAc, wMLDAd and wMDLAd.
It should be noticed that in our experiments, in addition to evaluating the proposed method visually for a couple of the datasets, we combined the NMLSDR with a popular multilabel classifier, namely the multilabel knearest neighbor classifier. By doing so, we could quantitatively evaluate the quality of the embeddings learned by the NMLSDR and compare to alternative dimensionality reduction methods. However, many other multilabel classifiers exist tahir2012multilabel ; madjarov2012extensive ; XU2013885 ; CHEN201661 ; LIU2018307 ; WANG20143405 ; trajdos2015extension ; TRAJDOS201860 ; ZHUANG2018225 . As future work, it would be interesting to investigate if the proposed method outperforms alternative dimensionality reduction methods in conjunction with other classifiers as well.
Further, we recognize that the outcome of label propagation using a graph is influenced by several factors. More precisely, there are two main components that affect how the labels propagate, namely the particular method chosen and how the graph is constructed. Both of these two components are important, as discussed in zhu2005semi ; zhu2006semi . In our experiments, we chose a neighborhood graph with binary weights. However, in future work it would be interesting to more thoroughly investigate the sensitivity of NMLSDR with respect to the particular choices made for constructing the graph.
Acknowledgments
This work was partially funded by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines. Cristina SogueroRuiz is partially supported by project TEC201675361R from Spanish Government and by Project DTS17/00158 from Institute of Health Carlos III (Spain). The authors would like to thank clinicians at University Hospital of Fuenlabrada for their helpful advice and comments on various issues examined in the case study.
References
References

(1)
D. F. Nettleton, A. OrriolsPuig, A. Fornells, A study of the effect of different types of noise on the precision of supervised learning techniques, Artificial Intelligence Review 33 (4) (2010) 275–306.
doi:10.1007/s104620109156z.  (2) B. Frenay, M. Verleysen, Classification in the presence of label noise: A survey, IEEE Transactions on Neural Networks and Learning Systems 25 (5) (2014) 845–869. doi:10.1109/TNNLS.2013.2292894.
 (3) X. Zhu, X. Wu, Class noise vs. attribute noise: A quantitative study, Artificial Intelligence Review 22 (3) (2004) 177–210. doi:10.1007/s1046200407518.
 (4) N. Natarajan, I. S. Dhillon, P. K. Ravikumar, A. Tewari, Learning with noisy labels, in: Advances in neural information processing systems, 2013, pp. 1196–1204.
 (5) M. Pechenizkiy, S. Puuronen, A. Tsymbal, O. Pechenizkiy, Class noise and supervised learning in medical domains: The effect of feature extraction, in: 19th IEEE Symposium on ComputerBased Medical Systems (CBMS’06)(CBMS), Vol. 00, 2006, pp. 708–713. doi:10.1109/CBMS.2006.65.
 (6) J. A. Aslam, S. E. Decatur, On the sample complexity of noisetolerant learning, Information Processing Letters 57 (4) (1996) 189–195. doi:10.1016/00200190(96)000063.

(7)
T. Xiao, T. Xia, Y. Yang, C. Huang, X. Wang, Learning from massive noisy labeled data for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2691–2699.
 (8) P. A. Lachenbruch, Discriminant analysis when the initial samples are misclassified, Technometrics 8 (4) (1966) 657–662.

(9)
Y. Bi, D. R. Jeske, The efficiency of logistic regression compared to normal discriminant analysis under classconditional classification noise, Journal of Multivariate Analysis 101 (7) (2010) 1622–1637.
doi:10.1016/j.jmva.2010.03.001.  (10) D. Angluin, P. Laird, Learning from noisy examples, Machine Learning 2 (4) (1988) 343–370. doi:10.1023/A:1022873112823.
 (11) T. Liu, D. Tao, Classification with noisy labels by importance reweighting, IEEE Transactions on pattern analysis and machine intelligence 38 (3) (2016) 447–461.
 (12) Y. Halpern, S. Horng, Y. Choi, D. Sontag, Electronic medical record phenotyping using the anchor and learn framework, Journal of the American Medical Informatics Association 23 (4) (2016) 731–740. doi:10.1093/jamia/ocw011.
 (13) K. Ø. Mikalsen, C. SogueroRuiz, K. Jensen, K. Hindberg, M. Gran, A. Revhaug, R.O. Lindsetmo, S. O. Skrøvseth, F. Godtliebsen, R. Jenssen, Using anchors from free text in electronic health records to diagnose postoperative delirium, Computer Methods and Programs in Biomedicine 152 (2017) 105–114. doi:10.1016/j.cmpb.2017.09.014.
 (14) V. Agarwal, T. Podchiyska, J. M. Banda, V. Goel, T. I. Leung, E. P. Minty, T. E. Sweeney, E. Gyang, N. H. Shah, Learning statistical models of phenotypes using noisy labeled training data, Journal of the American Medical Informatics Association 23 (6) (2016) 1166–1173. doi:10.1093/jamia/ocw028.
 (15) A. Callahan, N. H. Shah, Chapter 19  Machine learning in healthcare, in: A. Sheikh, K. M. Cresswell, A. Wright, D. W. Bates (Eds.), Key Advances in Clinical Informatics, Academic Press, 2017, pp. 279 – 291. doi:10.1016/B9780128095232.000194.

(16)
J. M. Banda, M. Seneviratne, T. HernandezBoussard, N. H. Shah, Advances in electronic phenotyping: From rulebased definitions to machine learning models, Annual Review of Biomedical Data Science 1 (1) (2018) 53–68.
doi:10.1146/annurevbiodatasci080917013315.  (17) N. D. Lawrence, B. Schölkopf, Estimating a kernel fisher discriminant in the presence of label noise, in: Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001, pp. 306–313.
 (18) J. Bootkrajang, A. Kabán, Learning kernel logistic regression in the presence of class label noise, Pattern Recognition 47 (11) (2014) 3641 – 3655.
 (19) D. R. Wilson, T. R. Martinez, Reduction techniques for instancebased learning algorithms, Machine learning 38 (3) (2000) 257–286.
 (20) P. M. Long, R. A. Servedio, Random classification noise defeats all convex potential boosters, Machine learning 78 (3) (2010) 287–304.
 (21) R. A. McDonald, D. J. Hand, I. A. Eckley, An empirical comparison of three boosting algorithms on real data sets with artificial class noise, in: International Workshop on Multiple Classifier Systems, Springer, 2003, pp. 35–44.

(22)
T. Bylander, Learning linear threshold functions in the presence of classification noise, in: Proceedings of the Seventh Annual Conference on Computational Learning Theory, COLT ’94, ACM, New York, NY, USA, 1994, pp. 340–347.
doi:10.1145/180139.181176.  (23) K. Crammer, A. Kulesza, M. Dredze, Adaptive regularization of weight vectors, in: Advances in neural information processing systems, 2009, pp. 414–422.
 (24) B. Biggio, B. Nelson, P. Laskov, Support vector machines under adversarial label noise, in: Asian Conference on Machine Learning, 2011, pp. 97–112.
 (25) A. Vahdat, Toward robustness against label noise in training deep discriminative neural networks, in: Advances in Neural Information Processing Systems 30, Curran Associates, Inc., 2017, pp. 5596–5605.
 (26) G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, L. Qu, Making deep neural networks robust to label noise: A loss correction approach, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 (27) H. Wu, S. Prasad, Semisupervised dimensionality reduction of hyperspectral imagery using pseudolabels, Pattern Recognition 74 (2018) 212 – 224.
 (28) A. Krithara, M. R. Amini, J.M. Renders, C. Goutte, Semisupervised document classification with a mislabeling error model, in: European Conference on Information Retrieval, Springer, 2008, pp. 370–381.
 (29) A. Ekbal, S. Saha, U. K. Sikdar, On active annotation for named entity recognition, International Journal of Machine Learning and Cybernetics 7 (4) (2016) 623–640.
 (30) S. Nowak, S. Rüger, How reliable are annotations via crowdsourcing: a study about interannotator agreement for multilabel image annotation, in: Proceedings of the international conference on Multimedia information retrieval, ACM, 2010, pp. 557–566.
 (31) O. Chapelle, B. Schlkopf, A. Zien, SemiSupervised Learning, 1st Edition, The MIT Press, 2010.
 (32) P. Chen, L. Jiao, F. Liu, J. Zhao, Z. Zhao, S. Liu, Semisupervised double sparse graphs based discriminant analysis for dimensionality reduction, Pattern Recognition 61 (2017) 361 – 378.
 (33) M. A. Tahir, J. Kittler, A. Bouridane, Multilabel classification using heterogeneous ensemble of multilabel classifiers, Pattern Recognition Letters 33 (5) (2012) 513–523.
 (34) G. Madjarov, D. Kocev, D. Gjorgjevikj, S. Džeroski, An extensive experimental comparison of methods for multilabel learning, Pattern recognition 45 (9) (2012) 3084–3104.
 (35) J. Xu, Fast multilabel core vector machine, Pattern Recognition 46 (3) (2013) 885 – 898.
 (36) W.J. Chen, Y.H. Shao, C.N. Li, N.Y. Deng, Mltsvm: A novel twin support vector machine to multilabel learning, Pattern Recognition 52 (2016) 61 – 74.
 (37) Y. Liu, K. Wen, Q. Gao, X. Gao, F. Nie, Svm based multilabel learning with missing labels for image annotation, Pattern Recognition 78 (2018) 307 – 317.
 (38) S. Wang, J. Wang, Z. Wang, Q. Ji, Enhancing multilabel classification by modeling dependencies among labels, Pattern Recognition 47 (10) (2014) 3405 – 3413.

(39)
P. Trajdos, M. Kurzynski, An extension of multilabel binary relevance models based on randomized reference classifier and local fuzzy confusion matrix, in: International Conference on Intelligent Data Engineering and Automated Learning, Springer, 2015, pp. 69–76.
 (40) P. Trajdos, M. Kurzynski, Weighting scheme for a pairwise multilabel classifier based on the fuzzy confusion matrix, Pattern Recognition Letters 103 (2018) 60 – 67.
 (41) N. Zhuang, Y. Yan, S. Chen, H. Wang, C. Shen, Multilabel learning based deep transfer neural network for facial attribute classification, Pattern Recognition 80 (2018) 225 – 240.
 (42) S. Theodoridis, K. Koutroumbas, Pattern Recognition, 4th Edition, Academic Press, Inc., Orlando, FL, USA, 2008.
 (43) C. H. Lee, H.J. Yoon, Medical big data: promise and challenges, Kidney research and clinical practice 36 (1) (2017) 3.
 (44) P. B. Jensen, L. J. Jensen, S. Brunak, Mining electronic health records: towards better research applications and clinical care, Nature Reviews Genetics 13 (6) (2012) 395–405.
 (45) J. AndreuPerez, C. C. Y. Poon, R. D. Merrifield, S. T. C. Wong, G. Yang, Big data for health, IEEE Journal of Biomedical and Health Informatics 19 (4) (2015) 1193–1208. doi:10.1109/JBHI.2015.2450362.

(46)
R. Miotto, F. Wang, S. Wang, X. Jiang, J. T. Dudley, Deep learning for healthcare: review, opportunities and challenges, Briefings in Bioinformatics
doi:10.1093/bib/bbx044.  (47) X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report CMUCALD02107), Carnegie Mellon University.
 (48) X. Zhu, Z. Ghahramani, J. D. Lafferty, Semisupervised learning using gaussian fields and harmonic functions, in: Proceedings of the 20th International conference on Machine learning (ICML03), 2003, pp. 912–919.
 (49) Z. Yang, W. W. Cohen, R. Salakhutdinov, Revisiting semisupervised learning with graph embeddings, in: Proceedings of the 33rd International Conference on International Conference on Machine LearningVolume 48, JMLR. org, 2016, pp. 40–48.
 (50) M. Belkin, P. Niyogi, Using manifold stucture for partially labeled classification, in: Advances in neural information processing systems, 2003, pp. 953–960.
 (51) A. Sandryhaila, J. M. F. Moura, Discrete signal processing on graphs, IEEE Transactions on Signal Processing 61 (7) (2013) 1644–1656. doi:10.1109/TSP.2013.2238935.
 (52) D. Zhou, O. Bousquet, T. N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Advances in neural information processing systems, 2004, pp. 321–328.
 (53) M. Fan, X. Zhang, L. Du, L. Chen, D. Tao, Semisupervised learning through label propagation on geodesics, IEEE transactions on cybernetics.
 (54) B. Wang, J. Tsotsos, Dynamic label propagation for semisupervised multiclass multilabel classification, Pattern Recognition 52 (2016) 75–84.
 (55) Y. Zhang, Z.H. Zhou, Multilabel dimensionality reduction via dependence maximization, ACM Transactions on Knowledge Discovery from Data 4 (3) (2010) 14:1–14:21.
 (56) J. Xu, J. Liu, J. Yin, C. Sun, A multilabel feature extraction algorithm via maximizing feature variance and featurelabel dependence simultaneously, KnowledgeBased Systems 98 (2016) 172–184. doi:10.1016/j.knosys.2016.01.032.
 (57) A. Gretton, O. Bousquet, A. Smola, B. Schölkopf, Measuring statistical dependence with HilbertSchmidt norms, in: International conference on algorithmic learning theory, Springer, 2005, pp. 63–77.
 (58) M.L. Zhang, Z.H. Zhou, MLKNN: A lazy learning approach to multilabel learning, Pattern Recognition 40 (7) (2007) 2038 – 2048. doi:10.1016/j.patcog.2006.12.019.
 (59) B. Guo, C. Hou, F. Nie, D. Yi, Semisupervised multilabel dimensionality reduction, in: Data Mining (ICDM), 2016 IEEE 16th International Conference on, IEEE, 2016, pp. 919–924.
 (60) Y. Yu, J. Wang, Q. Tan, L. Jia, G. Yu, Semisupervised multilabel dimensionality reduction based on dependence maximization, IEEE Access 5 (2017) 21927–21940. doi:10.1109/ACCESS.2017.2760141.
 (61) I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of machine learning research 3 (Mar) (2003) 1157–1182.
 (62) I. Jolliffe, Principal component analysis, in: International encyclopedia of statistical science, Springer, 2011, pp. 1094–1096.
 (63) S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, science 290 (5500) (2000) 2323–2326.
 (64) M. Belkin, P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, in: Advances in neural information processing systems, 2002, pp. 585–591.
 (65) J. B. Tenenbaum, V. De Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, science 290 (5500) (2000) 2319–2323.
 (66) R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of eugenics 7 (2) (1936) 179–188.
 (67) C. H. Park, M. Lee, On applying linear discriminant analysis for multilabeled problems, Pattern recognition letters 29 (7) (2008) 878–887.
 (68) W. Chen, J. Yan, B. Zhang, Z. Chen, Q. Yang, Document transformation for multilabel feature selection in text categorization, in: 7th IEEE International Conference on Data Mining, 2007, pp. 451–456.
 (69) H. Wang, C. Ding, H. Huang, Multilabel linear discriminant analysis, in: European Conference on Computer Vision, Springer, 2010, pp. 126–139.
 (70) X. Lin, X.W. Chen, Mr. kNN: soft relevance for multilabel classification, in: Proceedings of the 19th ACM international conference on Information and knowledge management, ACM, 2010, pp. 349–358.
 (71) J. Xu, A weighted linear discriminant analysis framework for multilabel feature extraction, Neurocomputing 275 (2018) 107–120. doi:10.1016/j.neucom.2017.05.008.
 (72) D. R. Hardoon, S. Szedmak, J. ShaweTaylor, Canonical correlation analysis: An overview with application to learning methods, Neural computation 16 (12) (2004) 2639–2664.
 (73) K. Yu, S. Yu, V. Tresp, Multilabel informed latent semantic indexing, in: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, ACM, 2005, pp. 258–265.
 (74) S. Ji, L. Tang, S. Yu, J. Ye, A sharedsubspace learning framework for multilabel classification, ACM Transactions on Knowledge Discovery from Data (TKDD) 4 (2) (2010) 8.
 (75) B. Qian, I. Davidson, Semisupervised dimension reduction for multilabel classification, in: Proc. AAAI Conf. Artif. Intell., Vol. 10, 2010, pp. 569–574.
 (76) M. Gönen, Coupled dimensionality reduction and classification for supervised and semisupervised multilabel learning, Pattern Recognition Letters 38 (2014) 132–141. doi:https://doi.org/10.1016/j.patrec.2013.11.021.
 (77) T. Yu, W. Zhang, Semisupervised multilabel learning with joint dimensionality reduction, IEEE Signal Process. Lett. 23 (6) (2016) 795–799.
 (78) M. B. Blaschko, J. A. Shelton, A. Bartels, C. H. Lampert, A. Gretton, Semisupervised kernel canonical correlation analysis with application to human fMRI, Pattern Recognition Letters 32 (11) (2011) 1572–1583.
 (79) H. Li, P. Li, Y.j. Guo, M. Wu, Multilabel dimensionality reduction based on semisupervised discriminant analysis, Journal of Central South University of Technology 17 (6) (2010) 1310–1319.
 (80) Y. Yu, G. Yu, X. Chen, Y. Ren, Semisupervised multilabel linear discriminant analysis, in: International Conference on Neural Information Processing, Springer, 2017, pp. 688–698.
 (81) M. Hubert, K. V. Driessen, Fast and robust discriminant analysis, Computational Statistics & Data Analysis 45 (2) (2004) 301–320. doi:10.1016/S01679473(02)002992.
 (82) C. Croux, C. Dehon, Robust linear discriminant analysis using Sestimators, Canadian Journal of Statistics 29 (3) (2001) 473–493.
 (83) M. Hubert, P. J. Rousseeuw, S. Van Aelst, Highbreakdown robust multivariate methods, Statistical science (2008) 92–119.
 (84) F. Nie, S. Xiang, Y. Liu, C. Zhang, A general graphbased semisupervised learning with novel class discovery, Neural Computing and Applications 19 (4) (2010) 549–555.
 (85) C. D. Meyer, Jr, R. J. Plemmons, Convergent powers of a matrix with applications to iterative methods for singular linear systems, SIAM Journal on Numerical Analysis 14 (4) (1977) 699–705.
 (86) Y. Saad, Chapter 1  Background in matrix theory and linear algebra, in: Numerical Methods for Large Eigenvalue Problems, Manchester University Press, 1992, pp. 1–27. doi:10.1137/1.9781611970739.ch1.
 (87) X.Z. Wu, Z.H. Zhou, A unified view of multilabel performance measures, arXiv preprint arXiv:1609.00288.
 (88) J. Demšar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine learning research 7 (Jan) (2006) 1–30.
 (89) A. Soni, E. Mitchell, Expenditures for commonly treated conditions among adults age 18 and older in the U.S. civilian noninstitutionalized population, 2013, Statistical Brief.
 (90) A. CalderónLarrañaga, D. L. Vetrano, G. Onder, L. A. GimenoFeliu, C. CoscollarSantaliestra, A. Carfí, M. S. Pisciotta, S. Angleman, R. J. Melis, G. Santoni, et al., Assessing and measuring chronic multimorbidity in the older population: a proposal for its operationalization, Journals of Gerontology Series A: Biomedical Sciences and Medical Sciences 72 (10) (2016) 1417–1423.
 (91) American Diabetes Association, Economic costs of diabetes in the US in 2017, Diabetes Care 41 (5) (2018) 917–928.
 (92) Centers for Disease Control and Prevention, International Classification of Diseases, Ninth Revision, Clinical Modification (ICD9CM) (2011).
 (93) WHO, Collaborating Centre for Drug Statistics Methodology, Guidelines for ATC classification and DDD assignment, 2016.

(94)
C. SogueroRuiz, A. A. DíazPlaza, P. de Miguel Bohoyo, J. RamosLópez, M. RubioSánchez, A. Sánchez, I. MoraJiménez, On the use of decision trees based on diagnosis and drug codes for analyzing chronic patients, in: International Conference on Bioinformatics and Biomedical Engineering, Springer, 2018, pp. 135–148.
 (95) A. Sanchez, C. SogueroRuiz, I. MoraJiménez, F. RivasFlores, D. Lehmann, M. RubioSánchez, Scaled radial axes for interactive visual feature selection: A case study for analyzing chronic conditions, Expert Systems with Applications 100 (2018) 182–196.
 (96) R. F. Averill, N. Goldfield, J. Eisenhandler, J. Hughes, B. Shafir, D. Gannon, L. Gregg, F. Bagadia, B. Steinbeck, N. Ranade, et al., Development and evaluation of Clinical Risk Groups (CRGs), Wallingford, CT: 3M Health Information Systems, 1999.
 (97) B. Schölkopf, A. Smola, K.R. Müller, Kernel principal component analysis, in: W. Gerstner, A. Germond, M. Hasler, J.D. Nicoud (Eds.), Artificial Neural Networks — ICANN’97, Springer Berlin Heidelberg, Berlin, Heidelberg, 1997, pp. 583–588.
 (98) K. Ø. Mikalsen, F. M. Bianchi, C. SogueroRuiz, R. Jenssen, Time series cluster kernel for learning similarities between multivariate time series with missing data, Pattern Recognition 76 (2018) 569–581.
 (99) V. H. Moghaddam, J. Hamidzadeh, New hermite orthogonal polynomial kernel and combined kernels in support vector machine classifier, Pattern Recognition 60 (2016) 921–935.
 (100) X. Zhu, Semisupervised learning with graphs, Ph.D. thesis, Carnegie Mellon University, Pittsburgh, PA, USA, aAI3179046 (2005).
 (101) X. Zhu, Semisupervised learning literature survey, Computer Science, University of WisconsinMadison 2 (3) (2006) 4.