I Introduction
Multilabel classification tasks have become more and more common in the machine learning field recently, e.g., in text information categorization [Li2015], image and video annotation [Qi2007], sequential data prediction [Read2017], or music information retrieval [Trohidis2011]. Multilabel databases exist for various real applications, such as Yeast database for protein localization sites prediction [Nakai1992], CAL500 database for music retrieval [Turnbull2008], or medical database for text classification [Pestian2007].
Compared to singlelabel problems, the characteristics of multilabel problems are more complicated and unpredictable. In a single label problem, each instance merely belongs to a specific class in a mutually exclusive manner [Wang2010]. Classes in a multilabel problem are not mutually exclusive, which means that each data item can belong to either one or several classes. Moreover, different classes contain a varying number of data items, leading to class imbalanced problems [Lu2019]. Hence, in order to solve a multilabel classification problem efficiently and effectively, we need not only to consider the correlation of class labels and features of each data item, but also take into account the different cardinalities of the classes.
As described in [Zhang2014], multilabel classification methods are derived following either a problem transformation (PT) approach or an algorithm adaptation (AA) approach. Methods following the PT approach simply utilize singlelabel classification algorithms to tackle multilabel classification tasks using decomposition approaches, such as the binary relevance (BR) algorithm [Tanaka2015], [ZhangM.L.LiY.K.LiuX.Y.Geng2018] or the label powerset (LP) algorithm [Abdallah2016], [Tsoumakas2010]. A weighted multilabel linear discriminant analysis algorithm (wMLDA) [Xu2018b] combines the decomposition approach with different labels and/or feature information to build a multilabel classification method. Methods following the AA approach directly utilize the information of class labels and data items to explore their correlation, e.g., in an extension of AdaBoost algorithm [Schapire2000] or a deconvolutionbased method in [Streich2008]. Linear Discriminant Analysis (LDA) and its variants have been widely used to extract discriminant data representations for solving various problems involving supervised dimensionality reduction, e.g., in human action recognition [Iosifidis2012b, Iosifidis2014b], [Wu2017], biological data classification [Wang2017, Huang2009a], and facial image analysis [Gao2009]. However, it cannot be directly used to tackle multilabel problems due to the characteristics of multilabel data. This is due to two factors: a) the contribution of each data item in the calculation of the scatter matrices involved in the optimization problem of singlelabel LDA and its variants cannot be appropriately determined and b) the cardinality of the various classes forming the multilabel problem can be quite imbalanced.
In this paper, we propose a novel method for multilabel data classification based on a probabilistic approach that is able to estimate the contribution of each data item to the classes it belongs to by taking into account prior information encoded using various types of metrics. The proposed calculation of the contribution of each data item to the classes it belongs to cannot only weight its importance, but can also address problems related to imbalanced classes. To this end, we exploit the concept of class saliency introduced in
[Xu2018c]. Hence, the proposed method is named as Saliencybased Weighted Multilabel Linear Discriminant Analysis (SwMLDA). Our proposed SwMLDA approach, as a kind of PT approach, exploits both label and feature information with various prior weighting factors, i.e., binarybased weight form [Park2008a], misclassificationbased weight form [Xu2018c], entropybased weight form [Chen2007], fuzzybased weight form [Lin2010], dependencebased weight form [Xu2018b], and correlationbased weight form [Wang2010]. The proposed method leads to improved results on 10 publicly available multilabel databases.We have made the following contributions on multilabel classification tasks with our novel SwMLDA approach: (1) we propose using probabilistic saliency estimation in multilabel classification to weight the importance of each item for its classes; (2) we formulate a novel SwMLDA method that uses the saliencybased weights and can alleviate the problems related to imbalanced datasets; (3) we integrate label and feature information to SwMLDA by using various types of weighting factors as prior information; (4) we compare our proposed approach to related methods on 10 diverse multilabel data sets, and the results show considerable improvements in multilabel classification tasks using our approach.
The remainder of this paper is structured as follows. In Section 2, we briefly review the related works. We include a precise explanation of the LDA and weighted MLDA with adequate mathematical notations to support the derivations of the probabilistic saliency estimation. In Section 3, we describe our proposed methods in detail. Section 4 presents for experimental setup and results on 10 multilabel databases. In Section 5, we conclude this paper and discuss the potential future studies.
Ii Related works
In this section, we first briefly present several standard approaches for multilabel classification in subsection IIA. In subsection IIB, we provide a detailed description of the standard LDA, weighted LDA, Multilabel LDA (MLDA), and weighted Multilabel LDA (wMLDA), since they form the theoretical foundation for the proposed work. Subsequently, we introduce the general concepts of saliency estimation and the probabilistic saliency estimation approach needed to develop the proposed method.
Iia General methods for multilabel classification tasks
Various methods have been proposed for solving multilabel classification tasks, such as variants of Support Vector Machine (SVM)
[Godbole2004]and various feature extraction methods
[Wang2010, Xu2018b, Zhang2008]. As PT algorithms, Binary Relevancebased methods [Tanaka2015, ZhangM.L.LiY.K.LiuX.Y.Geng2018, Read2009] decompose a multilabel classification problem into several singlelabel classification problems in a oneversusall manner. Another standard PT method is Label Powerset (LP) algorithm [Abdallah2016], [Pushpa2017], which exploits the dependencies or correlations of class labels to rebuild a labeled subset for a singlelabel classifier. The traditional SVM algorithm acts as a PT approach: in
[Boutell2004]a multilabel scene classification problem is decomposed into several singlelabel problems by following a crosstraining strategy.
As an AA approach, alternating decisiontree (ADTree) was proposed to enhance the performance of boosting methods
[Yoav1999, DeComite2003]. In [Yoav1999], the strategy of an alternating decision tree is based on an option tree using boosting. Another decisiontree related algorithm ADTboost.MH was proposed in [DeComite2003] to solve multilabel text and data classification problems by ADTboost algorithm[Yoav1999] and Adaboost.MH algorithm [Schapire1999].IiB Dimensionality reduction algorithms for multilabel classification tasks
Standard LDA and its variants have been applied to tackle various multilabel classification problems [Wang2010, Park2008a, Oikonomou2013, Yuan2014, Nie2009, Siblini2019]. Generally, dimensionality reductionbased methods tackling multilabel classification problems are categorized as unsupervised and supervised, depending on whether class label information is involved in or not [Xu2018b]. The objective of dimensionality reductionbased methods is to determine a data projection matrix mapping the data from the original feature space to a discriminant subspace , where .
IiB1 Linear Discrimination Analysis
LDA is an effective technique to reduce dimensionality of original data as a prepossessing step for singlelabel classification problems. In the following, we assume that a training set formed by data points and class labels is presented as
(1) 
where and are the data points and the corresponding label vectors, respectively. The instance matrix is defined as
(2) 
The label matrix is depicted as
(3) 
The label information of element is represented as . If belongs to the class , , otherwise . Note that in single labelclassification tasks there is a single 1 on each column. Later, we will use the same notation in multilabel classification, where the number of 1s is not constrained.
The withinclass, betweenclass, and total scatter matrices , , and , respectively, are defined as follows:
(4) 
(5) 
(6) 
denotes the mean vector of class as
(7) 
where is the cardinality of class . The total mean vector is computed as
(8) 
The optimal projection matrix is learned by maximizing the Fisher’s discriminant criterion [R.A.Fisher1936] through compacting the withinclass scatter and maximizing the betweenclass scatter simultaneously as
(9) 
where denotes the trace of a matrix. Usually, the optimal projection matrix
is calculated by solving eigenvalue decomposition of the matrix
and then using the eigenvectors corresponding to the largest eigenvalues as the projection matrix
. The rank of is equal to , which is the maximal dimensionality of the resulting subspace. Since , an alternative approach is to use instead of and maximize the Fisher’s discriminant criterion as(10) 
Although the traditional LDA technique has gained popularity on various singlelabel classification tasks, its performance varies according to the types of input data sets. Usually, the data sets used in most traditional LDA classification tasks are assumed to have equal class distribution as a homoscedastic Gaussian model [Petridis2004a], in which the covariance matrices of each class should be identical [Tang2005]. Furthermore, the performance is affected severely due to the imbalance of input data sets [Tang2005a].
IiB2 Weighted Linear Discrimination Analysis
In order to enhance the robustness of traditional LDA on different kinds of data sets, various weight factors are introduced into the definitions of scatter matrices to balance the contribution of each class, according to class statistics [Tang2005], [Li2009a]
, e.g., class cardinality, a prior probability. Weighted LDA approaches have diminished the influence of outlier classes on the scatter matrices of imbalanced data sets to some extent; however, they still neglect the varying importance of individual samples in the class description. Saliencybased weighted Linear Discriminant Analysis (SwLDA)
[Xu2018c], as a kind of graph expression, was proposed to explore the contribution of each instance based on probabilistic saliency estimation [Aytekin2018]. Our work uses the same idea for multilabel classification.Generally, weight factors are calculated using various metrics to reallocate the contribution of each class, which can alleviate the influence of outlier classes on the projection matrix. An example of a weighted betweenclass matrix definition based on Bayes error rate was proposed in [Loog2001b]:
(11) 
where , denote the a priori probabilities of class and class , respectively. expresses the dissimilarity between class and class . The withinclass scatter matrix can be muted with prior information as in [Tang2005]:
(12) 
where is a relevance weight factor that has a low value if class is estimated to be an outlier class. Thus, both definitions of scatter matrices decrease the influence of outlier classes. After computing the weighted scatter matrices, they can be used to obtain the optimal projection matrix from Eq. (9)
IiB3 Multilabel Linear Discrimination Analysis
Although weighted LDA algorithms have enhanced the performance on singlelabel classification tasks [Jarchi2006a], [Ahmed2012] compared to traditional LDA, such variants are still not directly applicable for multilabel classification tasks [Wang2010]. In a multilabel data set, label information contains certain correlations or dependencies [Wu2016], for example, an image instance labeled as ’car’ highly correlates to label ’road’ [Wang2010]. Besides, it is quite common that the number of samples in each class in a multiclass data set is imbalanced. For example, the largest class size is 1128 and the smallest 21 in the widely used Yeast database [Nakai1992], as shown in Fig. 1. Due to the specific characteristics of multilabel databases, it is imperative to take into account the correlation of class labels and/or discriminative feature information of each instance to tackle the suboptimal classification result on imbalanced data sets.
When traditional LDA and its variants are applied to tackling multilabel classification tasks by simply using Eqs. (4)  (6) with the multilabel label matrix , a significant problem is that the contribution of one instance can be repeatedly counted in computing the scatter matrices. Hence, weight factors are used to express redundancy or/and correlation information so that LDA related algorithms can calculate scatter matrices without redundancy on multilabel databases. In [Wang2010], a multilabel linear discriminant analysis (MLDA) approach based on the exploration of label correlation information was proposed to tackle multilabel image or video classification tasks. MLDA approach embeds the correlation information of class labels as weight factors in the definition of scatter matrices as
(13) 
(14) 
(15) 
where describes a weight factor of the instance for the class , is the total mean vector of all training instances, and is the mean vector of class :
(16) 
(17) 
A correlation matrix is computed using the class labels of each pair of classes:
(18) 
where , are label vectors for classes . The label correlation information can reveal whether two classes are closely related or not. The correlation matrix is then used to compute the weight factors in Eqs. (13)  (17). To tackle the overcounting problem [Wang2010], the weight factors are normalized with norm:
(19) 
where is the label vector for the sample. was used directly as in [Wang2010], while we exploit it in a different manner in our work.
Various weight matrices have been introduced to improve the performance of LDA on multilabel classification tasks [Wang2010], [Xu2018b], [Oikonomou2013]. Such strategies yield a more suitable projection subspace compared to other dimensionality reduction algorithms [Wang2010], such as principle component analysis (PCA), multilabel dimensionality reduction via dependence maximization (MDDM), or multilabel least square (MLLS).
In [Oikonomou2013], MLDA was extended to Direct MLDA by changing the definition of in a way that allows to obtain a higher dimensional subspace than the original MLDA, where the subspace dimensionality is limited by the rank of to . This extension work further enhanced the results in multilabel video classification tasks. Another extension, multilabel discriminant analysis with locality consistency (MLDALC) [Yuan2014] not only preserves the global class label information as MLDA does, but also incorporates a graph regularized term to utilize the local geometric information. MLDALC reveals the similarity among nearby instances with transformation in the projection space using incorporation of the graph Laplacian matrix into the MLDA approach, which further enhances the classification performance in multilabel data sets compared to MLDA or/and MLLS algorithms.
IiB4 Weighted multilabel linear discriminant analysis
A weighted multilabel LDA (wMLDA) approach was proposed in [Xu2018b] focusing on linear feature extraction for multilabel classification. In wMLDA, a multilabel classifier is composed of several singlelabel classifiers according to the number of classes and a weight matrix is simultaneously calculated based on various metrics to embody the contribution of each instance in scatter matrices calculation. Various metrics can be used to measure the relationships among instances from the labels and/or features. wMLDA approach employs correlationbased weight form [Wang2010], entropybased weight form [Chen2007], binarybased weight form [Park2008a], fuzzybased weight form [Lin2010], and dependencebased weight form [Xu2018b]. In this work, we exploit the same metrics, while we use this information in a novel way. We provide a detailed explanation of the metrics in Section IIIB.
In [Xu2018b], scatter matrices , , and are redefined to exploit the prior information for weighting. Firstly, a nonnegative weight matrix with the same size of label matrix is defined to describe the weight of each instance to its corresponding classes:
(20) 
where represents a weight vector for the instance and is a weight vector for the class. The weight matrix is calculated based on one of the prior information matrices described in IIIB. Then, and are defined as summations of weights for the class and all classes:
(21) 
(22) 
In order to simplify notation, row vectors and are defined as
(23) 
(24) 
where is the summation of weights for the th instance over all classes . Then, the scatter matrices can be redefined as
(25) 
(26) 
(27) 
where has row vectors . Under this approach, the optimal projection matrix can still be obtained by solving the generalized eigenproblem corresponding to Eq. (10) as discussed in Section 2.
IiC Saliency estimation
Saliency estimation as a standard computer vision task is inspired by neurobiological studies
[Ltti1998] and cognition psychology [Treisman1980]. Generally saliency estimation is a preprocessing step for various highlevel computer vision tasks, such as object detection [Aytekin2018], [WangSalientSurvey], omnidirectional images [Battisti2018], and human attention estimation [Choi2016HumanApproach]. Saliency in physiological science is defined as a special kind of perception of the human visual system, by which humans can perceive particular parts in a scene in details due to colors, textures, or other prominent information contained in these parts [Cheng2011]. These particular parts can be distinguished as foreground from nonsalient background parts.Computational saliency estimation approaches can be categorized as local approaches and global approaches based on the way they process saliency information [Cheng2011]. Local saliency approaches explore the prominent information around the neighborhood of specific pixels/regions whilst global approaches exploit the rarity of a pixel/patch/region in the whole scene. Since the emergence of computational saliency estimation field in [Koch1985], various probabilistic approaches have been explored in this topic. In [Jian2018]
, a saliency map is estimated based on three kinds of prior information on images at superpixel level. Saumya et al. utilize a generalized Bernoulli distribution to estimate a saliency map in their work
[Jetley2016].Another saliency estimation approach was proposed by Aytekin et al. [Aytekin2018] for segmenting salient objects in an image using a probabilistic estimation, where a probability mass function depicts whether a region (pixel, superpixel, or patch) in an image is considered as a distinct region. The higher the values of for a region, the more prominent the region is. is solved by optimizing two items working simultaneously to allocate not only lower probability to nonsalient regions but also similar probabilities to similar regions as follows:
(28) 
where the first term suppresses the probability of a nonprominent region using its prior information . In the second term, a high similarity of regions and , given as a high similarity value , forces the regions to have similar probabilities.
This optimization task in Eq. (28) can be expressed using matrix notations as
(29)  
(30)  
where is a probability vector that depicts the probabilities of each element or region to be salient, i.e., .
is an affinity matrix, which denotes the similarity of each pair of regions
and as . is a diagonal matrix having elements equal to . is a diagonal matrix having elements . Then Lagrangian multiplier method is employed(31) 
A global optimum is obtained by setting the partial derivative of Eq. (31) with the respect to zero. The final optimized probability vector is
(32) 
Iii Proposed Method
We propose a novel saliencybased weighted linear discriminant analysis method for multilabel classification tasks, where the saliencybased weight factors are calculated based on the probabilistic saliency estimation approach and the specific prior information of the input data. In this section, we describe our novel Saliencybased weighted Multilabel Linear Discriminant Analysis (SwMLDA) approach in detail.
We calculate a saliencybased weight matrix based on the probabilistic saliency estimation with the exploration of various prior information: binary [Park2008a], correlation [Wang2010], entropy [Chen2007], fuzzy [Lin2010], dependence [Xu2018b], and misclassification [Xu2018c]. The weight matrix is denoted as:
(33) 
where represents the optimal weight vector of the instance and is the weight vector of the class. The details for computing the probabilistic weight matrix are given in the next subsections. After forming , the proposed method proceeds as wMLDA by using the weights in the scatter matrices and :
(34) 
(35) 
where . Note that our method normalizes the weight vectors so that the sum of the weights for a class is always 1. Therefore, in Eq. (23) is an identity vector and . Finally, the optimal projection matrix is obtained from Eq. (10) by solving the corresponding generalized eigenvalue problem.
Iiia Saliencybased weight factors
We extend the probabilistic saliency estimation approach [Aytekin2018] described in Section IIC to express the saliency of each instance for its class(es). To this end, we formulate the prior information in and so that probability describes the saliency of each item for class . In an initial work [Xu2018c], we used saliencybased weight factors to tackle suboptimal classification results caused by imbalanced data sets or/and outliers in singlelabel classification using LDAbased algorithms. Here, we exploit the saliencybased weight factors to tackle multilabel classification tasks.
We calculate the saliencybased weight factors separately for each class in the spirit of PT approaches. For each class, we consider only the samples belonging to the class, thus, has elements. is computed following Eq. (32) as
(36) 
where constitutes three terms as . To form from , the elements of are placed on their corresponding positions in and the values for items not belonging to class are set to zero. We then form the weight matrix by placing weight vectors as its rows.
is an affinity matrix obtained by a graph notation. That is, for each class , we form its corresponding graph , where is a matrix formed by the instances belonging to class and is a graph weighting matrix expressing the similarity between each pair of instances in class . In our experiments, we use a fully connected graph to obtain with a heat kernel function formulated as
(37) 
where and are the and instance in class , and . is set as a constant value. is a diagonal matrix and each element is calculated based on as .
is a diagonal matrix, which carries the prior information of each instance in class to be salient for its class based on the metrics presented in the next subsection. The values of inversely relate to the values of weight factor vector ranging from 0 to 1. The lower a value , the more prominent the corresponding instance is expected to be based on the prior knowledge. We introduce six different prior information matrices to exploit label or/and feature information of each class, which produce six different variants of the proposed approach.
After computing the prior information vector and affinity matrix for class , we follow the approach of PSE in Eq. (36) to calculate the saliency score vector for class . In order to avoid singularity during this process, a regularized version of with a small value epsilon added to the diagonal elements is used. The summation of the values in the saliencybased weight vector for each class is one, which is expected to alleviate the overcounting problem.
IiiB Prior information matrices
IiiB1 Misclassificationbased prior information matrix (SwMLDAm)
This approach was defined in [Xu2018c] to alleviate the suboptimal result in LDA arising from outlier instances on imbalanced data sets. We utilize the misclassificationbased prior information to generate a diagonal matrix based on the probability of each instance belonging to class to be more salient for another class:
(38) 
where , is the instance of class and is the mean vector of class . In this approach, a sample which is closer to another class is considered less salient for class even if it is relatively close to the center of class .
IiiB2 Correlationbased prior information matrix (SwMLDAc)
As in [Wang2010], label correlation information is represented by a class pair matrix defined in Eq. (18). For each instance , the normalized weight vector is calculated by Eq. (19). We compute the weight factors separately for each class and, after obtaining them for all instances, we select the elements, , and formulate the correlationbased prior information of the class as .
Label correlation information is widely exploited to tackle the redundancy of label information in various multilabel tasks [Wang2010], [Zhu2018]. However, it can lead to a suboptimal result, due to nonzero values in the correlation weight factor matrix for irrelevant labels [Xu2018b]. Because we calculate the correlationbased prior information matrix based on each class separately, the nonzero values of unrelated label pairs can be avoided.
IiiB3 Binarybased prior information matrix (SwMLDAb)
Binarybased approach directly utilizes the label information as in [Park2008a]. In our formulation, this approach reduces to having an equal value in for all instances as only instances belonging to class are considered in . For wMLDA, such direct use of class labels leads to overcounting problem in the scatter matrices. In our formulation, this problem is avoided because merely represents the prior information of nonsalient instances and the final weight matrix is normalized for each class.
IiiB4 Entropybased prior information matrix (SwMLDAe)
We utilize entropy metric for label information to present a prior information matrix of each class , as in [Xu2018b], [Chen2007]. For each instance , its number of relevant labels is calculated as
(39) 
and its entropy is given as
(40) 
where . Thus, the entropy is higher, when there are fewer relevant labels. The probability for an instance to be relevant to class is
(41) 
The entropybased prior information of each instance to the different class(es) is defined as follows:
(42) 
Finally, the diagonal matrix has elements .
IiiB5 Fuzzybased prior information matrix (SwMLDAf)
Fuzzy means clustering algorithm (FCM) [Bezdek1981] is an extension of means, where an instance can belong to multiple clusters with different degrees. The membership degree of instance in class is indicated with a weight factor [Dembczynski2012]. In our work, a supervised version of fuzzy means clustering algorithm (SFCM) [Xu2018b], [Lin2010] is exploited to obtain the prior information matrix.
As in [Xu2018b], we optimize the following:
(43) 
where presents the fuzzy centroid of class , denotes the membership of instance to class . The constraint forces the weights of each instance to sum to one. The constrained optimization problem in Eq. (43) can be solved by Lagrangian optimization with , where
(44) 
After getting the partial derivatives of with respect to , and and setting their values to zero, we get
(45) 
IiiB6 Dependencebased prior information matrix (SwMLDAd)
Dependencebased weights were proposed in [Xu2018b]. They are based on HilbertSchmidt independence criterion (HSIC) [Gretton2005], which is used to describe statistical dependence between features and labels based on the estimation of HilbertSchmidt norms. We follow the definition of HSIC in [Xu2018b] as
(47)  
where , , and . is a centered matrix, which is represented as , where
denotes an identity matrix and
denotes an allone vector. denotes the Hadamard, i.e., elementwise, product of two matrices or vectors. To find that maximizes HSIC, we solve the following optimization problem using the iterative approach described in [Xu2018b]:(48)  
This approach transforms a multilabel task to several singlelabel tasks [Xu2018b]. It allocates 1 to only one prominent class for each instance after the final iteration. In our probabilistic formulation, the diagonal matrix has elements .
Iv experiments
In our work, we tested our approach on ten multilabel databases and compared the final results with six competing methods using five evaluation metrics. We use the Matlab codes provided for
[Xu2018b] in the comparative experiments and exploit the relevant parts also in the implementation of our proposed method. In the following subsections, we present ten databases, implementation details, evaluation metrics, and classification results.Iva Databases
We perform our experiments on 10 publicly available multilabel databases: Yeast [Nakai1992], Scene [Boutell2004], Cal500 [Turnbull2008], Medical [Pestian2007], TMC2007500 [Srivastava2005], Corel16k001 [Barnard2003], PlantGO, Image, HumanGO, Enron. The contents of these databases include text, image, and acoustic clips. The numbers of classes and features of these databases are shown in Table I. ’Cardinality’ gives the mean numbers of class labels per instance for the database.
Database  Contents  Train Instances  Test Instances  Classes  Attributes  Cardinality 

Yeast  Biology  1500  917  14  103  4.24 
PlantGO  Biology  588  390  12  3091  1.08 
Image  Scene  1200  800  5  294  1.24 
Scene  Scene  1211  1196  6  294  1.07 
Enron  Text  1123  579  53  1001  3.38 
Cal500  Music  300  202  174  68  26.04 
HumanGO  Biology  1862  1244  14  9845  1.19 
Medical  Text  645  333  45  1449  1.25 
TMC2007500  Text  21519  7077  22  500  2.16 
Corel16k001  Scene  5188  1744  153  103  4.24 
IvB Experimental setup
After eigendecomposition of , we retained the eigenvectors corresponding to the top 0.999 informative eigenvalues. The classifier used in the experiments is a multilabel
nearest neighbor classifier (MLKNN)
[Zhang2007] with as in [Xu2018b]. MLKNN utilizesnearest neighbor algorithm and maximum a posterior (MAP) principle to tackle the multilabel categorization task. MLKNN first estimates prior and posterior probability of each instance
for each class from a training dataset based on frequency counting [Zhang2007]. Then, the predicted probabilities on a test dataset are calculated based on the prior and posterior probabilities on the training dataset using the Bayesian rule. The predicted labels are obtained by setting a threshold () for the predicted probabilities.IvC Performance evaluation
We adopt five different evaluation metrics [Zhang2010], [Park2019] to evaluate the performance of our proposed algorithm: one error, normalized coverage, ranking loss, hamming loss, and macroF1. We introduce them in the following. Here, we denote the ground truth label matrix for the test samples as , where the column represents the label vector of test sample .
The predicted label matrix is denoted as and is the predicted label vector of a test sample . We use for the predicted probabilities, where denotes the membership of instance in class . denotes an ordered list of classes ranked in the order of descending probability in . is used to denote the indices of relevant classes in and denotes the indices of negative classes in .

One error shows how often the top ranked class is not among the positive ground truth labels. Lower values of this metric indicate better performance.
(49) where denotes the first class in the sorted list .
(50) 
Normalized coverage demonstrates how far on average in the predicted label ranking one needs to go to cover all the groundtruth labels of an instance. A smaller coverage value indicates better performance.
(51) where gives the positions of relevant classes in the ordered list .

Ranking loss evaluates for each item relevant vs. irrelevant class pair and gives the fraction of pairs, where the irrelevant class if ranked above the relevant one. Smaller values of this metric indicate a better performance. Here, we use to denote the number of relevant classes in and :
(52) (53) where is used to denote the count of wrong rankings for item .

Hamming loss shows the rate of misclassified predicted values using XOR comparison between predicted labels and ground truth labels. Smaller values of this metric indicate a better performance:
(54) 
MacroF1 shows the average F1 value on each class, which reveals the authenticity and reliability of predicted true labels. Higher values of this metric indicate a better performance.
Reference methods  Variants of the proposed saliencybased methods  
Dataset  DMLDA  wMLDA_{c}  wMLDA_{b}  wMLDA_{e}  wMLDA_{f}  wMLDA_{d}  SwMLDA_{m}  SwMLDA_{c}  SwMLDA_{b}  SwMLDA_{e}  SwMLDA_{f}  SwMLDA_{d} 
Yeast  0.2399  0.2486  0.2410  0.2475  0.2530  0.2497  0.2474  0.2530  0.2530  0.2421  0.2432  
Plant  0.7564  0.7359  0.6069  0.7436  0.7359  0.7410  0.6692  0.6615  0.6667  0.6564  0.6590  
Image  0.4975  0.3413  0.3613  0.3400  0.3463  0.3463  0.3325  0.3150  0.3163  0.3263  0.3213  
Scene  0.4983  0.3286  0.3202  0.3269  0.3286  0.3202  0.2542  0.2408  0.2425  0.2416  
Enron  0.7636  0.8061  0.7242  0.7000  0.6909  0.5924  0.5348  0.5833  0.5833  0.5455  0.5576  
Cal500  0.1089  0.1386  0.1139  0.1089  0.1139  0.1139  
Human  0.6849  0.6174  0.6069  0.6174  0.6094  0.5997  0.6109  0.6109  0.6045  0.6029  0.5916  
Medical  0.3964  0.2613  0.2252  0.2342  0.2222  0.2312  0.2162  0.2012  0.2042  0.1922  0.1922  
TMC2007  0.2021  0.1498  0.1492  0.1495  0.1499  0.1584  0.1561  0.1557  0.1553  0.1537  0.1501  
Corel16k001  0.7414  0.7259  0.7242  0.7288  0.7208  0.7104  0.7299  0.7185  0.7150  0.7150  0.7225 
Reference methods  Variants of the proposed saliencybased methods  
Dataset  DMLDA  wMLDA_{c}  wMLDA_{b}  wMLDA_{e}  wMLDA_{f}  wMLDA_{d}  SwMLDA_{m}  SwMLDA_{c}  SwMLDA_{b}  SwMLDA_{e}  SwMLDA_{f}  SwMLDA_{d} 
Yeast  0.5187  0.5119  0.5072  0.5012  0.5003  0.5097  0.4991  0.4987  0.4962  0.4975  0.4964  
Plant  0.2646  0.2846  0.2355  0.2900  0.2984  0.2797  0.2303  0.2387  0.2282  0.2359  0.2408  
Image  0.3528  0.2619  0.2641  0.2656  0.2659  0.2656  0.2416  0.2313  0.2284  0.2300  0.2269  
Scene  0.2547  0.1584  0.1574  0.1547  0.1567  0.1515  0.1112  0.1110  0.1110  0.1109  0.1139  
Enron  0.3457  0.3862  0.3650  0.3479  0.3545  0.3361  0.3043  0.3058  0.3095  0.3073  0.3032  
Cal500  0.7533  0.7477  0.7511  0.7517  0.7486  0.7472  0.7462  0.7467  0.7468  0.7469  0.7469  
Human  0.2127  0.1969  0.1945  0.1964  0.1971  0.1945  0.1855  0.1855  0.1834  0.1845  0.1835  
Medical  0.0819  0.0779  0.0819  0.0704  0.0716  0.0762  0.0678  0.0659  0.0678  0.0665  0.0634  
TMC2007  0.1148  0.0983  0.0974  0.0979  0.0973  0.1024  0.0994  0.0970  0.0972  0.0973  0.0975  
Corel16k001  0.3956  0.3771  0.3698  0.3740  0.3731  0.3779  0.3698  0.3677  0.3687  0.3698  0.3639 
Reference methods  Variants of the proposed saliencybased methods  
Dataset  DMLDA  wMLDA_{c}  wMLDA_{b}  wMLDA_{e}  wMLDA_{f}  wMLDA_{d}  SwMLDA_{m}  SwMLDA_{c}  SwMLDA_{b}  SwMLDA_{e}  SwMLDA_{f}  SwMLDA_{d} 
Yeast  0.1900  0.1827  0.1823  0.1799  0.1808  0.1813  0.1744  0.1786  0.1777  0.1761  0.1748  
Plant  0.2577  0.2763  0.2817  0.2878  0.2713  0.2196  0.2254  0.2300  0.2199  0.2274  0.2315  
Image  0.2878  0.1948  0.1978  0.1986  0.2000  0.1992  0.1771  0.1599  0.1653  0.1667  0.1652  
Scene  0.2321  0.1380  0.1367  0.1338  0.1360  0.1318  0.0909  0.0892  0.0896  0.0900  0.0929  
Enron  0.1739  0.2012  0.1742  0.1639  0.1682  0.1537  0.1306  0.1330  0.1330  0.1336  0.1329  
Cal500  0.1882  0.1900  0.1882  0.1865  0.1863  0.1854  0.1860  0.1854  0.1854  0.1855  0.1865  
Human  0.1907  0.1712  0.1702  0.1712  0.1721  0.1702  0.1602  0.1612  0.1604  0.1609  0.1603  
Medical  0.0682  0.0571  0.0648  0.0527  0.0498  0.0570  0.0462  0.0480  0.0489  0.0482  0.0461  
TMC2007  0.0375  0.0269  0.0266  0.0268  0.0264  0.0289  0.0279  0.0264  0.0264  0.0263  
Corel16k001  0.1962  0.1894  0.1863  0.1872  0.1864  0.1890  0.1866  0.1857  0.1863  0.1868  0.1825 
Reference methods  Variants of the proposed saliencybased methods  
Dataset  DMLDA  wMLDA_{c}  wMLDA_{b}  wMLDA_{e}  wMLDA_{f}  wMLDA_{d}  SwMLDA_{m}  SwMLDA_{c}  SwMLDA_{b}  SwMLDA_{e}  SwMLDA_{f}  SwMLDA_{d} 
Yeast  0.2077  0.2046  0.2028  0.2035  0.2049  0.2091  0.2038  0.2059  0.2047  0.2046  0.2049  
Plant  0.1171  0.0924  0.1184  0.1201  0.1081  0.0947  0.1017  0.1010  0.1021  0.1068  0.0987  
Image  0.2310  0.1893  0.1898  0.1860  0.1883  0.1828  0.1738  0.1703  0.1723  0.1698  0.1713  
Scene  0.1683  0.1182  0.1185  0.1198  0.1256  0.1172  0.0975  0.0917  0.0917  0.0949  0.0943  
Enron  0.0669  0.0721  0.0668  0.0645  0.0664  0.0565  0.0585  0.0585  0.0563  0.0565  0.0549  
Cal500  0.1392  0.1394  0.1393  0.1386  0.1383  0.1391  0.1390  0.1398  0.1388  0.1386  0.1383  
Human  0.0943  0.0908  0.0924  0.0923  0.0908  0.0845  0.0891  0.0887  0.0868  0.0880  0.0874  
Medical  0.0225  0.0172  0.0225  0.0167  0.0161  0.0165  0.0159  0.0153  0.0149  0.0153  0.0155  
TMC2007  0.0608  0.0539  0.0529  0.0535  0.0531  0.0571  0.0544  0.0537  0.0531  0.0535  0.0535  
Corel16k001  0.0200  0.0200  0.0200  0.0200  0.0201  0.0200  0.0200  0.0200  0.0200  0.0200 
Reference methods  Variants of the proposed saliencybased methods  
Dataset  DMLDA  wMLDA_{c}  wMLDA_{b}  wMLDA_{e}  wMLDA_{f}  wMLDA_{d}  SwMLDA_{m}  SwMLDA_{c}  SwMLDA_{b}  SwMLDA_{e}  SwMLDA_{f}  SwMLDA_{d} 
Yeast  0.3174  0.3516  0.3596  0.3532  0.2988  0.3519  0.3342  0.3486  0.3483  0.3475  0.3647  
Plant  0.0185  0.1259  0.1574  0.1216  0.1543  0.1331  0.1461  0.1488  0.1503  0.1583  0.1393  
Image  0.3002  0.5908  0.5738  0.5852  0.5875  0.5774  0.5610  0.5854  0.5956  0.5864  0.5686  
Scene  0.3456  0.6488  0.6523  0.6412  0.6406  0.6489  0.7106  0.7306  0.7269  0.7304  0.7294  
Enron  0.0198  0.0372  0.0483  0.0600  0.0557  0.0331  0.0637  0.0567  0.0524  0.0595  0.0595  
Cal500  0.0526  0.0465  0.0504  0.0501  0.0525  0.0490  0.0520  0.0542  0.0527  0.0522  0.0511  
Human  0.0016  0.1460  0.1493  0.1455  0.1460  0.1380  0.1303  0.1371  0.1300  0.1429  0.1431  
Medical  0.1302  0.1911  0.1302  0.1959  0.1898  0.1916  0.1921  0.2253  0.2043  0.2210  0.2222  
TMC2007  0.4748  0.5917  0.5994  0.5921  0.5928  0.5394  0.6120  0.6125  0.6022  0.6074  0.6147  
Corel16k001  0.0184  0.0353  0.0373  0.0305  0.0304  0.0315  0.0361  0.0379  0.0386  0.0366  0.0447 
IvD Classification results
Tables (II)(VI) show the experimental results of our approach and competing methods with one error, normalized coverage, ranking loss, hamming loss, and macroF1 metrics. One error, normalized coverage, and ranking loss directly utilize the probabilities from the MLKNN algorithm in various ways. We can conclude that all versions of our proposed methods achieved significant improvements in most databases comparing to the reference methods with the first three metrics that use probabilities. Our method achieved the best result in eight cases out of ten in Tables (II) and (IV), and nine cases out of then in Table (III).
The remaining two metrics utilize the predicted labels obtained by a threshold value and the probabilities in different ways. We currently did not adapt a cross validation strategy to select an optimal threshold value in our experiments, which may lead to suboptimal results. With the last two metrics, the reference methods worked a bit better than with the former three metrics, but even in the worst case with Hamming loss, our method achieved the best results in six cases out of ten.
According to the results with all metrics, our misclassificationbased prior information variant is the most efficient and precise one and totally achieved 15 best results among all the test cases (highlighted values in the tables). achieved 11 best results among all the test cases with different metrics. Moreover, each variant of our algorithm achieved better performance compared to the corresponding reference methods in most cases. For instance, achieved better results on at least eight cases out of ten for any metric than did with the oneerror metric, enhanced the performance on eight cases with hamming loss and seven cases with marcoF1 compared to . This shows that the proposed approach of using the prior information for class saliency estimation generally outperforms using it directly for weighting the items as in [Xu2018b].
V Conclusion
In this paper, we proposed a novel multilabel classification method to tackle the data imbalance and information redundancy problems in encountered multilabel classification tasks. Our method is an extension MLDA, where the weights are generated with a probabilistic approach to evaluate the saliency of each instance for different classes. The probabilistic approach uses an affinity matrix to ensure similar results for similar instances and a prior information matrix to integrate prior information on prominence of each instance for each class. Our solution can alleviate the data imbalance problem, which is commonly encountered in multilabel databases, as the weight factor vectors are calculated separately for each class. Our method can also alleviates the common overcounting problem. We proposed variants of our methods using different prior information matrices based on both labels and features.
We used five metrics to evaluate the performance of our method with competing method on ten multilabel datasets. The experimental results show that our method enhanced the classification performance compared to the competing algorithms.
Our algorithm is still based on the linear subspace learning technique. In the future, we will make a nonlinear extension using the kernel trick. We will also explore the prominence of each feature channel from all instances to calculate the weight factor vector.
Comments
There are no comments yet.