Saliency-based Weighted Multi-label Linear Discriminant Analysis

04/08/2020 ∙ by Lei Xu, et al. ∙ Aarhus Universitet Tampere Universities 0

In this paper, we propose a new variant of Linear Discriminant Analysis (LDA) to solve multi-label classification tasks. The proposed method is based on a probabilistic model for defining the weights of individual samples in a weighted multi-label LDA approach. Linear Discriminant Analysis is a classical statistical machine learning method, which aims to find a linear data transformation increasing class discrimination in an optimal discriminant subspace. Traditional LDA sets assumptions related to Gaussian class distributions and single-label data annotations. To employ the LDA technique in multi-label classification problems, we exploit intuitions coming from a probabilistic interpretation of class saliency to redefine the between-class and within-class scatter matrices. The saliency-based weights obtained based on various kinds of affinity encoding prior information are used to reveal the probability of each instance to be salient for each of its classes in the multi-label problem at hand. The proposed Saliency-based weighted Multi-label LDA approach is shown to lead to performance improvements in various multi-label classification problems.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Multi-label classification tasks have become more and more common in the machine learning field recently, e.g., in text information categorization [Li2015], image and video annotation [Qi2007], sequential data prediction [Read2017], or music information retrieval [Trohidis2011]. Multi-label databases exist for various real applications, such as Yeast database for protein localization sites prediction [Nakai1992], CAL500 database for music retrieval [Turnbull2008], or medical database for text classification [Pestian2007].

Compared to single-label problems, the characteristics of multi-label problems are more complicated and unpredictable. In a single label problem, each instance merely belongs to a specific class in a mutually exclusive manner [Wang2010]. Classes in a multi-label problem are not mutually exclusive, which means that each data item can belong to either one or several classes. Moreover, different classes contain a varying number of data items, leading to class imbalanced problems [Lu2019]. Hence, in order to solve a multi-label classification problem efficiently and effectively, we need not only to consider the correlation of class labels and features of each data item, but also take into account the different cardinalities of the classes.

As described in [Zhang2014], multi-label classification methods are derived following either a problem transformation (PT) approach or an algorithm adaptation (AA) approach. Methods following the PT approach simply utilize single-label classification algorithms to tackle multi-label classification tasks using decomposition approaches, such as the binary relevance (BR) algorithm [Tanaka2015], [ZhangM.L.LiY.K.LiuX.Y.Geng2018] or the label powerset (LP) algorithm [Abdallah2016], [Tsoumakas2010]. A weighted multi-label linear discriminant analysis algorithm (wMLDA) [Xu2018b] combines the decomposition approach with different labels and/or feature information to build a multi-label classification method. Methods following the AA approach directly utilize the information of class labels and data items to explore their correlation, e.g., in an extension of AdaBoost algorithm [Schapire2000] or a de-convolution-based method in [Streich2008]. Linear Discriminant Analysis (LDA) and its variants have been widely used to extract discriminant data representations for solving various problems involving supervised dimensionality reduction, e.g., in human action recognition [Iosifidis2012b, Iosifidis2014b], [Wu2017], biological data classification [Wang2017, Huang2009a], and facial image analysis [Gao2009]. However, it cannot be directly used to tackle multi-label problems due to the characteristics of multi-label data. This is due to two factors: a) the contribution of each data item in the calculation of the scatter matrices involved in the optimization problem of single-label LDA and its variants cannot be appropriately determined and b) the cardinality of the various classes forming the multi-label problem can be quite imbalanced.

In this paper, we propose a novel method for multi-label data classification based on a probabilistic approach that is able to estimate the contribution of each data item to the classes it belongs to by taking into account prior information encoded using various types of metrics. The proposed calculation of the contribution of each data item to the classes it belongs to cannot only weight its importance, but can also address problems related to imbalanced classes. To this end, we exploit the concept of class saliency introduced in

[Xu2018c]. Hence, the proposed method is named as Saliency-based Weighted Multi-label Linear Discriminant Analysis (SwMLDA). Our proposed SwMLDA approach, as a kind of PT approach, exploits both label and feature information with various prior weighting factors, i.e., binary-based weight form [Park2008a], misclassification-based weight form [Xu2018c], entropy-based weight form [Chen2007], fuzzy-based weight form [Lin2010], dependence-based weight form [Xu2018b], and correlation-based weight form [Wang2010]. The proposed method leads to improved results on 10 publicly available multi-label databases.

We have made the following contributions on multi-label classification tasks with our novel SwMLDA approach: (1) we propose using probabilistic saliency estimation in multi-label classification to weight the importance of each item for its classes; (2) we formulate a novel SwMLDA method that uses the saliency-based weights and can alleviate the problems related to imbalanced datasets; (3) we integrate label and feature information to SwMLDA by using various types of weighting factors as prior information; (4) we compare our proposed approach to related methods on 10 diverse multi-label data sets, and the results show considerable improvements in multi-label classification tasks using our approach.

The remainder of this paper is structured as follows. In Section 2, we briefly review the related works. We include a precise explanation of the LDA and weighted MLDA with adequate mathematical notations to support the derivations of the probabilistic saliency estimation. In Section 3, we describe our proposed methods in detail. Section 4 presents for experimental setup and results on 10 multi-label databases. In Section 5, we conclude this paper and discuss the potential future studies.

Ii Related works

In this section, we first briefly present several standard approaches for multi-label classification in subsection II-A. In subsection II-B, we provide a detailed description of the standard LDA, weighted LDA, Multi-label LDA (MLDA), and weighted Multi-label LDA (wMLDA), since they form the theoretical foundation for the proposed work. Subsequently, we introduce the general concepts of saliency estimation and the probabilistic saliency estimation approach needed to develop the proposed method.

Ii-a General methods for multi-label classification tasks

Various methods have been proposed for solving multi-label classification tasks, such as variants of Support Vector Machine (SVM)


and various feature extraction methods

[Wang2010, Xu2018b, Zhang2008]. As PT algorithms, Binary Relevance-based methods [Tanaka2015, ZhangM.L.LiY.K.LiuX.Y.Geng2018, Read2009] decompose a multi-label classification problem into several single-label classification problems in a one-versus-all manner. Another standard PT method is Label Powerset (LP) algorithm [Abdallah2016], [Pushpa2017]

, which exploits the dependencies or correlations of class labels to rebuild a labeled subset for a single-label classifier. The traditional SVM algorithm acts as a PT approach: in


a multi-label scene classification problem is decomposed into several single-label problems by following a cross-training strategy.

As an AA approach, alternating decision-tree (ADTree) was proposed to enhance the performance of boosting methods

[Yoav1999, DeComite2003]. In [Yoav1999], the strategy of an alternating decision tree is based on an option tree using boosting. Another decision-tree related algorithm ADTboost.MH was proposed in [DeComite2003] to solve multi-label text and data classification problems by ADTboost algorithm[Yoav1999] and Adaboost.MH algorithm [Schapire1999].

Ii-B Dimensionality reduction algorithms for multi-label classification tasks

Standard LDA and its variants have been applied to tackle various multi-label classification problems [Wang2010, Park2008a, Oikonomou2013, Yuan2014, Nie2009, Siblini2019]. Generally, dimensionality reduction-based methods tackling multi-label classification problems are categorized as unsupervised and supervised, depending on whether class label information is involved in or not [Xu2018b]. The objective of dimensionality reduction-based methods is to determine a data projection matrix mapping the data from the original feature space to a discriminant subspace , where .

Ii-B1 Linear Discrimination Analysis

LDA is an effective technique to reduce dimensionality of original data as a prepossessing step for single-label classification problems. In the following, we assume that a training set formed by data points and class labels is presented as


where and are the data points and the corresponding label vectors, respectively. The instance matrix is defined as


The label matrix is depicted as


The label information of element is represented as . If belongs to the class , , otherwise . Note that in single label-classification tasks there is a single 1 on each column. Later, we will use the same notation in multi-label classification, where the number of 1s is not constrained.

The within-class, between-class, and total scatter matrices , , and , respectively, are defined as follows:


denotes the mean vector of class as


where is the cardinality of class . The total mean vector is computed as


The optimal projection matrix is learned by maximizing the Fisher’s discriminant criterion [R.A.Fisher1936] through compacting the within-class scatter and maximizing the between-class scatter simultaneously as


where denotes the trace of a matrix. Usually, the optimal projection matrix

is calculated by solving eigenvalue decomposition of the matrix

and then using the eigenvectors corresponding to the largest eigenvalues as the projection matrix

. The rank of is equal to , which is the maximal dimensionality of the resulting subspace. Since , an alternative approach is to use instead of and maximize the Fisher’s discriminant criterion as


Although the traditional LDA technique has gained popularity on various single-label classification tasks, its performance varies according to the types of input data sets. Usually, the data sets used in most traditional LDA classification tasks are assumed to have equal class distribution as a homoscedastic Gaussian model [Petridis2004a], in which the covariance matrices of each class should be identical [Tang2005]. Furthermore, the performance is affected severely due to the imbalance of input data sets [Tang2005a].

Ii-B2 Weighted Linear Discrimination Analysis

In order to enhance the robustness of traditional LDA on different kinds of data sets, various weight factors are introduced into the definitions of scatter matrices to balance the contribution of each class, according to class statistics [Tang2005], [Li2009a]

, e.g., class cardinality, a prior probability. Weighted LDA approaches have diminished the influence of outlier classes on the scatter matrices of imbalanced data sets to some extent; however, they still neglect the varying importance of individual samples in the class description. Saliency-based weighted Linear Discriminant Analysis (SwLDA)

[Xu2018c], as a kind of graph expression, was proposed to explore the contribution of each instance based on probabilistic saliency estimation [Aytekin2018]. Our work uses the same idea for multi-label classification.

Generally, weight factors are calculated using various metrics to reallocate the contribution of each class, which can alleviate the influence of outlier classes on the projection matrix. An example of a weighted between-class matrix definition based on Bayes error rate was proposed in [Loog2001b]:


where , denote the a priori probabilities of class and class , respectively. expresses the dissimilarity between class and class . The within-class scatter matrix can be muted with prior information as in [Tang2005]:


where is a relevance weight factor that has a low value if class is estimated to be an outlier class. Thus, both definitions of scatter matrices decrease the influence of outlier classes. After computing the weighted scatter matrices, they can be used to obtain the optimal projection matrix from Eq. (9)

Ii-B3 Multi-label Linear Discrimination Analysis

Although weighted LDA algorithms have enhanced the performance on single-label classification tasks [Jarchi2006a], [Ahmed2012] compared to traditional LDA, such variants are still not directly applicable for multi-label classification tasks [Wang2010]. In a multi-label data set, label information contains certain correlations or dependencies [Wu2016], for example, an image instance labeled as ’car’ highly correlates to label ’road’ [Wang2010]. Besides, it is quite common that the number of samples in each class in a multi-class data set is imbalanced. For example, the largest class size is 1128 and the smallest 21 in the widely used Yeast database [Nakai1992], as shown in Fig. 1. Due to the specific characteristics of multi-label databases, it is imperative to take into account the correlation of class labels and/or discriminative feature information of each instance to tackle the sub-optimal classification result on imbalanced data sets.

Fig. 1: The number of instances for each class in Yeast database

When traditional LDA and its variants are applied to tackling multi-label classification tasks by simply using Eqs. (4) - (6) with the multi-label label matrix , a significant problem is that the contribution of one instance can be repeatedly counted in computing the scatter matrices. Hence, weight factors are used to express redundancy or/and correlation information so that LDA related algorithms can calculate scatter matrices without redundancy on multi-label databases. In [Wang2010], a multi-label linear discriminant analysis (MLDA) approach based on the exploration of label correlation information was proposed to tackle multi-label image or video classification tasks. MLDA approach embeds the correlation information of class labels as weight factors in the definition of scatter matrices as


where describes a weight factor of the instance for the class , is the total mean vector of all training instances, and is the mean vector of class :


A correlation matrix is computed using the class labels of each pair of classes:


where , are label vectors for classes . The label correlation information can reveal whether two classes are closely related or not. The correlation matrix is then used to compute the weight factors in Eqs. (13) - (17). To tackle the over-counting problem [Wang2010], the weight factors are normalized with -norm:


where is the label vector for the sample. was used directly as in [Wang2010], while we exploit it in a different manner in our work.

Various weight matrices have been introduced to improve the performance of LDA on multi-label classification tasks [Wang2010], [Xu2018b], [Oikonomou2013]. Such strategies yield a more suitable projection subspace compared to other dimensionality reduction algorithms [Wang2010], such as principle component analysis (PCA), multi-label dimensionality reduction via dependence maximization (MDDM), or multi-label least square (MLLS).

In [Oikonomou2013], MLDA was extended to Direct MLDA by changing the definition of in a way that allows to obtain a higher dimensional subspace than the original MLDA, where the subspace dimensionality is limited by the rank of to . This extension work further enhanced the results in multi-label video classification tasks. Another extension, multi-label discriminant analysis with locality consistency (MLDA-LC) [Yuan2014] not only preserves the global class label information as MLDA does, but also incorporates a graph regularized term to utilize the local geometric information. MLDA-LC reveals the similarity among nearby instances with transformation in the projection space using incorporation of the graph Laplacian matrix into the MLDA approach, which further enhances the classification performance in multi-label data sets compared to MLDA or/and MLLS algorithms.

Ii-B4 Weighted multi-label linear discriminant analysis

A weighted multi-label LDA (wMLDA) approach was proposed in [Xu2018b] focusing on linear feature extraction for multi-label classification. In wMLDA, a multi-label classifier is composed of several single-label classifiers according to the number of classes and a weight matrix is simultaneously calculated based on various metrics to embody the contribution of each instance in scatter matrices calculation. Various metrics can be used to measure the relationships among instances from the labels and/or features. wMLDA approach employs correlation-based weight form [Wang2010], entropy-based weight form [Chen2007], binary-based weight form [Park2008a], fuzzy-based weight form [Lin2010], and dependence-based weight form [Xu2018b]. In this work, we exploit the same metrics, while we use this information in a novel way. We provide a detailed explanation of the metrics in Section III-B.

In [Xu2018b], scatter matrices , , and are redefined to exploit the prior information for weighting. Firstly, a non-negative weight matrix with the same size of label matrix is defined to describe the weight of each instance to its corresponding classes:


where represents a weight vector for the instance and is a weight vector for the class. The weight matrix is calculated based on one of the prior information matrices described in III-B. Then, and are defined as summations of weights for the class and all classes:


In order to simplify notation, row vectors and are defined as


where is the summation of weights for the th instance over all classes . Then, the scatter matrices can be redefined as


where has row vectors . Under this approach, the optimal projection matrix can still be obtained by solving the generalized eigenproblem corresponding to Eq. (10) as discussed in Section 2.

Ii-C Saliency estimation

Saliency estimation as a standard computer vision task is inspired by neurobiological studies

[Ltti1998] and cognition psychology [Treisman1980]. Generally saliency estimation is a pre-processing step for various high-level computer vision tasks, such as object detection [Aytekin2018], [WangSalientSurvey], omni-directional images [Battisti2018], and human attention estimation [Choi2016HumanApproach]. Saliency in physiological science is defined as a special kind of perception of the human visual system, by which humans can perceive particular parts in a scene in details due to colors, textures, or other prominent information contained in these parts [Cheng2011]. These particular parts can be distinguished as foreground from non-salient background parts.

Computational saliency estimation approaches can be categorized as local approaches and global approaches based on the way they process saliency information [Cheng2011]. Local saliency approaches explore the prominent information around the neighborhood of specific pixels/regions whilst global approaches exploit the rarity of a pixel/patch/region in the whole scene. Since the emergence of computational saliency estimation field in [Koch1985], various probabilistic approaches have been explored in this topic. In [Jian2018]

, a saliency map is estimated based on three kinds of prior information on images at super-pixel level. Saumya et al. utilize a generalized Bernoulli distribution to estimate a saliency map in their work


Another saliency estimation approach was proposed by Aytekin et al. [Aytekin2018] for segmenting salient objects in an image using a probabilistic estimation, where a probability mass function depicts whether a region (pixel, super-pixel, or patch) in an image is considered as a distinct region. The higher the values of for a region, the more prominent the region is. is solved by optimizing two items working simultaneously to allocate not only lower probability to non-salient regions but also similar probabilities to similar regions as follows:


where the first term suppresses the probability of a non-prominent region using its prior information . In the second term, a high similarity of regions and , given as a high similarity value , forces the regions to have similar probabilities.

This optimization task in Eq. (28) can be expressed using matrix notations as


where is a probability vector that depicts the probabilities of each element or region to be salient, i.e., .

is an affinity matrix, which denotes the similarity of each pair of regions

and as . is a diagonal matrix having elements equal to . is a diagonal matrix having elements . Then Lagrangian multiplier method is employed


A global optimum is obtained by setting the partial derivative of Eq. (31) with the respect to zero. The final optimized probability vector is


Iii Proposed Method

We propose a novel saliency-based weighted linear discriminant analysis method for multi-label classification tasks, where the saliency-based weight factors are calculated based on the probabilistic saliency estimation approach and the specific prior information of the input data. In this section, we describe our novel Saliency-based weighted Multi-label Linear Discriminant Analysis (SwMLDA) approach in detail.

We calculate a saliency-based weight matrix based on the probabilistic saliency estimation with the exploration of various prior information: binary [Park2008a], correlation [Wang2010], entropy [Chen2007], fuzzy [Lin2010], dependence [Xu2018b], and misclassification [Xu2018c]. The weight matrix is denoted as:


where represents the optimal weight vector of the instance and is the weight vector of the class. The details for computing the probabilistic weight matrix are given in the next subsections. After forming , the proposed method proceeds as wMLDA by using the weights in the scatter matrices and :


where . Note that our method normalizes the weight vectors so that the sum of the weights for a class is always 1. Therefore, in Eq. (23) is an identity vector and . Finally, the optimal projection matrix is obtained from Eq. (10) by solving the corresponding generalized eigenvalue problem.

Iii-a Saliency-based weight factors

We extend the probabilistic saliency estimation approach [Aytekin2018] described in Section II-C to express the saliency of each instance for its class(es). To this end, we formulate the prior information in and so that probability describes the saliency of each item for class . In an initial work [Xu2018c], we used saliency-based weight factors to tackle sub-optimal classification results caused by imbalanced data sets or/and outliers in single-label classification using LDA-based algorithms. Here, we exploit the saliency-based weight factors to tackle multi-label classification tasks.

We calculate the saliency-based weight factors separately for each class in the spirit of PT approaches. For each class, we consider only the samples belonging to the class, thus, has elements. is computed following Eq. (32) as


where constitutes three terms as . To form from , the elements of are placed on their corresponding positions in and the values for items not belonging to class are set to zero. We then form the weight matrix by placing weight vectors as its rows.

is an affinity matrix obtained by a graph notation. That is, for each class , we form its corresponding graph , where is a matrix formed by the instances belonging to class and is a graph weighting matrix expressing the similarity between each pair of instances in class . In our experiments, we use a fully connected graph to obtain with a heat kernel function formulated as


where and are the and instance in class , and . is set as a constant value. is a diagonal matrix and each element is calculated based on as .

is a diagonal matrix, which carries the prior information of each instance in class to be salient for its class based on the metrics presented in the next subsection. The values of inversely relate to the values of weight factor vector ranging from 0 to 1. The lower a value , the more prominent the corresponding instance is expected to be based on the prior knowledge. We introduce six different prior information matrices to exploit label or/and feature information of each class, which produce six different variants of the proposed approach.

After computing the prior information vector and affinity matrix for class , we follow the approach of PSE in Eq. (36) to calculate the saliency score vector for class . In order to avoid singularity during this process, a regularized version of with a small value epsilon added to the diagonal elements is used. The summation of the values in the saliency-based weight vector for each class is one, which is expected to alleviate the over-counting problem.

Iii-B Prior information matrices

Iii-B1 Misclassification-based prior information matrix (SwMLDAm)

This approach was defined in [Xu2018c] to alleviate the sub-optimal result in LDA arising from outlier instances on imbalanced data sets. We utilize the misclassification-based prior information to generate a diagonal matrix based on the probability of each instance belonging to class to be more salient for another class:


where , is the instance of class and is the mean vector of class . In this approach, a sample which is closer to another class is considered less salient for class even if it is relatively close to the center of class .

Iii-B2 Correlation-based prior information matrix (SwMLDAc)

As in [Wang2010], label correlation information is represented by a class pair matrix defined in Eq. (18). For each instance , the normalized weight vector is calculated by Eq. (19). We compute the weight factors separately for each class and, after obtaining them for all instances, we select the elements, , and formulate the correlation-based prior information of the class as .

Label correlation information is widely exploited to tackle the redundancy of label information in various multi-label tasks [Wang2010], [Zhu2018]. However, it can lead to a sub-optimal result, due to non-zero values in the correlation weight factor matrix for irrelevant labels [Xu2018b]. Because we calculate the correlation-based prior information matrix based on each class separately, the non-zero values of unrelated label pairs can be avoided.

Iii-B3 Binary-based prior information matrix (SwMLDAb)

Binary-based approach directly utilizes the label information as in [Park2008a]. In our formulation, this approach reduces to having an equal value in for all instances as only instances belonging to class are considered in . For wMLDA, such direct use of class labels leads to over-counting problem in the scatter matrices. In our formulation, this problem is avoided because merely represents the prior information of non-salient instances and the final weight matrix is normalized for each class.

Iii-B4 Entropy-based prior information matrix (SwMLDAe)

We utilize entropy metric for label information to present a prior information matrix of each class , as in [Xu2018b], [Chen2007]. For each instance , its number of relevant labels is calculated as


and its entropy is given as


where . Thus, the entropy is higher, when there are fewer relevant labels. The probability for an instance to be relevant to class is


The entropy-based prior information of each instance to the different class(es) is defined as follows:


Finally, the diagonal matrix has elements .

Iii-B5 Fuzzy-based prior information matrix (SwMLDAf)

Fuzzy -means clustering algorithm (FCM) [Bezdek1981] is an extension of -means, where an instance can belong to multiple clusters with different degrees. The membership degree of instance in class is indicated with a weight factor [Dembczynski2012]. In our work, a supervised version of fuzzy -means clustering algorithm (SFCM) [Xu2018b], [Lin2010] is exploited to obtain the prior information matrix.

As in [Xu2018b], we optimize the following:


where presents the fuzzy centroid of class , denotes the membership of instance to class . The constraint forces the weights of each instance to sum to one. The constrained optimization problem in Eq. (43) can be solved by Lagrangian optimization with , where


After getting the partial derivatives of with respect to , and and setting their values to zero, we get


As the optimal value of depends on and vice versa, Eq. (45) and Eq. (46) are solved iteratively until the solution converges. Finally, we set the values of the diagonal matrix as .

Iii-B6 Dependence-based prior information matrix (SwMLDAd)

Dependence-based weights were proposed in [Xu2018b]. They are based on Hilbert-Schmidt independence criterion (HSIC) [Gretton2005], which is used to describe statistical dependence between features and labels based on the estimation of Hilbert-Schmidt norms. We follow the definition of HSIC in [Xu2018b] as


where , , and . is a centered matrix, which is represented as , where

denotes an identity matrix and

denotes an all-one vector. denotes the Hadamard, i.e., element-wise, product of two matrices or vectors. To find that maximizes HSIC, we solve the following optimization problem using the iterative approach described in [Xu2018b]:


This approach transforms a multi-label task to several single-label tasks [Xu2018b]. It allocates 1 to only one prominent class for each instance after the final iteration. In our probabilistic formulation, the diagonal matrix has elements .

Iv experiments

In our work, we tested our approach on ten multi-label databases and compared the final results with six competing methods using five evaluation metrics. We use the Matlab codes provided for

[Xu2018b]  in the comparative experiments and exploit the relevant parts also in the implementation of our proposed method. In the following subsections, we present ten databases, implementation details, evaluation metrics, and classification results.

Iv-a Databases

We perform our experiments on 10 publicly available multi-label databases: Yeast [Nakai1992], Scene [Boutell2004], Cal500 [Turnbull2008], Medical [Pestian2007], TMC2007-500 [Srivastava2005], Corel16k001 [Barnard2003], PlantGO, Image, HumanGO, Enron. The contents of these databases include text, image, and acoustic clips. The numbers of classes and features of these databases are shown in Table I. ’Cardinality’ gives the mean numbers of class labels per instance for the database.

Database Contents Train Instances Test Instances Classes Attributes Cardinality
Yeast Biology 1500 917 14 103 4.24
PlantGO Biology 588 390 12 3091 1.08
Image Scene 1200 800 5 294 1.24
Scene Scene 1211 1196 6 294 1.07
Enron Text 1123 579 53 1001 3.38
Cal500 Music 300 202 174 68 26.04
HumanGO Biology 1862 1244 14 9845 1.19
Medical Text 645 333 45 1449 1.25
TMC2007-500 Text 21519 7077 22 500 2.16
Corel16k001 Scene 5188 1744 153 103 4.24
TABLE I: Characteristics of datasets used for experiments

Iv-B Experimental setup

After eigendecomposition of , we retained the eigenvectors corresponding to the top 0.999 informative eigenvalues. The classifier used in the experiments is a multi-label

-nearest neighbor classifier (ML-KNN)

[Zhang2007] with as in [Xu2018b]. ML-KNN utilizes

-nearest neighbor algorithm and maximum a posterior (MAP) principle to tackle the multi-label categorization task. ML-KNN first estimates prior and posterior probability of each instance

for each class from a training dataset based on frequency counting [Zhang2007]. Then, the predicted probabilities on a test dataset are calculated based on the prior and posterior probabilities on the training dataset using the Bayesian rule. The predicted labels are obtained by setting a threshold () for the predicted probabilities.

Iv-C Performance evaluation

We adopt five different evaluation metrics [Zhang2010], [Park2019] to evaluate the performance of our proposed algorithm: one error, normalized coverage, ranking loss, hamming loss, and macro-F1. We introduce them in the following. Here, we denote the ground truth label matrix for the test samples as , where the column represents the label vector of test sample .

The predicted label matrix is denoted as and is the predicted label vector of a test sample . We use for the predicted probabilities, where denotes the membership of instance in class . denotes an ordered list of classes ranked in the order of descending probability in . is used to denote the indices of relevant classes in and denotes the indices of negative classes in .

  1. One error shows how often the top ranked class is not among the positive ground truth labels. Lower values of this metric indicate better performance.


    where denotes the first class in the sorted list .

  2. Normalized coverage demonstrates how far on average in the predicted label ranking one needs to go to cover all the ground-truth labels of an instance. A smaller coverage value indicates better performance.


    where gives the positions of relevant classes in the ordered list .

  3. Ranking loss evaluates for each item relevant vs. irrelevant class pair and gives the fraction of pairs, where the irrelevant class if ranked above the relevant one. Smaller values of this metric indicate a better performance. Here, we use to denote the number of relevant classes in and :


    where is used to denote the count of wrong rankings for item .

  4. Hamming loss shows the rate of misclassified predicted values using XOR comparison between predicted labels and ground truth labels. Smaller values of this metric indicate a better performance:

  5. Macro-F1 shows the average F1 value on each class, which reveals the authenticity and reliability of predicted true labels. Higher values of this metric indicate a better performance.


    where and

    are precision and recall for class


Reference methods Variants of the proposed saliency-based methods
Yeast 0.2399 0.2486 0.2410 0.2475 0.2530 0.2497 0.2474 0.2530 0.2530 0.2421 0.2432
Plant 0.7564 0.7359 0.6069 0.7436 0.7359 0.7410 0.6692 0.6615 0.6667 0.6564 0.6590
Image 0.4975 0.3413 0.3613 0.3400 0.3463 0.3463 0.3325 0.3150 0.3163 0.3263 0.3213
Scene 0.4983 0.3286 0.3202 0.3269 0.3286 0.3202 0.2542 0.2408 0.2425 0.2416
Enron 0.7636 0.8061 0.7242 0.7000 0.6909 0.5924 0.5348 0.5833 0.5833 0.5455 0.5576
Cal500 0.1089 0.1386 0.1139 0.1089 0.1139 0.1139
Human 0.6849 0.6174 0.6069 0.6174 0.6094 0.5997 0.6109 0.6109 0.6045 0.6029 0.5916
Medical 0.3964 0.2613 0.2252 0.2342 0.2222 0.2312 0.2162 0.2012 0.2042 0.1922 0.1922
TMC2007 0.2021 0.1498 0.1492 0.1495 0.1499 0.1584 0.1561 0.1557 0.1553 0.1537 0.1501
Corel16k001 0.7414 0.7259 0.7242 0.7288 0.7208 0.7104 0.7299 0.7185 0.7150 0.7150 0.7225
TABLE II: One error ()
Reference methods Variants of the proposed saliency-based methods
Yeast 0.5187 0.5119 0.5072 0.5012 0.5003 0.5097 0.4991 0.4987 0.4962 0.4975 0.4964
Plant 0.2646 0.2846 0.2355 0.2900 0.2984 0.2797 0.2303 0.2387 0.2282 0.2359 0.2408
Image 0.3528 0.2619 0.2641 0.2656 0.2659 0.2656 0.2416 0.2313 0.2284 0.2300 0.2269
Scene 0.2547 0.1584 0.1574 0.1547 0.1567 0.1515 0.1112 0.1110 0.1110 0.1109 0.1139
Enron 0.3457 0.3862 0.3650 0.3479 0.3545 0.3361 0.3043 0.3058 0.3095 0.3073 0.3032
Cal500 0.7533 0.7477 0.7511 0.7517 0.7486 0.7472 0.7462 0.7467 0.7468 0.7469 0.7469
Human 0.2127 0.1969 0.1945 0.1964 0.1971 0.1945 0.1855 0.1855 0.1834 0.1845 0.1835
Medical 0.0819 0.0779 0.0819 0.0704 0.0716 0.0762 0.0678 0.0659 0.0678 0.0665 0.0634
TMC2007 0.1148 0.0983 0.0974 0.0979 0.0973 0.1024 0.0994 0.0970 0.0972 0.0973 0.0975
Corel16k001 0.3956 0.3771 0.3698 0.3740 0.3731 0.3779 0.3698 0.3677 0.3687 0.3698 0.3639
TABLE III: Normalized coverage ()
Reference methods Variants of the proposed saliency-based methods
Yeast 0.1900 0.1827 0.1823 0.1799 0.1808 0.1813 0.1744 0.1786 0.1777 0.1761 0.1748
Plant 0.2577 0.2763 0.2817 0.2878 0.2713 0.2196 0.2254 0.2300 0.2199 0.2274 0.2315
Image 0.2878 0.1948 0.1978 0.1986 0.2000 0.1992 0.1771 0.1599 0.1653 0.1667 0.1652
Scene 0.2321 0.1380 0.1367 0.1338 0.1360 0.1318 0.0909 0.0892 0.0896 0.0900 0.0929
Enron 0.1739 0.2012 0.1742 0.1639 0.1682 0.1537 0.1306 0.1330 0.1330 0.1336 0.1329
Cal500 0.1882 0.1900 0.1882 0.1865 0.1863 0.1854 0.1860 0.1854 0.1854 0.1855 0.1865
Human 0.1907 0.1712 0.1702 0.1712 0.1721 0.1702 0.1602 0.1612 0.1604 0.1609 0.1603
Medical 0.0682 0.0571 0.0648 0.0527 0.0498 0.0570 0.0462 0.0480 0.0489 0.0482 0.0461
TMC2007 0.0375 0.0269 0.0266 0.0268 0.0264 0.0289 0.0279 0.0264 0.0264 0.0263
Corel16k001 0.1962 0.1894 0.1863 0.1872 0.1864 0.1890 0.1866 0.1857 0.1863 0.1868 0.1825
TABLE IV: Ranking loss ()
Reference methods Variants of the proposed saliency-based methods
Yeast 0.2077 0.2046 0.2028 0.2035 0.2049 0.2091 0.2038 0.2059 0.2047 0.2046 0.2049
Plant 0.1171 0.0924 0.1184 0.1201 0.1081 0.0947 0.1017 0.1010 0.1021 0.1068 0.0987
Image 0.2310 0.1893 0.1898 0.1860 0.1883 0.1828 0.1738 0.1703 0.1723 0.1698 0.1713
Scene 0.1683 0.1182 0.1185 0.1198 0.1256 0.1172 0.0975 0.0917 0.0917 0.0949 0.0943
Enron 0.0669 0.0721 0.0668 0.0645 0.0664 0.0565 0.0585 0.0585 0.0563 0.0565 0.0549
Cal500 0.1392 0.1394 0.1393 0.1386 0.1383 0.1391 0.1390 0.1398 0.1388 0.1386 0.1383
Human 0.0943 0.0908 0.0924 0.0923 0.0908 0.0845 0.0891 0.0887 0.0868 0.0880 0.0874
Medical 0.0225 0.0172 0.0225 0.0167 0.0161 0.0165 0.0159 0.0153 0.0149 0.0153 0.0155
TMC2007 0.0608 0.0539 0.0529 0.0535 0.0531 0.0571 0.0544 0.0537 0.0531 0.0535 0.0535
Corel16k001 0.0200 0.0200 0.0200 0.0200 0.0201 0.0200 0.0200 0.0200 0.0200 0.0200
TABLE V: Hamming loss ()
Reference methods Variants of the proposed saliency-based methods
Yeast 0.3174 0.3516 0.3596 0.3532 0.2988 0.3519 0.3342 0.3486 0.3483 0.3475 0.3647
Plant 0.0185 0.1259 0.1574 0.1216 0.1543 0.1331 0.1461 0.1488 0.1503 0.1583 0.1393
Image 0.3002 0.5908 0.5738 0.5852 0.5875 0.5774 0.5610 0.5854 0.5956 0.5864 0.5686
Scene 0.3456 0.6488 0.6523 0.6412 0.6406 0.6489 0.7106 0.7306 0.7269 0.7304 0.7294
Enron 0.0198 0.0372 0.0483 0.0600 0.0557 0.0331 0.0637 0.0567 0.0524 0.0595 0.0595
Cal500 0.0526 0.0465 0.0504 0.0501 0.0525 0.0490 0.0520 0.0542 0.0527 0.0522 0.0511
Human 0.0016 0.1460 0.1493 0.1455 0.1460 0.1380 0.1303 0.1371 0.1300 0.1429 0.1431
Medical 0.1302 0.1911 0.1302 0.1959 0.1898 0.1916 0.1921 0.2253 0.2043 0.2210 0.2222
TMC2007 0.4748 0.5917 0.5994 0.5921 0.5928 0.5394 0.6120 0.6125 0.6022 0.6074 0.6147
Corel16k001 0.0184 0.0353 0.0373 0.0305 0.0304 0.0315 0.0361 0.0379 0.0386 0.0366 0.0447
TABLE VI: Macro-F1 ()

Iv-D Classification results

Tables (II)-(VI) show the experimental results of our approach and competing methods with one error, normalized coverage, ranking loss, hamming loss, and macro-F1 metrics. One error, normalized coverage, and ranking loss directly utilize the probabilities from the ML-KNN algorithm in various ways. We can conclude that all versions of our proposed methods achieved significant improvements in most databases comparing to the reference methods with the first three metrics that use probabilities. Our method achieved the best result in eight cases out of ten in Tables (II) and (IV), and nine cases out of then in Table (III).

The remaining two metrics utilize the predicted labels obtained by a threshold value and the probabilities in different ways. We currently did not adapt a cross validation strategy to select an optimal threshold value in our experiments, which may lead to suboptimal results. With the last two metrics, the reference methods worked a bit better than with the former three metrics, but even in the worst case with Hamming loss, our method achieved the best results in six cases out of ten.

According to the results with all metrics, our mis-classification-based prior information variant is the most efficient and precise one and totally achieved 15 best results among all the test cases (highlighted values in the tables). achieved 11 best results among all the test cases with different metrics. Moreover, each variant of our algorithm achieved better performance compared to the corresponding reference methods in most cases. For instance, achieved better results on at least eight cases out of ten for any metric than did with the one-error metric, enhanced the performance on eight cases with hamming loss and seven cases with marco-F1 compared to . This shows that the proposed approach of using the prior information for class saliency estimation generally outperforms using it directly for weighting the items as in [Xu2018b].

V Conclusion

In this paper, we proposed a novel multi-label classification method to tackle the data imbalance and information redundancy problems in encountered multi-label classification tasks. Our method is an extension MLDA, where the weights are generated with a probabilistic approach to evaluate the saliency of each instance for different classes. The probabilistic approach uses an affinity matrix to ensure similar results for similar instances and a prior information matrix to integrate prior information on prominence of each instance for each class. Our solution can alleviate the data imbalance problem, which is commonly encountered in multi-label databases, as the weight factor vectors are calculated separately for each class. Our method can also alleviates the common over-counting problem. We proposed variants of our methods using different prior information matrices based on both labels and features.

We used five metrics to evaluate the performance of our method with competing method on ten multi-label datasets. The experimental results show that our method enhanced the classification performance compared to the competing algorithms.

Our algorithm is still based on the linear subspace learning technique. In the future, we will make a non-linear extension using the kernel trick. We will also explore the prominence of each feature channel from all instances to calculate the weight factor vector.