Generalized Label Enhancement with Sample Correlations

04/07/2020 ∙ by Qinghai Zheng, et al. ∙ Xi'an Jiaotong University IEEE 0

Recently, label distribution learning (LDL) has drawn much attention in machine learning, where LDL model is learned from labeled instances. Different from single-label and multi-label annotations, label distributions describe the instance by multiple labels with different intensities and accommodates to more general conditions. As most existing machine learning datasets merely provide logical labels, label distributions are unavailable in many real-world applications. To handle this problem, we propose two novel label enhancement methods, i.e., Label Enhancement with Sample Correlations (LESC) and generalized Label Enhancement with Sample Correlations (gLESC). More specifically, LESC employs a low-rank representation of samples in the feature space, and gLESC leverages a tensor multi-rank minimization to further investigate sample correlations in both the feature space and label space. Benefit from the sample correlation, the proposed method can boost the performance of LE. Extensive experiments on 14 benchmark datasets demonstrate that LESC and gLESC can achieve state-of-the-art results as compared to previous label enhancement baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recently, a growing number of studies have focused on the challenging label ambiguity learning problems. Since the single-label learning paradigm, in which an instance is mapped to one single logical label simply, has limitations in practice (Zhang and Zhou, 2013), multi-label learning (MLL) is highlighted to address this issue. During past years, a collection of scenarios have applied this learning process (Tsoumakas and Katakis, 2007; Huang and Zhou, 2012; Zhang et al., 2017; Chen et al., 2019; Zhang and Zhang, 2010)

, which simultaneously assigns multiple logical labels to each instance. For example, in supervised MLL, each sample is described by a label vector, elements of which are either 1 or 0 to demonstrate whether this instance belongs to the corresponding label or not. Since multiple logical labels with the same values contribute equally for MLL, the relative importance among these multiple associated labels, which is supposed to be different under most circumstances, is ignored and cannot be well investigated.

Therefore, despite MLL’s success, in some sophisticated scenario, such as facial age estimation

(Geng et al., 2013; Yang et al., 2018) and facial expression recognition (Jia et al., 2019; Kim and Provost, 2015; Zhou et al., 2015), the performance of primitive MLL is hindered, since a model precisely mapping the instance to a real-valued label vector with the quantitative description degrees, i.e., label distribution, is required in these tasks. To meet this demand, the learning process for the above-mentioned model called ”label distribution learning” (LDL) (Geng, 2016) has attracted significant attention. In LDL, an instance is annotated by a label vector, i.e., the label distribution, and each element ranging from 0 to 1 is the description degree of the relevant label and all values add up to 1. As many pieces of literature have demonstrated (Geng, 2016; Gao et al., 2017; Zheng et al., 2018), label distributions can describe attributes of samples more precisely because the relative importance of multiple labels is much different in many real-world applications, and implicit cues within the label distributions can be effectively leveraged through LDL for reinforcing the supervised training.

Figure 1. An example of label enhancement

Nevertheless, since manually annotating each instance with label distribution is labor-intensive and time-consuming, it is unavailable in most training sets practically (Xu et al., 2019b). The requirement of label distribution among different datasets arises some progress in the label enhancement (LE), which is proposed by (Xu et al., 2018, 2019a). Specifically, LE can be a pre-processing of LDL, in other words, label distributions can be exactly recovered from the off-the-shelf logical labels and the implicit information of the given features by LE, as shown in Fig. 1.

Obviously, according to the definition (Xu et al., 2018, 2019a), the essence of recovering process is to utilize the information from two aspects: 1) the underlying topological structure in the feature space, and 2) the existing logical labels. Accordingly, several approaches have been proposed in recent years. To leverage the knowledge in the feature space, some prior efforts assign the membership degree of each instance to different labels via fuzzy clustering method (FCM) (Melin and Castillo, 2005; El Gayar et al., 2006). Besides, some approaches construct graph structures in the feature space to improve the recovering process (Zhu and Goldberg, 2009; Xu et al., 2018). However, most existing LE methods doesn’t fully investigate and utilize both the underlying structure in the feature space and the implicit information of the existing logical labels. For example, graphs and similarity matrices used in the aforementioned existing LE methods can not fully explore the intrinsic information of data samples, since edges in graphs or elements of similarity matrices are calculated by a pair-wise method (Li et al., 2015) or a

-nearest neighbors (KNN) strategy

(Hou et al., 2016; Xu et al., 2018). The downside of these partial-based graph construction processes is that only local topological features can be utilized, and the holistic information of the feature space is largely untapped. In addition, these approaches always require some prior knowledge for the graph construction, that is to say, if the parameters of KNN is tuned slightly, the recovery performance of these algorithms may vary on a large scale, which is not expected in practice.

Here, we aim to employ the intrinsic global sample correlations to obtain an exact label distribution recovering. Since the low-rank representation (LRR) (Liu et al., 2012) can unearth the global structure of the whole feature space, it is expected to achieve a promising LE performance by employing LRR to supervise the label distribution recovering process. To this end, a novel Label Enhancement with Sample Correlations, termed LESC, is proposed in this paper. The proposed method imposes a low-rank constraint on the data subspace representation to capture the global relationship of all instances. Clearly, LRR is employed to benefit the LE performance by exploiting an intrinsic sample correlations in the feature space from a global perspective (Liu and Yan, 2011; Yin et al., 2015; Zheng et al., 2020a, b). Since both labels are also the semantic features of data samples, it is natural and intuitive to transfer the constructed low-rank structure in the feature space to the label space smoothly. More importantly, by extending on the investigation of sample correlations employed in our previous work (Tang et al., 2020)

, this paper also proposes a generalized Label Enhancement with Sample Correlations, dubbed gLESC for short. This method can jointly explore the implicit information in both the feature space and the label space by employing a tensor-Singular Value Decomposition (t-SVD)

(Kilmer et al., 2013) based low-rank tensor constraint. Actually, the sample correlations simply obtained from the feature space is not the optimal choice for label distributions recovering, since excessive ineffective information, which is useless for LE, is also contained in the feature space. For example, regarding to the facial emotion labels, the sample correlations information of gender and identity in the feature space may hinder the recovering process of LE. To address this problem, the existing logical labels are also leverage to attain the desired and intrinsic sample correlations, which can be more suitable for LE. It is clear that samples with similar label distributions have similar logical labels, but not vice versa. Figuratively speaking, by imposing a t-SVD based low-rank tensor constraint on both the feature space and label space jointly, logical labels play a role to remove unwanted information. Once the desired sample correlations are attained, they are leveraged to supervise the recovering process of LE, and optimal recovered label distributions can be achieved after LE. Extensive experiments conducted on 14 benchmark datasets illustrate that our proposed methods are stable to obtain remarkable performance.

Our contributions can be summarized as follows:

  • By incorporating the sample correlations into the recovering process of LE, a novel Label Enhancement with Sample Correlations, named LESC, is proposed in this paper. It uses the low-rank representation in the feature space to explore the global instances relationship for the LE improvement.

  • To further investigate the intrinsic sample correlations for LE, a novel generalized LESC (gLESC) is also proposed. By imposing a t-SVD based low-rank tensor constraint on both the feature space and label space, the proper sample correlations for LE can be achieve effectively.

  • Comprehensive experiments conducted on 14 datasets, including an artificial dataset and 13 real-world datasets, show the excellent power and generation of our methods compared with several state-of-the-art methods.

The remainder of this paper is organized as follows. The next section reviews related works of LE. Section 3 elaborates our proposed approaches, including LESC and gLESC. Comprehensive experimental results and corresponding discussions are provided in Section 4. Finally, conclusions of this paper are drawn in Section 5.

2. Related Work

For the convenience of the description of related works, we declare the fundamental notations in advance. The set of labels is , where is the size of the label set. For an instance , the logical label is denoted as and , while the corresponding label distribution is denoted as:

(1)

where depicts the degree to which belongs to label . The goal of the LE process is to recover the associated label distributions of every instance from the existing logical labels in a given training set.

This definition is formally presented by (Xu et al., 2018, 2019a), in which a LE method, termed GLLE, is also proposed. It is worth noting that some studies have concentrated on the same issue earlier. For instance, fuzzy clustering method (Melin and Castillo, 2005) is applied in (El Gayar et al., 2006), which intends to allocate the description values to each instance over diverse clusters. Specifically, features are clustered into clusters via fuzzy -means clustering where denotes the -th cluster center. The cluster membership for each instance is obtained by calculating the description value over the center as follows:

(2)

where

is larger than 1. Afterward, a zero matrix

is initialized and it is continuously updated by:

(3)

where denotes the -th row of . They constructed prototype label matrix through which classes and clusters are softly associated. After normalizing the columns and rows of to sum to 1, the label distribution is computed for each instance using fuzzy composition:

In addition, other recent studies have focused on the graph-based approaches to tackle the LE problem. They construct the similarity matrix over the features space via various strategies. (Hou et al., 2016) recoveres the label distribution according to manifold learning (ML), which ensures them to gradually convert the local structure of the feature space into the label space. In particular, to represent this structure, the similarity matrix is established based on the assumption that each feature can be represented by the linear combination of its KNN, which means to minimize:

(4)

where if belongs to the KNNs of ; otherwise, . They further constrain that for translation invariance. The constructed graph is transferred into the label space to minimize the distance between the target label distribution and the identical linear combination of its KNN label distributions (Roweis and Saul, 2000), which infers the optimization of:

(5)

by adding the constraint of , where . This formula is minimized with respect to the target label distribution through a constrained quadratic programming process.

(Li et al., 2015) regards the LE as the label propagation (LP) process (Zhu and Goldberg, 2009). The pairwise similarity is calculated over the complete feature space and a fully-connected graph is established as:

(6)

where and is fixed to be 1. The required LP matrix is built from the formula: with denoting a diagonal matrix where equals to the sum of -th row element in . Thus far, The LP is iteratively implemented, and it is proved that the recovered label distribution matrix converges to:

(7)

with denoting the trade-off parameter that controls the contribution between the label propagation and the initial logical label matrix .

For the GLLE algorithm, the similarity matrix is also constructed in the feature space by partial topological structure. Different from LP, which calculates the pair-wise distance within the whole feature space, the GLLE algorithm computes the distance between a specific instance and its KNNs to define the relevant element in the similarity matrix as follows:

(8)

where is the set of

’s KNNs. Because of the same intuition that these relationships could be converted into the label distribution space, this constructed graph is incorporated into the label space to attain a matrix linearly transforming the logical labels to the label distributions, obtaining the previous state-of-the-art results. Since we normalize each

by the softmax normalization for the above-mentioned algorithms, the condition can be satisfied.

Because it is fully recognized that establishing the similarity matrix based on pair-wise or local feature structure can hinder these approaches’ performances, here, the LRR and the t-SVD based low-rank tensor constraint are introduced to excavate the global information and to leverage the attained proper sample correlations to overcome these aforementioned drawbacks in the label distribution recovering process.

3. Our Proposed Approaches

In this section, our methods, i.e., LESC and gLESC, are introduced detailed. In a training set , all instances are vertically concatenated along the column to attain the feature matrix , where and . After the LE process, a new LDL training set can be rehabilitated to implement the LDL process. Here we use and denote the logical label matrix and the objective label distribution matrix respectively.

3.1. LESC Approach

Figure 2. The flow chart of the proposed LESC.

For a given instance , it is necessary to find an effective model to recover the best label distribution. and the mapping model employed in this paper can be written as follows:

(9)

where indicates a linear transformation parameterized by , and embeds into a high-dimensional space where the Gaussian kernel function is determined to be employed.

To get an optimal , the following objective function can be formulated:

(10)

where

denotes a loss function,

is used to excavate the underlying information of sample correlations, and is a trade-off parameter. To be specific, we will elaborate and detailed in this section.

Since the prior knowledge of the ground-truth label distribution is unavailable, we establish the loss function between the recovered label distributions and the logical labels. The least-squares (LS) loss function is adopted as the first term in (10):

(11)

As for , the sample correlations are employed here. It is noteworthy the LRR is imposed on the feature space in our proposed LESC. Global sample correlations in the feature space can be achieved by LRR, since all samples and their global relationships are expressed by the linear combination of other related samples. Accordingly, this property can be transferred to the label space under general conditions. Therefore, it is expected that the low-rank recovery to the label distribution can be expressed, which means to discover a proper for minimizing the distance between and , where is the minimized LRR of the feature space. This leads the second term of the optimization formula (10) to be as follows:

(12)

To be clear, the flow chart of our LESC is present in Fig. 2. As can be observed, the sample correlations are obtained by applying the low-rank representation on the feature space. In other words, the proposed LESC aims at seeking the LRR of the feature matrix to excavate the global structure in the feature space. Consenquently, by assuming that , it is natural and necessary to solve the following rank minimization problem:

(13)

where indicates the sample-specific corruptions, and is the low-rank coefficients which balances the effects between two parts. is used to denote the desired low-rank representation of feature X with respect to the variable . Practically, the rank function can be replaced by a nuclear norm to transfer (13) into a convex optimization problem. As a result, we have the following problem:

(14)

To get optimal solution, the augmented Lagrange multiplier (ALM) with Alternating Direction Minimization strategy (Lin et al., 2011) is employed in this paper. Specifically, an auxiliary variable, i.e., , is introduced here so as to make the objective function separable and convenient for optimization. Therefore, (14) can be rewritten as follows:

(15)

and the corresponding ALM problem can be solved by minimizing the following function:

(16)

which can be decomposed into following subproblems:

3.1.1. J-subproblem

The subproblem of updating J can be written as follows:

(17)

By leveraging the singular value threshold (SVT) method (Lin et al., 2011), the optimal can be achieved. To be specific, we impose the singular value decomposition (SVD) on , i.e., , and the following solution can be achieved:

(18)

in which is a soft thresholding operator with the following formulation:

(19)

3.1.2. C-subproblem

By fixing other variables, the subproblem with respect to C can be formulated as follows:

(20)

and the optimal solution is , with

(21)

where

denotes an identity matrix with the proper size.

3.1.3. E-subproblem

For updating E, we solve the following problem:

(22)

the close-form solution of which can be written as follows:

(23)

where .

3.1.4. Updating Lagrange multipliers and

When other variables are fixed, Lagrange multipliers and can be updated as follows:

(24)

in which . Obverously, is increased monotonically by until reaching the maximum, i.e., .

When is optimized, the desired sample correlations can be achieved, i.e., . Consequently, (10) can be rewritten as follows:

(25)

where .

Aiming to achieve the optimal solution, i.e., , the minimization of this objective function will be solved by an effective quasi-Newton method called the limited memory BFGS (L-BFGS) (Nocedal and Wright, 2006), of which the optimizing process is associated with the first-order gradient. Once the formula converges, we feed the optimal into (9) to form the label distribution . Furthermore, since the defined label distribution needs to meet the requirement , is normalized by the softmax normalization form.

3.2. gLESC Approach

Figure 3. The flow chart of the proposed gLESC.

LESC employs a low-rank subspace representation in the feature space to get the sample correlations, which can be utilized to supervise the recovering process of label distributions. Under the assumption that both the features and the labels are all the semantic information of samples, it is natural and reasonable to impose the aforementioned constraint on the desired label distributions. However, only the sample correlations in the feature space are investigated in the proposed LESC, and the corresponding information hidden in the existing logical labels are ignore. Actually, the sample correlations obtained by LRR in the feature space are influenced by some interference information. For example, in the facial emotion dataset, the gender and identity information may be contained in the sample correlations, which are attained by LRR in the feature space. Obviously, it obstructs the exact recovering process of label distributions.

Since the existing logical labels do not contain the unwished information, it is a good choice to incorporate the underlying information of these existing logical labels into the formation process of desired sample correlations. To this end, a generalized label enhancement with sample correlations (gLESC) is also proposed in this paper, and the corresponding flow chart is shown in Fig. 3. As can be observed in Fig. 3, the underlying sample correlations of both the sample features and existing logical label can be achieved to supervise the whole recovering process of label distributions, so the refinement of LE can be attained, as well as the implicit information of data samples can be leveraged fully. To achieve this goal, the tensor-Singular Value Decomposition (t-SVD) based low-rank tensor constraint (Kilmer et al., 2013) is introduced in this section. It should be noted that the difference between LESC and gLESC is the construction of sample correlations. In the proposed gLESC, we have the following formulation:

(26)

where and are 3-order tensors constructed by and , respectively.

denotes a t-SVD based tensor nuclear norm

(Kilmer et al., 2013), which can be calculated as follows:

(27)

in which

denotes a fast Fourier transformation (FFT) along the 3-rd dimension of

, i.e, the frontal slices of (Zhang et al., 2014), and indicates the k-th diagonal element of , which can be calculated as follows:

(28)

To solve the minimization problem of (26), we can construct the following ALM equation:

(29)

where is an auxiliary tensor variable, , and are the Lagrange multipliers. Consequently, we have the following subproblems:

3.2.1. -subproblem

By fixing other variables, the optimal solution can be written as follows:

(30)

which is a standard t-SVD based tensor nuclear norm minimization problem with the following optimization (Hu et al., 2016):

(31)

in which and can be calculated as follows:

(32)

and is a tensor tubal-shrinkage operator with the following definition:

(33)

where is a f-diagonal tensor. Specifically, elements of can be formulated as follows:

(34)

3.2.2. -subproblem

With other variables fixed, the subproblem of updating can be written as follows:

(35)

which has the closed-form solutions. To be specific, we take the derivative of the above function with respect to and respectively, then set the corresponding derivative to 0, the optimal solutions can be attained as follows:

(36)

where , , , and have the following formulations:

(37)

where I indicates an identity matrix with the proper size.

3.2.3. -subproblem

When other variables are fixed, the -subproblem can be formulated as follows:

(38)

Since the definition of -norm of a tensor is to get the total sum of -norm of each fiber in the 3-rd dimension of this tensor, it is obvious that . So here we can reformulate (38) as follows:

(39)

where denotes the matricization along the 3-rd direction, with

(40)

Accordingly, the closed-form solution of (38) can be obtained as follows:

(41)

in which and indicate the i-th column of and respectively.

3.2.4. Updating Lagrange multipliers

we update Lagrange multipliers as follows:

(42)

Once the problem of (26) is optimized, we can obtain the desired sample correlations as follows so as to achieve an exact recovering process of label distributions:

(43)

The subsequent operation of our gLESC is similar to (25), for the compactness of this paper, we skip them over in this section.

Figure 4.

The flowchart of label recovery experiments conducted in this section. To be specific, the true label distributions are binarized to attain the logical labels firstly, then a LE method can be performed to obtain the recovered label distributions. Accordingly, we evaluate the recovered performance based on six frequently-used LDL evaluation measures

(Geng, 2016).

4. Experiments

To validate the effectiveness and superiority of our methods, extensive experiments are conducted, and experimental results together with the corresponding analyses are reported in this section. Label recovery experiments are performed on 14 benchmark datasets111http://palm.seu.edu.cn/xgeng/LDL/index.htm, and the corresponding flowchart is shown in Fig. 4.

4.1. Datasets

The fundamental statistics of 14 datasets, including 13 real-world datasets and a toy dataset, employed for evaluation can be observed in Table 1. To be specific, the first 3 real-world datasets are created from movies, facial expression images, the remaining 10 real-world datasets from Yeast-alpha to Yeast-spoem are collected from the records of some biological experiments on the budding yeast genes(Eisen et al., 1998). As for the artificial dataset, which is also adopted in (Xu et al., 2018) to intuitively exhibits the model’s ability of label enhancement, each instance is chosen following the rule that the first two dimensions and are formed as a grid with an interval of 0.04 in the range [-1,1], while the third dimension is computed by:

(44)

The corresponding label distribution is collected through the following equations:

(45)
(46)

and label distributions can be obtained as follows:

(47)

where , , , , , , , , and .

Dataset # Instances # Features # Labels
Artificial
Movie
SBU_3DFE
SJAFFE
Yeast-alpha
Yeast-cdc
Yeast-cold
Yeast-diau
Yeast-dtt
Yeast-elu
Yeast-heat
Yeast-spo
Yeast-spo5
Yeast-spoem
2601
7755
2500
213
2465
2465
2465
2465
2465
2465
2465
2465
2465
2465
3
1869
243
243
24
24
24
24
24
24
24
24
24
24
3
5
6
6
18
15
4
7
4
14
6
6
3
2
Table 1. Some Information about 14 Datasets.

It is noteworthy that due to the lack of datasets with both logical labels and label distributions, the logical labels had to be binarized from the ground-truth label distributions in the original datasets so as to implement LE algorithms and measure the similarity between the recovered label distributions and the ground-truths. To ensure the consistency of evaluation, we binarized the logical labels through the way in (Xu et al., 2018) in this section.

4.2. Experimental Settings

To fully investigate the performance of our algorithms, i.e., LESC and gLESC, five state-of-the-art algorithms, including, FCM (El Gayar et al., 2006), KM (Jiang et al., 2006), LP (Li et al., 2015), ML (Hou et al., 2016), and GLLE (Xu et al., 2019a) are employed. We list the parameter settings here. The parameters and are selected among in our LESC and gLESC. In consistent with the parameters used in (Xu et al., 2019a), the parameter in LP is fixed to be 0.5, Gaussian kernel is employed in KM, the number of neighbors for ML is assigned to be , and the parameter in FCM is fixed to be 2. Regarding to GLLE, the number of neighbors is assigned to be and the optimal value of parameter is chosen from .

Since both the recovered and ground-truth label distributions are label vectors, the average distance or similarity between them is calculated to evaluate the LE algorithms thoroughly. For a fair comparison, six measures are selected, where the first four are distance-based measures and the last two are similarity-based measures, reflecting the performance of LE algorithms from different aspects in semantics. As shown in Table 2 where

denotes the real label distributions, for these metrics, i.e., Chebyshev distance (Cheb), Canberra metric (Canber), Clark distance (Clark), Kullback-Leibler divergence (KL), cosine coefficient (Cosine), and intersection similarity (Intersec),

states ”the smaller the greater”, and states ”the larger the greater”.

Measure Formula
Cheb
Canber
Clark
KL
Cosine
Intersec
Table 2. Introduction to Evaluation Measures.
Figure 5. Visualization of the ground-truth and recovered label distributions on the artificial dataset (regarded as RGB colors, best viewed in color).
Figure 6. CD diagrams of different LE methods on six measures, including Cheb, Canber, Clark, KL, Cosine, and Intersec. CD diagrams are calculated based on the Wilcoxon-Holm method (Ismail Fawaz et al., 2019). Specifically, the method located on the right side is better that the method on the left side, and the line between two methods denotes that their recovery results are different within one critical difference.

4.3. Analysis of Recovery Performance

First, we evaluate the recovery performance on the artificial dataset, and to illustrate the recovery performance visually, the three-dimensional label distributions are separately converted into the RGB color channels, which are reinforced by the decorrelation stretch process for easier observation. In other words, the label distribution of each point in the feature space can be represented by its color. Thus far, the color patterns can be directly observed to compare both the ground truth and the recovered label distributions. As shown in Fig. 5, in contrast to the ground-truth color patterns, our algorithms, both LESC and gLESC, can nearly recover these patterns identically, while GLLE obtains almost the same results. In addition, the color patterns in other four algorithms, i.e., FCM, KM, LP, and ML, are barely satisfactory, which proves the limits of excavating the space structure of features locally. Clearly, our methods can achieve good recovery performance on the artificial dataset.

Dataset Measure Results by Cheb Measure Results by Canber
2-15 FCM KM LP ML GLLE LESC gLESC FCM KM LP ML GLLE LESC gLESC
Artificial
Movie
SBU_3DFE
SJAFFE
Yeast-alpha
Yeast-cdc
Yeast-cold
Yeast-diau
Yeast-dtt
Yeast-elu
Yeast-heat
Yeast-spo
Yeast-spo5
Yeast-spoem
0.188(5)
0.230(6)
0.135(5)
0.132(5)
0.044(5)
0.051(5)
0.141(5)
0.124(5)
0.097(4)
0.052(5)
0.169(6)
0.130(5)
0.162(5)
0.233(5)
0.260(7)
0.234(7)
0.238(7)
0.214(7)
0.063(7)
0.076(7)
0.252(7)
0.152(7)
0.257(7)
0.078(7)
0.175(7)
0.175(7)
0.277(7)
0.408(7)
0.130(4)
0.161(4)
0.123(2)
0.107(4)
0.040(4)
0.042(4)
0.137(4)
0.099(4)
0.128(5)
0.044(4)
0.086(4)
0.090(4)
0.114(4)
0.163(4)
0.227(6)
0.164(5)
0.233(6)
0.186(6)
0.057(6)
0.071(6)
0.242(6)
0.148(6)
0.244(6)
0.072(6)
0.165(5)
0.171(6)
0.273(6)
0.403(6)
0.108(3)
0.122(3)
0.126(4)
0.087(3)
0.020(3)
0.022(3)
0.066(3)
0.053(3)
0.052(3)
0.023(3)
0.049(3)
0.062(3)
0.099(3)
0.088(3)
0.057(2)
0.121(2)
0.122(1)
0.069(2)
0.015(2)
0.019(2)
0.056(2)
0.042(2)
0.043(2)
0.019(2)
0.046(2)
0.060(2)
0.092(1)
0.087(2)
0.055(1)
0.120(1)
0.125(3)
0.067(1)
0.014(1)
0.017(1)
0.052(1)
0.039(1)
0.037(1)
0.017(1)
0.043(1)
0.059(1)
0.092(1)
0.084(1)
0.797(5)
1.664(4)
1.020(4)
1.081(5)
2.883(4)
2.415(4)
0.734(4)
1.895(4)
0.501(4)
1.689(4)
1.157(4)
0.998(4)
0.563(5)
0.534(5)
1.779(7)
3.444(7)
4.121(7)
4.010(7)
11.809(7)
9.875(7)
2.566(7)
4.261(7)
2.594(7)
9.110(7)
3.849(7)
3.854(7)
1.382(7)
1.253(7)
0.668(4)
1.720(5)
1.245(5)
1.064(4)
4.544(5)
3.644(5)
0.924(5)
1.748(5)
0.941(5)
3.381(5)
1.293(5)
1.231(5)
0.401(4)
0.365(4)
1.413(6)
1.934(6)
4.001(6)
3.138(6)
11.603(6)
9.695(6)
2.519(6)
4.180(6)
2.549(6)
8.949(6)
3.779(6)
3.772(6)
1.355(6)
1.226(6)
0.617(3)
1.045(3)
0.820(3)
0.781(3)
1.134(3)
0.959(3)
0.305(3)
0.671(3)
0.248(3)
0.902(3)
0.430(3)
0.548(3)
0.305(3)
0.183(3)
0.213(2)
1.034(1)
0.799(1)
0.561(2)
0.846(2)
0.765(2)
0.263(2)
0.480(2)
0.206(2)
0.727(2)
0.401(2)
0.533(2)
0.284(2)
0.180(2)
0.193(1)
1.034(1)
0.803(2)
0.550(1)
0.761(1)
0.695(1)
0.242(1)
0.452(1)
0.175(1)
0.628(1)
0.372(1)
0.521(1)
0.283(1)
0.175(1)
Avg.Rank 5.07 7.00 3.93 5.86 3.07 1.86 1.14 4.27 7.00 4.71 6.00 3.00 1.86 1.07
Dataset Measure Results by Clark Measure Results by KL
2-15 FCM KM LP ML GLLE LESC gLESC FCM KM LP ML GLLE LESC gLESC
Artificial
Movie
SBU_3DFE
SJAFFE
Yeast-alpha
Yeast-cdc
Yeast-cold
Yeast-diau
Yeast-dtt
Yeast-elu
Yeast-heat
Yeast-spo
Yeast-spo5
Yeast-spoem
0.561(5)
0.859(4)
0.482(4)
0.522(5)
0.821(4)
0.739(4)
0.433(4)
0.838(5)
0.329(4)
0.579(4)
0.580(5)
0.520(4)
0.395(5)
0.401(5)
1.251(7)
1.766(7)
1.907(7)
1.874(7)
3.153(7)
2.885(7)
1.472(7)
1.886(7)
1.477(7)
2.768(7)
1.802(7)
1.811(7)
1.059(7)
1.028(7)
0.487(4)
0.913(5)
0.580(5)
0.502(4)
1.185(5)
1.014(5)
0.503(5)
0.788(4)
0.499(5)
0.973(5)
0.568(4)
0.558(5)
0.274(4)
0.272(4)
1.041(6)
1.140(6)
1.848(6)
1.519(6)
3.088(6)
2.825(6)
1.440(6)
1.844(6)
1.446(6)
2.711(6)
1.764(6)
1.768(6)
1.036(6)
1.004(6)
0.452(3)
0.569(3)
0.391(3)
0.377(3)
0.337(3)
0.306(3)
0.176(3)
0.296(3)
0.143(3)
0.295(3)
0.213(3)
0.266(3)
0.197(3)
0.132(3)
0.148(2)
0.564(2)
0.378(2)
0.276(2)
0.253(2)
0.251(2)
0.152(2)
0.224(2)
0.119(2)
0.241(2)
0.199(2)
0.258(2)
0.185(2)
0.129(2)
0.130(1)
0.563(1)
0.376(1)
0.270(1)
0.231(1)
0.231(1)
0.141(1)
0.211(1)
0.102(1)
0.213(1)
0.186(1)
0.253(1)
0.184(1)
0.126(1)
0.267(5)
0.381(6)
0.094(3)
0.107(5)
0.100(4)
0.091(4)
0.113(5)
0.159(5)
0.065(4)
0.059(4)
0.147(5)
0.110(5)
0.123(5)
0.208(5)
0.309(7)
0.452(7)
0.603(6)
0.558(7)
0.630(7)
0.630(7)
0.586(7)
0.538(7)
0.617(7)
0.617(7)
0.586(7)
0.562(7)
0.334(7)
0.531(7)
0.160(4)
0.177(5)
0.105(4)
0.077(4)
0.121(5)
0.111(5)
0.103(4)
0.127(4)
0.103(5)
0.109(5)
0.089(4)
0.084(4)
0.042(4)
0.067(4)
0.274(6)
0.218(4)
0.565(5)
0.391(6)
0.602(6)
0.601(6)
0.556(6)
0.509(6)
0.586(6)
0.589(6)
0.556(6)
0.532(6)
0.317(6)
0.503(6)
0.131(3)
0.123(3)
0.069(2)
0.050(3)
0.013(3)
0.014(3)
0.019(3)
0.027(3)
0.013(3)
0.013(3)
0.017(3)
0.029(3)
0.034(3)
0.027(2)
0.013(2)
0.120(1)
0.064(1)
0.029(2)
0.008(2)
0.010(2)
0.015(2)
0.017(2)
0.010(2)
0.009(2)
0.015(2)
0.028(2)
0.031(1)
0.027(2)
0.012(1)
0.120(1)
0.064(1)
0.027(1)
0.007(1)
0.008(1)
0.013(1)
0.015(1)
0.007(1)
0.007(1)
0.014(1)
0.027(1)
0.031(1)
0.026(1)
Avg.Rank 4.43 7.00 4.57 6.00 3.00 2.00 1.00 4.64 7.00 4.36 5.79 2.86 1.86 1.00
Dataset Measure Results by Cosine Measure Results by Intersec
2-15 FCM KM LP ML GLLE LESC gLESC FCM KM LP ML GLLE LESC gLESC
Artificial
Movie
SBU_3DFE
SJAFFE
Yeast-alpha
Yeast-cdc
Yeast-cold
Yeast-diau
Yeast-dtt
Yeast-elu
Yeast-heat
Yeast-spo
Yeast-spo5
Yeast-spoem
0.933(5)
0.773(7)
0.912(5)
0.906(5)
0.922(4)
0.929(4)
0.922(5)
0.882(5)
0.959(4)
0.950(4)
0.883(5)
0.909(5)
0.922(5)
0.878(5)
0.918(7)
0.880(6)
0.812(7)
0.827(7)
0.751(7)
0.754(7)
0.779(7)
0.799(7)
0.759(7)
0.758(7)
0.779(7)
0.800(7)
0.882(7)
0.812(7)
0.974(4)
0.929(4)
0.922(4)
0.941(4)
0.911(5)
0.916(5)
0.925(4)
0.915(4)
0.921(5)
0.918(5)
0.932(4)
0.939(4)
0.969(4)
0.950(4)
0.925(6)
0.919(5)
0.815(6)
0.857(6)
0.756(6)
0.759(6)
0.784(6)
0.803(6)
0.763(6)
0.763(6)
0.783(6)
0.803(6)
0.884(6)
0.815(6)
0.980(3)
0.936(3)
0.927(3)
0.958(3)
0.987(3)
0.987(3)
0.982(3)
0.975(3)
0.988(3)
0.987(3)
0.984(3)
0.974(3)
0.971(3)
0.978(2)
0.992(1)
0.937(2)
0.932(1)
0.973(2)
0.992(2)
0.991(2)
0.986(2)
0.985(2)
0.991(2)
0.991(2)
0.986(2)
0.975(2)
0.974(1)
0.978(2)
0.991(2)
0.938(1)
0.931(2)
0.975(1)
0.994(1)
0.992(1)
0.988(1)
0.987(1)
0.994(1)
0.993(1)
0.987(1)
0.976(1)
0.974(1)
0.979(1)
0.812(5)
0.677(6)
0.827(4)
0.821(5)
0.844(4)
0.847(4)
0.833(4)
0.760(5)
0.894(4)
0.883(4)
0.807(4)
0.836(4)
0.838(5)
0.767(5)
0.740(7)
0.649(7)
0.579(7)
0.593(7)
0.532(7)
0.533(7)
0.559(7)
0.588(7)
0.541(7)
0.539(7)
0.559(7)
0.575(7)
0.724(7)
0.592(7)
0.870(4)
0.778(5)
0.810(5)
0.837(4)
0.774(5)
0.779(5)
0.794(5)
0.788(4)
0.786(5)
0.782(5)
0.805(5)
0.819(5)
0.886(4)
0.837(4)
0.773(6)
0.779(4)
0.587(6)
0.661(6)
0.537(6)
0.538(6)
0.565(6)
0.593(6)
0.546(6)
0.544(6)
0.564(6)
0.580(6)
0.727(6)
0.597(6)
0.892(3)
0.831(3)
0.850(3)
0.872(3)
0.938(3)
0.937(3)
0.924(3)
0.906(3)
0.939(3)
0.936(3)
0.929(3)
0.909(3)
0.901(3)
0.912(3)
0.943(2)
0.833(1)
0.855(1)
0.905(2)
0.953(2)
0.950(2)
0.935(2)
0.933(2)
0.949(2)
0.949(2)
0.934(2)
0.912(2)
0.908(1)
0.913(2)
0.945(1)
0.833(1)
0.854(2)
0.908(1)
0.958(1)
0.954(1)
0.940(1)
0.937(1)
0.957(1)
0.956(1)
0.939(1)
0.914(1)
0.908(1)
0.916(1)
Avg.Rank 4.86 6.93 4.29 5.93 2.93 1.79 1.14 4.50 7.00 4.64 5.86 3.00 2.13 1.07
Table 3. Recovery Results (value(rank)).

To further investigate the recovery performance, we present the quantitative results of these aforementioned algorithms in metrics of Cheb, Canber, Clark, KL, Cosine, and Intersec (as shown in Table 3). To exhibit the mean accuracy of the recovered label distributions, the average rank of different methods among all datasets is also listed, and the optimal results for each dataset are highlighted with boldface. In a big picture, the proposed LESC achieves the second-best recovery performance, and the proposed gLESC obtains the best results. For example, average rank of LESC and gLESC in the metric of Clark is 2.00 and 1.00, respectively. Regrading to the artificial dataset, apparently, the corresponding quantitative results are consistent with the recovered color patterns in Fig. 5. For 13 real-world datasets, results from Table 3 also demonstrate the superiority of our LESC and gLESC. For example, from Yeast-alpha dataset to Yeast-spoem dataset, LESC and gLESC attain the best recovery performance, and gLESC always ranks the first place. Additionally, we also report the critical difference (CD) of average rank in this section. As can be seen in Fig. 6, the CD diagrams show that our gLESC achieves the optimal recovery results on all metrics, and the proposed LESC also attain the sub-optimal performance. In general, the recovery performance can be ranked as gLESC>LESC>GLLE>LP>FCM>ML>KM.

Are sample correlations obtained by the low-rank representation suitable for LE? As can be observed from Table 3 and Fig. 6, we can conclude that LESC and gLESC, which leverage the low-rank representation to attain global sample correlations for LE, outperform GLLE, which use the distance-based similarity to get label recovery, by a large margin. Consequently, it is clear that sample correlations obtained by the low-rank representation are suitable for LE.

Are sample correlations captured from both the feature space and label space better for LE? Comparing to LESC, gLESC leverages a tensor multi-rank minimization to obtain the sample correlations from both the feature space and label space. Since the sample correlations investigated in gLSEC are more suitable than that in LESC, it is expected that gLESC can attain better recovery performance. From the quantitative experimental results in Table 3 and Fig. 6, we can conclude that sample correlations captured from both the feature space and label space are better for LE.

Figure 7. Label recovery performance of LESC on SBU_3DFE, Yeast-alpha, and Yeast-cold in metrics of Cheb and Cosine. Specifically, different rows denonte different values of , and different columns denote different values of .
Figure 8. Label recovery performance of gLESC on SBU_3DFE, Yeast-alpha, and Yeast-cold in metrics of Cheb and Cosine. Specifically, different rows denonte different values of , and different columns denote different values of .

4.4. Parameters Sensitivity

Two trade-off hyperparameters, including

and , are involved in our proposed methods. The influence of them is analyzed separately by fixing one parameter and tuning another one chosen from . In this section, we take experimental results on SBU_3DFE, Yeast-alpha, and Yeast-cold in metrics of Cheb and Cosine for example, which can be seen in Fig. 7 and Fig. 8. Although only the cases of three datasets are illustrated here, the same observations can be obtained in other datasets.

For LESC, when the low-rank coefficient varies with the trade-off parameter fixed, two shown measure results of the recovery performance fluctuates in a very tiny range that could not even be distinguished. As we increase the parameter from 0.0001 to 0.1, the recovery performance also turns out to change within a small scope. When is geared to 1 or 10, the results even zooms up to a higher level. Particularly, taking Yeast-alpha dataset for reference, it is found that when is chosen from , our worst measure result still far exceeds that of the previous state-of-the-art baseline, i.e, 0.987 versus 0.973 (best result attained by GLLE) in the metric of Cosine. Regrading to gLESC, similar observations can be reached as well, and we skip them here for the compactness of this paper. As discussed before, these phenomena indicate that our algorithms, both LESC and gLESC, are robust when the values of and in the objective function vary by a large scope. This ensures us to generalize our algorithm to different datasets without much effort in terms of adjusting the values of hyperparameters.

5. Conclusion

In this paper, two novel LE methods, i.e., LESC and gLESC, are proposed to boost the LE performance by exploiting the underlying sample correlations. LESC explores the low-rank representation from the feature space, and gLESC further investigates the sample correlations by utilizing a tensor multi-rank minimization to obtain more suitable sample correlations from both the feature space and label space during the label distribution recovery process. Extensive experimental results on 14 datasets show that LE can really benefit from the sample correlations. They demonstrate the remarkable superiority of the proposed LESC and gLESC over several state-of-the-art algorithms in recovering the label distributions. Further analysis on the influence of hyperparameters verifies the robustness of our methods.

Acknowledgements.
This work is supported in by the Fundamental Research Funds for Central Universities under Grant Nos. xzy012019045.

References

  • Z. Chen, X. Wei, P. Wang, and Y. Guo (2019) Multi-label image recognition with graph convolutional networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5177–5186. External Links: Link Cited by: §1.
  • M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95 (25), pp. 14863–14868. External Links: Link Cited by: §4.1.
  • N. El Gayar, F. Schwenker, and G. Palm (2006)

    A study of the robustness of knn classifiers trained using soft labels

    .
    In

    IAPR Workshop on Artificial Neural Networks in Pattern Recognition

    ,
    pp. 67–80. External Links: Link Cited by: §1, §2, §4.2.
  • B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng (2017) Deep label distribution learning with label ambiguity. IEEE Transactions on Image Processing 26 (6), pp. 2825–2838. External Links: Link Cited by: §1.
  • X. Geng, C. Yin, and Z. Zhou (2013) Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35 (10), pp. 2401–2412. External Links: Link Cited by: §1.
  • X. Geng (2016) Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28 (7), pp. 1734–1748. External Links: Link Cited by: §1, Figure 4.
  • P. Hou, X. Geng, and M. Zhang (2016) Multi-label manifold learning. In

    Thirtieth AAAI Conference on Artificial Intelligence

    ,
    Cited by: §1, §2, §4.2.
  • W. Hu, D. Tao, W. Zhang, Y. Xie, and Y. Yang (2016) The twist tensor nuclear norm for video completion. IEEE transactions on neural networks and learning systems 28 (12), pp. 2961–2973. External Links: Link Cited by: §3.2.1.
  • S. Huang and Z. Zhou (2012) Multi-label learning by exploiting label correlations locally. In Twenty-sixth AAAI conference on artificial intelligence, Cited by: §1.
  • H. Ismail Fawaz, G. Forestier, J. Weber, L. Idoumghar, and P. Muller (2019) Deep learning for time series classification: a review. Data Mining and Knowledge Discovery 33 (4), pp. 917–963. External Links: Link Cited by: Figure 6.
  • X. Jia, X. Zheng, W. Li, C. Zhang, and Z. Li (2019) Facial emotion distribution learning by exploiting low-rank label correlations locally. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9841–9850. External Links: Link Cited by: §1.
  • X. Jiang, Z. Yi, and J. C. Lv (2006) Fuzzy svm with a new fuzzy membership function. Neural Computing & Applications 15 (3-4), pp. 268–276. External Links: Link Cited by: §4.2.
  • M. E. Kilmer, K. Braman, N. Hao, and R. C. Hoover (2013) Third-order tensors as operators on matrices: a theoretical and computational framework with applications in imaging. SIAM Journal on Matrix Analysis and Applications 34 (1), pp. 148–172. External Links: Link Cited by: §1, §3.2.
  • Y. Kim and E. M. Provost (2015) Emotion recognition during speech using dynamics of multiple regions of the face. ACM Trans. Multimedia Comput. Commun. Appl. 12 (1s). External Links: ISSN 1551-6857, Link, Document Cited by: §1.
  • Y. Li, M. Zhang, and X. Geng (2015) Leveraging implicit relative labeling-importance information for effective multi-label learning. In 2015 IEEE International Conference on Data Mining, pp. 251–260. External Links: Link Cited by: §1, §2, §4.2.
  • Z. Lin, R. Liu, and Z. Su (2011) Linearized alternating direction method with adaptive penalty for low-rank representation. In Advances in neural information processing systems, pp. 612–620. Cited by: §3.1.1, §3.1.
  • G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma (2012) Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 171–184. External Links: Link Cited by: §1.
  • G. Liu and S. Yan (2011)

    Latent low-rank representation for subspace segmentation and feature extraction

    .
    In 2011 International Conference on Computer Vision, pp. 1615–1622. External Links: Link Cited by: §1.
  • P. Melin and O. Castillo (2005) Hybrid intelligent systems for pattern recognition using soft computing: an evolutionary approach for neural networks and fuzzy systems. Vol. 172, Springer Science & Business Media. Cited by: §1, §2.
  • J. Nocedal and S. Wright (2006) Numerical optimization. Springer Science & Business Media. Cited by: §3.1.4.
  • S. T. Roweis and L. K. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290 (5500), pp. 2323–2326. External Links: Link Cited by: §2.
  • H. Tang, J. Zhu, Q. Zheng, S. Pang, Z. Li, and W. Jun (2020) Label enhancement with sample correlations via low-rank representation. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §1.
  • G. Tsoumakas and I. Katakis (2007) Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM) 3 (3), pp. 1–13. External Links: Link Cited by: §1.
  • N. Xu, Y. Liu, and X. Geng (2019a) Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering. External Links: Link Cited by: §1, §1, §2, §4.2.
  • N. Xu, J. Lv, and X. Geng (2019b) Partial label learning via label enhancement. In AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §1.
  • N. Xu, A. Tao, and X. Geng (2018) Label enhancement for label distribution learning.. In IJCAI, pp. 2926–2932. External Links: Link Cited by: §1, §1, §2, §4.1, §4.1.
  • H. Yang, B. Lin, K. Chang, and C. Chen (2018) Joint estimation of age and expression by combining scattering and convolutional networks. ACM Trans. Multimedia Comput. Commun. Appl. 14 (1). External Links: ISSN 1551-6857, Link, Document Cited by: §1.
  • M. Yin, J. Gao, and Z. Lin (2015) Laplacian regularized low-rank representation and its applications. IEEE transactions on pattern analysis and machine intelligence 38 (3), pp. 504–517. External Links: Link Cited by: §1.
  • H. Zhang, B. Zhong, Q. Lei, J. Du, J. Peng, D. Chen, and X. Ke (2017) Sparse representation-based semi-supervised regression for people counting. ACM Trans. Multimedia Comput. Commun. Appl. 13 (4). External Links: ISSN 1551-6857, Link, Document Cited by: §1.
  • M. Zhang and K. Zhang (2010) Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’10, New York, NY, USA, pp. 999–1008. External Links: ISBN 9781450300551, Link, Document Cited by: §1.
  • M. Zhang and Z. Zhou (2013) A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering 26 (8), pp. 1819–1837. External Links: Link Cited by: §1.
  • Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer (2014) Novel methods for multilinear data completion and de-noising based on tensor-svd. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3842–3849. External Links: Link Cited by: §3.2.
  • Q. Zheng, J. Zhu, Z. Li, S. Pang, J. Wang, and Y. Li (2020a) Feature concatenation multi-view subspace clustering. Neurocomputing 379C, pp. 89–102. External Links: Link Cited by: §1.
  • Q. Zheng, J. Zhu, Z. Tian, Z. Li, S. Pang, and X. Jia (2020b) Constrained bilinear factorization multi-view subspace clustering. Knowledge Based Systems, pp. 105514. External Links: Link Cited by: §1.
  • X. Zheng, X. Jia, and W. Li (2018) Label distribution learning by exploiting sample correlations locally. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • Y. Zhou, H. Xue, and X. Geng (2015) Emotion distribution recognition from facial expressions. In Proceedings of the 23rd ACM International Conference on Multimedia, MM ’15, New York, NY, USA, pp. 1247–1250. External Links: ISBN 9781450334594, Link, Document Cited by: §1.
  • X. Zhu and A. B. Goldberg (2009)

    Introduction to semi-supervised learning

    .
    Synthesis lectures on artificial intelligence and machine learning 3 (1), pp. 1–130. Cited by: §1, §2.