Bidirectional Loss Function for Label Enhancement and Distribution Learning

07/07/2020 ∙ by Xinyuan Liu, et al. ∙ Xi'an Jiaotong University 10

Label distribution learning (LDL) is an interpretable and general learning paradigm that has been applied in many real-world applications. In contrast to the simple logical vector in single-label learning (SLL) and multi-label learning (MLL), LDL assigns labels with a description degree to each instance. In practice, two challenges exist in LDL, namely, how to address the dimensional gap problem during the learning process of LDL and how to exactly recover label distributions from existing logical labels, i.e., Label Enhancement (LE). For most existing LDL and LE algorithms, the fact that the dimension of the input matrix is much higher than that of the output one is alway ignored and it typically leads to the dimensional reduction owing to the unidirectional projection. The valuable information hidden in the feature space is lost during the mapping process. To this end, this study considers bidirectional projections function which can be applied in LE and LDL problems simultaneously. More specifically, this novel loss function not only considers the mapping errors generated from the projection of the input space into the output one but also accounts for the reconstruction errors generated from the projection of the output space back to the input one. This loss function aims to potentially reconstruct the input data from the output data. Therefore, it is expected to obtain more accurate results. Finally, experiments on several real-world datasets are carried out to demonstrate the superiority of the proposed method for both LE and LDL.



There are no comments yet.


page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning with ambiguity has become one of the most prevalent research topics. The traditional way to solve machine learning problems is based on single-label learning (SLL) and multi-label learning (MLL)

Tsoumakas and Katakis (2007); Xu et al. (2016). Concerning the SLL framework, an instance is always assigned to one single label, whereas in MLL an instance may be associated with several labels. The existing learning paradigms of SLL and MLL are mostly based on the so-called problem transformation. However, neither SLL nor MLL address the problem stated as “at which degree can a label describe its corresponding instance,” i.e., the labels have different importance on the description of the instance. It is more appropriate for the importance among candidate labels to be different rather than exactly equal. Taking the above problem into account, a novel learning paradigm called label distribution learning (LDL) Geng and Ji (2013) is proposed. Compared with SLL and MLL, LDL labels an instance with a real-valued vector that consists of the description degree of every possible label to the current instance. Detail comparison is visualized in Fig. 1. Actually, LDL can be regarded as a more comprehensive form of MLL and SLL. However, the tagged training sets required by LDL are extremely scarce owing to the heavy burden of manual annotation. Considering the fact that it is difficult to directly attain the annotated label distribution, a process called label enhancement (LE) Xu et al. (2018) is also proposed to recover the label distributions from logical labels. Taking LE algorithm, the logical label of conventional MLL dataset can be recovered into the label distribution vector by mining the topological information of input space and label correlation He et al. (2019).

Figure 1: Visualized comparison among SLL, MLL and LDL

Many relevant algorithms of LDL and LE have been proposed in recent years. These algorithms have progressively boosted the performance of specific tasks. For instance, LDL is widely applied in facial age estimation application. Geng et al.

Geng et al. (2014) proposed a specialized LDL framework that combines the maximum entropy model Berger et al. (1996) with IIS optimization, namely IIS-LDL. This approach not only achieves better performance than other traditional machine learning algorithms but also becomes the foundation of the LDL framework. In other works, Yang et al.’s Fan et al. (2017) attempt to take into account both facial geometric and convolutional features resulted in remarkably improving efficiency and accuracy. As mentioned above, the difficulty of acquiring labeled datasets restricts the development of LDL algorithms. After presenting several LE algorithms, Xu et al. Xu et al. (2019b) adapted LDL into partial label learning (PLL) with recovered label distributions via LE. Although these methods have achieved significant performance, one potential problem yet to be solved is that they suffer from the discriminative information loss problem, which is caused by the dimensional gap between the input data matrix and the output one. Importantly, it is entirely possible that these existing methods miss the essential information that should be inherited from the original input space, thereby degrading the performance.

As discussed above, the critical point of previous works on LDL and LE is to establish a suitable loss function to fit label distribution data. In previous works, only a unidirectional projection between input and output space is learned. In this paper, we present a bi-directional loss function with a comprehensive reconstruction constraint. Such function can be applied in both LDL and LE to maintain the latent information. Inspired by the auto-encoder paradigm Kodirov and Gong (2017); Cheng et al. (2019), our proposed method builds the reconstruction projection with the mapping projection to preserve the otherwise lost information. More precisely, optimizing the original loss is the mapping step, while minimizing the reconstruction error is the reconstruction step. In contrast to previous loss functions, the proposed loss function aims to potentially reconstruct the input data from the output data. Therefore, it is expected to obtain more accurate results than other related loss functions for both LE and LDL problems. Adequate experiments on several well-known datasets demonstrate that the proposed loss function achieves superior performance.

The main contributions of this work are delivered as:

  1. [1)]

  2. the reconstruction projection from label space to instance space is considered for the first time in the LDL and LE paradigms;

  3. a bi-directional loss function that combines mapping error and reconstruction error is proposed;

  4. the proposed method can be used not only in LDL but also for LE.

We organize the rest of this paper as follows. Firstly, related work about LDL and LE methods is reviewed in Section 2. Secondly, the formulation of LE as well LDL and the proposed methods, i.e., BD-LE and BD-LDL are introduced in Section 3. After that, the results of comparison experiment and ablation one are shown in Section 4. The influence of parameters is also discussed in Section 4. Finally, conclusions and future work exploration are summarized in section 5.

2 Related work

In this section, we briefly summarize the related work about LDL and LE methods.

2.1 Label Distribution Learning

The proposed LDL methods mainly focus on three aspects, namely model assumption, loss function, and the optimization algorithm. The maximum entropy model Berger et al. (1996) is widely used to represent the label distribution in the LDL paradigmXu et al. (2019c); Ren et al. (2019b). Maximum entropy model naturally agrees with the character of description degree in LDL model. However, such an exponential model is sometimes not comprehensive enough to accomplish a complex distribution. To overcome this issue, Gent et al. Xing et al. (2016) proposed a LDL family based on a boosting algorithm to extend the traditional LDL model. Inspired by the M-SVR algorithm, LDVSR Geng and Hou (2015) is designed for the movie opinion prediction task. Furthermore, CPNN Geng et al. (2013)

combines a neural network with the LDL paradigm to improve the effectiveness of facial age estimation applications. What’s more, recent work

Ren et al. (2019a); Xu and Zhou (2017) has proved that linear model is also able to achieve a relatively strong representation ability and a satisfying result. As reviewed above, most existing methods build the mapping from feature space to label space in an unidirectional way so that it is appropriate to take the bi-directional constraint into consideration.

Concerning the loss function, LDL aims at learning the model to predict unseen instances’ distributions which are similar to the true ones. The criteria to measure the distance between two distributions, such as the Kullback–Leibler (K-L) divergence, is always chosen as the loss function Jia et al. (2018); Geng et al. (2010). Owing to the asymmetry of the K-L divergence, Jeffery’s divergence is used in xxx Zhou et al. (2015) to build LDL model for facial emotion recognition. For the sake of easier computation, it is reasonable to adopt the Euclidean distance in a variety of tasks, e.g., facial emotion recognition Jia et al. (2019b).

Regarding the optimization method, SA-IIS Geng (2016) utilizes the improved iterative scaling (IIS) method whose performance is always worse Malouf (2002) than the other optimization. Fortunately, by leveraging the L-BFGS Nocedal and Wright (2006) optimization method, we maintain the balance between efficiency and accuracy, especially in SA-BFGS Geng (2016) and EDL Zhou et al. (2015). With the complexity of proposed model greater, the number of parameters to be optimized is more than one. Therefore, it is more appropriate to introduce the alternating direction method of multipliers (ADMM) Boyd et al. (2011) when the loss function incorporates additional inequality and equality constraints. In addition, exploiting the correlation among labels or samples can increasingly boost the performance of LDL model. Jia et al. Jia et al. (2018) proposed LDLLC to take the global label correlation into account with introducing the Person’s correlation between labels. It is pointed out in LDL-SCL Zheng et al. (2018) and EDL-LRL Jia et al. (2019b) that some correlations among labels (or samples) only exist in a set of instances, which are so-called the local correlation exploration. Intuitively, the instances in the same group after clustering share the same local correlation.

What’s more, it is common that the labeled data are incomplete and contaminated Ma et al. (2017). For the former condition, Xu et al. Xu and Zhou (2017) put forward IncomLDL-a and IncomLDL-p on the assumption that the recovered complete label distribution matrix is low-rank. Proximal Gradient Descend (PGD) and ADMM are used for the optimization of two methods respectively. The time complexity of the first one is , and the last one is but good at the accuracy. Jia et al. Jia et al. (2019a) proposed WSLDL-MCSC which is based on the matrix completion and the exploration of samples’ relevance in a transductive way when the data is under weak-supervision.

2.2 Label Enhancement Learning

To the best of our knowledge, there are a few researches whose topics focus on the label enhancement learning Xu et al. (2018). Five effective strategies have been devised during the present study. Four of them are adaptive algorithms. As discussed in Geng (2016), the concept of membership used in fuzzy clustering Jiang et al. (2006) is similar to label distribution. Although they indicated two distinguishing semantics, they are both in numerical format. Thus, FCM El Gayar et al. (2006) extend the calculation of membership which is used in fuzzy C-means clustering Melin and Castillo (2005) to recover the label distribution. LE algorithm based on kernel method (KM) Jiang et al. (2006) utilizes the kernel function to project the instances from origin space into a high-dimensional one. The instances are separated into two parts according to whether the corresponding logical label is 1 or not for every candidate label. Then the label distribution term, i.e. description degree can be calculated based on the distance between the instances and the center of groups. Label propagation technique Wang and Zhang (2007) is used in the LP method to update the label distribution matrix iteratively with a fully-connected graph built. Since the message between samples is shared and passed on the basis of the connection graph, the logical label can be enhanced into distribution-level label. LE method adapted from manifold learning (ML) Hou et al. (2016) take the topological consistency between feature space and label space into consideration to obtain the recovered label distribution. The last novel strategy called GLLE Xu et al. (2019a) is specialized by leveraging the topological information of the input space and the correlation among labels. Meanwhile, the local label correlation is captured via clustering Zhou et al. (2012).

3 Proposed Method

Let denote the -dimensional input space and represent the complete set of labels where is the number of all possible labels. For each instance , a simple logical label vector is leveraged to represent which labels can describe the instance correctly. Specially, for the LDL paradigm, instance is assigned with distribution-level vector

Notations Description
the number of instances
number of labels
dimension of samples
instance feature matrix
logical label matrix
label distribution matrix
Mapping parameter of BD-LE
Reconstruction parameter of BD-LE
Mapping parameter of BD-LDL
Reconstruction parameter of BD-LDL
Table 1: Summary of some notations

3.1 Bi-directional for Label Enhancement

Given a dataset , and is defined as input matrix and logical label matrix, respectively. According to previous discussion, the goal of LE is to transform into the label distribution matrix .

Firstly, a nonlinear function , i.e., kernel function is defined to transform each instance into a higher dimensional feature , which can be utilized to construct the vector of corresponding instance. For each instance, an appropriate mapping parameter is required to transform the input feature into the label distribution . As there is a large dimension gap between input space and output space, a lot of information may be lost during the mapping process. To address this issue, it is reasonable to introduce the parameter for the reconstruction of the input data from the output data. Accordingly, the objective function of LE is formulated as follows:


where denotes the loss function of data mapping, indicates the loss function of data reconstruction, is the regularization term, and are two trade-off parameters. It should be noted that the LE algorithm is regarded as a pre-processing of LDL methods and it dose not suffer from the over-fitting problem. Accordingly, it is not necessary to add the norm of parameters and as regularizers.

The first term is the mapping loss function to measure the distance between logical label and recovered label distribution. According to Xu et al. (2018), it is reasonable to select the least squared (LS) function:


where and is the trace of a matrix defined by the sum of diagonal elements. The second term is the reconstruction loss function to measure the similarity between the input feature data and the reconstructed one from the output data of LE. Similar to the mapping loss function, the reconstruction loss function is defined as follows:


To further simplify the model, it is reasonable to consider the tied weights Boureau et al. (2008) as follows:


where is the best reconstruction parameter to be obtained. Then the Eq. (1) is rewritten as:


To obtain desired results, the manifold regularization is designed to capture the topological consistency between feature space and label space, which can fully exploit the hidden label importance from the input instances. Before presenting this term, it is required to introduce the similarity matrix , whose element is defined as:


where denotes the set of -nearest neighbors for the instance , and is the hyper parameter fixed to be 1 in this paper. Inspired by the smoothness assumption Zhu et al. (2005), the more correlated two instances are, the closer are the corresponding recovered label distribution, and vice versa. Accordingly, it is reasonable to design the following manifold regularization:


where indicates the recovered label distribution, and is the Laplacian matrix. Note that the similarity matrix is asymmetric so that the element of diagonal matrix element is defined as ,

By substituting Eqs. (2), (3) and (7) into Eq. (5), the mapping and reconstruction loss function is defined on parameter as follows:


Actually, Eq.(8) can be easily optimized by a well-known method called limited-memory quasi-Newton method (L-BFGS) Yuan (1991). This method achieves the optimization by calculating the first-order gradient of :


3.2 Bi-directional for Label Distribution Learning

Given dataset whose label is the real-valued format, LDL aims to build a mapping function from the instances to the label distributions, where denotes the -th instance and indicates the -th label distribution of instance. Note that accounts for the description degree of to

rather than the probability that label tags correctly. All the labels can describe each instance completely, so it is reasonable that

and .

As mentioned before, most of LDL methods suffer from the mapping information loss due to the unidirectional projection of loss function. Fortunately, bidirectional projections can extremely preserve the information of input matrix. Accordingly, the goal of our specific BD-LDL algorithm is to determine a mapping parameter and a reconstruction parameter from training set so as to make the predicted label distribution and the true one as similar as possible. Therefore, the new loss function integrates the mapping error with the reconstruction error as follows:


where denotes the mapping parameter, indicates the reconstruction parameter, is a regularization to control the complexity of the output model to avoid over-fitting, and are two parameters to balance these four terms.

There are various candidate functions to measure the difference between two distributions such as the Euclidean distance, the Kullback-Leibler (K-L) divergence and the Clark distance etc. Here, we choose the Euclidean distance:


where is the mapping parameter to be optimized, and is the Frobenius norm of a matrix. For simplification, it is reasonable to consider tied weights Boureau et al. (2008) as follows:


Similarly, the objective function is simplified as follows:


where the term denotes the simplified reconstruction error. As for the second term in objective function, we adopt the F-norm to implement it:


Substituting Eqs. (11) and (14) into Eq. (13) yields the objective function:


Before optimization, the trace properties and are applied for the re-organization of objective function:


Then, for optimization, we can simply take a derivative of Eq. (16) with respective to the parameter and set it zero:


Obviously, Eq. (17) can be transformed into the following equivalent formulation:


Denote , and , Eq. (18) can be rewritten as the following formulation:

1:: training feature matrix; : labeled distribution matrix;
2:: projection parameter
3:Initial , , and ;
5:     Compute , and in Eq.(19)
6:     Perform Cholesky factorization to gain and
7:     Perform SVD on and
8:     Update via Eqs.(24) and (25)
9:     Update via Eqs.(26)
10:until Stopping criterion is satisfied
Algorithm 1 BD-LDL Algorithm

Although Eq. (19) is the well-known Sylvester equation which can be solved by existing algorithm in MATLAB, the computational cost corresponding solution is not ideal. Thus, following Zhu et al. (2017), we effectively solve Eq. (19) with Cholesky factorization Golub and Loan (1996)

as well the Singular Value Decomposition (SVD). Firstly, two positive semi-definite matrix

and can be factorized as:


where and are the triangular matrix which can be further decomposed via SVD as:


Substituting Eqs. (20) and (21) into Eq. (19) yields:


Since , , and are the unitray matrix, Eq. (22) can be rewritten as :


We multiplying and to both sides of Eq. (23) to obtain the following equation:


where , , and .

For both and are the diagonal matrix, we can directly attain whose element is defined as:


where and

can be calculated by eigenvalues of

and respectively, and is the i,j-th elment of matrix . Accordingly, can be obtained by:


We briefly summarize the procedure of the proposed BD-LDL in Algorithm 1.

Index Data Set # Examples # Features # Labels
1 Yeast-alpha 2,465 24 18
2 Yeast-cdc 2,465 24 15
3 Yeast-cold 2,465 24 4
4 Yeast-diau 2,465 24 7
5 Yeast-dtt 2,465 24 4
6 Yeast-elu 2,465 24 14
7 Yeast-heat 2,465 24 6
8 Yeast-spo 2,465 24 6
9 Yeast-spo5 2,465 24 3
10 Yeast-spoem 2,465 24 2
11 Natural Scene 2,000 294 9
12 Movie 7,755 1,869 5
13 SBU_3DFE 2,500 243 6
Table 2: Statistics of 13 datasets used in comparison experiment

4 Experiments

4.1 Datasets and Measurement

We conducted extensive experiments on 13 real-world datasets collected from biological experiments Eisen et al. (1998), facial expression images Lyons et al. (1998)

, natural scene images, and movies. The output of both LE and LDL are in the format of label distribution vectors. In contrast to the results of SLL and MLL, the label distribution vectors should be evaluated with diverse measurements. We naturally select six criteria that are most commonly used, i.e., Chebyshev distance (Chebeyshev), Clark distance (Clark), Canberra metric (Canberra), Kullback–Leibler divergence (K-L), Cosine coefficient (Cosine), and Intersection similarity (Intersec). The first four functions are always used to measure distance between groud-truth label distribution

and the predicted one , whereas the last two are similarity measurements.The specifications of criteria and used data sets can be found in Tables 3 and 2.

Name Defination
Distance Chebyshev
Similarity Intersaction
Table 3: Evaluation Measurements
Datasets Ours FCM KM LP ML GLLE
Yeast-alpha 0.0208(1) 0.0426(4) 0.0588(6) 0.0401(3) 0.0553(5) 0.0310(2)
Yeast-cdc 0.0231(1) 0.0513(4) 0.0729(6) 0.0421(3) 0.0673(5) 0.0325(2)
Yeast-cold 0.0690(1) 0.1325(4) 0.2522(6) 0.1129(3) 0.2480(5) 0.0903(2)
Yeast-diau 0.0580(1) 0.1248(4) 0.2500(6) 0.0904(3) 0.1330(5) 0.0789(2)
Yeast-dtt 0.0592(1) 0.0932(3) 0.2568(5) 0.1184(4) 0.2731(6) 0.0651(2)
Yeast-elu 0.0256(1) 0.0512(4) 0.0788(6) 0.0441(3) 0.0701(5) 0.0287(2)
Yeast-heat 0.0532(1) 0.1603(4) 0.1742(5) 0.0803(3) 0.1776(6) 0.0563(2)
Yeast-spo 0.0641(1) 0.1300(4) 0.1753(6) 0.0834(3) 0.1722(5) 0.0670(2)
Yeast-spo5 0.1017(2) 0.1622(4) 0.2773(6) 0.1142(3) 0.2730(5) 0.0980(1)
Yeast-spoem 0.0921(1) 0.2333(4) 0.4006(6) 0.1632(3) 0.3974(5) 0.1071(2)
Natural_Scene 0.3355(5) 0.3681(6) 0.3060(3) 0.2753(1) 0.2952(2) 0.3349(4)
Movie 0.1254(1) 0.2302(4) 0.2340(6) 0.1617(3) 0.2335(5) 0.1601(2)
SUB_3DFE 0.1285(1) 0.1356(3) 0.2348(6) 0.1293(2) 0.2331(5) 0.1412(4)
Avg. Rank 1.38 4.00 5.62 2.84 4.92 2.23
Table 4: Comparison Performance(rank) of Different LE Algorithms Measured by Chebyshev
Datasets Ours FCM KM LP ML GLLE
Yeast-alpha 0.9852(1) 0.9221(3) 0.8115(5) 0.9220(4) 0.7519(6) 0.9731(2)
Yeast-cdc 0.9857(1) 0.9236(3) 0.7541(6) 0.9162(4) 0.7591(5) 0.9597(2)
Yeast-cold 0.9804(1) 0.9220(4) 0.7789(6) 0.9251(3) 0.7836(5) 0.9690(2)
Yeast-diau 0.9710(1) 0.8901(4) 0.7990(6) 0.9153(3) 0.8032(5) 0.9397(2)
Yeast-dtt 0.9847(1) 0.9599(3) 0.7602(6) 0.9210(4) 0.7631(5) 0.9832(2)
Yeast-elu 0.9841(1) 0.9502(3) 0.7588(5) 0.9110(4) 0.7562(6) 0.9813(2)
Yeast-heat 0.9803(1) 0.8831(4) 0.7805(6) 0.9320(3) 0.7845(5) 0.9800(2)
Yeas-spo 0.9719(1) 0.9092(4) 0.8001(6) 0.9390(3) 0.8033(5) 0.9681(2)
Yeast-spo5 0.9697(2) 0.9216(4) 0.8820(6) 0.9694(3) 0.8841(5) 0.9713(1)
Yeast-spoem 0.9761(1) 0.8789(4) 0.8122(6) 0.9500(3) 0.8149(5) 0.9681(2)
Natural_Scene 0.7797(4) 0.5966(6) 0.7488(5) 0.8602(2) 0.8231(1) 0.7822(3)
Movie 0.9321(1) 0.7732(6) 0.8902(4) 0.9215(2) 0.8153(5) 0.9000(3)
SBU_3DFE 0.9233(1) 0.9117(3) 0.8126(6) 0.9203(2) 0.8150(5) 0.9000(4)
Avg. Rank 1.31 3.92 5.62 3.08 4.85 2.23
Table 5: Comparison Performance(rank) of Different LE Algorithms Measured by Cosine
Yeast-alpha 0.20970.003(1) 1.15410.034(8) 0.72360.060(7) 0.30530.006(6) 0.26890.008(5) 0.20980.002(2) 0.21260.000(4) 0.20980.006(3)
Yeast-cdc 0.20170.004(1) 1.06010.066(8) 0.57280.030(7) 0.29320.004(6) 0.24770.007(5) 0.21370.004(3) 0.20462.080(2) 0.21630.004(4)
Yeast-cold 0.13550.004(1) 0.51490.024(8) 0.15520.005(7) 0.16430.004(8) 0.14710.004(5) 0.13880.003(2) 0.14422.100(4) 0.14150.004(3)
Yeast-diau 0.19600.006(1) 0.74870.042(8) 0.26770.010(7) 0.24090.006(6) 0.22010.002(5) 0.19860.002(2) 0.20110.003(4) 0.20100.006(3)
Yeast-dtt 0.09640.004(2) 0.48070.040(8) 0.12060.008(7) 0.13320.003(8) 0.10840.003(6) 0.09890.001(4) 0.09801.600(3) 0.09620.006(1)
Yeast-elu 0.19640.004(1) 1.00500.041(8) 0.52460.028(7) 0.27510.006(6) 0.24380.008(5) 0.20150.002(3) 0.20290.023(4) 0.19940.006(2)
Yeast-heat 0.17880.005(1) 0.68290.026(8) 0.22610.010(7) 0.22600.005(6) 0.19980.003(5) 0.18260.003(2) 0.18260.003(2) 0.18540.004(4)
Yeast-spo 0.24560.008(1) 0.66860.040(8) 0.29500.010(7) 0.27590.006(6) 0.26390.003(5) 0.25030.002(4) 0.24800.685(2) 0.25000.008(3)
Yeast-spo5 0.17850.007(1) 0.42200.020(8) 0.18700.005(3) 0.19440.009(6) 0.19620.001(7) 0.18810.004(4) 0.19150.020(5) 0.18370.007(2)
Yeast-spoem 0.12320.005(1) 0.30650.030(8) 0.18900.012(7) 0.13670.007(6) 0.13120.001(3) 0.13160.005(4) 0.12730.054(2) 0.13200.008(5)
Natural_Scene 2.36120.541(1) 2.52590.015(8) 2.45340.018(4) 2.47030.019(6) 2.47540.013(7) 2.45800.012(5) 2.45190.005(2) 2.44560.019(3)
Movie 0.52110.606(1) 0.80440.010(8) 0.65330.010(6) 0.57830.007(5) 0.57500.011(4) 0.55430.007(3) 0.69560.041(7) 0.52890.008(2)
SBU_3DFE 0.35400.010(2) 0.41370.010(5) 0.44540.020(8) 0.41560.012(7) 0.34650.006(1) 0.35460.002(3) 0.35560.006(4) 0.41450.006(6)
Avg. Rank 1.15 7.77 6.46 6.31 4.85 3.15 3.46 3.15
Table 6: Comparison Results(meanstd.(rank)) of Different LDL Algorithms Measured by Clark
Yeast-alpha 0.99470.000(1) 0.85270.005(8) 0.94820.007(7) 0.98790.000(6) 0.99140.000(5) 0.99450.000(3) 0.99450.000(4) 0.99460.000(2)
Yeast-cdc 0.99550.000(1) 0.85440.012(8) 0.95900.003(7) 0.98710.000(6) 0.99130.000(4) 0.99040.000(5) 0.99398.070(2) 0.99320.000(3)
Yeast-cold 0.98930.001(1) 0.88840.008(8) 0.98590.001(6) 0.98380.000(7) 0.98710.000(5) 0.98860.000(3) 0.98920.034(2) 0.98830.001(4)
Yeast-diau 0.98840.001(1) 0.86440.007(8) 0.98600.000(5) 0.98210.000(7) 0.98530.000(6) 0.98800.000(2) 0.98760.063(4) 0.98780.001(3)
Yeast-dtt 0.99430.000(1) 0.89760.012(8) 0.99090.001(6) 0.98890.000(7) 0.99280.000(5) 0.99390.000(3) 0.99400.021(2) 0.99390.001(4)
Yeast-elu 0.99420.000(1) 0.86000.008(8) 0.96230.003(7) 0.98760.000(6) 0.99120.000(5) 0.99390.000(3) 0.99380.001(4) 0.99400.000(2)
Yeast-heat 0.98840.001(1) 0.86550.008(8) 0.98140.001(6) 0.98100.000(7) 0.98570.000(5) 0.98800.000(2) 0.98800.029(3) 0.98760.001(4)
Yeast-spo 0.97760.001(1) 0.86720.010(8) 0.96860.003(7) 0.97180.001(6) 0.97450.000(5) 0.97680.000(4) 0.97720.010(2) 0.97700.001(3)
Yeast-spo5 0.97530.002(1) 0.89680.010(8) 0.97310.001(4) 0.97060.002(7) 0.97100.000(6) 0.97320.001(3) 0.97230.007(5) 0.97430.002(2)
Yeast-spoem 0.98030.001(1) 0.91870.010(8) 0.97280.003(7) 0.97640.001(6) 0.97860.000(3) 0.97840.001(4) 0.97960.008(2) 0.97840.002(5)
Natural_Scene 0.76370.015(1) 0.55830.006(8) 0.69540.014(7) 0.69860.008(6) 0.71440.008(5) 0.74420.007(3) 0.76240.003(2) 0.74860.014(4)
Movie 0.93850.002(1) 0.84950.003(8) 0.87670.006(7) 0.90890.002(4) 0.87800.004(5) 0.92050.002(3) 0.87800.005(6) 0.93810.003(2)
SUB_3DFE 0.96440.004(1) 0.91670.004(8) 0.91810.005(7) 0.92020.004(5) 0.94820.001(3) 0.94360.000(4) 0.96360.002(2) 0.91980.002(6)
Avg. Rank 1.00 8.00 6.38 6.15 4.77 3.54 3.08 3.38
Table 7: Comparison Results(meanstd.(rank)) of Different LDL Algorithms Measured by Cosine

4.2 Methodology

To show the effectiveness of the proposed methods, we conducted comprehensive experiments on the aforementioned datasets. For LE, the proposed BD-LE method is compared with five classical LE approaches presented in Xu et al. (2018), i.e., FCM, KM, LP, ML, and GLLE. The hyper-parameter in the FCM method is set to 2. We select the Gaussian Kernel as the kernel function in the KM algorithm. For GLLE, the parameter is set to 0.01. Moreover, the number of neighbors is set to in both GLLE and ML.

For the LDL paradigm, the proposed BD-LDL method is compared with eight existing algorithms including PT-Bayes Geng and Ji (2013), PT-SVM Geng et al. (2014)


Geng et al. (2010), AA-BP Geng et al. (2013), SA-IIS Geng et al. (2013), SA-BFGS Geng (2016), LDL-SCL Zheng et al. (2018), and EDL-LRL Jia et al. (2019b), to demonstrate its superiority. The first two algorithms are implemented by the strategy of problem transformation. The next two ones are carried out by means of the adaptive method. Finally, from the fifth algorithm to the last one, they are specialized algorithms. In particular, the LDL-SCL and EDL-LRL constitute state-of-art methods recently proposed. We utilized the “C-SVC” type in LIBSVM to implement PT-SVM using the RBF kernel with parameters and . We set the hyper-parameter

in AA-kNN to 5. The number of hidden-layer neurons for AA-BP was set to 60. The parameters

, and in LDL-SCL were all set to . Regarding the EDL-LRL algorithm, we set the regularization parameters and to and

, respectively. For the intermediate algorithm K-means, the number of cluster was set to 5 according to Jia‘s suggestion

Jia et al. (2018). For the BFGS optimization used in SA-BFGS and BD-LDL, parameters and were set to and 0.9, respectively. Regarding the two bi-directional algorigthms, parameters are tuned from the range using grid-search method. The two parameters in BD-LE and are both set to . As for BD-LDL, the parameters and are set to and , respectively. Finally, we train the LDL model with the recovered label distributions for further evaluation of BD-LE. The details of parameter selections are shown in the parameter analysis section. And the experiments for the LDL algorithm on every datasets are conducted on a ten-fold cross-validation.

4.3 Results

4.3.1 BD-LE performance

Tables 1 and 2 present the results of six LE methods on all the datasets. Constrained by the page limit, we have only shown two representative results measured on Chebyshev and Cosine in this paper.

For each dataset, the results made by a specific algorithm are listed as a column in the tables in accordance with the used matrix. Note that there is always an entry highlighted in boldface. This entry indicates that the algorithm evaluated by the corresponding measurement achieves the best performance. The experimental results are presented in the form of “score (rank)”; “score” denotes the difference between a predicted distribution and the real one measured by the corresponding matrix; “rank” is a direct value to evaluate the effectiveness of these compared algorithms. Moreover, the symbol “” means “the smaller the better”, whereas “” indicates “the larger the better.”

Figure 2: Comparison results of different LE algorithms against the ‘Ground-truth’ used in BD-LDL measured by Chebyshev

It is worth noting that given that the LE method is regarded as a pre-processing method, there is no need to run it several times and record the mean as well as the standard deviation. After analyzing the results we obtained, the proposed BD-LE clearly outperforms other LE algorithms in most of the cases and renders sub-optimum performance only in about 4.7% of cases according to the statistics. In addition, BD-LE achieves better prediction results than GLLE in most of the cases, especially on dataset movie. From Table 1 we can see that the largest dimensional gap between input space and the output one is exactly in dataset moive. This indicates that the reconstruction projection can be added in LE algorithm reasonably. Two specialized algorithms, namely BE-LE and GLLE, rank first in 91.1% of cases. By contrast, the label distributions are hardly recovered from other four algorithms. This indicates the superiority of utilizing direct similarity or distance as the loss function in LDL and LE problems. In summary, the performance of the five LE algorithms is ranked from best to worst as follows: BD-LE, GLLE, LP, FCM, ML and KM. This proves the effectiveness of our proposed bi-directional loss function for the LE method.

Figure 3: Comparison results of different LE algorithms against the ‘Ground-truth’ used in SA-BFGS measured by Chebyshev

4.3.2 BD-LDL performance

As for the performance of BD-LDL method, we show the numerical result on 13 real-world datasets over the measurement Clark and Cosine in Tables 3 and 4 with the format “meanstd (rank)” similarly, and the item in bold in every row represents the best performance. One may observe that our algorithm BD-LDL outperforms other classical LDL algorithms in most cases. When measured by Cosine, it is vividly shown that BD-LDL achieves the best performance on every datasets, which strongly demonstrates the effectiveness of our proposed method. Besides, it can be found from Table 4 that although LDLLC and SA-BFGS obtains the best result on dataset Yeast-dtt and SBU_3DFE respectively when measured with Chebyshev, BD-LDL still ranks the second place. It also can be seen from the results that two PT and AP algorithms perform poorly on most cases. This verifies the superiority of utilizing the direct similarity or distance between the predicted label distribution and the true one as the loss function. Moreover, it can be easily seen from the results that our proposed method gains the superior performance over other existing specialized algorithms which ignore considering the reconstruction error. This indicates that such a bi-directional loss function can truly boost the performace of LDL algorithm.

Dataset Canberra Intersection
Yeast-alpha 0.79800.013 0.60130.011 0.8915 0.001 0.96240.001
Yeast-cdc 0.95420.012 0.60780.012 0.8869 0.001 0.95800.001
Yeast-cold 0.45150.010 0.21030.007 0.8779 0.002 0.94300.002
Yeast-diau 0.67510.010 0.42200.013 0.8338 0.001 0.94140.002
Yeast-dtt 0.37240.010 0.16590.006 0.8775 0.002 0.95900.001
Yeast-elu 0.80050.007 0.57890.011 0.8776 0.001 0.95910.001
Yeast-heat 0.57060.010 0.35770.010 0.8591 0.002 0.94120.002
Yeast-spo 0.70750.016 0.50360.019 0.8301 0.003 0.91710.003
Yeast-spo5 0.48340.010 0.27450.011 0.8184 0.003 0.91120.003
Yeast-spoem 0.29980.010 0.17160.007 0.8129 0.005 0.91690.003
Natural_Scene 6.96530.095 0.73190.040 0.3822 0.010 0.53950.011
Movie 1.42590.024 0.02180.001 0.7429 0.004 0.82980.002
SBU_3DFE 0.85620.021 0.01190.001 0.8070 0.004 0.85900.004
Table 8: Ablation experiments results of UD-LDL and BD-LDL Algorithms Measured by Canberra and Intersection

4.3.3 LDL algorithm Predictive Performance

The reason to use the LE algorithm is that we need to recover the label distributions for LDL training. For the purpose of verifying the correctness and effectiveness of our proposed LE algorithm, we conducted an experiment to compare predictions depending on the recovered label distributions with those made by the LDL model trained on real label distributions. Moreover, for further evaluation of the proposed BD-LDL, we selected BD-LDL and SA-BFGS as the LDL model in this experiment. Owing to the page limit, we hereby present only the experimental results measured with Chebyshev. The prediction results achieved from SA-BFGS and BD-LDL are visualized in Figs. 1 and 2 in terms of histograms. Note that ‘Ground-truth’ appearing in the figures represents the results depending on the real label distributions. We regard these results as a benchmark instead of taking them into consideration while conducting the evaluation. Meanwhile, we use ‘FCM’, ‘KM’, ‘LP’, ‘ML’, ‘GLLE’ and ‘BD-LE’ to represent the performance of the corresponding LE algorithm in this experiment. As illustrated in Figs. 1-2, although the prediction on datasets movie and SUB_3DFE is worse than other datasets, ‘BD-LE’ is still relatively close to ‘Ground-Truth’ in most cases, especially in the first to eleventh datasets. We must mention that ‘BD-LE’ combined with BD-LDL is closer to ‘Ground-truth’ than with SA-BFGS over all cases. This indicates that such a reconstruction constraint is generalized enough to bring the improvement into both LE and LDL algorithms simultaneously.

Figure 4: Influence of parameter and on dataset cold in BD-LE
Figure 5: Influence of parameter and on dataset cold in BD-LDL

4.4 Ablation Experiment

It is clear that the bi-directional loss for either LE or LDL consists of two parts, i.e., naive mapping loss and the reconstruction loss. In order to further demonstrate the effectiveness of the additional reconstruction term, we conduct the ablation experiment measured with Canberra and Intersection on the whole 13 datasets. We call the unidirectional algorithm without reconstruction term as UD-LE and UD-LE respectively which are fomulated as:


Since the objective function of UD-LE is identical to that of GLLE, the corresponding comparison can be referred to Tables 2 and 3. The LDL prediction results in metrics of Canberra and Intersection are tabulated in Table 6.

From Table 6 we can see that BD-LDL gains the superior results in all benchmark datasets, i.e., introducing the reconstruction term can truly boost the performance of LDL algorithm. It is expected that the top-3 improvements are achieved in datasets Natural Scene, Movie, SBU_3DFE respectively which are equiped with the relative large dimensional gap between the feature and label space.

4.5 Influence of Parameters

To examine the robustness of the proposed algorithms, we also analyze the influence of trade-off parameters in the experiments, including , in BD-LDL as well as and in BD-LE. We run BD-LE with and whose value range is [,], and parameters and involved in BD-LDL use the same candidate label set as well. Owing to the page limit, we only show in this paper the experimental results on Yeast-cold dataset which are measured with Chebyshev and Cosine. For further evaluation, the results are visualized with different colors in Figs. 5 and 6. When measured with Chebyshev, a smaller value means better performance and closer to blue; by contrast, with cosine, a larger value indicates better performance and closer to red.

It is clear from Fig. 5 that when falls in a certain range [,], we can achieve relatively good recovery results with any values of . After conducting several experiments, we draw a conclusion for BD-LE, namely that when both and are about , the best performance is obtained. Concerning the parameters of BD-LDL, when the value of is selected within the range [,], the color varies in an extremely steady way, which means that the performance is not sensitive to this hyper parameter in that particular range. In addition, we can also see from Fig. 6 that has a stronger influence on the performance than when the value of is within the range [,].

5 Conclusion

Previous studies have shown that the LDL method can effectively solve label ambiguity problems whereas the LE method is able to recover label distributions from logical labels. To improve the performance of LDL and LE methods, we propose a new loss function that combines the mapping error with the reconstruction error to leverage the missing information caused by the dimensional gap between the input space and the output one. Sufficient experiments have been conducted to show that the proposed loss function is sufficiently generalized for application in both LDL and LE with improvement. In the future, we will explore if there exists an end-to-end way to recover the label distributions with the supervision of LDL training process.


  • Berger et al. (1996) Berger, A.L., Pietra, V.J.D., Pietra, S.A.D., 1996.

    A maximum entropy approach to natural language processing.

    Computational linguistics 22, 39–71.
  • Boureau et al. (2008) Boureau, Y.l., Cun, Y.L., et al., 2008.

    Sparse feature learning for deep belief networks, in: Advances in neural information processing systems, pp. 1185–1192.

  • Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3, 1–122.
  • Cheng et al. (2019) Cheng, Y., Zhao, D., Wang, Y., Pei, G., 2019.

    Multi-label learning with kernel extreme learning machine autoencoder.

    Knowledge-Based Systems 178, 1–10.
  • Eisen et al. (1998) Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 14863–14868.
  • El Gayar et al. (2006) El Gayar, N., Schwenker, F., Palm, G., 2006.

    A study of the robustness of knn classifiers trained using soft labels, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer. pp. 67–80.

  • Fan et al. (2017) Fan, Y.Y., Liu, S., Li, B., Guo, Z., Samal, A., Wan, J., Li, S.Z., 2017. Label distribution-based facial attractiveness computation by deep residual learning. IEEE Transactions on Multimedia 20, 2196–2208.
  • Geng (2016) Geng, X., 2016. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28, 1734–1748.
  • Geng and Hou (2015) Geng, X., Hou, P., 2015.

    Pre-release prediction of crowd opinion on movies by label distribution learning, in: Twenty-Fourth International Joint Conference on Artificial Intelligence.

  • Geng and Ji (2013) Geng, X., Ji, R., 2013. Label distribution learning, in: Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops, IEEE Computer Society. pp. 377–383.
  • Geng et al. (2010) Geng, X., Smith-Miles, K., Zhou, Z.H., 2010. Facial age estimation by learning from label distributions, in: Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI Press. pp. 451–456.
  • Geng et al. (2014) Geng, X., Wang, Q., Xia, Y., 2014. Facial age estimation by adaptive label distribution learning, in: 2014 22nd International Conference on Pattern Recognition, IEEE. pp. 4465–4470.
  • Geng et al. (2013) Geng, X., Yin, C., Zhou, Z.H., 2013. Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35, 2401–2412.
  • Golub and Loan (1996) Golub, G.H., Loan, C.F.V., 1996. Matrix computations (3rd ed.) .
  • He et al. (2019) He, Z.F., Yang, M., Gao, Y., Liu, H.D., Yin, Y., 2019.

    Joint multi-label classification and label correlations with missing labels and feature selection.

    Knowledge-Based Systems 163, 145–158.
  • Hou et al. (2016) Hou, P., Geng, X., Zhang, M.L., 2016. Multi-label manifold learning, in: Thirtieth AAAI Conference on Artificial Intelligence.
  • Jia et al. (2018) Jia, X., Li, W., Liu, J., Zhang, Y., 2018. Label distribution learning by exploiting label correlations, in: Thirty-Second AAAI Conference on Artificial Intelligence.
  • Jia et al. (2019a) Jia, X., Ren, T., Chen, L., Wang, J., Zhu, J., Long, X., 2019a. Weakly supervised label distribution learning based on transductive matrix completion with sample correlations. Pattern Recognition Letters 125, 453–462.
  • Jia et al. (2019b) Jia, X., Zheng, X., Li, W., Zhang, C., Li, Z., 2019b.

    Facial emotion distribution learning by exploiting low-rank label correlations locally, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9841–9850.

  • Jiang et al. (2006) Jiang, X., Yi, Z., Lv, J.C., 2006. Fuzzy svm with a new fuzzy membership function. Neural Computing & Applications 15, 268–276.
  • Kodirov and Gong (2017) Kodirov, Elyor, X.T., Gong, S., 2017. Semantic autoencoder for zero-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183.
  • Lyons et al. (1998) Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J., 1998. Coding facial expressions with gabor wavelets, in: Proceedings Third IEEE international conference on automatic face and gesture recognition, IEEE. pp. 200–205.
  • Ma et al. (2017) Ma, J., Tian, Z., Zhang, H., Chow, T.W., 2017. Multi-label low-dimensional embedding with missing labels. Knowledge-Based Systems 137, 65–82.
  • Malouf (2002) Malouf, R., 2002. A comparison of algorithms for maximum entropy parameter estimation, in: proceedings of the 6th conference on Natural language learning-Volume 20, Association for Computational Linguistics. pp. 1–7.
  • Melin and Castillo (2005) Melin, P., Castillo, O., 2005. Hybrid intelligent systems for pattern recognition using soft computing: An evolutionary approach for neural networks and fuzzy systems. volume 172. Springer Science & Business Media.
  • Nocedal and Wright (2006) Nocedal, J., Wright, S., 2006. Numerical optimization. Springer Science & Business Media.
  • Ren et al. (2019a) Ren, T., Jia, X., Li, W., Chen, L., Li, Z., 2019a. Label distribution learning with label-specific features., in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3318–3324.
  • Ren et al. (2019b) Ren, T., Jia, X., Li, W., Zhao, S., 2019b. Label distribution learning with label correlations via low-rank approximation, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, AAAI Press. pp. 3325–3331.
  • Tsoumakas and Katakis (2007) Tsoumakas, G., Katakis, I., 2007. Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3, 1–13.
  • Wang and Zhang (2007) Wang, F., Zhang, C., 2007. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 55–67.
  • Xing et al. (2016) Xing, C., Geng, X., Xue, H., 2016. Logistic boosting regression for label distribution learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4489–4497.
  • Xu and Zhou (2017) Xu, M., Zhou, Z.H., 2017. Incomplete label distribution learning., in: IJCAI, pp. 3175–3181.
  • Xu et al. (2019a) Xu, N., Liu, Y.P., Geng, X., 2019a. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering .
  • Xu et al. (2019b) Xu, N., Lv, J., Geng, X., 2019b. Partial label learning via label enhancement.
  • Xu et al. (2018) Xu, N., Tao, A., Geng, X., 2018. Label enhancement for label distribution learning., in: IJCAI, pp. 2926–2932.
  • Xu et al. (2019c) Xu, S., Shang, L., Shen, F., 2019c. Latent semantics encoding for label distribution learning., in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3982–3988.
  • Xu et al. (2016) Xu, S., Yang, X., Yu, H., Yu, D.J., Yang, J., Tsang, E.C., 2016.

    Multi-label learning with label-specific feature reduction.

    Knowledge-Based Systems 104, 52–61.
  • Yuan (1991) Yuan, Y.x., 1991. A modified bfgs algorithm for unconstrained optimization. IMA Journal of Numerical Analysis 11, 325–332.
  • Zheng et al. (2018) Zheng, X., Jia, X., Li, W., 2018. Label distribution learning by exploiting sample correlations locally, in: Thirty-Second AAAI Conference on Artificial Intelligence.
  • Zhou et al. (2015) Zhou, Y., Xue, H., Geng, X., 2015. Emotion distribution recognition from facial expressions, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM. pp. 1247–1250.
  • Zhou et al. (2012) Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F., 2012. Multi-instance multi-label learning. Artificial Intelligence 176, 2291–2320.
  • Zhu et al. (2005) Zhu, X., Lafferty, J., Rosenfeld, R., 2005. Semi-supervised learning with graphs. Ph.D. thesis. Carnegie Mellon University, language technologies institute, school of ….
  • Zhu et al. (2017) Zhu, X., Suk, H.I., Wang, L., Lee, S.W., Shen, D., 2017. A novel relational regularization feature selection method for joint regression and classification in ad diagnosis. Medical Image Analysis 38, 205–214.