1 Introduction
Learning with ambiguity has become one of the most prevalent research topics. The traditional way to solve machine learning problems is based on singlelabel learning (SLL) and multilabel learning (MLL)
Tsoumakas and Katakis (2007); Xu et al. (2016). Concerning the SLL framework, an instance is always assigned to one single label, whereas in MLL an instance may be associated with several labels. The existing learning paradigms of SLL and MLL are mostly based on the socalled problem transformation. However, neither SLL nor MLL address the problem stated as “at which degree can a label describe its corresponding instance,” i.e., the labels have different importance on the description of the instance. It is more appropriate for the importance among candidate labels to be different rather than exactly equal. Taking the above problem into account, a novel learning paradigm called label distribution learning (LDL) Geng and Ji (2013) is proposed. Compared with SLL and MLL, LDL labels an instance with a realvalued vector that consists of the description degree of every possible label to the current instance. Detail comparison is visualized in Fig. 1. Actually, LDL can be regarded as a more comprehensive form of MLL and SLL. However, the tagged training sets required by LDL are extremely scarce owing to the heavy burden of manual annotation. Considering the fact that it is difficult to directly attain the annotated label distribution, a process called label enhancement (LE) Xu et al. (2018) is also proposed to recover the label distributions from logical labels. Taking LE algorithm, the logical label of conventional MLL dataset can be recovered into the label distribution vector by mining the topological information of input space and label correlation He et al. (2019).Many relevant algorithms of LDL and LE have been proposed in recent years. These algorithms have progressively boosted the performance of specific tasks. For instance, LDL is widely applied in facial age estimation application. Geng et al.
Geng et al. (2014) proposed a specialized LDL framework that combines the maximum entropy model Berger et al. (1996) with IIS optimization, namely IISLDL. This approach not only achieves better performance than other traditional machine learning algorithms but also becomes the foundation of the LDL framework. In other works, Yang et al.’s Fan et al. (2017) attempt to take into account both facial geometric and convolutional features resulted in remarkably improving efficiency and accuracy. As mentioned above, the difficulty of acquiring labeled datasets restricts the development of LDL algorithms. After presenting several LE algorithms, Xu et al. Xu et al. (2019b) adapted LDL into partial label learning (PLL) with recovered label distributions via LE. Although these methods have achieved significant performance, one potential problem yet to be solved is that they suffer from the discriminative information loss problem, which is caused by the dimensional gap between the input data matrix and the output one. Importantly, it is entirely possible that these existing methods miss the essential information that should be inherited from the original input space, thereby degrading the performance.As discussed above, the critical point of previous works on LDL and LE is to establish a suitable loss function to fit label distribution data. In previous works, only a unidirectional projection between input and output space is learned. In this paper, we present a bidirectional loss function with a comprehensive reconstruction constraint. Such function can be applied in both LDL and LE to maintain the latent information. Inspired by the autoencoder paradigm Kodirov and Gong (2017); Cheng et al. (2019), our proposed method builds the reconstruction projection with the mapping projection to preserve the otherwise lost information. More precisely, optimizing the original loss is the mapping step, while minimizing the reconstruction error is the reconstruction step. In contrast to previous loss functions, the proposed loss function aims to potentially reconstruct the input data from the output data. Therefore, it is expected to obtain more accurate results than other related loss functions for both LE and LDL problems. Adequate experiments on several wellknown datasets demonstrate that the proposed loss function achieves superior performance.
The main contributions of this work are delivered as:

[1)]

the reconstruction projection from label space to instance space is considered for the first time in the LDL and LE paradigms;

a bidirectional loss function that combines mapping error and reconstruction error is proposed;

the proposed method can be used not only in LDL but also for LE.
We organize the rest of this paper as follows. Firstly, related work about LDL and LE methods is reviewed in Section 2. Secondly, the formulation of LE as well LDL and the proposed methods, i.e., BDLE and BDLDL are introduced in Section 3. After that, the results of comparison experiment and ablation one are shown in Section 4. The influence of parameters is also discussed in Section 4. Finally, conclusions and future work exploration are summarized in section 5.
2 Related work
In this section, we briefly summarize the related work about LDL and LE methods.
2.1 Label Distribution Learning
The proposed LDL methods mainly focus on three aspects, namely model assumption, loss function, and the optimization algorithm. The maximum entropy model Berger et al. (1996) is widely used to represent the label distribution in the LDL paradigmXu et al. (2019c); Ren et al. (2019b). Maximum entropy model naturally agrees with the character of description degree in LDL model. However, such an exponential model is sometimes not comprehensive enough to accomplish a complex distribution. To overcome this issue, Gent et al. Xing et al. (2016) proposed a LDL family based on a boosting algorithm to extend the traditional LDL model. Inspired by the MSVR algorithm, LDVSR Geng and Hou (2015) is designed for the movie opinion prediction task. Furthermore, CPNN Geng et al. (2013)
combines a neural network with the LDL paradigm to improve the effectiveness of facial age estimation applications. What’s more, recent work
Ren et al. (2019a); Xu and Zhou (2017) has proved that linear model is also able to achieve a relatively strong representation ability and a satisfying result. As reviewed above, most existing methods build the mapping from feature space to label space in an unidirectional way so that it is appropriate to take the bidirectional constraint into consideration.Concerning the loss function, LDL aims at learning the model to predict unseen instances’ distributions which are similar to the true ones. The criteria to measure the distance between two distributions, such as the Kullback–Leibler (KL) divergence, is always chosen as the loss function Jia et al. (2018); Geng et al. (2010). Owing to the asymmetry of the KL divergence, Jeffery’s divergence is used in xxx Zhou et al. (2015) to build LDL model for facial emotion recognition. For the sake of easier computation, it is reasonable to adopt the Euclidean distance in a variety of tasks, e.g., facial emotion recognition Jia et al. (2019b).
Regarding the optimization method, SAIIS Geng (2016) utilizes the improved iterative scaling (IIS) method whose performance is always worse Malouf (2002) than the other optimization. Fortunately, by leveraging the LBFGS Nocedal and Wright (2006) optimization method, we maintain the balance between efficiency and accuracy, especially in SABFGS Geng (2016) and EDL Zhou et al. (2015). With the complexity of proposed model greater, the number of parameters to be optimized is more than one. Therefore, it is more appropriate to introduce the alternating direction method of multipliers (ADMM) Boyd et al. (2011) when the loss function incorporates additional inequality and equality constraints. In addition, exploiting the correlation among labels or samples can increasingly boost the performance of LDL model. Jia et al. Jia et al. (2018) proposed LDLLC to take the global label correlation into account with introducing the Person’s correlation between labels. It is pointed out in LDLSCL Zheng et al. (2018) and EDLLRL Jia et al. (2019b) that some correlations among labels (or samples) only exist in a set of instances, which are socalled the local correlation exploration. Intuitively, the instances in the same group after clustering share the same local correlation.
What’s more, it is common that the labeled data are incomplete and contaminated Ma et al. (2017). For the former condition, Xu et al. Xu and Zhou (2017) put forward IncomLDLa and IncomLDLp on the assumption that the recovered complete label distribution matrix is lowrank. Proximal Gradient Descend (PGD) and ADMM are used for the optimization of two methods respectively. The time complexity of the first one is , and the last one is but good at the accuracy. Jia et al. Jia et al. (2019a) proposed WSLDLMCSC which is based on the matrix completion and the exploration of samples’ relevance in a transductive way when the data is under weaksupervision.
2.2 Label Enhancement Learning
To the best of our knowledge, there are a few researches whose topics focus on the label enhancement learning Xu et al. (2018). Five effective strategies have been devised during the present study. Four of them are adaptive algorithms. As discussed in Geng (2016), the concept of membership used in fuzzy clustering Jiang et al. (2006) is similar to label distribution. Although they indicated two distinguishing semantics, they are both in numerical format. Thus, FCM El Gayar et al. (2006) extend the calculation of membership which is used in fuzzy Cmeans clustering Melin and Castillo (2005) to recover the label distribution. LE algorithm based on kernel method (KM) Jiang et al. (2006) utilizes the kernel function to project the instances from origin space into a highdimensional one. The instances are separated into two parts according to whether the corresponding logical label is 1 or not for every candidate label. Then the label distribution term, i.e. description degree can be calculated based on the distance between the instances and the center of groups. Label propagation technique Wang and Zhang (2007) is used in the LP method to update the label distribution matrix iteratively with a fullyconnected graph built. Since the message between samples is shared and passed on the basis of the connection graph, the logical label can be enhanced into distributionlevel label. LE method adapted from manifold learning (ML) Hou et al. (2016) take the topological consistency between feature space and label space into consideration to obtain the recovered label distribution. The last novel strategy called GLLE Xu et al. (2019a) is specialized by leveraging the topological information of the input space and the correlation among labels. Meanwhile, the local label correlation is captured via clustering Zhou et al. (2012).
3 Proposed Method
Let denote the dimensional input space and represent the complete set of labels where is the number of all possible labels. For each instance , a simple logical label vector is leveraged to represent which labels can describe the instance correctly. Specially, for the LDL paradigm, instance is assigned with distributionlevel vector
Notations  Description 
the number of instances  
number of labels  
dimension of samples  
instance feature matrix  
logical label matrix  
label distribution matrix  
Mapping parameter of BDLE  
Reconstruction parameter of BDLE  
Mapping parameter of BDLDL  
Reconstruction parameter of BDLDL 
3.1 Bidirectional for Label Enhancement
Given a dataset , and is defined as input matrix and logical label matrix, respectively. According to previous discussion, the goal of LE is to transform into the label distribution matrix .
Firstly, a nonlinear function , i.e., kernel function is defined to transform each instance into a higher dimensional feature , which can be utilized to construct the vector of corresponding instance. For each instance, an appropriate mapping parameter is required to transform the input feature into the label distribution . As there is a large dimension gap between input space and output space, a lot of information may be lost during the mapping process. To address this issue, it is reasonable to introduce the parameter for the reconstruction of the input data from the output data. Accordingly, the objective function of LE is formulated as follows:
(1) 
where denotes the loss function of data mapping, indicates the loss function of data reconstruction, is the regularization term, and are two tradeoff parameters. It should be noted that the LE algorithm is regarded as a preprocessing of LDL methods and it dose not suffer from the overfitting problem. Accordingly, it is not necessary to add the norm of parameters and as regularizers.
The first term is the mapping loss function to measure the distance between logical label and recovered label distribution. According to Xu et al. (2018), it is reasonable to select the least squared (LS) function:
(2)  
where and is the trace of a matrix defined by the sum of diagonal elements. The second term is the reconstruction loss function to measure the similarity between the input feature data and the reconstructed one from the output data of LE. Similar to the mapping loss function, the reconstruction loss function is defined as follows:
(3)  
To further simplify the model, it is reasonable to consider the tied weights Boureau et al. (2008) as follows:
(4) 
where is the best reconstruction parameter to be obtained. Then the Eq. (1) is rewritten as:
(5) 
To obtain desired results, the manifold regularization is designed to capture the topological consistency between feature space and label space, which can fully exploit the hidden label importance from the input instances. Before presenting this term, it is required to introduce the similarity matrix , whose element is defined as:
(6) 
where denotes the set of nearest neighbors for the instance , and is the hyper parameter fixed to be 1 in this paper. Inspired by the smoothness assumption Zhu et al. (2005), the more correlated two instances are, the closer are the corresponding recovered label distribution, and vice versa. Accordingly, it is reasonable to design the following manifold regularization:
(7)  
where indicates the recovered label distribution, and is the Laplacian matrix. Note that the similarity matrix is asymmetric so that the element of diagonal matrix element is defined as ,
By substituting Eqs. (2), (3) and (7) into Eq. (5), the mapping and reconstruction loss function is defined on parameter as follows:
(8)  
Actually, Eq.(8) can be easily optimized by a wellknown method called limitedmemory quasiNewton method (LBFGS) Yuan (1991). This method achieves the optimization by calculating the firstorder gradient of :
(9)  
3.2 Bidirectional for Label Distribution Learning
Given dataset whose label is the realvalued format, LDL aims to build a mapping function from the instances to the label distributions, where denotes the th instance and indicates the th label distribution of instance. Note that accounts for the description degree of to
rather than the probability that label tags correctly. All the labels can describe each instance completely, so it is reasonable that
and .As mentioned before, most of LDL methods suffer from the mapping information loss due to the unidirectional projection of loss function. Fortunately, bidirectional projections can extremely preserve the information of input matrix. Accordingly, the goal of our specific BDLDL algorithm is to determine a mapping parameter and a reconstruction parameter from training set so as to make the predicted label distribution and the true one as similar as possible. Therefore, the new loss function integrates the mapping error with the reconstruction error as follows:
(10) 
where denotes the mapping parameter, indicates the reconstruction parameter, is a regularization to control the complexity of the output model to avoid overfitting, and are two parameters to balance these four terms.
There are various candidate functions to measure the difference between two distributions such as the Euclidean distance, the KullbackLeibler (KL) divergence and the Clark distance etc. Here, we choose the Euclidean distance:
(11) 
where is the mapping parameter to be optimized, and is the Frobenius norm of a matrix. For simplification, it is reasonable to consider tied weights Boureau et al. (2008) as follows:
(12) 
Similarly, the objective function is simplified as follows:
(13) 
where the term denotes the simplified reconstruction error. As for the second term in objective function, we adopt the Fnorm to implement it:
(14) 
Substituting Eqs. (11) and (14) into Eq. (13) yields the objective function:
(15) 
Before optimization, the trace properties and are applied for the reorganization of objective function:
(16) 
Then, for optimization, we can simply take a derivative of Eq. (16) with respective to the parameter and set it zero:
(17) 
Obviously, Eq. (17) can be transformed into the following equivalent formulation:
(18) 
Denote , and , Eq. (18) can be rewritten as the following formulation:
(19) 
Although Eq. (19) is the wellknown Sylvester equation which can be solved by existing algorithm in MATLAB, the computational cost corresponding solution is not ideal. Thus, following Zhu et al. (2017), we effectively solve Eq. (19) with Cholesky factorization Golub and Loan (1996)
as well the Singular Value Decomposition (SVD). Firstly, two positive semidefinite matrix
and can be factorized as:(20)  
where and are the triangular matrix which can be further decomposed via SVD as:
(21)  
(22) 
Since , , and are the unitray matrix, Eq. (22) can be rewritten as :
(23) 
We multiplying and to both sides of Eq. (23) to obtain the following equation:
(24) 
where , , and .
For both and are the diagonal matrix, we can directly attain whose element is defined as:
(25) 
where and
can be calculated by eigenvalues of
and respectively, and is the i,jth elment of matrix . Accordingly, can be obtained by:(26) 
We briefly summarize the procedure of the proposed BDLDL in Algorithm 1.
Index  Data Set  # Examples  # Features  # Labels 
1  Yeastalpha  2,465  24  18 
2  Yeastcdc  2,465  24  15 
3  Yeastcold  2,465  24  4 
4  Yeastdiau  2,465  24  7 
5  Yeastdtt  2,465  24  4 
6  Yeastelu  2,465  24  14 
7  Yeastheat  2,465  24  6 
8  Yeastspo  2,465  24  6 
9  Yeastspo5  2,465  24  3 
10  Yeastspoem  2,465  24  2 
11  Natural Scene  2,000  294  9 
12  Movie  7,755  1,869  5 
13  SBU_3DFE  2,500  243  6 
4 Experiments
4.1 Datasets and Measurement
We conducted extensive experiments on 13 realworld datasets collected from biological experiments Eisen et al. (1998), facial expression images Lyons et al. (1998)
, natural scene images, and movies. The output of both LE and LDL are in the format of label distribution vectors. In contrast to the results of SLL and MLL, the label distribution vectors should be evaluated with diverse measurements. We naturally select six criteria that are most commonly used, i.e., Chebyshev distance (Chebeyshev), Clark distance (Clark), Canberra metric (Canberra), Kullback–Leibler divergence (KL), Cosine coefficient (Cosine), and Intersection similarity (Intersec). The first four functions are always used to measure distance between groudtruth label distribution
and the predicted one , whereas the last two are similarity measurements.The specifications of criteria and used data sets can be found in Tables 3 and 2.Name  Defination  
Distance  Chebyshev  
Clark  
Canberra  
Similarity  Intersaction  
Cosine 
Datasets  Ours  FCM  KM  LP  ML  GLLE 
Yeastalpha  0.0208(1)  0.0426(4)  0.0588(6)  0.0401(3)  0.0553(5)  0.0310(2) 
Yeastcdc  0.0231(1)  0.0513(4)  0.0729(6)  0.0421(3)  0.0673(5)  0.0325(2) 
Yeastcold  0.0690(1)  0.1325(4)  0.2522(6)  0.1129(3)  0.2480(5)  0.0903(2) 
Yeastdiau  0.0580(1)  0.1248(4)  0.2500(6)  0.0904(3)  0.1330(5)  0.0789(2) 
Yeastdtt  0.0592(1)  0.0932(3)  0.2568(5)  0.1184(4)  0.2731(6)  0.0651(2) 
Yeastelu  0.0256(1)  0.0512(4)  0.0788(6)  0.0441(3)  0.0701(5)  0.0287(2) 
Yeastheat  0.0532(1)  0.1603(4)  0.1742(5)  0.0803(3)  0.1776(6)  0.0563(2) 
Yeastspo  0.0641(1)  0.1300(4)  0.1753(6)  0.0834(3)  0.1722(5)  0.0670(2) 
Yeastspo5  0.1017(2)  0.1622(4)  0.2773(6)  0.1142(3)  0.2730(5)  0.0980(1) 
Yeastspoem  0.0921(1)  0.2333(4)  0.4006(6)  0.1632(3)  0.3974(5)  0.1071(2) 
Natural_Scene  0.3355(5)  0.3681(6)  0.3060(3)  0.2753(1)  0.2952(2)  0.3349(4) 
Movie  0.1254(1)  0.2302(4)  0.2340(6)  0.1617(3)  0.2335(5)  0.1601(2) 
SUB_3DFE  0.1285(1)  0.1356(3)  0.2348(6)  0.1293(2)  0.2331(5)  0.1412(4) 
Avg. Rank  1.38  4.00  5.62  2.84  4.92  2.23 
Datasets  Ours  FCM  KM  LP  ML  GLLE 
Yeastalpha  0.9852(1)  0.9221(3)  0.8115(5)  0.9220(4)  0.7519(6)  0.9731(2) 
Yeastcdc  0.9857(1)  0.9236(3)  0.7541(6)  0.9162(4)  0.7591(5)  0.9597(2) 
Yeastcold  0.9804(1)  0.9220(4)  0.7789(6)  0.9251(3)  0.7836(5)  0.9690(2) 
Yeastdiau  0.9710(1)  0.8901(4)  0.7990(6)  0.9153(3)  0.8032(5)  0.9397(2) 
Yeastdtt  0.9847(1)  0.9599(3)  0.7602(6)  0.9210(4)  0.7631(5)  0.9832(2) 
Yeastelu  0.9841(1)  0.9502(3)  0.7588(5)  0.9110(4)  0.7562(6)  0.9813(2) 
Yeastheat  0.9803(1)  0.8831(4)  0.7805(6)  0.9320(3)  0.7845(5)  0.9800(2) 
Yeasspo  0.9719(1)  0.9092(4)  0.8001(6)  0.9390(3)  0.8033(5)  0.9681(2) 
Yeastspo5  0.9697(2)  0.9216(4)  0.8820(6)  0.9694(3)  0.8841(5)  0.9713(1) 
Yeastspoem  0.9761(1)  0.8789(4)  0.8122(6)  0.9500(3)  0.8149(5)  0.9681(2) 
Natural_Scene  0.7797(4)  0.5966(6)  0.7488(5)  0.8602(2)  0.8231(1)  0.7822(3) 
Movie  0.9321(1)  0.7732(6)  0.8902(4)  0.9215(2)  0.8153(5)  0.9000(3) 
SBU_3DFE  0.9233(1)  0.9117(3)  0.8126(6)  0.9203(2)  0.8150(5)  0.9000(4) 
Avg. Rank  1.31  3.92  5.62  3.08  4.85  2.23 
Datasets  Ours  PTBayes  AABP  SAIIS  SABFGS  LDLSCL  EDLLRL  LDLLC 
Yeastalpha  0.20970.003(1)  1.15410.034(8)  0.72360.060(7)  0.30530.006(6)  0.26890.008(5)  0.20980.002(2)  0.21260.000(4)  0.20980.006(3) 
Yeastcdc  0.20170.004(1)  1.06010.066(8)  0.57280.030(7)  0.29320.004(6)  0.24770.007(5)  0.21370.004(3)  0.20462.080(2)  0.21630.004(4) 
Yeastcold  0.13550.004(1)  0.51490.024(8)  0.15520.005(7)  0.16430.004(8)  0.14710.004(5)  0.13880.003(2)  0.14422.100(4)  0.14150.004(3) 
Yeastdiau  0.19600.006(1)  0.74870.042(8)  0.26770.010(7)  0.24090.006(6)  0.22010.002(5)  0.19860.002(2)  0.20110.003(4)  0.20100.006(3) 
Yeastdtt  0.09640.004(2)  0.48070.040(8)  0.12060.008(7)  0.13320.003(8)  0.10840.003(6)  0.09890.001(4)  0.09801.600(3)  0.09620.006(1) 
Yeastelu  0.19640.004(1)  1.00500.041(8)  0.52460.028(7)  0.27510.006(6)  0.24380.008(5)  0.20150.002(3)  0.20290.023(4)  0.19940.006(2) 
Yeastheat  0.17880.005(1)  0.68290.026(8)  0.22610.010(7)  0.22600.005(6)  0.19980.003(5)  0.18260.003(2)  0.18260.003(2)  0.18540.004(4) 
Yeastspo  0.24560.008(1)  0.66860.040(8)  0.29500.010(7)  0.27590.006(6)  0.26390.003(5)  0.25030.002(4)  0.24800.685(2)  0.25000.008(3) 
Yeastspo5  0.17850.007(1)  0.42200.020(8)  0.18700.005(3)  0.19440.009(6)  0.19620.001(7)  0.18810.004(4)  0.19150.020(5)  0.18370.007(2) 
Yeastspoem  0.12320.005(1)  0.30650.030(8)  0.18900.012(7)  0.13670.007(6)  0.13120.001(3)  0.13160.005(4)  0.12730.054(2)  0.13200.008(5) 
Natural_Scene  2.36120.541(1)  2.52590.015(8)  2.45340.018(4)  2.47030.019(6)  2.47540.013(7)  2.45800.012(5)  2.45190.005(2)  2.44560.019(3) 
Movie  0.52110.606(1)  0.80440.010(8)  0.65330.010(6)  0.57830.007(5)  0.57500.011(4)  0.55430.007(3)  0.69560.041(7)  0.52890.008(2) 
SBU_3DFE  0.35400.010(2)  0.41370.010(5)  0.44540.020(8)  0.41560.012(7)  0.34650.006(1)  0.35460.002(3)  0.35560.006(4)  0.41450.006(6) 
Avg. Rank  1.15  7.77  6.46  6.31  4.85  3.15  3.46  3.15 
Datasets  Ours  PTBayes  AABP  SAIIS  SABFGS  LDLSCL  EDLLRL  LDLLC 
Yeastalpha  0.99470.000(1)  0.85270.005(8)  0.94820.007(7)  0.98790.000(6)  0.99140.000(5)  0.99450.000(3)  0.99450.000(4)  0.99460.000(2) 
Yeastcdc  0.99550.000(1)  0.85440.012(8)  0.95900.003(7)  0.98710.000(6)  0.99130.000(4)  0.99040.000(5)  0.99398.070(2)  0.99320.000(3) 
Yeastcold  0.98930.001(1)  0.88840.008(8)  0.98590.001(6)  0.98380.000(7)  0.98710.000(5)  0.98860.000(3)  0.98920.034(2)  0.98830.001(4) 
Yeastdiau  0.98840.001(1)  0.86440.007(8)  0.98600.000(5)  0.98210.000(7)  0.98530.000(6)  0.98800.000(2)  0.98760.063(4)  0.98780.001(3) 
Yeastdtt  0.99430.000(1)  0.89760.012(8)  0.99090.001(6)  0.98890.000(7)  0.99280.000(5)  0.99390.000(3)  0.99400.021(2)  0.99390.001(4) 
Yeastelu  0.99420.000(1)  0.86000.008(8)  0.96230.003(7)  0.98760.000(6)  0.99120.000(5)  0.99390.000(3)  0.99380.001(4)  0.99400.000(2) 
Yeastheat  0.98840.001(1)  0.86550.008(8)  0.98140.001(6)  0.98100.000(7)  0.98570.000(5)  0.98800.000(2)  0.98800.029(3)  0.98760.001(4) 
Yeastspo  0.97760.001(1)  0.86720.010(8)  0.96860.003(7)  0.97180.001(6)  0.97450.000(5)  0.97680.000(4)  0.97720.010(2)  0.97700.001(3) 
Yeastspo5  0.97530.002(1)  0.89680.010(8)  0.97310.001(4)  0.97060.002(7)  0.97100.000(6)  0.97320.001(3)  0.97230.007(5)  0.97430.002(2) 
Yeastspoem  0.98030.001(1)  0.91870.010(8)  0.97280.003(7)  0.97640.001(6)  0.97860.000(3)  0.97840.001(4)  0.97960.008(2)  0.97840.002(5) 
Natural_Scene  0.76370.015(1)  0.55830.006(8)  0.69540.014(7)  0.69860.008(6)  0.71440.008(5)  0.74420.007(3)  0.76240.003(2)  0.74860.014(4) 
Movie  0.93850.002(1)  0.84950.003(8)  0.87670.006(7)  0.90890.002(4)  0.87800.004(5)  0.92050.002(3)  0.87800.005(6)  0.93810.003(2) 
SUB_3DFE  0.96440.004(1)  0.91670.004(8)  0.91810.005(7)  0.92020.004(5)  0.94820.001(3)  0.94360.000(4)  0.96360.002(2)  0.91980.002(6) 
Avg. Rank  1.00  8.00  6.38  6.15  4.77  3.54  3.08  3.38 
4.2 Methodology
To show the effectiveness of the proposed methods, we conducted comprehensive experiments on the aforementioned datasets. For LE, the proposed BDLE method is compared with five classical LE approaches presented in Xu et al. (2018), i.e., FCM, KM, LP, ML, and GLLE. The hyperparameter in the FCM method is set to 2. We select the Gaussian Kernel as the kernel function in the KM algorithm. For GLLE, the parameter is set to 0.01. Moreover, the number of neighbors is set to in both GLLE and ML.
For the LDL paradigm, the proposed BDLDL method is compared with eight existing algorithms including PTBayes Geng and Ji (2013), PTSVM Geng et al. (2014)
, AAKNN
Geng et al. (2010), AABP Geng et al. (2013), SAIIS Geng et al. (2013), SABFGS Geng (2016), LDLSCL Zheng et al. (2018), and EDLLRL Jia et al. (2019b), to demonstrate its superiority. The first two algorithms are implemented by the strategy of problem transformation. The next two ones are carried out by means of the adaptive method. Finally, from the fifth algorithm to the last one, they are specialized algorithms. In particular, the LDLSCL and EDLLRL constitute stateofart methods recently proposed. We utilized the “CSVC” type in LIBSVM to implement PTSVM using the RBF kernel with parameters and . We set the hyperparameterin AAkNN to 5. The number of hiddenlayer neurons for AABP was set to 60. The parameters
, and in LDLSCL were all set to . Regarding the EDLLRL algorithm, we set the regularization parameters and to and, respectively. For the intermediate algorithm Kmeans, the number of cluster was set to 5 according to Jia‘s suggestion
Jia et al. (2018). For the BFGS optimization used in SABFGS and BDLDL, parameters and were set to and 0.9, respectively. Regarding the two bidirectional algorigthms, parameters are tuned from the range using gridsearch method. The two parameters in BDLE and are both set to . As for BDLDL, the parameters and are set to and , respectively. Finally, we train the LDL model with the recovered label distributions for further evaluation of BDLE. The details of parameter selections are shown in the parameter analysis section. And the experiments for the LDL algorithm on every datasets are conducted on a tenfold crossvalidation.4.3 Results
4.3.1 BDLE performance
Tables 1 and 2 present the results of six LE methods on all the datasets. Constrained by the page limit, we have only shown two representative results measured on Chebyshev and Cosine in this paper.
For each dataset, the results made by a specific algorithm are listed as a column in the tables in accordance with the used matrix. Note that there is always an entry highlighted in boldface. This entry indicates that the algorithm evaluated by the corresponding measurement achieves the best performance. The experimental results are presented in the form of “score (rank)”; “score” denotes the difference between a predicted distribution and the real one measured by the corresponding matrix; “rank” is a direct value to evaluate the effectiveness of these compared algorithms. Moreover, the symbol “” means “the smaller the better”, whereas “” indicates “the larger the better.”
It is worth noting that given that the LE method is regarded as a preprocessing method, there is no need to run it several times and record the mean as well as the standard deviation. After analyzing the results we obtained, the proposed BDLE clearly outperforms other LE algorithms in most of the cases and renders suboptimum performance only in about 4.7% of cases according to the statistics. In addition, BDLE achieves better prediction results than GLLE in most of the cases, especially on dataset movie. From Table 1 we can see that the largest dimensional gap between input space and the output one is exactly in dataset moive. This indicates that the reconstruction projection can be added in LE algorithm reasonably. Two specialized algorithms, namely BELE and GLLE, rank first in 91.1% of cases. By contrast, the label distributions are hardly recovered from other four algorithms. This indicates the superiority of utilizing direct similarity or distance as the loss function in LDL and LE problems. In summary, the performance of the five LE algorithms is ranked from best to worst as follows: BDLE, GLLE, LP, FCM, ML and KM. This proves the effectiveness of our proposed bidirectional loss function for the LE method.
4.3.2 BDLDL performance
As for the performance of BDLDL method, we show the numerical result on 13 realworld datasets over the measurement Clark and Cosine in Tables 3 and 4 with the format “meanstd (rank)” similarly, and the item in bold in every row represents the best performance. One may observe that our algorithm BDLDL outperforms other classical LDL algorithms in most cases. When measured by Cosine, it is vividly shown that BDLDL achieves the best performance on every datasets, which strongly demonstrates the effectiveness of our proposed method. Besides, it can be found from Table 4 that although LDLLC and SABFGS obtains the best result on dataset Yeastdtt and SBU_3DFE respectively when measured with Chebyshev, BDLDL still ranks the second place. It also can be seen from the results that two PT and AP algorithms perform poorly on most cases. This verifies the superiority of utilizing the direct similarity or distance between the predicted label distribution and the true one as the loss function. Moreover, it can be easily seen from the results that our proposed method gains the superior performance over other existing specialized algorithms which ignore considering the reconstruction error. This indicates that such a bidirectional loss function can truly boost the performace of LDL algorithm.
Dataset  Canberra  Intersection  
UDLDL  BDLDL  UDLDL  BDLDL  
Yeastalpha  0.79800.013  0.60130.011  0.8915 0.001  0.96240.001 
Yeastcdc  0.95420.012  0.60780.012  0.8869 0.001  0.95800.001 
Yeastcold  0.45150.010  0.21030.007  0.8779 0.002  0.94300.002 
Yeastdiau  0.67510.010  0.42200.013  0.8338 0.001  0.94140.002 
Yeastdtt  0.37240.010  0.16590.006  0.8775 0.002  0.95900.001 
Yeastelu  0.80050.007  0.57890.011  0.8776 0.001  0.95910.001 
Yeastheat  0.57060.010  0.35770.010  0.8591 0.002  0.94120.002 
Yeastspo  0.70750.016  0.50360.019  0.8301 0.003  0.91710.003 
Yeastspo5  0.48340.010  0.27450.011  0.8184 0.003  0.91120.003 
Yeastspoem  0.29980.010  0.17160.007  0.8129 0.005  0.91690.003 
Natural_Scene  6.96530.095  0.73190.040  0.3822 0.010  0.53950.011 
Movie  1.42590.024  0.02180.001  0.7429 0.004  0.82980.002 
SBU_3DFE  0.85620.021  0.01190.001  0.8070 0.004  0.85900.004 
4.3.3 LDL algorithm Predictive Performance
The reason to use the LE algorithm is that we need to recover the label distributions for LDL training. For the purpose of verifying the correctness and effectiveness of our proposed LE algorithm, we conducted an experiment to compare predictions depending on the recovered label distributions with those made by the LDL model trained on real label distributions. Moreover, for further evaluation of the proposed BDLDL, we selected BDLDL and SABFGS as the LDL model in this experiment. Owing to the page limit, we hereby present only the experimental results measured with Chebyshev. The prediction results achieved from SABFGS and BDLDL are visualized in Figs. 1 and 2 in terms of histograms. Note that ‘Groundtruth’ appearing in the figures represents the results depending on the real label distributions. We regard these results as a benchmark instead of taking them into consideration while conducting the evaluation. Meanwhile, we use ‘FCM’, ‘KM’, ‘LP’, ‘ML’, ‘GLLE’ and ‘BDLE’ to represent the performance of the corresponding LE algorithm in this experiment. As illustrated in Figs. 12, although the prediction on datasets movie and SUB_3DFE is worse than other datasets, ‘BDLE’ is still relatively close to ‘GroundTruth’ in most cases, especially in the first to eleventh datasets. We must mention that ‘BDLE’ combined with BDLDL is closer to ‘Groundtruth’ than with SABFGS over all cases. This indicates that such a reconstruction constraint is generalized enough to bring the improvement into both LE and LDL algorithms simultaneously.
4.4 Ablation Experiment
It is clear that the bidirectional loss for either LE or LDL consists of two parts, i.e., naive mapping loss and the reconstruction loss. In order to further demonstrate the effectiveness of the additional reconstruction term, we conduct the ablation experiment measured with Canberra and Intersection on the whole 13 datasets. We call the unidirectional algorithm without reconstruction term as UDLE and UDLE respectively which are fomulated as:
(27) 
(28) 
Since the objective function of UDLE is identical to that of GLLE, the corresponding comparison can be referred to Tables 2 and 3. The LDL prediction results in metrics of Canberra and Intersection are tabulated in Table 6.
From Table 6 we can see that BDLDL gains the superior results in all benchmark datasets, i.e., introducing the reconstruction term can truly boost the performance of LDL algorithm. It is expected that the top3 improvements are achieved in datasets Natural Scene, Movie, SBU_3DFE respectively which are equiped with the relative large dimensional gap between the feature and label space.
4.5 Influence of Parameters
To examine the robustness of the proposed algorithms, we also analyze the influence of tradeoff parameters in the experiments, including , in BDLDL as well as and in BDLE. We run BDLE with and whose value range is [,], and parameters and involved in BDLDL use the same candidate label set as well. Owing to the page limit, we only show in this paper the experimental results on Yeastcold dataset which are measured with Chebyshev and Cosine. For further evaluation, the results are visualized with different colors in Figs. 5 and 6. When measured with Chebyshev, a smaller value means better performance and closer to blue; by contrast, with cosine, a larger value indicates better performance and closer to red.
It is clear from Fig. 5 that when falls in a certain range [,], we can achieve relatively good recovery results with any values of . After conducting several experiments, we draw a conclusion for BDLE, namely that when both and are about , the best performance is obtained. Concerning the parameters of BDLDL, when the value of is selected within the range [,], the color varies in an extremely steady way, which means that the performance is not sensitive to this hyper parameter in that particular range. In addition, we can also see from Fig. 6 that has a stronger influence on the performance than when the value of is within the range [,].
5 Conclusion
Previous studies have shown that the LDL method can effectively solve label ambiguity problems whereas the LE method is able to recover label distributions from logical labels. To improve the performance of LDL and LE methods, we propose a new loss function that combines the mapping error with the reconstruction error to leverage the missing information caused by the dimensional gap between the input space and the output one. Sufficient experiments have been conducted to show that the proposed loss function is sufficiently generalized for application in both LDL and LE with improvement. In the future, we will explore if there exists an endtoend way to recover the label distributions with the supervision of LDL training process.
References

Berger et al. (1996)
Berger, A.L., Pietra, V.J.D.,
Pietra, S.A.D., 1996.
A maximum entropy approach to natural language processing.
Computational linguistics 22, 39–71. 
Boureau et al. (2008)
Boureau, Y.l., Cun, Y.L., et al.,
2008.
Sparse feature learning for deep belief networks, in: Advances in neural information processing systems, pp. 1185–1192.
 Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al., 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3, 1–122.

Cheng et al. (2019)
Cheng, Y., Zhao, D., Wang,
Y., Pei, G., 2019.
Multilabel learning with kernel extreme learning machine autoencoder.
KnowledgeBased Systems 178, 1–10.  Eisen et al. (1998) Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D., 1998. Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences 95, 14863–14868.

El Gayar et al. (2006)
El Gayar, N., Schwenker, F.,
Palm, G., 2006.
A study of the robustness of knn classifiers trained using soft labels, in: IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Springer. pp. 67–80.
 Fan et al. (2017) Fan, Y.Y., Liu, S., Li, B., Guo, Z., Samal, A., Wan, J., Li, S.Z., 2017. Label distributionbased facial attractiveness computation by deep residual learning. IEEE Transactions on Multimedia 20, 2196–2208.
 Geng (2016) Geng, X., 2016. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering 28, 1734–1748.

Geng and Hou (2015)
Geng, X., Hou, P., 2015.
Prerelease prediction of crowd opinion on movies by label distribution learning, in: TwentyFourth International Joint Conference on Artificial Intelligence.
 Geng and Ji (2013) Geng, X., Ji, R., 2013. Label distribution learning, in: Proceedings of the 2013 IEEE 13th International Conference on Data Mining Workshops, IEEE Computer Society. pp. 377–383.
 Geng et al. (2010) Geng, X., SmithMiles, K., Zhou, Z.H., 2010. Facial age estimation by learning from label distributions, in: Proceedings of the TwentyFourth AAAI Conference on Artificial Intelligence, AAAI Press. pp. 451–456.
 Geng et al. (2014) Geng, X., Wang, Q., Xia, Y., 2014. Facial age estimation by adaptive label distribution learning, in: 2014 22nd International Conference on Pattern Recognition, IEEE. pp. 4465–4470.
 Geng et al. (2013) Geng, X., Yin, C., Zhou, Z.H., 2013. Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence 35, 2401–2412.
 Golub and Loan (1996) Golub, G.H., Loan, C.F.V., 1996. Matrix computations (3rd ed.) .

He et al. (2019)
He, Z.F., Yang, M., Gao,
Y., Liu, H.D., Yin, Y.,
2019.
Joint multilabel classification and label correlations with missing labels and feature selection.
KnowledgeBased Systems 163, 145–158.  Hou et al. (2016) Hou, P., Geng, X., Zhang, M.L., 2016. Multilabel manifold learning, in: Thirtieth AAAI Conference on Artificial Intelligence.
 Jia et al. (2018) Jia, X., Li, W., Liu, J., Zhang, Y., 2018. Label distribution learning by exploiting label correlations, in: ThirtySecond AAAI Conference on Artificial Intelligence.
 Jia et al. (2019a) Jia, X., Ren, T., Chen, L., Wang, J., Zhu, J., Long, X., 2019a. Weakly supervised label distribution learning based on transductive matrix completion with sample correlations. Pattern Recognition Letters 125, 453–462.

Jia et al. (2019b)
Jia, X., Zheng, X., Li,
W., Zhang, C., Li, Z.,
2019b.
Facial emotion distribution learning by exploiting lowrank label correlations locally, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9841–9850.
 Jiang et al. (2006) Jiang, X., Yi, Z., Lv, J.C., 2006. Fuzzy svm with a new fuzzy membership function. Neural Computing & Applications 15, 268–276.
 Kodirov and Gong (2017) Kodirov, Elyor, X.T., Gong, S., 2017. Semantic autoencoder for zeroshot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3174–3183.
 Lyons et al. (1998) Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J., 1998. Coding facial expressions with gabor wavelets, in: Proceedings Third IEEE international conference on automatic face and gesture recognition, IEEE. pp. 200–205.
 Ma et al. (2017) Ma, J., Tian, Z., Zhang, H., Chow, T.W., 2017. Multilabel lowdimensional embedding with missing labels. KnowledgeBased Systems 137, 65–82.
 Malouf (2002) Malouf, R., 2002. A comparison of algorithms for maximum entropy parameter estimation, in: proceedings of the 6th conference on Natural language learningVolume 20, Association for Computational Linguistics. pp. 1–7.
 Melin and Castillo (2005) Melin, P., Castillo, O., 2005. Hybrid intelligent systems for pattern recognition using soft computing: An evolutionary approach for neural networks and fuzzy systems. volume 172. Springer Science & Business Media.
 Nocedal and Wright (2006) Nocedal, J., Wright, S., 2006. Numerical optimization. Springer Science & Business Media.
 Ren et al. (2019a) Ren, T., Jia, X., Li, W., Chen, L., Li, Z., 2019a. Label distribution learning with labelspecific features., in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3318–3324.
 Ren et al. (2019b) Ren, T., Jia, X., Li, W., Zhao, S., 2019b. Label distribution learning with label correlations via lowrank approximation, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, AAAI Press. pp. 3325–3331.
 Tsoumakas and Katakis (2007) Tsoumakas, G., Katakis, I., 2007. Multilabel classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3, 1–13.
 Wang and Zhang (2007) Wang, F., Zhang, C., 2007. Label propagation through linear neighborhoods. IEEE Transactions on Knowledge and Data Engineering 20, 55–67.
 Xing et al. (2016) Xing, C., Geng, X., Xue, H., 2016. Logistic boosting regression for label distribution learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4489–4497.
 Xu and Zhou (2017) Xu, M., Zhou, Z.H., 2017. Incomplete label distribution learning., in: IJCAI, pp. 3175–3181.
 Xu et al. (2019a) Xu, N., Liu, Y.P., Geng, X., 2019a. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering .
 Xu et al. (2019b) Xu, N., Lv, J., Geng, X., 2019b. Partial label learning via label enhancement.
 Xu et al. (2018) Xu, N., Tao, A., Geng, X., 2018. Label enhancement for label distribution learning., in: IJCAI, pp. 2926–2932.
 Xu et al. (2019c) Xu, S., Shang, L., Shen, F., 2019c. Latent semantics encoding for label distribution learning., in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3982–3988.

Xu et al. (2016)
Xu, S., Yang, X., Yu, H.,
Yu, D.J., Yang, J.,
Tsang, E.C., 2016.
Multilabel learning with labelspecific feature reduction.
KnowledgeBased Systems 104, 52–61.  Yuan (1991) Yuan, Y.x., 1991. A modified bfgs algorithm for unconstrained optimization. IMA Journal of Numerical Analysis 11, 325–332.
 Zheng et al. (2018) Zheng, X., Jia, X., Li, W., 2018. Label distribution learning by exploiting sample correlations locally, in: ThirtySecond AAAI Conference on Artificial Intelligence.
 Zhou et al. (2015) Zhou, Y., Xue, H., Geng, X., 2015. Emotion distribution recognition from facial expressions, in: Proceedings of the 23rd ACM international conference on Multimedia, ACM. pp. 1247–1250.
 Zhou et al. (2012) Zhou, Z.H., Zhang, M.L., Huang, S.J., Li, Y.F., 2012. Multiinstance multilabel learning. Artificial Intelligence 176, 2291–2320.
 Zhu et al. (2005) Zhu, X., Lafferty, J., Rosenfeld, R., 2005. Semisupervised learning with graphs. Ph.D. thesis. Carnegie Mellon University, language technologies institute, school of ….
 Zhu et al. (2017) Zhu, X., Suk, H.I., Wang, L., Lee, S.W., Shen, D., 2017. A novel relational regularization feature selection method for joint regression and classification in ad diagnosis. Medical Image Analysis 38, 205–214.