Improving evidential deep learning via multi-task learning

12/17/2021
by   Dongpin Oh, et al.
0

The Evidential regression network (ENet) estimates a continuous target and its predictive uncertainty without costly Bayesian model averaging. However, it is possible that the target is inaccurately predicted due to the gradient shrinkage problem of the original loss function of the ENet, the negative log marginal likelihood (NLL) loss. In this paper, the objective is to improve the prediction accuracy of the ENet while maintaining its efficient uncertainty estimation by resolving the gradient shrinkage problem. A multi-task learning (MTL) framework, referred to as MT-ENet, is proposed to accomplish this aim. In the MTL, we define the Lipschitz modified mean squared error (MSE) loss function as another loss and add it to the existing NLL loss. The Lipschitz modified MSE loss is designed to mitigate the gradient conflict with the NLL loss by dynamically adjusting its Lipschitz constant. By doing so, the Lipschitz MSE loss does not disturb the uncertainty estimation of the NLL loss. The MT-ENet enhances the predictive accuracy of the ENet without losing uncertainty estimation capability on the synthetic dataset and real-world benchmarks, including drug-target affinity (DTA) regression. Furthermore, the MT-ENet shows remarkable calibration and out-of-distribution detection capability on the DTA benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

page 13

07/06/2017

Statistical Parametric Speech Synthesis Using Generative Adversarial Networks Under A Multi-task Learning Framework

In this paper, we aim at improving the performance of synthesized speech...
08/12/2020

On Mean Absolute Error for Deep Neural Network Based Vector-to-Vector Regression

In this paper, we exploit the properties of mean absolute error (MAE) as...
09/16/2021

Improving Regression Uncertainty Estimation Under Statistical Change

While deep neural networks are highly performant and successful in a wid...
10/06/2021

A Hierarchical Variational Neural Uncertainty Model for Stochastic Video Prediction

Predicting the future frames of a video is a challenging task, in part d...
01/15/2022

Concise Logarithmic Loss Function for Robust Training of Anomaly Detection Model

Recently, deep learning-based algorithms are widely adopted due to the a...
04/17/2022

A Modified Nonlinear Conjugate Gradient Algorithm for Functions with Non-Lipschitz Gradient

In this paper, we propose a modified nonlinear conjugate gradient (NCG) ...
09/21/2020

Integration of Clinical Criteria into the Training of Deep Models: Application to Glucose Prediction for Diabetic People

Standard objective functions used during the training of neural-network-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The essential task in the safety-critical deep learning systems is to quantify uncertainty Gal (2016); Kendall and Gal (2017)

. The factors contributing to uncertainty can be classified into two types: irreducible observation noise (

aleatoric uncertainty) and the uncertainty of model parameters (epistemic uncertainty) Gal (2016); Guo et al. (2017). In particular, the difficulty and expense involved in representing the uncertainty of the model parameters make it challenging to quantify epistemic uncertainty.

Common approaches used to estimate epistemic uncertainty are Ensemble-based methods and Bayesian neural networks (BNNs)

Wilson and Izmailov (2020). Ensemble-based methods and BNNs have been shown to produce impressive results in terms of both accuracy and the robustness of uncertainty estimation Welling and Teh (2011); Neal (2012); Blundell et al. (2015); Lakshminarayanan et al. (2017); Gal and Ghahramani (2016); Maddox et al. (2019). However, since multiple numbers of models are required in ensemble models, and expansive approximations are necessary in BNNs for the intractable posterior, neither of these methodologies can be considered cost-effective in real-world applications (Amini et al., 2020).

In contrast, the evidential neural network (ENet) is a cost-effective deterministic

neural network, designed to accomplish uncertainty estimation by generating conjugate prior parameters as its outputs

(Malinin and Gales, 2018; Sensoy et al., 2018; Gurevich and Stuke, 2020; Malinin et al., 2020). The ENet can model both epistemic and aleatoric uncertainty based on the uncertainty of the generated conjugate prior. The ability of the ENet to estimate uncertainty without an ensemble or an expensive posterior approximation is remarkable.

On the other hand, the fundamental goal of deep learning is to accomplish state-of-the-art predictive accuracy, not only to estimate uncertainty. Although the ENet architecture can achieve outstanding and practical uncertainty estimations, NLL loss (negative log marginal likelihood)—the original loss function of the ENet—may result in the high mean squared error (MSE) of the ENet’s prediction in a certain condition. Intuitively speaking, this problem is caused by the fact that the NLL loss could finish training by determining that the target values are unknown instead of correcting the prediction of the ENet. We show that this phenomenon occurs when the estimated epistemic uncertainty is high. This high epistemic uncertainty implies that the given training samples are sparse or biased (Gal, 2016), which is a common situation in real-world applications, such as drug-target affinity prediction tasks (Ezzat et al., 2016; Yang et al., 2021).

One possible solution to resolve this issue is reformulating the training objective of the ENet as the Multi-task learning (MTL) loss with an additional loss function which only optimizes the predictive accuracy, such as the MSE loss. However, in the MTL, the conflicting gradients problem could occur, which negatively impact performance (Sener and Koltun, 2018; Lin et al., 2019; Yu et al., 2020). In particular, for the safety-critical systems, it can be harmful if the uncertainty estimation capability is degraded despite improved accuracy due to the gradient conflict between the losses. Therefore, to identify and determine the reason for the gradient conflict in our MTL optimization, the gradient between the loss functions was analyzed. Our analysis shows the certain condition that this gradient conflict could occur.

Based on this analysis, we define the Lipschitz modified MSE loss, which is designed to mitigate the gradient conflict with the original NLL loss function of the ENet. We also propose an MTL framework using the Lipschitz MSE, named Multi-task-based evidential network (MT-ENet). In particular, our contributions are as follows:

  • We thoroughly show that (1) the NLL loss alone is not sufficient to optimize the prediction accuracy of the ENet, (2) and adding the MSE-based loss into the training objective can resolve this issue.

  • We establish the condition that the MSE-based loss function could conflict with the NLL loss. To avoid this gradient condition, the Lipschitz modified MSE loss function is designed. Based on our novel Lipschitz loss function, we propose an MTL framework, MT-ENet, that improves the prediction accuracy of the ENet while maintaining its uncertainty estimation capability.

  • Our experiments show that MT-ENet outperforms other strong baselines on the real-word regression benchmark datasets. And the MT-ENet shows remarkable performance in uncertainty estimation and out-of-distribution detection on the drug-target affinity datasets.

2 Background

2.1 Problem setup

Assume we have a dataset of the regression task, , where are the i.i.d. input data points,

is the dimension of an input vector,

are the real-valued targets, and is the number of data samples. We consider resolving a regression task by modeling a predictive distribution, , where is a neural network and is its parameters.

Figure 1: A scheme of the architecture of the MT-ENet and ENet and simplified examples of roles of the outputs (). (Left) An overview of the ENet architecture. The likelihood and marginal likelihood distribution can be calculated via the outputs of the ENet. (A) determines the model prediction (point estimation); (B) determine the model uncertainty (uncertainty estimation); (C) The predictive distribution (Marginal likelihood) of the ENet can be derived by .

2.2 Evidential regression network

An evidential regression network (ENet) (Amini et al., 2020) considers a target value,

, as a sample drawn from a normal distribution with unknown parameters,

. These parameters, and , are drawn from a Normal-Inverse-Gamma (NIG) distribution, which is the conjugate prior to the normal distribution:

(1)

where , and

is the inverse-gamma distribution. The NIG distribution in Eq 

1. is parameterized by , which is the output of the ENet, , where is a trainable parameter of the ENet (Fig 1).

With the NIG distribution in Eq 1, the model prediction (), aleatoric (), and epistemic () uncertainty of the ENet can be calculated by the following:

(2)

With these equations, we define point estimation as estimating the model prediction of the ENet, and uncertainty estimation as estimating the uncertainties, and .

In addition to the uncertainty estimates in Eq 2, pseudo observation can be used as an alternative interpretation of the predictive uncertainty, and (Murphy, 2007, 2012)

. This is widely used in Bayesian statistics literatures

(Lee, 1989).

Definition 1.

We define and , the outputs of the ENet, as the pseudo observations.
Note that the aleatoric and the epistemic uncertainty increase as the pseudo observations decrease because they are inversely proportional (, ).

2.3 Training the ENet with the marginal likelihood

The ENet learns its trainable parameters by maximizing a marginal likelihood with respect to unknown Gaussian parameters, and . By analytically marginalizing the NIG distribution (Eq 1) over these parameters and , we can derive the following marginal likelihood:

(3)

where t is the student-t distribution with the location parameter (), the scale parameter (

), and the degrees of freedom (

).

The training procedure for the ENet is to minimize the negative log marginal likelihood (NLL) loss function, , as summarized in the following equation:

(4)

where is the gamma function and . Besides allowing the ENet to predict a proper point estimate , the NLL loss function allows the ENet to quantify the aleatoric and epistemic uncertainty via Eq 4.

3 Gradient shrinking of the NLL loss

In this section we show that the NLL loss alone is not sufficient to optimize the prediction accuracy of the ENet. Even though the model trained on NLL loss can predict the proper point estimate, , the model can circumvent to achieve lower NLL loss by increasing the predictive uncertainty instead of achieving a higher predictive accuracy. This limitation can be resolved by using an additional loss function. To formally state this, we first define the gradient vector of the loss function:

Definition 2.

Consider a loss function and a set of outputs of a model, , with parameters . Let be the subset of the model outputs, , then denotes the gradient vector with respect to the subset of the model outputs, . We exclude the reliance of to simplify notations.

In order to increase the accuracy of the ENet, the model prediction () should be trained to get closer to the true value via the corresponding gradient . For the ideal loss function, its gradient magnitude becomes zero only when the model prediction and the true value are identical. However, despite an inaccurate prediction if the prediction uncertainty is severely increased by the NLL loss, the gradient for the model prediction () could be significantly small. Thus, we aim to verify that the gradient magnitude for the model prediction, , is insignificant despite of the incorrect model prediction . Specifically, we prove that the is converged to zero under the condition that the pseudo observation () becomes zero as shown in Theorem 1:

Theorem 1.

(Shrinking NLL gradient)111Proofs and details of all the theorems and propositions of this paper are given in Appendix A.. Let be the output of the ENet, and assume that , then for every real , there exists a real such that for every , implies . Therefore, .

Theorem 1 states that the NLL loss itself is insufficient to correct the point estimate () because it cannot fully utilize the error value, , when pseudo observation () is very low. This infinitely low pseudo observation signifies that infinitely high epistemic uncertainty of the ENet () since . This statement is empirically justified in the experiment on the synthetic dataset as shown in Fig 4.

On the other hand, the simple MSE loss, , consistently trains the ENet to correctly estimate , since is independent of . Thus, the MSE-based loss functions—which only updates the target value ()—could improve the point estimation capability of the ENet.

Figure 2: Simplified examples of the gradient vectors. The red arrows represent the gradient of MSE; The blue arrows characterize the point estimation gradient of NLL; The green arrows signify the uncertainty estimation gradient of NLL. (a) Non-conflicting gradients. (b) Slightly conflicting gradients. (c) Conflicting gradient by . As Proposition 1 states, the direction of and is always the same.

4 Gradient conflict between the NLL and MSE

The previous section established that the simple MSE loss could resolve the gradient shrinkage problem. This finding motivates us to integrate the MSE loss into the NLL loss to form a MTL framework in order to improve the predictive accuracy. When forming this MTL framework, it is important not to have gradient conflict between tasks because they could lead to performance degradation. In the MTL literature, this undesirable case can be defined as when gradient vectors are in different directions Sener and Koltun (2018); Yu et al. (2020).

To avoid this degradation, it is neccessary to identify the cause of the gradient conflict. In this section, the gradient conflict between the simple MSE and NLL loss function is considered. To do this, the gradient of NLL is considered by dividing it into two separate gradients: uncertainty estimation gradient and point estimation gradient. Here, the uncertainty and point estimation gradient of the NLL are clarified with the following definition:

Definition 3.

We define the point estimation gradient of the NLL as , since performs the point estimation. The uncertainty estimation gradient of the NLL is defined as , since perform the uncertainty estimation. Finally, we define the total gradient of the NLL as .

Then, we show that the point estimation gradient of the NLL from the Definition 3 never conflicts with the gradient of the MSE as shown in the following proposition:

Proposition 1.

(Non-conflicting point estimation)11footnotemark: 1. If gradient magnitudes of

are not zero, then the cosine similarity

between , and is always one: .

Proposition 1 states that providing the model updates its parameters only through and , the two gradients will never conflict. This is indicated by the red arrows and blue arrows in Fig 2. Then, a trivial consequence of the Proposition 1 is the following corollary:

Corollary 1.

If the total gradients of the NLL and the gradient of MSE are not in the same direction, then the uncertainty estimation gradients of the NLL and the gradients of the MSE are also not in the same direction: .

Since Proposition 1 states that there is no gradient conflict between and , Corollary 1 implies that the only cause of gradient conflict between the MSE and NLL loss is the uncertainty estimation gradient of the NLL. Therefore, if the gradient conflict between the uncertainty estimation gradient and the gradient of the MSE loss can be avoided, the MTL can be performed more effectively.

4.1 Conjecture of the gradient conflict

Figure 3: Illustration of the trained predictive distribution via the MSE and NLL losses. The dashed lines represent the model predictions (); The stars are the ground truth. (A) The NLL decreases the uncertainty; (B) The NLL increases the uncertainty. The optimization procedure is (1)(2)(3). The NLL loss of (B) is higher than A due to the gradient conflict between the MSE and NLL.

In the previous section, it was shown that the uncertainty estimation gradient of the NLL is the only cause of the gradient conflict. In this section, we illustrate and focus on how the increased predictive uncertainty due to uncertainty estimation gradient can result in large gradient conflict. Fig 3 shows the examples of the predictive distributions of the ENet trained on the MSE and NLL loss during its training processes. According to the multi-task learning literature, it is a common premise that the more severe the gradient conflict, the greater the loss (Chen et al., 2018; Lin et al., 2019; Yu et al., 2020; Chen et al., 2020; Javaloy and Valera, 2021). In this sense, since (B) has a higher loss than (A) as long as the model prediction is approaching its true value, we can consider that the gradient conflict of (B) is more significant.

The fundamental difference between (B) and (A) is that the NLL increases uncertainty in (B), and the NLL decreases uncertainty in (A). Therefore, we assume that gradient conflicts between the MSE and NLL are more significant when the NLL increases the predictive uncertainty of the ENet like the case of (B). In practice, to alleviate the gradient conflict of (B), Lipschitz modified MSE loss function is designed. And we empirically demonstrate that our Lipschitz modified MSE loss function successfully reduces the gradient conflict as shown in Fig 5.

5 Multi-task based evidential neural network

A new multi-task loss function of the ENet is introduced in this section. The ultimate goal of this multi-task-based evidential neural network (MT-ENet) is to improve the predictive accuracy of the ENet while does not interrupt the uncertainty estimation training on the NLL loss. To accomplish this, the MT-ENet utilizes a novel Lipschitz loss function as an additional loss function which can mitigate gradient conflicts with the NLL loss.

5.1 Prevent the conflict via the Lipschitz MSE loss

In the previous section, we established that two gradients from the MSE and NLL could conflict when the NLL increases predictive uncertainty. In this section, this gradient conflict condition is specified to design the loss function that can resolve this conflict.

Specifically, we reinterpret this conflict situation as the change of pseudo observations of the ENet, since an increase in predictive uncertainty is associated with a decrease in the pseudo observation (, ). This decrease in the pseudo observations by the NLL loss occurs when the difference between the model prediction and the true value exceeds specific thresholds, as shown in Proposition 2:

Proposition 2.

Let , which is the squared error value of the ENet. then if is larger than certain thresholds, and , then the derivative signs of the w.r.t and are positive.

(5)

where is the digamma function.
Since both and contain , the thresholds ( and ) can be obtained by rearranging , to solve for for each case (See Appendix A).

Proposition 2 states that when the difference between the predicted and true values is significant, the NLL loss trains the ENet to increase the predictive uncertainty. This increase in predictive uncertainty results in the gradient conflict, which can lead to performance degradation, as explained in the previous section. Thus, our strategy is designing the additional loss function, which regulates its gradient if there is an increase in uncertainty.

Lipschitz MSE loss

We propose a modified Lipschitz MSE loss based on Proposition 2 to mitigate the gradient conflict that occurs when the MSE is excessively large. Unlike a commonly used MSE loss—which does not have Lipschitzness Qi et al. (2020)—we define the -Lipschitz continuous loss function where the Lipschitz constant is dynamically determined by and from Proposition 2.

Consider a minibatch, , extracted from our dataset and the corresponding output of the MT-ENet, . For each and , the Lipschitz modified MSE is defined as follows:

(6)

where . The restricts its gradient magnitude through adjusting its Lipschitz constant, , when the pseudo observation tend to be decreased (thus, uncertainty is increased) via the NLL loss. Note that the does not propagate gradients for the (). We empirically demonstrate that the designed Lipschitz modified loss mitigates the gradient conflict with the NLL by illustrating the high cosine similarity between our Lipschitz loss and NLL loss, as shown in Fig 5.

The final objective of the MT-ENet

The final MTL objective of the MT-ENet, , is defined as simple combination of the two losses and the regularization:

(7)

where is the evidence regularizer Amini et al. (2020) and is its coefficient.

Remark 1.

The evidential regularization, , could also train the model prediction, , like the NLL and MSE. However, the main goal of is not training but regulating the NLL loss by increasing the predictive uncertainty for incorrect predictions of the ENet Sensoy et al. (2018); Amini et al. (2020). Therefore, cannot play the role of the MSE based loss function, which improves predictive accuracy. Further discussion is available in Appendix B.

6 Related work

Related works on multi-task learning

Multi-task learning (MTL) is a paradigm for training several tasks with a single model Ruder (2017). A naïve strategy of MTL is training a model with a linear combination of given loss functions: . However, this linearly combined losses can be ineffective if the loss functions have trade-off relationships between other loss functions Sener and Koltun (2018); Lin et al. (2019). Several studies have resolved this issue—in the architecture agnostic manner—via dynamically adjusting weights of losses Sener and Koltun (2018); Guo et al. (2018); Kendall et al. (2018); Liu et al. (2019); Lin et al. (2019) or modifying gradients Chen et al. (2018, 2020); Yu et al. (2020). Similar to the previous works, we targeted resolving the gradient conflict issue. In particular, we mainly focused on applying the MTL to the ENet while keeping the NLL loss function to maintain the uncertainty estimation capability of the NLL loss function. To accomplish this, we defined the Lipschitz modified MSE which reconciles with the uncertainty estimation by implicitly alleviating the gradient conflict.

7 Experiments

Figure 4: Synthetic dataset results of various models. Red lines represent the predictive mean; Red shades represent the predictive uncertainty; Blue highlighted regions represent our observations. The green lines separate the data-rich and data-sparse regions.

7.1 Synthetic dataset evaluation

Datasets RMSE
MC-Dropout Ensemble ENet MT-ENet
Boston 2.97(0.19) 3.28(1.00) 3.06(0.16) 3.04(0.21)
Concrete 5.23(0.12) 6.03(0.58) 5.85(0.15) 5.60(0.17)
Energy 1.66(0.04) 2.09(0.29) 2.06(0.10) 2.04(0.07)
Kin8nm 0.10(0.00) 0.09(0.00) 0.09(0.00) 0.08(0.00)
Navel 0.01(0.00) 0.00(0.00) 0.00(0.00) 0.00(0.00)
Power 4.02(0.04) 4.11(0.17) 4.23(0.09) 4.03(0.04)
Protein 4.36(0.01) 4.71(0.06) 4.64(0.03) 4.73(0.07)
Wine 0.62(0.01) 0.64(0.04) 0.61(0.02) 0.63(0.01)
Yacht 1.11(0.09) 1.58(0.48) 1.57(0.56) 1.03(0.08)
Datasets NLL
MC-Dropout Ensemble ENet MT-ENet
Boston 2.46(0.06) 2.41(0.25) 2.35(0.06) 2.31(0.04)
Concrete 3.04(0.02) 3.06(0.18) 3.01(0.02) 2.97(0.02)
Energy 1.99(0.02) 1.38(0.22) 1.39(0.06) 1.17(0.05)
Kin8nm -0.95(0.01) -1.20(0.02) -1.24(0.01) -1.19(0.01)
Navel -3.80(0.01) -5.63(0.05) -5.73(0.07) -5.96(0.03)
Power 2.80(0.01) 2.79(0.04) 2.81(0.07) 2.75(0.01)
Protein 2.89(0.00) 2.83(0.02) 2.63(0.00) 2.64(0.01)
Wine 0.93(0.01) 0.94(0.12) 0.89(0.05) 0.86(0.02)
Yacht 1.55(0.03) 1.18(0.21) 1.03(0.19) 0.78(0.06)
Table 1: RMSE and NLL of UCI regression benchmark datasets. The best score and the second score are highlighted in bolded. Standrad errors are reported in the parentheses.

We first qualitatively evaluate the MT-ENet using a synthetic regression dataset. Fig 4 represents our synthetic data and the model predictions. Experiments were carried out on an imbalanced regression dataset for two reasons: 1) real-world regression tasks often involve an imbalanced target Yang et al. (2021); 2) the ENet could fail on the imbalanced dataset. We observe that the predictions of the ENet for sparse samples (to the right sides of the green dotted lines in Fig 4) do not fit the data despite the well-calibrated uncertainty. This trend of the ENet (Fig 4 (c)) is in line with Theorem 1, which stated that the MSE of the ENet is likely to be insufficient when the epistemic uncertainty is high. Conversely, the results obtained for other models, including the MT-ENet, are acceptable for the sparse sample regions. Training and experimental details are available in Appendix C.

7.2 Performance evaluation on real-world benchmarks

We evaluate the performance of the MT-ENet in comparison with strong baselines through the UCI regression benchmark datasets. The experimental settings and model architecture used in this study are identical to those used by Hernández-Lobato and Adams (2015). We report the average root mean squared error (RMSE) and the negative log-likelihood (NLL) of the models in Table 1. The MT-ENet generally provides the best or comparable RMSE and NLL performances. This is the evidence that the MT-ENet and the Lipschitz modified MSE loss improve the predictive accuracy of the ENet without disturbing the original NLL loss function. Details of the experiment are given in Appendix C.

7.3 Drug-target affinity regression

Figure 5: Cosine similarity trends between gradients from the additional loss and NLL. (Blue) the MT-ENet with the modified MSE; (Red) the ENet with the simple MSE; (Purple) the ENet with the weighted (0.1) MSE.

Next, we evaluate the MT-ENet on a high dimensional complex drug-target affinity (DTA) regression, which is an essential task for early-stage drug discovery Shin et al. (2019). Our experiments use two well-known benchmark datasets in the DTA literature: Davis Davis et al. (2011) and Kiba Tang et al. (2014). The Davis and Kiba datasets consist of protein sequence data and simplified molecular line-entry system (SMILES) sequences. The target value of the data is the drug-target affinity, which is a single real value.

The model architecture of the baselines and the MT-ENet, referred to as DeepDTA Öztürk et al. (2018), is the same except for the number of outputs: four outputs represent () for the ENet and MT-ENet; and a single output represents the target value for the MC-Dropout. The inputs of the models are the SMILES and protein sequences. The DeepDTA architecture consists of one-dimensional convolution layers for embedding the input sequences and fully connected layers to predict the target values using the embedded features. The details of training and the model architectures are available in Appendix C.

Metrics

Our evaluation metrics are the MSE, NLL, ECE (Expected Calibration Error) and CI (Concordance Index). The MSE and NLL are typical losses in the optimizer. The role of the CI is to quantify the quality of rankings and the predictive accuracy of the model

Steck et al. (2008); Yu et al. (2011). The CI is defined as:

where is the total number of data pairs; is a ground truth, and is a predicted value. The ECE indicates how accurately the estimated uncertainty reflects real accuracy Guo et al. (2017); Kuleshov et al. (2018). The ECE can be calculated as:

Where is accuracy using confidence interval (e.g. ) and is the number of intervals. We use .

Alleviated gradient conflict by the Modified MSE

We first demonstrate that the Lipschitz modified MSE can alleviate the gradient conflict by measuring cosine similarities of the gradient vectors. Note that, in this part of study, we use the ENet with the simple MSE as the additional loss (MSE ENet) as the baseline to determine whether our proposed Lipschitz loss can alleviate the conflict.

The moving average of the cosine similarity between gradients is shown in Fig 5, for 500 iterations averaged. The trends of the MT-ENet indicate the high cosine similarities, which is evidence that the Lipschitz modified MSE successfully alleviates the gradient conflict. Also, the results show that simply adjusting the weight of the MSE loss (the purple lines, small MSE) fails to resolve the conflict problem. Since the Lipschitz MSE is designed to avoid the conflict when the NLL loss results in the high predictive uncertainty high (B in Fig 3), this experimental result is in good agreement with our conjecture regarding the MT-ENet gradient conflict: In the event that the NLL increases the predictive uncertainty of the model, then the gradient conflict is highly likely.

Performance evaluation

Davis
ENet MT ENet MSE ENet Dropout
CI 0.856(0.02) 0.864(0.01) 0.863(0.02) 0.884(0.00)
MSE 0.275(0.00) 0.273(0.01) 0.266(0.01) 0.248(0.01)
NLL -2.344(0.42) -2.424(0.07) -2.430(0.10) 0.633(0.02)
ECE 0.184(0.02) 0.156(0.03) 0.179(0.02) 0.217(0.01)
Kiba
ENet MT ENet MSE ENet Dropout
CI 0.885(0.00) 0.887(0.00) 0.888(0.00) 0.872(0.00)
MSE 0.190(0.00) 0.181(0.00) 0.176(0.00) 0.178(0.00)
NLL -1.544(0.05) -1.433(0.07) -1.276(0.04) 0.465(0.01)
ECE 0.077(0.03) 0.066(0.01) 0.081(0.02) 0.162(0.01)
BindingDB
ENet MT ENet MSE ENet Dropout
CI 0.822(0.00) 0.824(0.00) 0.831(0.00) 0.830(0.00)
MSE 0.704(0.01) 0.694(0.00) 0.637(0.01) 0.641(0.01)
NLL 0.944(0.02) 0.937(0.01) 0.972(0.01) 1.90(0.05)
ECE 0.018(0.01) 0.015(0.00) 0.027(0.01) 0.134(0.01)
Table 2: The performance evaluation results on the DTA benchmark datasets. MSE ENet represents the ENet using the simple MSE as the additional loss. Dropout

represents the MC-Dropout. Standard deviations are reported in the parentheses.

Table 2 represents the performance evaluation results of the Kiba and Davis datasets. The MT-ENet and MSE ENet successfully improve the predictive accuracy metrics (CI, MSE) for both datasets. This result confirms that the additional MSE-based loss functions can improve the point-estimation capability of the ENet as described in Sec 3. Note also that the MT-ENet enhances accuracy without any significant degradation of the uncertainty estimation capability, which is indicated by the NLL and ECE. Notably, in the case of the calibration metric (ECE), the MT-ENet shows the best results among the existing models. Conversely, the NLL and ECE of the MSE-ENet on the Kiba dataset are degraded to some extent. This sacrifice of the uncertainty estimation ability of the MSE-ENet can likely be attributed to the gradient conflict with the NLL loss function.

Out-of-distribution testing on the BindingDB dataset
Figure 6: Density plots of log epistemic uncertainty. The AUC-ROC is also reported. The red distributions represent the ID data; The blue distributions represent the OOD data. MSE ENet represents the ENet with the simple MSE as the additional loss.

We examine the out-of-distribution (OOD) detection capability of the MT-ENet on the curated BindingDB dataset Liu et al. (2007). The BindingDB dataset includes various drug-target (molecule-protein) sequence pairs. Because we excluded kinase protein samples—which are the target of the Kiba dataset—pairs of the kinase proteins from the Kiba dataset and detergent molecules from the PubChem Kim et al. (2019) are considered as the OOD data. The test dataset of the BindingDB is considered as the in-distribution (ID) data.

For a total of three times, we randomly split the BindingDB dataset into the training (80%), validation (10%), and test (10%) datasets. All examined models are trained three times with the training datasets. The details of the experiments and these datasets are available in Appendix B.

Fig 6 represents the distributions of log epistemic uncertainty and the AUC-ROC score of the examined models. The MT-ENet and ENet show high AUC-ROC scores, which implies excellent detection capability for OOD samples. In addition, Table 2 reports the performance for the test dataset of the BindingDB. Table 2 and Fig 6 show that the MT-ENet improves the ENet in all the metrics on the BindingDB dataset, including the calibration and OOD detection. In contrast, the AUC-ROC and ECE of the MSE-ENet considerably underperform in comparison to other methods, though the MSE-ENet shows the best MSE and CI on the BindingDB dataset. The exception is that the ECE and NLL of the MSE-ENet are better than that of the MC-dropout. This result is evidence that an additional loss function (the simple MSE) for the MSE ENet, which does not avoid the gradient conflict, can result in an overconfidence problem since it could disturb the uncertainty estimation of the NLL loss.

Furthermore, the density plots of the aleatoric uncertainty of the MSE-ENet, ENet, and MT-ENet are provided in Appendix D. The results demonstrate that the aleatoric uncertainties are at a similar level for the OOD and ID data. These results of the experiment are evidence that the MT-ENet is capable of disentangling the aleatoric and the epistemic uncertainties.

8 Conclusion

We have theoretically and empirically demonstrated that the MSE-based loss improves the point estimation capability of the ENet by resolving the gradient shrinkage problem. We proposed the MT-ENet which uses the Lipschitz modified MSE as the additional training objectives to ensure that the uncertainty estimation of the NLL is not disturbed. Our experiments show that the MT-ENet improves predictive accuracy not only maintaining but also sometimes enhance the uncertainty estimation of the ENet. Successful real-world deep learning systems must be accurate, reliable, and efficient. We look forward to improving the evidential deep learning—that is, efficient and reliable—with better accuracy through the MTL framework.

Acknowledgments

This research was results of a study on the HPC Support Project, supported by the Ministry of Science and ICT and NIPA (National IT Industry Promotion Agency - Republic of Korea).

References

  • M. Abdel-Basset, H. Hawash, M. Elhoseny, R. K. Chakrabortty, and M. Ryan (2020) DeepH-dta: deep learning for predicting drug-target interactions: a case study of covid-19 drug repurposing. IEEE Access 8, pp. 170433–170451. Cited by: §C.3.
  • B. Agyemang, W. Wu, M. Y. Kpiebaareh, Z. Lei, E. Nanor, and L. Chen (2020) Multi-view self-attention for interpretable drug–target interaction prediction. Journal of Biomedical Informatics 110, pp. 103547. Cited by: §C.3.
  • A. Amini, W. Schwarting, A. Soleimany, and D. Rus (2020) Deep evidential regression. Advances in Neural Information Processing Systems 33. Cited by: Appendix B, §C.2, Appendix D, Appendix D, §1, §2.2, §5.1, §5.1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In

    International Conference on Machine Learning

    ,
    pp. 1613–1622. Cited by: §1.
  • Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich (2018) Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 794–803. Cited by: §4.1, §6.
  • Z. Chen, J. Ngiam, Y. Huang, T. Luong, H. Kretzschmar, Y. Chai, and D. Anguelov (2020) Just pick a sign: optimizing deep multitask models with gradient sign dropout. arXiv preprint arXiv:2010.06808. Cited by: §4.1, §6.
  • M. I. Davis, J. P. Hunt, S. Herrgard, P. Ciceri, L. M. Wodicka, G. Pallares, M. Hocker, D. K. Treiber, and P. P. Zarrinkar (2011) Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology 29 (11), pp. 1046–1051. Cited by: §7.3.
  • A. Ezzat, M. Wu, X. Li, and C. Kwoh (2016) Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC bioinformatics 17 (19), pp. 267–276. Cited by: §1.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §C.2, §C.2, §1.
  • Y. Gal (2016) Uncertainty in deep learning. University of Cambridge 1 (3), pp. 4. Cited by: Figure A10, Appendix D, §1, §1.
  • P. Goel and L. Chen (2021) On the robustness of monte carlo dropout trained with noisy labels. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2219–2228. Cited by: §C.3.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §7.3.
  • M. Guo, A. Haque, D. Huang, S. Yeung, and L. Fei-Fei (2018) Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–287. Cited by: §6.
  • P. Gurevich and H. Stuke (2020) Gradient conjugate priors and multi-layer neural networks. Artificial Intelligence 278, pp. 103184. Cited by: Appendix D, §1.
  • T. He, M. Heidemeyer, F. Ban, A. Cherkasov, and M. Ester (2017)

    SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines

    .
    Journal of cheminformatics 9 (1), pp. 1–14. Cited by: §C.3.
  • J. M. Hernández-Lobato and R. Adams (2015)

    Probabilistic backpropagation for scalable learning of bayesian neural networks

    .
    In International Conference on Machine Learning, pp. 1861–1869. Cited by: §C.2, §C.2, §7.2.
  • A. Javaloy and I. Valera (2021) Rotograd: dynamic gradient homogenization for multi-task learning. arXiv preprint arXiv:2103.02631. Cited by: §4.1.
  • A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §6.
  • A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. Advances in Neural Information Processing Systems 30, pp. 5574–5584. Cited by: Figure A10, Appendix D, §1.
  • S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B. A. Shoemaker, P. A. Thiessen, B. Yu, et al. (2019) PubChem 2019 update: improved access to chemical data. Nucleic acids research 47 (D1), pp. D1102–D1109. Cited by: §C.4, §7.3.
  • V. Kuleshov, N. Fenner, and S. Ermon (2018) Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, pp. 2796–2804. Cited by: §7.3.
  • B. Lakshminarayanan, A. Pritzel, and C. Blundell (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30. Cited by: §C.2, §C.2, §1.
  • P. M. Lee (1989) Bayesian statistics. Oxford University Press London:. Cited by: §2.2.
  • X. Lin, H. Zhen, Z. Li, Q. Zhang, and S. Kwong (2019) Pareto multi-task learning. In Thirty-third Conference on Neural Information Processing Systems (NeurIPS 2019), Cited by: §1, §4.1, §6.
  • S. Liu, Y. Liang, and A. Gitter (2019) Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33-01, pp. 9977–9978. Cited by: §6.
  • T. Liu, Y. Lin, X. Wen, R. N. Jorissen, and M. K. Gilson (2007) BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research 35 (suppl_1), pp. D198–D201. Cited by: §C.4, §7.3.
  • W. J. Maddox, P. Izmailov, T. Garipov, D. P. Vetrov, and A. G. Wilson (2019) A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems 32, pp. 13153–13164. Cited by: §1.
  • A. Malinin and M. Gales (2018) Predictive uncertainty estimation via prior networks. In NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Vol. 31, pp. 7047–7058. Cited by: Appendix D, §1.
  • A. Malinin, S. Chervontsev, I. Provilkov, and M. Gales (2020) Regression prior networks. arXiv preprint arXiv:2006.11590. Cited by: Appendix D, §1.
  • K. P. Murphy (2007)

    Conjugate bayesian analysis of the gaussian distribution

    .
    Technical report. Cited by: §2.2.
  • K. P. Murphy (2012) Machine learning: a probabilistic perspective. MIT press. Cited by: §2.2.
  • R. M. Neal (2012) Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
  • J. Nixon, M. W. Dusenberry, L. Zhang, G. Jerfel, and D. Tran (2019) Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2. Cited by: §C.3.
  • H. Öztürk, A. Özgür, and E. Ozkirimli (2018) DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34 (17), pp. i821–i829. Cited by: §C.3, §7.3.
  • T. Pahikkala, A. Airola, S. Pietilä, S. Shakyawar, A. Szwajda, J. Tang, and T. Aittokallio (2015) Toward more realistic drug–target interaction predictions. Briefings in bioinformatics 16 (2), pp. 325–337. Cited by: §C.3.
  • J. Qi, J. Du, S. M. Siniscalchi, X. Ma, and C. Lee (2020) On mean absolute error for deep neural network based vector-to-vector regression. IEEE Signal Processing Letters 27, pp. 1485–1489. Cited by: §5.1.
  • S. Ruder (2017) An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §6.
  • O. Sener and V. Koltun (2018) Multi-task learning as multi-objective optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 525–536. Cited by: §1, §4, §6.
  • M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3183–3193. Cited by: Appendix D, §1, §5.1.
  • B. Shin, S. Park, K. Kang, and J. C. Ho (2019) Self-attention based molecule representation for predicting drug-target interaction. In Machine Learning for Healthcare Conference, pp. 230–248. Cited by: §C.3, §7.3.
  • H. Steck, B. Krishnapuram, C. Dehing-Oberije, P. Lambin, and V. C. Raykar (2008) On ranking in survival analysis: bounds on the concordance index. In Advances in neural information processing systems, pp. 1209–1216. Cited by: §7.3.
  • J. Tang, A. Szwajda, S. Shakyawar, T. Xu, P. Hintsanen, K. Wennerberg, and T. Aittokallio (2014) Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling 54 (3), pp. 735–743. Cited by: §7.3.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 681–688. Cited by: §1.
  • A. G. Wilson and P. Izmailov (2020) Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791. Cited by: §1.
  • Y. Yang, K. Zha, Y. Chen, H. Wang, and D. Katabi (2021) Delving into deep imbalanced regression. arXiv preprint arXiv:2102.09554. Cited by: §1, §7.1.
  • C. Yu, R. Greiner, H. Lin, and V. Baracos (2011) Learning patient-specific cancer survival distributions as a sequence of dependent regressors. Advances in Neural Information Processing Systems 24, pp. 1845–1853. Cited by: §7.3.
  • T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn (2020) Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33. Cited by: §1, §4.1, §4, §6.
  • Y. Zeng, X. Chen, Y. Luo, X. Li, and D. Peng (2021) Deep drug-target binding affinity prediction with multiple attention blocks. Briefings in bioinformatics. Cited by: §C.3.
  • L. Zhao, J. Wang, L. Pang, Y. Liu, and J. Zhang (2020) GANsDTA: predicting drug-target binding affinity using gans. Frontiers in genetics 10, pp. 1243. Cited by: §C.3.

Appendix A Derivations and proofs

In this section, we provide the proofs of the theorems and propositions in the main manuscript. Before we give the proofs, we rewrite the evidental neural network and its outputs and Definitions from the main manuscripts.

Outputs of the evidential neural network

Let be the ENet, be the trainable parameters of the ENet, and is the input data where is its dimension. The outputs of parameters of the ENet are where . Specifically, the output of the ENet represents the NIG distribution:

The model prediction (), aleatoric (), and epistemic () uncertainty of the ENet can be calculated by the following:

Definition 1.

We define and , the outputs of the ENet, as the pseudo observations.

Definition 2.

Consider a loss function and a set of outputs of a model, , with parameters . Let be the subset of the model outputs, , then denotes the gradient vector with respect to the subset of the model outputs, . We exclude the reliance of to simplify notations.

Definition 3.

We define the point estimation gradient of the NLL as , since performs the point estimation. The uncertainty estimation gradient of the NLL is defined as , since perform the uncertainty estimation. Finally, we define the total gradient of the NLL as .

a.1 Proof of theorem 1

We prove Theorem 1. Here, we first rewrite the NLL loss:

Theorem 1.

(Shrinking NLL gradient). Let be the output of the ENet, and assume that , then for every real , there exists a real such that for every , implies . Therefore, .

(Proof .) Where

, and , we show that if for every , there exists a such that, for all if , then .

(since )
If , the proposition is always true regardless of since . Else:

Therefore, if we set , we say that

where is a set of trainable parameters of the ENet, and .

a.2 Proof of Proposition 1

We show that gradients of the NLL loss and MSE loss with respect to the model prediction () are never conflict. Specifically, the cosine similarity between two gradients never . Then we provide the proof of Proposition 1:

Proposition 1.

(Non-conflicting point estimation). If the L2 norm of are not zero, then the cosine similarity between , and is always one. Hence, is positively proportional to .

(A8)

where

(Proof .) We can express and as following:

where . We can calculate the cosine similarity, , between the two gradient vectors and :

The signs of the are always identical:

Therefore, we conclude: and .

a.3 Proof of Corollary 1

We prove Corollary 1, using Proposition1 and the following property of the gradient vector ,g:

  • Let be the set of outputs of a model and be a loss function. If there are two subsets of , and , then .

Corollary 1.

(Source of gradient conflict). If the total gradients of the NLL and the gradient of MSE are not in the same direction, then the uncertainty estimation gradients of the NLL and the gradients of the MSE are not in the same direction:

(Proof .) We consider the contrapositive of the Corollary 1:

(Contrapositive of Corollary 1.)

If the gradients of the NLL loss from ( ) and the gradients of the MSE loss are not conflict, , then the gradients of two losses never conflict: .

On account of an assumption: , we have where . From Theroem 2, we get where . Since , we have:

Since the contrapositive of Corollary 1 is true, we conclude that Corollary 1 is true.

a.4 Proof of Proposition 2

In this section, we proof Proposition 2. Let , which represents the squared error value. Then the gradient of the pseudo observations, , of the NIG distribution w.r.t the NLL loss is positive when the is larger then or :

Proposition 2.

Let , which is the squared error value of the ENet. then if is larger than certain thresholds, and , then the derivative signs of the w.r.t and are positive.

(A9)

where is the digamma function.

(Proof .)

and,

Appendix B Evidential regularization

The evidential regularization, , of the ENet was originally proposed in Amini et al. (2020).

(A10)

The role of the is to regularize the NLL loss, , by increasing the predictive uncertainty of incorrect predictions. In particular, if the difference between the model prediction, , and the true value, , is large, it tends to decrease the pseudo observations, ():

(A11)

As we mentioned in Sec 2, the decrease in the pseudo observations leads to increase in the predictive uncertainty because they are inversely proportional: and . Therefore, the evidential regularizer increases the uncertainty of the incorrect predictions, which has large .

Moreover, also depends on , which represents the model prediction , as we can noticed from eq A10. The derivative of with respect to is following: