1 Introduction
The essential task in the safetycritical deep learning systems is to quantify uncertainty Gal (2016); Kendall and Gal (2017)
. The factors contributing to uncertainty can be classified into two types: irreducible observation noise (
aleatoric uncertainty) and the uncertainty of model parameters (epistemic uncertainty) Gal (2016); Guo et al. (2017). In particular, the difficulty and expense involved in representing the uncertainty of the model parameters make it challenging to quantify epistemic uncertainty.Common approaches used to estimate epistemic uncertainty are Ensemblebased methods and Bayesian neural networks (BNNs)
Wilson and Izmailov (2020). Ensemblebased methods and BNNs have been shown to produce impressive results in terms of both accuracy and the robustness of uncertainty estimation Welling and Teh (2011); Neal (2012); Blundell et al. (2015); Lakshminarayanan et al. (2017); Gal and Ghahramani (2016); Maddox et al. (2019). However, since multiple numbers of models are required in ensemble models, and expansive approximations are necessary in BNNs for the intractable posterior, neither of these methodologies can be considered costeffective in realworld applications (Amini et al., 2020).In contrast, the evidential neural network (ENet) is a costeffective deterministic
neural network, designed to accomplish uncertainty estimation by generating conjugate prior parameters as its outputs
(Malinin and Gales, 2018; Sensoy et al., 2018; Gurevich and Stuke, 2020; Malinin et al., 2020). The ENet can model both epistemic and aleatoric uncertainty based on the uncertainty of the generated conjugate prior. The ability of the ENet to estimate uncertainty without an ensemble or an expensive posterior approximation is remarkable.On the other hand, the fundamental goal of deep learning is to accomplish stateoftheart predictive accuracy, not only to estimate uncertainty. Although the ENet architecture can achieve outstanding and practical uncertainty estimations, NLL loss (negative log marginal likelihood)—the original loss function of the ENet—may result in the high mean squared error (MSE) of the ENet’s prediction in a certain condition. Intuitively speaking, this problem is caused by the fact that the NLL loss could finish training by determining that the target values are unknown instead of correcting the prediction of the ENet. We show that this phenomenon occurs when the estimated epistemic uncertainty is high. This high epistemic uncertainty implies that the given training samples are sparse or biased (Gal, 2016), which is a common situation in realworld applications, such as drugtarget affinity prediction tasks (Ezzat et al., 2016; Yang et al., 2021).
One possible solution to resolve this issue is reformulating the training objective of the ENet as the Multitask learning (MTL) loss with an additional loss function which only optimizes the predictive accuracy, such as the MSE loss. However, in the MTL, the conflicting gradients problem could occur, which negatively impact performance (Sener and Koltun, 2018; Lin et al., 2019; Yu et al., 2020). In particular, for the safetycritical systems, it can be harmful if the uncertainty estimation capability is degraded despite improved accuracy due to the gradient conflict between the losses. Therefore, to identify and determine the reason for the gradient conflict in our MTL optimization, the gradient between the loss functions was analyzed. Our analysis shows the certain condition that this gradient conflict could occur.
Based on this analysis, we define the Lipschitz modified MSE loss, which is designed to mitigate the gradient conflict with the original NLL loss function of the ENet. We also propose an MTL framework using the Lipschitz MSE, named Multitaskbased evidential network (MTENet). In particular, our contributions are as follows:

We thoroughly show that (1) the NLL loss alone is not sufficient to optimize the prediction accuracy of the ENet, (2) and adding the MSEbased loss into the training objective can resolve this issue.

We establish the condition that the MSEbased loss function could conflict with the NLL loss. To avoid this gradient condition, the Lipschitz modified MSE loss function is designed. Based on our novel Lipschitz loss function, we propose an MTL framework, MTENet, that improves the prediction accuracy of the ENet while maintaining its uncertainty estimation capability.

Our experiments show that MTENet outperforms other strong baselines on the realword regression benchmark datasets. And the MTENet shows remarkable performance in uncertainty estimation and outofdistribution detection on the drugtarget affinity datasets.
2 Background
2.1 Problem setup
Assume we have a dataset of the regression task, , where are the i.i.d. input data points,
is the dimension of an input vector,
are the realvalued targets, and is the number of data samples. We consider resolving a regression task by modeling a predictive distribution, , where is a neural network and is its parameters.2.2 Evidential regression network
An evidential regression network (ENet) (Amini et al., 2020) considers a target value,
, as a sample drawn from a normal distribution with unknown parameters,
. These parameters, and , are drawn from a NormalInverseGamma (NIG) distribution, which is the conjugate prior to the normal distribution:(1) 
where , and
is the inversegamma distribution. The NIG distribution in Eq
1. is parameterized by , which is the output of the ENet, , where is a trainable parameter of the ENet (Fig 1).With the NIG distribution in Eq 1, the model prediction (), aleatoric (), and epistemic () uncertainty of the ENet can be calculated by the following:
(2) 
With these equations, we define point estimation as estimating the model prediction of the ENet, and uncertainty estimation as estimating the uncertainties, and .
In addition to the uncertainty estimates in Eq 2, pseudo observation can be used as an alternative interpretation of the predictive uncertainty, and (Murphy, 2007, 2012)
. This is widely used in Bayesian statistics literatures
(Lee, 1989).Definition 1.
We define and , the outputs of the ENet, as the pseudo observations.
Note that the aleatoric and the epistemic uncertainty increase as the pseudo observations decrease because they are inversely proportional (, ).
2.3 Training the ENet with the marginal likelihood
The ENet learns its trainable parameters by maximizing a marginal likelihood with respect to unknown Gaussian parameters, and . By analytically marginalizing the NIG distribution (Eq 1) over these parameters and , we can derive the following marginal likelihood:
(3)  
where t is the studentt distribution with the location parameter (), the scale parameter (
), and the degrees of freedom (
).The training procedure for the ENet is to minimize the negative log marginal likelihood (NLL) loss function, , as summarized in the following equation:
(4) 
where is the gamma function and . Besides allowing the ENet to predict a proper point estimate , the NLL loss function allows the ENet to quantify the aleatoric and epistemic uncertainty via Eq 4.
3 Gradient shrinking of the NLL loss
In this section we show that the NLL loss alone is not sufficient to optimize the prediction accuracy of the ENet. Even though the model trained on NLL loss can predict the proper point estimate, , the model can circumvent to achieve lower NLL loss by increasing the predictive uncertainty instead of achieving a higher predictive accuracy. This limitation can be resolved by using an additional loss function. To formally state this, we first define the gradient vector of the loss function:
Definition 2.
Consider a loss function and a set of outputs of a model, , with parameters . Let be the subset of the model outputs, , then denotes the gradient vector with respect to the subset of the model outputs, . We exclude the reliance of to simplify notations.
In order to increase the accuracy of the ENet, the model prediction () should be trained to get closer to the true value via the corresponding gradient . For the ideal loss function, its gradient magnitude becomes zero only when the model prediction and the true value are identical. However, despite an inaccurate prediction if the prediction uncertainty is severely increased by the NLL loss, the gradient for the model prediction () could be significantly small. Thus, we aim to verify that the gradient magnitude for the model prediction, , is insignificant despite of the incorrect model prediction . Specifically, we prove that the is converged to zero under the condition that the pseudo observation () becomes zero as shown in Theorem 1:
Theorem 1.
(Shrinking NLL gradient)^{1}^{1}1Proofs and details of all the theorems and propositions of this paper are given in Appendix A.. Let be the output of the ENet, and assume that , then for every real , there exists a real such that for every , implies . Therefore, .
Theorem 1 states that the NLL loss itself is insufficient to correct the point estimate () because it cannot fully utilize the error value, , when pseudo observation () is very low. This infinitely low pseudo observation signifies that infinitely high epistemic uncertainty of the ENet () since . This statement is empirically justified in the experiment on the synthetic dataset as shown in Fig 4.
On the other hand, the simple MSE loss, , consistently trains the ENet to correctly estimate , since is independent of . Thus, the MSEbased loss functions—which only updates the target value ()—could improve the point estimation capability of the ENet.
4 Gradient conflict between the NLL and MSE
The previous section established that the simple MSE loss could resolve the gradient shrinkage problem. This finding motivates us to integrate the MSE loss into the NLL loss to form a MTL framework in order to improve the predictive accuracy. When forming this MTL framework, it is important not to have gradient conflict between tasks because they could lead to performance degradation. In the MTL literature, this undesirable case can be defined as when gradient vectors are in different directions Sener and Koltun (2018); Yu et al. (2020).
To avoid this degradation, it is neccessary to identify the cause of the gradient conflict. In this section, the gradient conflict between the simple MSE and NLL loss function is considered. To do this, the gradient of NLL is considered by dividing it into two separate gradients: uncertainty estimation gradient and point estimation gradient. Here, the uncertainty and point estimation gradient of the NLL are clarified with the following definition:
Definition 3.
We define the point estimation gradient of the NLL as , since performs the point estimation. The uncertainty estimation gradient of the NLL is defined as , since perform the uncertainty estimation. Finally, we define the total gradient of the NLL as .
Then, we show that the point estimation gradient of the NLL from the Definition 3 never conflicts with the gradient of the MSE as shown in the following proposition:
Proposition 1.
(Nonconflicting point estimation)^{1}^{1}footnotemark: 1. If gradient magnitudes of
are not zero, then the cosine similarity
between , and is always one: .Proposition 1 states that providing the model updates its parameters only through and , the two gradients will never conflict. This is indicated by the red arrows and blue arrows in Fig 2. Then, a trivial consequence of the Proposition 1 is the following corollary:
Corollary 1.
If the total gradients of the NLL and the gradient of MSE are not in the same direction, then the uncertainty estimation gradients of the NLL and the gradients of the MSE are also not in the same direction: .
Since Proposition 1 states that there is no gradient conflict between and , Corollary 1 implies that the only cause of gradient conflict between the MSE and NLL loss is the uncertainty estimation gradient of the NLL. Therefore, if the gradient conflict between the uncertainty estimation gradient and the gradient of the MSE loss can be avoided, the MTL can be performed more effectively.
4.1 Conjecture of the gradient conflict
In the previous section, it was shown that the uncertainty estimation gradient of the NLL is the only cause of the gradient conflict. In this section, we illustrate and focus on how the increased predictive uncertainty due to uncertainty estimation gradient can result in large gradient conflict. Fig 3 shows the examples of the predictive distributions of the ENet trained on the MSE and NLL loss during its training processes. According to the multitask learning literature, it is a common premise that the more severe the gradient conflict, the greater the loss (Chen et al., 2018; Lin et al., 2019; Yu et al., 2020; Chen et al., 2020; Javaloy and Valera, 2021). In this sense, since (B) has a higher loss than (A) as long as the model prediction is approaching its true value, we can consider that the gradient conflict of (B) is more significant.
The fundamental difference between (B) and (A) is that the NLL increases uncertainty in (B), and the NLL decreases uncertainty in (A). Therefore, we assume that gradient conflicts between the MSE and NLL are more significant when the NLL increases the predictive uncertainty of the ENet like the case of (B). In practice, to alleviate the gradient conflict of (B), Lipschitz modified MSE loss function is designed. And we empirically demonstrate that our Lipschitz modified MSE loss function successfully reduces the gradient conflict as shown in Fig 5.
5 Multitask based evidential neural network
A new multitask loss function of the ENet is introduced in this section. The ultimate goal of this multitaskbased evidential neural network (MTENet) is to improve the predictive accuracy of the ENet while does not interrupt the uncertainty estimation training on the NLL loss. To accomplish this, the MTENet utilizes a novel Lipschitz loss function as an additional loss function which can mitigate gradient conflicts with the NLL loss.
5.1 Prevent the conflict via the Lipschitz MSE loss
In the previous section, we established that two gradients from the MSE and NLL could conflict when the NLL increases predictive uncertainty. In this section, this gradient conflict condition is specified to design the loss function that can resolve this conflict.
Specifically, we reinterpret this conflict situation as the change of pseudo observations of the ENet, since an increase in predictive uncertainty is associated with a decrease in the pseudo observation (, ). This decrease in the pseudo observations by the NLL loss occurs when the difference between the model prediction and the true value exceeds specific thresholds, as shown in Proposition 2:
Proposition 2.
Let , which is the squared error value of the ENet. then if is larger than certain thresholds, and , then the derivative signs of the w.r.t and are positive.
(5) 
where is the digamma function.
Since both and contain , the thresholds ( and ) can be obtained by rearranging , to solve for for each case (See Appendix A).
Proposition 2 states that when the difference between the predicted and true values is significant, the NLL loss trains the ENet to increase the predictive uncertainty. This increase in predictive uncertainty results in the gradient conflict, which can lead to performance degradation, as explained in the previous section. Thus, our strategy is designing the additional loss function, which regulates its gradient if there is an increase in uncertainty.
Lipschitz MSE loss
We propose a modified Lipschitz MSE loss based on Proposition 2 to mitigate the gradient conflict that occurs when the MSE is excessively large. Unlike a commonly used MSE loss—which does not have Lipschitzness Qi et al. (2020)—we define the Lipschitz continuous loss function where the Lipschitz constant is dynamically determined by and from Proposition 2.
Consider a minibatch, , extracted from our dataset and the corresponding output of the MTENet, . For each and , the Lipschitz modified MSE is defined as follows:
(6) 
where . The restricts its gradient magnitude through adjusting its Lipschitz constant, , when the pseudo observation tend to be decreased (thus, uncertainty is increased) via the NLL loss. Note that the does not propagate gradients for the (). We empirically demonstrate that the designed Lipschitz modified loss mitigates the gradient conflict with the NLL by illustrating the high cosine similarity between our Lipschitz loss and NLL loss, as shown in Fig 5.
The final objective of the MTENet
The final MTL objective of the MTENet, , is defined as simple combination of the two losses and the regularization:
(7) 
where is the evidence regularizer Amini et al. (2020) and is its coefficient.
Remark 1.
The evidential regularization, , could also train the model prediction, , like the NLL and MSE. However, the main goal of is not training but regulating the NLL loss by increasing the predictive uncertainty for incorrect predictions of the ENet Sensoy et al. (2018); Amini et al. (2020). Therefore, cannot play the role of the MSE based loss function, which improves predictive accuracy. Further discussion is available in Appendix B.
6 Related work
Related works on multitask learning
Multitask learning (MTL) is a paradigm for training several tasks with a single model Ruder (2017). A naïve strategy of MTL is training a model with a linear combination of given loss functions: . However, this linearly combined losses can be ineffective if the loss functions have tradeoff relationships between other loss functions Sener and Koltun (2018); Lin et al. (2019). Several studies have resolved this issue—in the architecture agnostic manner—via dynamically adjusting weights of losses Sener and Koltun (2018); Guo et al. (2018); Kendall et al. (2018); Liu et al. (2019); Lin et al. (2019) or modifying gradients Chen et al. (2018, 2020); Yu et al. (2020). Similar to the previous works, we targeted resolving the gradient conflict issue. In particular, we mainly focused on applying the MTL to the ENet while keeping the NLL loss function to maintain the uncertainty estimation capability of the NLL loss function. To accomplish this, we defined the Lipschitz modified MSE which reconciles with the uncertainty estimation by implicitly alleviating the gradient conflict.
7 Experiments
7.1 Synthetic dataset evaluation
Datasets  RMSE  

MCDropout  Ensemble  ENet  MTENet  
Boston  2.97(0.19)  3.28(1.00)  3.06(0.16)  3.04(0.21) 
Concrete  5.23(0.12)  6.03(0.58)  5.85(0.15)  5.60(0.17) 
Energy  1.66(0.04)  2.09(0.29)  2.06(0.10)  2.04(0.07) 
Kin8nm  0.10(0.00)  0.09(0.00)  0.09(0.00)  0.08(0.00) 
Navel  0.01(0.00)  0.00(0.00)  0.00(0.00)  0.00(0.00) 
Power  4.02(0.04)  4.11(0.17)  4.23(0.09)  4.03(0.04) 
Protein  4.36(0.01)  4.71(0.06)  4.64(0.03)  4.73(0.07) 
Wine  0.62(0.01)  0.64(0.04)  0.61(0.02)  0.63(0.01) 
Yacht  1.11(0.09)  1.58(0.48)  1.57(0.56)  1.03(0.08) 
Datasets  NLL  
MCDropout  Ensemble  ENet  MTENet  
Boston  2.46(0.06)  2.41(0.25)  2.35(0.06)  2.31(0.04) 
Concrete  3.04(0.02)  3.06(0.18)  3.01(0.02)  2.97(0.02) 
Energy  1.99(0.02)  1.38(0.22)  1.39(0.06)  1.17(0.05) 
Kin8nm  0.95(0.01)  1.20(0.02)  1.24(0.01)  1.19(0.01) 
Navel  3.80(0.01)  5.63(0.05)  5.73(0.07)  5.96(0.03) 
Power  2.80(0.01)  2.79(0.04)  2.81(0.07)  2.75(0.01) 
Protein  2.89(0.00)  2.83(0.02)  2.63(0.00)  2.64(0.01) 
Wine  0.93(0.01)  0.94(0.12)  0.89(0.05)  0.86(0.02) 
Yacht  1.55(0.03)  1.18(0.21)  1.03(0.19)  0.78(0.06) 
We first qualitatively evaluate the MTENet using a synthetic regression dataset. Fig 4 represents our synthetic data and the model predictions. Experiments were carried out on an imbalanced regression dataset for two reasons: 1) realworld regression tasks often involve an imbalanced target Yang et al. (2021); 2) the ENet could fail on the imbalanced dataset. We observe that the predictions of the ENet for sparse samples (to the right sides of the green dotted lines in Fig 4) do not fit the data despite the wellcalibrated uncertainty. This trend of the ENet (Fig 4 (c)) is in line with Theorem 1, which stated that the MSE of the ENet is likely to be insufficient when the epistemic uncertainty is high. Conversely, the results obtained for other models, including the MTENet, are acceptable for the sparse sample regions. Training and experimental details are available in Appendix C.
7.2 Performance evaluation on realworld benchmarks
We evaluate the performance of the MTENet in comparison with strong baselines through the UCI regression benchmark datasets. The experimental settings and model architecture used in this study are identical to those used by HernándezLobato and Adams (2015). We report the average root mean squared error (RMSE) and the negative loglikelihood (NLL) of the models in Table 1. The MTENet generally provides the best or comparable RMSE and NLL performances. This is the evidence that the MTENet and the Lipschitz modified MSE loss improve the predictive accuracy of the ENet without disturbing the original NLL loss function. Details of the experiment are given in Appendix C.
7.3 Drugtarget affinity regression
Next, we evaluate the MTENet on a high dimensional complex drugtarget affinity (DTA) regression, which is an essential task for earlystage drug discovery Shin et al. (2019). Our experiments use two wellknown benchmark datasets in the DTA literature: Davis Davis et al. (2011) and Kiba Tang et al. (2014). The Davis and Kiba datasets consist of protein sequence data and simplified molecular lineentry system (SMILES) sequences. The target value of the data is the drugtarget affinity, which is a single real value.
The model architecture of the baselines and the MTENet, referred to as DeepDTA Öztürk et al. (2018), is the same except for the number of outputs: four outputs represent () for the ENet and MTENet; and a single output represents the target value for the MCDropout. The inputs of the models are the SMILES and protein sequences. The DeepDTA architecture consists of onedimensional convolution layers for embedding the input sequences and fully connected layers to predict the target values using the embedded features. The details of training and the model architectures are available in Appendix C.
Metrics
Our evaluation metrics are the MSE, NLL, ECE (Expected Calibration Error) and CI (Concordance Index). The MSE and NLL are typical losses in the optimizer. The role of the CI is to quantify the quality of rankings and the predictive accuracy of the model
Steck et al. (2008); Yu et al. (2011). The CI is defined as:where is the total number of data pairs; is a ground truth, and is a predicted value. The ECE indicates how accurately the estimated uncertainty reflects real accuracy Guo et al. (2017); Kuleshov et al. (2018). The ECE can be calculated as:
Where is accuracy using confidence interval (e.g. ) and is the number of intervals. We use .
Alleviated gradient conflict by the Modified MSE
We first demonstrate that the Lipschitz modified MSE can alleviate the gradient conflict by measuring cosine similarities of the gradient vectors. Note that, in this part of study, we use the ENet with the simple MSE as the additional loss (MSE ENet) as the baseline to determine whether our proposed Lipschitz loss can alleviate the conflict.
The moving average of the cosine similarity between gradients is shown in Fig 5, for 500 iterations averaged. The trends of the MTENet indicate the high cosine similarities, which is evidence that the Lipschitz modified MSE successfully alleviates the gradient conflict. Also, the results show that simply adjusting the weight of the MSE loss (the purple lines, small MSE) fails to resolve the conflict problem. Since the Lipschitz MSE is designed to avoid the conflict when the NLL loss results in the high predictive uncertainty high (B in Fig 3), this experimental result is in good agreement with our conjecture regarding the MTENet gradient conflict: In the event that the NLL increases the predictive uncertainty of the model, then the gradient conflict is highly likely.
Performance evaluation
Davis  
ENet  MT ENet  MSE ENet  Dropout  
CI  0.856(0.02)  0.864(0.01)  0.863(0.02)  0.884(0.00) 
MSE  0.275(0.00)  0.273(0.01)  0.266(0.01)  0.248(0.01) 
NLL  2.344(0.42)  2.424(0.07)  2.430(0.10)  0.633(0.02) 
ECE  0.184(0.02)  0.156(0.03)  0.179(0.02)  0.217(0.01) 
Kiba  
ENet  MT ENet  MSE ENet  Dropout  
CI  0.885(0.00)  0.887(0.00)  0.888(0.00)  0.872(0.00) 
MSE  0.190(0.00)  0.181(0.00)  0.176(0.00)  0.178(0.00) 
NLL  1.544(0.05)  1.433(0.07)  1.276(0.04)  0.465(0.01) 
ECE  0.077(0.03)  0.066(0.01)  0.081(0.02)  0.162(0.01) 
BindingDB  
ENet  MT ENet  MSE ENet  Dropout  
CI  0.822(0.00)  0.824(0.00)  0.831(0.00)  0.830(0.00) 
MSE  0.704(0.01)  0.694(0.00)  0.637(0.01)  0.641(0.01) 
NLL  0.944(0.02)  0.937(0.01)  0.972(0.01)  1.90(0.05) 
ECE  0.018(0.01)  0.015(0.00)  0.027(0.01)  0.134(0.01) 
represents the MCDropout. Standard deviations are reported in the parentheses.
Table 2 represents the performance evaluation results of the Kiba and Davis datasets. The MTENet and MSE ENet successfully improve the predictive accuracy metrics (CI, MSE) for both datasets. This result confirms that the additional MSEbased loss functions can improve the pointestimation capability of the ENet as described in Sec 3. Note also that the MTENet enhances accuracy without any significant degradation of the uncertainty estimation capability, which is indicated by the NLL and ECE. Notably, in the case of the calibration metric (ECE), the MTENet shows the best results among the existing models. Conversely, the NLL and ECE of the MSEENet on the Kiba dataset are degraded to some extent. This sacrifice of the uncertainty estimation ability of the MSEENet can likely be attributed to the gradient conflict with the NLL loss function.
Outofdistribution testing on the BindingDB dataset
We examine the outofdistribution (OOD) detection capability of the MTENet on the curated BindingDB dataset Liu et al. (2007). The BindingDB dataset includes various drugtarget (moleculeprotein) sequence pairs. Because we excluded kinase protein samples—which are the target of the Kiba dataset—pairs of the kinase proteins from the Kiba dataset and detergent molecules from the PubChem Kim et al. (2019) are considered as the OOD data. The test dataset of the BindingDB is considered as the indistribution (ID) data.
For a total of three times, we randomly split the BindingDB dataset into the training (80%), validation (10%), and test (10%) datasets. All examined models are trained three times with the training datasets. The details of the experiments and these datasets are available in Appendix B.
Fig 6 represents the distributions of log epistemic uncertainty and the AUCROC score of the examined models. The MTENet and ENet show high AUCROC scores, which implies excellent detection capability for OOD samples. In addition, Table 2 reports the performance for the test dataset of the BindingDB. Table 2 and Fig 6 show that the MTENet improves the ENet in all the metrics on the BindingDB dataset, including the calibration and OOD detection. In contrast, the AUCROC and ECE of the MSEENet considerably underperform in comparison to other methods, though the MSEENet shows the best MSE and CI on the BindingDB dataset. The exception is that the ECE and NLL of the MSEENet are better than that of the MCdropout. This result is evidence that an additional loss function (the simple MSE) for the MSE ENet, which does not avoid the gradient conflict, can result in an overconfidence problem since it could disturb the uncertainty estimation of the NLL loss.
Furthermore, the density plots of the aleatoric uncertainty of the MSEENet, ENet, and MTENet are provided in Appendix D. The results demonstrate that the aleatoric uncertainties are at a similar level for the OOD and ID data. These results of the experiment are evidence that the MTENet is capable of disentangling the aleatoric and the epistemic uncertainties.
8 Conclusion
We have theoretically and empirically demonstrated that the MSEbased loss improves the point estimation capability of the ENet by resolving the gradient shrinkage problem. We proposed the MTENet which uses the Lipschitz modified MSE as the additional training objectives to ensure that the uncertainty estimation of the NLL is not disturbed. Our experiments show that the MTENet improves predictive accuracy not only maintaining but also sometimes enhance the uncertainty estimation of the ENet. Successful realworld deep learning systems must be accurate, reliable, and efficient. We look forward to improving the evidential deep learning—that is, efficient and reliable—with better accuracy through the MTL framework.
Acknowledgments
This research was results of a study on the HPC Support Project, supported by the Ministry of Science and ICT and NIPA (National IT Industry Promotion Agency  Republic of Korea).
References
 DeepHdta: deep learning for predicting drugtarget interactions: a case study of covid19 drug repurposing. IEEE Access 8, pp. 170433–170451. Cited by: §C.3.
 Multiview selfattention for interpretable drug–target interaction prediction. Journal of Biomedical Informatics 110, pp. 103547. Cited by: §C.3.
 Deep evidential regression. Advances in Neural Information Processing Systems 33. Cited by: Appendix B, §C.2, Appendix D, Appendix D, §1, §2.2, §5.1, §5.1.

Weight uncertainty in neural network.
In
International Conference on Machine Learning
, pp. 1613–1622. Cited by: §1.  Gradnorm: gradient normalization for adaptive loss balancing in deep multitask networks. In International Conference on Machine Learning, pp. 794–803. Cited by: §4.1, §6.
 Just pick a sign: optimizing deep multitask models with gradient sign dropout. arXiv preprint arXiv:2010.06808. Cited by: §4.1, §6.
 Comprehensive analysis of kinase inhibitor selectivity. Nature biotechnology 29 (11), pp. 1046–1051. Cited by: §7.3.
 Drugtarget interaction prediction via class imbalanceaware ensemble learning. BMC bioinformatics 17 (19), pp. 267–276. Cited by: §1.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §C.2, §C.2, §1.
 Uncertainty in deep learning. University of Cambridge 1 (3), pp. 4. Cited by: Figure A10, Appendix D, §1, §1.

On the robustness of monte carlo dropout trained with noisy labels.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp. 2219–2228. Cited by: §C.3.  On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. Cited by: §1, §7.3.
 Dynamic task prioritization for multitask learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–287. Cited by: §6.
 Gradient conjugate priors and multilayer neural networks. Artificial Intelligence 278, pp. 103184. Cited by: Appendix D, §1.

SimBoost: a readacross approach for predicting drug–target binding affinities using gradient boosting machines
. Journal of cheminformatics 9 (1), pp. 1–14. Cited by: §C.3. 
Probabilistic backpropagation for scalable learning of bayesian neural networks
. In International Conference on Machine Learning, pp. 1861–1869. Cited by: §C.2, §C.2, §7.2.  Rotograd: dynamic gradient homogenization for multitask learning. arXiv preprint arXiv:2103.02631. Cited by: §4.1.
 Multitask learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7482–7491. Cited by: §6.
 What uncertainties do we need in bayesian deep learning for computer vision?. Advances in Neural Information Processing Systems 30, pp. 5574–5584. Cited by: Figure A10, Appendix D, §1.
 PubChem 2019 update: improved access to chemical data. Nucleic acids research 47 (D1), pp. D1102–D1109. Cited by: §C.4, §7.3.
 Accurate uncertainties for deep learning using calibrated regression. In International Conference on Machine Learning, pp. 2796–2804. Cited by: §7.3.
 Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems 30. Cited by: §C.2, §C.2, §1.
 Bayesian statistics. Oxford University Press London:. Cited by: §2.2.
 Pareto multitask learning. In Thirtythird Conference on Neural Information Processing Systems (NeurIPS 2019), Cited by: §1, §4.1, §6.
 Lossbalanced task weighting to reduce negative transfer in multitask learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 3301, pp. 9977–9978. Cited by: §6.
 BindingDB: a webaccessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research 35 (suppl_1), pp. D198–D201. Cited by: §C.4, §7.3.
 A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems 32, pp. 13153–13164. Cited by: §1.
 Predictive uncertainty estimation via prior networks. In NIPS’18: Proceedings of the 32nd International Conference on Neural Information Processing Systems, Vol. 31, pp. 7047–7058. Cited by: Appendix D, §1.
 Regression prior networks. arXiv preprint arXiv:2006.11590. Cited by: Appendix D, §1.

Conjugate bayesian analysis of the gaussian distribution
. Technical report. Cited by: §2.2.  Machine learning: a probabilistic perspective. MIT press. Cited by: §2.2.
 Bayesian learning for neural networks. Vol. 118, Springer Science & Business Media. Cited by: §1.
 Measuring calibration in deep learning.. In CVPR Workshops, Vol. 2. Cited by: §C.3.
 DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 34 (17), pp. i821–i829. Cited by: §C.3, §7.3.
 Toward more realistic drug–target interaction predictions. Briefings in bioinformatics 16 (2), pp. 325–337. Cited by: §C.3.
 On mean absolute error for deep neural network based vectortovector regression. IEEE Signal Processing Letters 27, pp. 1485–1489. Cited by: §5.1.
 An overview of multitask learning in deep neural networks. arXiv preprint arXiv:1706.05098. Cited by: §6.
 Multitask learning as multiobjective optimization. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 525–536. Cited by: §1, §4, §6.
 Evidential deep learning to quantify classification uncertainty. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3183–3193. Cited by: Appendix D, §1, §5.1.
 Selfattention based molecule representation for predicting drugtarget interaction. In Machine Learning for Healthcare Conference, pp. 230–248. Cited by: §C.3, §7.3.
 On ranking in survival analysis: bounds on the concordance index. In Advances in neural information processing systems, pp. 1209–1216. Cited by: §7.3.
 Making sense of largescale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. Journal of Chemical Information and Modeling 54 (3), pp. 735–743. Cited by: §7.3.
 Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 681–688. Cited by: §1.
 Bayesian deep learning and a probabilistic perspective of generalization. arXiv preprint arXiv:2002.08791. Cited by: §1.
 Delving into deep imbalanced regression. arXiv preprint arXiv:2102.09554. Cited by: §1, §7.1.
 Learning patientspecific cancer survival distributions as a sequence of dependent regressors. Advances in Neural Information Processing Systems 24, pp. 1845–1853. Cited by: §7.3.
 Gradient surgery for multitask learning. Advances in Neural Information Processing Systems 33. Cited by: §1, §4.1, §4, §6.
 Deep drugtarget binding affinity prediction with multiple attention blocks. Briefings in bioinformatics. Cited by: §C.3.
 GANsDTA: predicting drugtarget binding affinity using gans. Frontiers in genetics 10, pp. 1243. Cited by: §C.3.
Appendix A Derivations and proofs
In this section, we provide the proofs of the theorems and propositions in the main manuscript. Before we give the proofs, we rewrite the evidental neural network and its outputs and Definitions from the main manuscripts.
Outputs of the evidential neural network
Let be the ENet, be the trainable parameters of the ENet, and is the input data where is its dimension. The outputs of parameters of the ENet are where . Specifically, the output of the ENet represents the NIG distribution:
The model prediction (), aleatoric (), and epistemic () uncertainty of the ENet can be calculated by the following:
Definition 1.
We define and , the outputs of the ENet, as the pseudo observations.
Definition 2.
Consider a loss function and a set of outputs of a model, , with parameters . Let be the subset of the model outputs, , then denotes the gradient vector with respect to the subset of the model outputs, . We exclude the reliance of to simplify notations.
Definition 3.
We define the point estimation gradient of the NLL as , since performs the point estimation. The uncertainty estimation gradient of the NLL is defined as , since perform the uncertainty estimation. Finally, we define the total gradient of the NLL as .
a.1 Proof of theorem 1
We prove Theorem 1. Here, we first rewrite the NLL loss:
Theorem 1.
(Shrinking NLL gradient). Let be the output of the ENet, and assume that , then for every real , there exists a real such that for every , implies . Therefore, .
(Proof .) Where
, and , we show that if for every , there exists a such that, for all if , then .
(since )  
If , the proposition is always true regardless of since . Else:  
Therefore, if we set , we say that
where is a set of trainable parameters of the ENet, and .
a.2 Proof of Proposition 1
We show that gradients of the NLL loss and MSE loss with respect to the model prediction () are never conflict. Specifically, the cosine similarity between two gradients never . Then we provide the proof of Proposition 1:
Proposition 1.
(Nonconflicting point estimation). If the L2 norm of are not zero, then the cosine similarity between , and is always one. Hence, is positively proportional to .
(A8) 
where
(Proof .) We can express and as following:
where . We can calculate the cosine similarity, , between the two gradient vectors and :
The signs of the are always identical:
Therefore, we conclude: and .
a.3 Proof of Corollary 1
We prove Corollary 1, using Proposition1 and the following property of the gradient vector ,g:

Let be the set of outputs of a model and be a loss function. If there are two subsets of , and , then .
Corollary 1.
(Source of gradient conflict). If the total gradients of the NLL and the gradient of MSE are not in the same direction, then the uncertainty estimation gradients of the NLL and the gradients of the MSE are not in the same direction:
(Proof .) We consider the contrapositive of the Corollary 1:
(Contrapositive of Corollary 1.)
If the gradients of the NLL loss from ( ) and the gradients of the MSE loss are not conflict, , then the gradients of two losses never conflict: .
On account of an assumption: , we have where . From Theroem 2, we get where . Since , we have:
Since the contrapositive of Corollary 1 is true, we conclude that Corollary 1 is true.
a.4 Proof of Proposition 2
In this section, we proof Proposition 2. Let , which represents the squared error value. Then the gradient of the pseudo observations, , of the NIG distribution w.r.t the NLL loss is positive when the is larger then or :
Proposition 2.
Let , which is the squared error value of the ENet. then if is larger than certain thresholds, and , then the derivative signs of the w.r.t and are positive.
(A9) 
where is the digamma function.
(Proof .)
and,
Appendix B Evidential regularization
The evidential regularization, , of the ENet was originally proposed in Amini et al. (2020).
(A10) 
The role of the is to regularize the NLL loss, , by increasing the predictive uncertainty of incorrect predictions. In particular, if the difference between the model prediction, , and the true value, , is large, it tends to decrease the pseudo observations, ():
(A11) 
As we mentioned in Sec 2, the decrease in the pseudo observations leads to increase in the predictive uncertainty because they are inversely proportional: and . Therefore, the evidential regularizer increases the uncertainty of the incorrect predictions, which has large .
Moreover, also depends on , which represents the model prediction , as we can noticed from eq A10. The derivative of with respect to is following:
Comments
There are no comments yet.