Uncertainty Aware Semi-Supervised Learning on Graph Data

10/24/2020 ∙ by Xujiang Zhao, et al. ∙ Virginia Polytechnic Institute and State University The University of Texas at Dallas University at Buffalo 7

Thanks to graph neural networks (GNNs), semi-supervised node classification has shown the state-of-the-art performance in graph data. However, GNNs have not considered different types of uncertainties associated with class probabilities to minimize risk of increasing misclassification under uncertainty in real life. In this work, we propose a multi-source uncertainty framework using a GNN that reflects various types of predictive uncertainties in both deep learning and belief/evidence theory domains for node classification predictions. By collecting evidence from the given labels of training nodes, the Graph-based Kernel Dirichlet distribution Estimation (GKDE) method is designed for accurately predicting node-level Dirichlet distributions and detecting out-of-distribution (OOD) nodes. We validated the outperformance of our proposed model compared to the state-of-the-art counterparts in terms of misclassification detection and OOD detection based on six real network datasets. We found that dissonance-based detection yielded the best results on misclassification detection while vacuity-based detection was the best for OOD detection. To clarify the reasons behind the results, we provided the theoretical proof that explains the relationships between different types of uncertainties considered in this work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inherent uncertainties derived from different root causes have realized as serious hurdles to find effective solutions for real world problems. Critical safety concerns have been brought due to lack of considering diverse causes of uncertainties, resulting in high risk due to misinterpretation of uncertainties (e.g., misdetection or misclassification of an object by an autonomous vehicle). Graph neural networks (GNNs) Kipf and Welling (2017); Veličković et al. (2018)

have received tremendous attention in the data science community. Despite their superior performance in semi-supervised node classification and regression, they didn’t consider various types of uncertainties in the their decision process. Predictive uncertainty estimation 

Kendall and Gal (2017)

using Bayesian NNs (BNNs) has been explored for classification prediction and regression in the computer vision applications, based on aleatoric uncertainty (AU) and epistemic uncertainty (EU). AU refers to data uncertainty from statistical randomness (e.g., inherent noises in observations) while EU indicates model uncertainty due to limited knowledge (e.g., ignorance) in collected data. In the belief or evidence theory domain, Subjective Logic (SL) 

Josang et al. (2018) considered vacuity (or a lack of evidence or ignorance) as uncertainty in a subjective opinion. Recently other uncertainty types, such as dissonance, consonance, vagueness, and monosonance Josang et al. (2018), have been discussed based on SL to measure them based on their different root causes.

We first considered multidimensional uncertainty types in both deep learning (DL) and belief and evidence theory domains for node-level classification, misclassification detection, and out-of-distribution (OOD) detection tasks. By leveraging the learning capability of GNNs and considering multidimensional uncertainties, we propose a uncertainty-aware estimation framework by quantifying different uncertainty types associated with the predicted class probabilities. In this work, we made the following key contributions:

  • [leftmargin=*, noitemsep]

  • A multi-source uncertainty framework for GNNs. Our proposed framework first provides the estimation of various types of uncertainty from both DL and evidence/belief theory domains, such as dissonance (derived from conflicting evidence) and vacuity (derived from lack of evidence). In addition, we designed a Graph-based Kernel Dirichlet distribution Estimation (GKDE) method to reduce errors in quantifying predictive uncertainties.

  • Theoretical analysis: Our work is the first that provides a theoretical analysis about the relationships between different types of uncertainties considered in this work. We demonstrate via a theoretical analysis that an OOD node may have a high predictive uncertainty under GKDE.

  • Comprehensive experiments for validating the performance of our proposed framework: Based on the six real graph datasets, we compared the performance of our proposed framework with that of other competitive counterparts. We found that the dissonance-based detection yielded the best results in misclassification detection while vacuity-based detection best performed in OOD detection.

Note that we use the term ‘predictive uncertainty’ in order to mean uncertainty estimated to solve prediction problems.

2 Related Work

DL research has mainly considered aleatoric uncertainty (AU) and epistemic

uncertainty (EU) using BNNs for computer vision applications. AU consists of homoscedastic uncertainty (i.e., constant errors for different inputs) and heteroscedastic uncertainty (i.e., different errors for different inputs) 

Gal (2016). A Bayesian DL framework was presented to simultaneously estimate both AU and EU in regression (e.g., depth regression) and classification (e.g., semantic segmentation) tasks Kendall and Gal (2017). Later, distributional uncertainty was defined based on distributional mismatch between testing and training data distributions Malinin and Gales (2018). Dropout variational inference Gal and Ghahramani (2016) was used for an approximate inference in BNNs using epistemic uncertainty, similar to DropEdge Rong et al. (2019). Other algorithms have considered overall uncertainty in node classification Eswaran et al. (2017); Liu et al. (2020); Zhang et al. (2019). However, no prior work has considered uncertainty decomposition in GNNs.

In the belief (or evidence) theory domain, uncertainty reasoning has been substantially explored, such as Fuzzy Logic De Silva (1995), Dempster-Shafer Theory (DST) Sentz et al. (2002), or Subjective Logic (SL) Josang (2016). Belief theory focuses on reasoning inherent uncertainty in information caused by unreliable, incomplete, deceptive, or conflicting evidence. SL considered predictive uncertainty in subjective opinions in terms of vacuity (i.e., a lack of evidence) and vagueness (i.e., failing in discriminating a belief state) Josang (2016). Recently, other uncertainty types have been studied, such as dissonance caused by conflicting evidenceJosang et al. (2018). In the deep NNs, Sensoy et al. (2018) proposed evidential deep learning (EDL) model, using SL to train a deterministic NN for supervised classification in computer vision based on the sum of squared loss. However, EDL didn’t consider a general method of estimating multidimensional uncertainty or graph structure.

3 Multidimensional Uncertainty and Subjective Logic

This section provides the overview of SL and discusses SL-based multiple types of uncertainties, called evidential uncertainty, with the measures of vacuity and dissonance. In addition, we give a brief overview of probabilistic uncertainty, discussing the measures of aleatoric and epistemic.

3.1 Subjective Logic

SL offers the formulation of a subjective opinion based on both probabilistic logic (PL) Nilsson (1986) and belief theory (BT) Shafer (1976)

with two unique extensions. First, SL explicitly represents uncertainty by introducing vacuity of evidence (or uncertainty mass) in its opinion representation. This addresses the limitations of PL by modeling a lack of confidence in probabilities. Second, SL extends the traditional BT by incorporating base rates as the prior probabilities in Bayesian theory. The Bayesian nature of SL allows it to use second-order uncertainty to express and reason the uncertainty mass, where second-order uncertainty is represented by a probability density function (PDF) over first-order probabilities 

Josang (2016). For multi-class problems, we use a multinomial distribution (i.e., first-order uncertainty) to model class probabilities and a Dirichlet PDF (i.e., second-order uncertainty) to model the distribution of class probabilities. Second-order uncertainty enriches uncertainty representation with evidence information, playing a key role in distinguish OOD from conflicting prediction as detailed later.

Opinions are the arguments in SL. In the multi-class setting, the multinomial opinion of a random variable

in domain is given by a triplet as:

(1)

where , and denote the belief mass distribution over , uncertainty mass representing vacuity of evidence, and base rate distribution over , respectively, and . The probability that is assigned to the -class is given by , which combines the belief mass with the uncertain mass using the base rates. In the multi-class setting, can be regarded as the prior preference over the -th class. When no specific preference is given, we assign all the base rates as .

3.2 Evidential Uncertainty

In this section, we explain how the second order uncertainty (evidential uncertainty) is derived from the first order uncertainty as a Dirichlet PDF. Given a set of random variables , where p is distributed on a simplex of dimensionality , a conditional distribution can be represented by the marginal distribution, . We define as a Dirichlet PDF over p: , where is a

-dimensional strength vector, with

denoting the effective number of observations of the -th class. SL explicitly introduces uncertain evidence through a weight representing non-informative evidence and redefines the strength parameter as: , where is the amount of evidence (or the number of observations) to support the -th class and is usually set to , i.e., the number of classes. Given the new definition of the strength parameter, the expectation of the class probabilities is given by:

(2)

where . By marginalizing out p, we can derive an evidence-based expression of belief mass and uncertainty mass:

(3)

SL categorizes uncertainty into two primary sources Josang (2016): (1) basic belief uncertainty derived from single belief masses, and (2) intra-belief uncertainty based on the relationships between different belief masses. These two sources of uncertainty can be boiled down to vacuity and dissonance, respectively, that correspond to vacuous belief and contradicting beliefs. In particular, vacuity of an opinion is captured by uncertainty mass in (3) while dissonance of an opinion Josang et al. (2018) is formulated by:

(4)

where is the relative mass balance function between two belief masses. The belief dissonance of an opinion is measured based on how much belief supports individual classes. Consider a binary classification example with a binomial opinion given by . Based on (4), it has a dissonance value of . In this case, although the vacuity is close to zero, a high dissonance indicates that one cannot make a clear decision because both two classes have the same amount of supporting both beliefs, which reveals strong conflict within the opinion.

3.3 Probabilistic Uncertainty

For classification, the estimation of the probabilistic uncertainty relies on the design of an appropriate Bayesian DL model with parameters . Given input and dataset , we estimate a class probability by , and obtain epistemic uncertainty estimated by mutual information Depeweg et al. (2018); Malinin and Gales (2018):

(5)

where

is Shannon’s entropy of a probability distribution. The first term indicates

entropy that represents the total uncertainty while the second term is aleatoric that indicates data uncertainty. By computing the difference between entropy and aleatoric uncertainties, we obtain epistemic uncertainty, which refers to uncertainty from model parameters.

4 Relationships Between Multiple Uncertainties

Figure 1: Multiple uncertainties of different prediction. Let .

We use the shorthand notations , , , , and to represent vacuity, dissonance, aleatoric, epistemic, and entropy, respectively.

To interpret multiple types of uncertainty, we show three prediction scenarios of 3-class classification in Figure 1, in each of which the strength parameters are known. To make a prediction with high confidence, the subjective multinomial opinion, following a Dirichlet distribution, will yield a sharp distribution on one corner of the simplex (see Figure 1 (a)). For a prediction with conflicting evidence, called a conflicting prediction (CP), the multinomial opinion should yield a central distribution, representing confidence to predict a flat categorical distribution over class labels (see Figure 1 (b)). For an OOD scenario with , the multinomial opinion would yield a flat distribution over the simplex (Figure 1 (c)), indicating high uncertainty due to the lack of evidence. The first technical contribution of this work is as follows.

Theorem 1.

We consider a simplified scenario, where a multinomial random variable follows a K-class categorical distribution: , the class probabilities p follow a Dirichlet distribution: , and refer to the Dirichlet parameters. Given a total Dirichlet strength , for any opinion on a multinomial random variable , we have

  1. General relations on all prediction scenarios.

    (a) ; (b) .

  2. Special relations on the OOD and the CP.

    1. For an OOD sample with a uniform prediction (i.e., ), we have

    2. For an in-distribution sample with a conflicting prediction (i.e., with , if ), we have

      with .

The proof of Theorem 1 can be found in Appendix A.1. As demonstrated in Theorem 1 and Figure 1, entropy cannot distinguish OOD (see Figure 1 (c)) and conflicting predictions (see Figure 1 (b)) because entropy is high for both cases. Similarly, neither aleatoric uncertainty nor epistemic uncertainty can distinguish OOD from conflicting predictions. In both cases, aleatoric uncertainty is high while epistemic uncertainty is low. On the other hand, vacuity and dissonance can clearly distinguish OOD from a conflicting prediction. For example, OOD objects typically show high vacuity with low dissonance while conflicting predictions exhibit low vacuity with high dissonance. This observation is confirmed through the empirical validation via our extensive experiments in terms of misclassification and OOD detection tasks.

5 Uncertainty-Aware Semi-Supervised Learning

In this section, we describe our proposed uncertainty framework based on semi-supervised node classification problem. The overall description of the framework is shown in Figure 2.

5.1 Problem Definition

Given an input graph , where is a ground set of nodes, is a ground set of edges, is a node-level feature matrix, is the feature vector of node , are the labels of the training nodes , and is the class label of node . We aim to predict: (1) the class probabilities of the testing nodes: ; and (2) the associated multidimensional uncertainty estimates introduced by different root causes: , where is the probability that the class label and is the total number of uncertainty types.

Figure 2: Uncertainty Framework Overview. Subjective Bayesian GNN (a) designed for estimating the different types of uncertainties (b).

5.2 Proposed Uncertainty Framework

Learning evidential uncertainty. As discussed in Section 3.1, evidential uncertainty can be derived from multinomial opinions or equivalently Dirichlet distributions to model a probability distribution for the class probabilities. Therefore, we design a Subjective GNN (S-GNN) to form their multinomial opinions for the node-level Dirichlet distribution of a given node . Then, the conditional probability can be obtained by:

(6)

where is the output of S-GNN for node , is the model parameters, and is an adjacency matrix. The Dirichlet probability function is defined by:

(7)

Note that S-GNN is similar to classical GNN, except that we use an activation layer (e.g., ReLU) instead of the softmax layer (only outputs class probabilities). This ensures that S-GNN would output non-negative values, which are taken as the parameters for the predicted Dirichlet distribution.

Learning probabilistic uncertainty. Since probabilistic uncertainty relies on a Bayesian framework, we proposed a Subjective Bayesian GNN (S-BGNN) that adapts S-GNN to a Bayesian framework, with the model parameters following a prior distribution. The joint class probability of y can be estimated by:

(8)

where is the posterior, estimated via dropout inference, that provides an approximate solution of posterior and taking samples from the posterior distribution of models Gal and Ghahramani (2016)

. Thanks to the benefit of dropout inference, training a DL model directly by minimizing the cross entropy (or square error) loss function can effectively minimize the KL-divergence between the approximated distribution and the full posterior (i.e., KL[

]) in variational inference Gal and Ghahramani (2016); Kendall et al. (2015). For interested readers, please refer to more detail in Appendix B.8.

Therefore, training S-GNN with stochastic gradient descent enables learning of an approximated distribution of weights, which can provide good explainability of data and prevent overfitting. We use a

loss function to compute its Bayes risk with respect to the sum of squares loss by:

(9)

where is an one-hot vector encoding the ground-truth class with and for all and is a class label. Eq. (9

) aims to minimize the prediction error and variance, leading to maximizing the classification accuracy of each training node by removing excessive misleading evidence.

5.3 Graph-based Kernel Dirichlet distribution Estimation (GKDE)

Figure 3: Illustration of GKDE. Estimate prior Dirichlet distribution for node (red) based on training nodes (blue) and graph structure information.

The loss function in Eq. (9) is designed to measure the sum of squared loss based on class labels of training nodes. However, it does not directly measure the quality of the predicted node-level Dirichlet distributions. To address this limitation, we proposed Graph-based Kernel Dirichlet distribution Estimation (GKDE) to better estimate node-level Dirichlet distributions by using graph structure information. The key idea of the GKDE is to estimate prior Dirichlet distribution parameters for each node based on the class labels of training nodes (see Figure 3). Then, we use the estimated prior Dirichlet distribution in the training process to learn the following patterns: (i) nodes with a high vacuity will be shown far from training nodes; and (ii) nodes with a high dissonance will be shown near the boundaries of classes.

Based on SL, let each training node represent one evidence for its class label. Denote the contribution of evidence estimation for node from training node by , where is obtained by:

(10)

is the Gaussian kernel function used to estimate the distribution effect between nodes and , and means the node-level distance (a shortest path between nodes and ), and is the bandwidth parameter. The prior evidence is estimated based GKDE: , where is a set of training nodes and the prior Dirichlet distribution . During the training process, we minimize the KL-divergence between model predictions of Dirichlet distribution and prior distribution: . This process can prioritize the extent of data relevance based on the estimated evidential uncertainty, which is proven effective based on the proposition below.

Proposition 1.

Given training nodes, for any testing nodes and , let be the vector of graph distances from nodes to training nodes and be the graph distances from nodes to training nodes, where is the node-level distance between nodes and . If for all , , then we have

where and refer to vacuity uncertainties of nodes and estimated based on GKDE.

The proof for this proposition can be found in Appendix A.2. The above proposition shows that if a testing node is too far from training nodes, the vacuity will increase, implying that an OOD node is expected to have a high vacuity.

In addition, we designed a simple iterative knowledge distillation method Hinton et al. (2015) (i.e., Teacher Network) to refine the node-level classification probabilities. The key idea is to train our proposed model (Student) to imitate the outputs of a pre-train a vanilla GNN (Teacher) by adding a regularization term of KL-divergence. This leads to solving the following optimization problem:

(11)

where is the vanilla GNN’s (Teacher) output and and are trade-off parameters.

6 Experiments

In this section, we conduct experiments on the tasks of misclassification and OOD detections to answer the following questions for semi-supervised node classification:

Q1. Misclassification Detection: What type of uncertainty is the most promising indicator of high confidence in node classification predictions?

Q2. OOD Detection: What type of uncertainty is a key indicator of accurate detection of OOD nodes?

Q3. GKDE with Uncertainty Estimates: How can GKDE help enhance prediction tasks with what types of uncertainty estimates?

Through extensive experiments, we found the following answers for the above questions:

A1. Dissonance (i.e., uncertainty due to conflicting evidence) is more effective than other uncertainty estimates in misclassification detection.

A2. Vacuity (i.e., uncertainty due to lack of confidence) is more effective than other uncertainty estimates in OOD detection.

A3. GKDE can indeed help improve the estimation quality of node-level Dirichlet distributions, resulting in a higher OOD detection.

6.1 Experiment Setup

Datasets: We used six datasets, including three citation network datasets Sen et al. (2008)

(i.e., Cora, Citeseer, Pubmed) and three new datasets 

Shchur et al. (2018) (i.e., Coauthor Physics, Amazon Computer, and Amazon Photo). We summarized the description and experimental setup of the used datasets in Appendix B.2111The source code and datasets are accessible at https://github.com/zxj32/uncertainty-GNN.

Comparing Schemes: We conducted the extensive comparative performance analysis based on our proposed models and several state-of-the-art competitive counterparts. We implemented all models based on the most popular GNN model, GCN Kipf and Welling (2017). We compared our model (S-BGCN-T-K) against: (1) Softmax-based GCN Kipf and Welling (2017) with uncertainty measured based on entropy; and (2) Drop-GCN that adapts the Monte-Carlo Dropout Gal and Ghahramani (2016); Ryu et al. (2019) into the GCN model to learn probabilistic uncertainty; (3) EDL-GCN that adapts the EDL model Sensoy et al. (2018) with GCN to estimate evidential uncertainty; (4) DPN-GCN that adapts the DPN Malinin and Gales (2018) method with GCN to estimate probabilistic uncertainty. We evaluated the performance of all models considered using the area under the ROC (AUROC) curve and area under the Precision-Recall (AUPR) curve in both experiments Hendrycks and Gimpel (2016).

6.2 Results

Misclassification Detection. The misclassification detection experiment involves detecting whether a given prediction is incorrect using an uncertainty estimate. Table 1 shows that S-BGCN-T-K outperforms all baseline models under the AUROC and AUPR for misclassification detection. The outperformance of dissonance-based detection is fairly impressive. This confirms that low dissonance (a small amount of conflicting evidence) is the key to maximize the accuracy of node classification prediction. We observe the following performance order: , which is aligned with our conjecture: higher dissonance with conflicting prediction leads to higher misclassification detection. We also conducted experiments on additional three datasets and observed similar trends of the results, as demonstrated in Appendix C.

Data Model AUROC AUPR Acc
Va.* Dis. Al. Ep. En. Va. Dis. Al. Ep. En.
Cora S-BGCN-T-K 70.6 82.4 75.3 68.8 77.7 90.3 95.4 92.4 87.8 93.4 82.0
EDL-GCN 70.2 81.5 - - 76.9 90.0 94.6 - - 93.6 81.5
DPN-GCN - - 78.3 75.5 77.3 - - 92.4 92.0 92.4 80.8
Drop-GCN - - 73.9 66.7 76.9 - - 92.7 90.0 93.6 81.3
GCN - - - - 79.6 - - - - 94.1 81.5
Citeseer S-BGCN-T-K 65.4 74.0 67.2 60.7 70.0 79.8 85.6 82.2 75.2 83.5 71.0
EDL-GCN 64.9 73.6 - - 69.6 79.2 84.6 - - 82.9 70.2
DPN-GCN - - 66.0 64.9 65.5 - - 78.7 77.6 78.1 68.1
Drop-GCN - - 66.4 60.8 69.8 - - 82.3 77.8 83.7 70.9
GCN - - - - 71.4 - - - - 83.2 70.3
Pubmed S-BGCN-T-K 64.1 73.3 69.3 64.2 70.7 85.6 90.8 88.8 86.1 89.2 79.3
EDL-GCN 62.6 69.0 - - 67.2 84.6 88.9 - - 81.7 79.0
DPN-GCN - - 72.7 69.2 72.5 - - 87.8 86.8 87.7 77.1
Drop-GCN - - 67.3 66.1 67.2 - - 88.6 85.6 89.0 79.0
GCN - - - - 68.5 - - - - 89.2 79.0
  • Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy

Table 1: AUROC and AUPR for the Misclassification Detection.
Data Model AUROC AUPR
Va.* Dis. Al. Ep. En. Va. Dis. Al. Ep. En.
Cora S-BGCN-T-K 87.6 75.5 85.5 70.8 84.8 78.4 49.0 75.3 44.5 73.1
EDL-GCN 84.5 81.0 - 83.3 74.2 53.2 - - 71.4
DPN-GCN - - 77.3 78.9 78.3 - - 58.5 62.8 63.0
Drop-GCN - - 81.9 70.5 80.9 - - 69.7 44.2 67.2
GCN - - - - 80.7 - - - - 66.9
Citeseer S-BGCN-T-K 84.8 55.2 78.4 55.1 74.0 86.8 54.1 80.8 55.8 74.0
EDL-GCN 78.4 59.4 - - 69.1 79.8 57.3 - - 69.0
DPN-GCN - - 68.3 72.2 69.5 - - 68.5 72.1 70.3
Drop-GCN - - 72.3 61.4 70.6 - - 73.5 60.8 70.0
GCN - - - - 70.8 - - - - 70.2
Pubmed S-BGCN-T-K 74.6 67.9 71.8 59.2 72.2 69.6 52.9 63.6 44.0 56.5
EDL-GCN 71.5 68.2 - - 70.5 65.3 53.1 - - 55.0
DPN-GCN - - 63.5 63.7 63.5 - - 50.7 53.9 51.1
Drop-GCN - - 68.7 60.8 66.7 - - 59.7 46.7 54.8
GCN - - - - 68.3 - - - - 55.3
Amazon Photo S-BGCN-T-K 93.4 76.4 91.4 32.2 91.4 94.8 68.0 92.3 42.3 92.5
EDL-GCN 63.4 78.1 - - 79.2 66.2 74.8 - - 81.2
DPN-GCN - - 83.6 83.6 83.6 - - 82.6 82.4 82.5
Drop-GCN - - 84.5 58.7 84.3 - - 87.0 57.7 86.9
GCN - - - - 84.4 - - - - 87.0
Amazon Computer S-BGCN-T-K 82.3 76.6 80.9 55.4 80.9 70.5 52.8 60.9 35.9 60.6
EDL-GCN 53.2 70.1 - - 70.0 33.2 43.9 - - 45.7
DPN-GCN - - 77.6 77.7 77.7 - - 50.8 51.2 51.0
Drop-GCN - - 74.4 70.5 74.3 - - 50.0 46.7 49.8
GCN - - - - 74.0 - - - - 48.7
Coauthor Physics S-BGCN-T-K 91.3 87.6 89.7 61.8 89.8 72.2 56.6 68.1 25.9 67.9
EDL-GCN 88.2 85.8 - - 87.6 67.1 51.2 - - 62.1
DPN-GCN - - 85.5 85.6 85.5 - - 59.8 60.2 59.8
Drop-GCN - - 89.2 78.4 89.3 - - 66.6 37.1 66.5
GCN - - - - 89.1 - - - - 64.0
  • Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy

Table 2: AUROC and AUPR for the OOD Detection.

OOD Detection. This experiment involves detecting whether an input example is out-of-distribution (OOD) given an estimate of uncertainty. For semi-supervised node classification, we randomly selected one to four categories as OOD categories and trained the models based on training nodes of the other categories. Due to the space constraint, the experimental setup for the OOD detection is detailed in Appendix B.3.

In Table 2, across six network datasets, our vacuity-based detection significantly outperformed the other competitive methods, exceeding the performance of the epistemic uncertainty and other type of uncertainties. This demonstrates that vacuity-based model is more effective than other uncertainty estimates-based counterparts in increasing OOD detection. We observed the following performance order: , which is consistent with the theoretical results as shown in Theorem 1.

Ablation Study. We conducted additional experiments (see Table 3) in order to demonstrate the contributions of the key technical components, including GKDE, Teacher Network, and subjective Bayesian framework. The key findings obtained from this experiment are: (1) GKDE can enhance the OOD detection (i.e., 30% increase with vacuity), which is consistent with our theoretical proof about the outperformance of GKDE in uncertainty estimation, i.e., OOD nodes have a higher vacuity than other nodes; and (2) the Teacher Network can further improve the node classification accuracy.

6.3 Why is Epistemic Uncertainty Less Effective than Vacuity?

Although epistemic uncertainty is known to be effective to improve OOD detection Gal and Ghahramani (2016); Kendall and Gal (2017) in computer vision applications, our results demonstrate it is less effective than our vacuity-based approach. The first potential reason is that epistemic uncertainty is always smaller than vacuity (From Theorem 1

), which potentially indicates that epistemic may capture less information related to OOD. Another potential reason is that the previous success of epistemic uncertainty for OOD detection is limited to supervised learning in computer vision applications, but its effectiveness for OOD detection was not sufficiently validated in semi-supervised learning tasks. Recall that epistemic uncertainty (i.e., model uncertainty) is calculated based on mutual information (see Eq. (

5)). In a semi-supervised setting, the features of unlabeled nodes are also fed to a model for training process to provide the model with a high confidence on its output. For example, the model output would not change too much even with differently sampled parameters , i.e., , which result in a low epistemic uncertainty. We also designed a semi-supervised learning experiment for image classification and observed a consistent pattern with the results demonstrated in Appendix C.6.

Data Model AUROC (Misclassification Detection) AUPR (Misclassification Detection) Acc
Va.* Dis. Al. Ep. En. Va. Dis. Al. Ep. En.
Cora S-BGCN-T-K 70.6 82.4 75.3 68.8 77.7 90.3 95.4 92.4 87.8 93.4 82.0
S-BGCN-T 70.8 82.5 75.3 68.9 77.8 90.4 95.4 92.6 88.0 93.4 82.2
S-BGCN 69.8 81.4 73.9 66.7 76.9 89.4 94.3 92.3 88.0 93.1 81.2
S-GCN 70.2 81.5 - - 76.9 90.0 94.6 - - 93.6 81.5
AUROC (OOD Detection) AUPR (OOD Detection)
Amazon Photo S-BGCN-T-K 93.4 76.4 91.4 32.2 91.4 94.8 68.0 92.3 42.3 92.5 -
S-BGCN-T 64.0 77.5 79.9 52.6 79.8 67.0 75.3 82.0 53.7 81.9 -
S-BGCN 63.0 76.6 79.8 52.7 79.7 66.5 75.1 82.1 53.9 81.7 -
S-GCN 64.0 77.1 - - 79.6 67.0 74.9 - - 81.6 -
  • Va.: Vacuity, Dis.: Dissonance, Al.: Aleatoric, Ep.: Epistemic, En.: Entropy

Table 3: Ablation study of our proposed models: (1) S-GCN: Subjective GCN with vacuity and dissonance estimation; (2) S-BGCN: S-GCN with Bayesian framework; (3) S-BGCN-T: S-BGCN with a Teacher Network; (4) S-BGCN-T-K: S-BGCN-T with GKDE to improve uncertainty estimation.

7 Conclusion

In this work, we proposed a multi-source uncertainty framework of GNNs for semi-supervised node classification. Our proposed framework provides an effective way of predicting node classification and out-of-distribution detection considering multiple types of uncertainty. We leveraged various types of uncertainty estimates from both DL and evidence/belief theory domains. Through our extensive experiments, we found that dissonance-based detection yielded the best performance on misclassification detection while vacuity-based detection performed the best for OOD detection, compared to other competitive counterparts. In particular, it was noticeable that applying GKDE and the Teacher network further enhanced the accuracy in node classification and uncertainty estimates.

Acknowledgments

We would like to thank Yuzhe Ou for providing proof suggestions. This work is supported by the National Science Foundation (NSF) under Grant No #1815696 and #1750911.

Broader Impact

In this paper, we propose a uncertainty-aware semi-supervised learning framework of GNN for predicting multi-dimensional uncertainties for the task of semi-supervised node classification. Our proposed framework can be applied to a wide range of applications, including computer vision, natural language processing, recommendation systems, traffic prediction, generative models and many more 

Zhou et al. (2018). Our proposed framework can be applied to predict multiple uncertainties of different roots for GNNs in these applications, improving the understanding of individual decisions, as well as the underlying models. While there will be important impacts resulting from the use of GNNs in general, our focus in this work is on investigating the impact of using our method to predict multi-source uncertainties for such systems. The additional benefits of this method include improvement of safety and transparency in decision-critical applications to avoid overconfident prediction, which can easily lead to misclassification.

We see promising research opportunities that can adopt our uncertainty framework, such as investigating whether this uncertainty framework can further enhance misclassification detection or OOD detection. To mitigate the risk from different types of uncertainties, we encourage future research to understand the impacts of this proposed uncertainty framework to solve other real world problems.

References

  • [1] C. W. De Silva (1995) Intelligent control: fuzzy logic applications. CRC press. Cited by: §2.
  • [2] S. Depeweg, J. Hernandez-Lobato, F. Doshi-Velez, and S. Udluft (2018) Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. In

    International Conference on Machine Learning

    ,
    pp. 1184–1193. Cited by: §3.3.
  • [3] D. Eswaran, S. Günnemann, and C. Faloutsos (2017) The power of certainty: a dirichlet-multinomial model for belief propagation. In Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 144–152. Cited by: §2.
  • [4] T. Fawcett (2006) An introduction to roc analysis. Pattern recognition letters, pp. 861–874. Cited by: 1st item.
  • [5] Y. Gal and Z. Ghahramani (2015)

    Bayesian convolutional neural networks with bernoulli approximate variational inference

    .
    arXiv preprint arXiv:1506.02158. Cited by: §B.8, §B.8.
  • [6] Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §B.8, §C.6, §C.6, §2, §5.2, §6.1, §6.3.
  • [7] Y. Gal (2016) Uncertainty in deep learning. University of Cambridge. Cited by: §2.
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §B.6.
  • [9] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §6.1.
  • [10] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.3.
  • [11] A. Josang, J. Cho, and F. Chen (2018) Uncertainty characteristics of subjective opinions. In 2018 21st International Conference on Information Fusion (FUSION), pp. 1998–2005. Cited by: §1, §2, §3.2.
  • [12] A. Josang (2016) Subjective logic. Springer. Cited by: §A.1, §2, §3.1, §3.2.
  • [13] A. Kendall, V. Badrinarayanan, and R. Cipolla (2015)

    Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding

    .
    arXiv preprint arXiv:1511.02680. Cited by: §B.8, §C.6, §5.2.
  • [14] A. Kendall and Y. Gal (2017) What uncertainties do we need in bayesian deep learning for computer vision?. In Advances in neural information processing systems, pp. 5574–5584. Cited by: §C.6, §C.6, §1, §2, §6.3.
  • [15] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §B.6.
  • [16] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §B.4, §1, §6.1.
  • [17] D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, Cited by: §C.6.
  • [18] Z. Liu, S. Li, S. Chen, Y. Hu, and S. Huang (2020) Uncertainty aware graph gaussian process for semi-supervised learning.. In AAAI, pp. 4957–4964. Cited by: §2.
  • [19] L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research, pp. 2579–2605. Cited by: §C.4.
  • [20] A. Malinin and M. Gales (2018) Predictive uncertainty estimation via prior networks. In Advances in Neural Information Processing Systems, pp. 7047–7058. Cited by: §B.4, §2, §3.3, §6.1.
  • [21] J. McAuley, C. Targett, Q. Shi, and A. Van Den Hengel (2015) Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pp. 43–52. Cited by: §B.2.
  • [22] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018) Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, pp. 1979–1993. Cited by: §C.6.
  • [23] N. J. Nilsson (1986) Probabilistic logic. Artificial intelligence, pp. 71–87. Cited by: §3.1.
  • [24] Y. Rong, W. Huang, T. Xu, and J. Huang (2019) Dropedge: towards deep graph convolutional networks on node classification. In International Conference on Learning Representations, Cited by: §C.7, §2.
  • [25] S. Ryu, Y. Kwon, and W. Y. Kim (2019) Uncertainty quantification of molecular property prediction with bayesian neural networks. arXiv preprint arXiv:1903.08375. Cited by: §6.1.
  • [26] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad (2008) Collective classification in network data. AI magazine, pp. 93–93. Cited by: §B.2, §6.1.
  • [27] M. Sensoy, L. Kaplan, and M. Kandemir (2018) Evidential deep learning to quantify classification uncertainty. In Advances in Neural Information Processing Systems, pp. 3179–3189. Cited by: §A.1, §B.4, §2, §6.1.
  • [28] K. Sentz, S. Ferson, et al. (2002) Combination of evidence in dempster-shafer theory. Vol. 4015, Sandia National Laboratories Albuquerque. Cited by: §2.
  • [29] G. Shafer (1976) A mathematical theory of evidence. Vol. 42, Princeton university press. Cited by: §3.1.
  • [30] O. Shchur, M. Mumme, A. Bojchevski, and S. Günnemann (2018) Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868. Cited by: §B.2, §B.2, §B.6, §6.1.
  • [31] A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, pp. 1195–1204. Cited by: §C.6.
  • [32] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio (2018) Graph Attention Networks. International Conference on Learning Representations. External Links: Link Cited by: §C.2, §1.
  • [33] Z. Yang, W. Cohen, and R. Salakhudinov (2016) Revisiting semi-supervised learning with graph embeddings. In International conference on machine learning, pp. 40–48. Cited by: §B.2.
  • [34] Y. Zhang, S. Pal, M. Coates, and D. Ustebay (2019) Bayesian graph convolutional neural networks for semi-supervised classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 5829–5836. Cited by: §2.
  • [35] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2018) Graph neural networks: a review of methods and applications. arXiv preprint arXiv:1812.08434. Cited by: Broader Impact.

Appendix A Proofs

a.1 Theorem 1’s Proof

See 1

Interpretation. Theorem 1.1 (a) implies that increases in both uncertainty types may not happen at the same time. A higher vacuity leads to a lower dissonance, and vice versa (a higher dissonance leads to a lower vacuity). This indicates that a high dissonance only occurs only when a large amount of evidence is available and the vacuity is low. Theorem 1.1 (b) shows relationships between vacuity and epistemic uncertainty in which vacuity is an upper bound of epistemic uncertainty. Although some existing approaches [12, 27] treat epistemic uncertainty the same as vacuity, it is not necessarily true except for an extreme case where a sufficiently large amount of evidence available, making vacuity close to zero. Theorem 1.2 (a) and (b) explain how entropy differs from vacuity and/or dissonance. We observe that entropy is 1 when either vacuity or dissonance is 0. This implies that entropy cannot distinguish different types of uncertainty due to different root causes. For example, a high entropy is observed when an example is an either OOD or misclassified example. Similarly, a high aleatoric uncertainty value and a low epistemic uncertainty value are observed under both cases. However, vacuity and dissonance can capture different causes of uncertainty due to lack of information and knowledge and to conflicting evidence, respectively. For example, an OOD objects typically show a high vacuity value and a low dissonance value while a conflicting prediction exhibits a low vacuity and a high dissonance.

Proof.

1. (a) Let the opinion , where is the number of classes, is the belief for class , is the uncertainty mass (vacuity), and . Dissonance has a upper bound with

where is the relative mass balance, then we have

(13)

1. (b) For the multinomial random variable , we have

(14)

where Cal(p) is the categorical distribution and is Dirichlet distribution. Then we have

(15)

and the epistemic uncertainty is estimated by mutual information,

(16)

Now we consider another measure of ensemble diversity: Expected Pairwise KL-Divergence between each model in the ensemble. Here the expected pairwise KL-Divergence between two independent distributions, including and , where and are two independent samples from , can be computed,

where . We consider Dirichlet ensemble, the Expected Pairwise KL Divergence,

(18)

where and is the digamma Function, which is the derivative of the natural logarithm of the gamma function. Now we obtain the relations between vacuity and epistemic,

(19)

2. (a) For an out-of-distribution sample, , the vacuity can be calculated as

(20)

and the belief mass , we estimate dissonance,

(21)

Given the expected probability , the entropy is calculated based on ,

(22)

where is the entropy. Based on Dirichlet distribution, the aleatoric uncertainty refers to the expected entropy,

where , , and is the number of category. The epistemic uncertainty can be calculated via the mutual information,

To compare aleatoric uncertainty with epistemic uncertainty, we first prove that aleatoric uncertainty (Eq. (A.1)) is monotonically increasing and converging to 1 as increases. Based on Lemma 1, we have

(25)

Based on Eq. (25) and Eq. (A.1), we prove that aleatoric uncertainty is monotonically increasing with respect to . So the minimum aleatoric can be shown to be , when .

Similarly, for epistemic uncertainty, which is monotonically decreasing as increases based on Lemma 1, the maximum epistemic can be shown to be when . Then we have,

(26)

Therefore, we prove that .

2. (b) For a conflicting prediction, i.e., , with , and , the expected probability , the belief mass , and the vacuity can be calculated as

(27)

and the dissonance can be calculated as

Given the expected probability , the entropy can be calculated based on Dirichlet distribution,

(29)

and the aleatoric uncertainty is estimated as the expected entropy,