I Introduction
Distilling knowledge from graphs is an important task found ubiquitously in applications, such as fraud detection [34], user interest modeling in social networks [43, 44, 29], and bioinformatics [24, 14, 50]
. The knowledge helps humans make highstake decisions, such as whether to investigate a business or account for fraud detection on Yelp or to conduct expensive experiments on a promising protein in drug discovery. The stateoftheart approaches model the graphs as directed or undirected graphical models, such as Bayesian networks, Markov Random Fields (MRF)
[18], and Conditional Random Fields (CRF) [42], and distill knowledge from the graphs using inferences based on optimization [48, 12] and message passing [15]. Different from predictive models on i.i.d. vectors, graphical models capture the dependencies among random variables and carry more insights for decision making. However, the inferences involve iterative and recursive computations, making the inference outcomes cognitively difficult to understand, verify, and ratify, locking away more applications of graphical models (EU law requires algorithmic transparency
[10]) and more accurate models through debugging [46]. We focus on MRFs and belief propagation (BP) inferences [20] that compute marginal distributions, aiming to make the inference outcomes more interpretable and thus cognitively easier for humans. Fig. 1 depicts the problem definition and the proposed solution.Several challenges are due. First, simple but faithful explanations are desired [25] but have not been defined for inferences on MRFs. Prior work [13, 8] approximates a highdimensional Gaussian through a sparse covariance matrix, which does not explain belief propagation. To explain the marginal distribution on MRFs, using sensitivity analysis, the authors in [5, 7]
proposed to find influential parameters inherent in the model but not in any particular inference algorithms and computations. Explanation of parametric linear models and deep networks using surrogate models, differentiation, and feature selection
[35, 19, 11, 26, 4] cannot be applied to graphical model inferences, although our proposal can explain inferences on deep graphical models [16]. Explainable RNNs [23] handles linear chains but not general MRFs. In terms of usability, previous works have studied how to visualize explanations of other models and utilize the explanations in the end tasks such as model debugging [40]. It is less know what’s the best graph explanation complexity and faithfulness tradeoff for the endusers, how to effectively communicate the probabilistic explanations, and how to utilize the explanations.Second, algorithmically, an MRF can be large, densely connected, and cyclic, while simple and faithful explanations need to be found efficiently. BP computes each message using other messages iteratively and recursively until convergence (Eq. (2)). As a result, a message is a function of other messages and a complete explanation requires the entire history of message computations, possibly from multiple iterations. Furthermore, on cyclic graphs, a converged message can be defined recursively by itself, generating explanations such as “a user is fraudulent because it is fraudulent”. A desirable explanation should be without recursion, but cutting out the cycles may result in a different model and affect faithfulness.
We propose a new approach called “GraphExp” to address the above challenges. Given the full graphical model and any target variable (the “explanandum”) on , GraphExp finds an “explanans” (or the “explaining”) graphical model consisting of a smaller number of random variables and dependencies. The goal of GraphExp is to minimize the loss in faithfulness measured by the symmetric KLdivergence between the marginals of inferred on and [41]. Starting from the graph consisting of only, GraphExp greedily includes the next best variable into the previous subgraph so that the enlarged subgraph has the lowest loss. Theoretically we prove that: (1) an exhaustive search for the optimal with highest faithfulness (lowest loss) is NPhard, and furthermore, the objective function is neither monotonic nor submodular, leading to the lack of a performance guarantee of any greedy approximation (Theorem II.1); (2) GraphExp only generates acyclic graphs that are more explainable (Theorem III.1).
There can exist multiple sensible explanations for the same inference outcome [37, 36] and an enduser can find the one that best fits her mental model. We equip GraphExp with beam search [3] to discover a set of distinct, simple, and faithful explanations for a target variable. Regarding scalability, when looking for on densely connected graphs, the branching factor in the search tree can be too large for fast search. While the search is trivially parallelizable, we further propose a safe pruning strategy that retains the desirable candidates while cutting the search space down significantly (Fig. 4). Regarding usability, GraphExp does not commit to a single explanation but allows the users to select one or multiple most sensible explanations for further investigation (Section IVG). We highlight the contributions as follows:
We define the problem of explaining belief propagation to endusers and study the communication and utility of the explanations.
We propose an optimization problem for finding simple, faithful, and diverse explanations. We prove the hardness of the problem and then propose GraphExp as a greedy approximation algorithm to solve the problem. We analyze the time complexity and the properties of the output graphs, Both parallel search and the proposed pruning strategy deliver at least linear speedup on large datasets.
Empirically, on 10 networks with up to millions of nodes and edges from 4 domains, GraphExp explains BP faithfully and significantly outperforms variants of GraphExp and other explanation methods not designed for graphs. We propose visualization that allows flexible, intuitive of the found explanations. To demonstrate the utility of the found explanations, we identify a security issue in Yelp spam detection using the found subgraphs.
FS [11] 
LIME [35] 
GSparse [8] 
GDiff [5] 
GraphExp 

Cycles handling  
Completeness  
Interpretability  
Diversity  
Scalability  
Flexibility 
Ii Problem definition
Notation definitions are summarized in Table II.
Notation  Definition 
Undirected graphical model (MRF)  
Random varaibles and their connections  
()  Random variables (and their values) 
(or ) 
Prior probability distribution of (or ) 
Compatibility matrix between and  
Message passed from to  
Marginal distribution (belief) of  
KL  KL Divergence between and 
Variables in connected to subgraph  
Neighbors of on 
Given a set of random variables , each taking values in where is the number of classes, an MRF
factorizes the joint distribution
as(1) 
where normalizes the product to a probability distribution, is the prior distribution of without considering other variables. The compatibility encodes how likely the pair will take the value jointly and capture the dependencies between variables. The factorization can be represented by a graph consisting of as nodes and edges , as shown in Fig. 1. BP inference computes the marginal distributions (beliefs) , , based on which human decisions can be made. The inference computes messages from to :
(2) 
where is a normalization factor so that is a probability distribution of . The messages in both directions on all edges are updated until convergence (guaranteed when is acyclic [32]). The marginal is the belief
(3) 
where is the neighbors of on . We aim to explain how the marginal is inferred by BP. For any , depends on messages over all edges reachable from and to completely explain how is computed, one has to trace down each message in Eq. (3) and Eq. (2). Such a complete explanation is hardly interpretable due to two factors: 1) on large graphs with longrange dependencies, messages and variables far away from will contribute to indirectly through many steps; 2) when there is a cycle, BP needs multiple iterations to converge and a message can be recursively defined by itself [17, 15]. A complete explanation of BP can keep track of all these computations on a call graph [38]. However, the graph easily becomes too large to be intuitive for humans to interpret or analyze. Rather, should be approximated using shortrange dependencies without iterative and recursive computations. The question is, without completely follow the original BP computations, how will the approximation affected? To answer this question, we formulate the following optimization problem:
Definition 1.
Given an MRF , and a target node , extract another MRF with and containing no more than variables and no cycle, so that BP computes similar marginals and on and , respectively. Formally, solve the following
(4) 
The objective captures the faithfulness of , measured by the symmetric KLdivergence between marginal distributions of on and , where
(5) 
The choice of as a faithfulness measured can be justified: KL measures the loss when the “true” distribution is while the explaining distribution is [41]. Symmetrically, a user can regard as the “true” marginal, which can be explained by the . The simplicity of can be measured by the number of variables on , and for to be interpretable, we control the size of to be less than . Since a graphical model encodes a joint distribution of a set of variables, the above problem is equivalent to searching a joint distribution of a smaller number of variables with fewer variable dependencies, so that the two joint distributions lead to similar marginals of the variable to be explained.
If the negation of the above objective function is submodular and monotonically increasing, then a greedy algorithm that iteratively builds by adding one variable at a time to increase the negation of objective function (namely, to decrease ) can generate a solution whose value is within of the optimum [30].
Definition II.1 (Submodularity).
Let be a set and be the power set of . A set function is submodular if and any , .
Definition II.2.
A set function is monotonically increasing if , implies .
However, we prove that both properties do not hold on , which cannot be efficiently approximated well.
Theorem II.1.
The objective function in Eq. (4) is not submodular nor monotonically increasing.
Proof.
We find a counterexample that violates submodularity and monotonicity. As shown in Fig. 2, the full graph has 3 variables , and , with connected to and , respectively. The priors are , , and , and both edges have the same homophilyencouraging potential if and otherwise.
Let subgraphs , and be as shown in the figure. One can run BP on to find and that , with the subscriptions indicating on which subgraph is computed. However, and the gain through adding to is greater than that adding to . On the same example, we can see that adding to can increase the objective. ∎
Iii Methodologies
The optimization problem Eq. (4) can be solved by exhaustive search in the space of all possible trees under the specified constraints and it is thus a combinatorial subset maximization problem and NPhard [11, 1], similar to exhaustive feature selection. Greedy algorithms are a common solution to approximately solve NPhard problems. Since finding multiple alternative sensible explanations is one of our goals, we adopt beam search in the greedy search [3], maintaining in a beam several top candidates ranked by faithfulness and succinctness throughout the search.
A general greedy beam search framework is presented in Algorithm 1. The algorithm finds multiple explaining subgraphs of size in iterations, where is the maximum subgraph size (which roughly equals to human working memory capacity [28], or the number of cognitive chunks [22]). Starting from the initial , at each step the graph is extended to by adding one more explaining node and edge to optimize a certain objective function without forming a loop. After the desired subgraphs are found and right before the algorithm exits, BP will be run again on to compute so that we can use the converged messages on to explain to an enduser how is approximated on . Since is small and contains no cycle, the explanation is significantly simpler than the original computations on . We substantiate the general framework in the following two sections, with two alternative ways to rank candidate extensions of the subgraphs during the beam search. Before we discuss the two concrete algorithms, a theoretical aspect related to interpretability is characterized below.
Theorem III.1.
The output from Algorithm 1 is a tree.
Proof.
We prove this by induction. Base case: is a single node so it is a tree. Inductive step: Assume that is a tree. By adding one more variable and edge ( and ) to , there is no cycle created, since we are only attaching to through a single edge . ∎
Iiia GraphExpGlobal (GeG): search explanations via evaluating entire subgraphs
We propose GEG, an instantiation of Algorithm 1. At iteration , the algorithm evaluates the candidate extensions of using the objective Eq. (4). Define be the set of nodes in that are connected to . A candidate is generated by adding a variable through the edge to , where is a random variable in . A new BP procedure is run on to infer , the marginal of , and the distance is calculated as the quality of the candidate . After exhausting all possible candidates, GEG adds the candidates with the least distances to Beam[t].
One search step in Fig. 3 demonstrates the above ideas. The search attempts to extend the middle in Beam[2] by adding a new variable from to the subgraph, so the new belief on the larger subgraph is closer to . In this example, GEG keeps the top 3 extensions of (beam size is 3), but only two options are legitimate and are the only two candidates included in Beam[3]. The middle subgraph consisting of is generated by the algorithm GEL to be discussed later (see Section IIIC). When the search extends the bottom right subgraph , variable can be connected to in two ways, through edges and , but both GraphExp variants include only one link to avoid cycles in the exlaining subgraphs.
On the highlevel, GEG is similar to forward wrapperstyle feature selection algorithms, where each feature is evaluated by including it to the set of selected features and running a target classification model on the new feature sets. The key difference here is that GEG can’t select any variable on , but has to restrict itself to those that will result in an acyclic graph (which is guaranteed by Theorem III.1).
One of the advantages of GEG is that the objective function in the optimization problem Eq. (4) is minimized directly at each greedy step. However, as each candidate at each step requires a separate BP, it can be timeconsuming. We analyze the time complexity below. To generate a subgraph of maximum size for a variable, iterations are needed. At iteration , one has to run BP for as many times as the number of neighboring nodes of the current explanation . The number of candidates that need to be evaluated for one of the candidates in Beam[t1] in the iteration equals the size of the number of neighboring nodes in . On graphs with a small diameter, this size grows quickly to the number of nodes in . On the other extreme, if is a linear chain, this size is no more than 2. For each BP run, it is known that BP will converge in the number of iterations same as the diameter of the graph that BP is operated on, which is upperbounded by the size of the candidate subgraph . During each BP iteration, messages have to be computed. The overall time complexity of GEG is , where is the beam size. Since the number of classes on the variables are fixed and usually small (relative to the graph size), here we ignore the factor , which is the time complexity to compute one message.
IiiB Speeding up GeG on large graphs
Graphical models in realworld applications are usually gigantic containing tens or hundreds of thousands of nodes. GraphExp can take a long time to finish on such large graphs, especially when the graph diameter is small. Slowness can harm the usability of GraphExp in applications requiring interpretability, for example, when a user wants to inspect multiple explanations of BP for counterfactual analysis, or statistics of the errors needs to be computed over explanations of many nodes [46]. We propose parallelized search and a pruning strategy to speed up GEG.
Parallel search The general GraphExp algorithm can be parallelized on two levels. First, the generation of explanations over multiple target variables can be executed on multiple cores. Second, in the evaluation of the next extensions of during beam search, multiple candidates can be tried out at the same time on multiple cores. Particularly for GEG, during the BP inference over each candidate , there are existing parallel algorithms that compute the messages [9] asynchronously. As the subgraphs are bounded by the human cognitive capacity and are small, a parallel inference can be an overkill. We evaluate the reduction in search time using the first level of parallelism (Section IVE).
Pruning candidate variables In Algorithm 1, all candidates have to be evaluated to select and we have to run BP as many times as . When the cut is large, this can be costly. As we aim at explaining how BP infers the marginal of the target , adding any variable that has a distribution that deviates much from the distribution of is not helpful but confusing. Considering that a subgraph has candidates at step, we run BP on these candidates and abandon the bottom percent of them based on Eq. 4 in the following steps.
IiiC GraphExpLocal (GeL): search explanations via local message backtracing
Sometimes one may want to trade explanation faithfulness for speed during subgraph search. For example, in the exploratory phase, a user wants to get a quick feeling of how the inferences are made or to identify mistakes caused by glitches on the graph before digging deeper into finergrained explanations. We propose GEL for this purpose to complement GEG that can generate more faithful and detailed explanations (with the expense of more searching time). GEL is based on message backtracing that follows the general beam search but with more constraints on the search space. At iteration , the search adds an additional edge between and that best explains a message or a belief in , which can be computed using information that is not currently in . There are two cases.

For a message on an edge already included in , the search attempts to find a message , where , so that the message contributes most to the message . We use the distance defined in Eq. (4) to measure the contribution of to : the smaller the distance, the more similar two messages are and thus more contribution from to .

For the belief of the target node , the search attempts to find a message that best explains , using the distance between and .
In both cases, we define the end points of as those nodes that are in but can be connected to those in . In the example in Fig. 1, and are end points of the subgraph in the middle. In GEL, if the prior of an endpoint best explains the message emitting from the endpoint (e.g., in Fig. 1) or belief of the endpoint ( in the same figure), the prior is added to the subgraph and no extension can be made to the end point: the prior is a parameter of the graphical model and not computed by BP, and no other BP distribution can further explain the prior. The search of GEL will stop at this branch, although the other branches in the beam can further be extended. Using the same example in Fig. 1, the prior best explains ) and this branch is considered finished requiring no more extension.
Analysis of GEL We analyze what the above search is optimizing. We unify the two cases where the search explains how a message and belief are computed. Assuming that the potential function for all edges are identity matrices, which is the case when a graph exhibits strong homophily. Then the message going from to in Eq. (2) is proportional to
(6) 
which is in the same form of Eq. (3) when a belief is computed. Therefore, both messages and beliefs can be written in the form of , where is the number of factors in the product. Let be a distribution that explains with factors and with being an endpoint to be explained. If , then the distance is 0 but many edges (factors) are included in
. Starting from a uniform distribution, the search each time finds an approximating distribution
by including an additional factor in to minimize over all distributions representing the sofar included messages and beliefs computed by BP on , and over all factors (messages or priors) that have not been included but are contributing to . Therefore, the algorithm does not directly attempt to optimize the objective for the target , but does so locally: it keeps adding more factors to best explain one of the next endpoints, which can already explain the target variable to some extent.Variants of GEL To further speed up GEL (see Fig. 4), especially on graphs that a user has prior knowledge about the topology of the explaining subgraph, its search space can be further constrained. On the one hand, we can only extend a candidate on the endpoint that is added most recently, creating a chain of variables so that one is explaining the other. This aligns with the conclusion that causal explanations are easier to be understood by the endusers [25], and our explanations of the inference are indeed causal: how is computed by a smaller subgraph will be presented to the endusers. On the other hand, when a target variable has many incoming messages (which is the case on social and scientific networks), it is best to spend the explaining capacity on direct neighbors. In the experiments, we adopt these two constraints over GEL on the review and remaining networks, respectively.
Iv Experiments
In this section, we examine the explanation faithfulness and interpretability of GEL, GEG, and the stateoftheart baselines, including LIME, in ten different networks from four domains (spam detection, citation analysis, social networks, bioinformatics). We also evaluate the scalability and sensitivity analysis of these methods. Moreover, we conduct user study to demostrate the usability of GEG.
Iva Datasets
We drew datasets from four applications. First, we adopt the same three Yelp review networks (YelpChi, YelpNYC, and YelpZip) from [34] for spam detection tasks. We represent reviewers, reviews, and products and their relationships (reviewerreview and reviewproduct connections) by an MRF. BP can infer the labels (suspicious or normal) of reviews on the MRF given no labeled data but just prior suspiciousness probability computed from metadata [34]. Second, in collective classification, we construct an MRF for each one of three citation networks (Cora, Citeseer, and PubMed) that contain papers as nodes and undirected edges as paper citation relationships [29]. As a transductive learning algorithm, BP can infer the distributions of paper topics of an unlabeled paper, given labeled nodes in the same network. Third, we represent blogs (BlogCatalog), videos (Youtube), and users (Flickr) as nodes and behaviors including subscription and tagging as edges [43]
. BP infers the preferences of users. The goal is to classify social network nodes into multiple classes. Lastly, in biological networks, we adopt the networks analyzed in
[50] which denotes nodes as proteinprotein pairs and the subordination relations of protein pair as the class. Explaining BP inference is important in all these applications: the MRFs are in general large and cyclic for a user to thoroughly inspect why a paper belongs to a certain area, or why a review is suspicious, or why a blog is under a specific topic, or why two proteins connect to each other. It is much easier for the user to verify the inference outcome on much smaller explaining graphs. The statistics of the datasets are shown in Table III.Datasets  Classes  Nodes  Edges  edge/node 
YelpChi  2  105,659  269,580  2.55 
YelpNYC  2  520,200  1436,208  2.76 
YelpZip  2  873,919  2434,392  2.79 
Cora  7  2,708  10,556  3.90 
Citeseer  6  3,321  9,196  2.78 
PubMed  3  1,9717  44,324  2.25 
Youtube  47  1,138,499  2,990,443  2.63 
BlogCatalog  39  10,312  333,983  32.39 
Flickr  195  80,513  5,899,882  73.28 
Bioinformatics  144  13,682  287,916  21.04 
IvB Experimental setting
On Yelp review networks, a review has two and only two neighbors (a reviewer posts the review and the product receives the review), while a reviewer and a product can be connected to multiple reviews. On the remaining networks, nodes are connected to other nodes without constraints of the number of neighbors and the type of nodes. We apply the two variants of GEL on Yelp and other networks, respectively. Psychology study shows that human beings can process about seven items at a time [28]. To balance the faithfulness and interpretability, both GEL and GEG search subgraphs that are of maximum size five starting from the target node. In the demon, we explore larger explaining subgraphs to allow users to select the smallest subgraph that makes the most sense.
On all ten networks, we assume homophily relationships between any pair of nodes. For example, a paper is more likely to be on the same topic of the neighboring paper, and two nodes are more likely to be suspicious or normal at the same time. On Yelp, we set node priors and compatibility matrices according to [34]. On other networks, we assign 0.9 to the diagonal and to the rest of the compatibility metrics. As for priors, we assign 0.9 to the true class of a labeled node, and to the remaining classes, where is the number of classes in data. For unlabeled nodes, we set uniform distribution over classes. On Yelp, there is no labeled node, and on Youtube network, 5% are labeled nodes. For the remaining networks, we set the ratio of labeled data as 50%. With consideration the size of the large networks, we sample 1% from the unlabeled nodes as target nodes on Youtube and Flickr datasets, and sample 20% of unlabeled nodes as target nodes on BlogCatalog and Bioinformatics networks.
Method  Embedding  LIME  Random  GEL  Random  GEG (=1)  GEG (=3)  Comb 
Chi  0.058[5.0]  5.300  0.053[3.9]  0.036[3.9]  0.022[5.0]  0.0012[5.0]  0.0012[5.0]  0.0006[6.5] 
NYC  0.084[5.0]  5.955  0.043[4.1]  0.028[4.1]  0.017[5.0]  0.0012[5.0]  0.0011[5.0]  0.0006[6.2] 
Zip  0.084[5.0]  6.036  0.040[4.2]  0.025[4.2]  0.010[5.0]  0.0014[5.0]  0.0013[5.0]  0.0008[6.1] 
Cora  0.527[4.8]  1.321  0.362[3.6]  0.181[3.6]  0.594[4.8]  0.137[4.8]  0.132[4.9]  0.084[6.4] 
Citeseer  0.305[4.4]  1.221  0.243[2.9]  0.108[2.9]  0.340[4.4]  0.077[4.4]  0.075[4.4]  0.048[5.7] 
PubMed  0.842[5.0]  0.910  0.718[3.1]  0.577[3.1]  0.893[5.0]  0.188[5.0]  0.185[5.0]  0.098[7.1] 
Youtube  0.340[5.0]    0.376[2.8]  0.321[2.8]  0.343[5.0]  0.263[5.0]  0.264[5.0]  0.225[6.7] 
Flickr  5.903[5.0]    6.259[4.7]  6.232[4.7]  6.018[5.0]  4.654[5.0]  4.652[5.0]  4.111[7.4] 
BlogCatalog  7.887[5.0]    7.899[4.8]  8.054[4.8]  7.867[5.0]  6.621[5.0]  6.702[5.0]  6.343[7.9] 
Bioinformatics  2.065[5.0]    2.085[4.9]  1.893[4.9]  2.116[5.0]  1.423[5.0]  1.508[5.0]  1.356[5.6] 
significantly (pairwise ttest at 5% significance level) outperforms
Random and whether Comb outperforms GEG (=3), respectively.IvC Baselines
Random It ignores the messages computed by BP and selects a node in randomly when extending . To compare fairly, Random searches the subgraph of the same structure as those found by GEL and GEG, respectively.
Embedding It constructs subgraphs with the same size as those found by GEG. However, it utilizes DeepWalk[33] to obtain node embeddings, based on which top candidate nodes similar to the target are included to explain the target variable.
LIME [35]
It is the stateoftheart blackbox explanation method that works for classification models when input data are vectors rather than graphs. We randomly select 200 neighbors of each target node in the node feature vector space, with sampling probability weighted by the cosine similarity between the neighbors and the target. The feature vector is defined either as
[34](on Yelp review networks) or bagofwords (on the citation networks). A binary/multiclass logistic regression model is then fitted on the sample and used to approximate the decision boundary around the target.
LIME is less efficient than the subgraph searchbased approaches since a new classification model has to be fitted for each target variable. LIME explains 30% of all review nodes, randomly sampled on Yelp review datasets, and explain all unlabeled variables on the citation networks. It cannot explain nodes in the remaining four networks due to the lack of feature vectors, which is one of the drawbacks of LIME.Comb It aggregates all candidate subgraphs from Beam[C] into a single graph as an explanation. This method can aggregate at most (beam size) subgraphs and have at most variables. Here, we set =3 and report the performance of the combined subgraph. The performances of the top candidate in Beam[C] with =1 and =3 are reported under GEG (=1) and GEG (=3).
IvD Explanation Accuracy
Overall Performance
Quantitative evaluation metrics For each method, we run BP on the extracted subgraph
for each target variable to obtain a new belief (except for LIME that does not construct subgraphs). Explanation faithfulness is measured by the objective function in Eq. (4). In Table IV, we report the mean of the performance metric overall target variables, and the best methods are boldfaced. We also report the average size of explaining subgraphs in the square brackets after individual means.From Table IV, we can conclude that: 1) Comb always constructs the largest subgraph and performs best, due to multiple alternative highquality explanations from the beam branches. 2) GEG (=3) is the runner up and better than GEG (=1), because the search space of GEG (=3) is larger than GEG(=1)’s. 3) The performance of Embedding is not very good, but still better than LIME in all cases. LIME has the worst performance, as it is not designed to explain BP and cannot take the network connections into account. 4) Faithfulness is positively correlated to subgraph size.
Spam detection explanations On Yelp review networks, GEL generates chainlike subgraphs. The average subgraph size is around four, even though the maximum capacity is five. This is because GEL focuses on local information only and stops early when the prior of the last added node best explains the previously added message. Both GEG versions extend the subgraph to the maximum size to produce a better explanation. Notice that Random performs better when imitating GEG (=1) than when imitating GEL. The reason is that there are only two types of neighboring nodes of the target node and Random imitating GEG has a higher chance to include the better neighbor and also generates larger subgraphs.
Collective classification tasks In these tasks, GEL constructs starlike subgraphs centered at the target node. On Cora and Citeseer, the performance of GEL is closer to (but still inferior to) GEG with both beam sizes, compared to their performance difference on Yelp. This is because the Cora and Citeseer networks consist of many small connected components, most of which contain less than five nodes, essentially capping GEG’s ability to add more explaining nodes to further bring up the faithfulness (evidenced by the average size of subgraphs found by GEG smaller than 5). Compared with Yelp review networks, interestingly, Random imitating GEG generates larger subgraphs but performs worse than Random imitating GEL. The reason is that GEG can add nodes far away from the target node and the random imitation will do the same. However, without the principled guidance in GEG, Random can add nodes those are likely in the other classes than the class of the target node.
IvE Scalability
We run GEG (=1) on Yelp review networks which have the most nodes to be explained. The scalability of the algorithm is demonstrated in two aspects in Fig. 4. First, the search can be done in parallel on multiple cores. As we increase the number of cores, the running time goes down superlinearly as the number of cores increases from 1 to 14. Second, candidate pruning plays a role in speeding up the search. We use 14 cores and increase the pruning ratio from 0 to 99% and obtain about speedup at most. Importantly, the explanation faithfulness is not affected by the pruning, shown by the three lines at the bottom of the right figure (200 mean objective).
IvF Sensitivity analysis
There are two hyperparameters for the subgraph search and we study the sensitivity of the explanation faithfulness with respect to these two parameters. First, when there are labeled nodes on
, the explanation subgraphs may include a labeled node in the explanation, and the more labeled nodes, the more likely the explanation subgraphs will include these nodes. The question here is whether including such labeled nodes will improve explanation faithfulness, and does our method require a large number of labeled nodes. To answer the questions, on the four networks (Cora, Citeseer, PubMed, and Bioinformatics), we vary the ratio of labeled nodes in the ranges of {5%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 95%}. The explanation performances are shown in Fig. 5. Overall different ratios on all four networks, Comb is the best and GEG is runnerup. On the Cora and Citeseer networks, the performances of all methods are not very sensitive to the ratio of labeled nodes. On the PubMed network, there are improvements in most baselines except LIME becomes worse. One can also conclude that LIME, designed for parametric models including deep models, is outperformed by methods designed specifically for MRF. On the Bioinformatics, LIME is not available and GEG and Comb have better performance.Second, subgraphs containing more nodes perform better since they have more contexts of the target. To evaluate the sensitivity of faithfulness with respect to the subgraph size, we downsize the subgraphs found by other methods (embedding, GEG (=1), GEG (=3), Random imitating GEL and GEG) to the same size of those found by GEL, which may stop growing subgraphs before reaching the cap. The results obtained on four networks are shown in Fig. 6. On Yelp, GEL, GEG, and Random perform the same when the size is two since, with this size, the imitating Random has to adopt the same subgraph topology and node type as the subgraph found by GEL and GEG. As we downsize the best subgraphs found by GEG (), a better subgraph with a large size may not be optimal when trimmed to size two, lacking the optimal substructures that facilitate dynamic programming variable can alter the target belief, leading to more insight into and trust on the BP.
IvG Explanation visualization and utility
Explanation for Ratification One of the goals of explanations is to help endusers understand why the probabilistic beliefs are appropriate given the MRF, so that they can develop confidence in the beliefs and BP. This process is called “ratification” [41]. To attain this goal, Fig. 7 displays multiple explanations (trees) generated by Comb on the Karate Club network, with increasing size in Fig. 7 along with the faithfulness measured by the distance from Eq. (4). One can see that the distance decreases exponentially fast as more nodes are added to the explanation. On each explaining subgraph (tree), we display beliefs found by BP on and so that the user is aware of the gap. Since insight is personal [41], the interface is endowed with flexibility for a user to navigate through alternatives, and use multiple metrics (distributional distance, subgraph size and topology) to select the more sensible ones. This design also allows the user to see how adding or removing an variable can alter the target belief, leading to more insight into and trust on the BP.
Explanation for Verification MRF designers can use the explanations for verification of their design, including network parameters and topology. We demonstrate how to use the generated subgraphs to identify a security issue in spam detection. Specifically, we desire that a spam review should be detected with equal probability by the same spam detector, regardless of how many reviews their authors have posted. On the one hand, it has been found that wellcamouflaged elite spammer accounts can deliver better rating promotion [27], a fact that has been exploited by dedicated spammers [49]. On the other hand, a spam detector can be accurate in detecting fake reviews for the wrong reason, such as the prolificacy of the reviewers [36].
We gather all reviews from the YelpChi dataset, and from the generated subgraph explanations, create four features, including whether a review is connected to its reviewer and target product, and the degree of its two potential neighbors. We then train a logistic regression model on these features to predict the probability of a review being a false positive (4,501), a false negative (6,681), and misclassified (11,182), using predictions based on SpEagle [34]. The point is that the above mistakes should not depend on the prolificacy of the connected reviewers and products. However, we found that a sizable subset of false negatives (1,153 out of 6,681) are due to the influence from the review node only based on our explanations. Moreover, the logistic model shows that the influence of the degree of the neighboring reivewer contributes the most to the probability of FN. This is a serious security issue: a spam from an prolific reviewers is more likely to be missed by SpEagle.
V Related work
To explain differentiable parametric predictive models, such as deep networks [19] and linear models [26, 47], the gradients of the output with respect to the parameters and input data [39] can signify key factors that explain the output. However, graphical models aim to model long range and more complicated types of interaction among variables. If a model is too large or complex to be explained, an approximating model can provide certain amount of transparency to the full model. In [35, 2]
, parametric or nonparametric models are fitted to approximate a more complex model locally. The idea of approximation is similar to that in GraphExp, with different approximation loss functions. We have seen in the experiments that a parametric model won’t deliver good explanations to the inference outcomes on a graphical model. In
[21], HMM is trained to approximate an RNN model, resembling the idea of using a simpler grapical model to approximate a more complex graphical model. However, both HMM and RNN are linear while GraphExp focuses on graphs with more general topology, including cycles.Explainable bayesian networks were studied in the 1980’s and 1990’s [31, 41], driven by the needs to verify and communicate the inference outcomes of Bayesian networks in expert systems. It has long been recognized that human users are less likely to adopt expert systems without interpretable and probable explanations [45]. More recently, Bayesian networks were formulated as a multilinear function so that explanations can be facilitated by differentiation [7]. The fundamental difference between GraphExp and these prior works is that we handle MRFs with cycles while they handled Bayesian networks without cycles. The differentiationbased explanation of MRFs in [5] finds a set of important netowrk parameters (potentials) to explain changes in the marginal distribution of a target variable without explaining any inference procedure. GraphExp generates graphical models consisting of prominent variables for reproducing the inference of a BP procedure on a larger graph. Interpretable graphical models are also studied under the hood of topic models [6], where the focus is to communicate the meaning of the inference outcomes through measures or prototypes (which words belong to a topic), rather than explaining how the outcomes are arrived at.
References
 [1] Edoardo Amaldi and Viggo Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(12):237–260, 1998.

[2]
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja
Hansen, and KlausRobert MÃžller.
How to explain individual classification decisions.
Journal of Machine Learning Research
, 11(Jun):1803–1831, 2010. 
[3]
Dhruv Batra, Payman Yadollahpour, Abner GuzmanRivera, and Gregory
Shakhnarovich.
Diverse mbest solutions in markov random fields.
In
European Conference on Computer Vision
, pages 1–16. Springer, 2012.  [4] Rich Caruana, Hooshang Kangarloo, JD Dionisio, Usha Sinha, and David Johnson. Casebased explanation of noncasebased learning methods. In Proceedings of the AMIA Symposium, page 212. American Medical Informatics Association, 1999.

[5]
Hei Chan and Adnan Darwiche.
Sensitivity analysis in markov networks.
In
International Joint Conference on Artificial Intelligence
, volume 19, page 1300. LAWRENCE ERLBAUM ASSOCIATES LTD, 2005.  [6] Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L BoydGraber, and David M Blei. Reading tea leaves: How humans interpret topic models. In Advances in neural information processing systems, pages 288–296, 2009.
 [7] Adnan Darwiche. A differential approach to inference in bayesian networks. Journal of the ACM (JACM), 50(3):280–305, 2003.

[8]
Jerome Friedman, Trevor Hastie, and Robert Tibshirani.
Sparse inverse covariance estimation with the graphical lasso.
Biostatistics, 9(3):432–441, 2008.  [9] Joseph Gonzalez, Yucheng Low, and Carlos Guestrin. Residual splash for optimally parallelizing belief propagation. In Artificial Intelligence and Statistics, pages 177–184, 2009.
 [10] Bryce Goodman and Seth Flaxman. European union regulations on algorithmic decisionmaking and a “right to explanation”. AI Magazine, 38(3):50–57, 2017.
 [11] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. Journal of machine learning research, 3(Mar):1157–1182, 2003.
 [12] Tamir Hazan and Amnon Shashua. Normproduct belief propagation: Primaldual messagepassing for approximate inference. IEEE Transactions on Information Theory, 56(12):6294–6316, 2010.
 [13] Jean Honorio, Dimitris Samaras, Irina Rish, and Guillermo Cecchi. Variable selection for gaussian graphical models. In Artificial Intelligence and Statistics, pages 538–546, 2012.
 [14] Qiang Huang, LingYun Wu, and XiangSun Zhang. An efficient network querying method based on conditional random fields. Bioinformatics, 27(22):3173–3178, 2011.
 [15] Saehan Jo, Jaemin Yoo, and U Kang. Fast and scalable distributed loopy belief propagation on realworld graphs. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pages 297–305. ACM, 2018.

[16]
Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R
Datta.
Composing graphical models with neural networks for structured representations and fast inference.
In Advances in neural information processing systems, pages 2946–2954, 2016.  [17] U Kang, Duen Horng, et al. Inference of beliefs on billionscale graphs. 2010.
 [18] Ross Kindermann. Markov random fields and their applications. American mathematical society, 1980.
 [19] Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1885–1894. JMLR. org, 2017.
 [20] Daphne Koller and Nir Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [21] Viktoriya Krakovna and Finale DoshiVelez. Increasing the interpretability of recurrent neural networks using hidden markov models. arXiv preprint arXiv:1606.05320, 2016.
 [22] Isaac Lage, Emily Chen, Jeffrey He, Menaka Narayanan, Been Kim, Sam Gershman, and Finale DoshiVelez. An evaluation of the humaninterpretability of explanation. arXiv preprint arXiv:1902.00006, 2019.
 [23] Tao Lei, Regina Barzilay, and Tommi Jaakkola. Rationalizing neural predictions. arXiv preprint arXiv:1606.04155, 2016.
 [24] MingHui Li, Lei Lin, XiaoLong Wang, and Tao Liu. Protein–protein interaction site prediction based on conditional random fields. Bioinformatics, 23(5):597–604, 2007.
 [25] Tania Lombrozo. The structure and function of explanations. Trends in cognitive sciences, 10(10):464–470, 2006.
 [26] Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 150–158. ACM, 2012.
 [27] Michael Luca. Reviews, reputation, and revenue: The case of yelp. com. Com (March 15, 2016). Harvard Business School NOM Unit Working Paper, (12016), 2016.
 [28] George A Miller. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review, 63(2):81, 1956.
 [29] Galileo Mark Namata, Stanley Kok, and Lise Getoor. Collective graph identification. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 87–95. ACM, 2011.
 [30] George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maximizing submodular set functions. Mathematical programming, 14(1):265–294, 1978.

[31]
Steven W Norton.
An explanation mechanism for bayesian inferencing systems.
In
Machine Intelligence and Pattern Recognition
, volume 5, pages 165–173. Elsevier, 1988.  [32] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Elsevier, 2014.
 [33] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710. ACM, 2014.
 [34] Shebuti Rayana and Leman Akoglu. Collective opinion spam detection: Bridging review networks and metadata. In Proceedings of the 21th acm sigkdd international conference on knowledge discovery and data mining, pages 985–994. ACM, 2015.
 [35] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144. ACM, 2016.
 [36] Andrew Slavin Ross, Michael C Hughes, and Finale DoshiVelez. Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717, 2017.
 [37] Chris Russell. Efficient search for diverse coherent explanations. arXiv preprint arXiv:1901.04909, 2019.
 [38] Barbara G Ryder. Constructing the call graph of a program. IEEE Transactions on Software Engineering, (3):216–226, 1979.
 [39] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
 [40] Simone Stumpf, Erin Sullivan, Erin Fitzhenry, Ian Oberst, WengKeen Wong, and Margaret Burnett. Integrating rich user feedback into intelligent user interfaces. In Proceedings of the 13th international conference on Intelligent user interfaces, pages 50–59. ACM, 2008.
 [41] Henri J Suermondt. Explanation in bayesian belief networks. 1993.
 [42] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4):267–373, 2012.
 [43] Lei Tang and Huan Liu. Relational learning via latent social dimensions. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 817–826. ACM, 2009.
 [44] Wenbin Tang, Honglei Zhuang, and Jie Tang. Learning to infer social ties in large networks. In Joint european conference on machine learning and knowledge discovery in databases, pages 381–397. Springer, 2011.
 [45] Randy L Teach and Edward H Shortliffe. An analysis of physician attitudes regarding computerbased clinical consultation systems. Computers and Biomedical Research, 14(6):542–558, 1981.
 [46] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the black box: Automated decisions and the gpdr. Harv. JL & Tech., 31:841, 2017.
 [47] Mike Wojnowicz, Ben Cruz, Xuan Zhao, Brian Wallace, Matt Wolff, Jay Luan, and Caleb Crable. “influence sketching”: Finding influential samples in largescale regressions. In 2016 IEEE International Conference on Big Data (Big Data), pages 3601–3612. IEEE, 2016.
 [48] Chen Yanover, Talya Meltzer, and Yair Weiss. Linear programming relaxations and belief propagation–an empirical study. Journal of Machine Learning Research, 7(Sep):1887–1907, 2006.
 [49] Haizhong Zheng, Minhui Xue, Hao Lu, Shuang Hao, Haojin Zhu, Xiaohui Liang, and Keith Ross. Smoke screener or straight shooter: Detecting elite sybil attacks in userreview social networks. arXiv preprint arXiv:1709.06916, 2017.
 [50] Marinka Zitnik and Jure Leskovec. Predicting multicellular function through multilayer tissue networks. Bioinformatics, 33(14):i190–i198, 2017.
Comments
There are no comments yet.