Graphs are often used to represent real-world networks, such as social networks (nodes represent users and edges represent social ties) Carrington et al. Carrington et al. (2005), transportation networks (nodes represent cities and edges represent roads) Bell and Iida Bell and Iida (1997), protein-protein interaction networks (nodes represent proteins and edges represent interaction relationship) Brohee and Van Helden Brohee and Van Helden (2006) and so on. In many realistic applications, uncertainty is intrinsic in graph data due to many practical reasons, which includes noisy measurement Aggarwal Aggarwal (2010), inferences and prediction error Adar and Re Adar and Re (2007)
, explicit manipulation and so on. As an example, in case of protein-protein interaction (PPI) networks, as high throughput interaction detection methods are often erroneous, and hence interaction of two proteins is probabilistic in nature. This kind of graphs are often called as uncertain graphs or probabilistic graphs, where each edge of the network is associated with a probability value. The interpretation of this probability may vary from context to context. In case of PPI Network, the assigned probability to an edge is basically its existential probability. On the other hand, in case of a social network, the assigned probability to an edge may signify the probability with which one can influence other Chen et al.Chen et al. (2013). In a generic seance, an uncertain graph is basically a graph whose edges are marked with a probability value. Due to the wider applicability of uncertain graphs in different domains such as social network analysis Kempe et al. Kempe et al. (2003), computational biology Zhao et al. Zhao et al. (2014), crowed sourcing Ke et al. (2018); Yalavarthi et al. (2017b, a), recommender systems Taranto et al. (2012), wireless networks Coon et al. (2018) analysis of mining of such networks become key research area in recent times.
Extensive studies on uncertain graphs lead to many interesting problems Khan et al. (2018b); Kassiano et al. (2016), which includes frequent pattern matching
frequent pattern matchingYuan et al. (2016); Chen et al. (2018), subhraph extraction Chen and Wang (2010); Yuan et al. (2011a), clustering of uncertain graphs Ceccarello et al. (2017a); Han et al. (2019), motif counting Ma et al. (2020), reliability computation Khan et al. (2018a) and many more. Also, over the years several solution methodologies have been developed. So there is a need to organize the existing results in a self contained manner. In this paper, we serve this purpose by surveying the existing literature. First, we report the goals of this survey.
1.1 Focus and Goal of the Survey
In this survey, the main focus are three folded and they are listed below:
Different problems studies on uncertain graph mining and analysis,
Major challenges for solving these problems,
Proposed solution methodologies.
The goal of the survey are as follows:
to provide a comprehensive background on uncertain graph mining and different problems introduced and studied in this domain.
to propose a taxonomy for classifying the existing literature and brief it in a concise manner.
to summarize the existing literature and points out future research directions.
1.2 Proposed Taxonomy
Broadly, the problems that have been studied in uncertain graph mining domain can be classified into three main categories: (i) Computational Problems (e.g. computation of reliability, clustering etc.), (ii) Querying Problems (e.g., reachability queries , queries regarding the existence of a particular combinatorial structures etc.), and (iii) Graph Algorithmic Problems (e.g., construction of spanning tree, link prediction, information flow maximization etc.). Also, there are some problems such as sparsification, node classification etc. which does not come under any of the three heads, and we put them under miscellaneous problems. Figure 1 gives a diagrammatic view for the proposed classification of the existing literature on uncertain graph mining.
1.3 Organization of the Survey
Rest of the paper is organized as follows: Section 2 lists required preliminary definitions. Section 3 describes different problems studied on uncertain graph mining. In Section 4, we report the existing challenges for solving the problems. Existing solution methodologies for these problems are described in Section 5. In Section 6, after analyzing the existing literature we describe the current research trends and point out existing research gaps. Section 7 list out existing future research directions. Finally, Section 8 concludes the survey.
In this section, we describe required preliminary concepts. Initially, we start with the uncertain graph.
Definition 1 (Uncertain Graph)111Please don’t confuse between uncertain graph and random graph Bollobás and Béla (2001). Uncertain graph and probabilistic graph are same, however, random graph is completely different and noting to do in this paper. Hence, we have not defined it in this paper.
An uncertain graph is denoted as , where denotes the set of vertices, denotes the set of edges, and is the edge weight function that assigns each edge to its probability, i.e., .
If the uncertain graph is weighted, along with , , and , there is also an edge weight function that assigns each edge to a real number, i.e., . We denote the number of vertices and edges of as and , respectively. Depending upon the situation, an uncertain graph may be directed or undirected. As in most of the existing studies, the considered uncertain graph is undirected, in this paper also unless otherwise stated by uncertain graph we mean it is undirected. Also, in some situations instead of single probability value, there may be multiple values associated with an edge. Take the example of a social network, where probability associated with each edge is basically the influence probability between two users. Now, influence probability may vary from context to context Chen et al. (2016b). This means a sportsman can influence his friends and followers regarding any news related to sports with higher probability compared to the others. Let, be the set of different contexts. In this case the social network can be modeled as an uncertain graph, where each edge of the network is associated with number of probability values. In that case, the probability function can be defined as . Standered graph theoretic terminologies such degree, neighborhood, path, subgraph, subgraph isomorphism, spanning tree etc. with their definitions and notations have been adopted from Diestel (2012) and not described here. For any arbitrary edge , denotes the probability associated with the edge . Possible World Semantic
is widely used to represent to an uncertain graph as a probability distribution over a set of deterministic graphs, which is defined next.
Definition 2 (Possible World Semantic)
By this model, an uncertain graph is represented as a probability distribution over number of deterministic graphs by keeping an edge with probability and removing it with probability . Hence, given an uncertain graph , the probability that the deterministic graph (Here, , and ) will be generated can be computed by the following equation:
Figure 2 shows an example of an uncertain graphs with its possible world. In the example , hence deterministic graphs will be in the possible world of .
Given a graph and two of its vertices distance between them is defined as the length of the shortest path. However, in case of uncertain graphs distance between two vertices can be defined in many ways, which is stated next.
Definition 3 (Distance in Uncertain Graphs)
Potamias et al. (2009) Given an uncertain graph , and two of its vertices , there are four different distance measure available in the literature:
Majority Distance: It is defined as the most probable shortest path distance between and .
Expected Distance: It is defined as the expected shortest path distance between and among all the possible world graphs.
Expected Reliable Distance: It is defined as the expected shortest path distance between and among all the possible world graphs in which and are connected.
Median Distance: It is defined as the median shortest path distance among all the possible world graphs.
Mathematical expressions for computing these distance measures are given in Table 1. Here, denotes the probability that the distance between and is .
|Distance Metric||Mathematical Expression|
|Expected Reliable Distance|
In a deterministic graph, two vertices are always reachable from each other if they belong to the same connected component. However, in case of uncertain graphs, reachability between two vertices are always probabilistic, which is called as reliability and it is defined next.
Definition 4 ( Reliability)
Given an uncertain graph , and two vertices , let be the indicator boolean variable which takes the value if and are reachable from each other in the deterministic graph , otherwise it is . Now, reliability between and is basically the expected value of rechability, which can be given by the Equation 2
Now, the notion of reliability for a uncertain graph can be extended for a subset of the vertices (called as terminal vertices) which is stated in Definition 5.
Definition 5 (Network Reliability)
For a given uncertain graph , and a set of terminal vertices , the network reliability is defined as the probability that the graph induced by the vertex set is connected in . This can be computed by Equation 3. Here, is a indicator variable whose value is when is connected and otherwise.
Subsequently, the notion of reliability has been generalized by Khan et al. Khan et al. (2018a) when the uncertain graph has more than one probability value in every edge and this is defined next.
Definition 6 (Conditional Reliability)
Khan et al. (2018a) Let, be an uncertain graph, and be the different contexts. Here, is defined as . Now, for any given and with the assumption that the contexts are independently aggregated, the effective probability for any edge can be computed by the Equation 4.
Similarly, by considering the contexts , the probability that the deterministic graph will be generated can be computed by the Equation 6.
Now, given two vertices , and given the set of contexts the conditional reliability between and can be defined by Equation 6.
Here, is an indicator variable whose value is if and is connected in .
Given two deterministic graphs and , a well known computational task is to determine whether contains a subgraph isomorphic to . This is the popular subgraph isomorphism problem Lee et al. (2012) which is known to be NP-Complete Lewis (1983). However, this problem can be translated in case of uncertain graphs by incorporating the concept of support, which is defined next.
Definition 7 (Support)
Given an uncertain graph , and a subgraph , the support of is defined as the fraction of the possible worlds that contains as subgraph. Mathematically, this can be expressed as follows:
Now, given a support value , an uncertain graph and a deterministic graph , one immediate question is that: “Does the uncertain graph contains the as a subgraph with support greater than or equal to ?” This kind of querying problems are called structural pattern-based queries. Section 3.1 discusses it in more details.
For a given graph, computation of node similarity (how similar two nodes are?) remains a central question in graph mining. Among many, one of the popularly adopted node similarity measure is Simrank initially proposed by Jeh and Widom Jeh and Widom (2002). The intuition behind this similarity measure is that two vertices are similar if they are referenced by similar vertices.
Definition 8 (Simrank)
Given a deterministic graph , and two vertices , the Simrank score between these two vertices are given by the following equation:
where , and denotes the set of incoming neighbors of the nodes and , respectively. is the decaying factor whose value is chosen as 0.8 Li et al. (2010a). Section 5.1.2 summarizes the existing literature for simrank computation on uncertain graphs.
Definition 9 (Random Walk on Graphs)
Göbel and Jagers (1974) This is a discrete time process defined on graphs and also it is one kind of graph search technique. Suppose, at time an object is placed on the vertex of an undirected graph . At every discrete time steps the object must move from one vertex to one of its neighboring one. So, if the object is on the vertex at time , then the probability that the object will be on the vertex at time is given by the following equation.
This probability values are reported as a matrix .
Existing solution methodologies for the simrank computation on uncertain graphs use random walk concept as described in Section 5.1.2. Table 2 describes the notations and symbols with their interpretations used in this paper. Many of the symbols have not been introduced yet and done as and when needed.
|An uncertain graph|
|A weighted uncertain graph|
|Vertex set and edge set of|
|The size of the vertex set and edge set of|
|Edge weight function, i.e.,|
|Edge weight function,|
|Adjacency matrix of|
|Any two arbitrary vertices of|
|Set of incoming neighbors of|
|An arbitrary edge from|
|Edge probability of the edge|
|Possible world set of|
|Generation probability of from|
|A deterministic graphs from|
|Simrank between and in|
|Set of terminal vertices from|
|Reachability / Reliability threshold|
|Set of different contexts/ tags|
|The edge probability of the edge for the tag|
|A subset of contexts|
|Edge probability of after the aggregating the tags in|
|Reliability between and for the tags in|
|Distance between and in under distance measure|
|Support of on|
|The Simrank matrix|
|Uncertain graph corresponding to the query|
3 Problems Studied on Uncertain Graphs
In these section, we describe the problems that have been studied in the domain of uncertain graph mining. We devote one subsection to each categories of sub-problems as shown in Figure 1.
3.1 Querying Uncertain Graphs
Different kinds of querying problems have been studied in the context of uncertain graph such as k-Nearest Neighbor Queries
(kNN Queries),Reliability Queries, Queries for Structural Pattern and so on. We describe them one by one.
3.1.1 Reliability-Based Queries
Analysis of network reliability is old still active research area Ball (1986); Brecht and Colbourn (1988); Ball et al. (1995); Khan et al. (2014); Guo and Jerrum (2019). Given an uncertain graph , a probability threshold , and a set of source nodes , Reliability-Based Query Problem asks to return , such that the probability of reachability from the set to , is more than Khan et al. (2014). From the computational point of view, the problem can be posed as follows:
Reliability-Based Query Problem
Input: Uncertain Graph , a vertex subset , and probability threshold .
Question: Return the set , such that , .
3.1.2 Conditional Reliability Maximization Queries
Recently, Khan et al. Khan et al. (2018a) studied the conditional reliability maximization problem. In case of single source single terminal variant of this problem along with an uncertain graph , and set of different contexts with , two specified vertices , and a positive integer , this problem asks to choose subset of the contexts such that such that conditional reliability as described in Definition 6 gets maximized. Mathematically, this problem can be expressed by the following equation.
From the computational perspective, this problem can be posed as follows:
Conditional Reliability Maximization Queries
Input: Uncertain Graph , Set of Contexts . with , Two Vertices , and a Positive Integer .
Question: Return , such that and quantity is maximized.
3.1.3 Nearest Neighbor(NN) Queries
This is a very generic data mining task, where given a set of data points and a specific one of them (say ) the question is to return most similar data points as from the remaining Roussopoulos et al. (1995). This problem has also been studied extensively in graph data as well Malkov et al. (2014). Recently, the -NN Query Problem has been studied on uncertain graph as well Potamias et al. (2010). Given a weighted uncertain graph , a specific vertex , a probabilistic distance function (anyone of four as shown in Table 1), and a positive integer , the kNN Query on Uncertain Graph Problem asks to return the set such that for any vertex in , for any . From the computational perspective, the problem can be posed as follows:
-NN Query on Uncertain Graph Problem
Input: Weighted Uncertain Graph , a specific vertex , a probabilistic distance function , and .
Question: Return the vertex set such that for any vertex in , for any . This problem can also be defined on unweighted uncertain graph as well. In that case , . Here, we have described the weighted version only.
3.1.4 Structural Pattern-Based Queries
Finding or counting a given structure (also known as query graph) in a graph database is a fundamental graph mining task Yan et al. (2005); Kuramochi and Karypis (2005). This problem has also been studied in the context of uncertain graph as well. Particularly, the structure that has been studied extensively is the frequent subgraph Zou et al. (2010a); Li et al. (2012), densest subgraph Zou (2013), Clique Mukherjee et al. (2015, 2017); Zou et al. (2010b) etc.
Reliability-Based Query Problem
Input: Uncertain Graph , a structure , and a probability threshold .
Question: Is there a pattern whose support value is greater than or equal to .
Here, may be a -clique, -truss and many more.
3.2 Computation on Uncertain Graphs
Different computational problems have been studied in the context of uncertain graphs. Here, we list them one by one.
3.2.1 Reliability Computation:
Given an uncertain graph , and two vertices and , this problem asks to compute the probability that is reachable from . It can be considered as the fundamental rechability problem in uncertain graph context. From the computational perspective, the problem can be list as follows:
Input: An Uncertain Graph , two specific vertices .
Task: Compute the reliability between and , i.e., in . A variant of this problem has been introduced by Jin et al.Jin et al. (2011b), where along with , , we are also given with a distance and the question is that what is the probability that is connected with within the distance .
3.2.2 -Terminal Reliability Computation:
This is a more generalized version of the reliability computation problem. In this problem, we are given with a uncertain graph and a set of terminal vertices , the network reliability can be given by the following equation:
where is a indicator variable whose value will be , if the vertices in are connected in and otherwise. This problem assks to compute the quantity Sasaki et al. (2019). From the computational point of view, this problem can be posed as follows:
Network Reliability Computation:
Input: An Uncertain Graph , a subset of vertices .
Task: Compute the network reliability in .
3.2.3 Simrank Computation:
Simrank is a popular similarity measure between any two vertices of a graph due to its applications in different problems including entity resolution Li et al. (2010b); Yin et al. (2007), similar protein identification Whalen et al. (2015) and so on. Recently, Simrank computation problem has also been studied for uncertain graph as well Du et al. (2015); Zhu et al. (2016a, 2017). For a given uncertain graph , this problem asks to compute a similarity matrix , where the -th entry contains the SimRank similarity value. Computationally, the problem looks like the following:
Simrank Computation on Uncertain Graphs:
Input: An Uncertain Graph , a subset of vertices .
Task: Compute the network reliability in .
3.2.4 Clustering of Uncertain Graphs
Clustring is a very generic data mining task. Given a set of data points this problem asks to partition them into a number (may be given or may be dataset dependent) of partitions (may be overlapping) such that data points belongs to the same partitions should be similar in some sence and data points belongs to the different partition should be different Jain et al. (1999). Clustring has also been stuided extensively on graph data Aggarwal and Wang (2010). This problem has also been studied in the context of uncertain graph as well Liu et al. (2012); Han et al. (2019). The problem can be summarized as follows:
Clustering of Uncertain Graph
Input: An Uncertain Graph , and the number of clusters .
Task: Partition into , , , based on certain similarity measure.
3.3 Graph Algorithmic Problems
3.3.1 Spanning Tree Problem:
Given a weighted graph, computing the minimum cost spanning tree is classic problem in Graph Algorithms Cormen et al. (2009). This problem has been studied when the edge weights are uncertain Megow et al. (2017); Erlebach et al. (2008). Recently, this problem has also been studied in the context of uncertain graphs, where the weight of the edge is fixed, however there is a probability of exsistence associated with every edge Zhang et al. (2016). From the computational perspective, this problem can be posed as follows:
Most Reliable Spanning Tree Problem on Uncertain Graphs
Input: An Weighted Uncertain Graph .
Task: Compute the minimum cost spanning tree having the highest probability.
3.3.2 Link Prediction Problem:
Link prediction is a classic graph mining task, where the snapshot of the network in different times, say , i.e., , , , is given and the task is to predict the network snapshot at time , i.e., Zhang and Zaïane (2019). This problem has lot of applications in social network analysis Liben-Nowell and Kleinberg (2007), recommerder systems Li and Chen (2009) and so on. Recently, this problem has has been studied considering the edge uncertainty Ahmed and Chen (2016); Martínez et al. (2017). From the the computational point of view, this problem can be posed as follows:
Link Prediction Problem on Uncertain Graphs
Input: Snapshots of an Uncertain Graph , .
Task: Output a similarity matrix, where its -th denotes the likelihood of .
3.3.3 Information Flow Maximization Problem:
Network flow is a classic problem in network algorithms, where given a directed graph with each edge is marked with its capasity and a source and target vertex, the goal here is to decide how much flow need to pass through each of the links such that total flow from the source to the target vertex is maximized Ahuja et al. (1988). Recently, this problem has been studied in the uncertain graph as well, however in a different setting Frey et al. (2017, 2018). Here the problem is defined as follows: Given a vertex weighted uncertain graph , a query vertex , and a positive integer , this problem asks to find out the subgraph that can contain at most edges that maximizes the expected information flow towards . From the computational point of view, this problem can be posed as follows:
Information Flow Maximization Problem on Uncertain Graph
Input: A Vertex Weighted Uncertain Graph , a query vertex , a positive integer .
Task: Output a similarity matrix, where its -th denotes the likelihood of .
3.4 Other Problems
3.4.1 Uncertain Graph Sparsification
Due to gigantic size of real-world networks, storing the entire network requires huge storage cost and more importantly querying the entire graph requires huge computational time. The remedy of this problem is not to store the entire graph. One approach is to resolve this issue is that instead of storing all the edges, store a subset of them. This leads to the problem of graph sparsification. Here, the problem is to decide which edges to store such that some specific property of the graph will not change much in the sparcified graph Fung et al. (2019); Spielman and Srivastava (2011). Several graph sparcification techniques have been proposed in the literature, such as sparner-based sparscifier Peleg and Schäffer (1989), cut-based sparcifier Fung et al. (2019) and many more. Recently, cut-based uncertain graph sparsification problem has been studied by Parchas et al. Parchas et al. (2018). Now, this problem has been formalized as follows: Given an uncertain graph , and a vertex subset , the expected cut size is defined by Equation 10.
For a positive integer , the discrepancy of the sparcified uncertain graph is defined as the sum of the absolute values discrepancy values for all sized subsets.
Now, for a given uncertain graph , a positive integer , and a sparsification ratio , the uncertain graph sparsification problem asks to to construct another uncertain graph such that that minimizes the sum of the discrepancies . From the computational point of view, this problem can be posed as follows:
Uncertain Graph Sparsification Problem
Input: An Uncertain Graph , a positive integer , a sparsification ratio .
Task: Construct another uncertain graph such that to minimize .
3.4.2 Node Classification in Uncertain Graphs
The problem of classifying the nodes of a network has been studied extensively Bhagat et al. (2011). Recently, this problem has also been studied in the context of uncertain graphs as well Dallachiesa et al. (2014); Kong et al. (2013); Han et al. (2015). This problem can be posed as follows: Given an uncertain graph , a set of labels and subset of its vertices are labeled with a labeling function , predict the labels of the nodes in . Computationally, this problem can be posed as follows:
Node Classification in Uncertain Graphs
Input: An uncertain graph , label set , a subset of vertices with labeling function .
Task: Predict the labels of the nodes in .
There are several other problems that have been studied such as uncertain graph visualization Schulz et al. (2016); Sharara et al. (2011) etc. As the intended audience of this paper is the researchers of the data mining and data management community, hence we are not focusing on these problems. Also, it is important to note that the problems that have been studied on uncertain graphs may be classified in other way also. However, in our classification, the goal was to put the problems that are of similar kind under the same umbrella. Next, we proceed to describe the major challenges for solving these problems.
4 Challenges for Solving these Problems
As mentioned in the literature, there are mainly two major challenges described in the following two subsections.
4.1 Exponential Number of Possible Worlds
As mentioned in Section 2 that if for a given uncertain graph with edges will have number of possible worlds. Even for small values of (say, ), the number of possible worlds is excessively large ( more than the number of atoms in this universe). Almost all the problems that have been described in Section 3 requires to enumerate all the possible worlds to output the answer accurately. However, due to bounded computational resources it is not possible to consider all the possible worlds. Now, here the challenge is that probabilistically how many samples to consider for the computation to output the result such that the error is bounded by at most . Consider the Reliability Computation Problem as described in Section 3. Let, and be the reliability values when all the sample graphs are considered and when all the graphs are not considered, respectively. So the question here is that for a given and how many sample graphs to consider such that the following inequity holds: . To address this issue, several effective sampling techniques have been developed such as recursive sampling Jin et al. (2011b), recursive stratified sampling Li et al. (2014, 2015), lazy propagation sampling Li et al. (2015) and many more. Also, Parchas et al. Parchas et al. (2014, 2015) proposed to generate deterministic representative instances such that underlying graph properties are preserved.
4.2 Gigantic Size of Real-World Networks
As described, large size of possible world can be handled by approximately answering the result with high probability. However, if the size of network is itself very large, then processing the graph becomes even frther difficult. Recently, to address this issue a framework called simultaneously processing approach has been developed Zou et al. (2017). Basically, this method samples a number of possible worlds independently at random, efficiently store them with compact data structures (as many of the sampled possible worlds have common substructure) and simultaneously process the query to generate the results. However, literature in this category is very limited.
In the next section, we report the existing solution methodologies for the problems reported in Section 3.
5 Exsisting Solution Methodologies
In this section, we briefly describe the existing solution methodologies of the problems described in Section 3. First we start with the computational problems on uncertain graphs.
5.1 Computational Problems
5.1.1 Reliability Computation
As mentioned previously, the reliability computation problem has been studied in different variants such as two vertex reliability computation ( reliability), For a given uncertain graph, trivial way to compute reliability is to enumerate all the possible worlds and process each of them sequentially. However, this approach leads to huge computational burden. Recently, Sasaki et al. Sasaki et al. (2019) proposed an efficient technique to compute the network reliability, which reduces the number of required possible worlds and compute the bounds on the -terminal reliability by efficiently constructing binary decision diagram, which is basically a directed acyclic graph. Binary decision diagram has been previously used to compute the network reliability on deterministic graphs as well Hardy et al. (2007)
. Reported experimental results show that there proposed methodology leads to less variance and less error rate. Jin et al.Jin et al. (2011b) studied the distance constrained reliability problem and their main contribution is to use the Horvitz-Thomson Type Estimator and Recursive Sampling Estimator
that efficiently combines a deterministic computational procedure to boost up the estimation accuracy. Reported results show the superior performance of these two estimators. Recently, Ke et al.Ke et al. (2019) performed an in-depth bench marking study for the reliability problem with different sampling strategies, i.e., Monte Carlo (MC) Sampling, B.F.S. with Indexing Zhu et al. (2015), Recursive Sampling Jin et al. (2011b), Recursive Stratified Sampling Li et al. (2014, 2015), Lazy Propagation Sampling Li et al. (2015), Indexing via Probabilistic Trees Maniu et al. (2017). Their goal was to make a comparative study of different sampling strategies to understand their estimator variance, memory usage etc.
The most simplest sampling technique is the MC sampling. In this approach deterministic graphs , , , are sampled out from the possible world. Now, assume that for all denotes the boolean variable whose value is if and is reachable from each other and otherwise. By this method, we have the following estimated reliability:
Here, the estimator
is an unbiased estimator of thereliability, i.e., . Now, it has been shown in Potamias et al. (2009) the number of Monte Carlo Samples required to reach the reliability is greater than or equals to , i.e., . Now, the traversing each deterministic graph for checking required time. Hence, the time requirement for reliability estimation using this technique is of .
Later Zhu et al. Zhu et al. (2015)
proposed an offline sampling scheme which is also space efficient. In this sampling scheme, the given uncertain graph entirely stored without the existence probability of the edges. However, each edge is associated with bit vector of length. For any arbitrary edge , in its associated bit vector, the -th entry is if the edge is present in . Now, it can be observed that the traversing in this compact graph structure is equivalent to traversing each of the sampled graph in parallel. It is easy to follow that the index building scheme requires time and space requirement is of .
Jin et al. Jin et al. (2011b) proposed the ‘recursive sampling’ approach that improves over the MC sampling due to the following two reasons. The first one is that: for a given query, if some edges are already missing in a possible world, then it may not be too much relevant whether other set of edges are present or not. Secondly, many possible worlds share a significant number of existing or missing edges. The working principal of this approach is as follows. Starting with the vertex , an extendable edge incident on is randomly sampled for times. Now, the generated samples are divided into two groups: one group containing and the other one is not containing . Assume, in the first group of samples by the edge the reachable node is and now more edges are expandable. This process is repeated for both the groups by picking up an randomly expandable edges and subdividing the groups into smaller ones. A very similar approach was developed by Zhu et al. Zhu et al. (2011) which is called as the Dynamic MC Sampling technique.
Now, assume that and be the set of present and non-present edges in one group. Let denotes the set of possible worlds, where the edges present in are present, though the edges of are not present. The generation probability of the deterministic graphs present in the group is given by the following equation:
Now, the reliability of the group is basically the probability that and are reachable when a deterministic graph from are appearing and the following equation computes this.
Now, it is easy to verify that for any arbitrary edge the following holds:
This process continues till the contains an path with or contains an cut with
Later, Li et al. Li et al. (2015) proposed the recursive stratified sampling technique that works based on the divide-and-conquer paradigm. In this method, the entire probability space is divided into non-overlapping subspaces by selecting edges. Each division we call as ‘stratum’. Let be the set of edges chosen via breadth first search from the source node . Let, is a boolean matrix which stores the status value of the selected edges in different stratum. will be equal to if the -th () edge belongs to the -th stratumor not. Now, a deterministic graph from the possible world belongs to the -th stratum is given by the following equation:
The sample size of each stratum is fixed as . Reliability is computed in each of the stratum and the expected value is returned as the reliability. It has been shown in Li et al. (2015) that the time requirement for recursive stratified sampling is same as the MC sampling; i.e.; .
Li et al. Li et al. (2017) proposed the ‘lazy propagation sampling’ scheme. The basic working principle of this method is as follows: “If the existence probability of an edge is very low, then it quite natural that this edge will not be present in many possible worlds, hence, it is not to probe such edges. To describe formally, this method employs a geometric distribution 222https://en.wikipedia.org/wiki/Geometric_distribution in each edge and probes an edge if it is activated. It has been shown in Li et al. (2017) that the variance of the lazy propagation sampling is same as the MC sampling, though it improves the efficiency.
Maniu et al. Maniu et al. (2017) proposed the ‘Indexing via Probabilistic trees’ methodology for efficiently computing the reliability in uncertain graphs. This method constructs a tree structure called as the ‘ProbTree’ from the given uncertain graph . When a reliability query comes, an equivalent graph is created from the ‘ProbTree’ and the MC sampling is done in itself. If the size of is smaller than , then the query can be evaluated very efficiently. They developed three different indexing schemes, namely, SPQR trees, FWD, and LIN ProbTrees. As per their analysis, among these three, the FWD (fixed width tree decomposition) is the best because the both the build time and query time of this data structure is linear and the query quality is ‘lossless’ for , where is the width of the tree decomposition Alber and Niedermeier (2002). As per the analysis in Maniu et al. (2017), the time complexity of building the FWD ProbTree is linear in number of nodes of the graph. However, the query execution time is of . Here, and denotes the number of nodes and edges present in the graph . The space complexity of this sampling scheme is of . Table 3 shows the time and space complexity of different sampling strategies for the reliability problem. Table 4 summarizes the pros and cons of different sampling techniques.
|Sampling Technique||Time Complexity||Space Complexity|
|B.F.S. with Indexing|
|Recursive Stratified Sampling|
|Lazy Propagation Sampling|
|Indexing via Probabilistic Trees|
|Monte Carlo Sampling||
|B.F.S. with Indexing Zhu et al. (2015)||
|Recursive Sampling Jin et al. (2011b)||
|Recursive Stratified Sampling Li et al. (2014, 2015)||
|Lazy Propagation Sampling Li et al. (2015)||
|Indexing via Probabilistic Trees Maniu et al. (2017)||
5.1.2 Simrank Computation
To the best of our knowledge, Du et al. Du et al. (2015) first study the problem of Simrank computation on uncertain graphs. They proposed a dynamic programming approach to compute the probability values of the probabilistic transition matrix, which works in linear time. To improve the efficiency further, they came up with incremental dynamic programming (IDP) approach. Reported results shows that the IDP approach converges much faster than the naive dynamic programming.
Subsequently, Zhu et al. Zhu et al. (2016a) also studied the same problem. They formalized the notion of random walk in uncertain graphs and also based on this notion they show how one can define Simrank on uncertain graphs. It is important to note that in case of uncertain graphs step transition probability matrix (i.e., ) is not equal to -th power of the one step transition probability matrix (). They defined the random walk on uncertain graphs by exploiting the all possible worlds. For a given uncertain graph , let ,
be the random variable associated with a random walk in. Hence, can be computed as follows:
Now, applying the Markovian property,
Now, -step transition probability from the vertex to can be computed as follows:
Let be the -step transition matrix of the uncertain graph . can be computed by the following equation.
where denotes the -step transition probability matrix of the possible world . Based on this random walk on uncertain graphs, they came up with four different approaches to compute Simrank, namely, (i) baseline method, (ii) sampling technique, (iii) two stage method, and (iv)speeding up techniques. Experimental results show that the methods have very high scalability.
5.1.3 Clustering of Uncertain Graphs
To the best of our knowledge, Liu et al. Liu et al. (2012) first studied the clustering problem on uncertain graphs. Suppose, given an uncertain graph , we want to cluster the graph into clusters , , , . Consider any possible world and it has many connected components denoted as , , , . The vertices of each connected component are also divided into groups based on their cluster levels. As an example, the vertices of the connected component are denoted by . First, they introduced the notion of purity by the cluster level entropy as follows:
where is the entropy of cluster levels for fragment for the -th possible world of . As mentioned in Liu et al. (2012), if purity is the only criteria then it may happen that the clustering process may leads to a single cluster containing maximum number of nodes. To address this issue, Liu et al. Liu et al. (2012) considered the notion of size balance, which tells that two clusters can not be too much imbalanced in terms of their sizes. Now, to make the clusters size balanced the following function could be maximized.
Here, . Now, Liu et al. Liu et al. (2012) formulated the reliable clustering of an uncertain graph as to minimize the following function:
They developed a novel -means clustering algorithm to optimize the function mentioned in Equation 22. Experimental results demonstrate the scalability of this method.
Later, Ceccarello et al. Ceccarello et al. (2017b) also studied the uncertain graph clustering problem from a different perspective. Here, the goal is to partition the nodes of the uncertain graph into cluster in such a way that each of the cluster features a node as a cluster center to maximize the minimum or average reliability (minimum connection probability (MCP) and average connection probability (ACP)) from the cluster center to the other nodes of the cluster. They have shown that this problem is P-Hard and came up with approximation algorithms for both the MCP and ACP variants of this problem. They have shown that their proposed methodologies generate a -clustering which gives and lower bound in approximation guarantee, where and denotes the maximum and average connection probability of any -clustering. They compared their results with deterministic weighted graph clustering algorithms to show the efficiency of their proposed methodology.
Kollios et al. Kollios et al. (2011) extended the edit distance-based graph clustering technique for uncertain graph. Given two deterministic graphs and , their graph edit distance is defined by the following equation.
Now, given an uncertain graph and and a deterministic graph , their graph edit distance can be given by the following equation:
Here, can be computed using Equation 23. They introduced the notion of Cluster Graph, which is basically vertex disjoint disconnected cliques, denoted as . That means is partitioned into disjoint sets , , , , such that , , and , and , . Now, given an uncertain graph , its clustering problem basically asks to find out the cluster graph such that the edit distance is minimized. They exploited the connection between this problem with that of correlation clustering and showed that the randomized expected -approximation algorithm proposed by Ailon et al. Ailon et al. (2008) for weighted correlation clustering can be used for solving the uncertain graph clustering problem. Experimental evaluations show that this algorithm generates statistically significant clusters of an uncertain graph and also scale well on real-world networks.
Halim et al. Halim et al. (2017) proposed a solution methodology for the uncertain graph clustering problem by exploiting the neighborhood information. By this method first the input uncertain graph is converted into a deterministic graph by classification technique used for edge prediction, and finally deterministic graph clustering technique can be used for clustering. They also performed an experimental study with different classification technique for edge prediction.
Subsequently, Han et al. Han et al. (2019) studied the uncertain graph clustering problem with two different goals: divide the uncertain graph into -clusters such that (i) the average reliability from cluster center to other nodes are maximized (similar to the -median problem), and also (ii) the minimum reliability between any node of the cluster to its cluster center is maximized (similar to the -center problem). Both -center and -median problems has been studied extensively by the researchers of theoretical computer scientists Thorup (2005); Mettu and Plaxton (2003). For the -median problem, they proposed an -factor approximation algorithm, and also for the -center problem they proposed an factor approximation (with high probability) algorithm where is the optimal value for the -center objective function. Their experimental evaluation shows that proposed approaches significantly out performs the methods proposed by Ceccarello et al. Ceccarello et al. (2017b).
It is important to notice that though there are several clustering techniques for uncertain graphs, however, the criteria is different across the methodologies. Like, in Liu et al.’s Liu et al. (2012) study the goal is to generate size balanced clusters, where as in Ceccarello et al.’s Ceccarello et al. (2017b) study the goal is to cluster the uncertain graph to maximize the average/minimum reliability within each cluster. Table 5 briefly summarizes the uncertain graph clustering techniques.
|Reference||Criteria for Clustering||Main Contributions|
|Liu et al. Liu et al. (2012)||To generate size balanced clustering of an uncertain graph.||
|Ceccarello et al. Ceccarello et al. (2017b)||To cluster an uncertain graph such that the average/minimum reliability within each cluster is maximized.||
|Kollios et al. Kollios et al. (2011)||To minimize the edit distance between cluster graph and input uncertain graph||
|Halim et al. Halim et al. (2017)||
|Han et al. Han et al. (2019)||
Next, we describe solution methodologies of various graph algorithmic problems that has been studied in the context of uncertain graphs.
5.2 Querying Uncertain Graphs
5.2.1 Querying for Subgraph
Subgraph searching and querying in a graph database in deterministic setting is a very well studied topic in data management community. However, in probabilistic setting the amount of literature is very limited. Here, we briefly summarize the existing literature.
Pattern Matching Queries
To the best of the author knowledge (also as per the authors’ claim) Zou et al. Zou et al. (2009) were the first to study the subgraph pattern matching queries on uncertain graphs. As the Subgraph Isomorphism Problem is NP-Hard even in deterministic graphs, the same hardness result follows in the context of uncertain graphs as well. Now, as the rechability problem is P-Hard in uncertain graphs, hence the subgraph pattern matching in uncertain graph is also P-Hard. So, the goal here was to find an approximate solution for this problem. They proposed an efficient approach to check whether a subgraph should be returned as a solution or not. They also derived a sample bound which is such that the . Here, is an estimator of . This is the foundational work in this direction, and subsequently many researchers gave deep dive in this direction.
Later Chen et al. Chen and Wang (2010) studied the approximate subgraph search problem in uncertain graph stream. Here, given an uncertain graph stream , a set of query graphs , and a probability threshold the goal here is to report all joinable pairs in each time stamp . This means that the subgraph is contained in with probability exceeding , where , , and . They proposed two efficient pruning technique, namely ‘structural pruning’ and ‘probability pruning’ which makes there methodology efficient. Running time of their methodology is of .
Later Yuan et al. Yuan et al. (2011b) studied the same problem and proposed the ‘filtering and verifying’ strategy to speed up the search process. In the filtering phase, a probabilistic inverted index is maintained which can be used for probabilistic pruning. Next in the verification phase the remaining candidates have been verified by exact algorithm. This method is tested with both synthetic as well as real-world datasets. In a follow up work by yuan et al. Yuan et al. (2014) Yuan et al. (2016), they developed probabilistic match trees (PM Trees) based on match cuts and cut selection process. Considering this index structure, they developed effective pruning strategy to prune the unqualified matches. This makes the proposed methodology much more efficient, and hence the sizes of the graphs that have been used in this study is much larger than the previous works.
Chen et al. Chen et al. (2018) proposed the ‘enumeration-evaluation’ framework for this problem, where first they enumerate all the candidate subgraphs and then for each subgraph compute its support and decide whether to output this as a result. They also showed that under the probabilistic semantic the computation of ‘Support’ is #P-Complete. Hence, their solutions are approximate in nature with accuracy guarantee. Experimental results show the practical usability of the algorithms.
Recently, Ma et al. Ma et al. (2019) studied the problem of counting a given motif present in an uncertain graph. Given a motif , there goal was to evaluate the occurrence statistics of such as probability mass function, mean, and variance. Based on their sample size analysis, they showed that if the number of samples are more than then the absolute error is bounded by with probability . The running time of their methodology is , where denotes the number of vertices in the motif and denotes the maximum degree of the graph. In the experimental study of this work the size of the datasets are much larger than that has been used in previous studies.
Subgraph Similarity Search
Yuan et al. Yuan et al. (2012) Yuan et al. (2015) studied the subgraph similarity search problem over uncertain graph databases. They showed that the problem is P-Complete. They used their previously developed ‘filter and verify’ technique to gear up the search process. In the filtering phase they develop lower and upper bound of subgraph similarity probability based on probabilistic matrix index. For the verification phase they developed efficient sampling approach to validate the remaining candidates. Experimental results show that their methods are scalable.
Gu et al. Gu et al. (2016) studied the problem of similarity maximal all matches in an uncertain graph database. Given an uncertain graph , a query graph , distance threshold , and probability threshold , a deterministic graph is a similarity maximal match of in under the vertex mapping , if there exist no other deterministic graph such that under the same vertex mapping , . Here, is the edit distance between and Chen et al. (2019). They proposed different speed up techniques such as partial graph evaluation, vertex pruning, probability upper bound-based filtering, and the incremental evaluation method. Experimental results show that they outperform baseline methods in orders of magnitude.
Table 6 summarizes the literature for pattern matching and subgraph similarity search queries on uncertain graphs.
|Type of Subgraph||Author||Major Findings|
|Any subgraph||Zou et al. Zou et al. (2009)||
|Chen et al. Chen and Wang (2010)||
|Yuan et al. Yuan et al. (2011b) Yuan et al. (2014) Yuan et al. (2016)||
|Chen et al. Chen et al. (2018)||
|Ma et al. Ma et al. (2019)||