DySky: Dynamic Skyline Queries on Uncertain Graphs

04/06/2020 ∙ by Suman Banerjee, et al. ∙ IIT Kharagpur IIT Gandhinagar 0

Given a graph, and a set of query vertices (subset of the vertices), the dynamic skyline query problem returns a subset of data vertices (other than query vertices) which are not dominated by other data vertices based on certain distance measure. In this paper, we study the dynamic skyline query problem on uncertain graphs (DySky). The input to this problem is an uncertain graph, a subset of its nodes as query vertices, and the goal here is to return all the data vertices which are not dominated by others. We employ two distance measures in uncertain graphs, namely, Majority Distance, and Expected Distance. Our approach is broadly divided into three steps: Pruning, Distance Computation, and Skyline Vertex Set Generation. We implement the proposed methodology with three publicly available datasets and observe that it can find out skyline vertex set without taking much time even for million sized graphs if expected distance is concerned. Particularly, the pruning strategy reduces the computational time significantly.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

‘Skyline’ has emerged as an effective multi-criteria decision making operator and hence an extensively researched topic in data management community for almost two decades (Chomicki et al., 2013). Borzsony et al. (Borzsony et al., 2001) fist introduced this operator. Given a set of data points , the skyline operator in it returns the subset of them that are not dominated by other data points present in the dataset. For any two points and , we say that dominates , if with respect to each dimension is not worse than , however, strictly better in at least one dimension. Without loss of generality, in this study, we assume that lower value means better in all dimensions. This problem has been studied in the context of graph data as well (Zou et al., 2010; Zheng et al., 2014). In real-world scenarios, the relationship among agents are uncertain in nature and this uncertainty is caused due to several reasons noisy measurements, unknown values, explicit manipulations, etc. Hence, this kind of situations are modeled as an uncertain graph, where edges are marked with existence probabilities. In case of social networks, these probabilities signify the influence probability between two users, in case of computer networks these signify the successful packet transfer probability between two systems etc. Now, we report some recent literature on skyline query processing and analysis of uncertain graphs.

1.1. Related Work

After introduced by Borzsony et al. (Borzsony et al., 2001), skyline queries have been studied on different kinds of data, for different purposes, with different system architectures, such as road networks (Miao et al., 2018; Fu et al., 2017; Zhu et al., 2018), bi-criteria networks (Jiang et al., 2016; Ouyang et al., 2018), uncertain data (Zhou et al., 2015; Nguyen and Cao, 2015), incomplete data (Lee et al., 2016; Miao et al., 2016), streaming data (De Matteis et al., 2015; Liu et al., 2019), spatial data (Son et al., 2017), encrypted data (Liu et al., 2018a)

, knowledge graphs

(Keles and Hose, 2019), wireless sensors networks (Wang et al., 2016); for route recommendations (Yawalkar and Ranu, 2019; Jiang and Du, 2018), finding perspective customers (Yin et al., 2018)

; resisting outliers

(Jaudoin et al., 2017), favorite product queries (Zhou et al., 2016); with map reduce architecture (Park et al., 2017, 2015), multi-core architectures (De Matteis et al., 2016), cloud computing framework (Huang et al., 2018) and so on. Keeping the topic of this paper in our mind, here we elaborate the skyline query processing on probabilistic and uncertain data. He et al. (He et al., 2016) studied the skyline query on uncertain time series data and developed a two step methodology for to answer this probabilistically. Park et al. (Park et al., 2015) studied the skyline query processing on uncertain data and proposed parallel algorithms for computing the same using map reduce framework. Zhou et al. (Zhou et al., 2015) studied the skyline query processing over uncertain data in distributed environments. Le et al. (Le et al., 2016) studied the skyline queries on uncertain data to return the user specific relevant results without enumerating all possible worlds. Recently, there are several studies in this directions (Zeng et al., 2019; Liu et al., 2018b, 2019). However, to the best of our knowledge skyline query has not been studied yet in the context of uncertain graphs.

Due to different practical applications, in recent times analysis of uncertain graphs have emerged as an important research topic (Khan et al., 2018; Khan and Chen, 2015). Several problems have been studied such as clustering (Halim et al., 2017; Chen et al., 2015), embedding (Hu et al., 2017), subgraph search (Jin et al., 2011; Chen et al., 2015), structural pattern findings (Bonchi et al., 2014) and so an. Ke et al. (Ke et al., 2019a) studied the ‘budgeted reliability maximization problem’, where the goal is to add small number of edges to increase the reliability between a given pair of nodes. ke et al. (Ke et al., 2019b) recently studied the reliability problem which asks with how much probability a target node is reachable from a source node in a given uncertain graph. Chen et al. (Chen et al., 2018) studied the frequent pattern finding in uncertain graphs and for this problem enumeration-evaluation algorithm for this problem. Look into (Kassiano et al., 2016) for survey.

1.2. Our Contribution

In this paper, we propose the noble problem “dynamic skyline query on uncertain graphs”. Given an uncertain graph with a subset of vertices as query vartices, the goal of this problem is to obtain the subset of the data vertices that are not dominated by the other data vertices with respect to some distance measure from the query vertices. Particularly, we make the following contributions in this paper:

  • We introduce the noble problem “Dynamic Skyline Queries on Uncertain Graph Problem” (DySky).

  • We propose a solution methodology for this problem, which broadly divided into three steps, namely, pruning, distance computation and skyline vertex set generation.

  • Proposed methodology has been implemented with three benchmark datasets and results show that the pruning strategy leads to less number of candidate nodes.

1.3. Organization

Rest of the paper is organized as follows: Section 2 describes required preliminary definitions and then define the problem formally. The proposed methodology has been described in Section 3. In Section 4 the experimental evaluations of the proposed methodology has been described. Section 5 draws conclusions and gives future directions.

2. Preliminaries and Problem Definition

In this section, we present required preliminary concepts and then define the dynamic skyline queries on uncertain graph problem formally. Initially, we start with a few basic definitions.

Definition 1 (Uncertain Graph).

We denote an uncertain graph by , where is the set of vertices, is the set of edges, is the distance function that assigns each edge to a positive real number, i.e., , and is the existence function that assigns each edge to a probability value, i.e., .

In our study, we consider only simple, finite, undirected, and weighted graphs. The number of nodes and edges of the graph is denoted by and , respectively. For an edge its weight and existence probability is denoted by and , respectively. In the literature, an uncertain graph is conceptualized and analyzed by the possible world model, which we define next.

Definition 2 (Possible World Semantics).

An uncertain graph

can be conceptualized as the probability distribution over a set of deterministic graphs, which is called as the possible world of the uncertain graph, and denoted as

. Each is obtained from by keeping all its vertices, keeping its edges with existing probability, and if an edge of is also there in , then . Now, the probability that the deterministic garph will be generated can be computed by the Equation 1.

(1)

In any deterministic graph , its two vertices and are said to be reachable if there exist a path from between and . However, in case of uncertain graphs, the reachability between any two given vertices can be defined in probabilistic way, which we call reliability.

Definition 3 (Reliability).

Given an undirected, uncertain graph , the reliability between its any two vertices and is defined as the probability that the vertices and can be reachable from each other. We denote the reliability between the vertices and by and defined by the following equation:

(2)

Here, is the boolean variable whose value is if and are connected in and otherwise.

In case of a deterministic weighted graph, distance between any two vertices is defined as the sum of individual edge weights constituting shortest path. However, in case of uncertain graphs distance between any two vertices can be defined in many ways. Here, we quote two of them that we use in our study.

Definition 4 (Majority Distance).

(Potamias et al., 2009) Given an uncertain graph and its two vertices , its majority distance is denoted by and defined as the most probable shortest path distance. Mathematically, it can be given by the following equation.

(3)

where is the shortest path distribution between the vertices and that gives probability value for every distance .

(4)
Definition 5 (Expected Distance).

Given an uncertain graph and its two vertices , let denotes the set of paths upto length . For each path , the path probability is defined as

(5)

Expected distance between and is defined as the

(6)

For any , denotes the set . Given a set of or more dimensional data points , the problem of skyline query computation asks to find out the data points that are not dominated by any other data points in , which is formally defined in Definition 6.

Definition 6 (Skyline Query).

Given a set of dimensional data points , we say that dominates , if for all , and there exist atleast one such that . Skyline of the dataset is the subset of the data points that are not dominated by any of the data points in .

Since past one decade or so, skyline queries have been studied extensively (Zou et al., 2010; Khan et al., 2012) in graphs as well, which we define next.

Definition 7 (Skyline Query in Graphs).

Given a graph , and a subset of vertices (called query vertices), for any two data vertices (vertices that are not query vertices, i.e., ) and , we say dominates , if , and , such that . The skyline query asks to return data vertices that are not dominated by other data vertices.

Though, the skyline query problem has been studied in the context of probabilistic data (Atallah and Qi, 2009; Le et al., 2016; Zhang et al., 2019), to the best of our knowledge this problem has not been studied in the context of uncertain graphs. In this paper, we introduce the problem of finding the dynamic skyline queries on uncertain graphs (DySky), which is defined next.

Definition 8 (Dynamic Skyline Queries on Uncertain Graphs).

Given an uncertain graph , and a subset of vertices (called query vertices), the problem of dynamic skyline queries on uncertain graphs asks to find out the subset of the data vertices such that none of them are dominated by the other data vertices.

Figure 1 shows a toy example of an uncertain graph with its majority distance, expected distance, and shortest path distance (for deterministic version) tables, where the skyline vertices are marked in orange color. It is important to observe as the distance measure changes, the skyline vertex set is also getting changed. This motivates us to study the DySky Problem under two different distance measures.

(a) An Uncertain Graph
(b) Majority Distance (MD), Expected Distance,
and Shortest Path Distance (in deterministic version)
Figure 1. (a) An uncertain graph with vertices and edges. The vertices and are the Query Vertices (denoted as and ) and remaining are data vertices (i.e., , , , and ). (b) Different distance tables between the query and data vertices. Skyline vertices in each cases are marked in Orange.

3. Proposed Methodology

Now, we describe the proposed methodology for solving the DySky Problem. Initially, we start by describing an overview of it.

3.1. Overview

The proposed methodology is broadly divided into three steps:

  • Step 1 (Pruning): In this step, a subset of the data vertices are returned as the candidate skyline vertices. This step comprises of two subsets. First, pruning is done by performing Breadth First Search (B.F.S., henceforth) from the query vertices and subsequently, pruning is done based on distance computation.

  • Step 2 (Distance Computation): In this step, distance computation is done between the candidate skyline vertices and query vertices. As mentioned previously, in our study we have used majority distance and expected distance.

  • Step 3 (Skyline Vertex Set Generation): Based on the previously computed distance, any existing skyline finding algorithm can be used to find out the actual skyline vertices. In our study, we have used the Block Nested Loop (BNL) Algorithm proposed by Borzsonyi et al. (Borzsony et al., 2001).

Next, we proceed towards representing the proposed methodology in an algorithmic form and its detailed analysis.

3.2. The Algorithm

Algorithm 1, 2, and 3 together constitute the proposed methodology for the DySky Problem. We describe the entire procedure in two subsections. First we start with describing the pruning step.

3.2.1. The Pruning Step

Algorithm 1 describes the B.F.S. and distance based pruning strategies, which takes the uncertain graph, the set of query vertices, and distance threshold as inputs and outputs the candidate skyline vertices. In B.F.S. pruning, from each of the query vertices, B.F.S. trees are constructed to check the connectivity. First, we create the dictionary . If a query vertex and data vertex is connected and the data vertex has the entry in the dictionary , the query vertex is included as a value corresponding to this key. Otherwise, a key corresponding to the data vertex is created and the query vertex is added as a value to this ‘key’. Now, the data vertices that are reachable from all the query vertices are kept as the candidate skyline vertices. Here, the B.F.S. pruning ends.

In reality, even if two vertices are connected by a path of large distance (i.e., more than certain threshold), reachability becomes costlier. Hence, to eliminate such vertices, we perform the distance-based pruning. For this purpose, distance between candidate skyline vertex and query vertex is computed. For a candidate skyline vertex, if there exist atleast one query vertex for which the computed distance value is more than the user defined threshold, the candidate skyline vertex set is updated by removing the candidate skyline vertex.

Data: Uncertain Graph , The Set of Query Vertices , Distance Threshold
. Result: Candidate Skyline Vertices
1 ;
2 for  do
3       for  do
4             if Isconnected(uv) then
5                   if  then
6                        
7                  else
8                         ;
9                        
10                   end if
11                  
12             end if
13            
14       end for
15      
16 end for
17;
18 for  do
19       if  then
20             ;
21            
22       end if
23      
24 end for
25for  do
26       for  do
27             if  then
28                   ;
29                  
30             end if
31            
32       end for
33      
34 end for
Algorithm 1 Step 1 (B.F.S and Distance based pruning)

Any pruning strategy to work correctly should guarantee that it does not remove any skyline vertex. Hence, we show that the Algorithm 1 is a correct pruning strategy in Lemma 1.

Lemma 1 ().

The proposed pruning strategy (Algorithm 1) is correct.

Proof.

Follows from the description. ∎

Now, we do an analysis for time and space requirement of Algorithm 1. Let be the number of query vertices, i.e., . For creating the B.F.S. trees rooted at the query vertices requires time. The maximum number of values associated with a ‘key’ in the dictionary is of . Execution time from Line No. to and to requires and . Now, in distance-based pruning, the number of distance computations is . Computing shortest path between two vertices in a weighted graph with positive edge weights requires time. Hence, time requirement for distance-based pruning requires time. Total time requirement for Algorithm 1 is of . Extra space requirement of Algorithm 1 is to store the dictionary , which is of , to store the candidate skyline vertices, which is of , and to perform the B.F.S., which is of . Hence, total space requirement of Algorithm 1 is of . Lemma 2 describes the formal statement.

Lemma 2 ().

Time and space requirement of Algorithm 1 is of and , respectively.

3.2.2. Distance Computation and Skyline Vertex Set Generation

Now, we describe Step and of our proposed methodology. It is important to observe that depending upon which distance measure is used (i.e., majority distance or expected distance) Step will be different. Algorithm 2 and 3 describes the last two steps for the majority distance and expected distance, respectively.

Data: Candidate Skyline Vertices
. Result: The Skyline Vertex Set .
1 ;
2 ;
3 ;
4 for  do
5       for  do
6             Create dictionary ;
7             for  do
8                   ;
9                   ;
10                  
11             end for
12            ;
13            
14       end for
15      
16 end for
17;
18 return ;
Algorithm 2 Step and (Distance Computation and Skyline Vertex Set Generation) for Majority Distance

For the majority distance case, first we generate number of subgraphs as mentioned in Definition 2, and the corresponding generation probabilities are stored in the array . Next, the majority distance is computed between a candidate skyline vertex and a query vertex. Finally, the BNL Algorithm is applied on the distance matrix to obtain the skyline vertex set.

Now, we analyze Algorithm 2 for time and space requirement. As mentioned in Definition 2, generation of number of subgraphs require time. Using Dijkstra’s algorithm computing the shortest path between a pair of vertices requires time. Hence, execution time from Line to requires . Now, BNL algorithm requires time. Extra space consumed by Algorithm 2 is to store the array , , and the matrix which requires , , and space, respectively. The formal statement is mentioned in Lemma 3.

Lemma 3 ().

Time and space requirement of Algorithm 2 is of and , respectively.

Lemma 2 and 3 together imply the statement mentioned in Theorem 1.

Theorem 1 ().

If majority distance is concerned, the proposed methodology returns the skyline vertex set in time and space.

It is trivial to observe that Algorithm 3 just implements the expected distance, and hence, without explanation we move to analyze the algorithm. Assume that maximum degree of the input uncertain graph is . Hence, the maximum number paths upto length between any pair of vertices is of . Hence, running time from Line to is of . Hence, total running time of Algorithm 3 is of . Extra space consumed by the Algorithm 3 is to store the matrix , array , and which requires . Hence, Lemma 4 holds.

Lemma 4 ().

The running time and space requirement of Algorithm 3 is of and , respectively.

Lemma 2 and 4 together imply the statement mentioned in Theorem 2.

Theorem 2 ().

If expected distance is concerned, the proposed methodology returns the skyline vertex set in time and space.

Data: Candidate Skyline Vertices
. Result: The Skyline Vertex Set .
1 ;
2 for  do
3       for  do
4             = Compute all paths from to upto length ;
5             ; ;
6             for  do
7                   for  do
8                         ;
9                         ;
10                        
11                   end for
12                  
13             end for
14            ;
15            
16       end for
17      
18 end for
19;
20 return ;
Algorithm 3 Step and (Distance Computation and Skyline Vertex Set Generation) for Expected Distance

4. Experimental Evaluations

(a) Minnasota Road Network (b) P2P Network (c) USA Road Network
Figure 2. Box plot for the candidate skyline size with respect to the query size for the Minnasota Road Network, P2P Network, and USA Road Network datasets.
(a) Minnasota Road Network (b) P2P Network (c) USA Road Network
Figure 3. Query size Vs. Skyline size plot for the Minnasota Road Network, P2P Network, and USA Road Network datasets.

In this section we describe the experimental validations of our proposed approach. Initially, we start by describing the datasets.

4.1. Datasets

In our study, we have used three different datasets appeared in three different contexts described below.

  • Minnesota Road Network (MRN) (Rossi and Ahmed, 2015): This is a road network dataset of the Minnasota city. Here, the junctions are represented by the nodes, and if two junctions are connected by a road then the corresponding two vertices are connected by an edge.

  • P2P Network (Leskovec et al., 2007; Ripeanu and Foster, 2002):This dataset contains a sequence of snapshots of the Gnutella peer-to-peer file sharing network from August 2002. There are total of 9 snapshots of Gnutella network collected in August 2002. Nodes represent hosts in the Gnutella network topology and edges represent connections between the Gnutella hosts.

  • USA Road Network (URN) (Rossi and Ahmed, 2015): This dataset describes a road network from the United States. Here, vertices represent the junctions, and an edge between signifies that the corresponding two junctions are are connected by road.

Please refer to Table 1 for basic statistics of the datasets. All the datasets are undirected and unweighted. Probability of existence and weight of each edge is chosen from the intervals and uniformly at random.

Dataset n m Density Avg. Degree
MRN 2642 3300 2
P2P Network 8114 26013 6.41
URN 129164 165435 2.56
Table 1. Basic Statistics of the Datasets

4.2. Experimental Setup

In our study the following three different query vertex selection strategies have been adopted:

  • RAND: By this method, to select query vertices first one is chosen randomly and remaining query vertices are chosen from the two hop neighbors of the initially selected vertex uniformly at random.

  • HDEG: By this method, to select query vertices first the subset of the nodes whose degree is more than a threshold value are marked and a node is chosen uniformly at random as a query vertex. Remaining are chosen from the two hop neighbors of the initially selected vertices uniformly at random.

  • HCLUS: This method is exactly the same as HDEG, except the case that, for choosing the first query vertex the subset of vertices are chosen based on the clustering coefficient of nodes.

Based on the selection strategy, we choose the query size from the set . The experiments are repeated for 10 times. All the algorithms have been implemented with Python 3.5 + NetworkX 2.1 environment on a HPC Cluster with 5 nodes each of them having 64 cores and 160 GB of memory and the implementations are available at https://github.com/BITHIKA1992/Skyline_Uncertain_Graph/

4.3. Goals of the Experiments

The experiments that have been conducted here aim to address the following questions:

  • Efficiency of the Pruning Strategies: As the number of query vertices increases, what is the fraction of data vertices removed before distance computation?

  • Query Size Vs. Skyline Vertices: Under different query vertex selection strategies how the cardinality of the skyline vertex set changes with respect to the query size?

  • Distance Metric Vs. Skyline Vertices: For a fixed query selection strategy and query size, how the cardinality of the skyline vertices changes with respect to the distance metric?

  • Query Selection Strategy Vs. Skyline Vertices: For a fixed query size and distance metric how the cardinality of the skyline vertices changes with respect to query selection strategy?

  • Query Size Vs. Computational Time: For a fixed query size and distance metric, how computational time grows with respect to query size?

4.4. Results and Discussion

Here, we address the research questions that we have raised.

4.4.1. Efficiency of the Pruning Strategies

As we apply BFS pruning in each dataset, it returns the vertices from the largest component. This reduces , and number of vertices for URN and P2P network dataset. For distance based pruning, we have taken the threshold value as , considering 4-hop path with the maximum edge weight . In Figure 2

, we show the box plot for the candidate size with respect to each query size and the query selection strategy. It can be observed that the candidate size for RAND selection strategy is less than other two, in all the datasets, which is trivial to convince. For P2P network, the inter quartile range is very high compared to other datasets. This is due to the reason of high average degree in the network. Also, for RAND selection strategy, this range is the highest for small query size. This is due to the existence of various small size component in the network. Both the road networks are very sparse and for the large query sizes like 15, 20, the candidate size becomes very small and the variance also reduces. With this sparsity for small road network MRN, it is impossible to find the connected vertices from all the query vertices within the distance of 400. So, we remove the results for query size of 15 and 20 in MRN dataset.

Dataset Query Size Step 1 Step 2 Step 3 Total Time
Sample Gen B.F.S Pruning Distance Pruning MD Comp. Time ED Comp. Time Skyline Comp. Time MD ED
MRN 2 28.6028 0.0028 0.5797 0.2195 0.0328 0.0019 29.4070 0.6174
3 29.3554 0.0029 0.6314 0.2232 0.0320 0.0021 30.2152 0.6685
5 28.9936 0.0026 0.6565 0.2544 0.0365 0.0025 29.9098 0.6982
8 30.3126 0.0029 0.7480 0.2443 0.0382 0.0029 31.3109 0.7922
10 29.3217 0.0029 0.8090 0.3156 0.0487 0.0036 30.4530 0.8643
P2P Network 2 1585.5587 0.0135 494.5580 16676.0924 156.4454 0.0026 18756.2254 651.0197
3 1596.0541 0.0142 737.8904 21824.5948 211.6984 0.0030 24158.5568 949.6062
5 1594.3928 0.0149 1122.0655 27233.3476 299.2759 0.0159 29949.8369 1421.3723
8 1583.5686 0.0148 1256.8362 26993.8274 406.0604 0.3052 29834.5524 1663.2167
10 1619.4677 0.0140 1654.2035 39048.2783 572.3251 2.3940 42324.3577 2228.9368
15 1567.8540 0.0153 2003.2736 44549.2078 707.3537 38.9008 48159.2516 2749.5436
20 1611.6262 0.0168 2826.1580 60793.5038 954.3398 661.4532 65892.7582 4441.9679
URN 2 137954.2686 0.1908 52.9266 0.7025 0.0650 0.0039 138008.0926 53.1864
3 139426.9728 0.1735 48.0920 0.7844 0.0730 0.0041 139476.0271 48.3427
5 138948.9628 0.1743 54.1615 1.3545 0.1053 0.0063 139004.6596 54.4476
8 163983.4069 0.1838 51.6949 1.2328 0.0928 0.0043 164036.5230 51.9759
10 162627.5420 0.1947 57.9749 1.4499 0.1180 0.0070 162687.1688 58.2948
15 115441.9538 0.1738 67.8932 0.8683 0.0847 0.0167 115510.9059 68.1685
20 114801.2682 0.1805 79.6447 0.1371 0.0241 0.0107 114881.2415 79.8602
Table 2. Computational time requirement (in Secs.) for finding skyline vertex set generation for Minnesota Road Network (MRN), P2P Network, and USA Road Network (URN) Dataset

4.4.2. Query Size Vs. Skyline Vertices

In Figure 3, we show the plot for query size Vs. skyline size, with two distance metrics and three query selection strategies. In this part, we describe the comparison of sizes. From all the 10 executions, here we report the mean values for the skyline size. With the increase in query size, the skyline size increases. However, for URN dataset in Figure 3(c), the skyline size decreases for large value of query size. The reason is due to small size of candidate skyline, which can be verified from the Figure 2(c). Also, for both the road network datasets the maximum skyline size reaches to approximately 15, whereas for the P2P network it reaches to around 1500. This due to its candidate size. For, both the cases, at large value of query size the ratio of candidate to skyline size is very small. As the number of query vertices increase, the chance of domination decreases.

4.4.3. Distance Metric Vs. Skyline Vertices

In this part, referring to Figure 3, we describe the behavior of skyline size with respect to different distance metrics. For the road networks in Figure 3(a) and (c), the skyline size is similar in both the datasets. However, for the P2P network in Figure 3(b), the skyline size in the expected distance ( max 1500) is much more than the majority distance ( max 300). The reason lies on the networks high average degree value and the density. As the number of paths increases between a query vertex to a data vertex, the expected distance value is unable to dominate other data vertices. This results into large size of skyline vertex set. This can be verified from Figure 3(b), by looking into HDEG and and HCLUS selection strategies, where it differs from the expected distance results. However, for RAND, the size is similar in both the distances. From the experiments, we also observe that for a particular query vertex set the skyline vertices may not be the same from both the distance metrics.

4.4.4. Query Selection Strategy Vs. Skyline Vertices

In this part, referring to Figure 3, we describe the behavior of skyline size with respect to different query selection strategies. First, we describe the threshold value selected for HDEG and HCLUS for different datasets. As the P2P network dataset consists of high degree nodes, we select the high degree threshold value as 15, and it returns 440 nodes. In case of both the road networks, the maximum degree is around 5. Hence, for MRN and URN datasets, this threshold value is considered as 2 and 3, respectively. The clustering coefficient threshold is taken as 0 as the clustering coefficient for all the networks are very less. From Figure 3, the main observation is that for all the selection strategies the skyline size does not vary much for smaller query size. Whereas, for the large value of query size, HCLUS gives maximum skyline vertices.

4.4.5. Computational Time

Table 2 contains the stepwise computational time requirement to find skyline vertices for different datasets. From the table, it can be observed that for all the datasets as the query size increases, time requirement for finding out the skyline vertex set also increases. Due to the change in the query size, required time for distance-based pruning, distance and skyline computation (using BNL) increases. Also, for all the datasets, the main time requirement is due to the sample graph generation. As in case of expected distance sample generation is not required, hence, in this distance setting time requirement is much less compared to the majority distance. In particular, for query size , the ratio between the computational time requirement for majority distance to expected distance for MRN, P2P, and URN are , , and , respectively.

Now, we proceed for the dataset specific observations. For the P2P Network dataset, when the query size increases beyond , there is a sharp increase in the skyline computation time. This is due to the following two reasons. From the Figure 2(b) and 3(b), it can be observed that candidate and skyline size are more compared to the previous query sizes.

5. Conclusion and Future Directions

In this paper, we introduce the problem of dynamic skyline queries on uncertain graphs for two different distance measures, namely, majority distance and expected distance. For this problem, we have proposed a methodology having three main steps: pruning, distance computation, and skyline vertex set generation. The proposed methodology has been analyzed to understand its time and space requirement. The experimental results demonstrate that it can find out the skyline vertex set with reasonable computation time, particularly for the expected distance.

Now, this study can be extended in several directions. It will be an interesting future study to come up with efficient methodology, which can reduce the computational time even for majority distance. One possible way could be parallelizing the sample graph generation. It will be an important future work to provide a sample bound for the majority distance case. The minimum number of samples from the possible world, one should choose to answer the skyline with more than certain threshold probability.

Acknowledgements.
Authors want to thank Ministry of Human Resource and Development (MHRD), Government of India, for sponsoring the project, E-business Center of Excellence under the scheme of Center for Training and Research in Frontier Areas of Science and Technology (FAST), Grant No. F.No.5-5/2014-TS.VII.

References

  • M. J. Atallah and Y. Qi (2009) Computing all skyline probabilities for uncertain data. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 279–287. Cited by: §2.
  • F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich (2014) Core decomposition of uncertain graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1316–1325. Cited by: §1.1.
  • S. Borzsony, D. Kossmann, and K. Stocker (2001) The skyline operator. In Proceedings 17th international conference on data engineering, pp. 421–430. Cited by: §1.1, §1, 3rd item.
  • Y. Chen, X. Zhao, X. Lin, Y. Wang, and D. Guo (2018) Efficient mining of frequent patterns on uncertain graphs. IEEE Transactions on Knowledge and Data Engineering 31 (2), pp. 287–300. Cited by: §1.1.
  • Y. Chen, X. Zhao, X. Lin, and Y. Wang (2015) Towards frequent subgraph mining on single large uncertain graphs. In 2015 IEEE International Conference on Data Mining, pp. 41–50. Cited by: §1.1.
  • J. Chomicki, P. Ciaccia, and N. Meneghetti (2013) Skyline queries, front and back. ACM SIGMOD Record 42 (3), pp. 6–18. Cited by: §1.
  • T. De Matteis, S. Di Girolamo, and G. Mencagli (2015) A multicore parallelization of continuous skyline queries on data streams. In European Conference on Parallel Processing, pp. 402–413. Cited by: §1.1.
  • T. De Matteis, S. Di Girolamo, and G. Mencagli (2016) Continuous skyline queries on multicore architectures. Concurrency and Computation: Practice and Experience 28 (12), pp. 3503–3522. Cited by: §1.1.
  • X. Fu, X. Miao, J. Xu, and Y. Gao (2017) Continuous range-based skyline queries in road networks. World Wide Web 20 (6), pp. 1443–1467. Cited by: §1.1.
  • Z. Halim, M. Waqas, A. R. Baig, and A. Rashid (2017) Efficient clustering of large uncertain graphs using neighborhood information. International Journal of Approximate Reasoning 90, pp. 274–291. Cited by: §1.1.
  • G. He, L. Chen, C. Zeng, Q. Zheng, and G. Zhou (2016) Probabilistic skyline queries on uncertain time series. Neurocomputing 191, pp. 224–237. Cited by: §1.1.
  • J. Hu, R. Cheng, Z. Huang, Y. Fang, and S. Luo (2017) On embedding uncertain graphs. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 157–166. Cited by: §1.1.
  • Z. Huang, W. Xu, J. Cheng, and J. Ni (2018) An efficient algorithm for skyline queries in cloud computing environments. China Communications 15 (10), pp. 182–193. Cited by: §1.1.
  • H. Jaudoin, P. Nerzic, O. Pivert, and D. Rocacher (2017) On making skyline queries resistant to outliers. In Advances in Knowledge Discovery and Management, pp. 19–38. Cited by: §1.1.
  • B. Jiang and X. Du (2018) Personalized travel route recommendation with skyline query. In 2018 IEEE 9th International Conference on Dependable Systems, Services and Technologies (DESSERT), pp. 549–554. Cited by: §1.1.
  • S. Jiang, J. Zheng, J. Chen, and W. Yu (2016) K-th order skyline queries in bicriteria networks. In Asia-Pacific Web Conference, pp. 488–491. Cited by: §1.1.
  • R. Jin, L. Liu, and C. C. Aggarwal (2011) Discovering highly reliable subgraphs in uncertain graphs. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 992–1000. Cited by: §1.1.
  • V. Kassiano, A. Gounaris, A. N. Papadopoulos, and K. Tsichlas (2016) Mining uncertain graphs: an overview. In International Workshop of Algorithmic Aspects of Cloud Computing, pp. 87–116. Cited by: §1.1.
  • X. Ke, A. Khan, M. A. Hasan, and R. Rezvansangsari (2019a) Budgeted reliability maximization in uncertain graphs. arXiv preprint arXiv:1903.08587. Cited by: §1.1.
  • X. Ke, A. Khan, and L. L. H. Quan (2019b) An in-depth comparison of st reliability algorithms over uncertain graphs. Proceedings of the VLDB Endowment 12 (8), pp. 864–876. Cited by: §1.1.
  • I. Keles and K. Hose (2019) Skyline queries over knowledge graphs. In The 18th International Semantic Web Conference, ISWC 2019International Semantic Web Conference, Cited by: §1.1.
  • A. Khan and L. Chen (2015) On uncertain graphs modeling and queries. Proceedings of the VLDB Endowment 8 (12), pp. 2042–2043. Cited by: §1.1.
  • A. Khan, V. Singh, and J. Wu (2012) Finding skyline nodes in large networks. In 2012 IEEE 28th International Conference on Data Engineering Workshops, pp. 198–204. Cited by: §2.
  • A. Khan, Y. Ye, and L. Chen (2018) On uncertain graphs. Synthesis Lectures on Data Management 10 (1), pp. 1–94. Cited by: §1.1.
  • T. M. N. Le, J. Cao, and Z. He (2016) Answering skyline queries on probabilistic data using the dominance of probabilistic skyline tuples. Information Sciences 340, pp. 58–85. Cited by: §1.1, §2.
  • J. Lee, H. Im, and G. You (2016) Optimizing skyline queries over incomplete data. Information Sciences 361, pp. 14–28. Cited by: §1.1.
  • J. Leskovec, J. Kleinberg, and C. Faloutsos (2007) Graph evolution: densification and shrinking diameters. ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1), pp. 2. Cited by: 2nd item.
  • J. Liu, J. Yang, L. Xiong, and J. Pei (2018a) Secure and efficient skyline queries on encrypted data. IEEE Transactions on Knowledge and Data Engineering 31 (7), pp. 1397–1411. Cited by: §1.1.
  • J. Liu, X. Li, K. Ren, J. Song, and Z. Zhang (2018b) Parallel n-of-n skyline queries over uncertain data streams. In International Conference on Database and Expert Systems Applications, pp. 176–184. Cited by: §1.1.
  • J. Liu, X. Li, K. Ren, and J. Song (2019) Parallelizing uncertain skyline computation against n-of-n data streaming model. Concurrency and Computation: Practice and Experience 31 (4), pp. e4848. Cited by: §1.1.
  • X. Miao, Y. Gao, G. Chen, and T. Zhang (2016) K-dominant skyline queries on incomplete data. Information Sciences 367, pp. 990–1011. Cited by: §1.1.
  • X. Miao, Y. Gao, S. Guo, and G. Chen (2018) On efficiently answering why-not range-based skyline queries in road networks. IEEE Transactions on Knowledge and Data Engineering 30 (9), pp. 1697–1711. Cited by: §1.1.
  • H. T. H. Nguyen and J. Cao (2015) Preference-based top-k representative skyline queries on uncertain databases. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 280–292. Cited by: §1.1.
  • D. Ouyang, L. Yuan, F. Zhang, L. Qin, and X. Lin (2018) Towards efficient path skyline computation in bicriteria networks. In International Conference on Database Systems for Advanced Applications, pp. 239–254. Cited by: §1.1.
  • Y. Park, J. Min, and K. Shim (2015) Processing of probabilistic skyline queries using mapreduce. Proceedings of the VLDB Endowment 8 (12), pp. 1406–1417. Cited by: §1.1.
  • Y. Park, J. Min, and K. Shim (2017) Efficient processing of skyline queries using mapreduce. IEEE Transactions on Knowledge and Data Engineering 29 (5), pp. 1031–1044. Cited by: §1.1.
  • M. Potamias, F. Bonchi, A. Gionis, and G. Kollios (2009) Nearest-neighbor queries in probabilistic graphs. Technical report Boston University Computer Science Department. Cited by: Definition 4.
  • M. Ripeanu and I. Foster (2002) Mapping the gnutella network: macroscopic properties of large-scale peer-to-peer systems. In international workshop on peer-to-peer systems, pp. 85–93. Cited by: 2nd item.
  • R. A. Rossi and N. K. Ahmed (2015) The network data repository with interactive graph analytics and visualization. In AAAI, External Links: Link Cited by: 1st item, 3rd item.
  • W. Son, F. Stehn, C. Knauer, and H. Ahn (2017) Top-k manhattan spatial skyline queries. Information Processing Letters 123, pp. 27–35. Cited by: §1.1.
  • Y. Wang, B. Song, J. Wang, L. Zhang, and L. Wang (2016) Geometry-based distributed spatial skyline queries in wireless sensor networks. Sensors 16 (4), pp. 454. Cited by: §1.1.
  • P. Yawalkar and S. Ranu (2019) Route recommendations on road networks for arbitrary user preference functions. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 602–613. Cited by: §1.1.
  • B. Yin, K. Gu, X. Wei, S. Zhou, and Y. Liu (2018) A cost-efficient framework for finding prospective customers based on reverse skyline queries. Knowledge-Based Systems 152, pp. 117–135. Cited by: §1.1.
  • Y. Zeng, G. Chen, K. Li, Y. Zhou, X. Zhou, and K. Li (2019) M-skyline: taking sunk cost and alternative recommendation in consideration for skyline query on uncertain data. Knowledge-Based Systems 163, pp. 204–213. Cited by: §1.1.
  • K. Zhang, H. Gao, X. Han, Z. Cai, and J. Li (2019) Modeling and computing probabilistic skyline on incomplete data. IEEE Transactions on Knowledge and Data Engineering. Cited by: §2.
  • W. Zheng, L. Zou, X. Lian, L. Hong, and D. Zhao (2014) Efficient subgraph skyline search over large graphs. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 1529–1538. Cited by: §1.
  • X. Zhou, K. Li, G. Xiao, Y. Zhou, and K. Li (2016) Top favorite probabilistic products queries. IEEE Transactions on Knowledge and Data Engineering 28 (10), pp. 2808–2821. Cited by: §1.1.
  • X. Zhou, K. Li, Y. Zhou, and K. Li (2015) Adaptive processing for distributed skyline queries over uncertain data. IEEE Transactions on Knowledge and Data Engineering 28 (2), pp. 371–384. Cited by: §1.1.
  • X. Zhu, J. Wu, W. Chang, G. Wang, and Q. Liu (2018) Authentication of skyline query over road networks. In International Conference on Security, Privacy and Anonymity in Computation, Communication and Storage, pp. 72–83. Cited by: §1.1.
  • L. Zou, L. Chen, M. T. Özsu, and D. Zhao (2010) Dynamic skyline queries in large graphs. In International Conference on Database Systems for Advanced Applications, pp. 62–78. Cited by: §1, §2.