Distributed Computation of Top-k Degrees in Hidden Bipartite Graphs

Hidden graphs are flexible abstractions that are composed of a set of known vertices (nodes), whereas the set of edges are not known in advance. To uncover the set of edges, multiple edge probing queries must be executed by evaluating a function f(u,v) that returns either true or false, if nodes u and v are connected or not respectively. Evidently, the graph can be revealed completely if all possible n(n-1)/2 probes are executed for a graph containing n nodes. However, the function f() is usually computationally intensive and therefore executing all possible probing queries result in high execution costs. The target is to provide answers to useful queries by executing as few probing queries as possible. In this work, we study the problem of discovering the top-k nodes of a hidden bipartite graph with the highest degrees, by using distributed algorithms. In particular, we use Apache Spark and provide experimental results showing that significant performance improvements are achieved in comparison to existing centralized approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/07/2017

Core Discovery in Hidden Graphs

Massive network exploration is an important research direction with many...
11/20/2017

Edge Estimation with Independent Set Oracles

We study the problem of estimating the number of edges in a graph with a...
04/27/2012

The conduciveness of CA-rule graphs

Given two subsets A and B of nodes in a directed graph, the conducivenes...
08/18/2017

An Optimal Realization Algorithm for Bipartite Graphs with Degrees in Prescribed Intervals

We consider the problem of constructing a bipartite graph whose degrees ...
07/09/2019

Nearly optimal edge estimation with independent set queries

We study the problem of estimating the number of edges of an unknown, un...
10/25/2020

LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

In cloud ML inference systems, batching is an essential technique to inc...
02/04/2020

Providing Insights for Queries affected by Failures and Stragglers

Interactive time responses are a crucial requirement for users analyzing...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nowadays, graphs are frequently used to model real-life problems in many diverse application domains, such as social network analysis, searching and mining the Web, pattern mining in bioinformatics and neuroscience. Graph mining is an active and growing research area, aiming at knowledge discovery from graph data [4, 1]. In its simplest form, a graph is defined by a set of nodes (or vertices) and a set of edges (or links) . Each edge connects a pair of nodes. In this work, we focus on simple bipartite graphs, where the set of nodes is composed of two subsets (the black nodes) and (the white nodes), such as and . Moreover, all edges of connect nodes from and nodes of , meaning that the two endpoints of an edge belong to different node subsets.

Bipartite graphs have many interesting applications in diverse fields. For example, a bipartite graph may be used to represent product purchases by customers. In this case, an edge exists between a product and a customer when was purchased by . As another example, in an Information Retrieval or Text Mining application a bipartite graph may be used to associate different types of tokens that exist in a document. Thus, an edge between a document and a token represents the fact that the token appears in document . Moreover, there are many applications of bipartite graphs in systems biology and medicine [7].

Figure 1: An example of an undirected bipartite graph, with two bipartitions and .

Figure 1 depicts a simple bipartite graph containing seven nodes and five edges. Among the black nodes, has the maximum number of neighbors (i.e., three). One of the nodes is isolated, i.e., it does not have any neighbors.  

Motivation and Contributions. Conventional bipartite graphs are characterized by the fact that the sets of vertices and as well as the set of edges are known in advance. Nodes and edges are organized in a way to enable efficient execution of fundamental graph-oriented computational tasks, such as finding nodes with large degree, computing shortest paths, clustering and community detection. Usually, the adjacency list representation is being used, which is a good compromise between space requirements and computational efficiency. However, a concept that has started to gain significant interest recently is that of hidden graphs. In contrast to a conventional graph, a hidden bipartite graph is defined as , where and are the subsets of nodes and is a function which takes as an input two vertex identifiers and returns true or false if the edge exists or not respectively. Therefore, in a hidden graph the edge set is not given explicitly and it is inferred by using the function .

Hidden graphs are very powerful because they are able to represent any type of relationship among a set of entities. Formally, for nodes, there is an exponential number of different graphs that may be generated (the exact number is ). To materialize all these different graphs demands significant space requirements. Moreover, it is unlikely that all these graphs will be used eventually. In contrast, the use of a hidden graph enables the representation of any possible graph by the specification of the appropriate function . It is noted that the function may require the execution of a complex algorithm in order to decide if two nodes are connected by an edge. Therefore, interesting what-if scenarios may be tested with respect to the satisfaction of meaningful properties. It is evident, that the complete graph structure may be revealed if all possible edge probing queries are executed, where and . However, this solution involves the execution of a quadratic number of probes, which is expected to be highly inefficient, taking into account that the function is computationally intensive and that real-world graphs are usually very large. Therefore, the target is to provide a solution to a graph-oriented problem by using as few edge probing queries as possible. The problem we are targeting is the discovery of the nodes with the highest degrees. The degree of a node is defined as the number of neighbors of . This problem has been addressed previously by [9] and [11]. In this work, we are interested in solving the problem by using distributed algorithms and over Big Data architectures. This enables the analysis of massive hidden graphs and will provide some baseline for executing more complex graph mining tasks, thus overcoming the limitations provided by centralized approaches. In particular, we study distributed algorithms for the discovery of the top- degree nodes in Apache Spark  over YARN and HDFS [10] and we offer experimental results based on a cluster of 32 physical machines. Performance evaluation results demonstrate that the proposed techniques are scalable and can be used for the analysis of massive hidden networks. We note that this is the first work addressing the problem of hidden graph analysis in a distributed setting.  

Roadmap. The rest of the paper is organized as follows. The next section contains the most representative related work in the area. Section 4 studies our main contribution in detail. Performance evaluation results are offered in Section 5, whereas Section 6 concludes our work and presents briefly interesting directions for future work in the area.

2 Related Research

A hidden graph is able to represent arbitrary relationship types among graph nodes. A research direction that is strongly related to hidden graphs is graph property testing [5]. In this case, one is interested in detecting if a graph satisfies a property or not by using fast approximate algorithms.

Evidently, detecting graph properties efficiently is extremely important. However, an ever more useful and more challenging task is to detect specific subgraphs or nodes that satisfy specific properties, by using as few edge probing queries as possible.

Another research direction related to hidden graphs focuses on learning a graph or a subgraph by using edge probing queries using pairs or sets of nodes (group testing) [2]. A similar topic is the reconstruction of subgraphs that satisfy certain structural properties [3].

An important difference between graph property testing and analyzing hidden graphs is that in the first case the graph is known whereas in the second case the set of edges is unknown and must be revealed gradually. Moreover, property testing focuses on checking if a specific property holds or not, whereas in hidden graph analysis we are interested in answering specific queries exactly. The problem we attack is the discovery of the top- nodes with the highest degrees. This problem was solved by [9] and [11] assuming a centralized environment.

The basic algorithm used in [9] was extended in [8] for unipartite graphs, towards detecting if a hidden graph contains a -core or not. Again, the algorithm proposed in [8] is centralized and its main objective is to minimize the number of edge probing queries performed.

In this paper, we take one step forward towards the design and performance evaluation of distributed algorithms for detecting the top- nodes with the highest degrees in hidden bipartite graphs, aiming at the reduction of the overall computational cost which depends on both the number of edge probing queries and the level of parallelism used. We note that this is the first work in the literature to attack the specific problem in a distributed setting.

3 Fundamental Concepts and Problem Definition

In this section, we present some fundamental concepts related to our research and state the problem formally. Recall that, the detection of the top- nodes with the highest degrees in a bipartite graph has been attached in [9] taking a centralized perspective. In that paper, the authors present the Switch-On-Empty (SOE) algorithm which shows the best performance and provide an optimal solution with respect to the number of edge probing queries that are required. Table 1 presents the most frequently used symbols.

Symbol Interpretation
a hidden graph
set of black vertices of
set of black vertices of
true if the edge exists, false otherwise
, number of black and white vertices
number of highest degree vertices requested
set of neighbors of vertex
degree of vertex
(unknown) set of edges
(unknown) number of edges ()
number of known existing neighbors of
number of known non-existing neighbors of
total number of edge probing queries issued
Table 1: Frequently used symbols.

Before diving into the details of our proposal, there is a need to describe briefly the way the SOE algorithm works. SOE receives as input a hidden bipartite graph, with bipartitions and . The output of SOE is composed of the vertices from or with the highest degrees. Without loss of generality, we are focusing on vertices of . Edge probing queries are executed as follows:

  • Initially, SOE starts from a vertex , selects a vertex and executes . If the edge is solid, it continues to perform probes between and another vertex .

  • Upon a failure, i.e., when the probe returns an empty result, the algorithm applies the same process for another vertex . Vertices for which all the probes have been applied, do not participate in future edge probes.

  • A round is complete when all vertices of have been considered. After each round, some vertices can be safely included in the result set and they are removed from . When a vertex must be considered again, we continue the execution of probes remembering the location of the last failure.

  • SOE keeps on performing rounds until the upper bound of vertex degrees in is less than the current -th highest degree determined so far. In that case, contains the required answer and the algorithm terminates. Note that all equal-degree vertices are included in .

In this work, we proceed one step further in order to detect high-degree nodes in larger graphs using multiple resources. In this respect, our algorithms are implemented in the Apache Spark engine [12, 13] using the Scala programming language [6]. Spark offers a powerful environment that enables the execution of distributed applications in massive amounts of data. Apache Spark is a unified distributed engine with a rich and powerful API for Scala, Python, Java and R [13]. One of its main characteristics is that (in contrast to Hadoop MapReduce) it exploits main memory as much as possible, being able to persist data across rounds to avoid unnecessary I/O operations. Spark jobs are executed based on a master-slave model in cooperation with a cluster manager such as YARN 111http://hadoop.apache.org/ or MESOS 222http://mesos.apache.org/.

4 Proposed Approach

In the following sections, we will focus on the algorithms developed to solve the distributed detection of top- nodes in a hidden bipartite graph. Two algorithms are presented that are both inspired by the SOE (Switch-On-Empty) algorithm [9]. The first algorithm, Distributed Switch-On-Empty (DSOE), is a distributed version of the SOE algorithm. The second algorithm, , DSOE, is an improved and more flexible version of DSOE.

4.1 The DSOE Algorithm

Our first solution is very simple but at the same time very effective. The main mechanism has the following rationale. For a vertex we are executing edge-probing queries until we get a certain amount of negative results . The value of is relevant to the size of the graph and is being reduced exponentially. More precisely, the first batch will have the second and so on. The idea behind this is that, because the majority of the degree distribution in real life graphs follows a Power law distribution, we do not expect to find many vertices with a high degree. The closer we get to the point of exhausting all the smaller we set , because we want to avoid as many unnecessary queries as possible. The pseudocode of the routine is given below.  

Function routine(v: vertex, f: Int) : void is

       while  do
             Execute the next query   if edge-probing query==False then
                  
            
       end while
      
end
 
The routine can be implemented easily to work in a distributed setting of the form . After each batch of routines, we check for the vertices that belong to the answer set . Vertices that belong to can be recognized easily, since vertices that have completed all possible probes should be in the answer. If , then we need one last batch of routines to finalize .
Result: Return a set of vertices that have the highest degree
1 ;
2 while  do
3       map(routine(,)) ;
4       we check if b has probe all the vertices ,each vertices that did it is being removed from and is being added to ;
5      
6 end while
7map(routine(,));
8 we check if b has probed all the vertices ,each vertices that did it, is being removed from and is being added to ;
9 return R;
Algorithm 1 DSOE
In this part we will prove that our conditions for the loop and for the vertices that belong in are correct. For a vertex , if completes all possible probes from the first routine, then there are only two options for it’s degree: either or , because there is a possibility that the last query could be negative and still could have exhausted all possible probes. Assume that while we execute the loop we are in the repeat. Also we assume that after we get a vertex that has exhausted all possible edge-probing queries. For again we have two possibilities, either or for the same reason as previously. From the previous observations we can safely conclude that for a vertex , that has completed all possible queries in with the best-case scenario for the degree is and the worst case scenario is . It is evident that and the equality is possible only if . From this conclusion we are sure that if then if a vertex exhausts all the queries, it can be inserted into . Taking into concern the equality we have to run the batch of routines one last time for the possibility that we may find a vertex with degree equal to the minimum degree in . Algorithm 1 contains the outline of DSOE.

4.2 The DSOE Algorithm

DSOE is the natural extension to the centralized SOE algorithm. However, we advance one step further in order to improve runtime as much as possible, even if the number of probes increases. For very large graphs we already expect a very large number of probes that is inevitable. For that reason trying to improve the execution time by adding a very small amount of probes in the already big number of inevitable probes may worth for the overall performance of our algorithm. The updated algorithm is named DSOE.

DSOE treate all vertices equally, so for algorithms we want to make an initial prediction about the degree of the vertices and after that, to emphasize during the routines the vertices that we predict to have a large degree by having a more loose condition in contrast to those that we predict that will have small degree. The result of this process is to calculate all possible edge-probing queries for the vertices we predict to have a large degree. With the information of the exact number of the degree for those vertices we are able to have a better ending condition for the rest of the vertices. Therefore, we will be able to stop processing of the vertices with low degrees faster and this way we will have better execution times. Moreover, for real life graphs, the degrees of the vertices usually follow a Power law distribution,so we expect that most vertices of the graph will have way smaller degree in comparison with the degree of the top- vertices.

More specifically, DSOE initially performs some random edge-probing queries . The number for these queries is set to . After this step, the course of DSOE is quite similar to that of DSOE. We execute repeatedly a batch of routines , with the difference that, this time a single routine for a vertex is complete when negative edge-probing queries, where is an outcome of the prediction performed by the sampling process to quickly detect nodes that potentially have large degree.

When a vertex exhausts all possible queries, then it is added gets added to a temporary set . The vertices in the set may not be completely correct, so we have to continue with the probing queries. However, the contents of provide a threshold which is set to the minimum degree among the vertices in . Probing queries are executed , until . The vertices that have completed all possible probes are added to . Finally, the best vertices from with respect to the degree are returned as the final answer. Algorithm 2 contains the pseudocode of DSOE.  

Function prediction(v: vertex,probes: int) : int is

       while counter probes do
             if edge-probing query==True then
                   prediction
       end while
      return prediction 
end
Function routine(v: vertex) : void is
       while  do
             if edge-probing query==False then
                   negatives
       end while
      
end
Function exhaust(v: vertex,T: int) : void is
       while  do
             execute the next edge-probing query 
       end while
      
end
 
Result: Return a set of vertices that have the highest degree
10 map(prediction(,));
11 while  do
12       map(routine());
13       we check if b has probe all the vertices ,each vertices that did it is being removed from and is being added to ;
14      
15 end while
16threshold=min(M);
17 map(exhaust(,threshold));
18 we check if b has probe all the vertices ,each vertices that did it is being removed from and is being added to ;
19 =top- from ;
20 return ;
Algorithm 2 DSOE

5 Performance Evaluation

In this section, we present performance evaluation results depicting the efficiency of the proposed distributed algorithms. All experiments have been conducted in a cluster of 32 physical nodes (machines) running Hadoop 2.7 and Spark 2.1.0. One node is used as the master whereas the rest 31 nodes are used as workers. The data resides in HDFS and YARN is being used as the resource manager.

5.1 Datasets

All datasets used in performance evaluation correspond to real-world networks. The networks used have different number of black and white nodes as well as different number of edges. More specifically, we have used three networks: DBLP, YOUTUBE, and WIKI. All networks are publicly available at the Koblenz Network Collection, which is a project to collect large network datasets of different types, in order to assist research in network science and related fields. The collection is maintained by the Institute of Web Science and Technologies at the University of Koblenz–Landau and it accessible by the following URL: http://konect.uni-koblenz.de/.

  • DBLP. The DBLP network is the authorship network from the DBLP computer science bibliography. The network is bipartite and a node is either an author or a publication. Each edge connects an author to one of the publications. The characteristics of this network are as follows: , , the highest degree for set is 114, the highest degree for set is 951 and the average degree for is 6.0660 and for is 2.1622.

  • YOUTUBE1. This is the bipartite network of YouTube users and their group memberships. Nodes correspond to users and groups, and an edge between a user and a group denotes a group membership. The properties of this network are as follows: and , the highest degree of set is , the highest degree of set is , the average degree for is and the average degree of is .

  • YOUTUBE2. This network represents friendship relationships between YouTube users. Nodes are users and an undirected edge between two nodes indicates a friendship. Although this graph is unipartite we modified it in order to become bipartite. We duplicate it’s vertices and the set is one clone of the nodes whereas is the other. Two nodes between and are connected if their corresponding nodes are connected in the initial graph. Evidently, for this graph the statistics for the two sets and are exactly the same: , the maximum degree is for both sets and the average degree for both sets is

  • WIKIPEDIA. This is the bipartite network of English Wikipedia articles and the corresponding categories they are contained in. The first set of nodes corresponds to articles and the second corresponds to categories. For this graph we have , , the highest degree for is and for is , the average degree for is while for set is .

5.2 Experimental Results

In the sequel, we present some representative experimental results demonstrating the performance of the proposed techniques. First, we perform a comparison of DSOE and DSOE with respect to runtime (i.e., time duration to complete the computation in the cluster) by modifying the number of Spark executors running. For this experiment, we start with 8 Spark executors and gradually we keep on increasing their number keeping . The DBLP graph has been used in this case. Figure 2(a) shows the runtime of both algorithms by increasing the number of executors. It is evident that both algorithms are scalable, since there is a significant speedup as we double the number of executors. Moreover, DSOE shows better performance than DSOE.

(a) runtime vs number of executors (b) number of probes vs
Figure 2: Comparison of DSOE and DSOE with respect to runtime and number of probes.

Another important measure that we want to examine is the number of probes that both algorithms execute. For this experiment, the number of executors and the execution time are irrelevant, so we focus only on different values of . Moreover we are running our algorithms in a significantly smaller dataset (YOUTUBE1) for different values of (i.e., 1, 10, 100 and 1000). The corresponding results are given in Figure 2(b). The first observation is that both algorithms perform a significant amount of probes. However, this was expected taking into account that if the average node degree is very small or very large, then many probes are required before the algorithms can provide the answer, as it has been shown in [9]. Also, in general DSOE requires less probes in comparison to DSOE for small values of .

One more very interesting aspect that we examine is the performance of our approaches with respect to the size of the graph. Our motivation is to compare the impact of sets and to the execution time. For this reason we will use DSOE on the DBLP graph twice: for the first execution we have and and the for the second time we reverse the direction of the queries from to . This way the graphs that we compare have the exact same number of edges and vertices but the differ significantly in the cardinality of and . Figure 3(a) presents the corresponding results. It is observed that in the reverse case the runtime is significantly higher, since for every node in there are more options to perform probes on . In general, the cost of the algorithm drops if the cardinality if is larger than that of .

(a) runtime vs source set (b) runtime vs cardinality of and
Figure 3: Performance of DSOE with respect to the source nodes and the size of the node sets and .

The goal of the last experiment is to test the scalability of the algorithms by increasing the size of the data. First we focus on and then on . For this experiment DBLP, YOUTUBE2 and WIKIPEDIA datasets have been used. These graphs have almost the same cardinality in one set and they differ on the cardinality of the other. For all tests we have used 64 Spark executors and the implementation of the DSOE algorithm with . The corresponding results are given in Figure 3(b). It is evident that although the cardinalities of both and have an impact on peformance, execution time is more sensitive on the cardinality of when used as the source set.

6 Conclusions

In this work, we study for the first time the distributed detection of the top- nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available a-priori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on real-world networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.

There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.

Acknowledgments

The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Managing and mining graph data. Springer. Cited by: §1.
  • [2] N. Alon and V. Asodi (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
  • [3] M. Bouvel, V. Grebinski, and G. Kucherov (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
  • [4] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
  • [5] O. Goldreich, S. Goldwasser, and D. Ron (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
  • [6] M. Odersky, L. Spoon, and B. Venners (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
  • [7] G. Pavlopoulos, P. Kontou, A. Pavlopoulou, C. Bouyioukos, E. Markou, and P. Bagos (2018-02) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.
  • [8] P. Strouthopoulos and A. N. Papadopoulos (2017) Core discovery in hidden graphs.

    CoRR (to appear in Data and Knowledge Engineering)

    abs/1712.02827.
    Cited by: §2.
  • [9] Y. Tao, C. Sheng, and J. Li (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
  • [10] T. White (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
  • [11] M. L. Yiu, E. Lo, and J. Wang (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
  • [13] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica (2016-10) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §3.

5 Performance Evaluation

In this section, we present performance evaluation results depicting the efficiency of the proposed distributed algorithms. All experiments have been conducted in a cluster of 32 physical nodes (machines) running Hadoop 2.7 and Spark 2.1.0. One node is used as the master whereas the rest 31 nodes are used as workers. The data resides in HDFS and YARN is being used as the resource manager.

5.1 Datasets

All datasets used in performance evaluation correspond to real-world networks. The networks used have different number of black and white nodes as well as different number of edges. More specifically, we have used three networks: DBLP, YOUTUBE, and WIKI. All networks are publicly available at the Koblenz Network Collection, which is a project to collect large network datasets of different types, in order to assist research in network science and related fields. The collection is maintained by the Institute of Web Science and Technologies at the University of Koblenz–Landau and it accessible by the following URL: http://konect.uni-koblenz.de/.

  • DBLP. The DBLP network is the authorship network from the DBLP computer science bibliography. The network is bipartite and a node is either an author or a publication. Each edge connects an author to one of the publications. The characteristics of this network are as follows: , , the highest degree for set is 114, the highest degree for set is 951 and the average degree for is 6.0660 and for is 2.1622.

  • YOUTUBE1. This is the bipartite network of YouTube users and their group memberships. Nodes correspond to users and groups, and an edge between a user and a group denotes a group membership. The properties of this network are as follows: and , the highest degree of set is , the highest degree of set is , the average degree for is and the average degree of is .

  • YOUTUBE2. This network represents friendship relationships between YouTube users. Nodes are users and an undirected edge between two nodes indicates a friendship. Although this graph is unipartite we modified it in order to become bipartite. We duplicate it’s vertices and the set is one clone of the nodes whereas is the other. Two nodes between and are connected if their corresponding nodes are connected in the initial graph. Evidently, for this graph the statistics for the two sets and are exactly the same: , the maximum degree is for both sets and the average degree for both sets is

  • WIKIPEDIA. This is the bipartite network of English Wikipedia articles and the corresponding categories they are contained in. The first set of nodes corresponds to articles and the second corresponds to categories. For this graph we have , , the highest degree for is and for is , the average degree for is while for set is .

5.2 Experimental Results

In the sequel, we present some representative experimental results demonstrating the performance of the proposed techniques. First, we perform a comparison of DSOE and DSOE with respect to runtime (i.e., time duration to complete the computation in the cluster) by modifying the number of Spark executors running. For this experiment, we start with 8 Spark executors and gradually we keep on increasing their number keeping . The DBLP graph has been used in this case. Figure 2(a) shows the runtime of both algorithms by increasing the number of executors. It is evident that both algorithms are scalable, since there is a significant speedup as we double the number of executors. Moreover, DSOE shows better performance than DSOE.

(a) runtime vs number of executors (b) number of probes vs
Figure 2: Comparison of DSOE and DSOE with respect to runtime and number of probes.

Another important measure that we want to examine is the number of probes that both algorithms execute. For this experiment, the number of executors and the execution time are irrelevant, so we focus only on different values of . Moreover we are running our algorithms in a significantly smaller dataset (YOUTUBE1) for different values of (i.e., 1, 10, 100 and 1000). The corresponding results are given in Figure 2(b). The first observation is that both algorithms perform a significant amount of probes. However, this was expected taking into account that if the average node degree is very small or very large, then many probes are required before the algorithms can provide the answer, as it has been shown in [9]. Also, in general DSOE requires less probes in comparison to DSOE for small values of .

One more very interesting aspect that we examine is the performance of our approaches with respect to the size of the graph. Our motivation is to compare the impact of sets and to the execution time. For this reason we will use DSOE on the DBLP graph twice: for the first execution we have and and the for the second time we reverse the direction of the queries from to . This way the graphs that we compare have the exact same number of edges and vertices but the differ significantly in the cardinality of and . Figure 3(a) presents the corresponding results. It is observed that in the reverse case the runtime is significantly higher, since for every node in there are more options to perform probes on . In general, the cost of the algorithm drops if the cardinality if is larger than that of .

(a) runtime vs source set (b) runtime vs cardinality of and
Figure 3: Performance of DSOE with respect to the source nodes and the size of the node sets and .

The goal of the last experiment is to test the scalability of the algorithms by increasing the size of the data. First we focus on and then on . For this experiment DBLP, YOUTUBE2 and WIKIPEDIA datasets have been used. These graphs have almost the same cardinality in one set and they differ on the cardinality of the other. For all tests we have used 64 Spark executors and the implementation of the DSOE algorithm with . The corresponding results are given in Figure 3(b). It is evident that although the cardinalities of both and have an impact on peformance, execution time is more sensitive on the cardinality of when used as the source set.

6 Conclusions

In this work, we study for the first time the distributed detection of the top- nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available a-priori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on real-world networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.

There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.

Acknowledgments

The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Managing and mining graph data. Springer. Cited by: §1.
  • [2] N. Alon and V. Asodi (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
  • [3] M. Bouvel, V. Grebinski, and G. Kucherov (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
  • [4] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
  • [5] O. Goldreich, S. Goldwasser, and D. Ron (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
  • [6] M. Odersky, L. Spoon, and B. Venners (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
  • [7] G. Pavlopoulos, P. Kontou, A. Pavlopoulou, C. Bouyioukos, E. Markou, and P. Bagos (2018-02) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.
  • [8] P. Strouthopoulos and A. N. Papadopoulos (2017) Core discovery in hidden graphs.

    CoRR (to appear in Data and Knowledge Engineering)

    abs/1712.02827.
    Cited by: §2.
  • [9] Y. Tao, C. Sheng, and J. Li (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
  • [10] T. White (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
  • [11] M. L. Yiu, E. Lo, and J. Wang (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
  • [13] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica (2016-10) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §3.

6 Conclusions

In this work, we study for the first time the distributed detection of the top- nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available a-priori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on real-world networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.

There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.

Acknowledgments

The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Managing and mining graph data. Springer. Cited by: §1.
  • [2] N. Alon and V. Asodi (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
  • [3] M. Bouvel, V. Grebinski, and G. Kucherov (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
  • [4] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
  • [5] O. Goldreich, S. Goldwasser, and D. Ron (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
  • [6] M. Odersky, L. Spoon, and B. Venners (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
  • [7] G. Pavlopoulos, P. Kontou, A. Pavlopoulou, C. Bouyioukos, E. Markou, and P. Bagos (2018-02) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.
  • [8] P. Strouthopoulos and A. N. Papadopoulos (2017) Core discovery in hidden graphs.

    CoRR (to appear in Data and Knowledge Engineering)

    abs/1712.02827.
    Cited by: §2.
  • [9] Y. Tao, C. Sheng, and J. Li (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
  • [10] T. White (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
  • [11] M. L. Yiu, E. Lo, and J. Wang (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
  • [13] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica (2016-10) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §3.

Acknowledgments

The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Managing and mining graph data. Springer. Cited by: §1.
  • [2] N. Alon and V. Asodi (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
  • [3] M. Bouvel, V. Grebinski, and G. Kucherov (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
  • [4] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
  • [5] O. Goldreich, S. Goldwasser, and D. Ron (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
  • [6] M. Odersky, L. Spoon, and B. Venners (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
  • [7] G. Pavlopoulos, P. Kontou, A. Pavlopoulou, C. Bouyioukos, E. Markou, and P. Bagos (2018-02) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.
  • [8] P. Strouthopoulos and A. N. Papadopoulos (2017) Core discovery in hidden graphs.

    CoRR (to appear in Data and Knowledge Engineering)

    abs/1712.02827.
    Cited by: §2.
  • [9] Y. Tao, C. Sheng, and J. Li (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
  • [10] T. White (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
  • [11] M. L. Yiu, E. Lo, and J. Wang (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
  • [13] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica (2016-10) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §3.

References

  • [1] C. C. Aggarwal and H. Wang (2010) Managing and mining graph data. Springer. Cited by: §1.
  • [2] N. Alon and V. Asodi (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
  • [3] M. Bouvel, V. Grebinski, and G. Kucherov (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on Graph-Theoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
  • [4] D. J. Cook and L. B. Holder (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
  • [5] O. Goldreich, S. Goldwasser, and D. Ron (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
  • [6] M. Odersky, L. Spoon, and B. Venners (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
  • [7] G. Pavlopoulos, P. Kontou, A. Pavlopoulou, C. Bouyioukos, E. Markou, and P. Bagos (2018-02) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.
  • [8] P. Strouthopoulos and A. N. Papadopoulos (2017) Core discovery in hidden graphs.

    CoRR (to appear in Data and Knowledge Engineering)

    abs/1712.02827.
    Cited by: §2.
  • [9] Y. Tao, C. Sheng, and J. Li (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
  • [10] T. White (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
  • [11] M. L. Yiu, E. Lo, and J. Wang (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
  • [12] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
  • [13] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica (2016-10) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 0001-0782, Link, Document Cited by: §3.