1 Introduction
Nowadays, graphs are frequently used to model reallife problems in many diverse application domains, such as social network analysis, searching and mining the Web, pattern mining in bioinformatics and neuroscience. Graph mining is an active and growing research area, aiming at knowledge discovery from graph data [4, 1]. In its simplest form, a graph is defined by a set of nodes (or vertices) and a set of edges (or links) . Each edge connects a pair of nodes. In this work, we focus on simple bipartite graphs, where the set of nodes is composed of two subsets (the black nodes) and (the white nodes), such as and . Moreover, all edges of connect nodes from and nodes of , meaning that the two endpoints of an edge belong to different node subsets.
Bipartite graphs have many interesting applications in diverse fields. For example, a bipartite graph may be used to represent product purchases by customers. In this case, an edge exists between a product and a customer when was purchased by . As another example, in an Information Retrieval or Text Mining application a bipartite graph may be used to associate different types of tokens that exist in a document. Thus, an edge between a document and a token represents the fact that the token appears in document . Moreover, there are many applications of bipartite graphs in systems biology and medicine [7].
Figure 1 depicts a simple bipartite graph containing seven nodes and five edges. Among the black nodes, has the maximum number of neighbors (i.e., three). One of the nodes is isolated, i.e., it does not have any neighbors.
Motivation and Contributions. Conventional bipartite graphs are characterized by the fact that the sets of vertices and as well as the set of edges are known in advance. Nodes and edges are organized in a way to enable efficient execution of fundamental graphoriented computational tasks, such as finding nodes with large degree, computing shortest paths, clustering and community detection. Usually, the adjacency list representation is being used, which is a good compromise between space requirements and computational efficiency. However, a concept that has started to gain significant interest recently is that of hidden graphs. In contrast to a conventional graph, a hidden bipartite graph is defined as , where and are the subsets of nodes and is a function which takes as an input two vertex identifiers and returns true or false if the edge exists or not respectively. Therefore, in a hidden graph the edge set is not given explicitly and it is inferred by using the function .
Hidden graphs are very powerful because they are able to represent any type of relationship among a set of entities. Formally, for nodes, there is an exponential number of different graphs that may be generated (the exact number is ). To materialize all these different graphs demands significant space requirements. Moreover, it is unlikely that all these graphs will be used eventually. In contrast, the use of a hidden graph enables the representation of any possible graph by the specification of the appropriate function . It is noted that the function may require the execution of a complex algorithm in order to decide if two nodes are connected by an edge. Therefore, interesting whatif scenarios may be tested with respect to the satisfaction of meaningful properties.
It is evident, that the complete graph structure may be revealed if all possible edge probing queries are executed, where and . However, this solution involves the execution of a quadratic number of probes, which is expected to be highly inefficient, taking into account that the function is computationally intensive and that realworld graphs are usually very large. Therefore, the target is to provide a solution to a graphoriented problem by using as few edge probing queries as possible.
The problem we are targeting is the discovery of the nodes with the highest degrees. The degree of a node is defined as the number of neighbors of . This problem has been addressed previously by [9] and [11].
In this work, we are interested in solving the problem by using distributed algorithms and over Big Data architectures. This enables the analysis of massive hidden graphs and will provide some baseline for executing more complex graph mining tasks, thus overcoming the limitations provided by centralized approaches. In particular, we study distributed algorithms for the discovery of the top degree nodes in Apache Spark over YARN and HDFS [10] and we offer experimental results based on a cluster of 32 physical machines. Performance evaluation results demonstrate that the proposed techniques are scalable and can be used for the analysis of massive hidden networks. We note that this is the first work addressing the problem of hidden graph analysis in a distributed setting.
Roadmap. The rest of the paper is organized as follows. The next section contains the most representative related work in the area. Section 4 studies our main contribution in detail. Performance evaluation results are offered in Section 5, whereas Section 6 concludes our work and presents briefly interesting directions for future work in the area.
2 Related Research
A hidden graph is able to represent arbitrary relationship types among graph nodes. A research direction that is strongly related to hidden graphs is graph property testing [5]. In this case, one is interested in detecting if a graph satisfies a property or not by using fast approximate algorithms.
Evidently, detecting graph properties efficiently is extremely important. However, an ever more useful and more challenging task is to detect specific subgraphs or nodes that satisfy specific properties, by using as few edge probing queries as possible.
Another research direction related to hidden graphs focuses on learning a graph or a subgraph by using edge probing queries using pairs or sets of nodes (group testing) [2]. A similar topic is the reconstruction of subgraphs that satisfy certain structural properties [3].
An important difference between graph property testing and analyzing hidden graphs is that in the first case the graph is known whereas in the second case the set of edges is unknown and must be revealed gradually. Moreover, property testing focuses on checking if a specific property holds or not, whereas in hidden graph analysis we are interested in answering specific queries exactly. The problem we attack is the discovery of the top nodes with the highest degrees. This problem was solved by [9] and [11] assuming a centralized environment.
The basic algorithm used in [9] was extended in [8] for unipartite graphs, towards detecting if a hidden graph contains a core or not. Again, the algorithm proposed in [8] is centralized and its main objective is to minimize the number of edge probing queries performed.
In this paper, we take one step forward towards the design and performance evaluation of distributed algorithms for detecting the top nodes with the highest degrees in hidden bipartite graphs, aiming at the reduction of the overall computational cost which depends on both the number of edge probing queries and the level of parallelism used. We note that this is the first work in the literature to attack the specific problem in a distributed setting.
3 Fundamental Concepts and Problem Definition
In this section, we present some fundamental concepts related to our research and state the problem formally. Recall that, the detection of the top nodes with the highest degrees in a bipartite graph has been attached in [9] taking a centralized perspective. In that paper, the authors present the SwitchOnEmpty (SOE) algorithm which shows the best performance and provide an optimal solution with respect to the number of edge probing queries that are required. Table 1 presents the most frequently used symbols.
Symbol  Interpretation 

a hidden graph  
set of black vertices of  
set of black vertices of  
true if the edge exists, false otherwise  
,  number of black and white vertices 
number of highest degree vertices requested  
set of neighbors of vertex  
degree of vertex  
(unknown) set of edges  
(unknown) number of edges ()  
number of known existing neighbors of  
number of known nonexisting neighbors of  
total number of edge probing queries issued 
Before diving into the details of our proposal, there is a need to describe briefly the way the SOE algorithm works. SOE receives as input a hidden bipartite graph, with bipartitions and . The output of SOE is composed of the vertices from or with the highest degrees. Without loss of generality, we are focusing on vertices of . Edge probing queries are executed as follows:

Initially, SOE starts from a vertex , selects a vertex and executes . If the edge is solid, it continues to perform probes between and another vertex .

Upon a failure, i.e., when the probe returns an empty result, the algorithm applies the same process for another vertex . Vertices for which all the probes have been applied, do not participate in future edge probes.

A round is complete when all vertices of have been considered. After each round, some vertices can be safely included in the result set and they are removed from . When a vertex must be considered again, we continue the execution of probes remembering the location of the last failure.

SOE keeps on performing rounds until the upper bound of vertex degrees in is less than the current th highest degree determined so far. In that case, contains the required answer and the algorithm terminates. Note that all equaldegree vertices are included in .
In this work, we proceed one step further in order to detect highdegree nodes in larger graphs using multiple resources. In this respect, our algorithms are implemented in the Apache Spark engine [12, 13] using the Scala programming language [6]. Spark offers a powerful environment that enables the execution of distributed applications in massive amounts of data. Apache Spark is a unified distributed engine with a rich and powerful API for Scala, Python, Java and R [13]. One of its main characteristics is that (in contrast to Hadoop MapReduce) it exploits main memory as much as possible, being able to persist data across rounds to avoid unnecessary I/O operations. Spark jobs are executed based on a masterslave model in cooperation with a cluster manager such as YARN ^{1}^{1}1http://hadoop.apache.org/ or MESOS ^{2}^{2}2http://mesos.apache.org/.
4 Proposed Approach
In the following sections, we will focus on the algorithms developed to solve the distributed detection of top nodes in a hidden bipartite graph. Two algorithms are presented that are both inspired by the SOE (SwitchOnEmpty) algorithm [9]. The first algorithm, Distributed SwitchOnEmpty (DSOE), is a distributed version of the SOE algorithm. The second algorithm, , DSOE, is an improved and more flexible version of DSOE.
4.1 The DSOE Algorithm
Our first solution is very simple but at the same time very effective. The main mechanism has the following rationale. For a vertex we are executing edgeprobing queries until we get a certain amount of negative results . The value of is relevant to the size of the graph and is being reduced exponentially. More precisely, the first batch will have the second and so on. The idea behind this is that, because the majority of the degree distribution in real life graphs follows a Power law distribution, we do not expect to find many vertices with a high degree. The closer we get to the point of exhausting all the smaller we set , because we want to avoid as many unnecessary queries as possible. The pseudocode of the routine is given below.
Function routine(v: vertex, f: Int) : void is
The routine can be implemented easily to work in a distributed setting of the form . After each batch of routines, we check for the vertices that belong to the answer set . Vertices that belong to can be recognized easily, since vertices that have completed all possible probes should be in the answer. If , then we need one last batch of routines to finalize . In this part we will prove that our conditions for the loop and for the vertices that belong in are correct. For a vertex , if completes all possible probes from the first routine, then there are only two options for it’s degree: either or , because there is a possibility that the last query could be negative and still could have exhausted all possible probes. Assume that while we execute the loop we are in the repeat. Also we assume that after we get a vertex that has exhausted all possible edgeprobing queries. For again we have two possibilities, either or for the same reason as previously. From the previous observations we can safely conclude that for a vertex , that has completed all possible queries in with the bestcase scenario for the degree is and the worst case scenario is . It is evident that and the equality is possible only if . From this conclusion we are sure that if then if a vertex exhausts all the queries, it can be inserted into . Taking into concern the equality we have to run the batch of routines one last time for the possibility that we may find a vertex with degree equal to the minimum degree in . Algorithm 1 contains the outline of DSOE.
4.2 The DSOE Algorithm
DSOE is the natural extension to the centralized SOE algorithm. However, we advance one step further in order to improve runtime as much as possible, even if the number of probes increases. For very large graphs we already expect a very large number of probes that is inevitable. For that reason trying to improve the execution time by adding a very small amount of probes in the already big number of inevitable probes may worth for the overall performance of our algorithm. The updated algorithm is named DSOE.
DSOE treate all vertices equally, so for algorithms we want to make an initial prediction about the degree of the vertices and after that, to emphasize during the routines the vertices that we predict to have a large degree by having a more loose condition in contrast to those that we predict that will have small degree. The result of this process is to calculate all possible edgeprobing queries for the vertices we predict to have a large degree. With the information of the exact number of the degree for those vertices we are able to have a better ending condition for the rest of the vertices. Therefore, we will be able to stop processing of the vertices with low degrees faster and this way we will have better execution times. Moreover, for real life graphs, the degrees of the vertices usually follow a Power law distribution,so we expect that most vertices of the graph will have way smaller degree in comparison with the degree of the top vertices.
More specifically, DSOE initially performs some random edgeprobing queries . The number for these queries is set to . After this step, the course of DSOE is quite similar to that of DSOE. We execute repeatedly a batch of routines , with the difference that, this time a single routine for a vertex is complete when negative edgeprobing queries, where is an outcome of the prediction performed by the sampling process to quickly detect nodes that potentially have large degree.
When a vertex exhausts all possible queries, then it is added gets added to a temporary set . The vertices in the set may not be completely correct, so we have to continue with the probing queries. However, the contents of provide a threshold which is set to the minimum degree among the vertices in . Probing queries are executed , until . The vertices that have completed all possible probes are added to . Finally, the best vertices from with respect to the degree are returned as the final answer. Algorithm 2 contains the pseudocode of DSOE.
Function prediction(v: vertex,probes: int) : int is
5 Performance Evaluation
In this section, we present performance evaluation results depicting the efficiency of the proposed distributed algorithms. All experiments have been conducted in a cluster of 32 physical nodes (machines) running Hadoop 2.7 and Spark 2.1.0. One node is used as the master whereas the rest 31 nodes are used as workers. The data resides in HDFS and YARN is being used as the resource manager.
5.1 Datasets
All datasets used in performance evaluation correspond to realworld networks. The networks used have different number of black and white nodes as well as different number of edges. More specifically, we have used three networks: DBLP, YOUTUBE, and WIKI. All networks are publicly available at the Koblenz Network Collection, which is a project to collect large network datasets of different types, in order to assist research in network science and related fields. The collection is maintained by the Institute of Web Science and Technologies at the University of Koblenz–Landau and it accessible by the following URL: http://konect.unikoblenz.de/.

DBLP. The DBLP network is the authorship network from the DBLP computer science bibliography. The network is bipartite and a node is either an author or a publication. Each edge connects an author to one of the publications. The characteristics of this network are as follows: , , the highest degree for set is 114, the highest degree for set is 951 and the average degree for is 6.0660 and for is 2.1622.

YOUTUBE1. This is the bipartite network of YouTube users and their group memberships. Nodes correspond to users and groups, and an edge between a user and a group denotes a group membership. The properties of this network are as follows: and , the highest degree of set is , the highest degree of set is , the average degree for is and the average degree of is .

YOUTUBE2. This network represents friendship relationships between YouTube users. Nodes are users and an undirected edge between two nodes indicates a friendship. Although this graph is unipartite we modified it in order to become bipartite. We duplicate it’s vertices and the set is one clone of the nodes whereas is the other. Two nodes between and are connected if their corresponding nodes are connected in the initial graph. Evidently, for this graph the statistics for the two sets and are exactly the same: , the maximum degree is for both sets and the average degree for both sets is

WIKIPEDIA. This is the bipartite network of English Wikipedia articles and the corresponding categories they are contained in. The first set of nodes corresponds to articles and the second corresponds to categories. For this graph we have , , the highest degree for is and for is , the average degree for is while for set is .
5.2 Experimental Results
In the sequel, we present some representative experimental results demonstrating the performance of the proposed techniques. First, we perform a comparison of DSOE and DSOE with respect to runtime (i.e., time duration to complete the computation in the cluster) by modifying the number of Spark executors running. For this experiment, we start with 8 Spark executors and gradually we keep on increasing their number keeping . The DBLP graph has been used in this case. Figure 2(a) shows the runtime of both algorithms by increasing the number of executors. It is evident that both algorithms are scalable, since there is a significant speedup as we double the number of executors. Moreover, DSOE shows better performance than DSOE.
Another important measure that we want to examine is the number of probes that both algorithms execute. For this experiment, the number of executors and the execution time are irrelevant, so we focus only on different values of . Moreover we are running our algorithms in a significantly smaller dataset (YOUTUBE1) for different values of (i.e., 1, 10, 100 and 1000). The corresponding results are given in Figure 2(b). The first observation is that both algorithms perform a significant amount of probes. However, this was expected taking into account that if the average node degree is very small or very large, then many probes are required before the algorithms can provide the answer, as it has been shown in [9]. Also, in general DSOE requires less probes in comparison to DSOE for small values of .
One more very interesting aspect that we examine is the performance of our approaches with respect to the size of the graph. Our motivation is to compare the impact of sets and to the execution time. For this reason we will use DSOE on the DBLP graph twice: for the first execution we have and and the for the second time we reverse the direction of the queries from to . This way the graphs that we compare have the exact same number of edges and vertices but the differ significantly in the cardinality of and . Figure 3(a) presents the corresponding results. It is observed that in the reverse case the runtime is significantly higher, since for every node in there are more options to perform probes on . In general, the cost of the algorithm drops if the cardinality if is larger than that of .
The goal of the last experiment is to test the scalability of the algorithms by increasing the size of the data. First we focus on and then on . For this experiment DBLP, YOUTUBE2 and WIKIPEDIA datasets have been used. These graphs have almost the same cardinality in one set and they differ on the cardinality of the other. For all tests we have used 64 Spark executors and the implementation of the DSOE algorithm with . The corresponding results are given in Figure 3(b). It is evident that although the cardinalities of both and have an impact on peformance, execution time is more sensitive on the cardinality of when used as the source set.
6 Conclusions
In this work, we study for the first time the distributed detection of the top nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available apriori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on realworld networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.
There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.
Acknowledgments
The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.
References
 [1] (2010) Managing and mining graph data. Springer. Cited by: §1.
 [2] (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
 [3] (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on GraphTheoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
 [4] (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
 [5] (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
 [6] (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
 [7] (201802) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.

[8]
(2017)
Core discovery in hidden graphs.
CoRR (to appear in Data and Knowledge Engineering)
abs/1712.02827. Cited by: §2.  [9] (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
 [10] (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
 [11] (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
 [12] (2012) Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
 [13] (201610) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 00010782, Link, Document Cited by: §3.
5 Performance Evaluation
In this section, we present performance evaluation results depicting the efficiency of the proposed distributed algorithms. All experiments have been conducted in a cluster of 32 physical nodes (machines) running Hadoop 2.7 and Spark 2.1.0. One node is used as the master whereas the rest 31 nodes are used as workers. The data resides in HDFS and YARN is being used as the resource manager.
5.1 Datasets
All datasets used in performance evaluation correspond to realworld networks. The networks used have different number of black and white nodes as well as different number of edges. More specifically, we have used three networks: DBLP, YOUTUBE, and WIKI. All networks are publicly available at the Koblenz Network Collection, which is a project to collect large network datasets of different types, in order to assist research in network science and related fields. The collection is maintained by the Institute of Web Science and Technologies at the University of Koblenz–Landau and it accessible by the following URL: http://konect.unikoblenz.de/.

DBLP. The DBLP network is the authorship network from the DBLP computer science bibliography. The network is bipartite and a node is either an author or a publication. Each edge connects an author to one of the publications. The characteristics of this network are as follows: , , the highest degree for set is 114, the highest degree for set is 951 and the average degree for is 6.0660 and for is 2.1622.

YOUTUBE1. This is the bipartite network of YouTube users and their group memberships. Nodes correspond to users and groups, and an edge between a user and a group denotes a group membership. The properties of this network are as follows: and , the highest degree of set is , the highest degree of set is , the average degree for is and the average degree of is .

YOUTUBE2. This network represents friendship relationships between YouTube users. Nodes are users and an undirected edge between two nodes indicates a friendship. Although this graph is unipartite we modified it in order to become bipartite. We duplicate it’s vertices and the set is one clone of the nodes whereas is the other. Two nodes between and are connected if their corresponding nodes are connected in the initial graph. Evidently, for this graph the statistics for the two sets and are exactly the same: , the maximum degree is for both sets and the average degree for both sets is

WIKIPEDIA. This is the bipartite network of English Wikipedia articles and the corresponding categories they are contained in. The first set of nodes corresponds to articles and the second corresponds to categories. For this graph we have , , the highest degree for is and for is , the average degree for is while for set is .
5.2 Experimental Results
In the sequel, we present some representative experimental results demonstrating the performance of the proposed techniques. First, we perform a comparison of DSOE and DSOE with respect to runtime (i.e., time duration to complete the computation in the cluster) by modifying the number of Spark executors running. For this experiment, we start with 8 Spark executors and gradually we keep on increasing their number keeping . The DBLP graph has been used in this case. Figure 2(a) shows the runtime of both algorithms by increasing the number of executors. It is evident that both algorithms are scalable, since there is a significant speedup as we double the number of executors. Moreover, DSOE shows better performance than DSOE.
Another important measure that we want to examine is the number of probes that both algorithms execute. For this experiment, the number of executors and the execution time are irrelevant, so we focus only on different values of . Moreover we are running our algorithms in a significantly smaller dataset (YOUTUBE1) for different values of (i.e., 1, 10, 100 and 1000). The corresponding results are given in Figure 2(b). The first observation is that both algorithms perform a significant amount of probes. However, this was expected taking into account that if the average node degree is very small or very large, then many probes are required before the algorithms can provide the answer, as it has been shown in [9]. Also, in general DSOE requires less probes in comparison to DSOE for small values of .
One more very interesting aspect that we examine is the performance of our approaches with respect to the size of the graph. Our motivation is to compare the impact of sets and to the execution time. For this reason we will use DSOE on the DBLP graph twice: for the first execution we have and and the for the second time we reverse the direction of the queries from to . This way the graphs that we compare have the exact same number of edges and vertices but the differ significantly in the cardinality of and . Figure 3(a) presents the corresponding results. It is observed that in the reverse case the runtime is significantly higher, since for every node in there are more options to perform probes on . In general, the cost of the algorithm drops if the cardinality if is larger than that of .
The goal of the last experiment is to test the scalability of the algorithms by increasing the size of the data. First we focus on and then on . For this experiment DBLP, YOUTUBE2 and WIKIPEDIA datasets have been used. These graphs have almost the same cardinality in one set and they differ on the cardinality of the other. For all tests we have used 64 Spark executors and the implementation of the DSOE algorithm with . The corresponding results are given in Figure 3(b). It is evident that although the cardinalities of both and have an impact on peformance, execution time is more sensitive on the cardinality of when used as the source set.
6 Conclusions
In this work, we study for the first time the distributed detection of the top nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available apriori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on realworld networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.
There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.
Acknowledgments
The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.
References
 [1] (2010) Managing and mining graph data. Springer. Cited by: §1.
 [2] (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
 [3] (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on GraphTheoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
 [4] (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
 [5] (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
 [6] (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
 [7] (201802) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.

[8]
(2017)
Core discovery in hidden graphs.
CoRR (to appear in Data and Knowledge Engineering)
abs/1712.02827. Cited by: §2.  [9] (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
 [10] (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
 [11] (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
 [12] (2012) Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
 [13] (201610) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 00010782, Link, Document Cited by: §3.
6 Conclusions
In this work, we study for the first time the distributed detection of the top nodes with the highest degrees in a hidden bipartite graph. Since the set of edges is not available apriori, edge probing queries must be applied in order to be able to explore the graph. We have designed two algorithms to attack the problem (DSOE and DSOE) and evaluate their performance based on realworld networks. In general the algorithms are scalable, showing good speedup factors by increasing the number of Spark executors.
There are many interesting directions for future work. By studying the experimental results, one important observation is that the number of probes is in general large. Therefore, more research is required to be able to significantly reduce the number of probes preserving the good quality of the answer. Also, it is interesting to evaluate the performance of the algorithms in even larger networks containing significantly larger number of nodes and edges.
Acknowledgments
The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.
References
 [1] (2010) Managing and mining graph data. Springer. Cited by: §1.
 [2] (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
 [3] (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on GraphTheoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
 [4] (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
 [5] (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
 [6] (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
 [7] (201802) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.

[8]
(2017)
Core discovery in hidden graphs.
CoRR (to appear in Data and Knowledge Engineering)
abs/1712.02827. Cited by: §2.  [9] (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
 [10] (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
 [11] (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
 [12] (2012) Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
 [13] (201610) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 00010782, Link, Document Cited by: §3.
Acknowledgments
The authors would like to thank the DaSciM group of the LiX laboratory of Ecole Polytechnique, and specifically Prof. Michalis Vazirgiannis and Dr. Christos Giatsidis for sharing the cluster to conduct the experimental evaluation.
References
 [1] (2010) Managing and mining graph data. Springer. Cited by: §1.
 [2] (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
 [3] (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on GraphTheoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
 [4] (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
 [5] (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
 [6] (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
 [7] (201802) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.

[8]
(2017)
Core discovery in hidden graphs.
CoRR (to appear in Data and Knowledge Engineering)
abs/1712.02827. Cited by: §2.  [9] (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
 [10] (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
 [11] (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
 [12] (2012) Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
 [13] (201610) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 00010782, Link, Document Cited by: §3.
References
 [1] (2010) Managing and mining graph data. Springer. Cited by: §1.
 [2] (2005) Learning a hidden subgraph. SIAM Journal on Discrete Mathematics 18 (4), pp. 697–712. Cited by: §2.
 [3] (2005) Combinatorial search on graphs motivated by bioinformatics applications: A brief survey. In Revised Selected Papers, 31st International Workshop on GraphTheoretic Concepts in Computer Science (WG), Metz, France, pp. 16–27. Cited by: §2.
 [4] (2006) Mining graph data. John Wiley & Sons, Inc., USA. External Links: ISBN 0471731900 Cited by: §1.
 [5] (1998) Property testing and its connection to learning and approximation. Journal of the ACM 45 (4), pp. 653–750. Cited by: §2.
 [6] (2016) Programming in scala: updated for scala 2.12. 3rd edition, Artima Incorporation, USA. External Links: ISBN 0981531687, 9780981531687 Cited by: §3.
 [7] (201802) Bipartite graphs in systems biology and medicine: a survey of methods and applications. GigaScience 7, pp. . External Links: Document Cited by: §1.

[8]
(2017)
Core discovery in hidden graphs.
CoRR (to appear in Data and Knowledge Engineering)
abs/1712.02827. Cited by: §2.  [9] (2010) Finding maximum degrees in hidden bipartite graphs. In Proceedings ACM International Conference on Management of Data (SIGMOD), Indianapolis, IN, pp. 891–902. Cited by: §1, §2, §2, §3, §4, §5.2.
 [10] (2015) Hadoop: the definitive guide. 4th edition, O’Reilly Media, Inc.. External Links: ISBN 1491901632, 9781491901632 Cited by: §1.
 [11] (2013) Identifying the most connected vertices in hidden bipartite graphs using group testing. IEEE Transactions on Knowledge & Data Engineering 25 (10), pp. 2245–2256. Cited by: §1, §2.
 [12] (2012) Resilient distributed datasets: a faulttolerant abstraction for inmemory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, pp. 2–2. External Links: Link Cited by: §3.
 [13] (201610) Apache spark: a unified engine for big data processing. Communications of the ACM 59 (11), pp. 56–65. External Links: ISSN 00010782, Link, Document Cited by: §3.
Comments
There are no comments yet.