Sampling is used to determine a subset of a given dataset that retains certain properties but allows more efficient data analysis. For graph sampling it is necessary to retain not only general characteristics of the original data but also the structural information. Graph sampling is especially important for the efficient processing and analysis of large graphs such as social networks (Leskovec and Faloutsos, 2006; Wang et al., 2011). Furthermore, sampling is often needed to allow the effective visualization of large graphs.
Our contribution in this paper is to outline the distributed implementation of known graph sampling algorithms for improved scalability to large graphs as well as their evaluation. The sampling approaches are added as operators to the open-source distributed graph analysis platform Gradoop111http://www.gradoop.com (Junghanns et al., 2016, 2018) and used for interactive graph visualization (Rostami et al., 2019). Our distributed sampling algorithms are, like Gradoop, based on the dataflow execution framework Apache Flink but the implementation would be similar for Apache Spark. The evaluation for different graphs considers the runtime scalability as well as the quality of sampling regarding retained graph properties and the similarity of graph visualizations.
This paper is structured as follows: We briefly discuss related work in Section 2 and provide background information on graph sampling in Section 3. In Section 4, we explain the distributed implementation of four sampling algorithms with Apache Flink. Sections 5 describes the evaluation results before we conclude in Section 6.
2. Related Work
Several previous publications address graph sampling algorithms but mostly without considering their distributed implementation. Hu et al. (Hu and Lau, 2013) survey different graph sampling algorithms and their evaluations. However, many of these algorithms cannot be applied to large graphs due to their complexity. Leskovec et al. (Leskovec and Faloutsos, 2006) analyze sampling algorithms for large graphs but there is no discussion of distributed or parallel approaches. Wang et al. (Wang et al., 2011) focuses on sampling algorithms for social networks but again without considering distributed approaches.
The only work about distributed graph sampling we are aware of is a recent paper by Zhang et al. (Zhang et al., 2018) for implementations based on Apache Spark. In contrast to our work, they do not evaluate the speedup behavior for different cluster sizes and the scalability to different data volumes. Our study also includes a distributed implementation and evaluation of random walk sampling.
We first introduce some basic definition of a graph sample and a graph sample algorithm. Afterwards, we specify some basic sampling algorithms and outline important graph metrics for both, visual and metrical comparison in the evaluation chapter.
3.1. Graph Sampling
A directed graph can be used to express the interactions of users of a social network. The user can be denoted as a vertex and a relationship between two user and can be denoted as a directed edge .
Since popular social networks such as Facebook and Twitter contains billions of users and trillions of relationships, the resulting graph is too big for both, visualization and analytical tasks. A common approach to reduce the size of the graph is to use graph sampling to scale down the information contained in the original graph.
Definition 1 (Graph Sample) A graph is a sampled graph (or graph sample) of graph iff the following three constraints are met: , and .222In the existing publications, there are different approaches toward the vertices with zero-degrees in the sampled graph. Within this work we choose the approach to remove all zero-degree vertices from the sampled graph.
Definition 2 (Graph Sample Algorithm) A graph sample algorithm is a function from a graph set to a set of sampled graphs , as in which the set of vertices and edges will be reduced until a given threshold is reached. is called sample size and defines the ratio of vertices (or edges) the graph sample contains compared to the original graph.
A graph sample is considered to be fruitful if it can represent many properties of the original graph. For example, if we want to find dense communities, like a group of friends in a social network, the sampled graph is only worthwhile if it preserves these relations as much as possible. We evaluate this concept by comparing some predefined graph properties on both original and sampled graphs.
3.2. Basic Graph Sampling Algorithms
Many graph sampling algorithms have already been investigated but we will limit ourselves to four basic approaches in this paper: random vertex sampling, random edge sampling, neighborhood sampling, and random walk sampling.
Random vertex sampling is the most straightforward sampling approach that uniformly samples the graph by selecting a subset of vertices and their corresponding edges based on the selected sample size
. For the distributed implementation in a shared-nothing approach, the information of the whole graph is not always available in every node. Therefore, we consider an estimation by selecting the vertices using
as a probability. This approach is also applied on the edges in the random edge sampling.
The idea of the random neighborhood sampling is to improve topological locality over the simple random vertex approach. Therefore, when a vertex is chosen to be in the resulting sampled graph, all neighbors are also added to the sampled graph. Optionally, only incoming or outgoing edges can be taken into account to select the neighbors of a vertex.
For the random walk sampling, one or more vertices are randomly selected as start vertices. For each start vertex, we follow a randomly selected outgoing edge to its neighbor. If a vertex has no outgoing edges or if all edges were followed already, we jump to any other randomly chosen vertex in the graph and continue the walk there. To avoid keeping stuck in dense areas of the graph we added a probability to jump to another random vertex instead of following an outgoing edge. This process continues until a desired number of vertices have been visited and the sample size has been met. All visited vertices and all edges whose source and target vertex was visited will be part of the graph sample result.
3.3. Important Graph Metrics
As we mentioned, we present an evaluation on the visual and metrical comparison of the original graph and the sampled one. Following is a set of graph metrics which will be evaluated in Section 5.
Cardinality of vertex and edge set, denoted as and .
Graph density : The ratio of all actually existing edges to all possible edges in the graph, defined by:
Number of triangles : The number of subsets of vertices in the graph with three elements which are fully connected (triangle or closed triple).
Global clustering coefficient : The ratio of the number of triangles to the number of all triples in the graph (see (Boccaletti et al., 2006)), defined by:
Average local clustering coefficient : The local clustering coefficient of a vertex is the ratio of the edges that actually connect its neighbors to the maximum possible number of edges between those neighbors. With a value between and , it describes how close the vertex and its neighborhood are to a clique (see (Watts and Strogatz, 1998)). We compute the average local clustering coefficient for all vertices in the graph.
Number of weakly connected components : A maximal connected subgraph of a graph in which each two vertices can be reached through a path is called a weakly connected component. The number of such components in a graph is the target parameter.
The average, minimum and maximum vertex degree in the graph, denoted as , , and .
The goal of the distributed implementation of graph sampling are to achieve fast execution and good scalability for large graphs with up to billions of vertices and edges. We therefore want to utilize the parallel processing capabilities of shared-nothing clusters and, specifically, distributed dataflow systems such as Apache Spark (Zaharia et al., 2012) and Apache Flink (Carbone et al., 2015). In contrast to the older MapReduce approach, these frameworks offer a wider range of transformations and keep data in main memory between the execution of operations. Our implementations are based on Apache Flink but can be easily transferred to Apache Spark. We first give a brief introduction to the programming concepts of the distributed dataflow model. We then outline the implementation of our sampling operators.
4.1. Distributed Dataflow Model
The processing of data that exceeds the computing power or storage of a single computer can be handled through the use of distributed dataflow systems. Therein the data is processed simultaneously on shared-nothing commodity cluster nodes. Although details vary for different frameworks, they are designed to implement parallel data-centric workflows, with datasets and primitive transformations as two fundamental programming abstractions. A dataset represents a typed collection partitioned over a cluster. A transformation is a deterministic operator that transforms the elements of one or two datasets into a new dataset. A typical distributed program consists of chained transformations that form a dataflow. A scheduler breaks each dataflow job into a directed acyclic execution graph, where the nodes are working threads and edges are input and output dependencies between them. Each thread can be executed concurrently on an associated dataset partition in the cluster without sharing memory.
Transformations can be distinguished into unary and binary operators, depending on the number of input datasets. Table 1 shows some common transformations from both types which are relevant for this work. The filter transformation evaluates a user-defined predicate function to each element of the input dataset. If the function evaluates to true, the element is part of the output. Another simple transformation is map. It applies a user-defined map function to each element of the input dataset which returns exactly one element to guarantee a one-to-one relation to the output dataset. A transformation processing a group instead of a single element as input is reduce where the input, as well as output, are key-value pairs. All elements inside a group share the same key. The transformation applies a user-defined function to each group of elements and aggregates them into a single output pair. A common binary transformation is join. It creates pairs of elements from two input datasets which have equal values on defined keys. A user-defined join function is applied for each pair that produces exactly one output element.
(I/O : input/output datasets, A/B : domains)
4.2. Sampling Operators
The operators for graph sampling compute a subgraph by either randomly selecting a subset of vertices or a subset of edges. In addition, neighborhood information or graph traversal can be used. The computation uses a series of transformations on the input graph. Latter is stored in two datasets, one for vertices and one for edges. For each sampling operator a filter is applied to the output graph’s vertex dataset to remove all zero-degree vertices following the definition of a graph sample in Section 3.
4.2.1. Random Vertex (Rv) and Random Edge (Re) Sampling
The input for the RV and RE operator is a input graph and a sample size . For RV, a filter operator is applied to the vertex dataset of the input graph. A vertex will be kept if a generated random value is lower or equal to . An edge will be kept if its source and target vertices occur in the dataset of the remaining vertices.
RE works the other way around, as a filter transformation is applied to the edge dataset of the input graph. An edge will be kept, again if the generated random value is lower or equal to . A vertex will be kept if it’s either the source or the target vertex of a remaining edge. The dataflow of the RV and RE operator can be seen in Figure 2 and 2.
4.2.2. Random Vertex Neighborhood (Rvn) Sampling
This approach is similar to the RV operator but also adds the direct neighbors of a vertex to the sample. The selection of the neighbors can be restricted according to the direction of the connecting edge (incoming, outgoing or both). In the implementation, randomly selected vertices of the input vertex dataset are marked as sampled with a boolean flag. As for RV, we select a vertex by setting the flag to true, if a generated random value is lower or equal than the given sample size or set it to false otherwise. In a second step, the marked vertices are joined with the input edge dataset, transforming each edge into a tuple containing the edge itself and the boolean flags for its source and target vertex. A filter operator is applied to the edge tuples, retaining only those edges, whose source or target vertices where sampled and matching the given neighborhood relation. This relation will be either a neighbor on an incoming edge of a sampled vertex, a neighbor on an outgoing edge, or both. The dataflow from an input logical graph to an sampled graph is illustrated in Figure 4.
4.2.3. Random Walk (Rw) Sampling
This approach uses a random walk algorithm to walk over vertices and edges of the input graph. Each visited vertex and edges connecting those vertices will then be returned as the sampled graph. Figure 4 shows the dataflow of an input graph to a sampled graph of this operator. At the beginning we transform the input graph to a specific Gelly format. We are using Gelly, the Google Pregel (Malewicz et al., 2010) implementation of Apache Flink, to implement a random walk algorithm.
Pregel utilizes the bulk-synchronous-parallel (Valiant, 1990) paradigm to create the vertex-centric-programming model. An iteration in a vertex-centric program is called superstep. During a superstep each vertex of the graph can compute a new state in a compute function. In a message function each vertex is able to prepare messages for other vertices. At the end of each superstep each worker of the cluster can exchange the prepared massages during a synchronization barrier. In our operator we consider a message from one vertex to one of its neighbors a ’walk’. A message to any other vertex is considered as ’jump’.
At the beginning of the random walk algorithm a single start vertex is randomly selected and marked as visited. The marked vertex will be referred to as walker. In the first superstep the walker either randomly picks one of its outgoing and not yet visited edges, walks to this neighbor and marks the edge as traversed. Or, with the probability of or if there aren’t any outgoing edges left, jumps to any other randomly selected vertex in the graph. Either the neighbor or the randomly selected vertex will become the new walker and the computation starts again. For a multi walk, more than one start vertex can be selected, which allows us to execute multiple walks in parallel.
For each completed superstep the already visited vertices are counted. If this number exceeds the desired number of sampled vertices, the iteration is terminated and the algorithm converges. Having the desired number of vertices marked as visited, the graph is transformed back and a filter operator is applied to its vertex dataset. A vertex will be kept if it is marked as visited. An edge will be kept if its source and target vertices occur in the dataset of the remaining vertices.
One key feature of distributed shared-nothing systems is their ability to respond to growing data sizes or problem complexity by adding additional machines. Therefore, we evaluate the scalability of our implementations with respect to increasing data volume and computing resources in the first part of this section. The second part will contain a more visual comparison of our sampling algorithms. We will show, that our implementation computes expressive, structure-preserving graph samples based on the graph properties introduced in Section 3.
Setup. The evaluations were executed on a shared-nothing cluster with 16 workers connected via 1 GBit Ethernet. Each worker consists of an Intel Xeon E5-2430 6 x 2.5 Ghz CPU, 48 GB RAM, two 4 TB SATA disks and runs openSUSE 13.2. We use Hadoop 2.6.0 and Flink 1.7.0. We run Flink with 6 threads and 40 GB memory per worker.
We use two types of datasets for our evaluation: synthetic graphs to measure scalability of the algorithms and real-world graphs to metrically and visually compare the sampled and original graph.
To evaluate the scalability of our implementations we use the LDBC-SNB data set generator (Erling and others, 2015)
. It creates heterogeneous social network graphs with a fixed schema. The synthetic graphs mimic structural characteristics of real-world graphs, e.g., node degree distribution based on power-laws and skewed property value distributions. Table2 shows the three datasets used throughout the benchmark. In addition to the scaling factor (SF) used, the cardinality of vertex and edge sets as well as the dataset size on hard disk are specified. Each is stored in the Hadoop distributed file system (HDFS). The execution times mentioned later include loading the graph from HDFS, computing the graph sample and writing the sampled graph back to HDFS. We run three executions per setup and report the average runtimes.
In addition, we use three real-world graphs from the SNAP Datasets (Leskovec and Krevl, 2014), ego-Facebook, ca-AstroPh and web-Google, to evaluate the impact of a sampling algorithm on the graph metrics and thus on the graphs structure.
In many real-world use cases data analysts are limited in graph size for visual or analytical tasks. Therefore, we run each sampling algorithm with the intention to create a sampled graph with round about 100k vertices. The used sample size for each graph is contained in Table 2. For the RW operator 3000 walker and a jump probability where used.
We first evaluate the absolute runtime and relative speedup of our implementations. Figure 7 shows the runtimes of the four algorithms for up to 16 workers using the LDBC.10 dataset; Figure 7 shows the corresponding speedup values. While all algorithms benefit from more resources, RVN and RW gain the most. For RVN, the runtime is reduced from 42 minutes on a single worker to 4 minutes on 16 workers (speedup 10.5). For RW, a speedup of 7.45 is reached (reduction from 67 to 9 minutes). The simpler algorithms RV and RE are already executed fast on a single machine for LDBC.10. Hence, their potential for improvement is limited explaining the lower speedup values.
|1||3.3 M||17.9 M||2.8 GB||0.03|
|10||30,4 M||180.4 M||23.9 GB||0.003|
|100||282.6 M||1.77 B||236.0 GB||0.0003|
We also evaluate scalability with increasing data volume and a fixed number of workers (16 worker). The results in Figure 7 show that the runtimes of each algorithm increases almost linearly with growing data volume. For example, the execution of the RVN algorithm required about 34 seconds on LDBC.1 and 2907 seconds on LDBC.100.
5.2. Metric-Based and Visual Comparison
An ideal sampling algorithm reduces the number of vertices and edges evenly by a desired amount. At the same time, the structural properties, as described by the calculated metrics, should be preserved in the best possible way. For example, a community of the graph can be thinned out, while the remaining vertices should stay equally connected to each other and thus hardly change their value for the local clustering coefficient.
In order to evaluate a sampling algorithm’s impact on the graph metrics and thus on the graph structure, the metrics of an original and the sampled graph are compared. As mentioned before, we use three real-world graphs from the SNAP Datasets (Leskovec and Krevl, 2014). Each sampling algorithm is applied three times to these graphs using a given sample size to reduce the number of vertices or edges by about 60%. Due to the included neighborhood of each selected vertex, RVN requires a much lower value for than the other sampling algorithms. For RW, the number of walkers is scaled according to the number of vertices, starting with 5 walkers for the ego-Facebook graph, 20 walkers for the ca-AstroPh graph, and 1000 walkers for the web-Google graph. The jump probability remained fixed throughout the experiment. We computed the proposed metrics for the original graphs and each resulting sample graph and added the average results to Table 3.
Since easier visualization is a main use case for graph sampling, we visually compare the original and the sampled graph structures for the ego-Facebook graph. Figure 7(a) shows the original graph with a force-directed layout (Hu, 2005). The vertex size represents the degree, i.e. bigger vertices imply a higher degree. The vertex color stands for its local clustering coefficient, where a darker color represents a higher value. Figures 7(b) to 7(e) show the sampled graphs for the different sampling algorithms. The positions of the vertices remain persistent compared to the original graph for all sampled graphs.
RV manages to predictably reduce the number of vertices as well as the edges. According to Table 3, the number of triangles in the graph has been reduced dramatically, the value of the global clustering coefficient is almost completely preserved. Depending on the original structure, RV decomposes a graph into many new weakly connected components. As seen in Figure 7(b), RV visibly thins out the graph but also destroys many of the existing communities and removes inter-community edges as well. RE decreases the number of edges by the desired amount while hardly reducing the number of vertices. The value for the local clustering coefficient is reduced by a similar amount as the number of edges. All other structural properties of the original graph are unpredictably changed. The visualization in Figure 7(c) shows that most of the vertices are kept. The deleted edges reduce the connectivity within the communities and thus the local clustering coefficient of many vertices. RVN reduces the number of vertices as desired and keeps about 5% to 15% of the edges. Figure 7(d) shows the well preserved neighborhood of sampled vertices, but the samples are lacking at edges connecting the individual communities. RW reduces the number of vertices as expected. Figure 7(e) shows, that edges within the individual communities and edges connecting those communities tend to be preserved.
We outlined distributed implementations for four graph sampling approaches using Apache Flink. Our first experimental results are promising as they showed good speedup for using multiple workers and near-perfect scalability for increasing dataset sizes. The metric-based and visual comparisons with the original graphs confirmed that the implementations provide the expected, useful results thereby enabling the analyst and Gradoop user to select the most suitable sampling method. For example, both random vertex and random edge sampling are useful for obtaining an overview of the graph. Random vertex neighborhood (RVN) sampling is useful to analyze neighborhood relationships while random walk sampling is beneficial to study the inter- and intra-connectivity of communities. In our ongoing work we provide distributed implementations for further sampling algorithms such as Frontier Sampling and Forest Fire Sampling.
This work is partially funded by Sächsische Aufbau Bank (SAB) and the European Regional Development (EFRE) under grant No. 100302179.
- Complex networks: structure and dynamics. Physics Reports 424 (4), pp. 175 – 308. External Links: Cited by: 4th item.
- Apache flink: stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36 (4). Cited by: §4.
- The LDBC social network benchmark: Interactive workload. In Proc. SIGMOD, Cited by: §5.
- A survey and taxonomy of graph sampling. CoRR abs/1308.5865. Cited by: §2.
- Efficient, high-quality force-directed graph drawing. Mathematica Journal 10 (1), pp. 37–71. Cited by: §5.2.
- Declarative and distributed graph analytics with GRADOOP. PVLDB 11, pp. 2006–2009. External Links: Cited by: §1.
- Analyzing extended property graphs with Apache Flink. In Proc. ACM SIGMOD Workshop on Network Data Analytics (NDA), Cited by: §1.
- Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, New York, NY, USA, pp. 631–636. External Links: Cited by: §1, §2.
- SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §5.2, §5.
- Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, pp. 135–146. External Links: Cited by: §4.2.3.
- BIGGR: Bringing GRADOOP to applications. Datenbank-Spektrum 19 (1). Cited by: §1.
- A bridging model for parallel computation. Commun. ACM 33 (8), pp. 103–111. External Links: Cited by: §4.2.3.
- Understanding graph sampling algorithms for social network analysis. In 2011 31st International Conference on Distributed Computing Systems Workshops, Vol. , pp. 123–128. External Links: Cited by: §1, §2.
- Collective dynamics of ’small-world’networks. nature 393 (6684), pp. 440. Cited by: 5th item.
- Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, Cited by: §4.
- Implementation and evaluation of distributed graph sampling methods with spark. Electronic Imaging 2018 (1), pp. 379–1–379–9. External Links: Cited by: §2.