1 Introduction
With the growing interest in analyzing large realworld graphs such as online social networks, web graphs and semantic web graphs, many distributed graph computing systems [1, 5, 10, 11, 13, 18, 21, 23] have emerged. These systems are deployed in a sharednothing distributed computing infrastructure usually built on top of a cluster of lowcost commodity PCs. Pioneered by Google’s Pregel [13], these systems adopt a vertexcentric computing paradigm, where programmers think naturally like a vertex when designing distributed graph algorithms. A Pregellike system also takes care of fault recovery and scales to arbitrary cluster size without the need of changing the program code, both of which are indispensable properties for programs running in a cloud environment.
MapReduce [3], and its opensource implementation Hadoop, are also popularly used for large scale graph processing. However, many graph algorithms are intrinsically iterative, such as the computation of PageRank, connected components, and shortest paths. For iterative graph computation, a Pregel program is much more efficient than its MapReduce counterpart [13].
Weaknesses of Pregel. Although Pregel’s vertexcentric computing model has been widely adopted in most of the recent distributed graph computing systems [1, 11, 10, 18] (and also inspired the edgecentric model [5]), Pregel’s vertextovertex message passing mechanism often causes bottlenecks in communication when processing realworld graphs.
To clarify this point, we first briefly review how Pregel performs message passing. In Pregel, a vertex can send messages to another vertex if knows ’s vertex ID. In most cases, only sends messages to its neighbors whose IDs are available from ’s adjacency list. But there also exist Pregel algorithms in which a vertex may send messages to another vertex that is not a neighbor of [24, 19]. These algorithms usually adopt pointer jumping (or doubling), a technique that is widely used in designing PRAM algorithms [22], to bound the number of iterations by , where refers to the number of vertices in the graph.
The problem with Pregel’s message passing mechanism is that a small number of vertices, which we call bottleneck vertices, may send/receive much more messages than other vertices. A bottleneck vertex not only generates heavy communication, but also significantly increases the workload of the machine in which the vertex resides, causing highly imbalanced workload among different machines. Bottleneck vertices are common when using Pregel to process realworld graphs, mainly due to either (1)high vertex degree or (2)algorithm logic, which we elaborate more as follows.
We first consider the problem caused by high vertex degree. When a highdegree vertex sends messages to all its neighbors, it becomes a bottleneck vertex. Unfortunately, realworld graphs usually have highly skewed degree distribution, with some vertices having very high degrees. For example, in the
Twitter whofollowswho graph^{1}^{1}1http://law.di.unimi.it/webdata/twitter2010/, the maximum degree is over 2.99M while the average degree is only 35. Similarly, in the BTC dataset used in our experiments, the maximum degree is over 1.6M while the average degree is only 4.69.We ran HashMin [17, 24], a distributed algorithm for computing connected components (CCs), on the degreeskewed BTC dataset in a cluster with 1 master (Worker 0) and 120 slaves (Workers 1–120), and observed highly imbalanced workload among different workers, which we describe next. Pregel assigns each vertex to a worker by hashing the vertex ID regardless of the degree of the vertex. As a result, each worker holds approximately the same number of vertices, but the total number of neighbors in the adjacency lists (i.e., number of edges) varies greatly among different workers. In the computation of HashMin on BTC, we observed an uneven distribution of edge number among workers, as some workers contain more highdegree vertices than other workers. Since messages are sent along the edges, the uneven distribution of edge number also leads to an uneven distribution of the amount of communication among different workers. In Figure 1, the taller blue bars indicate the total number of messages sent by each worker during the entire computation of HashMin, where we observe highly uneven communication workload among different workers.
Bottleneck vertices may also be generated by program logic. An example is the SV algorithm proposed in [24, 22] for computing CCs, which we will describe in detail in Section 3.4. In SV, each vertex maintains a field which records the vertex that is to communicate with. The field may be updated at each iteration as the algorithm proceeds; and when the algorithm terminates, vertices and are in the same CC iff . Thus, during the computation, some vertex may communicate with many vertices in its CC if , for . In this case, becomes a bottleneck vertex.
We ran SV on the USA road network in a cluster with 1 master (Worker 0) and 60 slaves (Workers 1–60), and observed highly imbalanced communication workload among different workers. In Figure 2, the taller blue bars indicate the total number of messages sent by each worker during the entire computation of SV, where we can see that the communication workload is very biased (especially at Worker 0). We remark that the imbalanced communication workload is not caused by skewed vertex degree distribution, since the largest vertex degree of the USA road network is merely 9. Rather, it is because of the algorithm logic of SV. Specifically, since the USA road network is connected, in the last round of SV, all vertices have equal to Vertex 0, indicating that they all belong to the same CC. Since Vertex 0 is hashed to Worker 0, Worker 0 sends much more messages than the other workers, as can be observed from Figure 2.
In addition to the two problems mentioned above, Pregel’s message passing mechanism is also not efficient for processing graphs with (relatively) high average degree due to the high overall communication cost. However, many realworld graphs such as social networks and mobile phone networks have relatively high average degree, as a person is often connected to at least dozens of people.
Our Solution. In this paper, we solve the problems caused by Pregel’s message passing mechanism with two effective message reduction techniques. The goals are to (1)mitigate the problem of imbalanced workload by eliminating bottleneck vertices, and to (2)reduce the overall number of messages exchanged through the network.
The first technique is called mirroring, which is designed to eliminate bottleneck vertices caused by high vertex degree. The main idea is to construct mirrors of each highdegree vertex in different machines, so that messages from a highdegree vertex are forwarded to its neighbors by its mirrors in local machines. Let be the degree of a vertex and be the number of machines in the cluster, mirroring bounds the number of messages sent by each time to . If is a highdegree vertex, can be up to millions, but is normally only from tens to a few hundred. We remark that ideas similar to mirroring have been adopted by existing systems [11, 18], but we find that mirroring a vertex does not always reduce the number of messages due to Pregel’s use of message combiner [13]. Hence, we provide a theoretical analysis on which vertices should be selected for mirroring in Section 5.
In Figure 1, the short red bars indicate the total number of messages sent by each worker when mirroring is applied to all vertices with degree at least 100. We can clearly see the big difference between the uneven blue bars (without mirroring) and the evenheight short red bars (with mirroring). Furthermore, the number of messages is also significantly reduced by mirroring. We remark that the algorithm is still the same and mirroring is completely transparent to users. Mirroring reduces the running time of HashMin on BTC from 26.97 seconds to 9.55 seconds.
The second technique is a new requestrespond paradigm. We extend the basic Pregel framework by an additional requestrespond functionality. A vertex may request another vertex for its attribute , and the requested value will be available in the next iteration. The requestrespond programming paradigm simplifies the coding of many Pregel algorithms, as otherwise at least three iterations are required to explicitly code each request and response process. More importantly, the requestrespond paradigm effectively eliminates the bottleneck vertices resulted from algorithm logic, by bounding the number of response messages sent by any vertex to . Consider the SV algorithm mentioned earlier, where a set of vertices with require the value of from (thus there are requests and responses). Under the requestrespond paradigm, all the requests from a machine to the same target vertex are merged into one request. Therefore, at most requests are needed for the vertices and at most responses are sent from . For large realworld graphs, is often orders of magnitude greater than .
In Figure 2, the short red bars indicate the total number of messages sent by each worker when the requestrespond paradigm is applied. Again, the skewed message passing represented by the blue bars are now replaced by the evenheight short red bars. In particular, Vertex 0 now only responds to the requesting workers instead of all the requesting vertices in the last round, and hence the highly imbalanced workload caused by Vertex 0 in Worker 0 is now evened out. The requestrespond paradigm reduces the running time of SV on the USA road network from 261.9 seconds to 137.7 seconds.
Finally, we remark that our experiments were run in a cluster without any resource contention, and our optimization techniques are expected to improve the overall performance of Pregel algorithms more significantly if they were run in a public data center, where the network bandwidth is lower and reducing communication overhead becomes more important.
The rest of the paper is organized as follows. We review existing parallel graph computing systems, and highlight the differences of our work from theirs, in Section 2. In Section 3, we describe some Pregel algorithms for problems that are common in social network analysis and web analysis. In Section 4, we introduce the basic communication framework. We present the mirroring technique and the requestrespond functionality in Sections 5 and 6. Finally, we report the experimental results in Section 7 and conclude the paper in Section 8.
2 Background and Related Work
We first review Pregel’s framework, and then discuss other related distributed graph computing systems.
2.1 Pregel
Pregel [13] is designed based on the bulk synchronous parallel (BSP) model. It distributes vertices to different machines in a cluster, where each vertex is associated with its adjacency list (i.e., the set of ’s neighbors). A program in Pregel implements a userdefined compute() function and proceeds in iterations (called supersteps). In each superstep, the program calls compute() for each active vertex. The compute() function performs the userspecified task for a vertex , such as processing ’s incoming messages (sent in the previous superstep), sending messages to other vertices (to be received in the next superstep), and making vote to halt. A halted vertex is reactivated if it receives a message in a subsequent superstep. The program terminates when all vertices vote to halt and there is no pending message for the next superstep.
Pregel numbers the supersteps so that a user may use the current superstep number when implementing the algorithm logic in the compute() function. As a result, a Pregel algorithm can perform different operations in different supersteps by branching on the current superstep number.
Message Combiner. Pregel allows users to implement a combine() function, which specifies how to combine messages that are sent from a machine to the same vertex in a machine . These messages are combined into a single message, which is then sent from to in . However, combiner is applied only when commutative and associative operations are to be applied to the messages. For example, in the PageRank computation, the messages sent to a vertex are to be summed up to compute ’s PageRank value; in this case, we can combine all messages sent from a machine to the same target vertex in a machine into a single message that equals their sum. Figure 3 illustrates the idea of combiner, where the messages sent by vertices in machine to the same target vertex in machine are combined into their sum before sending.
Aggregator. Pregel also supports aggregator, which is useful for global communication. Each vertex can provide a value to an aggregator in compute() in a superstep. The system aggregates those values and makes the aggregated result available to all vertices in the next superstep.
2.2 PregelLike Systems in JAVA
Since Google’s Pregel is proprietary, many opensource Pregel counterparts are developed. Most of these systems are implemented in JAVA, e.g., Giraph [1] and GPS [18]. They read the graph data from Hadoop’s DFS (HDFS) and write the results to HDFS. However, since object deletion is handled by JAVA’s Garbage Collector (GC), if a machine maintains a huge amount of vertex/edge objects in main memory, GC needs to track a lot of objects and the overhead can severely degrade the system performance. To decrease the number of objects being maintained, JAVAbased systems maintain vertices in main memory in their binary representation. For example, Giraph organizes vertices as main memory pages, where each page is simply a byte array object that holds the binary representation of many vertices. As a result, a vertex needs to be deserialized from the page holding it before calling compute(); and after compute() completes, the updated vertex needs to be serialized back to its page. The serialization cost can be high, especially if the adjacency list is long. To avoid unnecessary serialization cost, a Pregellike system should be implemented in a language such as C/C++, where programmers (who are system developers, not end users) manage main memory objects themselves. We implemented our Pregel+ system in C/C++.
GPS [18] supports an optimization called large adjacency list partitioning (LALP) to handle highdegree vertices, whose idea is similar to vertex mirroring. However, GPS does not explore the performance tradeoff between vertex mirroring and message combining. Instead, it is claimed in [18] that very small performance difference can be observed whether combiner is used or not, and thus, GPS simply does not perform senderside message combining. Our experiments in Section 7 show that senderside message combining significantly reduces the overall running time of Pregel algorithms, and therefore, both vertex mirroring and message combining should be used to achieve better performance. As we shall see in Section 5, vertex mirroring and message combining are two conflicting message reduction techniques, and a theoretical analysis on their performance tradeoff is needed in order to devise a cost model for automatically choosing vertices for mirroring.
2.3 GraphLab and PowerGraph
GraphLab [11] is another parallel graph computing system that follows a design different from Pregel. GraphLab supports asynchronous execution, and adopts a data pulling programming paradigm. Specifically, each vertex actively pulls data from its neighbors, rather than passively receives messages sent/pushed by its neighbors. This feature is somewhat similar to our requestrespond paradigm, but in GraphLab, the requests can only be sent to the neighbors. As a result, GraphLab cannot support parallel graph algorithms where a vertex needs to communicate with a nonneighbor. Such algorithms are, however, quite popular in Pregel as they make use of the pointer jumping (or doubling) technique of PRAM algorithms to bound the number of iterations by . Examples include the SV algorithm for computing CCs [24] and Pregel algorithm for computing minimum spanning forest [19]. These algorithms can benefit significantly from our requestrespond technique. Recently, several studies [8, 12] reported that GraphLab’s asynchronous execution is generally slower than its synchronous mode (that simulates Pregel’s model) due to the high locking/unlocking overhead. Thus, we mainly focus on Pregel’s computing model in this paper.
GraphLab also builds mirrors for vertices, which are called ghosts. However, GraphLab creates mirrors for every vertex regardless of its degree, which leads to excessive space consumption. A more recent version of GraphLab, called PowerGraph [5], partitions the graph by edges rather than by vertices. Edge partitioning mitigates the problem of imbalanced workload as the edges of a highdegree vertex are handled by multiple workers. Accordingly, a new edgecentric GatherApplyScatter (GAS) computing model is used instead of the traditional vertexcentric computing model.
3 Pregel Algorithms
In this section, we describe some Pregel algorithms for problems that are common in social network analysis and web analysis, which will be used for illustrating important concepts and for performance evaluation.
We consider fundamental problems such as (1)computing connected components (or biconnected components), which is a common preprocessing step for social network analysis [14, 15]; (2)computing minimum spanning tree (or forest), which is useful in mining social relationships [15]; and (3)computing PageRank, which is widely used in ranking web pages [16, 9] and spam detection[7].
For ease of presentation, we first define the graph notations used in the paper. Given an undirect graph , we denote the neighbors of a vertex by , and the degree of by ; if is directed, we denote the inneighbors (outneighbors) of a vertex by (), and the indegree (outdegree) of by (). Each vertex has a unique integer ID, denoted by . The diameter of is denoted by .
3.1 Attribute Broadcast
We first introduce a Pregel algorithm for attribute broadcast. Given a directed graph , where each vertex is associated with an attribute and an adjacency list that contains the set of ’s outneighbors , attribute broadcast constructs a new adjacency list for each vertex in , which is defined as .
Put simply, attribute broadcast associates each neighbor in the adjacency list of a vertex with ’s attribute . Attribute broadcast is very useful in distributed graph computation, and it is a frequently performed key operation in many Pregel algorithms. For example, the Pregel algorithm for computing biconnected components [24] requires to relabel the ID of each vertex by its preorder number in the spanning tree, denoted by . Attribute broadcast is used in this case, where refers to .
The Pregel algorithm for attribute broadcast consists of 3 supersteps: in superstep 1, each vertex sends a message to each neighbor to request for ; then in superstep 2, each vertex obtains the requesters from the incoming messages, and sends the response message to each requester ; finally in superstep 3, each vertex collects the incoming messages to construct .
3.2 PageRank
Next we present a Pregel algorithm for PageRank computation. Given a directed web graph , where each vertex (page) links to a list of pages , the problem is to compute the PageRank, , of each vertex .
Pregel’s PageRank algorithm [13] works as follows. In superstep 1, each vertex initializes and distributes the value to each outneighbor of . In superstep (), each vertex sums up the received values from its inneighbors, denoted by , and computes . It then distributes to each of its outneighbors.
3.3 HashMin
We next present a Pregel algorithm for computing connected components (CCs) in an undirected graph. We adopt the HashMin algorithm [17, 24]. Given a CC , let us denote the set of vertices of by , and define the ID of to be . We further define the color of a vertex as , where . HashMin computes for each vertex , and the idea is to broadcast the smallest vertex ID seen so far by each vertex , denoted by . When the algorithm terminates, for each vertex .
We now describe the HashMin algorithm in Pregel framework. In superstep 1, each vertex sets to be , broadcasts to all its neighbors, and votes to halt. In each subsequent superstep, each vertex receives messages from its neighbors; let be the smallest ID received, if , sets and broadcasts to its neighbors. All vertices vote to halt at the end of a superstep. When the process converges, all vertices have voted to halt and for each vertex , we have .
3.4 The SV Algorithm
The HashMin algorithm described in Section 3.3 requires supersteps [24], which can be slow for computing CCs in largediameter graphs. Another Pregel algorithm proposed in [24] computes CCs in supersteps, by adapting ShiloachVishkin’s (SV) algorithm for the PRAM model [22]. We use this algorithm to demonstrate how algorithm logic generates a bottleneck vertex even if is small.
In the SV algorithm, each vertex maintains a pointer , which is initialized as , forming a self loop as shown Figure 4(a). During the computation, vertices are organized into a forest such that all vertices in a tree belong to the same CC. The tree definition is relaxed a bit here to allow the tree root to have a selfloop, i.e., (see Figures 4(b) and 4(c)); while of any other vertex in the tree points to ’s parent.
The SV algorithm proceeds in rounds, and in each round, the pointers are updated in three steps (illustrated in Figure 5): (1)tree hooking: for each edge , if ’s parent is a tree root, hook as a child of ’s parent , i.e., set ; (2)star hooking: for each edge , if is in a star (see Figure 4(c) for an example of star), hook the star to ’s tree as in Step (1), i.e., set ; (3)shortcutting: for each vertex , move vertex and its descendants closer to the tree root, by hooking to the parent of ’s parent, i.e., setting . The above three steps execute in rounds, and the algorithm ends when every vertex is in a star.
Due to the shortcutting operation, the SV algorithm creates flattened trees (e.g., stars) with large fanout towards the end of the execution. As a result, a vertex may have many children (i.e., ), and each of these children requests for the value of . This renders a bottleneck vertex. In particular, in the last round of the SV algorithm, all vertices in a CC have , and they all send requests to the vertex for . In the basic Pregel framework, receives requests and sends responses, which leads to skewed workload when is large.
3.5 Minimum Spanning Forest
The Pregel algorithm proposed by [19] for minimum spanning forest (MSF) computation is another example that shows how algorithm logic can generate bottleneck vertices. This algorithm proceeds in iterations, where each iteration consists of three steps, which we describe below.
In Step (1), each vertex picks an edge with the minimum weight. The vertices and their picked edges form disjoint subgraphs, each of which is a conjoinedtree: two trees with their roots joined by a cycle. Figure 6 illustrates the concept of a conjoinedtree, where the edges are those picked in Step (1). The vertex with the smaller ID in the cycle of a conjoinedtree is called the supervertex of the tree (e.g., vertex 5 is the supervertex in Figure 6), and the other vertices are called the subvertices.
In Step (2), each vertex finds the supervertex of the conjoinedtree it belongs to, which is accomplished by pointer jumping. Specifically, each vertex maintains a pointer ; suppose that picks edge in Step (1), then the value of is initialized as . Each vertex then sends request to for . Initially, the actual supervertex (e.g., vertex 5 in Figure 6) and its neighbor in the cycle (e.g. vertex 6 in Figure 6) see that they have sent each other messages and detect that they are in the cycle. Vertex then sets itself as the supervertex (i.e., sets ) due to , before responding to the requesters (while remains for since ). For any other vertex , it receives response from and updates to be . This process is repeated until convergence, upon when records the supervertex for all vertices .
In Step (3), each vertex sends request to each neighbor for its supervertex , and removes edge if (i.e., and are in the same conjoinedtree); then sends the remaining edges (to vertices in other conjoinedtrees) to the supervertex . After this step, all subvertices are condensed into their supervertex, which constructs an adjacency list of edges to the other supervertices from those edges sent by its subvertices.
We consider an improved version of the above algorithm that applies the StoringEdgesAtSubvertices (SEAS) optimization of [19]. Specifically, instead of having the supervertex merge and store all crosstree edges, the SEAS optimization stores the edges of a supervertex in a distributed fashion among all of its subvertices. As a result, if a supervertex is merged into another supervertex, it has to notify its subvertices of the new supervertex they belong to. This is accomplished by having each vertex send request to its supervertex for . Since smaller conjoinedtrees are merged into larger ones, a supervertex may have many subvertices towards the end of the execution, and they all request for from , rendering a bottleneck vertex.
4 Basic Communication Framework
When considering on which system we should implement our message reduction techniques, we decided to implement a new opensource Pregel system in C/C++, called Pregel+, to avoid the pitfalls of a JAVAbased system described in Section 2.2. Other reasons for a new Pregel implementation include: (1)GPS does not perform senderside message combining, while our work studies effective message reduction techniques in a system that adheres to Pregel’s framework, where message combining is supported; (2)Giraph has been shown to have inferior performance in recent performance evaluation of graphparallel systems [2, 4, 6, 8, 20] and also in our experiments; (3)other existing graph computing systems are also not suitable as described in Sections 2.3 and LABEL:ssec:othersystems.
We first introduce the basic communication framework of Pregel+. Our two new message reduction techniques to be introduced in Sections 5 and 6 further extend the basic communication framework.
We use the term “worker” to represent a computing unit, which can be a machine or a thread/process in a machine. For ease of discussion, we assume that each machine runs only one worker but the concepts can be straightforwardly generalized.
In Pregel+, each worker is simply an MPI (Message Passing Interface) process and communications among different processes are implemented using MPI’s communication primitives. Each worker maintains a message channel, , for exchanging the vertextovertex messages. In the compute() function, if a vertex sends a message to a target vertex , the message is simply added to . Like in Google’s Pregel, messages in are sent to the target workers in batches before the next superstep begins. Note that if a message is sent from worker to vertex in worker , the ID of the target should be sent along with , so that when receives , it knows which vertex should be directed to.
The operation of the message channel is directly related to the communication cost and hence affects the overall performance of the system. We tested different ways of implementing , and the most efficient one is presented in Figure 7. We assume that a worker maintains vertices, . The message channel associates each vertex with an incoming message buffer . When an incoming message directed to vertex arrives, looks up a hash table for the incoming message buffer using ’s ID. It then appends to the end of . The lookup table is static unless graph mutation occurs, in which case updates to may be required. Once all incoming messages are processed, compute() is called for each active vertex with the messages in as the input.
A worker also maintains outgoing message buffers (where is the number of workers), one for each worker in the cluster, denoted by . In compute(), a vertex may send a message to another vertex with ID . Let be the hash function that computes the worker ID of a vertex from its vertex ID, then the target vertex is in worker . Thus, (along with ) is appended to the end of the buffer . Messages in each buffer are sent to worker in batch. If a combiner is used, the messages in a buffer are first grouped (sorted) by target vertex IDs, and messages in each group are combined into one message using the combiner logic before sending.
5 The Mirroring Technique
The mirroring technique is designed to eliminate bottleneck vertices caused by high vertex degree.
Given a highdegree vertex , we construct a mirror for in any worker in which some of ’s neighbors reside. When needs to send a message, e.g., the value of its attribute, , to its neighbors, sends to its mirrors. Then, each mirror forwards to the neighbors of that reside in the same local worker as the mirror, without any message passing.
Figure 8 illustrates the idea of mirroring. Assume that is a highdegree vertex residing in worker machine , and has neighbors residing in machine and neighbors residing in machine . Suppose that needs to send a message to the neighbors in and neighbors in . Figure 8(a) shows how sends to its neighbors in and using Pregel’s vertextovertex message passing. In total, messages are sent, one for each neighbor. To apply mirroring, we construct a mirror for in and , as shown by the two squares (with label ) in Figure 8(b). In this way, as illustrated in Figure 8(b), only needs to send to the two mirrors in and . Then, each mirror forwards to ’s neighbors locally in and without any network communication. In total, only two messages are sent through the network, which not only tremendously reduces the communication cost, but also eliminates the imbalanced communication load caused by .
We formalize the effectiveness of mirroring for message reduction by the following theorem.
Theorem 1
Let be the degree of a vertex and be the number of machines. Suppose that is to deliver a message to all its neighbors in one superstep. If mirroring is applied on , then the total number of messages sent by in order to deliver to all its neighbors is bounded by .
The proof follows directly from the fact that only needs to send one message to each of its mirrors in other machines and there are at most mirrors of .
Mirroring Threshold. The mirroring technique is transparent to programmers. But we can allow users to specify a mirroring threshold such that mirroring is applied to a vertex only if (we will see shortly that can be automatically set by a cost model following the result of Theorem 2). If a vertex has degree less than , it sends messages through the normal message channel as usual. Otherwise, the vertex only sends messages to its mirrors, and we call this message channel as the mirroring message channel, or in short. In a nutshell, a message is sent either through or , depending on the degree of the sending vertex.
Figure 9 illustrates the concepts of and , where we only consider the message passing between two machines and . The adjacency lists of vertices , , and in are shown in Figure 9(a), and we consider how they send messages to their common neighbor residing in machine . Assume that , then as Figure 9(b) shows, , and send their messages, , and , through , while sends its message through .
Mirroring v.s. Message Combining. Now let us assume that the messages are to be applied with commutative and associative operations at the receivers’ side, e.g., the message values are to be summed up as in PageRank computation. In this case, a combiner can be applied on the message channel . However, the receivercentric message combining is not applicable to the sendercentric channel . For example, in Figure 9(b), when in sends to its mirror in , does not need to know the receivers (i.e., , , and ); thus, its message to cannot be combined with those messages from , and that are also to be sent to . In fact, only holds a list of the machines that contain ’s neighbors, i.e. in this example, and ’s neighbors , , and that are local to are connected by ’s mirror in .
It may appear that ’s message to its mirror is wasted, because if we combine ’s message with those messages from , and , then we do not need to send it through . However, we note that a highdegree vertex like often has many neighbors in another worker machine, e.g., , and in addition to in this example, and the message is not wasted since the message is also forwarded to and , which are not the neighbors of any other vertex in .
Choice of Mirroring Threshold. The above discussion shows that there are cases where mirroring is useful, but it does not give any formal guideline as to when exactly mirroring should be applied. To this end, we conduct a theoretical analysis below on the interplay between mirroring and message combining. Our result shows that mirroring is effective even when message combiner is used.
Theorem 2
Given a graph with vertices and edges, we assume that the vertex set is evenly partitioned among machines (e.g., by hashing as in Pregel) and each machine holds vertices. We further assume that the neighbors of a vertex in are randomly chosen among , and the average degree is a constant. Then, mirroring should be applied to a vertex if ’s degree is at least .
Consider a machine that contains a set of vertices, , where each vertex has neighbors for . Let us focus on a specific vertex in , and infer how large should be so that applying mirroring on can reduce the overall communication even when a combiner is used.
Consider an application where all vertices send messages to all their neighbors in each superstep, such as in PageRank computation. Further consider vertex . If another vertex sends messages through and also has as its neighbor, then ’s message to is wasted since it can be combined with ’s message to . We assume the worst case where all vertices in send messages through . Since the neighbors of a vertex in are randomly chosen among , we have
and therefore,
We regard each
as a random variable whose value is chosen independently from a degree distribution (e.g., powerlaw degree distribution) with expectation
. Then, the expectation of the above equation is given byFor large graphs, we have
where the last step is derived from the fact that .
According to the above discussion, the expected number of ’s neighbors that are not the neighbors of any other vertex(es) in is equal to . In other words, if mirroring is not used, needs to send at least messages that are not wasted. On the other hand, if mirroring is used, sends at most messages, one to each mirror. Therefore, mirroring reduces the number of messages if , or equivalently, . To conclude, choosing as the degree threshold reduces the communication cost.
Theorem 2 states that the choice of depends on the number of workers, , and the average vertex degree, . A cluster usually involves tens to hundreds of workers, while the average degree of a large real world graph is mostly below 50. Consider the scenario where and , then . This shows that mirroring is effective even for vertices whose degree is not very high. We remark that Theorem 2 makes some simplified assumption (e.g., being a random graph) for ease of analysis, which may not be accurate for a real graph. However, our experiments in Section 7.1 show that Theorem 2 is effective on real graphs.
Mirror Construction. Pregel+ constructs mirrors for all vertices with after the input graph is loaded and before the iterative computation, although mirror construction can also be precomputed offline like GraphLab’s ghost construction. Specifically, the neighbors in ’s adjacency list is grouped by the workers in which they reside. Each group is defined as . Then, for each group , sends to worker , and constructs a mirror of with the adjacency list locally in . Each vertex also stores the address of ’s incoming message buffer so that messages can be directly forwarded to by ’s mirror in .
During graph computation, a vertex sends message to its mirror in worker . On receiving the message, looks up ’s mirror from a hash table using ’s ID (similar to described in Section 4). The message value is then forwarded to the incoming message buffers of ’s neighbors locally in .
Handling Edge Fields. There are some minor changes to Pregel’s programming interface for applying mirroring. In Pregel’s interface, a vertex calls send_msg to send an arbitrary message to a target vertex . With mirroring, a vertex sends a message containing the value of its attribute to all its neighbors by calling broadcast instead of calling send_msg for each neighbor .
Consider the algorithms described in Section 3. For PageRank, a vertex simply calls broadcast; while for HashMin, calls broadcast.
However, there are applications where the message value is not only decided by the sender vertex ’s state, but also by the edge that the message is sent along. For example, in Pregel’s algorithm for singlesource shortest path (SSSP) computation [13], a vertex sends to each neighbor , where is an attribute of estimating the distance from the source, and is an attribute of its outedge indicating the edge length.
To support applications like SSSP, Pregel+ requires that each edge object supports a function relay, which specifies how to update the value of before is added to the incoming message buffer of the target vertex . If is sent through , relay is called on the senderside before sending. If is sent through , relay is called on the receiverside when the mirror forwards to each local neighbor (as the edge field is maintained by the mirror). For example, in Figure 9, relay is called when is passed along a dashed arrow.
By default, relay does not change the value of . To support SSSP, a vertex calls broadcast in compute(), and meanwhile, the function relay is overloaded to add the edge length to , which updates the value of to the required value .
Summary of Contributions. GPS does not use message combining, and therefore, its LALP technique are not as effective as our mirroring technique that is reinforced with message combiner. GraphLab’s ghost vertex technique creates mirrors for all vertices regardless of the vertex degree, and thus it is also not as effective as our mirroring technique. As far as we know, this is the first work that considers the integration of vertex mirroring and message combining in Pregel’s computing model. In addition, we also identified the tradeoff between vertex mirroring and message combining in message reduction, and provided a cost model to automatically select vertices for mirroring so as to minimize the number of messages. As we shall see in our experiments in Section 7.1, the mirroring threshold computed by our cost model in Theorem 2 achieves nearoptimal performance. In addition, we also cope with the case where the message value depends on the edge field, which is not supported by GPS’s LALP technique.
6 The RequestRespond Paradigm
In Sections 1, 3.4 and 3.5, we have shown that bottleneck vertices can be generated by algorithm logic even if the input graph has no highdegree vertices. For handling such bottleneck vertices, the mirroring technique of Section 5 is not effective. To this end, we design our second message reduction technique, which extends the basic Pregel framework with a new requestrespond functionality.
We illustrate the concept using the algorithms described in Section 3. Using the requestrespond API, attribute broadcast in Section 3.1 is straightforward to implement: in superstep 1, each vertex sends requests to each neighbor for ; in superstep 2, the vertex simply obtains responded by each neighbor , and constructs . Similarly, for the SV algorithm in Section 3.4, when a vertex needs to obtain from vertex , it simply sends a request to so that can be used in the next superstep; for the MSF algorithm in Section 3.5, a vertex simply sends a request to its supervertex so that can be used to update in the next superstep.
RequestRespond Message Channel. We now explain in detail how Pregel+ supports the requestrespond API. The requestrespond paradigm supports all the functionality of Pregel. In addition, it supplements the vertextovertex message channel with a requestrespond message channel, denoted by .
Figure 10 illustrates how requests and responses are exchanged between two machines and through . Specifically, each machine maintains request sets, where is the number of machines, and each request set stores the requests to vertices in machine . In a superstep, a vertex in machine may call request() in its compute() function to send request to vertex for its attribute value (which will be used in the next superstep). Let , then the requested vertex is in machine , and hence is added to the request set of . Although many vertices in may send request to , only one request to will be sent from to since is a (hash) set that eliminates redundant elements.
After compute() is called for all active vertices, the vertextovertex messages are first exchanged through . Then, each machine sends each request set to machine . After the requests are exchanged, each machine receives request sets, where set stores the requests sent from machine . In the example shown in Figure 10, is contained in the set in machine , since vertex in machine sent request to .
Then, a response set is constructed for each request set received, which is to be sent back to machine . In our example, the requested vertex, , calls a userspecified function respond() to return its specified attribute , and adds the entry to the response set .
Once the response sets are exchanged, each machine constructs a hash table from the received entries. In the example shown in Figure 10, the entry is received by machine since it is in the response set in machine . The hash table is available for the next superstep, where vertices can access their requested value in their compute() function. In our example, vertex in machine may call get_resp() in the next superstep, which looks up ’s attribute from the hash table.
The following theorem shows the effectiveness of the requestrespond paradigm for message reduction.
Theorem 3
Let be the set of requesters that request the attribute from a vertex . Then, the requestrespond paradigm reduces the total number of messages from in Pregel’s vertextovertex message passing framework to , where is the number of machines.
The proof follows directly from the fact that each machine sends at most 1 request to even though there may be more than 1 requester in that machine, and that at most 1 respond from is sent to each machine that makes a request to , and that there are at most machines that contain a requester.
In the worst case, the requestrespond paradigm uses the same number of messages as Pregel’s vertextovertex message passing. But in practice, many Pregel algorithms (e.g., those described in Sections 3.4 and 3.5) have bottleneck vertices with a large number of requesters, leading to imbalanced workload and long elapsed running time. In such cases, our requestrespond paradigm effectively bounds the number of messages to the number of machines containing the requesters and eliminates the imbalanced workload.
Explicit Responding. In the above discussion, a vertex simply calls request() in one superstep, and it can then call get_resp() in the next superstep to get . All the operations including request exchange, response set construction, response exchange, and response table construction are performed by Pregel+ automatically and are thus transparent to users. We name the above process as implicit responding, where a responder does not know the requester until a request is received.
When a responder knows its requesters , can explicitly call respond() in compute(), which adds respond() to the response set where . This process is also illustrated in Figure 10. Explicit responding is more costefficient since there is no need for request exchange and response set construction.
Explicit responding is useful in many applications. For example, to compute PageRank on an undirected graph, a vertex can simply call respond() for each to push to ’s neighbors; this is because in the next superstep, vertex knows its neighbors , and can thus collect their responses. Similarly, in attribute broadcast, if the input graph is undirected, each vertex can simply push its attribute to its neighbors. Note that data pushing by explicit responding requires less messages than by Pregel’s vertextovertex message passing, since responds are sent to machines (more precisely, their response tables) rather than individual vertices.
Programming Interface. Pregel+ extends the vertex class in Pregel’s interface [13] by requiring users to specify an additional template argument R, which indicates the type of the attribute value that a vertex responds.
In compute(), a vertex can either pull data from another vertex by calling request(), or push data to by calling respond(). The attribute value that a vertex returns is defined by a userspecified abstract function respond(), which returns a value of type R. Like compute(), one may program respond() to return different attributes of a vertex in different supersteps according to the algorithm logic of the specific application. Finally, a vertex may call get_resp() in compute() to get the attribute of , if it is pushed into the response table in the previous superstep.
7 Experimental Results
We now evaluate the effectiveness of our message reduction techniques. We ran our experiments on a cluster of 16 machines, each with 24 processors (two Intel Xeon E52620 CPU) and 48GB RAM. One machine is used as the master, while the other 15 machines act as slaves. The connectivity between any pair of nodes in the cluster is 1Gbps.
We used five realworld datasets, as shown in Figure 11: (1)WebUK^{2}^{2}2http://law.di.unimi.it/webdata/ukunion200606200705: a web graph generated by combining twelve monthly snapshots of the .uk domain collected for the DELIS project; (2)LiveJournal (LJ) ^{3}^{3}3http://konect.unikoblenz.de/networks/livejournalgroupmemberships: a bipartite network of LiveJournal users and their group memberships; (3)Twitter^{4}^{4}4http://konect.unikoblenz.de/networks/twitter_mpi: Twitter whofollowswho network based on a snapshot taken in 2009; (4)BTC^{5}^{5}5http://km.aifb.kit.edu/projects/btc2009/: a semantic graph converted from the Billion Triple Challenge 2009 RDF dataset; (5)USA^{6}^{6}6http://www.dis.uniroma1.it/challenge9/download.shtml: the USA road network.
LJ, Twitter and BTC have skewed degree distribution; WebUK, LJ and Twitter have relatively high average degree; USA and WebUK have a large diameter.
Pregel+ Implementation. Pregel+ is implemented in C/C++ as a group of header files, and users only need to include the necessary base classes and implement the application logic in their subclasses. Pregel+ communicates with HDFS through libhdfs, a JNI based C API for HDFS. Each worker is simply an MPI process and communications are implemented using MPI communication primitives. While one may deploy Pregel+ with any Hadoop and MPI version, we use Hadoop 1.2.1 and MPICH 3.0.4 in our experiments. All programs are compiled using GCC 4.4.7 with O2 option enabled.
All the system source codes, as well as the source codes of the algorithms discussed in this paper, can be found in http://www.cse.cuhk.edu.hk/pregelplus.
7.1 Effectiveness of Mirroring
Figure 12 reports the performance gain by mirroring. We measure the gain by comparing with (1)Pregel+ without both mirroring and combiner, denoted by PregelnoMC; (2)Pregel+ with combiner but without mirroring, denoted by PregelnoM; and (3)GPS [18] with and without LALP. The requestrespond technique is not applied in Pregel+ for this set of experiments. As a reference, we also report the performance of Giraph 1.0.0 [1] (with combiner) and GraphLab 2.2 (which includes PowerGraph [5]).
We test the mirroring thresholds 1, 10, 100, 1000, and the one automatically set by the cost model given by Theorem 2 (which is 199, 165, 62, 126, for WebUK, Twitter, LJ, BTC, respectively). But for the USA road network, its maximum vertex degree is only 9 and thus we do not apply mirroring with large thresholds. For GPS, we follow [8] and fix the threshold of LALP as 100. This is a reasonable choice, since [8] reports that this threshold achieves good performance in general, and we find that the best performance after tuning the threshold is very close to the performance when the threshold is 100. We also report the preprocessing time of constructing mirrors for Pregel+ and that of LALP for GPS in rows marked by “Preproc Time”. We also report the number of messages sent by Pregel+ and GPS (note that Giraph does not report the number of messages, but the number should be the same as that of PregelnoMC and PregelnoM; while GraphLab does not employ message passing).
We ran PageRank on the three directed graphs, and HashMin on the two undirected graphs in Figure 11. For PageRank computation, we use aggregator to check whether every vertex changes its PageRank value by less than 0.01 after each superstep, and terminate if so. The computation takes 89, 89 and 96 supersteps on WebUK, Twitter and LJ, respectively, before convergence. We do not run GraphLab in asynchronous mode for PageRank, since its convergence condition is different from the synchronous version and hence leads to different PageRank results.
Mirroring in Pregel+. As Figure 12 shows, mirroring significantly improves the performance of PregelnoM, in terms of the reduction in both running time and message number. The improvement is particularly obvious for the graphs, Twitter, LJ, and BTC, which have highly skewed degree distribution. Thus, the result also demonstrates the effectiveness of mirroring in workload balancing.
Mirroring is not so effective for PageRank on WebUK, for which PregelnoM has the best performance. The number of messages is only slightly decreased when mirroring threshold , and yet it is still slower than PregelnoM. This is because messages sent through are intercepted by mirrors which incurs additional cost. Since the degree of the majority of the vertices in WebUK is not very high, mirroring does not significantly reduce the number of messages, and thus, the additional cost of is not paid off.
The results also show that the mirroring threshold given by our cost model achieves either the best performance, or close to the performance of the best threshold tested. The oneoff preprocessing time required to construct the mirrors is also short compared with the computation time.
Comparison with Other Systems. Figure 12 shows that Pregel+ without mirroring (i.e., PregelnoM) is already faster than both Giraph and GraphLab, which verifies that our implementation is efficient, and thus the performance gain by mirroring is not an overclaimed improvement gained over a slow implementation.
Compared with GPS, the reduction in both message number and running time achieved by the integration of mirroring and combiner in Pregel+ is significantly more than that achieved by LALP alone in GPS, which can be observed from (1)Pregel+ with mirroring vs. PregelnoMC, and (2)GPS with LALP v.s. GPS without LALP. In contrast to the claim in [18] that message combining is not effective, our result clearly demonstrates the benefits of integrating mirroring and combiner, and hence highlights the importance of our theoretical analysis on the tradeoff between mirroring and message combining (i.e., Theorem 2).
However, we notice that GPS is sometimes faster than Pregel+ even though much more messages are exchanged. We found it hard to explain and so we studied the codes of GPS to explore the reason, which we explain below. GPS requires that vertex IDs should be integers that are contiguous starting from ; while other systems allow vertex IDs to be of any userspecified type as long as a hash function is provided (for calculating the ID of the worker that a vertex resides in). As a result of the dense ID representation, each worker in GPS simply maintains the incoming message buffers of the vertices by an array, and when a worker receives a message targeted at vertex , it is put into ’s incoming message buffer (i.e., ) whose position in the array can be directly computed from . On the contrary, systems like Pregel+ and Giraph need to look up from a hash table using key , which has extra cost for each message exchanged.
We remark that there are good reasons to require vertex IDs to take arbitrary type, rather than to hardcode them as contiguous integers. For example, the Pregel algorithm in [24] for computing biconnected components constructs an auxiliary graph from the input graph, and each vertex of the auxiliary graph corresponds to an edge of the input graph. While we can simply use integer pair as vertex ID in Pregel+, using GPS requires extra effort from programmers to relabel the vertices of the auxiliary graph with contiguous integer IDs, which can be costly for a large graph. We note that, if one desires, he can easily implement GPS’s dense vertex ID representation in Pregel+ to further improve the performance for certain algorithms, but this is not the focus of our work which studies message reduction techniques.
7.2 Effectiveness of RequestRespond Technique
Figure 13 reports the performance gained by the requestrespond technique. We test the three algorithms in Section 3 to which the requestrespond technique is applicable: attribute broadcast, SV and minimum spanning forest. We also include Giraph and GPS as a reference. We do not include GraphLab since the algorithms cannot be easily implemented in GraphLab (e.g., it is not clear how a vertex can communicate with a nonneighbor as in SV and minimum spanning forest).
The results show that Pregel+ with requestrespond, denoted by ReqResq, uses significantly less messages. For example, for attribute broadcast on WebUK, ReqResq reduces the message number from 11,015 million to only 2,699 million. ReqResq also records the shortest running time except in a few cases where GPS is faster due to the same reason given in Section 7.1. Another exception is when computing minimum spanning forest on USA, where Pregel+ is faster without requestrespond. This is because vertices in USA have very low degree, rendering the requestrespond technique ineffective, and the additional computational overhead is not paid off by the reduction in message number.
8 Conclusions
We presented two techniques to reduce the amount of communication and to eliminate skewed communication workload. The first technique, mirroring, eliminates communication bottlenecks caused by high vertex degree, and is transparent to programming. The second technique is a new requestrespond paradigm, which eliminates bottlenecks caused by program logic, and simplifies the programming of many Pregel algorithms. Our experiments on large realworld graphs verified that our techniques are effective in reducing the communication cost and overall computation time.
References
 [1] Apache Giraph. http://giraph.apache.org/.

[2]
Z. Cai, Z. J. Gao, S. Luo, L. L. Perez, Z. Vagena, and C. M. Jermaine.
A comparison of platforms for implementing and running very large scale machine learning algorithms.
In SIGMOD, pages 1371–1382, 2014.  [3] J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137–150, 2004.
 [4] B. Elser and A. Montresor. An evaluation study of bigdata frameworks for graph processing. In BigData Conference, pages 60–67, 2013.
 [5] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graphparallel computation on natural graphs. In OSDI, pages 17–30, 2012.
 [6] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How well do graphprocessing platforms perform? an empirical performance evaluation and analysis. IPDPS, 2013.
 [7] Z. Gyöngyi, H. GarciaMolina, and J. O. Pedersen. Combating web spam with trustrank. In VLDB, pages 576–587, 2004.
 [8] M. Han, K. Daudjee, K. Ammar, M. T. Özsu, X. Wang, and T. Jin. An experimental comparison of Pregellike graph processing systems. PVLDB, 7(12):1047–1058, 2014.
 [9] G. Jeh and J. Widom. Scaling personalized web search. In WWW, pages 271–279, 2003.
 [10] Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in largescale graph processing. In EuroSys, pages 169–182, 2013.
 [11] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. PVLDB, 5(8):716–727, 2012.
 [12] Y. Lu, J. Cheng, D. Yan, and H. Wu. Largescale distributed graph computing systems: An experimental evaluation. PVLDB, 8(3), 2015.
 [13] G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for largescale graph processing. In SIGMOD Conference, pages 135–146, 2010.
 [14] A. Mislove, M. Marcon, P. K. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In SIGCOMM Conference on Internet Measurement, pages 29–42, 2007.
 [15] J. Niu, J. Peng, C. Tong, and W. Liao. Evolution of disconnected components in social networks: Patterns and a generative model. In Performance Computing and Communications Conference (IPCCC), 2012 IEEE 31st International, pages 305–313. IEEE, 2012.
 [16] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. 1999.
 [17] V. Rastogi, A. Machanavajjhala, L. Chitnis, and A. D. Sarma. Finding connected components in mapreduce in logarithmic rounds. In ICDE, pages 50–61, 2013.
 [18] S. Salihoglu and J. Widom. GPS: a graph processing system. In SSDBM, page 22, 2013.
 [19] S. Salihoglu and J. Widom. Optimizing graph algorithms on pregellike systems. PVLDB, 7(7):577–588, 2014.
 [20] N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In SIGMOD Conference, pages 979–990, 2014.
 [21] Z. Shang and J. X. Yu. Catch the wind: Graph workload balancing on cloud. In ICDE, pages 553–564, 2013.
 [22] Y. Shiloach and U. Vishkin. An o(log n) parallel connectivity algorithm. J. Algorithms, 3(1):57–67, 1982.
 [23] D. Yan, J. Cheng, Y. Lu, and W. Ng. Blogel: A blockcentric framework for distributed computation on realworld graphs. PVLDB, 7(14):1981–1992, 2014.
 [24] D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. Pregel algorithms for graph connectivity problems with performance guarantees. PVLDB, 7(14):1821–1832, 2014.
Comments
There are no comments yet.