Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

01/25/2016 ∙ by Da Yan, et al. ∙ University of Waterloo The Hong Kong University of Science and Technology 0

Pioneered by Google's Pregel, many distributed systems have been developed for large-scale graph analytics. These systems expose the user-friendly "think like a vertex" programming interface to users, and exhibit good horizontal scalability. However, these systems are designed for tasks where the majority of graph vertices participate in computation, but are not suitable for processing light-workload graph queries where only a small fraction of vertices need to be accessed. The programming paradigm adopted by these systems can seriously under-utilize the resources in a cluster for graph query processing. In this work, we develop a new open-source system, called Quegel, for querying big graphs, which treats queries as first-class citizens in the design of its computing model. Users only need to specify the Pregel-like algorithm for a generic query, and Quegel processes light-workload graph queries on demand using a novel superstep-sharing execution model to effectively utilize the cluster resources. Quegel further provides a convenient interface for constructing graph indexes, which significantly improve query performance but are not supported by existing graph-parallel systems. Our experiments verified that Quegel is highly efficient in answering various types of graph queries and is up to orders of magnitude faster than existing systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

Frego

Vertex-Computing QUEGEL Go interpretation. Early Steps...


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Big graphs are common in real-life applications today, for example, online social networks and mobile communication networks have billions of users, and web graphs and Semantic webs can be even bigger. Processing such big graphs typically require a special infrastructure, and the most popular ones are Pregel [24] and Pregel-like systems [1, 9, 10, 22, 29, 36]. In a Pregel-like system, a programmer thinks like a vertex and only needs to specify the behavior of one vertex, and the system automatically schedules the execution of the specified computing logic on all vertices. The system also handles fault tolerance and scales out without extra effort from programmers.

Existing Pregel-like systems, however, are designed for heavy-weight graph computation (i.e., analytic workloads), where the majority part of a graph or the entire graph is accessed. For example, Pregel’s PageRank algorithm [24] accesses the whole graph in each iteration. However, many real-world applications involve various types of graph querying, whose computation is light-weight in the sense that only a small portion of the input graph needs to be accessed. For example, in our collaboration with researchers from one of the world’s largest online shopping platforms, we have seen huge demands for querying different aspects of big graphs for all sorts of analysis to boost sales and improve customer experience. In particular, they need to frequently examine the shortest-path distance between some users in a large network extracted from their online shopping data. While Pregel’s single-source shortest-path (SSSP) algorithm [24] can be applied here, much of the computation will be wasted because only those paths between the queried users are of interest. Instead, it is much more efficient to apply point-to-point shortest-path (PPSP) queries, which only traverse a small part of the input graph. We also worked with a large telecom operator, and our experience is that graph queries (with light-weight workloads) are integral parts of analyzing massive mobile phone and SMS networks.

The importance of querying big graphs has also been recognized in some recent work [18], where two kinds of systems are identified: (1) systems for offline graph analytics (such as Pregel and GraphLab) and (2) systems for online graph querying, including Horton [30], G-SPARQL [28] and Trinity [32]. However, Horton and G-SPARQL are tailor-made only for specific types of queries. Trinity supports graph query processing, but compared with Pregel, its main advantage is that it keeps the input graph in main memories so that the graph does not have to be re-loaded for each query. The Trinity paper [32] also argues that indexing is too expensive for big graphs and thus Trinity does not support indexing. In the VLDB 2015 conference, there is also a workshop “Big-O(Q): Big Graphs Online Querying”, but the works presented there only study algorithms for specific types of queries. So far, there lacks a general-purpose framework that allows users to easily design distributed algorithms for answering various types of queries on big graphs.

One may, of course, use existing vertex-centric systems to process queries on big graphs, but these systems are not suitable for processing light-weight graph queries. To illustrate, consider processing PPSP queries on a 1.96-billion-edge Twitter graph used in our experiments. To answer one query by bidirectional breadth-first search (BiBFS) in our cluster, Giraph takes over 100 seconds, which is intolerable for a data analyst who wants to examine the distance between users in an online social network with short response time. To process queries on demand using an existing vertex-centric system, a user has the following two options: (1) to process queries one after another, which leads to a low throughput since the communication workload of each query is usually too light to fully utilize the network bandwidth and many synchronization barriers are incurred; or (2) to write a program to explicitly process a batch of queries in parallel, which is not easy for users and may not fully utilize the network bandwidth towards the end of the processing, since most queries may have finished their processing and only a small number of queries are still being processed. It is also not clear how to use graph indexing for query processing in existing vertex-centric systems.

To address the limitations of existing systems in querying big graphs, we developed a distributed system, called Quegel, for large-scale graph querying. We implemented the Hub-Labeling approach [15] in Quegel, and it can achieve interactive speeds for PPSP querying on the same Twitter graph mentioned above. Quegel treats queries as first-class citizens: users only need to write a Pregel-like algorithm for processing a generic query, and the system automatically schedules the processing of multiple incoming queries on demand. As a result, Quegel has a wide application scope, since any query that can be processed by a Pregel-style vertex-centric algorithm can be answered by Quegel, and much more efficiently. Under this query-centric design, Quegel adopts a novel superstep-sharing execution model to effectively utilize the cluster resources, and an efficient mechanism for managing vertex states that significantly reduces memory consumption. Quegel further provides a convenient interface for constructing indexes to improve query performance. To our knowledge, Quegel is the first general-purpose programming framework for querying big graphs at interactive speeds on a distributed cluster. We have successfully applied Quegel to process five important types of graph queries (to be presented in Section 5), and Quegel achieves performance up to orders of magnitude faster than existing systems.

The rest of this paper is organized as follows. We review related work in Section 2. In Section 3, we highlight important concepts in the design of Quegel, and key implementation issues. We introduce the programming model of Quegel in Section 4, and describe some graph querying problems as well as their Quegel algorithms in Section 5. Finally, we evaluate the performance of Quegel in Section 6 and conclude the paper in Section 7.

2 Related Work

We first review existing vertex-centric graph-parallel systems. We consider an input graph stored on Hadoop distributed file system (HDFS), where each vertex is associated with its adjacency list (i.e., ’s neighbors). If is undirected, we denote ’s neighbors by , while if is directed, we denote ’s in-neighbors and out-neighbors by and , respectively. Each vertex also has a value storing ’s vertex value. Graph computation is run on a cluster of workers, where each worker is a computing thread/process, and a machine may run multiple workers.

Pregel [24]. Pregel adopts the bulk synchronous parallel (BSP) model. It distributes vertices to workers in a cluster, where each vertex is associated with its adjacency list. A Pregel program computes in iterations, where each iteration is called a superstep. Pregel requires users to specify a user-defined function (UDF) compute(.). In each superstep, each active vertex calls compute(msgs), where msgs is the set of incoming messages sent from other vertices in the previous superstep. In .compute(msgs), may process msgs and update , send new messages to other vertices, and vote to halt (i.e., deactivate itself). A halted vertex is reactivated if it receives a message in a subsequent superstep. The program terminates when all vertices are deactivated and no new message is generated. Finally, the results (e.g., ) are dumped to HDFS.

Pregel also allows users to implement an aggregator for global communication. Each vertex can provide a value to an aggregator in compute(.) in a superstep. The system aggregates those values and makes the aggregated result available to all vertices in the next superstep.

Distributed Vertex-Centric Systems. Many Pregel-like systems have been developed, including Giraph [1], GPS [29], GraphX [10], and Pregel+ [36]. New features are introduced by these systems, for example, GPS proposed to mirror high-degree vertices on other machines, and Pregel+ proposed the integration mirroring and message combining as well as a request-respond mechanism, to reduce communication workload. While these systems strictly follow the synchronous data-pushing model of Pregel, GraphLab [22] adopts an asynchronous data-pulling model, where each vertex actively pulls data from its neighbors rather than passively receives messages. A subsequent version of GraphLab, called PowerGraph [9], partitions the graph by edges rather than by vertices to achieve more balanced workload. While the asynchronous model leads to faster convergence for some tasks like random walk, [23] and [11] reported that GraphLab’s asynchronous mode is generally slower than synchronous execution mainly due to the expensive cost of locking/unlocking.

Single-PC Vertex-Centric Systems. There are also other vertex-centric systems, such as GraphChi [19] and X-Stream [27], designed to run on a single PC by manipulating a big graph on disk. However, these systems need to scan the whole graph on disk once for each iteration of computation even if only a small fraction of vertices need to perform computation, which is inefficient for light-weight querying workloads.

Weaknesses of Existing Systems for Graph Querying. In our experience of working with researchers in e-commerce companies and telecom operators, we found that existing vertex-centric systems cannot support query processing efficiently nor do they provide a user-friendly programming interface to do so. If we write a vertex-centric algorithm for a generic query, we have to run a job for every incoming query. As a result, each superstep transmits only the few messages of one light-weight query which cannot fully utilize the network bandwidth. Moreover, there are a lot of synchronization barriers, one for each superstep of each query, which is costly. Moreover, some systems such as Giraph bind graph loading with graph computation (i.e., processing a query in our context) for each job, and the loading time can significantly degrade the performance.

An alternative to the one-query-at-a-time approach is to hard code a vertex-centric algorithm to process a batch of queries, where can be an input argument. However, in the compute(.) function, one has to differentiate the incoming messages and/or aggregators of different queries and update vertex values accordingly. In addition, existing vertex-centric framework checks the stop condition for the whole job, and users need to take care of additional details such as when a vertex can be deactivated (e.g., when it should be halted for all the queries), which should originally be handled by the system itself. More critically, the one-batch-at-a-time approach does not solve the problem of low utilization of network bandwidth, since in later stage when most queries finish their processing, only a small number of queries (or stragglers) are still being processed and hence the number of messages generated is too small to sufficiently utilize the network bandwidth.

The single-PC systems are clearly not suitable for light-weight querying workloads since they need to scan the whole graph on disk once for each iteration. Other existing graph databases such as Neo4j [25] and HyperGraphDB [14] support basic graph operations and simple graph queries, but they are not designed to handle big graphs. Our experiments also verified the inefficiency of single-PC systems and graph databases in querying big graphs (see Section 6). There are other systems, e.g., the block-centric system Blogel [35] and a recent general-purpose system Husky [40], which achieve remarkable performance on offline graph analytics, but are not designed for graph querying.

The above discussion motivates the need of a general-purpose graph processing system that treats queries as first citizens, which provides a user-friendly interface so that users can write their program easily for one generic query and the system processes queries on demand efficiently. Our Quegel system, to be presented in the following sections, fulfils this need.

3 The Quegel System

A Quegel program starts by loading the input graph , i.e., distributing vertices into the main memory of different workers in a cluster. If users enable indexing, a local index will be built from the vertices of each worker. After is loaded (and index is constructed), Quegel receives and processes incoming queries using the computing logic specified by a vertex UDF compute(.) as in Pregel. Users may type their queries from a client console, or submit a batch of queries with a file. After a query is evaluated, users may specify Quegel to print the answer to the console, or to dump the answer to HDFS if its size is large (e.g., the answer contains many subgraphs).

3.1 Execution Model: Superstep-Sharing

To address the weaknesses of existing systems presented in Section 2, we need to consider a new computation model. We first present the hardness of querying a big graph in general, which influences the design of our model.

Hardness of Big Graph Querying and Our Design Objective. We consider the processing of a large graph that is stored in distributed sites, so that the processing of each query requires network communication. Since the message transmission of each superstep incurs round-trip delay, it is difficult (if not unrealistic) for distributed vertex-centric computation (e.g., on machines) to achieve response time comparable to that of single-machine algorithms on a smaller graph (e.g., times smaller). Therefore, our goal is to answer a query in interactive speed, e.g., in a second to at most a few seconds depending on the complexity of processing a given query. We remark that even in CANDS [39], a specialized distributed system dedicated for shortest path querying on big graphs, a query can take many seconds to answer, while as we shall see in Section 6, our general-purpose Quegel system can process multiple PPSP queries per second on a graph with billions of edges.

Moreover, due to the sheer size of a big graph, the total workload of a batch of queries can be huge even if each query accesses just a fraction of the graph. We remark that the workload of distributed graph computation is significantly different from traditional database applications. For example, to query the balance of a bank account, the balance value can be quickly accessed from a centralized account table using a B-tree index based on the account number, and it is possible to achieve both high throughput and low latency. However, in distributed graph computation, the complicated topology of connections among vertices (which are not present among bank accounts) results in higher-complexity algorithms and heavier workloads. Specifically, due to the poor locality of graph data, each query usually accesses vertices spreading through the whole big graph in distributed sites, and vertices need to communicate with each other through the network.

The above discussion shows that there is a latency-throughput tradeoff where one can only expect either interactive speed or high throughput but not both. As a result, our design objective focuses on developing a model for the following two scenarios of querying big graphs, both of which are common in real life applications.

Scenario (i): Interactive Querying, where a user interacts with Quegel by submitting a query, checking the query results, refining the query based on the results and re-submitting the refined query, until the desired results are obtained. As an example, a data analyst may use interactive PPSP queries to examine the distance between two users of interest in a social network. Another example is given by the XML keyword querying application (to be presented in Section 5.2). In such applications, there are only one or several users (e.g., a data scientist) analyzing a big graph by posing interactive queries, but each query should be answered in a second or several seconds. No existing vertex-centric system can achieve such query latency on a big graph.

Scenario (ii): Batch Querying, where batches of queries are submitted to Quegel, and they need to be answered within a reasonable amount of time. An example of batch querying is given by the vertex-pair sampling application mentioned in Section 1

for estimating graph metrics, where a large number of PPSP queries need to be answered. Quegel achieves throughput 186 and 38.6 times higher than Giraph and GraphLab for processing PPSP queries, and thus allows the graph metrics to be estimated more accurately.

Superstep-Sharing Model. We propose a superstep-sharing execution model to meet the requirements of both interactive querying and batch querying. Specifically, Quegel processes graph queries in iterations called super-rounds. In a super-round, every query that is currently being processed proceeds its computation by one superstep; while from the perspective of an individual query, Quegel processes it superstep by superstep as in Pregel. Intuitively, a super-round in Quegel is like many queries sharing the same superstep. For a query whose computation takes supersteps, Quegel processes it in super-rounds, where the last super-round reports or dumps the results of .

Quegel allows users to specify a capacity parameter , so that in any super-round, there are at most queries being processed. New incoming queries are appended to a query queue, and at the beginning of a super-round, Quegel fetches as many queries from the queue as possible to start their processing, as long as the capacity constraint permits. During the computation of a super-round, different workers run in parallel, while each worker processes (its part of) the evaluation of the queries serially. And for each query , if has not been evaluated, a worker serially calls compute(.) on each of its vertices that are activated by ; while if has already finished its evaluation, the worker reports or dumps the query results, and releases the resources consumed by .

For the processing of each query, the supersteps are numbered. Different queries may have different superstep number in the same super-round, for example, if the queries enter the system in different super-rounds. Messages (and aggregators) of all queries are synchronized together at the end of a super-round, to be used by the next super-round.

For interactive querying where queries are posed and processed in sequence, the superstep-sharing model processes each individual query with all the cluster resources just as in Pregel. However, since Quegel decouples the costly graph loading and dumping from query processing, and supports convenient construction and adoption of graph indexes, the query latency is significantly reduced.

Figure 1: Load balancing
Figure 2: Illustration of context objects

For batch querying, while the workload of each individual query is light, superstep-sharing combines the workloads of up to queries as one batch in each super-round to achieve higher resource utilization. Compared with answering each query independently as in existing graph-parallel systems, Quegel’s superstep-sharing model supports much more efficient query processing since only one message (and/or aggregator) synchronization barrier is required in each super-round instead of up to synchronization barriers. We remark that the synchronization cost is relatively significant compared with the light workload of processing each single query. In addition, by sending the messages of many queries in one batch, superstep-sharing also better utilizes the network bandwidth.

Superstep-sharing also leads to more balanced workload. As an illustration, Figure 1 shows the execution of two queries for one superstep in a cluster of two workers. The first query (darker shading) takes 2 time units on Worker 1 and 4 time units on Worker 2, while the second query (lighter shading) takes 4 time units on Worker 1 and 2 time units on Worker 2. When the queries are processed individually, the first query needs to be synchronized before the second query starts to be processed. Thus, 8 time units are required in total. Using superstep-sharing, only one synchronization is needed at the end of the super-round, thus requiring only 6 time units.

One issue that remains is how to set the capacity parameter . Obviously, the larger the number of queries being simultaneously processed, the more fully is the network bandwidth utilized. But the value of should be limited by the available RAM space. The input graph consumes RAM space, while each query consumes space, where denotes the set of vertices accessed by . Thus, should not exceed the available RAM space, though in most case this is not a concern as . While setting larger tends to improve the throughput, the throughput converges when the network bandwidth is saturated. In a cluster such as ours which is connected by Gigabit Ethernet, we found that the throughput usually converges when is increased to 8 (for the graph queries we tested), which indicates that Quegel has already fully utilized the network bandwidth and shows the high complexity of querying a big graph.

3.2 System Design

Quegel manages three kinds of data: (i) V-data, whose value only depends on a vertex , such as ’s adjacency list. (ii) VQ-data, whose value depends on both a vertex and a query . For example, the vertex value is query-dependent: in a PPSP query , keeps the estimated value of the shortest distance from to , denoted by , whose value depends on the source vertex . As is w.r.t. a query , we use to denote “ w.r.t. ”. Other examples of VQ-data include the active/halted state of a vertex , and the incoming message buffer of (i.e., input to compute(.)). (iii) Q-data, whose value only depends on a query

. For example, at any moment, each query

has a unique superstep number. Other examples of Q-data include the query content (e.g., for a PPSP query), the outgoing message buffers, aggregated values, and control information that decides whether the computation should terminate.

Let be the set of queries currently being processed by Quegel, and let be the query ID of each .

In Quegel, each worker maintains a hash table to keep the Q-data of each query in . The Q-data of a query can be obtained from by providing the query ID , and we denote it by . When a new query is fetched from the query queue to start its processing at the beginning of a super-round, the Q-data of is inserted into of every worker; while after reports or dumps its results at superstep , the Q-data of is removed from of every worker.

Each worker also maintains an array of vertices, varray, each element of which maintains the V-data and VQ-data of a vertex that is distributed to . The VQ-data of a vertex is organized by a look-up table , where the VQ-data related to a query can be obtained by providing the query ID , and we denote it by . Since every vertex needs to maintain a table , we implement it using a space-efficient balanced binary search tree rather than a hash table. The data kept by each table entry include the vertex value , the active/halted state of (in ), and the incoming message buffer of (for ).

Unlike the one-batch-at-a-time approach of applying existing vertex-centric systems, where each vertex needs to maintain vertex values no matter whether it is accessed by a query, we design Quegel to be more space efficient. We require that a vertex is allocated a state for a query only if accesses during its processing, which is achieved by the following design. When vertex is activated for the first time during the processing of , the VQ-data of is initialized and inserted into . After a query reports or dumps its results at superstep , the VQ-data of (i.e., ) is removed from of every vertex in .

Each worker also maintains a hash table , such that the position of a vertex element in varray can be obtained by providing the vertex ID of . We denote the obtained vertex element by . The table is useful in two places: (1) when a message targeted at vertex is received, the system will obtain the incoming message buffer of from varray where is computed as , and then append the message to the buffer; (2) when an initial vertex is activated using its vertex ID at the beginning of a query, the system will initialize the VQ-data of for , and insert it into which is obtained from varray where is computed as . We shall see how users can activate the (usually small) initial set of vertices in Quegel for processing without scanning all vertices in Section 4.

An important feature of Quegel is that, it only requires a user to specify the computing logic for a generic vertex and a generic query; the processing of concrete queries is handled by Quegel and is totally transparent to users. For this purpose, each worker maintains two global context objects: (i) query context , which keeps the Q-data of the query that is processing; and (ii) vertex context , which keeps the VQ-data of the current vertex that is processing for the current query. In a super-round, when a worker starts to process each query , it first obtains and assigns it to , so that when a user accesses the Q-data of the current query in UDF compute(.) (e.g., to get the superstep number or to append messages to outgoing message buffers), the system will access directly without looking up from . Moreover, during the processing of , and before the worker calls compute(.) on each vertex , it first obtains and assigns it to , so that any access or update to the VQ-data of in compute(.) (e.g., obtaining or letting vote to halt) directly operates on without looking up from .

As an illustration, consider the example shown in Figure 2, where there are 3 queries being evaluated and the computation proceeds for 3 supersteps. Moreover, we assume that 4 vertices call compute(.) in each superstep of each query. As an example, when processing a superstep , is set to before evaluating for ; and when the evaluation arrives at , is set to before compute(.) is called. Figure 2 also shows a simplified code of compute(.) for shortest path computation, and inside compute(.) for , is accessed once in Line 1 and twice in Line 3, all of which use the value stored in directly; while Line 1 accesses the superstep number which is obtained from directly.

One benefit of using the context objects and is that, due to the access pattern locality of superstep-sharing, repetitive lookups of tables and are avoided. Another benefit is that, users can write their program exactly like in Pregel (e.g., to access and superstep number) and the processing of concrete queries is transparent to users.

4 Programming Interface

The programming interface of Quegel incorporates many unique features designed for querying workload. For example, the interface allows users to construct distributed graph indexes at graph loading. The interface also allows users to activate only an initial (usually small) set of vertices, denoted by , for processing a query without checking all vertices. Note that we cannot activate during graph loading because depends on each incoming query .

Quegel defines a set of base classes, each of which is associated with some template arguments. To write an application program, a user only needs to (1) subclass the base classes with the template arguments properly specified, and to (2) implement the UDFs according to the application logic. We now describe these base classes.

Vertex Class. As in Pregel, the Vertex class has a UDF compute(.) for users to specify the computing logic. In compute(.), a user may call get_query() to obtain the content of the current query . A user may also access other Q-data in compute(.), such as getting ’s superstep number, sending messages (which appends messages to ’s outgoing message buffers), and getting ’s aggregated value from the previous superstep. Quegel also allows a vertex to call force_terminate() to terminate the computation of at the end of the current superstep. All these operations access the Q-data fields from directly.

The vertex class of Quegel is defined as Vertex, which has five template arguments: (1)  specifies the type (e.g., int) of the ID of a vertex (which is V-data). (2)  specifies the type of the query-dependent attribute of a vertex , i.e., (which is VQ-data). (3)  specifies the type of the query-independent attribute of a vertex , denoted by (which is V-data). We do not hard-code the adjacency list structure in order to provide more flexibility. For example, a user may define to include two adjacency lists, one for in-neighbors and the other for out-neighbors, which is useful for algorithms such as bidirectional BFS. Other V-data can also be included in , such as vertex labels used for search space pruning in some query processing algorithms. (4)  specifies the type of the messages that are exchanged between vertices. (5)  specifies the type of the content of a query. For example, for a PPSP query, is a pair of vertex IDs indicating the source and target vertices. In compute(.), a user may access by calling value, and access by calling qvalue.

Suppose that a set of queries, , is being processed, then each vertex conceptually has query-dependent attributes , one for each query . Since a query normally only accesses a small fraction of all the vertices, to be space-efficient, Quegel allocates space to as well as other VQ-data only at the time when the vertex is first accessed during the processing of . Accordingly, Quegel provides a UDF init_value() for users to specify how to initialize when is first accessed by . For example, for a PPSP query , where keeps the estimated value of , one may implement init_value() as follows: if , ; else, . The state of is always initialized to be active by the system, since when the space of the state is allocated, is activated for the first time and should participate in the processing of in the current superstep. Function init_value() is the only UDF of the Vertex class in addition to compute(.).

Worker Class. The Vertex class presented above is mainly for users to specify the graph computation logic. Quegel provides another base class, Worker, for specifying the input/output format and for executing the computation of each worker. The template argument specifies the user-defined subclass of Vertex. The template argument is optional, and if distributed indexing (to be introduced shortly) is enabled, specifies the user-defined index class.

The Worker class has a function run(param), which implements the execution procedure of Quegel as described at the beginning of Section 3. After users define their subclasses to implement the computing logic, they call run(param) to start a Quegel job. Here, param specifies job parameters such as the HDFS path of the input graph . During the execution, we allow each query to change of a vertex , and when a user closes the Quegel program from the console, he/she may specify Quegel to save the changed graph (V-data only) to HDFS, before freeing the memory space consumed by .

The Worker class has four formatting UDFs, which are used (1) to specify how to parse a line of the input file into a vertex of in main memory, (2) to specify how to parse a query string (input by a user from the console or a file) into the query content of type , (3) to specify how to write the information of a vertex (e.g., ) to HDFS after a query is answered, and (4) to specify how to write the changed V-data of a vertex to HDFS when a Quegel job terminates. The last UDF is optional, and is only useful if users enable the end-of-job graph dumping.

Quegel allows each worker to construct a local index from its loaded vertices before query processing begins. We illustrate this process by considering a vertex-labeled graph where each vertex contains text , and show how to construct an inverted index on each worker , so that given a keyword , it returns a list of vertices on whose text contains . This kind of index is useful in XML keyword search [21, 45]

, subgraph pattern matching 

[7, 8], and graph keyword search [13, 26]. Specifically, recall that each worker in Quegel maintains its vertices in an array varray. If indexing is enabled, a UDF load2Idx() will be called to process each vertex in varray immediately after graph loading, where is ’s position in varray. To construct inverted indexes in Quegel, a user may specify as a user-defined inverted index class, and implement load2Idx() to add to the inverted list of each keyword in . There are also indices that cannot be constructed simply from local vertices, and we shall see how to handle such an application in Quegel in Section 5.1.

When a query is first scheduled for processing, each worker calls a UDF init_activate() to activate only the relevant vertices specified by users. For example, in a PPSP query , only and are activated initially; while for querying a vertex-labeled graph, only those vertices whose text contain at least one keyword in the query are activated. Inside init_activate(), one may call get_vpos(vertexID) to get the position of a vertex in varray (which actually looks up the hash table of each worker), and then call activate() to activate the vertex. For example, to activate in a PPSP query , a user may specify init_activate() to first call get_vpos() to return ’s position . If is on the current worker, will be returned and one may then call activate() to activate in init_activate(). If is not on the current worker, get_vpos() returns -1 and no action needs to be performed in init_activate(). For querying a vertex-labeled graph, a user may specify init_activate() to first get the positions of the keyword-matched vertices from the inverted index, and then activate them using activate().

Other Base Classes. Quegel also provides other base classes such as Combiner and Aggregator, for which users can subclass them to specify the logic of message combiner [24] and aggregator [24].

5 Applications

To demonstrate the generality of Quegel’s computing model for querying big graphs, we have implemented distributed algorithms for five important types of graph queries in Quegel, including (1) PPSP queries, (2) XML keyword queries, (3) terrain shortest path queries, (4) point-to-point (P2P) reachability queries, and (5) graph keyword queries. Among them, (1), (3) and (4) only care about the graph topology, while (2) and (5) also care about the text information on vertices and edges. We now present the five applications and their Quegel solutions.

5.1 PPSP Queries

We consider a PPSP query defined as follows. Given two vertices and in an unweighted graph , find the minimum number of hops from to in , denoted by . We focus on unweighted graphs since most large real graphs (e.g., social networks and web graphs) are unweighted. Moreover, we are only interested in reporting , although our algorithms can be easily modified to output the actual shortest path(s).

5.1.1 Algorithms without Indexing

Breadth-First Search (BFS). The simplest way of answering a PPSP query is to perform BFS from , until the search reaches . In this algorithm, is specified to be the current estimation of , and we use to denote in our discussion for simplicity. The UDF init_activate() of user-defined Worker subclass should activate at the beginning of processing . The vertex UDF init_value should set to if , and to otherwise. Note that calls init_value(.) when is first activated during the processing of , either by init_activate() or because some vertex sends a message.

The vertex UDF .compute is implemented as follows. Let be the superstep number of . If , then must be since only is activated by init_activate(); broadcasts messages to its out-neighbors to activate them, and then votes to halt. If , one of the following is performed: (i) if , then is visited by the BFS for the first time; in this case, sets , broadcasts messages to activate ’s out-neighbors and votes to halt; if , also calls force_terminate() to terminate query processing as has been computed; (ii) if , then has been activated by before, and hence votes to halt directly. Finally, only reports on the console and nothing is dumped to HDFS.

Bidirectional BFS (BiBFS). A more efficient algorithm is to perform forward BFS from and backward BFS from until a vertex is visited in both directions, and we say that is bi-reached in this case. Let be the set of bi-reached vertices when BiBFS stops, then is given by . We take the minimum since when BiBFS stops at iteration , for a vertex may be either or .

The Quegel algorithm for BiBFS is similar to that for BFS, with the following changes. The query-dependent vertex attribute now keeps a pair . The vertex UDF init_value sets to if , and to otherwise; and it sets to if , and to otherwise. Both and are activated by init_activate() initially, and two types of messages are used in order to perform forward BFS and backward BFS in parallel without interfering with each other. In .compute, if both and , should call force_terminate() since is bi-reached. Then, an aggregator is used to collect the distance of each , and to obtain the smallest one as for reporting.

BiBFS may be inferior to BFS in the following situation. Suppose that is undirected, and is in a small connected component (CC) while is in another giant CC. BFS will terminate quickly after all vertices in the small CC are visited, while BiBFS continues computation until all vertices in the giant CC are also visited. To solve this problem, we use aggregator to compute the numbers of messages sent by the forward BFS and the backward BFS in each superstep, respectively. If the number of messages sent in either direction is 0, the aggregator calls force_terminate() and reports .

5.1.2 Hub: An Algorithm with Indexing

Many big graphs exhibit skewed degree distribution, where some vertices (e.g., celebrities in a social network) connect to a large number of other vertices. We call such vertices as

hubs. During BFS, visiting a hub results in visiting a large number of vertices at the next step, rendering BFS or BiBFS inefficient. Hub-Labeling (abbr. Hub[15] was proposed to address this problem. We present a distributed implementation of Hub in Quegel for answering PPSP queries. We first consider undirected graphs and then extend the method to directed graphs.

Hub picks vertices with the highest degrees as the hubs. Let us denote the set of hubs by , Hub pre-computes the pairwise distance between any pair of hubs in . Hub also associates each vertex with a list of hubs, , called core-hubs, and pre-computes for each core-hub . Here, a hub is a core-hub of , iff no other hub exists on any shortest path between and . Formally, each vertex maintains a list of hub-distance labels defined as follows: (i) if , ; (ii) if , .

Given a PPSP query , an upperbound of can be derived from the vertex labels. For ease of presentation, we only present the algorithm for the case where neither nor is a hub, while algorithms for the other cases can be similarly derived. Specifically, is upperbounded by . Obviously, if there exists a shortest path from to that passes at least one hub (note that we allow ), then is exactly . However, the shortest path from to may not contain any hub, and thus we still need to perform BiBFS from and . Note that any edge on satisfies , and thus we need not continue BFS from any hub. In other words, BiBFS is performed on the subgraph of induced by , which does not include high-degree hubs.

Algorithm for Querying. We now present the UDF compute(.), which applies Hub to process PPSP queries. We first assume that for each vertex is already computed (we will see how to compute shortly), and that keeps the query-independent attribute . The algorithm for BiBFS is similar to the one discussed before, with the following changes: (i) whenever forward or backward BFS visits a hub , votes to halt directly; and (ii) once a vertex is bi-reached, calls force_terminate() to terminate the computation, and reports . Moreover, the BiBFS should terminate earlier if the superstep number reaches (even if no vertex is bi-reached), and is reported. This is because, a non-hub vertex that is bi-reached at superstep  or later would report , which cannot be smaller than .

We obtain in the first two supersteps: in superstep 1, only and have been activated by init_activate(); sends each core-hub a message (obtained from ), while provides to the aggregator. In superstep 2, each vertex receives message from , and obtains from the aggregator. Then, evaluates , where is obtained from and is obtained from , and provides the result to the aggregator. The aggregator takes the minimum of the values provided by all , which gives .

Algorithm for Indexing. The above algorithm requires that each vertex stores in . We now consider how to pre-compute in Quegel. This indexing procedure can be accomplished by performing BFS operations, each starting from a hub . Interestingly, if we regard each BFS operation from a hub as a BFS query in Quegel, then the entire procedure can be formulated as an independent Quegel job with the query set .

We process a BFS query in Quegel as follows. The query-dependent attribute of a vertex is defined as , where is a flag indicating whether any shortest path from to passes through another hub ( and ). Quegel starts processing by calling init_activate() to activate . The UDF init_value() is specified to set FALSE, and to set if or set otherwise.

The UDF compute(.) is implemented as follows. In this algorithm, a message sent by indicates whether there exists a shortest path from to that contains another hub (here, can be ); if so, for any vertex newly activated by that message, it holds that . Based on this idea, the algorithm is given as follows. In superstep 1, broadcasts message FALSE to its neighbors. In superstep  (), if , then is already visited by BFS, and it votes to halt directly; otherwise, is activated for the first time, and it sets , and receives and processes incoming messages as follows. If receives TRUE from a neighbor , then a shortest path from to via passes through another hub ( and ), and thus sets TRUE. Then, if or TRUE, broadcasts message TRUE to each neighbor ; otherwise, broadcasts message FALSE to all its neighbors. Finally, votes to halt.

To compute using the above algorithm, we specify the query-independent attribute of a vertex as , where is initially empty. After a query is processed, we perform the following operation in the query dumping UDF: (i) if , adds to only if FALSE; (ii) if , always adds to .

After all the queries are processed, is fully computed for each . Then, each vertex saves along with other V-data to HDFS, which is to be loaded later by the Quegel program for processing PPSP queries described previously.

Extension to Directed Graphs. If is directed, we make the following changes. First, each vertex now has in-degree and out-degree , and thus we consider three different ways of picking hubs, i.e., picking those vertices with the highest (i) in-degree, or (ii) out-degree, or (iii) sum of in-degree and out-degree. Second, each vertex now maintains two core-hub sets: an entry-hub set and an exit-hub set . A hub is an entry-hub (exit-hub) of , iff no other hub () exists on any shortest path from to (from to ). Accordingly, we obtain two lists of hub-distance labels, and . During indexing, we construct () by backward (forward) BFS, i.e., sending messages to in-neighbors (out-neighbors). When answering PPSP queries, we compute similarly but (and ) is now replaced by (and ).

5.2 XML Keyword Search

Section 5.1 illustrated how graph indexing itself can be formulated as an individual Quegel program. We now present another application of Quegel, i.e., keyword search on XML documents, which makes use of the distributed indexing interface of Quegel described in Section 4 directly. Compared with traditional algorithms that rely on disk-based indexes [21, 45], our Quegel algorithms are much easier to program, and they avoid the expensive cost of constructing any disk-based index. Although simple MapReduce solution has also been developed, it takes around 15 seconds to process each keyword query on an XML document whose size is merely 200MB [41]. The low efficiency is because MapReduce is not designed for querying workload. In contrast, our Quegel program answers the same kind of keyword queries on much larger XML documents in less than a second. Let us first review the query semantics of XML keyword search, and then discuss XML keyword query processing in Quegel, followed by applications of the query in an online shopping platform.

Figure 3: A fragment of an XML document

5.2.1 Query Semantics

An XML document can be regarded as a rooted tree, where internal vertices are XML tags and leaf vertices are texts. To illustrate, Figure 3 shows the tree structure of an XML document describing the information of a research lab. We denote the set of words contained in the tag or text of a vertex by , and if a keyword , we call as a matching vertex of (or, matches ). Given an XML document modeled by a tree , an XML keyword query , , , finds a set of trees, each of which is a fragment of , denoted by , such that for each keyword , there exists a vertex in matching . We call each result tree as a matching tree of .

Different semantics have been proposed to define what a meaningful matching tree could be. Most semantics require that the root of be the Lowest Common Ancestor (LCA) of vertices , , , where each vertex matches a keyword . For example, given the XML tree in Figure 3 and a query Tom, Graph, vertex 9 is the LCA of the matching vertices 11 and 13, while vertex 1 is the LCA of the matching vertices 3 and 5.

We consider two popular semantics for the root of : Smallest LCA (SLCA) and Exclusive LCA (ELCA) [45]. For simplicity, we use “LCA/SLCA/ELCA of ” to denote “LCA/SLCA/ELCA of matching vertices , , ”. An SLCA of is defined as an LCA of that is not an ancestor of any other LCA of . For example, in Figure 3, vertex 9 is the SLCA of Tom, Graph, while vertex 1 is not since it is an ancestor of another LCA, i.e. vertex 9. Let us denote the subtree of rooted at vertex by , then a vertex is an ELCA of if contains at least one occurrence of all keywords in , after pruning any subtree (where is a child of ) which already contains all keywords in . Referring to Figure 3 again, both vertices 1 and 9 are ELCAs of Tom, Graph. Vertex 1 is an ELCA since after pruning the subtree rooted at vertex , there still exist vertices 3 and 5 matching the keywords in . In contrast, if Peter, Graph, then vertex 9 is an ELCA of , while vertex 1 is not an ELCA of since after pruning the subtree rooted at vertex , there is no vertex matching “Peter”.

Once the root, , of a matching tree is determined, we may return the whole subtree as the result tree . However, if is at a top level of the input XML tree, can be large (e.g., the subtree rooted at vertex 1) and may contain much irrelevant information. For an SLCA , MaxMatch [21] was proposed to prune irrelevant parts from to form . Let be the set of keywords matched by the vertices in . If a vertex has a sibling , where , then is pruned. For example, let Tom, Graph and consider the subtree rooted at vertex 1 in Figure 3. Since vertex 9 contains {Tom, Graph} in its subtree while its sibling vertex 14 does not contain any keyword in its subtree, the subtree rooted at vertex 14 is pruned.

5.2.2 Query Algorithms

We now present the Quegel algorithms for computing SLCA, ELCA and MaxMatch. The Quegel program first loads the graph that represents the XML document (the graph is obtained by parsing the XML document with a SAX parser), where each vertex is associated with its parent and its children (V-data). Then, each worker constructs an inverted index from the loaded vertices using the indexing interface described in Section 4.

To process a query , the UDF init_activate() activates only those vertices with . The query-independent attribute of each vertex , , maintains , , and , and the query-dependent attribute maintains a bitmap , where bit  (denoted by ) equals 1 if keyword exists in subtree and 0 otherwise. The UDF init_value() sets each bit to if and 0 otherwise. For simplicity, if all the bits of are 1, we call as all-one. We now describe the query processing logic of compute(.) for SLCA, ELCA and MaxMatch semantics as follows.

Computing SLCA in Quegel. In superstep 1, all matching vertices have been activated by init_activate(), and each matching vertex sends to its parent and votes to halt. In superstep  (), there are two cases in processing a vertex . Case (a): if some bit of is 0, computes the bitwise-OR of and those bitmaps received from its children, which is denoted by . If , then some new bit of should be set due to a newly matched keyword; thus, sets , and sends the updated to its parent . In addition, if is all-one, then (1) if receives an all-one bitmap from a child, is labeled as a non-SLCA (the label is also maintained in ); (2) otherwise, is labeled as an SLCA. Case (b): if is all-one, then has been labeled either as an SLCA or as a non-SLCA (because a descendant is an SLCA) in an earlier superstep. (1) If is labeled as a non-SLCA, votes to halt directly; while (2) if is labeled as an SLCA, and receives an all-one bitmap from a child, then labels itself as a non-SLCA. Finally, votes to halt.

In the above algorithm, a vertex may send messages to its parent multiple times. To make sure that each vertex sends at most one message to its parent, we design another level-aligned algorithm as follows. Specifically, we pre-compute the level of each vertex in the XML tree, denoted by , by performing BFS from the tree root (with a traditional Pregel job). Then, our Quegel program loads the preprocessed data, where each vertex also maintains in . The UDF compute(.) is designed as follows. Initially, we use an aggregator to collect the maximum level of all the matching vertices, denoted by . The aggregator maintains and decrements it by one after each superstep. In a superstep, a vertex at level computes the bitwise-OR of and all the bitmaps received from its children at level ; the bitwise-OR is then assigned to and sent to ’s parent . Moreover, if an all-one bitmap is received, labels itself as a non-SLCA directly; otherwise, and if becomes all-one, then labels itself as an SLCA. Finally, votes to halt. Note that those matching vertices with remain active until they are processed.

Computing ELCA in Quegel. We use a level-aligned algorithm to compute ELCAs as follows. In a superstep, an active vertex at level updates and sends it to the parent as in SLCA computation. Meanwhile, also computes another bitmap (in addition to ), which is the bitwise-OR of (before its update) and all the non-all-one bitmaps received from its children at level . And labels itself as an ELCA if is all-one.

In our SLCA and ELCA algorithms, each vertex also maintains in its start and end positions in the XML document, denoted by and , which are also obtained during the SAX parsing. After a query is processed, each vertex that is labeled as an SLCA or ELCA dumps to HDFS, so that users can obtain by reading the corresponding part of the XML document.

Computing MaxMatch in Quegel. Our Quegel algorithm for computing MaxMatch prunes irrelevant parts from the subtree rooted at each SLCA, and all vertices in the result matching trees dump themselves to HDFS after a query is processed, which can then be sent to the client and assembled as trees for display.

The algorithm is also level-aligned, and consists of two phases. In Phase 1, we run a variant of the level-aligned SLCA algorithm, where each vertex sends message to . When a vertex receives a message from a child , it keeps in . To avoid the algorithm from terminating after Phase 1, we keep the SLCA vertices active (i.e., they do not vote to halt) during the computation of Phase 1. Phase 1 ends when the superstep that decrements to finishes, and then the aggregator sets the phase number as 2 to start Phase 2.

Phase 2 performs downward propagation from those SLCAs found by Phase 1. In a superstep, each active vertex labels itself to indicate that is in a matching tree (the label is also kept in ). Then, sends messages to its children that are not dominated by any of their siblings. Here, is dominated by if , and we check the condition using their bitmaps as follows: and Bit-OR . In this way, dominated subtrees are pruned from , and Quegel dumps only the labeled vertices to HDFS.

Applications of XML Keyword Search. Though originally proposed for querying a single XML document [21, 45], our algorithms can also be used to query a large corpus of many XML documents. We illustrate this by one application in e-commerce. During online shopping, a customer may pose a keyword query (in the form of an AJAX request) from a web browser to search for interested products. The web server will obtain the matched products from the database, organize them as an XML document, and send it back to the client side. The browser of the client will then parse the XML document by a Javascript script to display the results. The server may log the various AJAX responses to disk, so that data scientists and sellers may pose XML keyword queries on the logged XML corpus to study customers’ search behaviors of specific products, to help them make better business decisions.

5.3 Terrain Shortest Path Queries

Technological advance in remote sensing has made available high resolution terrain data of the entire Earth surface. Terrain data are usually represented in the Digital Elevation Model (DEM), which is an elevation mesh of sampled ground positions at regularly spaced interval. Since terrain data are usually collected at high resolution (e.g., 10m sampling intervals), the data size is usually huge.

(a) TIN
(b) Network Distance
Figure 4: Terrain data model

Many recent studies propose algorithms for processing various spatial queries over terrain data, including P2P shortest path queries [20], nearest neighbor (NN) queries [31, 17] and reverse NN queries [38, 17]. Applications of terrain queries include disaster response, outdoor activities, and military operations [38]. Existing works adopt the Triangulated Irregular Network (TIN) terrain representation as illustrated in Figure 4(a), which is derived from the DEM data. Since the terrain surface is composed of triangular faces, existing works use Chen and Han’s algorithm [16], which is a polyhedron shortest path algorithm, to compute the terrain shortest path between two terrain locations. This approach has very poor performance and scalability, since the time complexity of Chen and Han’s algorithm is quadratic to the number of triangular faces. For example, even with surface simplification (with precision loss), the algorithm of [20] can only process terrain shortest paths with length of merely several hundred meters, and it takes hundreds to thousands of seconds to compute one such shortest path. We propose an efficient approximate solution with a much lower cost.

Let be the network (which is TIN here) distance between two vertices (i.e., locations), and . Here, upper-bounds the actual terrain distance between and , since the TIN shortest path is also a path on the terrain surface. However, the TIN shortest path can be very different from the actual terrain shortest path [31]. We further show that the difference cannot be effectively reduced simply by increasing the sampling rate. Consider the mesh fragment shown in Figure 4(b), and suppose that all vertices have the same elevation. If only horizontal and vertical edges are considered, then no matter what the sampling interval is, is lower bounded by the Manhattan distance between and , even though the terrain shortest path is given by a straight line between and . Now consider a TIN where faces are diagonally triangulated, we can show that is lower bounded by , where (and ) refers to the larger (and smaller) one of and . Thus, a better solution is needed.

The above discussion motivates us to propose a new transformation from the terrain data to a network that gives more accurate terrain shortest path distance, and we can use Quegel to achieve efficient computation on the network. The idea is to add shortcut edges as illustrated by the last grid cell in Figure 4(b). Specifically, we split each edge of a cell by adding vertices such that the distance between two neighboring vertices is no more than as shown in Figure 4(b)

. Then, in each cell, we add a straight line between every pair of vertices that are not on the same horizontal or vertical edge. We then compute the shortest path on the new network to approximate the terrain shortest path. Since the cell shortcuts are in different directions, the network shortest path can be close to the actual terrain shortest path. Note that even TIN just interpolates the elevation of an arbitrary location from the sampled elevation data 

[20], as the actual elevation is not known. For example, in Figure 4(b), the elevation of is linearly interpolated from samples and . Therefore, the shortest paths computed on TIN [20] and our graph model both just approximates the actual shortest path.

Now the problem of computing P2P shortest path over the terrain is transformed to the problem of computing the P2P shortest path in the transformed network. Since terrain data can be of planetary scale and the transformed network is even larger, we employ Quegel for distributed shortest path querying in the transformed network. The logic of the

compute(.) function can simply be the distributed single-source shortest-path (SSSP) algorithm of [24], where each active vertex updates its current distance (from ) using that of its neighbors, and propagates the updated distance to its neighbors to activate them for further distance computation, until the process converges. We further devise a mechanism to terminate the SSSP computation earlier (without traversing all the vertices) when and are close to each other, which is described as follows.

Let be the Euclidean distance between vertices and . In any superstep, when a vertex is currently active (i.e., is at the distance propagation wavefront), is updated based on sent by the neighbors of from the previous superstep. Meanwhile, we compute using the coordinates of and . Note that lower-bounds . We use the aggregator to compute the minimum value of , denoted by , among all vertices at the distance propagation wavefront. If , vertex  calls force_terminate() to end the computation. This is because for any vertex that will be at the distance propagation wavefront in any following superstep, we have for the current . However, we already have , and thus no computed in any following superstep (including ) can be smaller than the current .

In our actual implementation, to avoid a large number of supersteps caused by large graph diameter, we use the idea of [35] which first partitions the graph into subgraphs that group spatially close vertices, and then propagates the distance updates from in the unit of subgraphs (instead of vertices). Experiments in Section 6 verify that our new method computes high-quality terrain shortest paths very efficiently for any path length (in contrast to only several hundred meters as in [20]).

5.4 P2P Reachability

In this section, we consider P2P reachability query , which determines whether there exists a path from to in a directed graph . The Quegel algorithms for BFS and BiBFS as described in Section 5.1.1 are also applicable to this problem. We now consider the Quegel solution that makes use of indexing.

A P2P reachability query on a direct graph can be reduced to one on a directed acyclic graph (DAG) . Each vertex of represents a strongly connected component (SCC) of , and each edge represents the fact that one component can reach another. To answer whether can reach in , we simply look up their corresponding SCCs, and , respectively, which are the nodes in . Vertex can reach in iff or can reach in . Note that the SCCs of can be computed in Pregel using the algorithm of [36], which associates each vertex in with its corresponding (SCC) vertex in . This -to- mapping relation can be pre-computed as an independent Pregel job, and stored on HDFS to be loaded later by Quegel workers into their local index field Worker::index. When a query arrives, each worker may look up and from the index and activate them (if they reside in the worker) in init_activate(). For ease of discussion, we assume is a DAG hereafter.

Existing work on P2P reachability indexing combines graph traversal with vertex-label based pruning in order to be scalable, such as [43, 34]. Due to the requirement of graph traversal, the graph and the vertex labels have to reside in main memory, and for massive graphs, one has to resort to a distributed main-memory system. In this section, we demonstrate how the index of [43] can be used in Quegel. We assume that a depth-first search forest of is given (which is required by the no-label to be introduced shortly), so that each vertex knows its parent in the forest, the pre-order number and the post-order number . This can be computed in memory, or using the IO-efficient algorithm of [42].

During the indexing phase, we compute three labels for each vertex using three cascaded Pregel jobs: (1) level , (2) yes-label and (3) no-label . These labels are then used in our Quegel algorithm to prune vertices from further expanding during bidirectional BFS from and .

Level Label. We first define and discuss its computation. Let us call a vertex with zero in-degree as a root, then is defined as the largest number of hops from a root to . For example, consider the DAG shown in Figure 5. Vertex 9 has level 3 although it is just two hops away from root 10 (through path ), since the longest path from root 10 has three hops (e.g., path ).

According to the definition of the level label, if can reach , then . Therefore, in our Quegel algorithm, if the forward BFS from activates a vertex with , votes to halt directly as it cannot reach ; similarly, if the backward BFS from activates a vertex with , votes to halt directly as cannot reach . Note that the vertex labels of and can be obtained using aggregator at the beginning of a query , so that any vertex can get them from the aggregator in compute(.).

The Pregel algorithm for level computation is as follows. Initially, only roots are active with , while is initialized as for all other vertices . In superstep 1, each root broadcasts to its out-neighbors before voting to halt. In superstep  (), each active vertex gets the largest incoming message (sent from in-neighbor ); here, we know that ’s level should be at least , and thus we check if . If so, updates , and broadcasts to all its out-neighbors. Finally, votes to halt. Upon convergence, for each vertex , equals the level of .

Yes-Label. We now define and discuss its computation. Recall that the pre-order number is available for each vertex . Let us define to be the set of all vertices reachable from (including itself), then is defined as . As an illustration, consider the graph shown in Figure 5, where the bold edges belong to the DFS forest, and the vertices are marked with their pre-order numbers. Vertex 5 has yes-label since the largest vertex that it can reach is itself; while vertex 7 has yes-label as the largest vertex that it can reach has ID 9.

Figure 5: Illustration of yes-labels

The yes-label has the following property: if , then can reach  [43]. To illustrate, in Figure 5, we can conclude that vertex 0 can reach vertex 2 since . In fact, this property holds as long as is computed from a spanning forest of (including a DFS forest). Intuitively, iff is an ancestor of in the forest. Therefore, in our Quegel algorithm, if the forward BFS from activates a vertex with , calls force_terminate() and indicates that can reach . This is because is obviously reachable from , and can reach according to the yes-labels. Similarly, if the backward BFS from activates a vertex with , calls force_terminate() and indicates that can reach .

To compute the yes-labels, we only need to compute for each vertex as follows. Initially, for each vertex , is initialized as , and only those vertices with zero out-degree are active; each active vertex sends to its in-neighbors in superstep 1 and votes to halt. In superstep  (), each vertex receives the incoming messages, and let the largest one be ; if , sets and broadcasts to its in-neighbors; finally, votes to halt.

A weakness of this algorithm is that, a vertex may update and broadcast to its in-neighbors for more than once. We design a more efficient level-aligned algorithm that makes use of level to ensure that each vertex only updates and broadcasts once, which works as follows. Initially, only those vertices with zero out-degree are active, and we use aggregator to collect their maximum level . Then, the aggregator maintains and decrements it by one after each superstep; all vertices with are already processed, while all vertices with are being processed. In a superstep, a vertex receives messages, and let the largest one be ; if , sets . Then, each vertex with broadcasts to its in-neighbors and votes to halt.

Figure 6: Illustration of no-labels

No-Label. Finally, we define and discuss its computation. For each vertex , is defined as . As an illustration, consider the graph shown in Figure 6, which is the same one as in Figure 5 except that the vertices are marked with their post-order numbers. Vertex 4 has no-label since the smallest vertex that it can reach has ID 0; while vertex 8 has no-label as the smallest vertex that it can reach has ID 1.

The no-label has the following property: if can reach , then  [43]. The property can be easily observed from Figure 6. We actually use its contrapositive: if , then cannot reach . To illustrate, in Figure 6, we can conclude that vertex 11 cannot reach vertex 0 since . Therefore, in our Quegel algorithm, if the forward BFS from activates a vertex with , votes to halt directly as cannot reach ; similarly, if the backward BFS from activates a vertex with , votes to halt directly as cannot reach .

The Pregel algorithm for no-label computation is symmetric to that for yes-label computation, and is thus omitted.

5.5 Graph Keyword Search

In this section, we consider a simplified version of the graph keyword search problem [13] which was recently studied by [26] on MapReduce: given a keyword query over a graph where each vertex has text , a keyword search finds a set of rooted trees in the form of , , , , where is the root vertex, and is the closest vertex to whose text contains keyword . Moreover, the maximum distance allowed from a root to a matched vertex is constrained to be . Note that a root vertex determines a unique answer, since we pick the matching vertex closest to for each keyword.

A simple vertex-centric algorithm for graph keyword search is described as follows. Each vertex maintains for each a field indicating its closest matching vertex . Initially, if , we set ; otherwise, we set . Only vertices whose text contains at least one keyword are active. In superstep 1, each matching vertex finds its fields with (i.e., ), sends to all its in-neighbors, and votes to halt. In superstep  (), a vertex receives messages from its out-neighbors . Here, message indicates that vertex matches , and it is hops from (and is one hop from ). Therefore, let be the out-neighbor of with the smallest and let the matching vertex be , then if , updates and sends it to all its in-neighbors, before voting to halt. If the computation proceeds after supersteps, all vertices vote to halt directly and the algorithm stops; by then any vertex whose for all keywords corresponds to a result.

Figure 7: RDF Data Fragment

A typical application of graph keyword search is over RDF data. An RDF data consists of triples of the form , where , and are called as subject, predicate and object, respectively. Conceptually, each triple can be regarded as a directed edge from vertex to vertex with edge label , and thus, the whole RDF data can be regarded as a labeled graph. As an illustration, consider the RDF graph shown in Figure 7, which contains triples like (Tom, supervises, Peter), and (Peter, age, “25”). Here, the text of some vertex uniquely determines the vertex identity, such as the vertex labeled “Peter”. The text of such a vertex is called a resource, which is usually a URI. While for some vertex like “25”, the text is just a literal that indicates the value of its predicate, and the text of another vertex can also be this literal.

To perform keyword search over an RDF graph, we first need to convert the set of triples into an adjacency list representation. For a literal vertex in triple , we store it as an attribute of resource vertex with attribute having value . For each resource vertex , two lists are stored, that contains ’s in-neighbors (which are resource vertices), and that contains ’s literal out-neighbors. The lists can be easily obtained by MapReduce. For example, to get the in-neighbor lists for all vertices, each mapper splits a triple (where