A Case for Stale Synchronous Distributed Model for Declarative Recursive Computation

07/24/2019
by   Ariyam Das, et al.
0

A large class of traditional graph and data mining algorithms can be concisely expressed in Datalog, and other Logic-based languages, once aggregates are allowed in recursion. In fact, for most BigData algorithms, the difficult semantic issues raised by the use of non-monotonic aggregates in recursion are solved by Pre-Mappability (PreM), a property that assures that for a program with aggregates in recursion there is an equivalent aggregate-stratified program. In this paper we show that, by bringing together the formal abstract semantics of stratified programs with the efficient operational one of unstratified programs, PreM can also facilitate and improve their parallel execution. We prove that PreM-optimized lock-free and decomposable parallel semi-naive evaluations produce the same results as the single executor programs. Therefore, PreM can be assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. In addition, we show that non-linear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model on distributed environments. After providing a formal correctness proof for the recursive query evaluation with PreM under this relaxed synchronization model, we present experimental evidence of its benefits. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/18/2019

BigData Applications from Graph Analytics to Machine Learning by Aggregates in Recursion

In the past, the semantic issues raised by the non-monotonic nature of a...
03/11/2019

Recursive Matrix Algorithms in Commutative Domain for Cluster with Distributed Memory

We give an overview of the theoretical results for matrix block-recursiv...
05/30/2021

Convergence of Datalog over (Pre-) Semirings

Recursive queries have been traditionally studied in the framework of da...
04/23/2018

Top-down and Bottom-up Evaluation Procedurally Integrated

This paper describes how XSB combines top-down and bottom-up computation...
10/20/2019

Monotonic Properties of Completed Aggregates in Recursive Queries

The use of aggregates in recursion enables efficient and scalable suppor...
07/11/2019

Provenance for Large-scale Datalog

Logic programming languages such as Datalog have become popular as Domai...
05/06/2020

Synthesis of Parallel Synchronous Software

In typical embedded applications, the precise execution time of the prog...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The growing interest in Datalog-based declarative systems like LogicBlox logicblox, BigDatalog bigdatalog, SociaLite socialite, BigDatalog-MC bigdatalog-mc and Myria

Myria has brought together important advances on two fronts: (i) Firstly, Datalog, with support for aggregates in recursion monotonic_agg, has sufficient power to express succinctly declarative applications ranging from complex graph queries to advanced data mining tasks, such as frequent pattern mining and decision tree induction scalingup-tplp2018. (ii) Secondly, modern architectures supporting in-memory parallel and distributed computing can deliver scalability and performance for this new generation of Datalog systems.

For example BigDatalog (bulk synchronous parallel processing on shared-nothing architecture), BigDatalog-MC (lock-free parallel processing on shared-memory multicore architecture), Myria (asynchronous processing on shared-nothing architecture) spearheaded the system-level scheduling, planning and optimization for different parallel computing models. This line of work was quite successful for Datalog, and also for recursive SQL queries that have borrowed this technology rasql). Indeed, our recent general-purpose Datalog systems surpassed commercial graph systems like GraphX on many classical graph queries in terms of performance and scalability bigdatalog.

Much of the theoretical groundwork contributing to the success of these parallel Datalog systems was laid out in the 90s. For example, in their foundation work GangulyParallel investigated parallel coordination-free (asynchronous) bottom-up evaluations of simple linear recursive programs (without any aggregates). In fact, many recent works have pushed this idea forward under the broader umbrella of CALM conjecture (Consistency And Logical Monotonicity) CalmConjecture which establishes that monotonic Datalog (Datalog without negation or aggregates) programs can be computed in an eventually consistent, coordination-free manner AmelootCalm2, AmelootCalmRefined. This line of work led to the asynchronous data-parallel (for Myria) and lock-free evaluation plans for many of the aforementioned systems (e.g. BigDatalog-MC). Simultaneously, another branch of research about ‘parallel correctness’ for simple non-recursive conjunctive queries ParallelCorrectness focused on optimal data distribution policies for re-partitioning the initial data under Massively Parallel Communication model (MPC). However, notably, this theoretical groundwork left out programs using aggregates in recursion, for which the existence of a formal semantics could not be guaranteed. But, this situation has changed recently because of the introduction of the notion of Pre-Mappability111In our initial work zaniolo-tplp2017, we interchangeably used the term Pre-Applicability. However, in our follow-up works scalingup-tplp2018, zaniolo-amw2018, we consistently used the term Pre-Mappability since the latter was deemed more appropriate in the context of ‘pre-mapping’ aggregates and constraints to recursive rules. () zaniolo-tplp2017 that has made possible the use of aggregates in recursion to express efficiently a large range of applications scalingup-tplp2018. A key aspect of this line of work has been the use of non-monotonic aggregates and pre-mappable constraints inside recursion, while preserving the formal declarative semantics of aggregate-stratified programs, thanks to the notion of that guarantees their equivalence. Unlike more complex non-monotonic semantics, stratification is a syntactic condition that is easily checked by users (and compilers), who know that the presence of a formal declarative semantics guarantees the portability of their applications over multiple platforms. Furthermore, evidence is mounting that a higher potential for parallelism is also gained under . Naturally, we would like to examine the applicability of under a parallel and distributed setting and analyze its potential gains using the rich models of parallelism previously proposed for Datalog and other logic systems.

In this paper, therefore, we begin by examining how interacts under a parallel setting, and address the question of whether it can be incorporated into the parallel evaluation plans on shared-memory and shared-nothing architectures. Furthermore, the current crop of Datalog systems supporting aggregates in recursion have only explored Bulk Synchronous Parallel (BSP)

and asynchronous distributed computing models. However, the new emerging paradigm of Stale Synchronous Parallel (SSP) processing model SSPCui has shown to speed up big data analytics and machine learning algorithm execution on distributed environments SSPPetuum, SSPMuli with

bounded staleness.

SSP processing allows each worker in a distributed setting to see and use another worker’s obsolete (stale) intermediate solution, which is out-of-date only by a limited (bounded) number of epochs. On the contrary, in a BSP model every worker coordinates at the end of each round of computation and sees each others’ current intermediate results. This relaxation of the synchronization barrier in a SSP model can reduce idle waiting of the workers (time spent waiting to synchronize), particularly when one or more workers (stragglers) lag behind others in terms of computation. Thus, in this paper, we also explore if declarative recursive computation can be executed under the loose consistency model of SSP processing and if it has the same convergence as that under a BSP processing framework.

To our surprise, we find dovetails excellently with SSP model for a class of non-linear recursive queries with aggregates, which are not embarrassingly parallel and still require some coordination between the workers to reach eventual consistency interlandi_tanca_2018. Thus, the contributions of this paper can be summarized as follows:

  • We show that is applicable to parallel bottom-up semi-naive evaluation plan, terminating at the same minimal fixpoint as the corresponding single executor based sequential execution.

  • We further show how recursive query evaluation with can operate effectively under a SSP distributed model.

  • Finally, we discuss the merits and demerits of a SSP model with initial empirical results on some recursive query examples, thus opening up an interesting direction for future research.

2 An Overview of

This section provides a brief overview about and some of its properties zaniolo-amw2016, zaniolo-amw2018. Consider the Datalog query in Example 1 that computes the shortest path between all pairs of vertices in a graph, given by the relation arc(X, Y, D), where D is the distance between source node X and destination node Y. The minD syntax in our example indicates aggregate on the cost variable D, while (X, Y) refer to the group-by arguments. This head notation for aggregates directly follows from SQL-2 syntax, where cost argument for the aggregate consists of one variable and group-by arguments can have zero or more variables. Rules in the example shows that the aggregate is computed at a stratum higher than the recursive rule ().

Example 1.

All Pairs Shortest Path

Incidentally, can also be expressed with stratified negation as shown in rules and . This guarantees that the program has a perfect-model semantics, although an iterated fixpoint computation of it can be very inefficient and even non-terminating in presence of cycles.

Application. The aforementioned inefficiency can be mitigated with , if the aggregate can be pushed inside the fixpoint computation, as shown in rules and . The following program under has a stable model semantics and scalingup-tplp2018 showed that this transformation is indeed equivalence-preserving with an assured convergence to a minimal fixpoint within a finite number of iterations. In other words, without the shortest path in our example (according to rule ) is given by the subset of the minimal model (computed from rules ) obtained after removing path atoms that did not satisfy the cost constraint for a given source-destination pair. However, with , the transfer of cost constraint inside recursion results in an optimized program, where the fixpoint computation is performed more efficiently, eventually achieving the same shortest path values (as those produced in the perfect model of the earlier program) by simply copying the atoms from path under the name shortestpath (rule ) after the least fixpoint computation terminates.

Formal Definition of . For a given Datalog program, let be the rules defining a (set of mutually) recursive predicate(s) and be the corresponding Immediate Consequence Operator (ICO) defined over . Then, a constraint is said to be to (and to ) when, for every interpretation of the program, we have .

In Example 1, the final rule imposes the constraint on (representing all possible paths) to eventually yield the shortest path between all pairs of nodes. Thus, the aggregate-stratified program defined by rules is equivalent to in the definition of . On the other hand, with aggregate pushed inside recursion, recursive rules represent .

Properties. We now discuss some important results about from zaniolo-tplp2017. We refer interested readers to our paper zaniolo-tplp2017 for the detailed proofs. Let denote the constrained immediate consequence operator, where constraint is applied after the ICO , i.e., . The following results hold when is to a positive program with ICO :

  1. If is a fixpoint for , then is a fixpoint for , i.e., .

  2. For some integer , if , then is a minimal fixpoint for and , where

Provability. We can verify if holds for a recursive rule by explicitly validating , i.e., at every iteration of the fixpoint computation. To simplify, this would indicate that we can verify if the constraint can be pushed inside recursion in rule by inserting an additional goal in the body of the rule as follows:

This additional goal in the body pre-applies the constraint on , followed by the application of operator, i.e., it expresses . Note, the constraint is satisfied by Dxz, if it is the minimum value seen yet in the fixpoint computation for the source-destination pair (X, Z). It is also evident that any other distance value between (X, Z), which violates the constraint, will also not satisfy the aggregate at the head of the rule, since the additional goal minimizes the sum D for each Dzy. Thus, this new goal in the body does not alter the ICO mapping defined by the original recursive rule, thereby proving is in this example program. More broadly speaking, these additional goals can be formally defined as “half functional dependencies”, borrowing the terminology from classical database theory of Functional and Multi-Valued Dependencies (FDs and MVDs). We next present the formal definition of half FD from zaniolo-amw2018, which will be used later for our proofs.

Definition 1.

(Half Functional Dependency). Let be a relation on a set of attributes , and . Considering the domain of to be totally ordered, a tuple is said to satisfy the -constraint (denoted as ), when contains no tuple with the same -value and a smaller -value. Similarly, a tuple satisfies a -constraint (denoted as ) if has no tuple with the same -value and a -value.

For any or constraint to be to a positive program , the corresponding half FD should hold for the relational view of the relevant recursive predicate across every interpretation of , where a relational view for predicate is defined as for a given . zaniolo-amw2018 provides generic templates, based on Functional and Multi-valued Dependencies, for identifying constraints that satisfy .

with Semi-Naive Evaluation. A naive fixpoint computation trivially generates new atoms from the entire set of atoms available at the end of the last fixpoint iteration. Semi-naive evaluation improves over this naive fixpoint computation with the aid of the following enhancements:

  1. At every iteration, track only the new atoms produced.

  2. Rules are re-written into their differential versions, so that only new atoms are produced and old atoms are never generated redundantly.

  3. Ensure step (2) does not generate any duplicate atoms.

For programs where can be applied, steps (1) and (2) remain identical. However, step (3) is extended so that (i) new atoms produced may not be retained, if they do not satisfy the constraint and (ii) existing atoms may get updated and thereafter tracked for the next iteration. For example, new atoms produced from rule are added to the working set and tracked only if a new source-destination (X,Y) path is discovered. On the other hand, if the new path atom, thus produced, has a smaller distance than the one in the working set, then the distance of the existing path atom is updated to satisfy the -constraint. However, if new path atoms are generated, which have larger distances, then they are simply ignored. This understanding of for semi-naive evaluation leads to a case for SSP model, where significant communication can be saved by condensing multiple updates into one. This is discussed in detail later in Section 5.

3 An Overview of Parallel Bottom-Up Evaluation

One of the early foundational works that established a standard technique to parallelize bottom-up evaluation of linear recursive queries was presented in GangulyParallel. The authors proposed a substitution partitioned parallelization scheme, where the set of possible ground substitutions, i.e., the base (extensional database) and derived relation (intensional database) atoms in the Datalog program are disjointedly partitioned, using a hash-based discriminating function, so that each partition of possible ground substitutions is mapped to exactly one of the parallel workers. The entire computation is then divided among all the workers, operating in parallel, where each worker only processes the partition of ground substitutions mapped to it during the bottom-up semi-naive evaluation. Since, each worker operates on a distinct non-overlapping partition of ground substitutions, no two workers perform the same or redundant computation, i.e., this scheme is non-redundant. Formally, if is a non-repetitive sequence of variables appearing in the body of rule and denotes a finite set of parallel workers, then is a discriminating hash function that divides the workload by assigning the ground substitution and corresponding processing to exactly one worker. The workers can send and receive information (ground instances from partially computed derived relations) to and from other workers to finish the assigned computation tasks. Ganguly et al. summarized the correctness of this parallelization scheme with the following result:

Correctness of Partitioned Parallelization Scheme. Let be a recursive Datalog program to be executed over workers. Under the partitioned parallelization scheme, let be the program to be executed at worker and let . Then, for every interpretation, the least model of the recursive relation in is identical to the least model obtained from the sequential execution of .

Note, the above parallelization strategy did not involve aggregates in recursion. But, nevertheless it was of significant consequence, since the scheme has been extended to derive lock-free parallel plans for shared-memory architectures as well as sharded data parallel decomposable plans for shared-nothing distributed architectures to parallelize bottom-up semi-naive evaluation of Datalog programs. We discuss them next with examples.

Shared-Memory Architecture. A trivial hash-based partitioning, as described above, can often lead to conflicts between different workers on a shared-memory architecture222For example, two distinct workers may update a path atom for the same (X,Y) pair in rule , if the hashing is done based on the ground instances of the sequence {X,Z,Dxz,Z,Y,Dzy} or even on the sequence {X,Z,Y}.. This can be prevented with the implementation of classical locks to resolve read-write conflicts. However, recently, yang-iclp2015 proposed a hash partitioning strategy based on discriminating sets that allows lock-free parallel evaluation of a broad class of generic queries including non-linear queries. We illustrate this with our running all pairs shortest path example.

Assume the relations arc, path and shortestpath from example 1 (rules ) are partitioned by the first column333The first attribute forms a discriminating set that is used for partitioning. (i.e., the source vertex), using a hash function that maps the source vertex to an integer between 1 to , latter denoting the number of workers. Now, a worker can execute the following program in parallel:

  1. The worker executes rule by reading from the partition of arc.

  2. Once all the workers finish step (1), the worker begins semi-naive evaluation with rule , where it reads from the partition of path, joins with the corresponding atoms from the arc relation, which is shared across all the workers, and then writes new atoms into the same partition of path.

  3. Once all the workers finish step (2), the semi-naive evaluation proceeds to the next iteration and repeats step (2) till the least fixpoint is reached.

  4. In the final step, the worker computes the shortestpath for the partition.

  5. All the shortestpath data pooled across the workers produce the final query result.

It is easy to observe that the above parallel execution does not require any locks, since each worker is writing to exactly one partition and no two workers are writing to the same partition. We formally define the lock-free parallel bottom-up evaluation scheme next.

Definition 2.

(Lock-free Parallel Bottom-up Evaluation). Let be a recursive Datalog program to be executed over workers and let be the corresponding ICO for the sequential execution of . Under the lock-free parallel plan executed over workers, let be the program to be executed at worker , producing an interpretation of the recursive predicate with the corresponding ICO . Then, for every input of base relations, we have, for , . It also follows from the correctness of partitioned parallelization scheme that .

The underlying strategy of a lock-free parallel plan to use disjointed data partitions have also been adopted to execute data-parallel distributed bottom-up evaluations, as explained next.

Shared-Nothing Architecture. Distributed systems like BigDatalog bigdatalog also divide the entire dataset into disjointed data shards in an identical manner as the lock-free partitioning technique described above. Each data shard resides in the memory of a worker and this partitioning scheme reduces the data shuffling required across different workers bigdatalog. In the context of shared-nothing architecture, this sharding scheme and subsequent distributed bottom-up evaluation is termed as a decomposable plan bigdatalog, rasql. In the rest of this paper, we will use the term ‘lock-free parallel plan’ in the context of shared-memory architecture and ‘parallel decomposable plan’ in the context of distributed environment for clarity.

Distributed systems like BigDatalog and SociaLite socialite perform the fixpoint computation under BSP model with synchronized iterations. However, note that, if each node caches the arc relation, then each node can operate independently without any co-ordination or synchronization with other nodes (i.e., step 3 listed before in the lock-free evaluation plan becomes unnecessary). The Myria system follows this asynchronous computing model for the query evaluation. Interestingly, GangulyParallel showed that only a subclass of linear recursive queries444The dataflow graph corresponding to the linear recursive query must have a cycle. can be executed in a co-ordination free manner or asynchronously. Thus, for a large class of non-linear and even many linear recursive queries (e.g. same generation query GangulyParallel), BSP computing model has been the only viable option.

4 Parallel Evaluation with

In this section, we now examine if can be easily integrated into the lock-free parallel and parallel decomposable bottom-up evaluation plans that have been widely adopted across shared-memory and shared-nothing architectures for a broad range of generic queries. We next provide some interesting theoretical results.

Lemma 1. Let be a relation defined over a set of attributes , where and . For a subset of (), if is divided into disjoint subsets using a hash function such that is defined as , then a tuple satisfying (or, ) will also satisfy (or, respectively) over and vice versa, where .

This follows directly from the fact that since , for any two tuples , if , then , i.e., any two tuples with the same -value will be mapped into the same partition, decided by their common -value. Since, all tuples with the same -value belong to a single partition, any tuple will satisfy (or, ) over both and .

Theorem 1.

Let be a recursive Datalog program, be its corresponding ICO and let the constraint be to and , resulting in the constrained ICO . Let be executed over workers under a lock-free parallel (or parallel decomposable) bottom-up evaluation plan, where is the program executed at worker and be the corresponding ICO defined over . If the group-by arguments used for the constraint also contain the discriminating set used for partitioning in the lock-free parallel (or parallel decomposable) plan, then:

  1. is also to and , for .

  2. For some integer , if is the minimal fixpoint for , then , where denotes the constrained ICO with respect to .

The proof for (i) follows trivially from lemma 1 and the provability technique discussed earlier in Section 2.

Since, is to and , according to the properties of . Similarly, since for , is to and (from (i) of theorem 1), , for some integer , where is the minimal fixpoint for . Thus, . Now, constraints are also trivially to union over disjoint sets zaniolo-tplp2017, i.e., . Also recall from the definition of lock-free parallel (or parallel decomposable) plan that . Combining these aforementioned equalities, we get,
.
Since, is the minimal fixpoint with respect to , for , . Therefore, .

Thus, following theorem 1, we can push the constraint within the parallel recursive plan expressed by rules and rewrite them for worker as follows:

Thus, we observe that pre-mappable constraints can be also easily pushed inside parallel lock-free (or parallel decomposable) evaluation plans of recursive queries to yield the same minimal fixpoint, yet making them computationally more efficient and safe. Thus, can be easily incorporated into the parallel computation plans (equivalent to rules ) of different systems like BigDatalog-MC, BigDatalog and Myria, irrespective of whether they use (1) shared-memory or shared-nothing architecture, or (2) they follow BSP or asynchronous computing models.

5 A Case for Relaxed Synchronization

We now consider a non-linear query, which is equivalent to the linear all pairs shortest path program with the application of (rules ). Since this is a non-linear query (rules ), this program cannot be executed in a coordination-free manner or asynchronously following the technique described in GangulyParallel.

However, as shown in yang-iclp2015, a simple query rewriting technique can produce an equivalent parallel decomposable evaluation plan for this non-linear query. Rules show the equivalent decomposable program, which can be executed by worker on a distributed system following a bulk synchronous parallel model. In this following decomposable evaluation plan, there is a mandatory synchronization step (rule ), where each worker (operating on the partition) copies the new atoms or updates in path produced during the semi-naive evaluation from rule to path and the new path is then sent to other workers so that they can use it in the evaluation of rule in the next iteration.

In a bulk synchronous distributed computing model, the communication between the workers in each iteration can be considerably more expensive than the local computation performed by each worker due to the bottleneck of network bandwidth. We now investigate if we can relax this synchronization constraint at every iteration.

Under a stale synchronous parallel (SSP) model, a worker can use an obsolete or stale version of path that omits some recent updates, produced by other workers, for its local computation. In particular, a worker using path at iteration will be able to use all the atoms and updates generated from iteration to , is a user-specified threshold for controlling the staleness. In addition, the worker’s stale path may have atoms or updates from iteration beyond , i.e., from iteration to (although this is not guaranteed). The intuition behind this is that in a SSP model, a worker for its local computation should be able to see and use its own updates at every iteration, in addition to seeing and using as many updates as possible from other workers, with the constraint that any updates older than a given age are not missed. This is the bounded staleness constraint sspdef. This leads to two advantages:

  1. Workers spend more time performing actual computation, rather than idle waiting for other workers to finish. This can be very helpful, when there are straggling workers present, which lag behind others in an iteration. In fact in distributed computing, stragglers present an acute problem since they can occur for several reasons like hardware differences disk1, system failures hardfail, skewed data distribution or even from software management issues and program interruptions caused from garbage collections or operating system noise, etc. oseffect.

  2. Secondly, workers can end up communicating less than under a BSP model. This is primarily because under , each worker can condense several updates computed from different local iterations into a single update before eventually sending it to other workers.

We illustrate the above advantages through an example. Figure 1 shows a toy graph which is distributed across two workers: (i) all edges incident on nodes 1-4 are available on worker 0, (ii) and the rest of the edges reside on worker 1. Now consider the shortest path between nodes 4 and 8, given by the path 4-3-2-1-5-6-7-8, which spans across 7 hops. The parallel program defined by rules with BSP processing would require at least three synchronized iterations to reach to the least fixpoint by semi-naive evaluation. Now consider worker 1 to be a straggling node that lags behind worker 0 during the computation because of hardware differences. Thus, worker 0 spends significant time idle waiting for worker 1 to complete, as shown in Figure 2a. But in this example, the shortest path between nodes 4 and 8 changes because of two aspects: (1) the shortest path between nodes 4 and 1 changes and (2) the shortest path between nodes 5 and 8 changes. Both of these computations can be done independently on the two workers and worker 0 needs to know the eventual shortest path between nodes 5 and 8 calculated by worker 1 and vice versa. It is important to note that this will only work if each worker can use the most recent local updates (newest atoms) generated by itself. In other words, worker 0 should be able to see the changes of the shortest path between node 4 and node 1 in every iteration (which is generated locally) and use a stale (obsolete) knowledge about the shortest path between nodes 5 and 8 (as sent by worker 1 earlier). This stale synchronization model is summarized in Figure 2b.

Figure 1: A toy graph distributed across two workers.
Figure 2: BSP vs. SSP model for evaluating all pairs shortest path query on two workers.

In this same example, note how the minimum cost for the path between node 1 and node 4 (computed by worker 0) changes in every iteration: (i) in the first iteration, the minimum cost was 10 given by the edge between node 1 and node 4, (ii) in the next iteration, the minimum cost drops to 7 given by the path 1-3-4 and (iii) in the third iteration the final minimum cost of 5 is given by the sequence 1-2-3-4. In a BSP model, each of this update generated in every iteration needs to be communicated to all the remaining workers. However, in a SSP model due to the advantage of this staleness, these multiple updates from different local iterations can be condensed into one most recent update, which is then sent to other workers. In other words, SSP with may skip sending some updates to remote workers, thus saving communication time.

Figure 3 formally presents the SSP processing based bottom-up evaluation plan for the non-linear all pairs shortest path example given by rules . If the evaluation is executed over a distributed system of workers, Figure 3 depicts the execution plan for a worker . A coordinator marks the completion of the overall evaluation process by individually tracking the termination of each of the worker’s task. For simplicity and clarity, we have used the naive fixpoint computation to describe the evaluation plan instead of using the optimized differential fixpoint algorithm. The used in the Figure 3 denotes the constraint. Step (3) in this evaluation plan shows how worker uses stale knowledge from other workers (denoted by ) during the recursive rule evaluation, shown by step (4). It is also important to note that in step (4), each worker is also using the most recent atoms generated by itself (denoted by ) for the evaluation. The condition in step (6) allows each local computation on worker to reach local fixpoint or move further by at least iterations. Thus, each worker can condense multiple updates generated within these iterations due to into a single update. Finally, step (9) ensures that if any worker falls beyond the user-defined staleness bound, then other workers wait for it to catch up within the desired staleness level before starting their local computations again. We next present some theoretical and empirical results about the SSP model based bottom-up evaluation.

1:  , , ,
2:  repeat
3:     Last received by worker
4:                                 
5:     ,
6:  until  and
7:  
8:  Send to other workers.
9:  if for any worker staleness bound then
10:     Wait for a new update from worker before continuing
11:  end if
12:  if  or a new update has been received from worker  then
13:     repeat from Step (2)
14:  else
15:     Send a message to coordinator.
16:     if any new update is received from worker  then
17:        Send a message to coordinator.
18:        Repeat from step (2).
19:     end if
20:  end if
Figure 3: SSP based bottom-up evaluation plan executed by worker for computing all pairs shortest path.

6 Bottom-up Evaluation with SSP Processing

Under the SSP model, a recursive query evaluation with constraints has the following theoretical guarantees:

Theorem 2.

Let be a recursive Datalog program with ICO and let the constraint be to and . Let have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . If is also to and for , then:

  1. The SSP processing yields the same minimal fixpoint of , as would have been obtained with BSP processing.

  2. If any worker under BSP processing requires rounds of synchronization, then under SSP processing would require rounds to reach the minimal fixpoint, where rounds of synchronization in SSP model means every worker has sent at least updates.

The proof is provided in A.

6.1 SSP Evaluation of Queries without Constraint

We now consider the parallel decomposable plan of a transitive closure query, which does not contain any aggregates in recursion. We use the same non-linear recursive example from bigdatalog-mc, given by rules , which shows the program executed by worker . Note, in this example every worker eventually has to compute and send to other workers all tc atoms of the form (X, Y), where h(X) . Without , a worker does not update its existing tc atoms. In fact, during semi-naive evaluation of this query, at any time, only new unique atoms are appended to tc. Thus, a SSP evaluation for the transitive closure query (without ) does not save any communication cost as compared to a BSP model. However, as shown in our experimental results next, the SSP model can still mitigate the influence of stragglers.

6.2 Experimental Results

Setup. We conduct our experiments on a 12 node cluster, where each node, running on Ubuntu 14.04 LTS, has an Intel i7-4770 CPU (3.40GHz, 4 cores) with 32GB memory and a 1 TB 7200 RPM hard drive. The compute nodes are connected with 1Gbit network. Following the standard practices established in bigdatalog, bigdatalog-mc, we execute the distributed bottom-up semi-naive evaluation using an AND/OR tree based implementation in Java on each node. Each node executes one application thread per core. We evaluate both the non-linear all pairs shortest path and transitive closure queries on a subset of the real world orkut social network data555http://snap.stanford.edu/data/com-Orkut.html.

Inducing Stragglers.

In order to study the influence of straggling nodes in a declarative recursive computation, we induce stragglers in our implementation following the strategy described in SSPCui. In particular, each of the nodes in our setup can be disrupted independently by a CPU-intensive background process that kicks in following a Poisson distribution and consumes at least half of the CPU resources.

Analysis. In this section, we empirically analyze the merits and demerits of a SSP model over a BSP model, by examining the following questions: (1) How does a SSP model compare to a BSP model when queries contain constraints and aggregates in recursion? (2) How do these two processing paradigms compare when cannot be applied? (3) And, how do the overall performances in the above scenarios change in presence and absence of stragglers? Table 1 captures the first case with the all pairs shortest path query (where is applicable), while Table 2 presents the second case with the transitive closure query, which do not contain any aggregates or constraints in recursion. For each of these two cases, as shown in the tables, we experimented with two different staleness values for a SSP model, both under the presence and absence of induced stragglers. Notably, a SSP model with bounded staleness (alternatively also called ‘slack’ and indicated by in the tables) set as zero reduces to a BSP model. Tables 1 and 2 capture the average execution time for the query at hand under different configurations over five runs. This run time can be divided into two components— (1) average computation time, which is the average time spent by the workers performing semi-naive evaluation for the recursive computation, and (2) average waiting time, which is the average time spent by the workers waiting to receive a new update to resume computation. Tables 1 and 2 show the run time break down for the two aforementioned cases (with and without respectively).

From Tables 1 and 2, it is evident that BSP processing requires the least compute time irrespective of straggling nodes. This is also intuitively true because the total recursive computation involved in a BSP based distributed semi-naive evaluation is similar to that of a single executor based sequential execution and as such a BSP model should require the least computational effort to reach the minimal fixpoint. On the other hand, a SSP model may perform many local computations optimistically with obsolete data using relaxed synchronization barriers, which can become redundant later on. As shown in the tables, average compute time indeed increases with higher slack indicating that a substantial amount of the work becomes unnecessary. However, as seen from both the tables, SSP plays a major rule in reducing the average wait time. This is trivially true, since in SSP processing, any worker can move ahead with local computations using stale knowledge, instead of waiting for global synchronization as required in BSP. However, note the reduction in average wait time under SSP model in Table 1 (with ) is more significant than in Table 2 (without ). This can be attributed to the fact that with semi-naive evaluation (Section 2) under SSP model can batch multiple updates together before sending them, thereby saving communication cost. However, for the transitive closure query (without ), the overall updates sent in BSP and SSP models are similar (since no aggregates are used, semi-naive evaluation only produces new atoms, never updates existing ones). Thus, in the latter case (Table 2), the wait times between BSP and SSP models are comparable when there are no induced stragglers, whereas the wait time in SSP is marginally better than BSP when stragglers are present. Notably, inducing stragglers obviously increases the average wait time all throughout as compared to a no straggler situation. The compute time also increases marginally in presence of stragglers, primarily because the straggling nodes take longer time to finish its computations.

Thus, to summarize based on the run times in the two tables, we see that in absence of stragglers, the SSP model can reduce the run time of the shortest path query (with constraint) by nearly 30%. However, the same is not true for the transitive closure query, which do not have any constraint. Hence, a BSP model would suffice if there are no stragglers and the query does not contain any constraint. However, in presence of stragglers or constraints, SSP model turns out to be a better alternative than BSP model, as it can lead to a execution time reduction of as high as 40% for the shortest path query and nearly 7% for the transitive closure query. Finally, it is also worth noting from the results that too much of a slack can also increase the query latency. Thus, a moderate amount of slack should be used in practice.

Time consumption No stragglers (time in sec) With stragglers (time in sec)
BSP (=0) SSP (=3) SSP (=6) BSP (=0) SSP (=3) SSP (=6)
Avg. compute time 2224 2443 3038 2664 2749 3435
Avg. wait time 1679 302 408 2786 485 704
Run time 3903 2745 3446 5450 3234 4139
Table 1: Comparing BSP vs. SSP model for all pairs shortest path query containing aggregates in recursion (with ).
Time consumption No stragglers (time in sec) With stragglers (time in sec)
BSP (=0) SSP (=3) SSP (=6) BSP (=0) SSP (=3) SSP (=6)
Avg. compute time 682 762 879 754 827 921
Avg. wait time 367 345 334 618 456 412
Run time 1049 1107 1213 1372 1283 1431
Table 2: Comparing BSP vs. SSP model for transitive closure query containing no aggregates in recursion (without ).

7 Conclusion

facilitates and extends the use of aggregates in recursion, and this enables a wide spectrum of graph and data mining algorithms to be expressed efficiently in declarative languages. In this paper, we explored various improvements to scalability via paralled execution with . In fact, can be easily integrated with most of the current generation Datalog engines like BigDatalog, Myria, BigDatalog-MC, SociaLite, LogicBlox, irrespective of their architecture differences and varying synchronization constraints. Moreover, in this paper, we have shown that brings additional benefits to the parallel evaluation of recursive queries. For that, we established the necessary theoretical framework that allows bottom-up recursive computations to be carried out over stale synchronous parallel model—in addition to the synchronous or completely asynchronous computing models studied in the past. These theoretical developments lead us to the conclusion, confirmed by initial experiments, that the parallel execution of non-linear queries with constraints can be expedited with a stale synchronous parallel (SSP) model. This model is also useful in the absence of constraints, where bounded staleness may not reduce communications, but it nevertheless mitigates the impact of stragglers. Initial experiments performed on a real-world dataset confirm the theoretical results, and are quite promising, paving the way toward future research in many interesting areas, where declarative recursive computation under SSP processing can be quite advantageous. For example, declarative advanced stream reasoning systems astro-cikm, supporting aggregates in recursion, can adopt distributed SSP model to query evolving graph data, especially when one portion of the network changes more rapidly as compared to others. SSP models under such scenario offer the flexibility to batch multiple network updates together, thereby reducing the communication costs effectively.

Finally, it is important to note that the methodologies developed here can also be applied to other declarative logic based systems beyond Datalog, like in SQL-based query engines rasql, which also use semi-naive evaluation for recursive computation. In addition, the SSP processing paradigm can also be adopted in many state-of-the-art graph-centric platforms such as Pregel pregel and GraphLab graphlab. These modern graph engines use a vertex-centric computing model vertexProgramming, which enforces a strong consistency requirement among its model variables under the “Gather-Apply-Scatter” abstraction. Consequently, this makes the synchronization cost for these graph frameworks similar to that of standard BSP systems. Thus, for many distributed graph computation problems involving aggregators (like shortest path queries), SSP model, as demonstrated in this paper, can be quite useful for these graph based platforms.

References

  • Ameloot (2014) Ameloot, T. J. 2014. Declarative networking: Recent theoretical work on coordination, correctness, and declarative semantics. SIGMOD Rec. 43, 2 (Dec.), 5–16.
  • Ameloot et al. (2017) Ameloot, T. J., Geck, G., Ketsman, B., Neven, F., and Schwentick, T. 2017. Parallel-correctness and transferability for conjunctive queries. J. ACM 64, 5 (Sept.), 36:1–36:38.
  • Ameloot et al. (2015) Ameloot, T. J., Ketsman, B., Neven, F., and Zinn, D. 2015. Weaker forms of monotonicity for declarative networking: A more fine-grained answer to the calm-conjecture. ACM Trans. Database Syst. 40, 4 (Dec.), 21:1–21:45.
  • Ameloot et al. (2013) Ameloot, T. J., Neven, F., and Van Den Bussche, J. 2013. Relational transducers for declarative networking. J. ACM 60, 2 (May), 15:1–15:38.
  • Ananthanarayanan et al. (2010) Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. 2010.

    Reining in the outliers in map-reduce clusters using mantri.

    In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. OSDI’10. 265–278.
  • Aref et al. (2015) Aref, M., ten Cate, B., Green, T. J., Kimelfeld, B., Olteanu, D., Pasalic, E., Veldhuizen, T. L., and Washburn, G. 2015. Design and implementation of the logicblox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1371–1382.
  • Beckman et al. (2006) Beckman, P., Iskra, K., Yoshii, K., and Coghlan, S. 2006. The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing. 1–12.
  • Cipar et al. (2013) Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson, G., Keeton, K., and Xing, E. 2013. Solving the straggler problem with bounded staleness. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems. HotOS’13. 22–22.
  • Condie et al. (2018) Condie, T., Das, A., Interlandi, M., Shkapsky, A., Yang, M., and Zaniolo, C. 2018. Scaling-up reasoning and advanced analytics on bigdata. TPLP 18, 5-6, 806–845.
  • Cui et al. (2014) Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. 2014. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC. 37–48.
  • Das et al. (2018) Das, A., Gandhi, S. M., and Zaniolo, C. 2018. Astro: A datalog system for advanced stream reasoning. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18. 1863–1866.
  • Ganguly et al. (1992) Ganguly, S., Silberschatz, A., and Tsur, S. 1992. Parallel bottom-up processing of datalog queries. J. Log. Program. 14, 1-2 (Oct.), 101–126.
  • Gu et al. (2019) Gu, J., Watanabe, Y., Mazza, W., Shkapsky, A., Yang, M., Ding, L., and Zaniolo, C. 2019. Rasql: Greater power and performance for big data analytics with recursive-aggregate-sql on spark. In SIGMOD’19.
  • Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Kim, J. K., Lee, S., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS. 1223–1231.
  • Interlandi and Tanca (2018) Interlandi, M. and Tanca, L. 2018. A datalog-based computational model for coordination-free, data-parallel systems. Theory and Practice of Logic Programming 18, 5-6, 874–927.
  • Krevat et al. (2011) Krevat, E., Tucek, J., and Ganger, G. R. 2011. Disks are like snowflakes: No two are alike. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems. HotOS’13. 14–14.
  • Lee et al. (2014) Lee, S., Kim, J. K., Zheng, X., Ho, Q., Gibson, G. A., and Xing, E. P. 2014. On model parallelization and scheduling strategies for distributed machine learning. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. NIPS’14. 2834–2842.
  • Low et al. (2012) Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. 2012. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8, 716–727.
  • Malewicz et al. (2010) Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. In SIGMOD’10. 135–146.
  • Mazuran et al. (2013) Mazuran, M., Serra, E., and Zaniolo, C. 2013. Extending the power of datalog recursion. The VLDB Journal 22, 4 (Aug.), 471–493.
  • Seo et al. (2013) Seo, J., Park, J., Shin, J., and Lam, M. S. 2013. Distributed socialite: A datalog-based language for large-scale graph analysis. Proc. VLDB Endow. 6, 14 (Sept.), 1906–1917.
  • Shkapsky et al. (2016) Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T., and Zaniolo, C. 2016. Big data analytics with datalog queries on spark. In SIGMOD. ACM, New York, NY, USA, 1135–1149.
  • Wang et al. (2015) Wang, J., Balazinska, M., and Halperin, D. 2015. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. Proc. VLDB Endow. 8, 12 (Aug.), 1542–1553.
  • Yan et al. (2015) Yan, D., Cheng, J., Lu, Y., and Ng, W. 2015. Effective techniques for message reduction and load balancing in distributed graph computation. In WWW. 1307–1317.
  • Yang et al. (2015) Yang, M., Shkapsky, A., and Zaniolo, C. 2015. Parallel bottom-up evaluation of logic programs: DeALS on shared-memory multicore machines. In Technical Communications of ICLP.
  • Yang et al. (2017) Yang, M., Shkapsky, A., and Zaniolo, C. 2017. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB J. 26, 2, 229–248.
  • Zaniolo et al. (2016) Zaniolo, C., Yang, M., Das, A., and Interlandi, M. 2016. The magic of pushing extrema into recursion: Simple, powerful datalog programs. In AMW.
  • Zaniolo et al. (2017) Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A., and Condie, T. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 5-6, 1048–1065.
  • Zaniolo et al. (2018) Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A., and Condie, T. 2018. Declarative bigdata algorithms via aggregates and relational database dependencies. In AMW.

Appendix A SSP processing based recursive computation with

Definition 3.

(-Cover). Let be a positive recursive Datalog program with as its corresponding ICO. Let a constraint be defined over the recursive predicate on a set of group-by arguments, denoted by with the cost-argument denoted as . Let be also to and . Let there be two sets and , both of which contain tuples of the form , where represents the set of real numbers. Now, is defined as the -cover for , if for every tuple , there exists only one tuple such that (i) and (ii) .

It is important to note from the above definition that if is the -cover for , then there can exist a tuple , such that but the converse is never true.

Lemma 2. Let be a recursive Datalog program, be its corresponding ICO and let the constraint be to and , resulting in the constrained ICO . Now, for any pair of positive integers , where , is a -cover for .

This directly follows from the fact that any atom in with cost can only exist in with updated cost , if or . Note if , then is trivially true.

Lemma 3. Let be a recursive Datalog program with ICO and let the constraint be to and . Let also have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . Let be also to and , for . After rounds of synchronization ( rounds of synchronization in SSP model means every worker has sent at least updates), if and denote the interpretation of the recursive predicate under BSP and SSP models respectively for any worker , then is a -cover for .

In a SSP based fixpoint computation, any worker can produce an atom in three ways:

  1. From local computation not involving any of the updates sent by other workers.

  2. From a join with a new atom or an update sent by another worker .

  3. From both cases (1) and (2) together.

Now, consider the base case, where before the first round of synchronization (i.e., at the round) each worker performs only local computation, since it has not received/sent any update from/to any other worker. Since, in a SSP model, each local computation may involve multiple iterations (as shown in step (6) in Figure 3), is trivially a -cover for (from lemma 2).

We next assume this hypothesis (lemma 3) to be true for some . Under this assumption, we find that each worker in SSP model for its fixpoint computation operates based on the information from its own and from the ones sent by other workers after the round of synchronization. And since each of this involved is a -cover for the corresponding (when compared against the BSP model), the aforementioned cases (1)-(3) will also produce a -cover for the synchronization round.

Hence, by principle of mathematical induction, the lemma holds for all .

Theorem 2.

Let be a recursive Datalog program with ICO and let the constraint be to and . Let have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . If is also to and , for , then:

  1. The SSP processing yields the same minimal fixpoint of , as would have been obtained with BSP processing.

  2. If any worker under BSP processing requires rounds of synchronization, then under SSP processing would require rounds to reach the minimal fixpoint, where rounds of synchronization in SSP model means every worker has sent at least updates.

Theorem 1 guarantees that the BSP evaluation of the datalog program with will yield the minimal fixpoint of . Note that in the SSP evaluation, for every tuple produced by a worker from the program , . In other words, if represents the final interpretation of the recursive predicate under SSP evaluation, then i.e. is bounded. It also follows from lemma 3, that is a -cover for the final interpretation of the recursive predicate under BSP evaluation i.e. is a -cover for . Since, is the least fixpoint under the constraint, we also get , as atoms in must have identical cost in .

Thus, we can write the following equation based on the above discussion,

(1)

Also recall, since is to each and , under the SSP evaluation, each worker also applies in every iteration in its fixpoint computation (step (4) in Figure 3). Thus, we have,

(2)

Combining equations (1) and (2), we get . Thus, the SSP evaluation also yields the same minimal fixpoint as the BSP model.

Since, the interpretation of the recursive predicate in the least model obtained from BSP evaluation is identical to that in the least model obtained from SSP processing, it directly follows from lemma 3, that the number of synchronization rounds required by worker in SSP evaluation will be at most , where is the number of rounds takes under BSP model.