1 Introduction
The growing interest in Datalogbased declarative systems like LogicBlox logicblox, BigDatalog bigdatalog, SociaLite socialite, BigDatalogMC bigdatalogmc and Myria
Myria has brought together important advances on two fronts: (i) Firstly, Datalog, with support for aggregates in recursion monotonic_agg, has sufficient power to express succinctly declarative applications ranging from complex graph queries to advanced data mining tasks, such as frequent pattern mining and decision tree induction scalinguptplp2018. (ii) Secondly, modern architectures supporting inmemory parallel and distributed computing can deliver scalability and performance for this new generation of Datalog systems.
For example BigDatalog (bulk synchronous parallel processing on sharednothing architecture), BigDatalogMC (lockfree parallel processing on sharedmemory multicore architecture), Myria (asynchronous processing on sharednothing architecture) spearheaded the systemlevel scheduling, planning and optimization for different parallel computing models. This line of work was quite successful for Datalog, and also for recursive SQL queries that have borrowed this technology rasql). Indeed, our recent generalpurpose Datalog systems surpassed commercial graph systems like GraphX on many classical graph queries in terms of performance and scalability bigdatalog.
Much of the theoretical groundwork contributing to the success of these parallel Datalog systems was laid out in the 90s. For example, in their foundation work GangulyParallel investigated parallel coordinationfree (asynchronous) bottomup evaluations of simple linear recursive programs (without any aggregates). In fact, many recent works have pushed this idea forward under the broader umbrella of CALM conjecture (Consistency And Logical Monotonicity) CalmConjecture which establishes that monotonic Datalog (Datalog without negation or aggregates) programs can be computed in an eventually consistent, coordinationfree manner AmelootCalm2, AmelootCalmRefined. This line of work led to the asynchronous dataparallel (for Myria) and lockfree evaluation plans for many of the aforementioned systems (e.g. BigDatalogMC). Simultaneously, another branch of research about ‘parallel correctness’ for simple nonrecursive conjunctive queries ParallelCorrectness focused on optimal data distribution policies for repartitioning the initial data under Massively Parallel Communication model (MPC). However, notably, this theoretical groundwork left out programs using aggregates in recursion, for which the existence of a formal semantics could not be guaranteed. But, this situation has changed recently because of the introduction of the notion of PreMappability^{1}^{1}1In our initial work zaniolotplp2017, we interchangeably used the term PreApplicability. However, in our followup works scalinguptplp2018, zanioloamw2018, we consistently used the term PreMappability since the latter was deemed more appropriate in the context of ‘premapping’ aggregates and constraints to recursive rules. () zaniolotplp2017 that has made possible the use of aggregates in recursion to express efficiently a large range of applications scalinguptplp2018. A key aspect of this line of work has been the use of nonmonotonic aggregates and premappable constraints inside recursion, while preserving the formal declarative semantics of aggregatestratified programs, thanks to the notion of that guarantees their equivalence. Unlike more complex nonmonotonic semantics, stratification is a syntactic condition that is easily checked by users (and compilers), who know that the presence of a formal declarative semantics guarantees the portability of their applications over multiple platforms. Furthermore, evidence is mounting that a higher potential for parallelism is also gained under . Naturally, we would like to examine the applicability of under a parallel and distributed setting and analyze its potential gains using the rich models of parallelism previously proposed for Datalog and other logic systems.
In this paper, therefore, we begin by examining how interacts under a parallel setting, and address the question of whether it can be incorporated into the parallel evaluation plans on sharedmemory and sharednothing architectures. Furthermore, the current crop of Datalog systems supporting aggregates in recursion have only explored Bulk Synchronous Parallel (BSP)
and asynchronous distributed computing models. However, the new emerging paradigm of Stale Synchronous Parallel (SSP) processing model SSPCui has shown to speed up big data analytics and machine learning algorithm execution on distributed environments SSPPetuum, SSPMuli with
bounded staleness.SSP processing allows each worker in a distributed setting to see and use another worker’s obsolete (stale) intermediate solution, which is outofdate only by a limited (bounded) number of epochs. On the contrary, in a BSP model every worker coordinates at the end of each round of computation and sees each others’ current intermediate results. This relaxation of the synchronization barrier in a SSP model can reduce idle waiting of the workers (time spent waiting to synchronize), particularly when one or more workers (stragglers) lag behind others in terms of computation. Thus, in this paper, we also explore if declarative recursive computation can be executed under the loose consistency model of SSP processing and if it has the same convergence as that under a BSP processing framework.
To our surprise, we find dovetails excellently with SSP model for a class of nonlinear recursive queries with aggregates, which are not embarrassingly parallel and still require some coordination between the workers to reach eventual consistency interlandi_tanca_2018. Thus, the contributions of this paper can be summarized as follows:
We show that is applicable to parallel bottomup seminaive evaluation plan, terminating at the same minimal fixpoint as the corresponding single executor based sequential execution.

We further show how recursive query evaluation with can operate effectively under a SSP distributed model.

Finally, we discuss the merits and demerits of a SSP model with initial empirical results on some recursive query examples, thus opening up an interesting direction for future research.
2 An Overview of
This section provides a brief overview about and some of its properties zanioloamw2016, zanioloamw2018. Consider the Datalog query in Example 1 that computes the shortest path between all pairs of vertices in a graph, given by the relation arc(X, Y, D), where D is the distance between source node X and destination node Y. The minD syntax in our example indicates aggregate on the cost variable D, while (X, Y) refer to the groupby arguments. This head notation for aggregates directly follows from SQL2 syntax, where cost argument for the aggregate consists of one variable and groupby arguments can have zero or more variables. Rules in the example shows that the aggregate is computed at a stratum higher than the recursive rule ().
Example 1.
All Pairs Shortest Path
Incidentally, can also be expressed with stratified negation as shown in rules and . This guarantees that the program has a perfectmodel semantics, although an iterated fixpoint computation of it can be very inefficient and even nonterminating in presence of cycles.
Application. The aforementioned inefficiency can be mitigated with , if the aggregate can be pushed inside the fixpoint computation, as shown in rules and . The following program under has a stable model semantics and scalinguptplp2018 showed that this transformation is indeed equivalencepreserving with an assured convergence to a minimal fixpoint within a finite number of iterations. In other words, without the shortest path in our example (according to rule ) is given by the subset of the minimal model (computed from rules ) obtained after removing path atoms that did not satisfy the cost constraint for a given sourcedestination pair. However, with , the transfer of cost constraint inside recursion results in an optimized program, where the fixpoint computation is performed more efficiently, eventually achieving the same shortest path values (as those produced in the perfect model of the earlier program) by simply copying the atoms from path under the name shortestpath (rule ) after the least fixpoint computation terminates.
Formal Definition of . For a given Datalog program, let be the rules defining a (set of mutually) recursive predicate(s) and be the corresponding Immediate Consequence Operator (ICO) defined over . Then, a constraint is said to be to (and to ) when, for every interpretation of the program, we have .
In Example 1, the final rule imposes the constraint on (representing all possible paths) to eventually yield the shortest path between all pairs of nodes. Thus, the aggregatestratified program defined by rules is equivalent to in the definition of .
On the other hand, with aggregate pushed inside recursion, recursive rules represent .
Properties. We now discuss some important results about from zaniolotplp2017. We refer interested readers to our paper zaniolotplp2017 for the detailed proofs. Let denote the constrained immediate consequence operator, where constraint is applied after the ICO , i.e., . The following results hold when is to a positive program with ICO :

If is a fixpoint for , then is a fixpoint for , i.e., .

For some integer , if , then is a minimal fixpoint for and , where
Provability. We can verify if holds for a recursive rule by explicitly validating , i.e., at every iteration of the fixpoint computation. To simplify, this would indicate that we can verify if the constraint can be pushed inside recursion in rule by inserting an additional goal in the body of the rule as follows:
This additional goal in the body preapplies the constraint on , followed by the application of operator, i.e., it expresses . Note, the constraint is satisfied by Dxz, if it is the minimum value seen yet in the fixpoint computation for the sourcedestination pair (X, Z). It is also evident that any other distance value between (X, Z), which violates the constraint, will also not satisfy the aggregate at the head of the rule, since the additional goal minimizes the sum D for each Dzy. Thus, this new goal in the body does not alter the ICO mapping defined by the original recursive rule, thereby proving is in this example program. More broadly speaking, these additional goals can be formally defined as “half functional dependencies”, borrowing the terminology from classical database theory of Functional and MultiValued Dependencies (FDs and MVDs). We next present the formal definition of half FD from zanioloamw2018, which will be used later for our proofs.
Definition 1.
(Half Functional Dependency). Let be a relation on a set of attributes , and . Considering the domain of to be totally ordered, a tuple is said to satisfy the constraint (denoted as ), when contains no tuple with the same value and a smaller value. Similarly, a tuple satisfies a constraint (denoted as ) if has no tuple with the same value and a value.
For any or constraint to be to a positive program , the corresponding half FD should hold for the relational view of the relevant recursive predicate across every interpretation of , where a relational view for predicate is defined as for a given .
zanioloamw2018 provides generic templates, based on Functional and Multivalued Dependencies, for identifying constraints that satisfy .
with SemiNaive Evaluation. A naive fixpoint computation trivially generates new atoms from the entire set of atoms available at the end of the last fixpoint iteration. Seminaive evaluation improves over this naive fixpoint computation with the aid of the following enhancements:

At every iteration, track only the new atoms produced.

Rules are rewritten into their differential versions, so that only new atoms are produced and old atoms are never generated redundantly.

Ensure step (2) does not generate any duplicate atoms.
For programs where can be applied, steps (1) and (2) remain identical. However, step (3) is extended so that (i) new atoms produced may not be retained, if they do not satisfy the constraint and (ii) existing atoms may get updated and thereafter tracked for the next iteration. For example, new atoms produced from rule are added to the working set and tracked only if a new sourcedestination (X,Y) path is discovered. On the other hand, if the new path atom, thus produced, has a smaller distance than the one in the working set, then the distance of the existing path atom is updated to satisfy the constraint. However, if new path atoms are generated, which have larger distances, then they are simply ignored. This understanding of for seminaive evaluation leads to a case for SSP model, where significant communication can be saved by condensing multiple updates into one. This is discussed in detail later in Section 5.
3 An Overview of Parallel BottomUp Evaluation
One of the early foundational works that established a standard technique to parallelize bottomup evaluation of linear recursive queries was presented in GangulyParallel. The authors proposed a substitution partitioned parallelization scheme, where the set of possible ground substitutions, i.e., the base (extensional database) and derived relation (intensional database) atoms in the Datalog program are disjointedly partitioned, using a hashbased discriminating function, so that each partition of possible ground substitutions is mapped to exactly one of the parallel workers. The entire computation is then divided among all the workers, operating in parallel, where each worker only processes the partition of ground substitutions mapped to it during the bottomup seminaive evaluation. Since, each worker operates on a distinct nonoverlapping partition of ground substitutions, no two workers perform the same or redundant computation, i.e., this scheme is nonredundant. Formally, if is a nonrepetitive sequence of variables appearing in the body of rule and denotes a finite set of parallel workers, then is a discriminating hash function that divides the workload by assigning the ground substitution and corresponding processing to exactly one worker. The workers can send and receive information (ground instances from partially computed derived relations) to and from other workers to finish the assigned computation tasks. Ganguly et al. summarized the correctness of this parallelization scheme with the following result:
Correctness of Partitioned Parallelization Scheme. Let be a recursive Datalog program to be executed over workers. Under the partitioned parallelization scheme, let be the program to be executed at worker and let . Then, for every interpretation, the least model of the recursive relation in is identical to the least model obtained from the sequential execution of .
Note, the above parallelization strategy did not involve aggregates in recursion. But, nevertheless it was of significant consequence, since the scheme has been extended to derive lockfree parallel plans for sharedmemory architectures as well as sharded data parallel decomposable plans for sharednothing distributed architectures to parallelize bottomup seminaive evaluation of Datalog programs. We discuss them next with examples.
SharedMemory Architecture. A trivial hashbased partitioning, as described above, can often lead to conflicts between different workers on a sharedmemory architecture^{2}^{2}2For example, two distinct workers may update a path atom for the same (X,Y) pair in rule , if the hashing is done based on the ground instances of the sequence {X,Z,Dxz,Z,Y,Dzy} or even on the sequence {X,Z,Y}.. This can be prevented with the implementation of classical locks to resolve readwrite conflicts. However, recently, yangiclp2015 proposed a hash partitioning strategy based on discriminating sets that allows lockfree parallel evaluation of a broad class of generic queries including nonlinear queries. We illustrate this with our running all pairs shortest path example.
Assume the relations arc, path and shortestpath from example 1 (rules ) are partitioned by the first column^{3}^{3}3The first attribute forms a discriminating set that is used for partitioning. (i.e., the source vertex), using a hash function that maps the source vertex to an integer between 1 to , latter denoting the number of workers. Now, a worker can execute the following program in parallel:

The worker executes rule by reading from the partition of arc.

Once all the workers finish step (1), the worker begins seminaive evaluation with rule , where it reads from the partition of path, joins with the corresponding atoms from the arc relation, which is shared across all the workers, and then writes new atoms into the same partition of path.

Once all the workers finish step (2), the seminaive evaluation proceeds to the next iteration and repeats step (2) till the least fixpoint is reached.

In the final step, the worker computes the shortestpath for the partition.

All the shortestpath data pooled across the workers produce the final query result.
It is easy to observe that the above parallel execution does not require any locks, since each worker is writing to exactly one partition and no two workers are writing to the same partition. We formally define the lockfree parallel bottomup evaluation scheme next.
Definition 2.
(Lockfree Parallel Bottomup Evaluation). Let be a recursive Datalog program to be executed over workers and let be the corresponding ICO for the sequential execution of . Under the lockfree parallel plan executed over workers, let be the program to be executed at worker , producing an interpretation of the recursive predicate with the corresponding ICO . Then, for every input of base relations, we have, for , . It also follows from the correctness of partitioned parallelization scheme that .
The underlying strategy of a lockfree parallel plan to use disjointed data partitions have also been adopted to execute dataparallel distributed bottomup evaluations, as explained next.
SharedNothing Architecture. Distributed systems like BigDatalog bigdatalog also divide the entire dataset into disjointed data shards in an identical manner as the lockfree partitioning technique described above. Each data shard resides in the memory of a worker and this partitioning scheme reduces the data shuffling required across different workers bigdatalog. In the context of sharednothing architecture, this sharding scheme and subsequent distributed bottomup evaluation is termed as a decomposable plan bigdatalog, rasql. In the rest of this paper, we will use the term ‘lockfree parallel plan’ in the context of sharedmemory architecture and ‘parallel decomposable plan’ in the context of distributed environment for clarity.
Distributed systems like BigDatalog and SociaLite socialite perform the fixpoint computation under BSP model with synchronized iterations. However, note that, if each node caches the arc relation, then each node can operate independently without any coordination or synchronization with other nodes (i.e., step 3 listed before in the lockfree evaluation plan becomes unnecessary). The Myria system follows this asynchronous computing model for the query evaluation. Interestingly, GangulyParallel showed that only a subclass of linear recursive queries^{4}^{4}4The dataflow graph corresponding to the linear recursive query must have a cycle. can be executed in a coordination free manner or asynchronously. Thus, for a large class of nonlinear and even many linear recursive queries (e.g. same generation query GangulyParallel), BSP computing model has been the only viable option.
4 Parallel Evaluation with
In this section, we now examine if can be easily integrated into the lockfree parallel and parallel decomposable bottomup evaluation plans that have been widely adopted across sharedmemory and sharednothing architectures for a broad range of generic queries. We next provide some interesting theoretical results.
Lemma 1. Let be a relation defined over a set of attributes , where and . For a subset of (), if is divided into disjoint subsets using a hash function such that is defined as , then a tuple satisfying (or, ) will also satisfy (or, respectively) over and vice versa, where .
This follows directly from the fact that since , for any two tuples , if , then , i.e., any two tuples with the same value will be mapped into the same partition, decided by their common value. Since, all tuples with the same value belong to a single partition, any tuple will satisfy (or, ) over both and .
Theorem 1.
Let be a recursive Datalog program, be its corresponding ICO and let the constraint be to and , resulting in the constrained ICO . Let be executed over workers under a lockfree parallel (or parallel decomposable) bottomup evaluation plan, where is the program executed at worker and be the corresponding ICO defined over . If the groupby arguments used for the constraint also contain the discriminating set used for partitioning in the lockfree parallel (or parallel decomposable) plan, then:

is also to and , for .

For some integer , if is the minimal fixpoint for , then , where denotes the constrained ICO with respect to .
The proof for (i) follows trivially from lemma 1 and the provability technique discussed earlier in Section 2.
Since, is to and , according to the properties of .
Similarly, since for , is to and (from (i) of theorem 1), , for some integer , where is the minimal fixpoint for .
Thus, .
Now, constraints are also trivially to union over disjoint sets zaniolotplp2017, i.e.,
.
Also recall from the definition of lockfree parallel (or parallel decomposable) plan that
.
Combining these aforementioned equalities, we get,
.
Since, is the minimal fixpoint with respect to
, for ,
.
Therefore, .
Thus, following theorem 1, we can push the constraint within the parallel recursive plan expressed by rules and rewrite them for worker as follows:
Thus, we observe that premappable constraints can be also easily pushed inside parallel lockfree (or parallel decomposable) evaluation plans of recursive queries to yield the same minimal fixpoint, yet making them computationally more efficient and safe. Thus, can be easily incorporated into the parallel computation plans (equivalent to rules ) of different systems like BigDatalogMC, BigDatalog and Myria, irrespective of whether they use (1) sharedmemory or sharednothing architecture, or (2) they follow BSP or asynchronous computing models.
5 A Case for Relaxed Synchronization
We now consider a nonlinear query, which is equivalent to the linear all pairs shortest path program with the application of (rules ). Since this is a nonlinear query (rules ), this program cannot be executed in a coordinationfree manner or asynchronously following the technique described in GangulyParallel.
However, as shown in yangiclp2015, a simple query rewriting technique can produce an equivalent parallel decomposable evaluation plan for this nonlinear query. Rules show the equivalent decomposable program, which can be executed by worker on a distributed system following a bulk synchronous parallel model. In this following decomposable evaluation plan, there is a mandatory synchronization step (rule ), where each worker (operating on the partition) copies the new atoms or updates in path produced during the seminaive evaluation from rule to path and the new path is then sent to other workers so that they can use it in the evaluation of rule in the next iteration.
In a bulk synchronous distributed computing model, the communication between the workers in each iteration can be considerably more expensive than the local computation performed by each worker due to the bottleneck of network bandwidth. We now investigate if we can relax this synchronization constraint at every iteration.
Under a stale synchronous parallel (SSP) model, a worker can use an obsolete or stale version of path that omits some recent updates, produced by other workers, for its local computation. In particular, a worker using path at iteration will be able to use all the atoms and updates generated from iteration to , is a userspecified threshold for controlling the staleness. In addition, the worker’s stale path may have atoms or updates from iteration beyond , i.e., from iteration to (although this is not guaranteed). The intuition behind this is that in a SSP model, a worker for its local computation should be able to see and use its own updates at every iteration, in addition to seeing and using as many updates as possible from other workers, with the constraint that any updates older than a given age are not missed. This is the bounded staleness constraint sspdef. This leads to two advantages:

Workers spend more time performing actual computation, rather than idle waiting for other workers to finish. This can be very helpful, when there are straggling workers present, which lag behind others in an iteration. In fact in distributed computing, stragglers present an acute problem since they can occur for several reasons like hardware differences disk1, system failures hardfail, skewed data distribution or even from software management issues and program interruptions caused from garbage collections or operating system noise, etc. oseffect.

Secondly, workers can end up communicating less than under a BSP model. This is primarily because under , each worker can condense several updates computed from different local iterations into a single update before eventually sending it to other workers.
We illustrate the above advantages through an example. Figure 1 shows a toy graph which is distributed across two workers: (i) all edges incident on nodes 14 are available on worker 0, (ii) and the rest of the edges reside on worker 1. Now consider the shortest path between nodes 4 and 8, given by the path 43215678, which spans across 7 hops. The parallel program defined by rules with BSP processing would require at least three synchronized iterations to reach to the least fixpoint by seminaive evaluation. Now consider worker 1 to be a straggling node that lags behind worker 0 during the computation because of hardware differences. Thus, worker 0 spends significant time idle waiting for worker 1 to complete, as shown in Figure 2a. But in this example, the shortest path between nodes 4 and 8 changes because of two aspects: (1) the shortest path between nodes 4 and 1 changes and (2) the shortest path between nodes 5 and 8 changes. Both of these computations can be done independently on the two workers and worker 0 needs to know the eventual shortest path between nodes 5 and 8 calculated by worker 1 and vice versa. It is important to note that this will only work if each worker can use the most recent local updates (newest atoms) generated by itself. In other words, worker 0 should be able to see the changes of the shortest path between node 4 and node 1 in every iteration (which is generated locally) and use a stale (obsolete) knowledge about the shortest path between nodes 5 and 8 (as sent by worker 1 earlier). This stale synchronization model is summarized in Figure 2b.
In this same example, note how the minimum cost for the path between node 1 and node 4 (computed by worker 0) changes in every iteration: (i) in the first iteration, the minimum cost was 10 given by the edge between node 1 and node 4, (ii) in the next iteration, the minimum cost drops to 7 given by the path 134 and (iii) in the third iteration the final minimum cost of 5 is given by the sequence 1234. In a BSP model, each of this update generated in every iteration needs to be communicated to all the remaining workers. However, in a SSP model due to the advantage of this staleness, these multiple updates from different local iterations can be condensed into one most recent update, which is then sent to other workers. In other words, SSP with may skip sending some updates to remote workers, thus saving communication time.
Figure 3 formally presents the SSP processing based bottomup evaluation plan for the nonlinear all pairs shortest path example given by rules . If the evaluation is executed over a distributed system of workers, Figure 3 depicts the execution plan for a worker . A coordinator marks the completion of the overall evaluation process by individually tracking the termination of each of the worker’s task. For simplicity and clarity, we have used the naive fixpoint computation to describe the evaluation plan instead of using the optimized differential fixpoint algorithm. The used in the Figure 3 denotes the constraint. Step (3) in this evaluation plan shows how worker uses stale knowledge from other workers (denoted by ) during the recursive rule evaluation, shown by step (4). It is also important to note that in step (4), each worker is also using the most recent atoms generated by itself (denoted by ) for the evaluation. The condition in step (6) allows each local computation on worker to reach local fixpoint or move further by at least iterations. Thus, each worker can condense multiple updates generated within these iterations due to into a single update. Finally, step (9) ensures that if any worker falls beyond the userdefined staleness bound, then other workers wait for it to catch up within the desired staleness level before starting their local computations again. We next present some theoretical and empirical results about the SSP model based bottomup evaluation.
6 Bottomup Evaluation with SSP Processing
Under the SSP model, a recursive query evaluation with constraints has the following theoretical guarantees:
Theorem 2.
Let be a recursive Datalog program with ICO and let the constraint be to and . Let have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . If is also to and for , then:

The SSP processing yields the same minimal fixpoint of , as would have been obtained with BSP processing.

If any worker under BSP processing requires rounds of synchronization, then under SSP processing would require rounds to reach the minimal fixpoint, where rounds of synchronization in SSP model means every worker has sent at least updates.
The proof is provided in A.
6.1 SSP Evaluation of Queries without Constraint
We now consider the parallel decomposable plan of a transitive closure query, which does not contain any aggregates in recursion. We use the same nonlinear recursive example from bigdatalogmc, given by rules , which shows the program executed by worker . Note, in this example every worker eventually has to compute and send to other workers all tc atoms of the form (X, Y), where h(X) . Without , a worker does not update its existing tc atoms. In fact, during seminaive evaluation of this query, at any time, only new unique atoms are appended to tc. Thus, a SSP evaluation for the transitive closure query (without ) does not save any communication cost as compared to a BSP model. However, as shown in our experimental results next, the SSP model can still mitigate the influence of stragglers.
6.2 Experimental Results
Setup. We conduct our experiments on a 12 node cluster, where each node, running on Ubuntu 14.04 LTS, has an Intel i74770 CPU (3.40GHz, 4 cores) with 32GB memory and a 1 TB 7200 RPM hard drive. The compute nodes are connected with 1Gbit network. Following the standard practices established in bigdatalog, bigdatalogmc, we execute the distributed bottomup seminaive evaluation using an AND/OR tree based implementation in Java on each node. Each node executes one application thread per core. We evaluate both the nonlinear all pairs shortest path and transitive closure queries on a subset of the real world orkut social network data^{5}^{5}5http://snap.stanford.edu/data/comOrkut.html.
Inducing Stragglers.
In order to study the influence of straggling nodes in a declarative recursive computation, we induce stragglers in our implementation following the strategy described in SSPCui. In particular, each of the nodes in our setup can be disrupted independently by a CPUintensive background process that kicks in following a Poisson distribution and consumes at least half of the CPU resources.
Analysis. In this section, we empirically analyze the merits and demerits of a SSP model over a BSP model, by examining the following questions: (1) How does a SSP model compare to a BSP model when queries contain constraints and aggregates in recursion? (2) How do these two processing paradigms compare when cannot be applied? (3) And, how do the overall performances in the above scenarios change in presence and absence of stragglers? Table 1 captures the first case with the all pairs shortest path query (where is applicable), while Table 2 presents the second case with the transitive closure query, which do not contain any aggregates or constraints in recursion. For each of these two cases, as shown in the tables, we experimented with two different staleness values for a SSP model, both under the presence and absence of induced stragglers. Notably, a SSP model with bounded staleness (alternatively also called ‘slack’ and indicated by in the tables) set as zero reduces to a BSP model. Tables 1 and 2 capture the average execution time for the query at hand under different configurations over five runs. This run time can be divided into two components— (1) average computation time, which is the average time spent by the workers performing seminaive evaluation for the recursive computation, and (2) average waiting time, which is the average time spent by the workers waiting to receive a new update to resume computation. Tables 1 and 2 show the run time break down for the two aforementioned cases (with and without respectively).
From Tables 1 and 2, it is evident that BSP processing requires the least compute time irrespective of straggling nodes. This is also intuitively true because the total recursive computation involved in a BSP based distributed seminaive evaluation is similar to that of a single executor based sequential execution and as such a BSP model should require the least computational effort to reach the minimal fixpoint. On the other hand, a SSP model may perform many local computations optimistically with obsolete data using relaxed synchronization barriers, which can become redundant later on. As shown in the tables, average compute time indeed increases with higher slack indicating that a substantial amount of the work becomes unnecessary. However, as seen from both the tables, SSP plays a major rule in reducing the average wait time. This is trivially true, since in SSP processing, any worker can move ahead with local computations using stale knowledge, instead of waiting for global synchronization as required in BSP. However, note the reduction in average wait time under SSP model in Table 1 (with ) is more significant than in Table 2 (without ). This can be attributed to the fact that with seminaive evaluation (Section 2) under SSP model can batch multiple updates together before sending them, thereby saving communication cost. However, for the transitive closure query (without ), the overall updates sent in BSP and SSP models are similar (since no aggregates are used, seminaive evaluation only produces new atoms, never updates existing ones). Thus, in the latter case (Table 2), the wait times between BSP and SSP models are comparable when there are no induced stragglers, whereas the wait time in SSP is marginally better than BSP when stragglers are present. Notably, inducing stragglers obviously increases the average wait time all throughout as compared to a no straggler situation. The compute time also increases marginally in presence of stragglers, primarily because the straggling nodes take longer time to finish its computations.
Thus, to summarize based on the run times in the two tables, we see that in absence of stragglers, the SSP model can reduce the run time of the shortest path query (with constraint) by nearly 30%. However, the same is not true for the transitive closure query, which do not have any constraint. Hence, a BSP model would suffice if there are no stragglers and the query does not contain any constraint. However, in presence of stragglers or constraints, SSP model turns out to be a better alternative than BSP model, as it can lead to a execution time reduction of as high as 40% for the shortest path query and nearly 7% for the transitive closure query. Finally, it is also worth noting from the results that too much of a slack can also increase the query latency. Thus, a moderate amount of slack should be used in practice.
Time consumption  No stragglers (time in sec)  With stragglers (time in sec)  

BSP (=0)  SSP (=3)  SSP (=6)  BSP (=0)  SSP (=3)  SSP (=6)  
Avg. compute time  2224  2443  3038  2664  2749  3435 
Avg. wait time  1679  302  408  2786  485  704 
Run time  3903  2745  3446  5450  3234  4139 
Time consumption  No stragglers (time in sec)  With stragglers (time in sec)  

BSP (=0)  SSP (=3)  SSP (=6)  BSP (=0)  SSP (=3)  SSP (=6)  
Avg. compute time  682  762  879  754  827  921 
Avg. wait time  367  345  334  618  456  412 
Run time  1049  1107  1213  1372  1283  1431 
7 Conclusion
facilitates and extends the use of aggregates in recursion, and this enables a wide spectrum of graph and data mining algorithms to be expressed efficiently in declarative languages. In this paper, we explored various improvements to scalability via paralled execution with . In fact, can be easily integrated with most of the current generation Datalog engines like BigDatalog, Myria, BigDatalogMC, SociaLite, LogicBlox, irrespective of their architecture differences and varying synchronization constraints. Moreover, in this paper, we have shown that brings additional benefits to the parallel evaluation of recursive queries. For that, we established the necessary theoretical framework that allows bottomup recursive computations to be carried out over stale synchronous parallel model—in addition to the synchronous or completely asynchronous computing models studied in the past. These theoretical developments lead us to the conclusion, confirmed by initial experiments, that the parallel execution of nonlinear queries with constraints can be expedited with a stale synchronous parallel (SSP) model. This model is also useful in the absence of constraints, where bounded staleness may not reduce communications, but it nevertheless mitigates the impact of stragglers. Initial experiments performed on a realworld dataset confirm the theoretical results, and are quite promising, paving the way toward future research in many interesting areas, where declarative recursive computation under SSP processing can be quite advantageous. For example, declarative advanced stream reasoning systems astrocikm, supporting aggregates in recursion, can adopt distributed SSP model to query evolving graph data, especially when one portion of the network changes more rapidly as compared to others. SSP models under such scenario offer the flexibility to batch multiple network updates together, thereby reducing the communication costs effectively.
Finally, it is important to note that the methodologies developed here can also be applied to other declarative logic based systems beyond Datalog, like in SQLbased query engines rasql, which also use seminaive evaluation for recursive computation. In addition, the SSP processing paradigm can also be adopted in many stateoftheart graphcentric platforms such as Pregel pregel and GraphLab graphlab. These modern graph engines use a vertexcentric computing model vertexProgramming, which enforces a strong consistency requirement among its model variables under the “GatherApplyScatter” abstraction. Consequently, this makes the synchronization cost for these graph frameworks similar to that of standard BSP systems. Thus, for many distributed graph computation problems involving aggregators (like shortest path queries), SSP model, as demonstrated in this paper, can be quite useful for these graph based platforms.
References
 Ameloot (2014) Ameloot, T. J. 2014. Declarative networking: Recent theoretical work on coordination, correctness, and declarative semantics. SIGMOD Rec. 43, 2 (Dec.), 5–16.
 Ameloot et al. (2017) Ameloot, T. J., Geck, G., Ketsman, B., Neven, F., and Schwentick, T. 2017. Parallelcorrectness and transferability for conjunctive queries. J. ACM 64, 5 (Sept.), 36:1–36:38.
 Ameloot et al. (2015) Ameloot, T. J., Ketsman, B., Neven, F., and Zinn, D. 2015. Weaker forms of monotonicity for declarative networking: A more finegrained answer to the calmconjecture. ACM Trans. Database Syst. 40, 4 (Dec.), 21:1–21:45.
 Ameloot et al. (2013) Ameloot, T. J., Neven, F., and Van Den Bussche, J. 2013. Relational transducers for declarative networking. J. ACM 60, 2 (May), 15:1–15:38.

Ananthanarayanan et al. (2010)
Ananthanarayanan, G., Kandula, S., Greenberg, A., Stoica, I., Lu, Y., Saha, B., and Harris, E. 2010.
Reining in the outliers in mapreduce clusters using mantri.
In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation. OSDI’10. 265–278.  Aref et al. (2015) Aref, M., ten Cate, B., Green, T. J., Kimelfeld, B., Olteanu, D., Pasalic, E., Veldhuizen, T. L., and Washburn, G. 2015. Design and implementation of the logicblox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1371–1382.
 Beckman et al. (2006) Beckman, P., Iskra, K., Yoshii, K., and Coghlan, S. 2006. The influence of operating systems on the performance of collective operations at extreme scale. In 2006 IEEE International Conference on Cluster Computing. 1–12.
 Cipar et al. (2013) Cipar, J., Ho, Q., Kim, J. K., Lee, S., Ganger, G. R., Gibson, G., Keeton, K., and Xing, E. 2013. Solving the straggler problem with bounded staleness. In Proceedings of the 14th USENIX Conference on Hot Topics in Operating Systems. HotOS’13. 22–22.
 Condie et al. (2018) Condie, T., Das, A., Interlandi, M., Shkapsky, A., Yang, M., and Zaniolo, C. 2018. Scalingup reasoning and advanced analytics on bigdata. TPLP 18, 56, 806–845.
 Cui et al. (2014) Cui, H., Cipar, J., Ho, Q., Kim, J. K., Lee, S., Kumar, A., Wei, J., Dai, W., Ganger, G. R., Gibbons, P. B., Gibson, G. A., and Xing, E. P. 2014. Exploiting bounded staleness to speed up big data analytics. In USENIX ATC. 37–48.
 Das et al. (2018) Das, A., Gandhi, S. M., and Zaniolo, C. 2018. Astro: A datalog system for advanced stream reasoning. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. CIKM ’18. 1863–1866.
 Ganguly et al. (1992) Ganguly, S., Silberschatz, A., and Tsur, S. 1992. Parallel bottomup processing of datalog queries. J. Log. Program. 14, 12 (Oct.), 101–126.
 Gu et al. (2019) Gu, J., Watanabe, Y., Mazza, W., Shkapsky, A., Yang, M., Ding, L., and Zaniolo, C. 2019. Rasql: Greater power and performance for big data analytics with recursiveaggregatesql on spark. In SIGMOD’19.
 Ho et al. (2013) Ho, Q., Cipar, J., Cui, H., Kim, J. K., Lee, S., Gibbons, P. B., Gibson, G. A., Ganger, G. R., and Xing, E. P. 2013. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS. 1223–1231.
 Interlandi and Tanca (2018) Interlandi, M. and Tanca, L. 2018. A datalogbased computational model for coordinationfree, dataparallel systems. Theory and Practice of Logic Programming 18, 56, 874–927.
 Krevat et al. (2011) Krevat, E., Tucek, J., and Ganger, G. R. 2011. Disks are like snowflakes: No two are alike. In Proceedings of the 13th USENIX Conference on Hot Topics in Operating Systems. HotOS’13. 14–14.
 Lee et al. (2014) Lee, S., Kim, J. K., Zheng, X., Ho, Q., Gibson, G. A., and Xing, E. P. 2014. On model parallelization and scheduling strategies for distributed machine learning. In Proceedings of the 27th International Conference on Neural Information Processing Systems  Volume 2. NIPS’14. 2834–2842.
 Low et al. (2012) Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., and Hellerstein, J. M. 2012. Distributed graphlab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5, 8, 716–727.
 Malewicz et al. (2010) Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. 2010. Pregel: A system for largescale graph processing. In SIGMOD’10. 135–146.
 Mazuran et al. (2013) Mazuran, M., Serra, E., and Zaniolo, C. 2013. Extending the power of datalog recursion. The VLDB Journal 22, 4 (Aug.), 471–493.
 Seo et al. (2013) Seo, J., Park, J., Shin, J., and Lam, M. S. 2013. Distributed socialite: A datalogbased language for largescale graph analysis. Proc. VLDB Endow. 6, 14 (Sept.), 1906–1917.
 Shkapsky et al. (2016) Shkapsky, A., Yang, M., Interlandi, M., Chiu, H., Condie, T., and Zaniolo, C. 2016. Big data analytics with datalog queries on spark. In SIGMOD. ACM, New York, NY, USA, 1135–1149.
 Wang et al. (2015) Wang, J., Balazinska, M., and Halperin, D. 2015. Asynchronous and faulttolerant recursive datalog evaluation in sharednothing engines. Proc. VLDB Endow. 8, 12 (Aug.), 1542–1553.
 Yan et al. (2015) Yan, D., Cheng, J., Lu, Y., and Ng, W. 2015. Effective techniques for message reduction and load balancing in distributed graph computation. In WWW. 1307–1317.
 Yang et al. (2015) Yang, M., Shkapsky, A., and Zaniolo, C. 2015. Parallel bottomup evaluation of logic programs: DeALS on sharedmemory multicore machines. In Technical Communications of ICLP.
 Yang et al. (2017) Yang, M., Shkapsky, A., and Zaniolo, C. 2017. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB J. 26, 2, 229–248.
 Zaniolo et al. (2016) Zaniolo, C., Yang, M., Das, A., and Interlandi, M. 2016. The magic of pushing extrema into recursion: Simple, powerful datalog programs. In AMW.
 Zaniolo et al. (2017) Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A., and Condie, T. 2017. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP 17, 56, 1048–1065.
 Zaniolo et al. (2018) Zaniolo, C., Yang, M., Interlandi, M., Das, A., Shkapsky, A., and Condie, T. 2018. Declarative bigdata algorithms via aggregates and relational database dependencies. In AMW.
Appendix A SSP processing based recursive computation with
Definition 3.
(Cover). Let be a positive recursive Datalog program with as its corresponding ICO. Let a constraint be defined over the recursive predicate on a set of groupby arguments, denoted by with the costargument denoted as . Let be also to and . Let there be two sets and , both of which contain tuples of the form , where represents the set of real numbers. Now, is defined as the cover for , if for every tuple , there exists only one tuple such that (i) and (ii) .
It is important to note from the above definition that if is the cover for , then there can exist a tuple , such that but the converse is never true.
Lemma 2. Let be a recursive Datalog program, be its corresponding ICO and let the constraint be to and , resulting in the constrained ICO . Now, for any pair of positive integers , where , is a cover for .
This directly follows from the fact that any atom in with cost can only exist in with updated cost , if or . Note if , then is trivially true.
Lemma 3. Let be a recursive Datalog program with ICO and let the constraint be to and . Let also have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . Let be also to and , for . After rounds of synchronization ( rounds of synchronization in SSP model means every worker has sent at least updates), if and denote the interpretation of the recursive predicate under BSP and SSP models respectively for any worker , then is a cover for .
In a SSP based fixpoint computation, any worker can produce an atom in three ways:

From local computation not involving any of the updates sent by other workers.

From a join with a new atom or an update sent by another worker .

From both cases (1) and (2) together.
Now, consider the base case, where before the first round of synchronization (i.e., at the round) each worker performs only local computation, since it has not received/sent any update from/to any other worker. Since, in a SSP model, each local computation may involve multiple iterations (as shown in step (6) in Figure 3), is trivially a cover for (from lemma 2).
We next assume this hypothesis (lemma 3) to be true for some . Under this assumption, we find that each worker in SSP model for its fixpoint computation operates based on the information from its own and from the ones sent by other workers after the round of synchronization. And since each of this involved is a cover for the corresponding (when compared against the BSP model), the aforementioned cases (1)(3) will also produce a cover for the synchronization round.
Hence, by principle of mathematical induction, the lemma holds for all .
Theorem 2.
Let be a recursive Datalog program with ICO and let the constraint be to and . Let have a parallel decomposable evaluation plan that can be executed over workers, where is the program executed at worker and is the corresponding ICO defined over . If is also to and , for , then:

The SSP processing yields the same minimal fixpoint of , as would have been obtained with BSP processing.

If any worker under BSP processing requires rounds of synchronization, then under SSP processing would require rounds to reach the minimal fixpoint, where rounds of synchronization in SSP model means every worker has sent at least updates.
Theorem 1 guarantees that the BSP evaluation of the datalog program with will yield the minimal fixpoint of . Note that in the SSP evaluation, for every tuple produced by a worker from the program , . In other words, if represents the final interpretation of the recursive predicate under SSP evaluation, then i.e. is bounded. It also follows from lemma 3, that is a cover for the final interpretation of the recursive predicate under BSP evaluation i.e. is a cover for . Since, is the least fixpoint under the constraint, we also get , as atoms in must have identical cost in .
Thus, we can write the following equation based on the above discussion,
(1) 
Also recall, since is to each and , under the SSP evaluation, each worker also applies in every iteration in its fixpoint computation (step (4) in Figure 3). Thus, we have,
(2) 
Combining equations (1) and (2), we get . Thus, the SSP evaluation also yields the same minimal fixpoint as the BSP model.
Since, the interpretation of the recursive predicate in the least model obtained from BSP evaluation is identical to that in the least model obtained from SSP processing, it directly follows from lemma 3, that the number of synchronization rounds required by worker in SSP evaluation will be at most , where is the number of rounds takes under BSP model.
Comments
There are no comments yet.