The growth of BigData applications adds new vigor to the quest by Datalog researchers for combining the expressive power found in recursive Prolog programs with the performance and scalability of relational DBMSs. Their research led to the delivery of a first commercial Datalog system  and impacted significantly query languages and systems. In particular, many DBMS vendors introduced support for recursive queries into their systems and the SQL-2003 standards by adopting Datalog’s key notions and techniques, including (a) stratified semantics for negation and aggregates, (b) optimization techniques that include (i) semi-naive fixpoint computation, (ii) constraint pushing for left/right-linear rules, and (iii) magic sets for linear rules. However, the impact of these extensions in the market place was limited, particularly if we compare it with OLAPs and Data Cubes of descriptive analytics which, by providing simple extensions of SQL aggregates, allowed DBMS vendors to make very successful inroads into the brave new world of Big Data analytics . However, as Big Data analytics grew in diversity and complexity, DBMS showed their limitations by their inability to support KDD, graph, and ML applications. For instance, (I) Agrawal et al.  showed that KDD applications are very difficult to support in SQL DBMS, while (II) Stonebraker et al.  pointed out that MapReduce owed its extraordinary popularity to its success in extension to recursive applications, such as Page Rank, but they never called for extensions of data parallelism of SQL DBMS that could support such new applications. Finally, (III) more than a dozen of graph database systems have been developed for this important application domain that is problematic for SQL DBMSes. Now if we examine the many factors that led to this non-optimal situation, we see that unresolved research issues played a role of paramount importance. For instance, since the early days of Datalog, researchers had been aware of the fact that many algorithms can be expressed quite naturally in Datalog once aggregates, and non-monotonic constructs, such as choice, were allowed in recursion, as per the following incomplete citation list: [13, 11, 6, 19] and [9, 14, 20, 7]. But as we will discuss later in the ‘related work’ section, these early proposals suffered from various limitations, and non-monotonic reasoning research was still evolving discouraged premature commitments to a particular solution. In the years since then, we have seen major progress on the semantic with the stable-model semantics  gaining widespread acceptance. This is due to its great power and generality that extends beyond the original focus of Datalog to cover disjunctive programs and answer-set semantics. Unfortunately, in its general form, answer-set semantic requires computational complexity levels that are unsuitable for the BigData applications that have become the computer science cynosure. As a result, the approaches for non-monotonic semantics that enables efficient implementations for a very wide range of applications has now become an overwhelming need. This is because, in recent years, Big-Data applications for graphs, KDD and ML have grown by leaps and bounds, laying bare the inadequacy of stratified Datalog and SQL in such crucial domains.
Many systems have addressed this surge of critical BigData applications by providing specialized libraries of functions that are written in various PLs which either operate externally on data extracted from the DBMS, or internally as an extension of DBMS. While, this approach can be effective for some applications and systems, a growing number of researchers have been investigating whether it is possible to extend to these new applications the old DBMS paradigm, in which usability, portability, and scalability were achieved by writing high-level queries that the system then executes efficiently using sophisticated query optimization and data parallelism techniques [16, 12, 21, 17, 22, 10, 4, 5]
. All these research projects share important common points, including (i) the use of parallel, multi-processor or multi-node, architectures to achieve performance and scalability and (ii) the use of aggregates in recursion to express more advanced applications. However the UCLA projects also address the difficult semantic issues raised by aggregates in recursion, which were left untouched by the other projects. Indeed UCLA’s work shows that formal semantics can be combined with generality and superior performance, and and even enhance it in unexpected ways. In fact, the Pre-Mappability property that was introduced to achieve formal semantics can often deliver better performance by compensating for skewness across wokers in a parallel execution[5, 3].
Previous works have shown various ways in which aggregates can be used in recursive logic programs while retaining formal semantics. Thus, using aggregates that are monotonic in the lattice of set containment was discussed in[12, 17, 22]. Among the aggregates, using min and max in recursive programs that are equivalent to stratified programs was discussed in [23, 10]
. In this paper, we explore a third important situation where programs using non-monotonic aggregates nevertheless define monotonic mappings because those aggregates are applied to sets of known cardinalities. In fact the computation of an aggregate such as sum is performed in two phases. In the initial phase, we progressively add to the current continuous sum each item in the set. In the final phase we detect the end of the input and return the last value produced in the initial phase. The desirability of clearly distinguishing between the two phases is well- recognized when dealing with continuous queries on data streams: in fact, aggregates returning only the results from initial phase computation are non-blocking, whereas those returning results produced in the secon phase are blocking. Likewise, the need to provide users with continuous aggregates that only compute in their initial phase was recognized in SQL:2003 with the introduction of OLAP Functions that support continuous aggregates. Drawing a clear distiction between the continuous and final version of the aggregate produced in the two phases, is also very important when using them in recursion. Indeed, the continuous initial aggregates are monotonic, whereas the final ones are non-monotonic and they cannot be used as such in recursion. However, in many situations, including those where the cardinality of the set is known, the the final phase computation can be recast in monotonic terms, whereby the whole aggregate becomes monotonic and can be used to express concisely and efficiently powerful recursive queries, as discussed in the rest of the paper, which is organized as follows. In the next section we formally define the initial and final versions of aggregates and the notion of pre-computability for the latter. In Section 4, we extend these notions to group-by aggregates, and show how they make possible the simple expression and efficient computation Markov Chains, and Lloyd’s clustering algorithm. Then, the conclusion, in Section 5, points out that many KDD and ML algorithms can be expressed in a similar way.
2 Defining Set Aggregates
We begin by defining a general template to compute aggregates using Horn Clauses. More specifically, let denote a set of atoms (no duplicates), i.e, . Now, the continuous count aggregate on returns the set of positive integers that do not exceeds the cardinality of the set and can be computed as follows using Horn clauses:
Example 1 (Defining continuous count).
Thus the goal progressively returns integers up the actual cardinality of the above set . The name monotonic count is also used for continuous count, since it is defined using by Horn clauses that always generate monotonic mappings in the lattice of set-containment.
In terms of implementation, the above formal definition of monotonic count is quite inefficient since it constructs all possible permutations of the values, while only one of such permutations needs to be considered. Thus, actual realizations of continuous count in systems visit each atom in in some efficient way—typically in the sequential order in which the atoms are stored.
The traditional final count used in SQL-2 i.e. the cardinality of set , can be derived as the maximum of the continuous count . But rather than using this approach that defines one aggregate using another, we can define it by the rules , , and of Example 1 and the following final rule:
Example 2 (Defining Final Count from continuous count in Example 1).
Thus the final count is defined using negation, and therefore it is non-monotonic, unlike continuous count. But as in the case of continuous count, its implementation will be expedited by considering only one of the possible permutations of the values in . Furthermore, the rule condition can be implemented by any test that determines that this is the last atom in the set. For instance, if the facts are stored in a file, then the logical condition is realized by detecting that the next datum is the end-of-file () mark. This is just one way in which the realization of condition is detected in DBMS. Indeed, if our table is indexed using a B+ tree, then there might not be any mark in the data blocks, and the termination condition is realized by the last bottom level index block with a null pointer to the next block. Moreover, if is the result of a join or other relational algebra expressions, it is the responsibility of the DBMS to signal to the function implementing the aggregate that the condition is satisfied because all the data satisfying the expression has been generated. To determine the semantic properties of aggregates, we still need to consider both (i) their explicit logic-based definitions, such as the one just described for count, and (ii) the fact that the completion condition might not be part of the logic program expressing the application at hand since it is implicitly tested by the underlying system.
2.1 From Counts to Sums and Averages
The definition of other aggregates such as sum or count will use the template established for count consisting of an initial phase where their continuous version is computed, and then of a second phase where the final result is returned. Moreover the final count can be used as the completion test that brings about the final phase in the computation of these aggregates.
For instance, the sum of the -values that satisfy can be defined as shown in Example 3 where rules and compute both the continuous sum and the continuous count as the first and the second arguments of . The value of the final sum is actually the value of the continuous sum when the continuous count value reaches a value that is equal to the cardinality of our set , i.e. a value that is equal to the final count. Rule expresses this completion condition using the predicate that was defined in Example 2 for the computation of .
Example 3 (Defining continuous and final sum).
Thus, the sum aggregates is basically defined by a monotonic computation, except for a final rule that call on final-count predicate which is non-monotonic. However, in many situations final-count is known before we enter the recursive computation of . For instance, this is true when
represents the atoms of a vector of known length. Moreover, in many situation where the cardinality ofis not known, it can be actually computed using the program in Example 1, to produce in a lower stratum. Then, the rules and in Example 3 will still be used to compute m, but instead of we will use the following rule to compute the final-sum:
Therefore, the final sum aggregate can be implemented by a stratified program where the lower stratum perform the non-monotonic computation of final count, and the next stratum derives sum by a monotonic computation since rules , and do not use negation.
Thus, in our definition of sum we have combined the computation of sum and count into one stratified programs, where rules , and occupy a lower stratum, the rule containing negation is at higher stratum, and rules , and a still higher stratum. Only rule that determines the actual count is non-monotonic.
In every program, recursive or not, when we will have to compute sum on a set whose cardinality is already know, rule is no longer needed and the computation of our aggregate becomes yet another monotonic predicated defined using Horn clauses. In practice, the situation is even more dramatic since the detection of the or other termination condition can be used to replace rule provided that the actual count computation could have been completed and saved away before the computation of sum— whereby or other completion condition will simply trigger the retrieval of this value to execute rule .
Similar observations also hold true for other aggregates such as average, and extrema aggregates. In fact, the average aggregates can be computed by replacing rule with the following rule:
For Max, we can instead write the following rules where we use the predicate to return the larger of the two values and (they cannot be equal since we are using set semantics).
Example 4 (Defining the max on a set where final_count is known.).
Dual definition holds for min, where instead of larger we will use a predicate that returns the smaller of the two values.
3 Group-By Aggregates
The logical definition of aggregates specified with a group-by clause can be derived as an extension of that provided in the previous section. Take for instance the following rule:
Then the join computation of sum and count can be performed as follows:
Example 5 (Defining continuous sum and final sum in the presence of group-by).
Thus, the computation starts with that sets the values of sum and count to zero. Then, after checking that the pair is in fact new, we increase the values of and for the group-by value matching , but we only increase the value for the rest. Thus at the end of the fixpoint computation, for each value we will have the sum of the -values associated with it. The values will be the same for every since it is equal to the cardinality of the set containing the facts.
The final rule returns the final value of continuous sum when the continuous count has reached its final value. This is the only rule using negation, and if we can replace it by conditions expressed using negation the whole computation of final sum becomes monotonic. This is, for instance, the situation when the value of can be pre-computed before we enter into the computation of sum as in the case in which is computed on a set of fixed or Pre-Countable Cardinality (PCC), This logic-based definition of group-by sum on PCC sets cardinality is easily other aggregates. In fact, if in Example 5 we replace sum by count, avg, min and max, we obtain a well-defined and efficiently computable semantics.
4 Pre-Countable Cardinalities in Recursion
Many examples of great practical interest belong to the PCC category. For instance, the Markov-Chain application is of great interest because it is closely related to the Page Rank algorithm that is at root of the Map-Reduce developments. We assume that we are given as DB facts which respectively describe the names of the cities of interest and the fraction of the population that will (most likely) move from to in the course of a year. For each city there is also a non-zero arc from the city back to the same city showing people that will not move away. Therefore, the sum of for the arcs leaving the city (i.e., a node) is always equal to one.
Thus, assuming that initially every city has a population of, say, , we need to find how the population evolves over the years. For that we can use the following program:
Example 6 (The kernel of the Markov Chain algorithm).
Now observe that, in the course of the fixpoint computation, the atoms with index value are generated after those with index value . Thus, the computation of will return the current value of the continuous sum as its final-sum as soon as the computation of completes reaching the final count. This final count in fact is equal to the size of results obtained by joining with , which is actually independent from . Therefore, this count value can be computed at a stratum lower than the stratum of by rules defining a predicate named, say, , which will then be passed to the rules of our program via an additional goal added to the rules defining . The stratified program so obtained defines the formal semantics of our logic program. Now, this formal semantics can be realized by an operational semantics that dispenses from the computation of , and simply detects the completion of the natural join completion of the natural join of with . In fact we count the number of tuples being produced by the natural join, this will return upon its completion, whereby the test specified by can be replaced by a test that the computation of the join is completed—a test that is already built-in the implementation provided by the system.
Termination and Optimization
In addition to issues of formal semantics, the approach discussed in the previous section also allows us to deal effectively with termination and optimization issues. For instance, in the current form, our Markov chain example, is non-terminating. A simple solution to that is to introduce a condition such as if we want to ensure that the fixpoint iteration terminates in steps. Typically however, steps are not needed for the computation to converge to a state in which each successive step returns the same and results as the previous step. To stop as soon as convergence occurs, an additional goal can be added to check that the population has increased in some city (and therefore decreased in others). Finally, we might want to specify that we are only interested in the final, i.e., the max value for the index . We then obtain the following program:
Example 7 (The actual Markov chain algorithm).
The last two rules specify post-conditions that must applied at the end of the fixpoint computation; however, it is quite straightforward for the compiler to integrate them into the semi-naive fixpoint computation to achieve a significant optimization. In fact, during the semi-naive fixpoint computation, the compiler identifies new facts (i.e., the delta) obtained at each point in the computation. Moreover, the latest delta atoms are identified by the latest value off , which is also the max value of the index i.e., the values returned by upon termination. Thus and can be implemented by simply returning the latest delta atoms upon termination 111An alternative approach using chain max is also available..
Now, the fact that only the atoms for the max value of are needed implies that all others can be dropped to achieve a much more efficient usage of memory.
Therefore, we have now a formal semantics defined by a stratified program consisting of (i) a bottom stratum where count is defined, (ii) a middle stratum of Horn clauses, i.e., monotonic rules , and (iii) a top stratum used to post-select the final results of interest. Now, this formal is realized via a very efficient operational semantics that only requires the semi-naive computation in the middle stratum, inasmuch as the completion of the join in (i) replaces the completion of the final count, and the extraction of the final results in (iii) is realized by the selection of the final delta in the semi-naive fixpoint.
Similar conclusions and optimizations hold for the popular clustering technique known as Lloyd’s algorithm discussed next.
Lloyd’s Clustering Algorithm
We are given a large set of K-dimensional points. Each point is described by a unique and the coordinate values in each of its dimensions, i.e., by . We also have a small set of centroids, e.g., say that we have 10 such points. Then to generate the initial assignment , we used the predicate that implements one of the many techniques described in the literature. Therefore, we have algorithm shown in Example 8. At each step , the algorithm finds the closest center for each point. Then a new set can be generated by averaging their coordinates 222In our rules we use the and to represent to short integers by one long integer and to reconstruct the original number.
Example 8 (Clustering a lá Lloyd).
Thus, if we let , and denote the respective cardinalities of our set of points, centers, and dimensions. We have that, for each rule, the aggregate computation involves a number of elements that is independent of : specified a computation taking place over , specified a computation is over elements, and the computation of is over elements. These are counts that can be easily determined before the recursive computation and remain the same for every value of . These explicit values could be passed to the recursive rules for computing the monotonic versions of the aggregates used in these rules. But a much simpler and efficient solution consists in letting the system detect the completion of execution of the body operators at each step , which is already implemented as part of the optimized seminaive fixpoint computation, as in the case of the Markov Chain computation. Indeed, issues similar to the termination and optimization that we discussed for Markov Chain also hold for Lloyd’s algorithm.
Recursive computations on datasets of fixed cardinality represent an area of great theoretical and practical interest for Datalog and other logic-based languages. Indeed, we have shown that important applications such Markoff Chains and Lloyd’s clustering algorithm can be expressed very efficiently using aggregates in recursion, while avoiding the difficult semantic issues besetting the use of non-monotonic constructs in recursive programs. In all these examples, we dealt with facts describing a given set of entities, such as cities, by their attributes (such as population). We have found that, when the number of such entities in the world remains unchanged, aggregates on the attributes of these entities can can be used in recursive logic rules while preserving the desirable properties of fixpoint computations. This and other recent results using the Pre-Mappability property of extrema [10, 4, 5] suggest that aggregate can provide the long-sought bridge between formal non-monotonic semantics and efficient scalable big data computations that, over many years of work, could not be build by non-monotonic reasoning researchers using only negation.
-  M. Aref, B. ten Cate, T. J. Green, B. Kimelfeld, et al. Design and implementation of the LogicBlox system. In SIGMOD, pages 1371–1382. ACM, 2015.
-  S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Record, 26(1):65–74, 1997.
-  A. Das, S. M. Gandhi, and C. Zaniolo. ASTRO: A datalog system for advanced stream reasoning. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM 2018, Torino, Italy, October 22-26, 2018, pages 1863–1866, 2018.
A. Das, Y. Li, J. Wang, M. Li, and C. Zaniolo.
Bigdata applications from graph analytics to machine learning by aggregates in recursion.In ICLP’19, 2019.
-  A. Das and C. Zaniolo. A case for stale synchronous distributed model for declarative recursive computation. In 35th International Conference on Logic Programming, ICLP’19, 2019.
-  S. Ganguly, S. Greco, and C. Zaniolo. Minimum and maximum predicates in logic programming. In PODS, pages 154–163, 1991.
-  S. Ganguly, S. Greco, and C. Zaniolo. Extrema predicates in deductive databases. Journal of Computer and System Sciences, 51(2):244–259, 1995.
-  M. Gelfond and V. Lifschitz. The stable model semantics for logic programming. In ICLP, pages 1070–1080, 1988.
-  S. Greco, C. Zaniolo, and S. Ganguly. Greedy by choice. In PODS, pages 105–113. ACM, 1992.
-  J. Gu, Y. Watanabe, W. Mazza, A. Shkapsky, M. Yang, L. Ding, and C. Zaniolo. Rasql: Greater power and performance for big data analytics with recursive-aggregate-sql on spark. In ACM SIGMOD Int. Conference on Management of Data, Amsterdam, NL June 30–July 5, 2019.
-  D. B. Kemp and P. J. Stuckey. Semantics of logic programs with aggregates. In ISLP, pages 387–401, 1991.
-  M. Mazuran, E. Serra, and C. Zaniolo. A declarative extension of Horn clauses, and its significance for Datalog and its applications. TPLP, 13(4-5):609–623, 2013.
-  I. S. Mumick, H. Pirahesh, and R. Ramakrishnan. The magic of duplicates and aggregates. In VLDB, pages 264–277. Morgan Kaufmann Publishers Inc., 1990.
-  K. A. Ross and Y. Sagiv. Monotonic aggregation in deductive databases. In PODS, pages 114–126, 1992.
-  S. Sarawagi, S. Thomas, and R. Agrawal. Integrating mining with relational database systems: Alternatives and implications. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA., pages 343–354, 1998.
-  J. Seo, J. Park, J. Shin, and M. S. Lam. Distributed SociaLite: a Datalog-based language for large-scale graph analysis. PVLDB, 6(14):1906–1917, 2013.
-  A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with Datalog queries on Spark. In SIGMOD, pages 1135–1149. ACM, 2016.
-  M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: Friends or foes? Commun. ACM, 53(1):64–71, Jan. 2010.
-  S. Sudarshan and R. Ramakrishnan. Aggregation and relevance in deductive databases. In VLDB, pages 501–511, 1991.
-  A. Van Gelder. Foundations of aggregation in deductive databases. In Deductive and Object-Oriented Databases, pages 13–34. Springer, 1993.
-  J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive Datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542–1553, 2015.
-  M. Yang, A. Shkapsky, and C. Zaniolo. Scaling up the performance of more powerful datalog systems on multicore machines. VLDB J., 26(2):229–248, 2017.
-  C. Zaniolo, M. Yang, A. Das, A. Shkapsky, T. Condie, and M. Interlandi. Fixpoint semantics and optimization of recursive Datalog programs with aggregates. TPLP, 17(5-6):1048–1065, 2017.