Relational E-Matching

08/04/2021 ∙ by Yihong Zhang, et al. ∙ University of Washington 0

We present a new approach to e-matching based on relational join; in particular, we apply recent database query execution techniques to guarantee worst-case optimal run time. Compared to the conventional backtracking approach that always searches the e-graph "top down", our new relational e-matching approach can better exploit pattern structure by searching the e-graph according to an optimized query plan. We also establish the first data complexity result for e-matching, bounding run time as a function of the e-graph size and output size. We prototyped and evaluated our technique in the state-of-the-art egg e-graph framework. Compared to a conventional baseline, relational e-matching is simpler to implement and orders of magnitude faster in practice.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The congruence closure data structure, also known as the e-graph, is a central component of SMT-solvers (simplify; z3; moskal; cvc4) and equality saturation-based optimizers (eqsat; egg). An e-graph compactly represents a set of terms and an equivalence relation over the terms. An important operation on e-graphs is e-matching, which finds the set of terms in an e-graph matching a given pattern. In SMT-solvers, e-matching is used to instantiate quantified formulas over ground terms. In equality saturation, e-matching is used to match rewrite rules on an e-graph to discover new equivalent programs. The efficiency of e-matching greatly affects the overall performance of the SMT-solver (z3; cvc4), and slow e-matching is a major bottleneck in equality saturation (egg; tensat; szalinski). In a typical application of equality saturation, e-matching is responsible for 60–90% of the overall run time (egg).

Several algorithms have been proposed for e-matching (efficient-ematching; moskal; simplify). However, due to the NP-completeness of e-matching (ematching-nph), most algorithms implement some form of backtracking search, which are inefficient in many cases. In particular, backtracking search only exploits structural constraints, which are constraints about the shape of a pattern, but defers checking equality constraints, which are constraints that variables should be consistently mapped. This leads to suboptimal run time when the equality constraints dominate the structural constraints.

To improve the performance of backtracking-based e-matching, existing systems implement various optimizations. Some of these optimizations only deal with patterns of certain simple shapes and are therefore ad hoc in nature (moskal). Others attempt to incrementalize e-matching upon changes to the e-graph, or match multiple similar patterns together to eliminate duplicated work (efficient-ematching). However, these optimizations are complex to implement and fail to generalize to workloads where the e-graph changes rapidly or when the patterns are complex.

To tackle the inefficiency and complexity involved in e-matching, we propose a systematic approach to e-matching called relational e-matching. Our approach is based on the observation that e-matching is an instance of a well-studied problem in the databases community, namely answering conjunctive queries. We therefore propose to solve e-matching on an e-graph by reducing it to answering conjuctive queries on a relational database. This approach has several benefits. First, by reducing e-matching to conjunctive queries, we simplify e-matching by taking advantage of decades of study by the databases community. Second, the relational representation provides a unified way to express both the structural constraints and equality constraints in patterns, allowing query optimizers to leverage both kinds of constraints to generate asymptotically faster query plans. Finally, by leveraging the generic join algorithm, a novel algorithm developed in the databases community, our technique achieves the first worst-case optimal bound for e-matching.

Relational e-matching is provably optimal despite the NP-hardness of e-matching. The databases community makes a clear distinction between query complexity, the complexity dependent on the size of the query, and data complexity, the complexity dependent on the size of the database. The NP-hardness result (ematching-nph) is stated over the size of the pattern, yet in practice only small patterns are matched on a large e-graph. When we hold the size of each pattern constant, relational e-matching runs in time polynomial over the size of the e-graph.

Our approach is widely applicable. For example, multi-patterns are typically framed as an extension to e-matching that allows the user to find matches satisfying multiple patterns simultaneously. Efficient support for multi-patterns requires modifying the basic backtracking algorithm (efficient-ematching). In contrast, relational e-matching inherently supports multi-patterns for free. The relational model also opens the door to entirely new kinds of optimizations, such as persistent or incremental e-graphs.

To evaluate our approach, we implemented relational e-matching for egg, a state-of-the-art implementation of e-graphs. Relational e-matching is simpler, more modular, and orders of magnitude faster than egg’s e-matching implementation.

In summary, we make the following contributions in this paper:

  • We propose relational e-matching, a systematic approach to e-matching that is simple, fast, and optimal.

  • We adapt generic join to implement relational e-matching, and provide the first data complexity results for e-matching.

  • We prototyped relational e-matching 111

    We will open source our implementation.

    in egg, a state-of-the-art e-graph implementation, and we show that relational e-matching can be orders of magnitude faster.

The rest of the paper is organized as follows: Section 2 reviews relevant background on the e-graph data structure, the e-matching problem, conjunctive queries and join algorithms. Section 3 presents our relational view of e-graphs, our e-matching algorithm, and the complexity results. Section 4 discusses optimizations on our core algorithm and addresses various practical concerns. Section 5 evaluates our algorithm and implementation with a set of experiments in the context of equality saturation. Section 6 discusses how the relational model opens up many avenues for future work in e-graphs and e-matching, and Section 7 concludes.

2. Background

Throughout the paper we follow the notation in Figure 1. We define the e-graph data structure and the e-matching problem, and review background on relational queries and join algorithms that form the foundation of our e-matching algorithm.

2.1. E-Graphs and E-Matching

function symbols ::=
variables ::=
e-class ids ::=
ground terms ::=
patterns ::=
e-nodes ::=
e-classes ::=
Figure 1. Syntax and metavariables used in this paper.

Terms.

Let be a set of function symbols with associated arities. A function symbol is called a constant if it has zero arity. Let be a set of variables. We define to be the set of terms constructed using function symbols from and variables from . More formally, is the smallest set such that (1) all variables and constants are in and (2) implies , where has arity . A ground term is a term in that contains no variables. All terms in are ground terms. A non-ground term is also called a pattern. We call a term of the form an -application term.

Congruence relation.

An equivalence relation is a binary relation over that is reflexive, symmetric, and transitive. A congruence relation is an equivalence relation satisfying:

We write and when is clear from the context.

E-graph.

An e-graph  is a set of e-classes, where each e-class is a set of e-nodes. Each e-node consists of a function symbol and a list of children e-class ids. Similar to terms, we call an e-node of the form an -application e-node.

Given the definitions and syntax in Figure 1, we can more formally define an e-graph as a tuple where:

  • A union-find (unionfind) data structure stores an equivalence relation (denoted with ) over e-class ids. The union-find provides a function find that canonicalizes e-class ids such that . An e-class id is canonical if .

  • The e-class map maps e-class ids to e-classes. All equivalent e-class ids map to the same e-class, i.e., iff is the same set as . An e-class id is said to refer to the e-class .

  • A function lookup that maps e-node to the id of e-class that contains it: .

No two e-nodes have the same symbol and children, i.e., an e-node’s symbol and children together uniquely identify the e-class that contains it. This property is necessary for lookup to be a function. Section 4.3 explains how this property also translates to a functional dependency in the e-graph’s relational representation, which could be leveraged to further optimize relational e-matching.

An e-graph efficiently represents sets of ground terms in a congruence relation. An e-graph, e-class, or e-node is said to represent a term if can be “found” within it. Representation is defined recursively:

  • An e-graph represents a term if any of its e-classes does.

  • An e-class  represents a term if any e-node does.

  • An e-node represents a term if they have the same function symbol and e-class represents term for .

Figure 2. An example e-graph.

Figure 2 shows an e-graph representing the set of terms (where ):

In addition, all -terms are equivalent, and all -terms are equivalent. Note that the e-graph has size , yet it represents many terms. In general, an e-graph is capable of representing exponentially many terms in polynomial space. If the e-graph has cycles, it can even represent an infinite set of terms. For example, the e-graph with a single e-class represents the infinite set of terms .

E-matching.

E-matching finds the set of terms in an e-graph matching a given pattern. Specifically, e-matching finds the set of e-matching substitutions and a root class. An e-matching substitution is a function that maps every variable in a pattern to an e-class. For convenience, we use to denote the set of terms obtained by replacing every occurrence of variable in with terms represented by .

Given an e-graph and a pattern , e-matching finds the set of all possible pairs such that every term in is represented in the e-class . Terms in are said to be matched by pattern , and is said to be the root of matched terms. For example, matching the pattern against the e-graph  in Figure 2 produces the following substitutions, each with the same root :

Existing e-matching algorithms perform backtracking search directly on the e-graph (efficient-ematching; simplify; egg). Figure 3 shows an abstract backtracking-based e-matching algorithm. Most e-matching algorithms using backtracking search can be viewed as optimizations based on this abstract algorithm. Specifically, it will perform a top-down search following the shape of the pattern and prune the result set of substitutions when necessary. To match pattern against , backtracking search visits terms in the following order (each marks a backtrack step):

For each term visited, whenever the algorithm yields a match . Despite there being only matches, backtracking search runs in time .

This inefficiency is due to the fact that naïve backtracking does not use the equality constraints to prune the search space globally. Specifically, the above e-matching pattern corresponds to three constraints for a potential matching term :

  1. should have function symbol .

  2. ’s second child should have function symbol .

  3. ’s first child should be equivalent to the child of ’s second child.

We can categorize these constraints into two kinds:

  • Structural constraints are derived from the structure of the pattern. The structure of pattern constrains the root symbol and the second symbol to be and respectively (i.e., constraints 1 and 2).

  • Equality constraints are implied by multiple occurrences of the same variable. Here, the occurrences of implies that the terms at these positions should be equivalent with each other for all matches (i.e., constraint 3), which we call equality constraints. Following moskal, we define patterns without equality constraints to be linear patterns.

Backtracking search exploits the structural constraints first and defers checking the equality constraints to the end. In our example pattern , backtracking search enumerates all , regardless of whether and are equivalent, only to discard inequivalent matches later. Complex query patterns may involve many variables that occur at several places, which will makes naïve backtracking search enumerate a very large number of candidates, even though the result size is small.

Figure 3. A declarative backtracking-based e-matching algorithm (reproduced from efficient-ematching). The set of substitutions for pattern on e-graph  can be obtained by computing .

2.2. Conjunctive Queries

Conjunctive queries are a subset of queries in relational algebra that use only select, project, and join operators (as opposed to union, difference, or aggregation). Conjunctive queries have many desirable theoretical properties (like computable equivalence checking), and they enjoy efficient execution thanks to decades of research from the databases community.

Relational databases.

A relational schema over domain is a set of relation symbols with associated arities. A relation under a schema is a set of tuples; for each tuple , is the arity of in and is an element in . A database instance (or simply database) of is a set of relations under .

We use the notation to denote projection, i.e., .

Conjunctive queries.

A conjunctive query over the schema is a formula of the form:

where are relation symbols in with arities and the are variables.222 Some definitions of conjunctive queries allow both variables and constants. We only allow variables without loss of generality: any constant can be specified with a distinguished relation . We call the part the head of the query, the remainder is the body. Each is called an atom. Variables that appear in the head are called free variables, and they must appear in the body. Variables that appear in the body but not the head are called bound variables, since they are implicitly existentially quantified.

Semantics of conjunctive queries.

Similar to e-matching, evaluating a conjunctive query yields substitutions. Specifically, evaluation yields substitutions that map free variables in to elements in the domain such that there exists a mapping of the bound variables that causes every substituted atom to be present in the database. Bound variables are projected out and not present in resulting the substitutions.

More formally, let be a database of schema and let be a conjunctive query over the same schema with variables in its head. Let the atoms in the body of be where has arity . Evaluating over yields a substitution iff there exists a mapping all variables in such that:

In practice, conjunctive queries are often evaluated according to a query plan which dictates each step of execution. For example, many industrial database systems will construct tree-like query plans, where each node describes an operation like scanning a relation or joining two intermediate relations. Industrial database systems typically construct query plans based on binary join algorithm such as hash joins and merge-sort join, which process two relations at a time. The quality of a query plan critically determines the performance of evaluating a conjunctive query.

We observe that conjunctive query and e-matching are structurally similar: both are defined as finding substitutions whose instantiations are present in a database. Therefore, it is tempting to reduce e-matching to a conjunctive query over the relational database, thereby benefiting from well-studied techniques from the databases community, including join algorithms and query optimization. We achieve exactly this in Section 3.

2.3. Worst-Case Optimal Join Algorithms

The run time of any algorithm for answering conjunctive queries is lower-bounded by the output size, assuming the output must be materialized. How large can the output of a conjunctive query be on a particular database? A naïve bound is simply the product of the size of each relation, which is the size of their cartesian product. Such a naïve bound fails to consider the query’s structure. The AGM bound (agm) gives us a bound for the worst-case output size. In fact, the AGM bound is tight; there always exists a database where the query output size is the AGM bound.

The AGM bound and worst-case optimal joins are recent developments in databases research. We do not attempt to provide a comprehensive background on these topics here; familiarity with the existence of the AGM bound and the generic join algorithm is sufficient for this paper.

Consider , also known as the “triangle query”, since output tuples are triangles between the edge relations . We calculate a trivial bound . If , then . We can derive a tighter bound from . That is because contains fewer tuples than the query as further requires . The AGM bound for is even smaller: . It is computed from the fractional edge cover of the query hypergraph.

Query Hypergraph

triangle

Figure 4. Query hypergraph of .

The hypergraph of a query is simply the hypergraph with a vertex for each variable and a (hyper)edge for each atom. The edge for an atom connects the vertices correponding to the variables . Figure 4 illustrates ’s hypergraph.

Cyclic and Acyclic Queries

Certain queries can be represented by a tree, called the join tree, where each node corresponds to an atom in the query. Furthermore, for each variable the nodes corresponding to the atom containing must form a connected component. Queries that admit such a join tree are said to be acyclic; otherwise, the query is cyclic.333 A cycle in the hypergraph does not necessarily entail a cyclic query, since the hypergraph may still admit a join tree. The triangle query is cyclic because it cannot be represented by a join tree. Acyclic queries can be answered more efficiently than cyclic ones.

Fractional Edge Cover

A set of edges cover a graph if they touch all vertices. For ’s hypergraph, any two edges form a cover. A fractional edge cover assigns a weight in the interval to each edge such that, for each vertex , the weights of the edges containing sum to at least 1. Every edge cover is a fractional cover, where every edge is assigned a weight of 1 if it is in the cover, and 0 otherwise. For ’s hypergraph, is the fractional edge cover with lowest total weight.

The AGM Bound

The AGM bound (agm) for a query with body atoms is defined as , where forms a fractional edge cover. For example, the AGM bound for is when . This is the upper bound of ’s output size; i.e. in the worst case outputs this many tuples.

Generic Join

A desirable algorithm for answering conjunctive queries shoud run in time linear to the worst case output size. Recent developments in the databases community have led to such an algorithm (wcoj), one of which is generic join (gj). Generic join has one parameter: an ordering of the variables in the query. Any ordering guarantees a run time linear to the worst-case output size, but different orderings can lead to dramatically different run time in practice (emptyheaded).

Result: computes the output of query
Input: query , partial substitution
/* indicates how many variables remain in query */
1 if  then /* there are no more variables, so is complete */
2      output
3else
4       choose a variable ;
       /* Compute , which all possible values of , by intersecting the attributes of the relations where occurs. Intersection must be computed in time. */
5       ;
6       ;
7       for  do
             /* compute residual query by replacing variable with constant */
8             ;
9             GJ
10       end for
11      
12 end if
Algorithm 1 Generic join for the general query

Algorithm 1 shows the generic join algorithm. Generic join is recursive, proceeding in two steps. First, it chooses a variable from the query and collects all possible values for that variable in the query. Then, for each of those values, it builds a residual query by replacing occurrences of the variable in the query with a possible value for that variable. These residual queries are solved recursively, and when there are no more variables in the residual query, the algorithm yields the substitution it has accumulated so far.

Generic join requires two important performance bounds to be met in order for its own run time to meet the AGM bound. First, the intersection on line 1 must run in time. Second, the residual relations should be computed in constant time, i.e., computing from the relation the relation for some must take constant time. Both of these can be solved by using tries (sometimes called prefix or suffix trees) as an indexing data structure. Figure 5 shows an example trie. Tries allow fast intersection because each node is a map which can be intersected in time linear to the size of the smaller map. Tries also allow constant-time access to residual relations according to a compatible variable ordering.

x y z
1 2 4
1 2 6
1 3 7
8 2 4
(a) Table for relation .
(b) Trie for relation using ordering .
Figure 5. A trie is a tree where every node is a map (typically a hashmap or sorted map) from a value to a trie. Every path from the root of a trie to a leaf represents a tuple in the relation. Tries allow efficient computation of residual relations. For example, can be computed quickly by following the edge from the root.

A useful way to understand generic join is to observe the loops and intersections it performs on a specific query. Algorithm 2 shows generic join computing the triangle query. Given a variable ordering ( in this case), generic join assumes the input relations are stored in tries according to the ordering, so is stored in a trie with s on the first level, and s on the second. This makes accessing the residual relations () fast since the replacement of variables with values is done according to the given variable ordering. Note how the algorithm is essentially just nested for loops. There is no explicit filtering step; the intersection of residual queries guarantees that once a complete tuple of values is selected, it can be immediately output without additional checking.

Result: compute
1 ;
2 for  do /* compute */
3       ;
4       for  do /* compute */
5             ;
6             for  do /* yield join results */
7                  output
8             end for
9            
10       end for
11      
12 end for
Algorithm 2 Generic join for the triangle query, with ordering .

3. Relational E-Matching

E-matching via backtracking search is inefficient because it handles equality constraints suboptimally. In fact, backtracking search follows edges in the e-graph and only visits concrete terms that satisfy structural constraints. However, equality constraints are checked a posteriori only after the search visits a (partial) term.444 Backtracking e-matching can perform the check as soon as it has traversed enough of the pattern to encounter a variable more than once. Whenever there are many terms that satisfy the structural constraints but not the equality constraints, as is in our example pattern , backtracking will waste time visiting terms that do not yield a match.

By reducing e-matching to evaluating conjunctive queries, we can use join algorithms that take advantage of both structural and equality constraints. Figure 6 conveys this intuition using the pattern and the example e-graph and database from Figure 7. The backtracking approach considers every possible assignment to the variables, even those where the two occurrences of do not agree.

We can instead formulate a conjunctive query that is equivalent to the following pattern:

Later subsections will detail how this conversion is done, but note how the auxiliary variable captures the structural constraint from the pattern. Evaluating with a simple hash join strategy exemplifies the benefits of the relational approach: it considers structural and equality constraints (in this case by doing a hash join keyed on ); indeed, the relational perspective sees no difference between the two kinds of constraints.

(a) Backtracking takes time

(b) Hash join takes time .
Figure 6. E-matching with backtracking search and a simple hash join on the e-graph/database in Figure 7.

This observation leads us to a very simple algorithm for relational e-matching, shown in Algorithm 3. Relational e-matching takes an e-graph and a set of patterns ps. It first transforms the e-graph to a relational database . Then, it reduces every pattern to a conjunctive query . Finally, it evaluates the conjunctive queries over . These intermediate steps will be detailed in the following subsections.

Input: An e-graph  and a list of e-matching patterns ps
Output: The result of running ps on
1 ;
2 ;
return
Algorithm 3 RelationalEMatching

3.1. From the E-Graph to a Relational Database

(a) An example e-graph, reproduced from Figure 2.
id
1
2
(b) Relation of .
id
1
2
(c) Relation of .
Figure 7. An e-graph and its relational representation. Each e-class (dotted box) is labeled with its id.

The first step of relational e-matching is to transform the e-graph into a relational database . The domain of the database is e-class ids, and its schema is determined by the function symbols in . Every e-node with symbol in the e-graph corresponds to a tuple in the relation in the database. If has arity , then will have arity ; its first attribute is the e-class id that contains the corresponding e-node, and the remaining attributes are the children of the e-node. Figure 7 shows an example e-graph and part of its corresponding database. In particular, only the relations of function symbols and are presented in this figure. There are other relations; each relation represents a constant and has exactly one tuple (i.e., singleton ).

We construct the database by simply looping over every e-node in every e-class in and making a tuple in the corresponding relation:

Note that the tuples in the database contain only canonical e-class ids returned from the find function.

Our presentation in this paper specifically targets e-matching use cases like equality saturation, where the building of the database can be amortized. In this setting, e-matching is done in large batches, and expensive work like congruence closure can be amortized between these batches using a technique called “rebuilding” (egg). The time complexity of building this database is always linear, which is subsumed by the time complexity of most non-trivial e-matching patterns. In Section 6.3, we discuss how this technique could be generalized to the non-amortized setting of frequently updated e-graph as future work.

3.2. From Patterns to Conjunctive Queries

 where are variables in
 and
Figure 8. Compiling a pattern to a conjunctive query.

Once we have a database that corresponds to the e-graph, we must convert each pattern we wish to e-match to a conjunctive query. We use the algorithm in Figure 8 to “unnest” a pattern to a conjunctive query by connecting nested patterns with auxiliary variables.

The Aux function returns a variable and a conjunctive query atom list. Particularly, for non-variable pattern , Aux produces a fresh variable and a concatenation of and atoms from , where is the result of calling . For variable pattern , Aux simply returns and an empty list. Note that the auxiliary variables introduced by are not included in the head of the query, and thus are not part of the output.

Given a pattern , the Compile function returns a conjunctive query with body atoms from and the head atom consisting of the root variable and variables in . The compiled conjunctive query and the original e-matching query are equivalent because there is an one-to-one correspondence between the output of them. Specifically, each e-matching output corresponds to a query output of . The only difference is that returning the root e-class id is a special consideration for e-matching, but it is just another variable in the conjunctive query.

The Compile function (specifically the Aux subroutine) relies on the fact that the database contains only canonical e-class ids. Without this fact, nested patterns would require an addition join on the equivalence relation . But since if and are canonical e-class ids, we can omit introducing the additional join, instead joining nested patterns directly on the auxiliary variable.

Using this algorithm, the example pattern is compiled to the following conjunctive query:

(1)

Compared to the original e-matching pattern, this flattened representation enables relational e-matching to utilize both the structural and the equality constraints. For example, a reasonable query plan that database optimizers will synthesize is a hash join on both join variables (i.e., and ), which takes time. In contrast, backtracking-based e-matching takes time.

Figure 6 shows the traces for running a direct backtracking search on the e-graph and running hash join on the relational representation. Every term enumerated by hash join will simultaneously satisfy all the constraints. Conceptually, backtracking-based e-matching can be seen as a hash join that only builds and look-ups a single variable (i.e., ), and filters the outputs using the equality predicate on . In other words, existing e-matching algorithms will consider all terms regardless of whether is congruent to , while the generated conjunctive query gives the query optimizer the freedom to synthesize query plans that will consider only tuples where .

3.3. Answering CQs with Generic Join

Finally, we consider the problem of efficiently solving the compiled conjunctive queries. We propose to use the generic join algorithm to solve the generated conjunctive queries. Although traditional query plans, which are based on two way joins such as hash joins and merge-sort joins, are extensively used in industrial relational database engine, they may suffer on certain queries compiled from patterns. For example, consider the pattern . The compiled conjunctive query is:

(2)

Like the classic triangle query, this is a cyclic conjunctive query (Section 2.3). We call e-matching patterns that generate cyclic conjunctive queries cyclic patterns. For such cyclic queries, wcoj show there exist databases on which any two-way join plan is suboptimal. In contrast, generic join is guaranteed to run in time linear to the worst case output size. Moreover, generic join can have comparable performance on acyclic queries with two-way join plans. These properties make generic join our ideal solver for conjunctive queries generated from e-matching patterns.

Using the generic join algorithm, suppose we fix the variable ordering to be on the generated conjunctive query 1. The algorithm below shows generic join instantiated on this particular CQ:

Result: compute
// compute all possible values of
1 ;
2 for  do
       // compute all possible values of given
3       ;
4       for  do
             // compute all possible values of given and
5             ;
6             for  do
7                   output
8             end for
9            
10       end for
11      
12 end for
Algorithm 4 Relational e-matching using GJ for , with ordering .

3.4. Complexity of Relational E-matching

Generic join guarantees worst-case optimality with respect to the output size, and relational e-matching preserves this optimality. In particular, we have the following theorem:

Theorem 1 ().

Relational e-matching is worst-case optimal; that is, fix a pattern , let be the set of substitutions yielded by e-matching on an e-graph with e-nodes, relational e-matching runs in time .

Proof.

Notice that there is an one-to-one correspondence between output tuples of the generated conjunctive query and the e-matching pattern. Therefore, the worst-case bound is the same across an e-matching pattern and the conjunctive query it generated. Because generic join is worst-case optimal, relational e-matching also runs in worst-case optimal time with respect to the output size. ∎

The structure of e-matching patterns allows us to derive an additional bound dependent on the actual output size rather than the worst-case output size.

Theorem 2 ().

Fix an e-graph with e-nodes that compiles to a database , and a fix pattern that compiles to conjunctive query . Relational e-matching on runs in time .

Proof.

Let be the set of isolated variables, those that occur in only one atom. Note that , since is precisely the pattern variables and the root, and auxiliary variables must occur in at least two atoms. Using these, define two new queries:

Since , is the same query as but with zero or more variables projected out. Therefore, every tuple in corresponds to one in the output , so and .

Now we can compute the AGM bound for . Our new atom includes all those variables that only appear in one atom of . Therefore, every variable in occurs in at least two atoms, so assigning to each edge is a fractional edge cover. Thus:

 since
 since

Let denote the running time of generic join with query on database . We know that . Because , we also know , and we can use to bound . Now we show that .

The query is just with an additional atom that covers the variables that only appeared in one atom from . Fix a variable ordering for generic join that puts those variables in at the end. So loops of both GJ instantiations are the same, except that, in , each loop corresponding to a variable in performs an intersection with , but not in . But these intersections are in the innermost loops, at which point all intersections with atoms from have already been done. So the intersections with do nothing, since is precisely projected down to the variables in ! Since those intersections are not helpful and simply does not do them, .

Putting the inequalities together, we get:

. ∎

Example 3 (Complexity of relational e-matching).

Consider the pattern , which compiles to the query . Following the proof, we define and . The AGM bound for is . This also bounds the run time of generic join on .

The above bound is tight for linear patterns, in which case each variable occurs exatly twice in . In the case of nonlinear patterns, we may find tighter covers than assigning to each atom, thereby improving the bound.

3.5. Supporting Multi-patterns

Multi-patterns are an extension to e-matching used in both SMT solvers (efficient-ematching) and program optimizations (tensat). A multi-pattern is a list of patterns of the form that are to be simultaneously matched (i.e., the instantiation of each contained pattern should use the same substitution ). For example, the e-matching the multi-pattern searches for pairs of two -applications whose first arguments are equivalent. Efficient support for multi-patterns on top of backtracking search requires complicated additions to state-of-the-art e-matching algorithms (efficient-ematching). Relational e-matching supports multi-patterns “for free”: a multi-pattern is compiled to a single conjunctive query just like a single pattern. For example, the conjunctive query for the multi-pattern is

(3)

This is one example that shows the wide applicability of the relational model adopted in relational e-matching.

4. Optimizations

Our implementation of relational e-matching using generic join is simple (under 500 lines), but that does not preclude having several optimizations important for practical performance.

4.1. Degenerate Patterns

Not all patterns correspond to conjunctive queries that involve relational joins. Non-nested patterns (whether linear or non-linear) will produce relational queries without any joins:

The corresponding query plan is simply a scan of a relation with possible filtering. For these queries, generic join (or any other join plan) offers no benefit, and building the indices for generic join incurs uncessary overhead. A relational e-matching implementation (or any other kind) should have a “fast path” for these relatively common types of queries that simply scans the e-graph/database for matching e-nodes/tuples. For this reason, we exclude these kinds of patterns from our evaluation in Section 5.

4.2. Variable Ordering

Different variable orderings may result in dramatically different performance for generic join (eval-wcoj; emptyheaded)

, so choosing an variable ordering is important. Compared to join plans for binary joins, query plans for generic join is much less studied. In relational e-matching we choose a variable ordering using two simple heuristics: First, we prioritize variables that occur in many relations, because the intersected set of many relations is likely to be smaller. Second, we prioritize variables that occur in small relations, because intersections involving a small relations are also likely to be smaller. Performing smaller intersections first can prune down the search space early.

Using these two heuristics, the optimizer is able to find more efficient query plans than the top-down search of backtracking-based e-matching. This is even true for linear patterns, where our relational e-matching has no more information than e-matching, but it does have more flexibility. Consider the linear pattern compiled to the query . When there are very few -application e-nodes in the e-graph, will be small. The variable ordering takes advantage of this by intersecting first, resulting to an intersection no larger than . This “bottom-up” traversal is not possible in conventional e-matching.

4.3. Functional Dependencies

Functional dependencies describe the dependencies between columns. For example, a functional dependency on relation of the form indicates that for each tuple of , the values of , , and uniquely determines the value . Functional dependencies are ubiquitous in relational e-matching. In fact, every transformed schema of the form has a functional dependency from to e-class. When the variable graph formed by functional dependencies is acyclic 555Cyclicity of functional dependencies is unrelated to cyclicity of the query., we can speed up generic join by ordering the variables to follow the topological order of functional dependency graph. Every conjunctive query compiled from an e-matching pattern has acyclic functional dependencies, because each dependency goes from the e-node’s children to the e-node’s parent e-class. Relational e-matching can therefore always choose a variable ordering that is consistent with the functional dependency. Our implementation tries to respect functional dependencies, but prioritizes larger intersections more.

As an example, consider conjunctive query 2 again. It is synthesized from pattern and, assuming each relation has size , an AGM bound of . Suppose however that we pick the variable ordering to be . For every possible value of chosen, there will be at most one possible value for and root by functional dependency, which can be immediately determined. This reduces the run time from to .

4.4. Batching

Generic join always processes one variable at a time, even if multiple consecutive variables are from the same atom. We find this strategy to be inefficient in practice, as it results in deeper recursion that does little useful work.

Consider the query . The right variable ordering places at the front, since it is the only intersection. We observe that variables that only appear in one atom can be “batched” with others that only appear in the same atom. Batched variables are treated as a single variable in the trie and intersections. So instead of variable ordering , we can use . This lowers the recursion depth of generic join (from 5 to 3) and improves data locality by reducing pointer-chasing.

5. Evaluation

To empirically evaluate relational e-matching, we implemented it inside the egg equality saturation toolkit (egg). Our implementation consists about 80 lines of Rust inside egg itself to convert patterns into conjunctive queries, paired with a a separate, e-graph-agnostic Rust library to implement generic join in fewer than lines.

egg’s existing e-matching infrastructure is also about lines of Rust, and it is interconnected to various other parts of egg. Qualitatively, we claim that the relational approach is simpler to implement, especially since the CQ solver is completely modular. We could plug in a different generic join implementation 666 There is no reusable generic join implementation at the time of writing., or even a more conventional binary join implementation.

In this section, we refer to egg’s existing e-matching implementation as “EM” and our relation approach as “GJ.”

5.1. Benchmarking setup

We use egg’s two largest benchmark suites as the basis for our two benchmark suites. The math suite implements a simple computer algebra system, including limited support for symbolic differentiation and integration. The lambda suite implements a partial evaluator for the lambda calculus. Each egg benchmark suite provides a set of rewrite rules, each with a left and right side pattern, and a set of starting terms.

To construct the e-graphs used in our benchmarks we ran equality saturation on a set of terms selected from egg’s test suite, stopping before before the e-graph reached 1e5, 1e6, 2e6, and 3e6 e-nodes. The result is four increasingly large e-graphs for each benchmark suite filled with terms that are generated by the suite’s rewrite rules. For each benchmark suite and each of the four e-graph sizes, we then ran e-matching on the e-graph using both EM and GJ. We ran each approach 10 times and took the minimum run time.

For our GJ approach, we ran each trial twice. The first time builds the index tries necessary for generic join just-in-time, and the run time includes that. On the second trial, GJ uses the pre-built index tries from the first run, so the time to build them is excluded. In both Figure 9 and Table 1, orange bars/rows show the first runs (including indexing), and blue bars/rows show the second runs (excluding indexing).

All benchmarks are single-threaded, and they were executed on a 4.6GHz CPU with 32GB of memory.

Figure 9. Relational e-matching can be up to 6 orders of magnitude faster than traditional e-matching on complex patterns. Speedup tends to be greater when the output size is smaller. Bars to the right of the “” line indicate that relational e-matching is faster. The plots show two benchmark suites, lambda and math, taken from the egg test suite. Each group of bars shows the benchmarking results of e-matching a single pattern on 4 increasingly large e-graphs (top to bottom), comparing egg’s built-in e-matching (EM) with our relational e-matching approach using generic join (GJ). The orange bar shows the multiplicative speedup of our approach: . The blue bar shows the same, but excluding the time spent building the indices needed for generic join: . The text above each group of bars shows the pattern itself and the number of substitutions found on the largest e-graph; the patterns are sorted by this quantity.
Idx Suite EG Size GJ EM Total HMean GMean Best Medn Worst
+ lambda 4,142 15 3 1.69 .84 1.71 13.62 1.60 .12
lambda 4,142 18 0 2.58 2.99 4.23 39.17 3.68 1.10
+ lambda 57,454 16 2 2.60 .95 2.66 136.54 2.65 .12
lambda 57,454 18 0 2.87 3.33 9.11 406.70 4.05 1.03
+ lambda 109,493 15 3 1.66 1.75 3.11 148.96 2.03 .65
lambda 109,493 18 0 1.70 3.32 7.46 291.18 4.10 1.05
+ lambda 213,345 15 3 2.20 1.55 3.40 304.33 1.72 .43
lambda 213,345 18 0 2.21 2.96 8.23 501.12 5.04 1.04
white
white Idx Suite EG Size GJ EM Total HMean GMean Best Medn Worst
+ math 8,205 30 2 5.49 0.64 4.61 66.54 2.79 0.03
math 8,205 30 2 5.21 2.93 8.62 1,630.00 5.48 0.62
+ math 53,286 29 3 311.23 2.61 13.50 50,030.29 3.62 0.74
math 53,286 30 2 318.95 3.39 29.60 1,325,802.56 30.72 0.74
+ math 132,080 29 3 96.55 2.66 15.18 61,488.73 4.02 0.60
math 132,080 30 2 97.84 3.46 34.16 2,447,939.38 68.71 0.75
+ math 217,396 30 2 119.82 2.83 18.34 101,023.37 3.91 0.72
math 217,396 31 1 119.73 3.45 41.35 8,575,830.58 80.84 0.76
Table 1. Summary statistics across patterns for each benchmark suite. The “Idx” column shows whether the time to build indices in GJ is included (+) or not (–); the row color corresponds to the colors in Figure 9. The “Suite” columns shows the benchmark suite, and the “EG Size” shows the number of e-nodes in the e-graph used to benchmark. The “GJ” and “EM” columns show how many patterns that algorithm was fastest on in this configuration. “Total” shows the cumulative speedup across all patterns in that configuration. The remaining columns show statistics about the EM/GJ

ratios for each pattern: harmonic mean, geometric mean, max, median, and min.

5.2. Results

Figure 9 show the results of our benchmarking experiments. GJ can be over 6 orders of magnitude faster than traditional e-matching on complex patterns. Speedup tends to be greater when the output size is smaller, and when the pattern is larger and non-linear. A large output indicates the e-graph is densely poplulated with terms matching the given pattern, therefore backtracking search wastes little time on unmatched terms, and using relational e-matching contributes little or no speedup. Large and complex (non-linear) patterns require careful query planning to be processed efficiently. For example, the pattern experiencing the largest speedup in Figure 9 is 4 e-nodes deep with 4 occurrences of the variable . Relational e-matching using generic join can devise a variable ordering to put smaller relations with fewer children on the outer loop, thereby pruning down a large search space early. In contrast, backtracking search must traverse the e-graph  top-down.

In some cases index building time takes a significant proportion of the run time in relational e-matching, sometimes offsetting the gains. Overall, relational e-matching remains competitive with the index building overhead. In Section 6.3 we discuss potential remedies to alleviate this overhead.

Table 1 shows summary statistics across all patterns for each benchmark configuration. Notably, GJ is faster across patterns in every benchmarking configuration (the “Total” column). Much of the total benchmarking run time is dominated by simple, linear patterns (e.g. (+ (+ a b) c)) that return many results. Terms matching such patterns come to dominate the e-graph over time, due to the explosive expansion of associativity and commutativity. As a result, the total speedup does not necessarily increase as the e-graph grows, whereas the best speedup as well as different average statistics steadily increase.

In summary, relational e-matching is almost always faster than backtracking search, and is especially effective in speeding up complex patterns.

6. Discussion

The relational model for e-matching is not only simple and fast, but it opens the door to a wide range of future work to further improve e-graphs and e-matching.

6.1. Pushdown optimization

An e-matching pattern may have additional filtering condition associated with it. For example, a rewrite rule with left hand side (* (/ x y) y) may additionally require that . When the variables involved in conditions all occur in a single relation (e.g., and ), this relation can be effectively filtered even before being joined (e.g., using predicate ), which could immediately prune a large search space.

We call this pushdown optimization, which can be considered as e-graph’s version of the relational query optimization that always pushes the filter operations down to the bottom of the join tree. Note that the ability to do pushdown optimization stems from relational e-matching’s ability to consider the constraints in any order; backtracking e-matching could not support this technique. We currently do not implement this, because it requires breaking changes to egg’s interface.

Conditions that involve multiple variables can be “pushed down” as well. In generic join, the filter can occur immediately after the variables appear in the variable ordering. Thus, an implementation that supports these conditional filters should take this into account when determining variable ordering.

6.2. Join Algorithms

Research in databases has proposed a myriad of different join algorithms. For example, state-of-the-art database systems implement two-way joins like hash join and merge-sort join. They have a longer history than generic join, and benefit from various constant factor optimizations. Extensive research has focused on generating highly efficient query plans using two-way joins. On the other hand, Yannakakis’ algorithm (yannakakis) is proven to be optimal on a class of queries called full ayclic queries, running in time linear to the total size of the input and the output. All linear patterns correspond to acyclic queries, but some nonlinear patterns correspond to cyclic ones. Recent research (mhedhbi19; freitag) has also experimented with combining traditional join algorithms with generic join, achieving good performance. In this paper we choose generic join for its simplicity, and future work may consider other join algorithms for relational e-matching.

6.3. Incremental Processing

We have focused on improving the core e-matching algorithm in this paper, yet prior work has successfully sped up e-matching by making it incremental (efficient-ematching). When the changes to the e-graph are small and the results of e-matching  patterns are frequently queried, maintaining the already-discovered matches becomes crucial for efficiency. From our relational perspective, incremental e-matching  is captured precisely by the classic problem of incremental view maintenance (IVM) (DBLP:conf/vldb/CeriW91; DBLP:conf/sigmod/SalemBCL00; DBLP:conf/sigmod/ZhugeGHW95) in databases. IVM aims to efficiently update an already-computed query result upon changes to the database, without recomputing the query from scratch.

There is oppotunity to improve relational e-matching even without a full-fledged IVM solution. For example, we have shown in Figure 9 that index building can take up a significant portion of the run time. Our index implementation is based on a hash trie, which is simple but difficult to update efficiently. We are experimenting with an alternative index design based on sort tries, in the hope that it can make updates as simple as inserting into a sorted array.

6.4. Building on Existing Database Systems

Given our reduction from e-matching to conjunctive query answering, one may wonder if other e-graph operations could be reduced to relational operations so that a fully functioning e-graph engine can be implemented purely on top of an off-the-shelf database system. There are many benefits to it. For example, we could enjoy an industrial-strength query optimization and execution engine for free (although most industrial databases do not use worst-case optimal join algorithms), and eliminates the cost of transforming an e-graph to a relational database. Moreover, this approach would enjoy any properties of the host database system, including persistence, incremental maintenance, concurrency, and fault-tolerance.

As a proof of concept, we implemented a prototype e-graph implementation on top of SQLite, an embedded relational database system, with 160 lines of Racket code. E-graph operations like insertion and merging are translated into high-level SQL queries and executed using SQLite. This naïve prototype is not competitive with to highly optimized implementations like egg, especially given our relational e-matching approach. However, with appropriate indices and query plan, it could have similar asymptotic performance. Specialized data structures to represent equivalence relations (nappa2019fast) could also help performance. Therefore, not only e-matching but also other e-graph operations can be expressed as relational queries, which hints at the possibility of developing real-world e-graph engines on top of existing relational database systems.

7. Conclusion

In this paper, we present relational e-matching, a novel e-matching algorithm that is conceptually simpler, asymptotically faster, and worst-case optimal. We reduce e-matching to conjunctive queries answering, a well-studied problem in databases research. This relational presentation provides a unified way to exploit in query planning not only structural constraints, but also equality constraints, which are constraints that traditional e-matching algorithms cannot effectively leverage. We implement relational e-matching with the worst-case optimal generic join algorithm, using which we derive the first data complexity for e-matching. We integrate our implementaiton in the state-of-the-art equality saturation engine egg, and show relational e-matching to be flexible (readily supports multi-patterns) and efficient (achieves orders of magnitude speedup).

References