Term rewriting has been effective in optimizing compilers for decades (dershowitz1993-rewrite-systems). However, deciding when to apply each rewrite rule is hard and has a huge impact on the performance of the rewritten program: the so-called phase ordering problem. The challenge is that the global benefit of applying a rewrite rule depends on future rewrites. Maximizing local benefit in a greedy fashion is not sufficient in the absence of a convergence property, i.e. confluence and termination, as local optima may be far away from the global optimum.
Equality saturation (tate2009-equality-saturation; willsey2021-egg) mitigates the phase ordering problem by exploring many ways to apply rewrite rules. Starting from an input program, an equality graph (e-graph) is grown iteratively until reaching a fixed point (saturation), achieving a goal, or timing out. An e-graph efficiently represents a large set of equivalent programs, and is grown by repeatedly applying all possible rewrite rules in a purely additive way. After growing the e-graph, the best program found is extracted from it using a cost model, e.g. one that selects the fastest program.
The applicability of equality saturation has recently been broadened by introducing an amortized invariant restoration technique called rebuilding and a mechanism called e-class analyses (willsey2021-egg). Nevertheless, the application of equality saturation for complex optimizations of realistic programs is limited by the following two issues.
Languages with name bindings. Previous equality saturation work either explicitly avoids the use of name binding for efficiency reasons (smith2021-access-patterns), or uses a simple but inefficient implementation (willsey2021-egg). As almost all programming languages use variables, and hence name binding, this paper explores practical ways of efficiently implementing equality saturation for languages with name bindings. We study equality saturation for the lambda calculus as it is the standard formalism for functional languages. We show that using De Bruijn indices avoids overloading the e-graph with -equivalent terms, and that an approximate substitution enables searches performed in milliseconds where searches with naive explicit substitution quickly run out of memory.
Complex optimizations, i.e. those requiring long rewrite sequences. On each equality saturation iteration, the e-graph tends to grow bigger since every possible rewrite rule is applied in a purely additive way. The growth rate is extremely rapid for some combinations of rewrite rules, such as associativity and commutativity that generate an exponential number of equivalent permutations (wang2020-spores; nandi2020-synthesizing-CAD; willsey2021-egg). In such cases, discovering long rewrite sequences that require many iterations is unfeasible. One way to address this issue is to limit the number of rules applied (wang2020-spores; willsey2021-egg), but this risks not finding optimizations that require rules that have been omitted. A second way is to use an external solver to speculatively add equivalences (nandi2020-synthesizing-CAD), but this requires the identification of sub-tasks that can benefit from being delegated.
This paper proposes sketch-guided equality saturation as a technique to break down complex optimizations into smaller ones. The programmer specifies rewrite goals by writing sketches: program patterns that leave details unspecified. While sketches have previously been used as a starting point for program synthesis (solar2008-synthesis-sketching), our work uses sketches to end an equality saturation search once the sketch is satisfied. Guiding the rewriting using a sequence of sketches decomposes it into a sequence of relatively small equality saturation searches.
We demonstrate that sketch-guiding enables complex optimizations in the Rise functional language. We start by showing that existing equality saturation techniques are not sufficient for applying complex optimizations as the e-graph grows too large. Previous work on Rise produced highly optimized code at the cost of the programmer orchestrating sequences of thousands of rewrite rules (hagedorn2020-elevate). Our evaluation demonstrates that by combining our efficient name binding techniques with sketch-guiding, complex optimizations are discovered by equality saturation with little programmer guidance and in a matter of seconds.
To summarize, the contributions of this paper include:
The development of new techniques to support efficient equality saturation for a typed lambda calculus. The techniques are realized in the Risegg implementation for the Rise data-parallel functional language. We demonstrate the effectiveness of Risegg by optimizing a binomial filter application (section 3).
Proposing sketch-guided equality saturation as a new semi-automatic technique to perform complex optimizations that require long rewrite sequences not discoverable by equality saturation alone. We demonstrate the practicality of sketches for guiding realistic optimizations of Harris corner detection (section 4).
A systematic comparative evaluation of sketch-guided equality saturation for optimizing matrix multiplication. Seven complex optimizations are demonstrated, including loop blocking, vectorization, and parallelization. We show that the complex optimizations are not feasible with fully automated equality saturation due to excessive runtime and memory consumption. In contrast, sketch-guided equality saturation performs the optimizations in seconds and with low memory consumption. At most four sketches are required to guide the search, i.e. significantly less effort than purely manual techniques (section 5).
This section gives a technical overview of equality saturation and its application to express compiler optimizations. We also introduce the functional language Rise (hagedorn2020-elevate) in which the programs we optimize in this paper are expressed.
2.1. Equality saturation
Equality saturation (tate2009-equality-saturation; willsey2021-egg) is a technique for efficiently implementing rewrite-driven compiler optimizations without committing to a single rewrite choice. We demonstrate how equality saturation mitigates the phase ordering problem by using a rewriting example where greedily reducing a cost function is not sufficient to find the optimal program.
Rewriting is often used to fuse operators and avoid that every operator writes its result to memory, for example:
The initial term (a) applies function to each element of a matrix (using two nested s), transposes the result, and then applies function to each element. The optimized term (b) avoids storing an intermediate matrix in memory and transposes the input before applying and to each element. The following rewrite rules are sufficient to perform this optimization, if applied in the correct order:
Rule (1) states that transposing a two-dimensional array before or after applying a function to the elements is equivalent. Rule (2) states that function composition is associative. Finally, rule (3) is the rewrite rule for map fusion. In this example, minimizing the term size results in maximizing fused maps and, therefore, is a good cost model.
If we greedily apply rewrite rules that lower term size, we will only apply rule (3) as this is the only rule that reduces term size. However, rule (3) cannot be directly applied to term (a): we are in a local optimum. The only way to reduce term size further is to first apply the other rewrite rules, which may or may not pay off depending on future rewrites.
We now investigate step-by-step how equality saturation enables to minimize term size by exploring many possible ways to apply rewrites without getting stuck in local minima.
First, an equality graph (e-graph) representing the initial term is constructed (fig. 0(a)). An e-graph is a set of equivalence classes (e-classes). An e-class is a set of equivalent nodes (e-nodes). An e-node is an -ary function symbol () from the term language, associated with child e-classes (). Examples of symbols are , , and . The e-graph data structure is used during equality saturation to efficiently represent and rewrite a set of equivalent programs.
Second, the e-graph is iteratively grown by applying rules non-destructively (figs. 0(d), 0(c) and 0(b)). While standard term rewriting picks a single possible rewrite in a depth-first manner, equality saturation explores all possible rewrites in a breadth-first manner. Within an equality saturation iteration, rewrites are applied independently: they may only depend on rewrites from previous iterations. For the sake of simplicity, we only apply a handful of rewrite rules in fig. 1. When applying a rewrite rule, the equality between its left-hand side and its right-hand side is recorded in the e-graph. Rewrite rules stop being applied when a fixed point is reached (saturation), or when another stopping criteria is reached (e.g. timeout). If saturation is reached, it means that all possible rewrites have been explored.
At that point, the e-graph represents many terms that are equivalent according to the applied rules. An e-graph is much more compact than a regular set of terms, as equivalent sub-terms are shared. E-graphs are capable of representing exponentially many terms in polynomial space, and even infinitely many terms in the presence of cycles (willsey2021-egg). To maximize sharing, a congruence invariant is maintained: intuitively identical e-nodes should not be in different e-classes (fig. 2). However, we will see later that extensive sharing does not necessarily prevent the e-graph size from exploding.
Finally, an extraction procedure selects the best term from the e-graph according to a cost function. For a local cost function (i.e. with signature if the cost is of type ), a relatively simple bottom-up e-graph traversal can be used (panchekha2015-herbie). More complex cost functions require more complex extraction procedures (wang2020-spores; wu2019-carpentry).
An e-class analysis (willsey2021-egg) enables propagating an analysis data of type in a bottom-up fashion, and can be used for extraction when the cost function is local. An e-class analysis is defined by providing two functions:
A function constructing the analysis data from an -ary symbol combined with the data of its child e-classes:
A function merging the analysis data of e-nodes that are in the same e-class:
To compute the smallest term for each e-class, we define an e-class analysis with :
2.2. Rewriting the Rise functional language
In this paper we study the lambda calculus because it formalizes functional languages. To demonstrate the impact of our approach in practice, we use Rise (hagedorn2020-elevate) that implements a typed lambda calculus, and many of the examples in this paper are Rise programs. Rise is a spiritual successor of Lift (lift-rewrite-2015; lift-ir-2017) that demonstrated performance portability across hardware by automatically applying semantics-preserving rewrite rules to optimize programs from domains including scientific code (lift-stencil-2018) and convolutions (mogers2020-convolution).
Rise provides standard lambda abstraction (x. b), function application (f x), identifiers and literals. Rise also offers an extensible set of higher-order functions describing well-known data-parallel computational patterns. While Rise provides many computational patterns, we focus in this paper on two important patterns. map applies a function to each element of an array. reduce combines all elements of an array to a single value given a binary reduction operator. To make Rise accessible to equality saturation, Rise programs are easily encoded as terms of shape as shown in table 1.
To control optimization Rise is complemented by a second programming language, Elevate (hagedorn2020-elevate) that allows programmers to describe complex compiler optimizations as compositions of rewrite rules, called strategies. The performance of the code generated by Rise and Elevate
has been to shown to be on par with the state-of-the-art deep learning compiler TVM(tvm-2018) for matrix multiplication (hagedorn2020-elevate); and competitive to - or even up to 1.4 better than - the state-of-the-art image processing compiler Halide (halide-2012) for the Harris corner detection (koehler2021-elevate-imgproc). This makes Rise an interesting base language for exploring rewrite-based compiler optimizations.
Unfortunately writing Elevate strategies manually is low level and time-consuming. A strategy describes precisely the rewrite sequence required for a particular optimization. Even though Elevate provides combinators and abstractions to help express complex optimizations, the authors of (hagedorn2020-elevate) and (koehler2021-elevate-imgproc) report that expressing complex optimizations required between 2 and 4 person weeks of effort. The fundamental problem is that Elevate strategies express optimizations in an imperative style and require the programmer to orchestrate all rewrite steps deterministically. Besides being costly to develop, this also significantly limits applicability of optimizations to many different programs: small program differences require adjustments to the rewrite sequence. So it is highly desirable to find some automatic, or at least semi-automatic, rewriting technique to reduce the programmer effort required to optimize Rise programs.
3. Efficient Equality Saturation for the Lambda Calculus
This section addresses the first issue with prior equality saturation techniques: the lack of effective support for languages with name bindings. We explore the engineering design choices required to efficiently implement equality saturation for a typed lambda calculus. A set of design choices are realized for the Rise language in the new Risegg implementation that is heavily inspired by the egg library (willsey2021-egg). The performance numbers in this section are from a prototype of Risegg111https://github.com/Bastacyclop/egg-rise for an untyped subset of Rise implemented in Rust on top of the egg library.
To assess the efficiency of equality saturation in this section, our aim is to be able to discover certain rewrite goals in reasonable time on a laptop machine, i.e. with an AMD Ryzen 5 PRO 2500U processor and using no more than 4GB of RAM. Discovering a rewrite goal means that it is feasible to grow an e-graph starting from the initial program until the goal program is represented in the e-class of the initial program.
Applying equality saturation to lambda calculus terms requires the efficient support of standard operations and rewrites. Figure 4 shows the standard rules of -reduction and -reduction. The other two rules encode standard map-fusion and map-fission, and are interesting because they introduce new name bindings on their right-hand-side.
In equality saturation standard term substitution cannot be used to directly compute from the -reduction rule in an e-graph, because and are not terms but e-classes. A simple way to address this is to use explicit substitution as in egg (willsey2021-egg). A syntactic constructor is added to represent substitution, as well as rewrite rules to encode its small-step behavior:
Explicit substitution adds all intermediate substitution steps to the e-graph, quickly exploding its size. We can demonstrate the e-graph size problem by attempting to discover a trivial rewrite goal using equality saturation:
The rewrite in (4) merely requires a sequence of two map-fission rules, one map-fusion rule, plus a couple of the -reduction and -reduction rules that are pervasive when rewriting functional programs:
Despite this simple rewrite sequence, rewrite goal (4) cannot reasonably be discovered using explicit substitution. After more than 40 seconds of equality saturation (10 iterations) the available 4GB memory is exhausted and the goal has not been discovered. The e-graph contains 13M e-nodes and 3M e-classes.
So intermediate substitution steps cannot be added to the e-graph, otherwise it grows uncontrollably. To avoid intermediate substitutions we propose extraction-based substitution that works as followings.
extract a term for each e-class involved in the substitution (i.e and );
perform standard term substitution;
add the resulting term to the e-graph.
Extraction-based substitution is far more efficient than explicit substitution. For example we can discover the rewrite goal (4) in less than a millisecond, with 5 iterations, and the e-graph contains only 364 e-nodes and 277 e-classes.
Extraction-based substitution is, however, an approximation as it computes the substitution for a subset of the terms represented by and , and ignores the rest. Figure 5 shows an example where the initial e-graph is in the middle, and the e-graph after extraction-based substitution with and on the right. This particular choice results in an e-graph lacking the program that is included in the e-graph without approximation (left in fig. 5).
In practice, we have not observed this approximation to be an issue, and believe this is for two reasons. First, the substitution is computed on each equality saturation iteration, where different terms may be extracted, increasing coverage of the set of terms represented by and . Second, many of the ignored equivalences are recovered either by e-graph congruence, or by applying further rewrite rules.
Future work may investigate alternative substitution implementations to balance efficiency with non-approximation. For efficiency extraction-based substitution is used in Risegg.
3.2. Name Bindings
In equality saturation inappropriate handling of name bindings easily leads to serious efficiency issues. Consider rewrite rules like map-fusion that create a new lambda abstraction on their right-hand side. Which name should they introduce when they are applied? In standard term rewriting, generating a fresh name using a global counter (aka. gensym) is a common solution. However, if a new name is generated each time the rewrite rule is applied, the e-graph will quickly be overloaded with many -equivalent terms222Two terms are -equivalent if one term can be made equivalent to the other simply by renaming variables..
Fewer -equivalent terms are introduced if fresh names are generated as a function of the matched e-class identifiers. However as the e-graph grows and e-classes are merged e-class identifiers change, and -equivalent terms are still generated and duplicated in the e-graph.
Figure 6 shows an example that demonstrates the practical issues when rewriting with -equivalent terms. This Rise program computes a binomial filter – a 2D convolution – that is expressed using the slide pattern creating a sliding window to group neighboring elements that are then multiplied with the convolution kernel and summed. The purpose of the rewrite shown is to separate the 2D filter into two 1D filters according to a well-known convolution kernel equation:
This separation optimization reduces both memory accesses and arithmetic complexity. An Elevate rewriting strategy achieves the optimization by orchestrating 30 rewrite rules including 17 /-reductions (koehler2021-elevate-imgproc).
The binomial filter optimization goal cannot be discovered by equality saturation if fresh names are generated for each rewrite rule application. After two minutes and 9 iterations the 4GB memory is exhausted, and the goal has not been discovered. The e-graph contains 2.9M e-nodes and 1.4M e-classes, emphasizing the need for more efficient name handling.
De Bruijn indices (DEBRUIJN1972381) are a standard technique for representing lambda calculus terms without naming the bound variables, and avoid the need for conversions. If De Bruijn indices enable two -equivalent terms to become structurally equivalent, the regular e-graph congruence invariant333Reminder: the congruence invariant ensures that identical e-nodes will not end up in different e-classes. is enough to prevent the duplication of -equivalent terms. Therefore, we translate our terms and rewrite rules to use De Bruijn indices instead of names, and observe a significant change in efficiency. With De Bruijn indices, 100ms is enough to discover the binomial filter rewrite goal. After 11 iterations, the e-graph contains 3K e-nodes and 1K e-classes. Hence De Bruijn indices are used in Risegg.
True equality modulo -renaming
While De Bruijn indices give a significant performance improvement they do not provide equality modulo -renaming for sub-terms. Consider , where represents De Bruijn indices. Although and are structurally different, they both correspond to the same variable . Recent work has shown how to implement efficient hashing modulo alpha renaming (maziarz2021-hash-mod-alpha-eq), and could be used to investigate an even more efficient e-graph representation. Another possibility would be to investigate the effectiveness of nominal rewriting techniques (fernandez2007-nominal-rewriting).
Translating name-based rules into index-based rules
Using De Bruijn indices means that rewrite rules also need to manipulate terms with De Bruijn indices. Thankfully, more user-friendly name-based rewrite rules can be automatically translated to the index-based rules used internally (bonelli2000-bruijn-rewriting).
Shifting De Bruijn indices
De Bruijn indices must be shifted when a term is used with different surrounding lambdas. In Risegg, extraction-based index shifting works as substitution in three steps:
extract a term from the e-class whose indices need shifting;
perform standard index shifting;
add the resulting term to the e-graph.
Avoiding Name Bindings using Combinators
It is also possible to avoid name bindings entirely (smith2021-access-patterns). For example, it is possible to introduce a function composition combinator ‘’ (also used in section 2.1), that greatly simplifies map-fusion and map-fission rules:
However, this approach has its own downsides. Associativity rules are required, which we know increases the growth rate of the e-graph (willsey2021-egg). Only using a left-/right-most associativity rule avoids generating too many equivalent ways to parenthesize terms. But other rewrite rules now have to take this associativity convention into account, making their definition more difficult and their matching more expensive. In general, matching modulo associativity or commutativity are algorithmically hard problems (benanav1987-complexity-matching).
The combinator on its own is also not sufficient to remove the need for name bindings. At one extreme, combinatory logic could be used as any lambda calculus term can be represented in combinatory logic, replacing function abstraction by a limited set of combinators. However, translating a lambda calculus term into combinatory logic results in a term of size in the worst case, where is the size of the initial term (lachowski2018-complexity). Translating existing rewrite systems to combinatory logic would be challenging in itself.
3.3. Freshness Predicates
Handling predicates is not trivial in equality saturation. The -reduction has the side condition that ””, but in an e-graph is an e-class and not a term.
The predicate could be handled precisely by filtering the e-class into , and using on the right-hand-side of the rule. However, this requires splitting an e-class in two: one that satisfies the predicate, and one that does not. This would reduce sharing and increase the e-graph size, be difficult to reason about in the presence of cycles, and interfere with the e-graph amortized invariant restoration optimization (willsey2021-egg).
The design of Risegg makes an engineering trade-off, as for substitution, and following egg (willsey2021-egg). In Risegg, the -reduction rewrite rule is only applied if . Advantages are that this predicate is efficient to compute using an e-class analysis, and that there is no need to split the e-class. The disadvantage is that it is an approximation that ignores some valid terms. Figure 7 shows an example where -reduction is not applied by Risegg. In practice, we have not observed this approximation to be an issue, e.g. for the results presented in section 5.
3.4. Adding Types
Typed lambda calculi are pervasive providing the foundation for almost all functional languages, and a key consideration is how to add types into the e-graph. Considering the terms and , there are broadly two alternatives:
Keep types polymorphic (one e-class):
Instantiate the types (two e-classes):
While keeping types polymorphic enables more sharing, instantiating types enables more precise type-based rewriting. While polymorphic types can be computed using an e-class analysis, instantiated types must be embedded in the e-graph. There is no obvious best solution, instead we observe a trade-off between the amount of sharing and the amount of information. Since Rise rewrite rules often match precise types, types are instantiated in Risegg.
Rewrite rule type inference
To avoid having to explicitly type rewrite rules by hand, we infer their types. After inferring the types on the left-hand-side, we check that the right-hand-side is well-typed for any well-typed left-hand-side matching program. When applied, typed rewrite rules match (deconstruct) types with their left-hand-side, and construct types on their right-hand-side. Type annotations can be used in Risegg to constrain the inferred types.
Since types are duplicated many times in the e-graph, and since structural type equality is often required, we hash-cons types for efficiency (filliatre2006-hashconsing).
This section has showed important aspects to consider when using equality saturation for languages with name bindings. Specifically, we have seen that De Bruijn indices and extraction-based substitution are critical in practice: making the difference between running out of memory or optimizing functional programs in seconds.
4. Sketch-Guided Equality Saturation
The previous section discussed crucial techniques for efficiently optimizing functional programs using equality saturation. But are these techniques enough to perform complex optimizations, i.e. those requiring long rewrite sequences? In (hagedorn2020-elevate) the authors report that 63,000 rewrite steps are required to perform loop blocking, vectorization, and parallelization optimizations for a matrix multiplication. Our evaluation in section 5 shows that even with the techniques discussed in section 3 equality saturation is unable to perform these optimizations, exhausting the available 60GB memory after more than one hour of search.
The issue is that as the e-graph grows, iterations become slower and require more memory. Combined with the fact that the e-graph is grown in a breadth-first manner, this makes finding long rewrite sequences inherently hard. The growth rate is aggravated by some combinations of rewrite rules, such as associativity and commutativity that generate an exponential number of equivalent permutations (wang2020-spores; nandi2020-synthesizing-CAD; willsey2021-egg). This motivates keeping the size of the e-graph small.
4.1. Keeping the e-graph size under control
Amongst the many ways to reduce e-graph size during exploration, we identify the following three general directions.
Deleting programs that are not considered valuable.
Often there are programs that we are obviously not interested in and, therefore, these could be removed from the e-graph. For example, we implement a filter removing all e-classes that only contribute to constructing programs of more than a given size limit. This is useful because exploring unreasonably large terms is typically not desirable.
Unfortunately, due to the extensive sharing, deleting individual programs from the e-graph is difficult. Therefore, Risegg only deletes e-nodes and e-classes, affecting all of their parents. This prohibits deleting individual programs that share sub terms with other programs that we do not want to delete. In the future, it would be interesting to develop more sophisticated capabilities, e.g. deleting the matched left-hand-side of a rewrite rule that would also allow performing normalization.
Prioritizing growth in promising directions.
Previous work proposes rewrite rule schedulers as a way to control which rewrite rules should be applied on a given equality saturation iteration (willsey2021-egg). The SimpleScheduler from the egg library applies all rewrite rules on each iteration. The default BackoffScheduler prevents specific rules from being applied too often, reducing e-graph growth in the presence of “explosive” rules such as associativity and commutativity. Our experience with Rise is that using the BackoffScheduler is counterproductive because the desired optimizations depend on explosive rules. Future work may look into finding more advanced ways to prioritize e-graph growth, but at present Risegg does not use a rewrite rule scheduler.
An effective way to reduce e-graph size is to build and grow a new e-graph from the most promising term represented in a previous e-graph. Our sketch-guided technique develops this approach by defining a sequence of equality saturation searches, while defining a clear goal for each search in the form of a sketch.
4.2. The Intuition for Sketches
When designing optimizations it is useful for the programmer to think about the desired shape of the optimized program. Sketches are program patterns that capture this intuition while leaving details unspecified.
Previous work on optimizing the Harris corner detection image processing pipeline with rewrites (koehler2021-elevate-imgproc) used program snippets like LABEL:harris-shape3 to explain the effect of the various optimizations. The key insight is that explanatory program snippets can be utilized as sketches such as the one shown in LABEL:harris-sketch. This sketch resembles the snippet in LABEL:harris-shape3 but adds program holes (?) and constraints (contains) to explicitly elide program details.
Where Elevate rewriting is purely manual, sketch-guided equality saturation is semi-automated. That is, it allows the programmer to declaratively specify the desired optimization goal (the sketch) without needing to specify the detailed rewrite sequence to get there.
4.3. Sketch Definition
We define the simple SketchBasic language with just four constructors as a proof of concept. The syntax of SketchBasic and the set of terms that the constructors represent is defined in fig. 9. A sketch represents a set of terms , such that where denotes all terms in the language we rewrite. We say that any satisfies the sketch .
The sketch is the least precise as it represents all terms in the language. The sketch represents all terms that match a specific -ary function symbol from the term language, and whose children satisfy sketches . The sketch represents all terms containing a term that satisfies sketch . Finally, the sketch represents terms satisfying either or .
When rewriting terms in a typed language sketches may be annotated with a type sketch () constraining the type of terms. If denotes the set of terms satisfying the type sketch , then . The grammar of type sketches depends on the language we rewrite. We elide type sketches from our definition of SketchBasic for simplicity, but use them in section 5.
When writing sketches, a balance has to be found between being too precise (representing only one program) and too vague (representing programs that are not desired). This balance also interacts with the choice of rules, since programs that may be found by the search are where represents the set of terms that can be discovered to be equivalent to the initial term according to the given . This means that using a more restricted set of rules generally enables specifying less precise sketches.
4.4. Sketch-Guided Equality Saturation Algorithm
The idea behind Sketch-Guided Equality Saturation is to use multiple sketches to decompose a complex rewrite goal into a series of simpler rewrite goals. The overall process is illustrated in fig. 10.
The programmer guides the system by providing a sequence of sketches (). Successive equality saturation searches are performed to find equivalent terms satisfying one sketch after the other. Because each sketch is satisfied by many terms, the programmer must also provide cost models () to select the term to be used as the start point for the next search. Sets of rewrite rules () are provided to grow the e-graph in each search.
The pseudo-code for the sketch-guided equality saturation algorithm is shown in LABEL:fig:sketching-alg. The extract function (line 3) is used to extract a term from the e-graph that satisfies the specified sketch while minimizing the specified cost model, and we describe it in the next subsection. The found function (line 3) is used to stop growing the e-graph by checking whether extract succeeded in returning a term. For efficiency our implementation of found does not rely on extract and checks only if a term could be extracted.
4.5. Sketch-Guided Equality Saturation Extraction
To extract the best program that satisfies a SketchBasic sketch from an e-graph we define a helper function , where is a cost function that must be monotonic and local. While extract returns a program from an e-class , the helper returns a map from e-classes to optional tuples of costs and terms. After invoking we simply look up the e-class in the map and extract the term from the optional tuple. For efficiency, we memoize previously computed results of . The extract function is recursively defined over the four SketchBasic cases as follows.
Case 1: . This case is equivalent to extracting the programs minimizing from the e-graph. We implement this extraction as an e-class analysis (section 2) with data type and functions that constructs analysis data and that combines analysis data from e-nodes in the same e-class:
Case 2: . We consider each e-class containing e-nodes and the terms that should be extracted for each child e-class . We write for indexing into the map returned by :
Case 3: . We use another e-class analysis and initialize the analysis data to corresponding to the base case where . To the analysis data we consider all terms that would contain terms from and keep the best by reducing them using :
To the analysis data, we do the same as for .
Case 4: . We the results from and :
This section introduces sketch-guided equality saturation as a semi-automated process where the programmer guides multiple equality saturation searches. They do so by specifying sketches that declaratively describe the program shapes after the desired optimizations have been applied.
This section compares three different optimization methods: Elevate rewriting strategies, fully automated equality saturation, and the new sketch-guided equality saturation. In the evaluation both equality saturation techniques use the efficient implementation of name bindings presented in section 3.
|baseline||1||yes||0.5s||20 MB||2||51, 49|
We evaluate the optimization of a matrix multiplication as it allows us to compare sketch-guided equality saturation against published Elevate strategies that specify realistic optimizations equivalent to TVM schedules (hagedorn2020-elevate). TVM is a state-of-the-art deep learning compiler (tvm-2018) and (hagedorn2020-elevate) demonstrates that expressing optimizations performed by TVM as compositions of rewrites is possible and achieves the same high performance as TVM. The optimizations performed by TVM and Elevate are typical compiler optimizations, including loop blocking, loop permutations, vectorization, and parallelization. Overall we evaluate all seven differently optimized versions of matrix multiplication presented in (hagedorn2020-elevate) and described in the TVM manual.
In this evaluation, we make sure that the generated C code is equivalent modulo variable names to the manually optimized versions that already demonstrated competitive performance compared with TVM. First, we compare how much time and memory are required for fully automated equality saturation and our sketch-guided equality saturation. Then, we discuss how much programmer effort is required using manually written Elevate strategies compared to writing our sketches.
5.1. Optimization Time and Memory Consumption
|baseline||1||yes||0.5s||20 MB||2||51, 49|
|blocking||2||yes||7s||0.3 GB||11K||11K, 7K|
|vectorization||3||yes||7s||0.4 GB||11K||11K, 7K|
|loop-perm||3||yes||4s||0.3 GB||6K||10K, 7K|
|array-packing||4||yes||5s||0.4 GB||9K||10K, 7K|
|cache-blocks||4||yes||5s||0.5 GB||9K||10K, 7K|
|parallel||4||yes||5s||0.4 GB||9K||10K, 7K|
Both Elevate444https://github.com/elevate-lang/elevate and our full Risegg implementation555https://github.com/rise-lang/shine/tree/sges/src/main/scala/rise/eqsat are written in Scala and we use standard Java utilities for measurements: System.nanoTime() to measure runtime, and the Runtime api to compute maximum memory consumption by sampling regularly.
The experiments are performed on two platforms. For Elevate strategies and our sketch-guided equality saturation, we use a less powerful AMD Ryzen 5 PRO 2500U with 4GB of RAM available to the JVM. For fully-automated non-guided equality saturation, we use a more powerful Intel Xeon E5-2640 v2 with 60GB of RAM available to the JVM.
Fully Automated Equality Saturation
Table 2 shows the runtime and memory consumption of using a single fully automated equality saturation to perform seven different optimizations. The search terminates as soon as the optimized version is found in the e-graph. Most optimizations are not found before exhausting the 60GB of available memory. Only the “baseline” and “blocking” versions are found and the search for the blocking version requires more than 1h and about 35GB of RAM. Millions of rewrite rules are applied and the e-graph contains millions of e-nodes and e-classes. More complex optimizations involve more rewrite rules, creating a richer space of equivalent programs but exhausting memory faster. As examples, “vectorization” and “loop-perm” include vectorization rules, while “array-packing”, “cache-blocks” and “parallel” include rules for memory storage.
Sketch-Guided Equality Saturation
Table 3 shows the runtime and memory consumption of our sketch-guided equality saturation, where sketches guide the optimization process and break a single equality saturation search into multiple. All optimizations are found in less than 10s, using less than 0.5GB of RAM. Interestingly the number of rewrite rules applied by our semi-automatic approach is in the same order of magnitude as for the manual Elevate strategies (hagedorn2020-elevate). On one hand, equality saturation applies more rules than necessary because of its explorative nature. On the other hand, Elevate strategies apply more rules than necessary because they re-apply the same rule to the same sub-expression and do not orchestrate the shortest rewrite path possible. The constructed e-graphs contains no more than e-nodes and e-classes, at least two orders of magnitude less than the required for “blocking” without sketch-guidance.
While complex optimizations of matrix multiplication are not feasible with fully automated equality saturation, they are feasible with sketch-guided equality saturation. In finding all optimized version in less than 10 seconds, our semi-automated technique is practical and only about one order of magnitude slower than the Elevate strategies that perform these optimizations in under a second. Next, we investigate the programmer effort of sketch-guided equality saturation compared to orchestrating the rewrite sequence manually with Elevate.
5.2. Programmer Effort
Elevate strategies (hagedorn2020-elevate) are programs describing optimizations as compositions of rewrite rules. While Elevate is a functional language, strategies are inherently imperative in nature as they describe the rewrite steps required to transform the initial program into an optimized one. In contrast, sketches declaratively describe the desired programs rather than how to reach them.
Elevate enables the development of abstractions that help write concise strategies, such as the one performing the “blocking” optimization in LABEL:lst:elevate-blocking. Unfortunately, these abstractions are often program specific and complex to implement. For example, the reorder abstraction in line 5 is defined as shown in LABEL:lst:elevate-reorder. The implementation of this one abstraction is 43 lines long, involves the definition of 8 internal strategies, and carefully composes them together with more generic strategies in a recursive process. Still, this reorder abstraction is – despite its name – not capable of reordering generic nestings of map and reduce patterns, but only works for the matrix multiplication example. Additionally, it is hard for the programmer to reason about the list parameter which represents the desired reordering: what will the resulting program look like? Overall, the “blocking” optimization requires 112 lines of program-specific Elevate code.
The first authors of (hagedorn2020-elevate) and (koehler2021-elevate-imgproc) that used Elevate
for optimizing matrix multiplication and image processing pipelines estimated666in private communication with us that they spent between two person-weeks and one-person month to develop the Elevate strategies.
To demonstrate their simplicity we discuss the sketches written for the matrix multiplication versions. We start by defining useful sketch abstractions that combine generic constructs from SketchBasic with Rise-specific type annotations (LABEL:rise-sketch-abs). In the code, -> is a function type and n.dt an array type of n elements of type dt. The type annotations restrict the iteration domains of patterns like map and reduceSeq. We use similar definitions for other language constructs.
LABEL:mm-baseline-sketch shows the sketch for the (basically unoptimized) “baseline” version. The sketch describes the desired program structure of two nested map patterns and a nested reduce. The comments on the right show the equivalent nested for loops.
The sketch describing the “blocking” version in LABEL:mm-blocking-sketch corresponds to an optimized program where the imperative loop nests have been split and reordered such that the iteration space is chunked into blocks of processed by the three innermost for loops.
In contrast to the strategy in LABEL:lst:elevate-blocking sketches focus on the effect the optimization has on the program, rather than how the transformation is performed step-by-step. Developing sketches is significantly easier: we estimate that it took about two person-days to develop all sketches used for our evaluation, in contrast to the weeks required to develop the strategies. We also believe, that it is more intuitive to describe familiar program shapes rather than composing rules that rewrite the program.
Sketches for Guiding the Search
Using only the sketch in LABEL:mm-blocking-sketch enables equality saturation to find the optimized “blocking” program, but requires over 1 hour and 35GB. By guiding the search with an additional sketch shown in LABEL:mm-blocking-split-sketch we can significantly accelerate the search to less than 10s. This guiding sketch describes a program shape where the map and reduce patterns have been split but not yet reordered.
Table 4 shows how each optimization version can be described by logical optimization steps, each corresponding to a sketch describing the program after the step has been applied. Interestingly, the split sketch shown in LABEL:mm-blocking-split-sketch is a useful first guide for all optimized versions.
Choice of Rules and Cost Model
Besides the sketches, programmers also specify the rules used in each search and a cost model. For the split sketch, 8 rules are required explaining how to split map and reduce. The reorder sketches require 9 rules that swap various nestings of map and reduce. The store sketch requires 4 rules and the lower sketches 10 rules including map-fusion, 6 rules for vectorization, 1 rule for loop unrolling and 1 rule for loop parallelization. If we naively use all rules for the blocking search, the search time increases by about 25, still finding the optimizations in minutes but showing the importance of selecting a small set of rules.
We use a simple cost model that minimizes weighted term size. Common rules and cost models are easy to reuse.
|blocking||split + reorder|
|vectorization||split + reorder + lower|
|loop-perm||split + reorder + lower|
|array-packing||split + reorder + store + lower|
|cache-blocks||split + reorder + store + lower|
|parallel||split + reorder + store + lower|
6. Related Work
. Some compiler optimizations can be fully automated via equality saturation or heuristic searches(tate2009-equality-saturation;
yang2021-eqsat-tensor; lift-rewrite-2015; polymage-2015). Although this approach can automatically yield high performance, it is not always feasible or even desirable as it lacks user control, may result in poor performance and may be too time-consuming.
Optimization Strategies. Compiler optimizations can be precisely controlled with rewriting strategies (visser1998-strategies; hagedorn2020-elevate; koehler2021-elevate-imgproc) or schedules (halide-2012; tvm-2018). However, optimization strategies are challenging to write, non-declarative, and quickly become over-detailed and program-specific.
Equality Saturation with Bindings. We are not the first to attempt applying equality saturation to languages with binding. Willsey et al. implement a partial evaluator for the lambda calculus in (willsey2021-egg) using explicit substitution. Although conceptually simple, this implementation has a prohibitively high performance cost as we demonstrated in section 3. In (smith2021-access-patterns), Smith et al. propose access patterns as a way to avoid the need for binding structures when representing tensor programs. In this paper, we present practical solutions to make equality saturation with bindings more efficient.
Sketching. The idea of sketching has been introduced for program synthesis (solar2008-synthesis-sketching), along with counterexample guided inductive synthesis which combines a synthesizer with a validation procedure. Our approach in this paper is different as we target optimizations rather than program synthesis. We use sketches as program patterns to filter a set of equivalent programs generated via equality saturation, and as a result, we do not need a validation procedure.
Equational Reasoning with E-graphs. E-graphs were originally designed for efficient congruence closure in theorem provers (nelson1980-techniques), and are used in the Z3 theorem prover (de2008-z3). They have also recently been used for semantic code search in the Yogo tool (premtoon2020-yogo). In the realm of theorem proving, rewriting strategies can be compared to procedural proof languages (harrison1996-hol; paulson1994-isabelle), while the sketch approach can be compared to declarative proof languages (corbineau2007-declarative-coq; wenzel2002-mizar; norell2007-agda; kaufmann1996-acl2).
This paper broadens the applicability of equality saturation for programming languages in two ways. We drastically improve the efficiency of equality saturation for languages with name bindings, like the lambda calculus. We propose sketch-guided equality saturation as a semi-automatic technique to scale equality saturation to complex optimizations that require long rewrite sequences. The experimental evaluation demonstrates that sketch-guided equality saturation enables seven sophisticated optimizations of matrix multiplication to be applied within seconds. The guided approach is declarative, and requires far less effort than imperative Elevate rewrite strategies. Moreover, most of the optimizations cannot be applied with fully-automated equality saturation.
We would like to thank Max Willsey for the open-source egg library and his valuable feedback; theRise and Elevate teams for their open-source work. This work was supported by the Engineering and Physical Sciences Research Council [EP/W007940/1].