1. Introduction
Term rewriting has been effective in optimizing compilers for decades (dershowitz1993rewritesystems). However, deciding when to apply each rewrite rule is hard and has a huge impact on the performance of the rewritten program: the socalled phase ordering problem. The challenge is that the global benefit of applying a rewrite rule depends on future rewrites. Maximizing local benefit in a greedy fashion is not sufficient in the absence of a convergence property, i.e. confluence and termination, as local optima may be far away from the global optimum.
Equality saturation (tate2009equalitysaturation; willsey2021egg) mitigates the phase ordering problem by exploring many ways to apply rewrite rules. Starting from an input program, an equality graph (egraph) is grown iteratively until reaching a fixed point (saturation), achieving a goal, or timing out. An egraph efficiently represents a large set of equivalent programs, and is grown by repeatedly applying all possible rewrite rules in a purely additive way. After growing the egraph, the best program found is extracted from it using a cost model, e.g. one that selects the fastest program.
The applicability of equality saturation has recently been broadened by introducing an amortized invariant restoration technique called rebuilding and a mechanism called eclass analyses (willsey2021egg). Nevertheless, the application of equality saturation for complex optimizations of realistic programs is limited by the following two issues.
Languages with name bindings. Previous equality saturation work either explicitly avoids the use of name binding for efficiency reasons (smith2021accesspatterns), or uses a simple but inefficient implementation (willsey2021egg). As almost all programming languages use variables, and hence name binding, this paper explores practical ways of efficiently implementing equality saturation for languages with name bindings. We study equality saturation for the lambda calculus as it is the standard formalism for functional languages. We show that using De Bruijn indices avoids overloading the egraph with equivalent terms, and that an approximate substitution enables searches performed in milliseconds where searches with naive explicit substitution quickly run out of memory.
Complex optimizations, i.e. those requiring long rewrite sequences. On each equality saturation iteration, the egraph tends to grow bigger since every possible rewrite rule is applied in a purely additive way. The growth rate is extremely rapid for some combinations of rewrite rules, such as associativity and commutativity that generate an exponential number of equivalent permutations (wang2020spores; nandi2020synthesizingCAD; willsey2021egg). In such cases, discovering long rewrite sequences that require many iterations is unfeasible. One way to address this issue is to limit the number of rules applied (wang2020spores; willsey2021egg), but this risks not finding optimizations that require rules that have been omitted. A second way is to use an external solver to speculatively add equivalences (nandi2020synthesizingCAD), but this requires the identification of subtasks that can benefit from being delegated.
This paper proposes sketchguided equality saturation as a technique to break down complex optimizations into smaller ones. The programmer specifies rewrite goals by writing sketches: program patterns that leave details unspecified. While sketches have previously been used as a starting point for program synthesis (solar2008synthesissketching), our work uses sketches to end an equality saturation search once the sketch is satisfied. Guiding the rewriting using a sequence of sketches decomposes it into a sequence of relatively small equality saturation searches.
We demonstrate that sketchguiding enables complex optimizations in the Rise functional language. We start by showing that existing equality saturation techniques are not sufficient for applying complex optimizations as the egraph grows too large. Previous work on Rise produced highly optimized code at the cost of the programmer orchestrating sequences of thousands of rewrite rules (hagedorn2020elevate). Our evaluation demonstrates that by combining our efficient name binding techniques with sketchguiding, complex optimizations are discovered by equality saturation with little programmer guidance and in a matter of seconds.
To summarize, the contributions of this paper include:

The development of new techniques to support efficient equality saturation for a typed lambda calculus. The techniques are realized in the Risegg implementation for the Rise dataparallel functional language. We demonstrate the effectiveness of Risegg by optimizing a binomial filter application (section 3).

Proposing sketchguided equality saturation as a new semiautomatic technique to perform complex optimizations that require long rewrite sequences not discoverable by equality saturation alone. We demonstrate the practicality of sketches for guiding realistic optimizations of Harris corner detection (section 4).

A systematic comparative evaluation of sketchguided equality saturation for optimizing matrix multiplication. Seven complex optimizations are demonstrated, including loop blocking, vectorization, and parallelization. We show that the complex optimizations are not feasible with fully automated equality saturation due to excessive runtime and memory consumption. In contrast, sketchguided equality saturation performs the optimizations in seconds and with low memory consumption. At most four sketches are required to guide the search, i.e. significantly less effort than purely manual techniques (section 5).
2. Background
This section gives a technical overview of equality saturation and its application to express compiler optimizations. We also introduce the functional language Rise (hagedorn2020elevate) in which the programs we optimize in this paper are expressed.
2.1. Equality saturation
Equality saturation (tate2009equalitysaturation; willsey2021egg) is a technique for efficiently implementing rewritedriven compiler optimizations without committing to a single rewrite choice. We demonstrate how equality saturation mitigates the phase ordering problem by using a rewriting example where greedily reducing a cost function is not sufficient to find the optimal program.
Rewriting is often used to fuse operators and avoid that every operator writes its result to memory, for example:
(a)  
(b) 
The initial term (a) applies function to each element of a matrix (using two nested s), transposes the result, and then applies function to each element. The optimized term (b) avoids storing an intermediate matrix in memory and transposes the input before applying and to each element. The following rewrite rules are sufficient to perform this optimization, if applied in the correct order:
(1)  
(2)  
(3) 
Rule (1) states that transposing a twodimensional array before or after applying a function to the elements is equivalent. Rule (2) states that function composition is associative. Finally, rule (3) is the rewrite rule for map fusion. In this example, minimizing the term size results in maximizing fused maps and, therefore, is a good cost model.
If we greedily apply rewrite rules that lower term size, we will only apply rule (3) as this is the only rule that reduces term size. However, rule (3) cannot be directly applied to term (a): we are in a local optimum. The only way to reduce term size further is to first apply the other rewrite rules, which may or may not pay off depending on future rewrites.
We now investigate stepbystep how equality saturation enables to minimize term size by exploring many possible ways to apply rewrites without getting stuck in local minima.
First, an equality graph (egraph) representing the initial term is constructed (fig. 0(a)). An egraph is a set of equivalence classes (eclasses). An eclass is a set of equivalent nodes (enodes). An enode is an ary function symbol () from the term language, associated with child eclasses (). Examples of symbols are , , and . The egraph data structure is used during equality saturation to efficiently represent and rewrite a set of equivalent programs.
Second, the egraph is iteratively grown by applying rules nondestructively (figs. 0(d), 0(c) and 0(b)). While standard term rewriting picks a single possible rewrite in a depthfirst manner, equality saturation explores all possible rewrites in a breadthfirst manner. Within an equality saturation iteration, rewrites are applied independently: they may only depend on rewrites from previous iterations. For the sake of simplicity, we only apply a handful of rewrite rules in fig. 1. When applying a rewrite rule, the equality between its lefthand side and its righthand side is recorded in the egraph. Rewrite rules stop being applied when a fixed point is reached (saturation), or when another stopping criteria is reached (e.g. timeout). If saturation is reached, it means that all possible rewrites have been explored.
At that point, the egraph represents many terms that are equivalent according to the applied rules. An egraph is much more compact than a regular set of terms, as equivalent subterms are shared. Egraphs are capable of representing exponentially many terms in polynomial space, and even infinitely many terms in the presence of cycles (willsey2021egg). To maximize sharing, a congruence invariant is maintained: intuitively identical enodes should not be in different eclasses (fig. 2). However, we will see later that extensive sharing does not necessarily prevent the egraph size from exploding.
Finally, an extraction procedure selects the best term from the egraph according to a cost function. For a local cost function (i.e. with signature if the cost is of type ), a relatively simple bottomup egraph traversal can be used (panchekha2015herbie). More complex cost functions require more complex extraction procedures (wang2020spores; wu2019carpentry).
An eclass analysis (willsey2021egg) enables propagating an analysis data of type in a bottomup fashion, and can be used for extraction when the cost function is local. An eclass analysis is defined by providing two functions:

[leftmargin=5mm]

A function constructing the analysis data from an ary symbol combined with the data of its child eclasses:

A function merging the analysis data of enodes that are in the same eclass:
To compute the smallest term for each eclass, we define an eclass analysis with :
2.2. Rewriting the Rise functional language
In this paper we study the lambda calculus because it formalizes functional languages. To demonstrate the impact of our approach in practice, we use Rise (hagedorn2020elevate) that implements a typed lambda calculus, and many of the examples in this paper are Rise programs. Rise is a spiritual successor of Lift (liftrewrite2015; liftir2017) that demonstrated performance portability across hardware by automatically applying semanticspreserving rewrite rules to optimize programs from domains including scientific code (liftstencil2018) and convolutions (mogers2020convolution).
Rise provides standard lambda abstraction (x. b), function application (f x), identifiers and literals. Rise also offers an extensible set of higherorder functions describing wellknown dataparallel computational patterns. While Rise provides many computational patterns, we focus in this paper on two important patterns. map applies a function to each element of an array. reduce combines all elements of an array to a single value given a binary reduction operator. To make Rise accessible to equality saturation, Rise programs are easily encoded as terms of shape as shown in table 1.
Rise program  

x. b

lam x  b 
f x

app  f, x 
x

var x  
map

map  
reduce

reduce 
To control optimization Rise is complemented by a second programming language, Elevate (hagedorn2020elevate) that allows programmers to describe complex compiler optimizations as compositions of rewrite rules, called strategies. The performance of the code generated by Rise and Elevate
has been to shown to be on par with the stateoftheart deep learning compiler TVM
(tvm2018) for matrix multiplication (hagedorn2020elevate); and competitive to  or even up to 1.4 better than  the stateoftheart image processing compiler Halide (halide2012) for the Harris corner detection (koehler2021elevateimgproc). This makes Rise an interesting base language for exploring rewritebased compiler optimizations.Unfortunately writing Elevate strategies manually is low level and timeconsuming. A strategy describes precisely the rewrite sequence required for a particular optimization. Even though Elevate provides combinators and abstractions to help express complex optimizations, the authors of (hagedorn2020elevate) and (koehler2021elevateimgproc) report that expressing complex optimizations required between 2 and 4 person weeks of effort. The fundamental problem is that Elevate strategies express optimizations in an imperative style and require the programmer to orchestrate all rewrite steps deterministically. Besides being costly to develop, this also significantly limits applicability of optimizations to many different programs: small program differences require adjustments to the rewrite sequence. So it is highly desirable to find some automatic, or at least semiautomatic, rewriting technique to reduce the programmer effort required to optimize Rise programs.
3. Efficient Equality Saturation for the Lambda Calculus
This section addresses the first issue with prior equality saturation techniques: the lack of effective support for languages with name bindings. We explore the engineering design choices required to efficiently implement equality saturation for a typed lambda calculus. A set of design choices are realized for the Rise language in the new Risegg implementation that is heavily inspired by the egg library (willsey2021egg). The performance numbers in this section are from a prototype of Risegg^{1}^{1}1https://github.com/Bastacyclop/eggrise for an untyped subset of Rise implemented in Rust on top of the egg library.
To assess the efficiency of equality saturation in this section, our aim is to be able to discover certain rewrite goals in reasonable time on a laptop machine, i.e. with an AMD Ryzen 5 PRO 2500U processor and using no more than 4GB of RAM. Discovering a rewrite goal means that it is feasible to grow an egraph starting from the initial program until the goal program is represented in the eclass of the initial program.
(reduction)  
(reduction)  
(mapfusion)  
(mapfission) 
Applying equality saturation to lambda calculus terms requires the efficient support of standard operations and rewrites. Figure 4 shows the standard rules of reduction and reduction. The other two rules encode standard mapfusion and mapfission, and are interesting because they introduce new name bindings on their righthandside.
3.1. Substitution
In equality saturation standard term substitution cannot be used to directly compute from the reduction rule in an egraph, because and are not terms but eclasses. A simple way to address this is to use explicit substitution as in egg (willsey2021egg). A syntactic constructor is added to represent substitution, as well as rewrite rules to encode its smallstep behavior:
Explicit substitution adds all intermediate substitution steps to the egraph, quickly exploding its size. We can demonstrate the egraph size problem by attempting to discover a trivial rewrite goal using equality saturation:
(4)  
The rewrite in (4) merely requires a sequence of two mapfission rules, one mapfusion rule, plus a couple of the reduction and reduction rules that are pervasive when rewriting functional programs:
(5)  
Despite this simple rewrite sequence, rewrite goal (4) cannot reasonably be discovered using explicit substitution. After more than 40 seconds of equality saturation (10 iterations) the available 4GB memory is exhausted and the goal has not been discovered. The egraph contains 13M enodes and 3M eclasses.
So intermediate substitution steps cannot be added to the egraph, otherwise it grows uncontrollably. To avoid intermediate substitutions we propose extractionbased substitution that works as followings.

extract a term for each eclass involved in the substitution (i.e and );

perform standard term substitution;

add the resulting term to the egraph.
Extractionbased substitution is far more efficient than explicit substitution. For example we can discover the rewrite goal (4) in less than a millisecond, with 5 iterations, and the egraph contains only 364 enodes and 277 eclasses.
Extractionbased substitution is, however, an approximation as it computes the substitution for a subset of the terms represented by and , and ignores the rest. Figure 5 shows an example where the initial egraph is in the middle, and the egraph after extractionbased substitution with and on the right. This particular choice results in an egraph lacking the program that is included in the egraph without approximation (left in fig. 5).
In practice, we have not observed this approximation to be an issue, and believe this is for two reasons. First, the substitution is computed on each equality saturation iteration, where different terms may be extracted, increasing coverage of the set of terms represented by and . Second, many of the ignored equivalences are recovered either by egraph congruence, or by applying further rewrite rules.
Future work may investigate alternative substitution implementations to balance efficiency with nonapproximation. For efficiency extractionbased substitution is used in Risegg.
3.2. Name Bindings
In equality saturation inappropriate handling of name bindings easily leads to serious efficiency issues. Consider rewrite rules like mapfusion that create a new lambda abstraction on their righthand side. Which name should they introduce when they are applied? In standard term rewriting, generating a fresh name using a global counter (aka. gensym) is a common solution. However, if a new name is generated each time the rewrite rule is applied, the egraph will quickly be overloaded with many equivalent terms^{2}^{2}2Two terms are equivalent if one term can be made equivalent to the other simply by renaming variables..
Fewer equivalent terms are introduced if fresh names are generated as a function of the matched eclass identifiers. However as the egraph grows and eclasses are merged eclass identifiers change, and equivalent terms are still generated and duplicated in the egraph.
Figure 6 shows an example that demonstrates the practical issues when rewriting with equivalent terms. This Rise program computes a binomial filter – a 2D convolution – that is expressed using the slide pattern creating a sliding window to group neighboring elements that are then multiplied with the convolution kernel and summed. The purpose of the rewrite shown is to separate the 2D filter into two 1D filters according to a wellknown convolution kernel equation:
(6) 
This separation optimization reduces both memory accesses and arithmetic complexity. An Elevate rewriting strategy achieves the optimization by orchestrating 30 rewrite rules including 17 /reductions (koehler2021elevateimgproc).
The binomial filter optimization goal cannot be discovered by equality saturation if fresh names are generated for each rewrite rule application. After two minutes and 9 iterations the 4GB memory is exhausted, and the goal has not been discovered. The egraph contains 2.9M enodes and 1.4M eclasses, emphasizing the need for more efficient name handling.
De Bruijn indices (DEBRUIJN1972381) are a standard technique for representing lambda calculus terms without naming the bound variables, and avoid the need for conversions. If De Bruijn indices enable two equivalent terms to become structurally equivalent, the regular egraph congruence invariant^{3}^{3}3Reminder: the congruence invariant ensures that identical enodes will not end up in different eclasses. is enough to prevent the duplication of equivalent terms. Therefore, we translate our terms and rewrite rules to use De Bruijn indices instead of names, and observe a significant change in efficiency. With De Bruijn indices, 100ms is enough to discover the binomial filter rewrite goal. After 11 iterations, the egraph contains 3K enodes and 1K eclasses. Hence De Bruijn indices are used in Risegg.
True equality modulo renaming
While De Bruijn indices give a significant performance improvement they do not provide equality modulo renaming for subterms. Consider , where represents De Bruijn indices. Although and are structurally different, they both correspond to the same variable . Recent work has shown how to implement efficient hashing modulo alpha renaming (maziarz2021hashmodalphaeq), and could be used to investigate an even more efficient egraph representation. Another possibility would be to investigate the effectiveness of nominal rewriting techniques (fernandez2007nominalrewriting).
Translating namebased rules into indexbased rules
Using De Bruijn indices means that rewrite rules also need to manipulate terms with De Bruijn indices. Thankfully, more userfriendly namebased rewrite rules can be automatically translated to the indexbased rules used internally (bonelli2000bruijnrewriting).
Shifting De Bruijn indices
De Bruijn indices must be shifted when a term is used with different surrounding lambdas. In Risegg, extractionbased index shifting works as substitution in three steps:

[leftmargin=2.5mm]

extract a term from the eclass whose indices need shifting;

perform standard index shifting;

add the resulting term to the egraph.
Avoiding Name Bindings using Combinators
It is also possible to avoid name bindings entirely (smith2021accesspatterns). For example, it is possible to introduce a function composition combinator ‘’ (also used in section 2.1), that greatly simplifies mapfusion and mapfission rules:
(intro)  
(mapfusion)  
(mapfission) 
However, this approach has its own downsides. Associativity rules are required, which we know increases the growth rate of the egraph (willsey2021egg). Only using a left/rightmost associativity rule avoids generating too many equivalent ways to parenthesize terms. But other rewrite rules now have to take this associativity convention into account, making their definition more difficult and their matching more expensive. In general, matching modulo associativity or commutativity are algorithmically hard problems (benanav1987complexitymatching).
The combinator on its own is also not sufficient to remove the need for name bindings. At one extreme, combinatory logic could be used as any lambda calculus term can be represented in combinatory logic, replacing function abstraction by a limited set of combinators. However, translating a lambda calculus term into combinatory logic results in a term of size in the worst case, where is the size of the initial term (lachowski2018complexity). Translating existing rewrite systems to combinatory logic would be challenging in itself.
3.3. Freshness Predicates
Handling predicates is not trivial in equality saturation. The reduction has the side condition that ””, but in an egraph is an eclass and not a term.
The predicate could be handled precisely by filtering the eclass into , and using on the righthandside of the rule. However, this requires splitting an eclass in two: one that satisfies the predicate, and one that does not. This would reduce sharing and increase the egraph size, be difficult to reason about in the presence of cycles, and interfere with the egraph amortized invariant restoration optimization (willsey2021egg).
The design of Risegg makes an engineering tradeoff, as for substitution, and following egg (willsey2021egg). In Risegg, the reduction rewrite rule is only applied if . Advantages are that this predicate is efficient to compute using an eclass analysis, and that there is no need to split the eclass. The disadvantage is that it is an approximation that ignores some valid terms. Figure 7 shows an example where reduction is not applied by Risegg. In practice, we have not observed this approximation to be an issue, e.g. for the results presented in section 5.
3.4. Adding Types
Typed lambda calculi are pervasive providing the foundation for almost all functional languages, and a key consideration is how to add types into the egraph. Considering the terms and , there are broadly two alternatives:

Keep types polymorphic (one eclass):

Instantiate the types (two eclasses):
While keeping types polymorphic enables more sharing, instantiating types enables more precise typebased rewriting. While polymorphic types can be computed using an eclass analysis, instantiated types must be embedded in the egraph. There is no obvious best solution, instead we observe a tradeoff between the amount of sharing and the amount of information. Since Rise rewrite rules often match precise types, types are instantiated in Risegg.
Rewrite rule type inference
To avoid having to explicitly type rewrite rules by hand, we infer their types. After inferring the types on the lefthandside, we check that the righthandside is welltyped for any welltyped lefthandside matching program. When applied, typed rewrite rules match (deconstruct) types with their lefthandside, and construct types on their righthandside. Type annotations can be used in Risegg to constrain the inferred types.
Hashconsing types
Since types are duplicated many times in the egraph, and since structural type equality is often required, we hashcons types for efficiency (filliatre2006hashconsing).
3.5. Summary
This section has showed important aspects to consider when using equality saturation for languages with name bindings. Specifically, we have seen that De Bruijn indices and extractionbased substitution are critical in practice: making the difference between running out of memory or optimizing functional programs in seconds.
4. SketchGuided Equality Saturation
The previous section discussed crucial techniques for efficiently optimizing functional programs using equality saturation. But are these techniques enough to perform complex optimizations, i.e. those requiring long rewrite sequences? In (hagedorn2020elevate) the authors report that 63,000 rewrite steps are required to perform loop blocking, vectorization, and parallelization optimizations for a matrix multiplication. Our evaluation in section 5 shows that even with the techniques discussed in section 3 equality saturation is unable to perform these optimizations, exhausting the available 60GB memory after more than one hour of search.
The issue is that as the egraph grows, iterations become slower and require more memory. Combined with the fact that the egraph is grown in a breadthfirst manner, this makes finding long rewrite sequences inherently hard. The growth rate is aggravated by some combinations of rewrite rules, such as associativity and commutativity that generate an exponential number of equivalent permutations (wang2020spores; nandi2020synthesizingCAD; willsey2021egg). This motivates keeping the size of the egraph small.
4.1. Keeping the egraph size under control
Amongst the many ways to reduce egraph size during exploration, we identify the following three general directions.
Deleting programs that are not considered valuable.
Often there are programs that we are obviously not interested in and, therefore, these could be removed from the egraph. For example, we implement a filter removing all eclasses that only contribute to constructing programs of more than a given size limit. This is useful because exploring unreasonably large terms is typically not desirable.
Unfortunately, due to the extensive sharing, deleting individual programs from the egraph is difficult. Therefore, Risegg only deletes enodes and eclasses, affecting all of their parents. This prohibits deleting individual programs that share sub terms with other programs that we do not want to delete. In the future, it would be interesting to develop more sophisticated capabilities, e.g. deleting the matched lefthandside of a rewrite rule that would also allow performing normalization.
Prioritizing growth in promising directions.
Previous work proposes rewrite rule schedulers as a way to control which rewrite rules should be applied on a given equality saturation iteration (willsey2021egg). The SimpleScheduler from the egg library applies all rewrite rules on each iteration. The default BackoffScheduler prevents specific rules from being applied too often, reducing egraph growth in the presence of “explosive” rules such as associativity and commutativity. Our experience with Rise is that using the BackoffScheduler is counterproductive because the desired optimizations depend on explosive rules. Future work may look into finding more advanced ways to prioritize egraph growth, but at present Risegg does not use a rewrite rule scheduler.
Starting over.
An effective way to reduce egraph size is to build and grow a new egraph from the most promising term represented in a previous egraph. Our sketchguided technique develops this approach by defining a sequence of equality saturation searches, while defining a clear goal for each search in the form of a sketch.
4.2. The Intuition for Sketches
When designing optimizations it is useful for the programmer to think about the desired shape of the optimized program. Sketches are program patterns that capture this intuition while leaving details unspecified.
Previous work on optimizing the Harris corner detection image processing pipeline with rewrites (koehler2021elevateimgproc) used program snippets like LABEL:harrisshape3 to explain the effect of the various optimizations. The key insight is that explanatory program snippets can be utilized as sketches such as the one shown in LABEL:harrissketch. This sketch resembles the snippet in LABEL:harrisshape3 but adds program holes (?) and constraints (contains) to explicitly elide program details.
Where Elevate rewriting is purely manual, sketchguided equality saturation is semiautomated. That is, it allows the programmer to declaratively specify the desired optimization goal (the sketch) without needing to specify the detailed rewrite sequence to get there.
4.3. Sketch Definition
We define the simple SketchBasic language with just four constructors as a proof of concept. The syntax of SketchBasic and the set of terms that the constructors represent is defined in fig. 9. A sketch represents a set of terms , such that where denotes all terms in the language we rewrite. We say that any satisfies the sketch .
The sketch is the least precise as it represents all terms in the language. The sketch represents all terms that match a specific ary function symbol from the term language, and whose children satisfy sketches . The sketch represents all terms containing a term that satisfies sketch . Finally, the sketch represents terms satisfying either or .
When rewriting terms in a typed language sketches may be annotated with a type sketch () constraining the type of terms. If denotes the set of terms satisfying the type sketch , then . The grammar of type sketches depends on the language we rewrite. We elide type sketches from our definition of SketchBasic for simplicity, but use them in section 5.
When writing sketches, a balance has to be found between being too precise (representing only one program) and too vague (representing programs that are not desired). This balance also interacts with the choice of rules, since programs that may be found by the search are where represents the set of terms that can be discovered to be equivalent to the initial term according to the given . This means that using a more restricted set of rules generally enables specifying less precise sketches.
4.4. SketchGuided Equality Saturation Algorithm
The idea behind SketchGuided Equality Saturation is to use multiple sketches to decompose a complex rewrite goal into a series of simpler rewrite goals. The overall process is illustrated in fig. 10.
The programmer guides the system by providing a sequence of sketches (). Successive equality saturation searches are performed to find equivalent terms satisfying one sketch after the other. Because each sketch is satisfied by many terms, the programmer must also provide cost models () to select the term to be used as the start point for the next search. Sets of rewrite rules () are provided to grow the egraph in each search.
The pseudocode for the sketchguided equality saturation algorithm is shown in LABEL:fig:sketchingalg. The extract function (line 3) is used to extract a term from the egraph that satisfies the specified sketch while minimizing the specified cost model, and we describe it in the next subsection. The found function (line 3) is used to stop growing the egraph by checking whether extract succeeded in returning a term. For efficiency our implementation of found does not rely on extract and checks only if a term could be extracted.
4.5. SketchGuided Equality Saturation Extraction
To extract the best program that satisfies a SketchBasic sketch from an egraph we define a helper function , where is a cost function that must be monotonic and local. While extract returns a program from an eclass , the helper returns a map from eclasses to optional tuples of costs and terms. After invoking we simply look up the eclass in the map and extract the term from the optional tuple. For efficiency, we memoize previously computed results of . The extract function is recursively defined over the four SketchBasic cases as follows.
Case 1: . This case is equivalent to extracting the programs minimizing from the egraph. We implement this extraction as an eclass analysis (section 2) with data type and functions that constructs analysis data and that combines analysis data from enodes in the same eclass:
Case 2: . We consider each eclass containing enodes and the terms that should be extracted for each child eclass . We write for indexing into the map returned by :
Case 3: . We use another eclass analysis and initialize the analysis data to corresponding to the base case where . To the analysis data we consider all terms that would contain terms from and keep the best by reducing them using :
To the analysis data, we do the same as for .
Case 4: . We the results from and :
4.6. Summary
This section introduces sketchguided equality saturation as a semiautomated process where the programmer guides multiple equality saturation searches. They do so by specifying sketches that declaratively describe the program shapes after the desired optimizations have been applied.
5. Evaluation
This section compares three different optimization methods: Elevate rewriting strategies, fully automated equality saturation, and the new sketchguided equality saturation. In the evaluation both equality saturation techniques use the efficient implementation of name bindings presented in section 3.
version  sketches  found  time  RAM  rules  egraph 

baseline  1  yes  0.5s  20 MB  2  51, 49 
blocking  1  yes  1h+  35GB  5M  4M, 2M 
vectorization  1  no  1h+  60GB+  ???  ??? 
loopperm  1  no  1h+  60GB+  ???  ??? 
arraypacking  1  no  35mn  60GB+  ???  ??? 
cacheblocks  1  no  35mn  60GB+  ???  ??? 
parallel  1  no  35mn  60GB+  ???  ??? 
We evaluate the optimization of a matrix multiplication as it allows us to compare sketchguided equality saturation against published Elevate strategies that specify realistic optimizations equivalent to TVM schedules (hagedorn2020elevate). TVM is a stateoftheart deep learning compiler (tvm2018) and (hagedorn2020elevate) demonstrates that expressing optimizations performed by TVM as compositions of rewrites is possible and achieves the same high performance as TVM. The optimizations performed by TVM and Elevate are typical compiler optimizations, including loop blocking, loop permutations, vectorization, and parallelization. Overall we evaluate all seven differently optimized versions of matrix multiplication presented in (hagedorn2020elevate) and described in the TVM manual.
In this evaluation, we make sure that the generated C code is equivalent modulo variable names to the manually optimized versions that already demonstrated competitive performance compared with TVM. First, we compare how much time and memory are required for fully automated equality saturation and our sketchguided equality saturation. Then, we discuss how much programmer effort is required using manually written Elevate strategies compared to writing our sketches.
5.1. Optimization Time and Memory Consumption
version  sketches  found  time  RAM  rules  egraph 

baseline  1  yes  0.5s  20 MB  2  51, 49 
blocking  2  yes  7s  0.3 GB  11K  11K, 7K 
vectorization  3  yes  7s  0.4 GB  11K  11K, 7K 
loopperm  3  yes  4s  0.3 GB  6K  10K, 7K 
arraypacking  4  yes  5s  0.4 GB  9K  10K, 7K 
cacheblocks  4  yes  5s  0.5 GB  9K  10K, 7K 
parallel  4  yes  5s  0.4 GB  9K  10K, 7K 
Experimental Setup
Both Elevate^{4}^{4}4https://github.com/elevatelang/elevate and our full Risegg implementation^{5}^{5}5https://github.com/riselang/shine/tree/sges/src/main/scala/rise/eqsat are written in Scala and we use standard Java utilities for measurements: System.nanoTime() to measure runtime, and the Runtime api to compute maximum memory consumption by sampling regularly.
The experiments are performed on two platforms. For Elevate strategies and our sketchguided equality saturation, we use a less powerful AMD Ryzen 5 PRO 2500U with 4GB of RAM available to the JVM. For fullyautomated nonguided equality saturation, we use a more powerful Intel Xeon E52640 v2 with 60GB of RAM available to the JVM.
Fully Automated Equality Saturation
Table 2 shows the runtime and memory consumption of using a single fully automated equality saturation to perform seven different optimizations. The search terminates as soon as the optimized version is found in the egraph. Most optimizations are not found before exhausting the 60GB of available memory. Only the “baseline” and “blocking” versions are found and the search for the blocking version requires more than 1h and about 35GB of RAM. Millions of rewrite rules are applied and the egraph contains millions of enodes and eclasses. More complex optimizations involve more rewrite rules, creating a richer space of equivalent programs but exhausting memory faster. As examples, “vectorization” and “loopperm” include vectorization rules, while “arraypacking”, “cacheblocks” and “parallel” include rules for memory storage.
SketchGuided Equality Saturation
Table 3 shows the runtime and memory consumption of our sketchguided equality saturation, where sketches guide the optimization process and break a single equality saturation search into multiple. All optimizations are found in less than 10s, using less than 0.5GB of RAM. Interestingly the number of rewrite rules applied by our semiautomatic approach is in the same order of magnitude as for the manual Elevate strategies (hagedorn2020elevate). On one hand, equality saturation applies more rules than necessary because of its explorative nature. On the other hand, Elevate strategies apply more rules than necessary because they reapply the same rule to the same subexpression and do not orchestrate the shortest rewrite path possible. The constructed egraphs contains no more than enodes and eclasses, at least two orders of magnitude less than the required for “blocking” without sketchguidance.
Summary
While complex optimizations of matrix multiplication are not feasible with fully automated equality saturation, they are feasible with sketchguided equality saturation. In finding all optimized version in less than 10 seconds, our semiautomated technique is practical and only about one order of magnitude slower than the Elevate strategies that perform these optimizations in under a second. Next, we investigate the programmer effort of sketchguided equality saturation compared to orchestrating the rewrite sequence manually with Elevate.
5.2. Programmer Effort
Elevate strategies
Elevate strategies (hagedorn2020elevate) are programs describing optimizations as compositions of rewrite rules. While Elevate is a functional language, strategies are inherently imperative in nature as they describe the rewrite steps required to transform the initial program into an optimized one. In contrast, sketches declaratively describe the desired programs rather than how to reach them.
Elevate enables the development of abstractions that help write concise strategies, such as the one performing the “blocking” optimization in LABEL:lst:elevateblocking. Unfortunately, these abstractions are often program specific and complex to implement. For example, the reorder abstraction in line 5 is defined as shown in LABEL:lst:elevatereorder. The implementation of this one abstraction is 43 lines long, involves the definition of 8 internal strategies, and carefully composes them together with more generic strategies in a recursive process. Still, this reorder abstraction is – despite its name – not capable of reordering generic nestings of map and reduce patterns, but only works for the matrix multiplication example. Additionally, it is hard for the programmer to reason about the list parameter which represents the desired reordering: what will the resulting program look like? Overall, the “blocking” optimization requires 112 lines of programspecific Elevate code.
The first authors of (hagedorn2020elevate) and (koehler2021elevateimgproc) that used Elevate
for optimizing matrix multiplication and image processing pipelines estimated
^{6}^{6}6in private communication with us that they spent between two personweeks and oneperson month to develop the Elevate strategies.Sketches
To demonstrate their simplicity we discuss the sketches written for the matrix multiplication versions. We start by defining useful sketch abstractions that combine generic constructs from SketchBasic with Risespecific type annotations (LABEL:risesketchabs). In the code, > is a function type and n.dt an array type of n elements of type dt. The type annotations restrict the iteration domains of patterns like map and reduceSeq. We use similar definitions for other language constructs.
LABEL:mmbaselinesketch shows the sketch for the (basically unoptimized) “baseline” version. The sketch describes the desired program structure of two nested map patterns and a nested reduce. The comments on the right show the equivalent nested for loops.
The sketch describing the “blocking” version in LABEL:mmblockingsketch corresponds to an optimized program where the imperative loop nests have been split and reordered such that the iteration space is chunked into blocks of processed by the three innermost for loops.
In contrast to the strategy in LABEL:lst:elevateblocking sketches focus on the effect the optimization has on the program, rather than how the transformation is performed stepbystep. Developing sketches is significantly easier: we estimate that it took about two persondays to develop all sketches used for our evaluation, in contrast to the weeks required to develop the strategies. We also believe, that it is more intuitive to describe familiar program shapes rather than composing rules that rewrite the program.
Sketches for Guiding the Search
Using only the sketch in LABEL:mmblockingsketch enables equality saturation to find the optimized “blocking” program, but requires over 1 hour and 35GB. By guiding the search with an additional sketch shown in LABEL:mmblockingsplitsketch we can significantly accelerate the search to less than 10s. This guiding sketch describes a program shape where the map and reduce patterns have been split but not yet reordered.
Table 4 shows how each optimization version can be described by logical optimization steps, each corresponding to a sketch describing the program after the step has been applied. Interestingly, the split sketch shown in LABEL:mmblockingsplitsketch is a useful first guide for all optimized versions.
Choice of Rules and Cost Model
Besides the sketches, programmers also specify the rules used in each search and a cost model. For the split sketch, 8 rules are required explaining how to split map and reduce. The reorder sketches require 9 rules that swap various nestings of map and reduce. The store sketch requires 4 rules and the lower sketches 10 rules including mapfusion, 6 rules for vectorization, 1 rule for loop unrolling and 1 rule for loop parallelization. If we naively use all rules for the blocking search, the search time increases by about 25, still finding the optimizations in minutes but showing the importance of selecting a small set of rules.
We use a simple cost model that minimizes weighted term size. Common rules and cost models are easy to reuse.
version  sketches 

blocking  split + reorder 
vectorization  split + reorder + lower 
loopperm  split + reorder + lower 
arraypacking  split + reorder + store + lower 
cacheblocks  split + reorder + store + lower 
parallel  split + reorder + store + lower 
6. Related Work
Automatic Optimization
. Some compiler optimizations can be fully automated via equality saturation or heuristic searches
(tate2009equalitysaturation;yang2021eqsattensor
; liftrewrite2015; polymage2015). Although this approach can automatically yield high performance, it is not always feasible or even desirable as it lacks user control, may result in poor performance and may be too timeconsuming.Optimization Strategies. Compiler optimizations can be precisely controlled with rewriting strategies (visser1998strategies; hagedorn2020elevate; koehler2021elevateimgproc) or schedules (halide2012; tvm2018). However, optimization strategies are challenging to write, nondeclarative, and quickly become overdetailed and programspecific.
Equality Saturation with Bindings. We are not the first to attempt applying equality saturation to languages with binding. Willsey et al. implement a partial evaluator for the lambda calculus in (willsey2021egg) using explicit substitution. Although conceptually simple, this implementation has a prohibitively high performance cost as we demonstrated in section 3. In (smith2021accesspatterns), Smith et al. propose access patterns as a way to avoid the need for binding structures when representing tensor programs. In this paper, we present practical solutions to make equality saturation with bindings more efficient.
Sketching. The idea of sketching has been introduced for program synthesis (solar2008synthesissketching), along with counterexample guided inductive synthesis which combines a synthesizer with a validation procedure. Our approach in this paper is different as we target optimizations rather than program synthesis. We use sketches as program patterns to filter a set of equivalent programs generated via equality saturation, and as a result, we do not need a validation procedure.
Equational Reasoning with Egraphs. Egraphs were originally designed for efficient congruence closure in theorem provers (nelson1980techniques), and are used in the Z3 theorem prover (de2008z3). They have also recently been used for semantic code search in the Yogo tool (premtoon2020yogo). In the realm of theorem proving, rewriting strategies can be compared to procedural proof languages (harrison1996hol; paulson1994isabelle), while the sketch approach can be compared to declarative proof languages (corbineau2007declarativecoq; wenzel2002mizar; norell2007agda; kaufmann1996acl2).
7. Conclusion
This paper broadens the applicability of equality saturation for programming languages in two ways. We drastically improve the efficiency of equality saturation for languages with name bindings, like the lambda calculus. We propose sketchguided equality saturation as a semiautomatic technique to scale equality saturation to complex optimizations that require long rewrite sequences. The experimental evaluation demonstrates that sketchguided equality saturation enables seven sophisticated optimizations of matrix multiplication to be applied within seconds. The guided approach is declarative, and requires far less effort than imperative Elevate rewrite strategies. Moreover, most of the optimizations cannot be applied with fullyautomated equality saturation.
Acknowledgements.
We would like to thank Max Willsey for the opensource egg library and his valuable feedback; the
Rise and Elevate teams for their opensource work. This work was supported by the Engineering and Physical Sciences Research Council [EP/W007940/1].
Comments
There are no comments yet.