oaam
Optimizing Abstract Abstract Machines
view repo
The technique of abstracting abstract machines (AAM) provides a systematic approach for deriving computable approximations of evaluators that are easily proved sound. This article contributes a complementary step-by-step process for subsequently going from a naive analyzer derived under the AAM approach, to an efficient and correct implementation. The end result of the process is a two to three order-of-magnitude improvement over the systematically derived analyzer, making it competitive with hand-optimized implementations that compute fundamentally less precise results.
READ FULL TEXT VIEW PDFOptimizing Abstract Abstract Machines
Optimizing Abstract Abstract Machines
Program analysis provides sound predictive models of program behavior, but in order for such models to be effective, they must be efficiently computable and correct. Past approaches to designing program analyses have often featured abstractions that are far removed from the original language semantics, requiring ingenuity in their construction and effort in their verification. The abstracting abstract machines (AAM) approach dvanhorn:VanHorn2011Abstracting ; dvanhorn:VanHorn2012Systematic to deriving program analyses provides an alternative: a systematic way of transforming a programming language semantics in the form of an abstract machine into a family of abstract interpreters. It thus reduces the burden of constructing and verifying the soundness of an abstract interpreter.
By taking a machine-oriented view of computation, AAM makes it possible to design, verify, and implement program analyzers for realistic language features typically considered difficult to model. The approach was originally applied to features such as higher-order functions, stack inspection, exceptions, laziness, first-class continuations, and garbage collection. It has since been used to verify actor-local:DOsualdo:12A and thread-based dvanhorn:Might2011Family parallelism and behavioral contracts dvanhorn:TobinHochstadt2012Higherorder ; it has been used to model Coq local:harvard , Dalvik local:dalvik , Erlang local:DOsualdo:12B , JavaScript local:DBLP:journals/corr/abs-1109-4467 , and Racket dvanhorn:TobinHochstadt2012Higherorder .
The primary strength of the approach is that abstract interpreters can be easily derived through a small number of steps from existing machine models. Since the relationships between abstract machines and higher-level semantic models—such as definitional interpreters dvanhorn:reynolds-hosc98 , structured operational semantics dvanhorn:Plotkin1981Structural , and reduction semantics dvanhorn:Felleisen2009Semantics —are well understood dvanhorn:Danvy:DSc , it is possible to navigate from these high-level semantic models to sound program analyzers in a systematic way. Moreover, since these analyses so closely resemble a language’s interpreter (a) implementing an analysis requires little more than implementing an interpreter, (b) a single implementation can serve as both an interpreter and analyzer, and (c) verifying the correctness of the implementation is straightforward.
Unfortunately, the AAM approach yields analyzers with poor performance relative to hand-optimized analyzers. Our work takes aim squarely at this “efficiency gap,” and narrows it in an equally systematic way through a number of simple steps, many of which are inspired by run-time implementation techniques such as laziness and compilation to avoid interpretative overhead. Each of these steps is proven correct, so the end result is an implementation that is trustworthy and efficient.
In this article, we develop a systematic approach to deriving a practical implementation of an abstract-machine-based analyzer using mostly semantic means rather than tricky engineering. Our goal is to empower programming language implementers and researchers to explore and convincingly exhibit their ideas with a low barrier to entry. The optimizations we describe are widely applicable and apparently effective to scale far beyond the size of programs typically considered in the recent literature on flow analysis for functional languages.
We start with a quick review of the AAM approach to develop an analysis framework and then apply our step-by-step optimization techniques in the simplified setting of a core functional language. This allows us to explicate the optimizations with a minimal amount of inessential technical overhead. Following that, we scale this approach up to an analyzer for a realistic untyped, higher-order imperative language with a number of interesting features and then measure improvements across a suite of benchmarks.
At each step during the initial presentation and development, we evaluated the implementation on a set of benchmarks. The highlighted benchmark in figure 1 is from Vardoulakis and Shivers dvanhorn:Vardoulakis2011CFA2 that tests distributivity of multiplication over addition on Church numerals. For the step-by-step development, this benchmark is particularly informative:
it can be written in most modern programming languages,
it was designed to stress an analyzer’s ability to deal with complicated environment and control structure arising from the use of higher-order functions to encode arithmetic, and
its improvement is about median in the benchmark suite considered in section 6, and thus it serves as a good sanity check for each of the optimization techniques considered.
We start, in section 3
, by developing an abstract interpreter according to the AAM approach. In the initial abstraction, each state carries a store (what is called per-state store variance). The space of stores is exponential in size; without further abstraction, the analysis is exponential and thus cannot analyze the example in a reasonable amount of time. In section
4, we perform a further abstraction by widening the store. The resulting analyzer sacrifices precision for speed and is able to analyze the example in about 1 minute. This step is described by Van Horn and Might [dvanhorn:VanHorn2012Systematic, , §3.5–6] and is necessary to make even small examples feasible. We therefore take a widened interpreter as the baseline for our evaluation.Section 5 gives a series of simple abstractions and implementation techniques that, in total, speed up the analysis by nearly a factor of 500, dropping the analysis time to a fraction of a second. Figure 1 shows the step-wise improvement of the analysis time for this example.
The AAM approach, in essence, does the following: it takes a machine-based view of computation and turns it into a finitary approximation by bounding the size of the store. With a limited address space, the store must map addresses to sets of values. Store updates are interpreted as joins, and store dereferences are interpreted by non-deterministic choice of an element from a set. The result of analyzing a program is a finite directed graph where nodes in the graph are (abstract) machine states and edges denote machine transitions between states.
The techniques we propose for optimizing analysis fall into the following categories:
generate fewer states by avoiding the eager exploration of non-deterministic choices that will later collapse into a single join point. We accomplish this by applying lazy evaluation techniques so that non-determinism is evaluated by need.
generate fewer states by avoiding unnecessary, intermediate states of a computation. We accomplish this by applying compilation techniques from functional languages to avoid interpretive overhead in the machine transition system.
generate states faster. We accomplish this by better algorithm design in the fixed-point computation we use to generate state graphs.
Figure 2 shows the effect of (1) and (2) for the small motivating example in Earl, et al. dvanhorn:Earl2012Introspective . By generating significantly fewer states at a significantly faster rate, we are able to achieve large performance improvements in terms of both time and space.
Section 6 describes the evaluation of each optimization technique applied to an implementation supporting a more realistic set of features, including mutation, first-class control, compound data, a full numeric tower and many more forms of primitive data and operations. We evaluate this implementation against a set of benchmark programs drawn from the literature. For all benchmarks, the optimized analyzer outperforms the baseline by at least a factor of two to three orders of magnitude.
(a) Baseline | (b) Lazy | (c) Compiled (& lazy) |
In this section, we give a brief review of the AAM approach by defining a sound analytic framework for a core higher-order functional language: Landin’s ISWIM dvanhorn:Landin1966Next . In the subsequent sections, we will explore optimizations for the analyzer in this simplified setting, but scaling these techniques to realistic languages is straightforward and has been done for the analyzer evaluated in section 6.
ISWIM is a family of programming languages parameterized by a set of base values and operations. To make things concrete, we consider a member of the ISWIM family with integers, booleans, and a few operations. Figure 3 defines the syntax of ISWIM. It includes variables, literals (either integers, booleans, or operations), -expressions for defining procedures, procedure applications, and conditionals. Expressions carry a label, , which is drawn from an unspecified set and denotes the source location of the expression; labels are used to disambiguate distinct, but syntactically identical pieces of syntax. We omit the label annotation in contexts where it is irrelevant.
The semantics is defined in terms of a machine model. The machine components are defined in figure 4; figure 5 defines the transition relation (unmentioned components stay the same). The evaluation of a program is defined as its set of traces that arise from iterating the machine transition relation. The function produces the set of all proofs of reachability for any state from the injection of program (from which one could extract a string of states). The machine is a very slight variation on a standard abstract machine for ISWIM in “eval, continue, apply” form dvanhorn:Danvy:DSc . It can be systematically derived from a definitional interpreter through a continuation-passing style transformation and defunctionalization, or from a structural operational semantics using the refocusing construction of Danvy and Nielsen dvanhorn:Danvy-Nielsen:RS-04-26 .
Compared with the standard machine semantics, this definition is different in the following ways, which make it abstractable as a program analyzer:
the store maps addresses to sets of values, ^{1}^{1}1 More generally, we can have stores map to any domain that forms a Galois connection with sets of values, enabling to produce elaborate abstractions of base values (e.g., interval or octagon abstractions). We use sets of values for a simpler exposition. not single values,
continuations are heap-allocated, not stack-allocated,
there are “timestamps” () and syntax labels () threaded through the computation, and
the machine is implicitly parameterized by the functions , , , , and spaces , (and initial ).
To characterize concrete interpretation, set the implicit parameters of the relation given in figure 5 as follows:
These functions appear to ignore and , but they can be used to determinize the choice of fresh addresses. The on stores in the figure is a point-wise lifting of : . The resulting relation is non-deterministic in its choice of addresses, however it must always choose a fresh address when allocating a continuation or variable binding. If we consider machine states equivalent up to consistent renaming and fix an allocation scheme, this relation defines a deterministic machine (the relation is really a function).
The interpretation of primitive operations is defined by setting as follows:
To characterize abstract interpretation, set the implicit parameters just as above, but drop the condition. The relation takes some care to not make the analysis run forever; a simple instantiation is a flat abstraction where arithmetic operations return an abstract top element , and returns both and on . This family of interpreters is also non-deterministic in choices of addresses, but it is free to choose addresses that are already in use. Consequently, the machines may be non-deterministic when multiple values reside in a store location.
It is important to recognize from this definition that any allocation strategy is a sound abstract interpretation dvanhorn:Might2009Posteriori . In particular, concrete interpretation is a kind of abstract interpretation. So is an interpretation that allocates a single cell into which all bindings and continuations are stored. The former is an abstract interpretation with uncomputable reachability and gives only the ground truth of a program’s behavior; the latter is an abstract interpretation that is easy to compute but gives little information. Useful program analyses lay somewhere in between and can be characterized by their choice of address representation and allocation strategy. Uniform -CFA dvanhorn:nielson-nielson-popl97 , presented next, is one such analysis.
To characterize uniform -CFA, set the allocation strategy as follows, for a fixed constant :
The notation denotes the truncation of a list of symbols to the leftmost symbols.
All that remains is the interpretation of primitives. For abstract interpretation, we set to the function that returns on all inputs—a symbolic value we interpret as denoting the set of all integers.
At this point, we have abstracted the original machine to one which has a finite state space for any given program, and thus forms the basis of a sound, computable program analyzer for ISWIM.
The uniform -CFA allocation strategy would make in figure 5 a computable abstraction of possible executions, but one that is too inefficient to run, even on small examples. Through this section, we explain a succession of approximations to reach a more appropriate baseline analysis. We ground this path by first formulating the analysis in terms of a classic fixed-point computation.
Conceptually, the AAM approach calls for computing an analysis as a graph exploration: (1) start with an initial state, and (2) compute the transitive closure of the transition relation from that state. All visited states are potentially reachable in the concrete, and all paths through the graph are possible traces of execution.
We can cast this exploration process in terms of a fixed-point calculation. Given the initial state and the transition relation , we define the global transfer function:
Internally, this global transfer function computes the successors of all supplied states, and then includes the initial state:
Then, the evaluator for the analysis computes the least fixed-point of the global transfer function: where .
The possible traces of execution tell us the most about a program, so we take to be the (regular) set of paths through the computed graph. We elide the construction of the set of edges in this paper.
To conduct this naive exploration on the Vardoulakis and Shivers example would require considerable time. Even though the state space is finite, it is exponential in the size of the program. Even with , there are exponentially many stores in the AAM framework.
In the next subsection, we fix this with store widening to reach polynomial (albeit of high degree) complexity. This widening effectively lifts the store out of individual states to create a single, global shared store for all.
A common technique to accelerate convergence in flow analyses is to share a common, global store. Formally, we can cast this optimization as a second abstraction or as the application of a widening operator ^{2}^{2}2Technically, we would have to copy the value of the global store to all states being stepped to fit the formal definition of a widening, but this representation is order-isomorphic to that. during the fixed-point iteration. In the ISWIM language, such a widening makes 0-CFA quartic in the size of the program. Thus, complexity drops from intractable exponentiality to a merely daunting polynomial.
Since we can cast this optimization as a widening, there is no need to change the transition relation itself. Rather, what changes is the structure of the fixed-point iteration. In each pass, the algorithm will collect all newly produced stores and join them together. Then, before each transition, it installs this joined store into current state.
To describe this process, AAM defined a transformation of the reduction relation so that it operates on a pair of a set of contexts () and a store (). A context includes all non-store components, e.g., the expression, the environment and the stack. The transformed relation, , is
To retain soundness, this store grows monotonically as the least upper bound of all occurring stores.
The final approximation we make to get to our baseline is to store-allocate all values that appear, so that any non-machine state that contains a value instead contains an address to a value. The AAM approach stops at the previous optimization. However, the funcontinuation stores a value, and this makes the space of continuations quadratic rather than linear in the size of the program, for a monovariant analysis like 0-CFA. Having the space of continuations grow linearly with the size of the program will drop the overall complexity to cubic (as expected). We also need to allocate an address for the argument position in an apstate.
To achieve this linearity for continuations, we allocate an address for the value position when we create the continuation. This address and the tail address are both determined by the label of the application point, so the space becomes linear and the overall complexity drops to cubic. This is a critical abstraction in languages with -ary functions, since otherwise the continuation space grows super-exponentially (). We extend the semantics to additionally allocate an address for the function value when creating the continuation. The continuation has to contain this address to remember where to retrieve values from in the store.
The new evaluation rules follow, where :
Now instead of storing the evaluated function in the continuation frame itself, we indirect it through the store for further control on complexity and precision:
Associated with this indirection, we now apply all functions stored in the address. This nondeterminism is necessary in order to continue with evaluation.
In this section, we discuss the optimizations for abstract interpreters that yield our ultimate performance gains. We have two broad categories of these optimizations: (1) pragmatic improvement, (2) transition elimination. The pragmatic improvements reduce overhead and trade space for time by utilizing:
timestamped stores;
store deltas; and
imperative, pre-allocated data structures.
The transition-elimination optimizations reduce the overall number of transitions made by the analyzer by performing:
frontier-based semantics;
lazy non-determinism; and
abstract compilation.
All pragmatic improvements are precision preserving (form complete abstractions), but the semantic changes are not in some cases, for reasons we will describe. We did not observe the precision differences in our evaluation.
We apply the frontier-based semantics combined with timestamped stores as our first step. The move to the imperative will be made last in order to show the effectiveness of these techniques in the purely functional realm.
The semantics given for store widening in section 4.2, while simple, is wasteful. It also does not model what typical implementations do. It causes all states found so far to step each iteration, even if they are not revisited. This has negative performance and precision consequences (changes to the store can travel back in time in straight-line code). We instead use a frontier-based semantics that corresponds to the classic worklist algorithms for analysis. The difference is that the store is not modified in-place, but updated after all frontier states have been processed. This has implications for the analysis’ precision and determinism. Specifically, higher precision, and it is deterministic even if set iteration is not.
The state space changes from a store and set of contexts to a set of seen abstract states (context plus store), , a set of contexts to step (the frontier), , and a store to step those contexts with, :
We constantly see more states, so is always growing. The frontier, which is what remains to be done, changes. Let’s start with the result of stepping all the contexts in paired with the current store (call it for intermediate):
The next store is the least upper bound of all the stores in :
The next frontier is exactly the states that we found from stepping the last frontier, but have not seen before. They must be states, so we pair the contexts with the next store:
Finally, we add what we know we had not yet seen to the seen set:
To inject a program into this machine, we start off knowing we have seen the first state, and that we need to process the first state:
Notice that now has several copies of the abstract store in it. As it is, this semantics is much less efficient (but still more precise) than the previously proposed semantics because membership checks have to compare entire stores. Checking equality is expensive because the stores within each state are large, and nearly every entry must be checked against every other due to high similarities amongst stores.
And, there is a better way. Shivers’ original work on -CFA was susceptible to the same problem, and he suggested three complementary optimizations: (1) make the store global; (2) update the store imperatively; and (3) associate every change in the store with a version number – its timestamp. Then, put timestamps in states where previously there were stores. Given two states, the analysis can now compare their stores just by comparing their timestamps – a constant-time operation.
There are two subtle losses of precision in Shivers’ original timestamp technique that we can fix.
In our semantics, the store does not change until the entire frontier has been explored. This avoids cross-branch pollution which would otherwise happen in Shivers’ semantics, e.g., when one branch writes to address and another branch reads from address .
The common implementation strategy for timestamps destructively updates each state’s timestamp. This loses temporal information about the contexts a state is visited in, and in what order. Our semantics has a drop-in replacement of timestamps for stores in the seen set (), so we do not experience precision loss.
The observation Shivers made was that the store is increasing monotonically, so all stores throughout execution will be totally ordered (form a chain). This observation allows you to replace stores with pointers into this chain. We keep the stores around in to achieve a complete abstraction. This corresponds to the temporal information about the execution’s effect on the store.
Note also that is only populated with states that have not been seen at the resulting store. This is what produces the more precise abstraction than the baseline widening.
The general fixed-point combinator we showed above can be specialized to this semantics, as well. In fact, is a functional relation, so we can get the least fixed-point of it directly.
maintains the invariant that all stores in are totally ordered and is an upper bound of the stores in .
maintains the invariant that is in order with respect to and .
is a complete abstraction of .
The proof follows from the order isomorphism that, in one direction, sorts all the stores in to form , and translates stores in to their distance from the end of (their timestamp). In the other direction, timestamps in are replaced by the stores they point to in .
The above technique requires joining entire (large) stores together. Additionally, there is still a comparison of stores, which we established is expensive. Not every step will modify all addresses of the store, so joining entire stores is wasteful in terms of memory and time. We can instead log store changes and replay the change log on the full store after all steps have completed, noting when there is an actual change. This uses far fewer join and comparison operations, leading to less overhead, and is precision-preserving.
We represent change logs as . Each becomes a log addition , where begins empty () for each step. Applying the changes to the full store is straightforward:
We change the semantics slightly to add to the change log rather than produce an entire modified store. The transition relation is identical except for the addition of this change log. We maintain the invariant that lookups will never rely on the change log, so we can use the originally supplied store unmodified.
A taste of the changes to the reduction relation is as follows:
We lift to accommodate for the asymmetry in the input and output, and change the frontier-based semantics in the following way:
Here combines change logs across all non-deterministic steps for a state to later be replayed. The order the combination happens in doesn’t matter, because join is associative and commutative.
iff
By cases on and .
Let . iff .
By induction on .
is a complete abstraction of .
Follows from previous lemma and that join is associative and commutative.
Tracing the execution of the analysis reveals an immediate shortcoming: there is a high degree of branching and merging in the exploration. Surveying this branching has no benefit for precision. For example, in a function application, (f x y), where f, x and y each have several values each argument evaluation induces -way branching, only to be ultimately joined back together in their respective application positions. Transition patterns of this shape litter the state-graph:
To avoid the spurious forking and joining, we delay the non-determinism until and unless it is needed in strict contexts (such as the guard of an if, a called procedure, or a numerical primitive application). Doing so collapses these forks and joins into a linear sequence of states:
This shift does not change the concrete semantics of the language to be lazy. Rather, it abstracts over transitions that the original non-deterministic semantics steps through. We say the abstraction is lazy because it delays splitting on the values in an address until they are needed in the semantics. It does not change the execution order that leads to the values that are stored in the address.
We introduce a new kind of value, , that represents a delayed non-deterministic choice of a value from . The following rules highlight the changes to the semantics:
Since if guards are in strict position, we must force the value to determine which branch to to take. The middle rule uses only to combine with values in the store - it does not introduce needless non-determinism.
We have two choices for how to implement lazy non-determinism.
This semantics introduces a subtle precision difference over the baseline. Consider a configuration where a reference to a variable and a binding of a variable will happen in one step, since store widening leads to stepping several states in one big “step.” With laziness, the reference will mean the original binding(s) of the variable or the new binding, because the actual store lookup is delayed one step (i.e. laziness is administrative).
The administrative nature of laziness means that we could remove the loss in precision by storing the result of the lookup in a value representing a delayed nondeterministic choice. This is a more common choice in 0CFA implementations we have seen, but it interferes with the next optimization due to the invariant from store deltas we have that lookups must not depend on the change log.
If and then there exists a such that and
Here is straightforward — the left-hand side store must be contained in the right-hand-side store, and if values occur in the states, the left-hand-side value must be in the forced corresponding right-hand-side value. The proof is by cases on .
The prior optimization saved time by doing the same amount of reasoning as before but in fewer transitions. We can exploit the same idea—same reasoning, fewer transitions—with abstract compilation. Abstract compilation transforms complex expressions whose abstract evaluation is deterministic into “abstract bytecodes.” The abstract interpreter then does in one transition what previously took many. Refer back to figure 2 to see the effect of abstract compilation. In short, abstract compilation eliminates unnecessary allocation, deallocation and branching. The technique is precision preserving without store widening. We discuss the precision differences with store widening at the end of the section.
The compilation step converts expressions into functions that expect the other components of the ev state. Its definition in figure 6 shows close similarity to the rules for interpreting ev states. The next step is to change reduction rules that create ev states to instead call these functions. Figure 7 shows the modified reduction relation. The only change from the previous semantics is that state construction is replaced by calling the compiled expression. For notational coherence, we write for and for .
The correctness of abstract compilation seems obvious, but it has never before been rigorously proved. What constitutes correctness in the case of dropped states, anyway? Applying an abstract bytecode’s function does many “steps” in one go, at the end of which, the two semantics line up again (modulo representation of expressions). This constitutes the use of a notion of stuttering. We provide a formal analysis of abstract compilation without store widening with a proof of a stuttering bisimulation ianjohnson:BCG88 between this semantics and lazy non-determinism without widening to show precision preservation.
The number of transitions that can occur in succession from an abstract bytecode is roughly bounded by the amount of expression nesting in the program. We can use the expression containment order to prove stuttering bisimulation with a well-founded equivalence bisimulation (WEB) ianjohnson:manolios-diss. WEBs are equivalent to the notion of a stuttering bisimulation, but are more amenable to mechanization since they also only require reasoning over one step of the reduction relation. The trick is in defining a well-founded ordering that determines when the two semantics will match up again, what Manolios calls the pair of functions and (but we don’t need since the uncompiled semantics doesn’t stutter).
We define a refinement, , from non-compiled to compiled states (built structurally) by “committing” all the actions of an state (defined similarly to , but immediately applies the functions), and subsequently changing all expressions with their compiled variants. Since WEBs are for single transition systems, a WEB refinement is over the disjoint union of our two semantics, and the equivalence relation we use is just that a state is related to its refined state (and itself). Call this relation .
Before we prove this setup is indeed a WEB, we need one lemma that applying an abstract bytecode’s function is equal to refining the corresponding state:
Let . Let . .
The proof is by induction on .
is a WEB on