Data-dependence analysis aims to identify the def-use information in a program. However, the presence of pointers and references obscure such information. The analysis must cut through the tangle of aliasing to reason about data dependence, which serves as a substrate for many program analysis clients, such as impact analysis (arnold1996software) and program slicing (sridharan2007thin).
Path-sensitivity is a common axis for pursuing precision, yet is stunningly challenging for data-dependence analysis, which can suffer from the “aliasing-path-explosion” problem For instance, at a load statement , we need to track the path condition of this statement, and the path conditions under which points-to different memory objects. Each load or store statement may access hundreds of memory objects, each memory object may be accessed at dozens or hundreds of locations in the program, and the number of calling contexts under which the statements execute can be exponential. Consequently, the number of disjunctive cases to track grow extremely large, far too many to enable a scalable analysis.
Existing path-sensitive data-dependence analyses can be classified into two major categories. The “fused” approach can path-sensitively reasons about data-dependence information without a points-to analysis as a priori, such as symbolic execution(cadar2008klee). The fused approach uses various logics to generate formulas encoding the entire history of memory writes and reads, which allow for establishing correlations between variables automatically. However, it encodes constraints following control-flow paths, regardless of whether they are relevant to the data dependence of interests or not. Such “dense” analysis is known to have performance problems. For instance, Focal (Focal-FSE19), a state-of-the-art backward symbolic executor, takes almost 230 hours in answering on-demand queries for a program with near 33 KLoC.
Alternatively, the staged approach leverages an independent pointer analysis to approximate the def-use information, which is then leveraged to bootstrap the path-sensitive data-dependence analysis (yan2018spatio; blackshear2013thresher). Although the idea of leveraging pre-computed pointer information has advanced flow- and/or context-sensitive analysis (via sparsification (yan2018spatio), pruning (fink:typestate:issta), or partitioning (kahlon2008bootstrapping)), how to replicate this success for path-sensitive data-dependence analysis remains an open question.
The problem of computing transitive data-dependence relations can be formulated as a graph reachability problem, where value-flow graphs varying in precision act as the reachability indices. Without an index, the approaches like symbolic execution are hard to scale, whether it be exhaustive or demand-driven. When computing the index, there is a tension between tracking path-sensitive pointer information too early (livshits2003tracking; hackett2006aliasing; dillig2011precise)–which leads to the overwhelming cost of value-flow graph construction–and tracking path-sensitive pointer information too late–which reduces the benefits of path-sensitivity, because pointer information can be spuriously and/or redundantly propagated (yan2018spatio).
In this paper, we present Falcon, a fused and sparse approach to path-sensitive data-dependence analysis, which piggybacks the computation of pointer information with the resolution of data dependence. The key insight is that an data-dependence relation induced by pointer expressions can be identified without knowing the concrete memory objects referenced by the pointers. This enables us to efficiently and precisely build a reachability index for value flows, alleviating the need for explicitly and repeatedly enumerating sheer amounts of points-to information.
We first introduce an all-program-points but lazy pointer analysis: it constructs the symbolical and compositional value-flow graphs for the entire program, without computing exhaustive points-to information. The graphs act as the “conduits” for tracking transitive data dependence in a sparse and demand-driven manner. Our analysis provides the key precision benefit that path-sensitivity brings, paths pruning and merging, via lightweight semi-decision procedures. To achieve context-sensitivity but avoid expensive summary cloning, it only clones the memory access-path expressions that are rooted at a function parameter and incur side-effects, thus enabling local reasoning of value flows, as opposed to global reasoning about the entire heap.
We then present two client analyses, thin slicing based program understanding (sridharan2007thin; li2016program) and value-flow analysis based bug hunting (sui2012static; shi2018pinpoint; yan2018spatio). Crucially, the guards qualifying the graph edges have concisely merged value flows going through different memory objects. The clients (1) do not need to perform explicit cast-splitting over the points-to sets when handling indirect reads/writes, thereby alleviating a major source of case explosion in previous path-sensitive analyses (PSE; blackshear2013thresher; yan2018spatio), because the size of a points-to set could be very large, and (2) can sparsely track the value flows by following the value-flow edges, aiding scalability. In summary, our approach separates the task of reasoning about “how the values flow through different memory objects” from answering queries about data dependence.
Specifically, there are two novel and critical features in our algorithm itself:
The pointer analysis for building value-flow graphs is both on-the-fly sparse and path-sensitive, in that it computes the heap def-use chains incrementally, along with the path-sensitive pointer information discovered. The Spas algorithm (sui2011spas) is the only previous work that has the same property, but they achieve incremental sparsity by following the level-by-level analysis (yu2010level) and, thus, is exhaustive.
When answering demand data-dependence queries, our analysis can stop as soon as enough evidence is gathered, without trying to find all pointed-by memory objects. The previous analyses (zheng2008demand; yan2011demand; spath2016boomerang; spath2019context) can answer demand alias queries directly, via different storeless representations (kanvar2016heap). However, none of the techniques can introduce path sensitivity.
Overall, this paper makes the following key contributions:
We identify and discuss the major challenges of scaling path-sensitive data-dependence analysis.
We introduce an efficient path-sensitive data-dependence analysis, which, for the first time, allow us to analyze multi-million-line code bases with the precision of full path-sensitivity in minutes.
We demonstrate the utility of our approach with two clients, namely thin slicing and value flow analysis.
We conduct a significant experiment on 16 real-world programs ranging from 13 KLoC to 8 MLoC.
In building value-flow graphs, Falcon outperforms Svf (sui2016svf), Sfs (hardekopf2011flow), and Dsa (lattner2007making), achieving on average 17, , and 4.4 speedups, respectively.
Compared with Supa (sui2016demand; sui2018vfdemand), the state-of-the-art demand-driven flow- and context-sensitive pointer analysis for C/C++, Falcon is 54 in answering thin slicing queries, and it improves the precision by 1.6.
In comparison with Cred (yan2018spatio), a state-of-the-art path-sensitive value flow analysis for bug hunting, Falcon is on average 6 faster, and finds more real bugs (21 vs. 12) with a lower false-positive rate (25% vs. 47.8%).
We take the program in Fig. 1 as an example to motivate the path-sensitive data-dependence analysis, highlight its challenges, and explain the essence of our approach.
Importance of Path-Sensitive Data-Dependence Information
Suppose we need to detect double-free bugs for the program in Fig. 1(a). In this program, there are two memory deallocation statements free(a) and free(e). Observe that the value of can flow to only under the condition . Thus, the program is safe, because the deallocation statements execute under the condition .
Assume that we only approximate the data-dependence information with a path-insensitive pointer analysis, and then partially track path correlations of the memory deallocation statements. The pointer analysis can tell that may be data-dependent on , meaning that and may point-to the same memory object. Observe that the path conditions for the two statements free(a) and free(e) are both , which do not conflict with each other. Here, if taking as the path condition of a double-free vulnerability, our analysis would raise a false alarm, because the condition for to be data-dependent on is . To summarize, the imprecise data-dependence information caused by the points-to analysis would be passed on to the clients, hurting the precision.
Problem of Aliasing-Path-Explosion
However, obtaining path-sensitive data-dependence information is far from trivial, because tracking path-sensitive pointer information can require reasoning about a considerable number of disjunctive cases. For instance, at a load statement , we need to track the path condition of this statement, and the path conditions under which points-to different memory objects. each load or store statement may access hundreds of memory objects, each memory object may be accessed at dozens or hundreds of locations in the program, and the number of calling contexts under which the statements execute can be exponential. In summary, the transfer function for each statement needs to store and propagate an enormous amount of information. We term the problem aliasing-path-explosion: analyzing path-sensitive data-dependence information can lead to reasoning about an excessive number of paths (blackshear2013thresher).
A recent analysis (yan2018spatio) has leveraged the idea of sparsity to refine the flow-insensitive results into a path-sensitive one on demand. It first constructs the flow-insensitive def-use chains with a pre-analysis, which then enable the primary path-sensitive analysis to be performed sparsely (sui2016demand; sui2018vfdemand; yan2018spatio). For instance, as shown in Fig. 1(b), the two edges between and state that the value of pointer can flow to the pointer via the memory objects or , implying that may be data-dependent on . The pre-computed def-use chains enable the primary path-sensitive analysis to be performed sparsely (sui2016demand; sui2018vfdemand; yan2018spatio).
However, the flow-insensitive pre-analysis drops path information. When answering demand queries, the primary analysis still suffers from the aliasing-path-explosion. For example, suppose a client asks, “what are the set of variables may be data-dependent on?” Following their work (yan2018spatio), we perform an on-demand backward traversal from to , respectively. Apparently, in the worst case, such graph traversal needs to search five paths and solve five path constraints. This number of paths exceeds the total number of paths in the program (which is four), meaning that the aliasing-path-explosion can be even worse than the well-known scalability problem caused by conditional branching in symbolic execution.
In essence existing sparse analysis can use a pre-analysis to identify the relevant memory objects (as in Fig. 1(b)). However, the pre-analysis can only reduce the number of memory objects to track, but not the number of value flow paths going through the relevant memory objects, because it is unaware of the path conditions qualifying the value flows. When going for path sensitivity, the primary path-sensitive analysis phase cannot avoid aliasing-path-explosion.
At a high level, our approach works in two phases. In the first phase, we compute the guarded and storless value-flow graphs. In the second stage, the clients can utilize the graphs to track transitive program dependence on demand. The crux is to judiciously merge abstract states while building the graphs, since when and how to merge them drastically affect the accuracy and performance of the client analyses.
We observe that many memory objects and paths qualifying data-dependence relations are redundant, which can be symbolically identified and merged. For example, intuitively, the variable may be data-dependent , no matter points-to or . However, the analysis in the previous work (yan2018spatio) has to separate the two edges and label them with different memory objects, so that it can preserve the capability of precision refinement based on the memory objects.
Based on the observation, our key idea is to use a symbolical storeless representation for pointer expressions, which concisely “index” how values flow in and out of the memory in a precision-preserving manner. Crucially, as illustrated in Fig. 1(c), our first phase takes advantage of a lightweight semi-decision procedure to achieve the following two merits, which significantly reduces the burden of the second phase: ① it can efficiently prune a number of infeasible value flows; and ② it can effectively merge and simplify path constraints at the time of merging value-flow edges. Besides, to build interprocedural value-flow graphs, it only clones the memory access-path expressions that are rooted at a function parameter and incur side-effects, there by avoiding computing a whole-program image of the heap.
Then, in the second phase, we can answer demand data-dependence queries for variables of interest. For example, consider the graph in Fig. 1(c), where the memory objects pointed-by and are implicit.
To determine the values that is data-dependent on, we perform a backward graph traversal, and only have to traverse two paths, i.e., from to and , respectively.
To detect double free bugs, we perform a forward traverse from to , collecting the guard qualifying the edges (i.e., ). We then collect the path conditions of the two statements free(a) and free(e) (i.e., ). Clearly, we can eliminate the false positive because is unsatisfiable.
This section presents the basic terminologies and notations used in the paper, including the language, abstract domains, as well as the guarded value-flow graph.
We formalize our analysis with a simple language, as in Fig. 2. Programs are in the static single assignment form. Statements include address-taken statements, common assignments, -assignments, loads, stores, branches, returns, procedure calls, and sequencing. In the -assignment, is the gated function for each , which means if and only if is satisfied. Such gated functions can be computed in almost linear time (ottenstein1990program). With no loss of generality, we assume each function has only one return statement.
The symbols and abstract domains are listed in Fig. 3. A label indicates the position of a statement in the control flow graph. An abstract value is a pointer that points-to a memory location. A memory object represents a memory location that may contain different abstract values on different guard conditions . We factor the abstract domain to the points-to environment and abstract store . means that the pointer points-to the memory object under the condition . states that the memory object contains the value , which is stored into the memory object at the program point on the condition . For simplicity, we define the operation , so that we can query and under a precondition . Formally,
Guarded Value-Flow Graph
Intuitively, a value flows to if is assigned to directly (via an assignment, such as ) or indirectly (via pointer dereferences, such as ). Formally, we define the guarded value-flow graph as below.
Definition 3.1 ().
(Guarded Value-Flow Graph) A guarded value-flow graph is a directed graph , where , , and are defined as following:
is a set of nodes, each of which is denoted by , meaning that the variable is defined or used at a program location .
is a set of edges, each of which represents a value-flow relation. means that the value flows to .
maps each edge in the graph to a condition , meaning that the value-flow relation holds only when the condition is satisfied.
Our approach first computes a guarded value-flow graph for each function. A specific client of data-dependence analysis can then be reduced to graph reachability problems, whereby the local value-flow graphs are stitched together by matching formal and actual parameters as well as return value and its receivers.
To achieve path-sensitivity, for a value-flow edge that only holds under some condition, we label the edge with the constraint . To establish such guarded edges, the key is to record what values will be stored into a memory object at a store statement (e.g., ) and query what values can be loaded at a load statement (e.g., ). We will detail this process in the following sections.
4. Intraprocedural Analysis
This section presents our intraprocedural analysis to compute and , so that we can establish indirect value flows by querying what values can be loaded at a load statement. We first define the abstract transformers, which enables a conventional data-flow analysis. At the end of § 4.1, we summarize the challenges for optimization, which are addressed in § 4.2 and § 4.3.
4.1. Abstract Transformers
Fig. 4 lists the rules for analyzing the basic statements. The rule for states that under the current points-to environment , abstract store , and path condition , the statement produces new points-to environment and/or abstract store .
Rule addr creates memory objects at allocation sites. Rule copy and Rule phi are self-explanatory; thus, we focus on the store and load rules.
Rule store processes a store statement under path condition , which results in new configurations of the abstract store . We first query the memory objects may point-to, denoted . For all guarded memory objects , we update the abstract store to record the values that may hold. Following conventional singleton-based algorithms, if points-to at most one concrete memory object, we can perform an indirect strong update, which kills other values hold by the memory object (emami1994context; hardekopf2009semi; hardekopf2011flow; lhotak2011points; yu2010level).
Given a load statement under path condition at program location , we apply Rule load as follows. Similar to the store rule, we query the memory objects that may point-to under the condition , denoted . We then fetch the values from each memory object , denoted . Finally, for every , we update by adding the points-to set of under condition as a subset of the points-to set of .
Merging Value-Flow Edges
Recall from § 3 that our analysis computes a guarded value-flow graph that summarizes value flows induced by the memory. We formalize the rule of building indirect value-flow edges in Fig. 5. The vflow rule states that when and are stored and loaded from the same memory object , may alias with .
In the rule, suppose such that and . As illustrated in Fig. 1(b), conventional approaches build flow-insensitive value-flow edges (sui2016demand; yan2018spatio). Thus, they have to distinguish a value flow induced by different memory objects, so that they can preserve the capability of precision refinement based on the memory objects. Such methods, however, can still suffer from the aliasing-explosion-problem, making the analysis not scalable.
To tackle the problem, the vflow rule merges the value-flow edges, which not only reduces the number of edges, but also can normalize and simplify the conditions based on some simple rewriting rules. For example, when merging two value-flow edges under the conditions and respectively, the condition can be simplified as after merging.
However, a highly precise (e.g., flow- and path-sensitive) analysis that uses the above rules to compute value-flow edges is notoriously expensive, due to the following challenges:
Conservative propagation. Propagating data-flow facts along control flows is expensive and unnecessary (hardekopf2011flow). To mitigate this problem, data-flow facts can be propagated along with def-use chains. However, the def-use information of memory objects is unavailable without a pointer analysis. To resolve the paradox, most existing work (hardekopf2011flow; ye2014region; sui2016sparse; sui2016demand; yan2018spatio) perform a lightweight but imprecise pointer analysis to over-approximate the def-use chains. Due to the imprecision, many false def-use relations are introduced, hurting performance.
Constraints explosion. Our analysis needs to account for a sheer number of guard updates for each statement, quickly causing the explosion of constraints. For a demand-driven analysis, it is both intractable and unnecessary to pay the full price of path-sensitive reasoning up front.
We address these challenges by the following means:
The first challenge is addressed via an on-the-fly sparse analysis, which computes def-use relations incrementally during the analysis, instead of relying on precomputed imprecise def-use chains (§ 4.2).
The second challenge is addressed via a semi-path-sensitive analysis that simplifies and partially solves the constraints, to merge redundant value-flow edges and prune obviously false value-flow edges (§ 4.3).
4.2. On-the-Fly Sparse Analysis
To address the challenge of conservative propagation, we utilize the idea of sparsity to skip unnecessary control flows when propagating data-flow facts. To this end, instead of leveraging the imprecise def-use relations computed by a pre-analysis, we construct the def-use relations incrementally during the analysis, along with the precise pointer information discovered. To formally present the idea, we maintain the abstract store as a set of , which describes the abstract store at the basic block . Then the store and load rules are refined as follows.
The store Rule
We follow the idea in SSA form where a variable defined at a basic block can only be used in a basic block dominated by or in the dominance frontier where the definition is merged with other definitions (cytron1991efficiently). Suppose at basic block , a store statement writes a guarded value to the memory object . As shown in Alg. 1, it takes two steps to update the abstract store. First, we write into the local store (Line 3). Second, we propagate the abstract value to the dominance frontiers of . We update the guard for the propagated abstract value, which is the conjunction of and the path condition of (Lines 4-5). Note that it is unnecessary to propagate the abstract value to basic blocks dominated by , because at the load time, we can walk up the dominance tree to find the corresponding definitions (see the next paragraph).
Example 4.1 ().
Consider the program in Fig. 6. After , the variables points-to alloc, alloc, respectively. Suppose we are analyzing the store statement at basic block . The abstract value is stored into alloc and then propagated to the local abstract store of ’s dominance frontier . Therefore, we have and . Similarly, after analyzing the store statement , we have and .
The load Rule
As shown in Alg. 2, for a load statement at the basic block , we track the values that can be read from the memory objects pointed-by . For each memory object , it suffices to walk up the dominance tree (Lines 6-15) to gather abstract values until a strong update is found. The basic idea behind the approach is that the definition of a variable must dominate its uses. Clearly, this is a linear search in the dominance tree.
Example 4.2 ().
Consider the program in Fig. 6. When analyzing the load statement at the basic block , we need to read values from the memory objects pointed-by . To this end, we gather abstract values stored into alloc that is pointed-by . To do this, the sub-procedure ReadFromObject only visits two basic blocks and by walking up the dominator tree. From , we can read the value , which is written into alloc at and propagated to . From , we can read the value , which is written into alloc at that dominates .
4.3. Semi-Path-Sensitive Analysis
Path-sensitivity comes in many flavors, depending on the kind of information encoded as constraints. Previous work on path-sensitive pointer analysis either adopts relatively coarse abstractions where Boolean variables abstract just the control flow, ignoring the actual predicate of a condition (sui2011spas); or takes expensive abstractions with first-order theory formulas tracking data predicates, which can incur huge overhead (livshits2003tracking; hackett2006aliasing; dillig2011precise).111livshits2003tracking invoke a computer algebra system. hackett2006aliasing implement a procedure similar to bit-blasting that translates arithmetic constraints to SAT constraints. dillig2011precise use the Mistral SMT solver.
For constructing the guarded value-flow graph, we explore a sweet spot in the space; it is semi-path-sensitive from the two aspects below.
First, the guards are conceptually first-order formulas, but abstracted as Boolean skeletons. For instance, we abstract the two branch conditions and to fresh Boolean literals and , respectively. Such encoding allows for certain degrees of branch correlation tracking. The design is similar in spirit to the DPLL() lazy SMT solving architecture, which separates propositional reasoning and theory reasoning in a modular and demand-driven way (DPLLT-JACM06).
Second, instead of applying a full-featured SAT/SMT solver, we adopt several linear time semi-decision procedures such as unit-propagation (zhang1996cient) for identifying “easy” unsatisfiable or valid constraints, as well as performing logical simplifications. In our experiment, we observe that about 70% of the path conditions constructed in the analysis are satisfiable. For the remaining ones, 80% of them are easy constraints and can be solved with the semi-decision procedures.
Many infeasible path-sensitive data-flow facts can be filtered because programmers tend to maintain an implicit and simple correlation of conditional points-to relations, both for ensuring some required logical properties and the good human readability. The semi-path-sensitive analysis makes this correlation explicit.
Compared with the state of the arts that construct the value-flow graph via a flow-insensitive points-to analysis and label the edge with a memory object (sui2016sparse; yan2018spatio), our analysis has two major benefits. First, it catches the path correlations between different statements, thus pruning more infeasible value flows than path-insensitive analyses. Second, it concisely merges and simplifies the guards qualifying a value flow, when merging the value-flow edges between load and store statements.
Example 4.3 ().
Consider the program in Fig. 1(a). Observe that the variable may point-to or . A path-insensitive algorithm will conclude that may alias with , where is a false positive. We now explain intuitively how our algorithm works and prunes the false positive. Let us consider the two cases where points-to or , respectively. First, if points-to , as in Fig. 7(a), our analysis will obtain the following values that may flow to :
The semi-path-sensitive analysis can decide that the guard of the second item is unsatisfiable. Hence, the value is pruned. Second, if points-to , as in Fig. 7(b), the analysis can also prune the value . Finally, after merging the two graphs induced by and , we obtain the graph in Fig. 7(c).
The on-the-fly sparse analysis and the semi-path-sensitive analysis conspire to address the challenges of conservative propagation and constraints explosion (§ 4.1). The sparse analysis skips unnecessary control flows when propagating data-flow facts, thereby improving the analysis efficiency. The semi-path-sensitive analysis removes a lot of false pointer information, which not only improves precision but also benefits efficiency because smaller points-to sets lead to less work (lhotak2008evaluating; smaragdakis2014introspective).
5. Interprocedural Analysis
The tenet of our interprocedural analysis is breaking down the entire abstraction into smaller components to enable on-demand resolution of alias relations. To achieve context-sensitivity but avoid expensive summary cloning, we first introduce the approach to building concise function summaries. We then sketch the process of constructing inter-procedural value-flow graph.
A pointer analysis often faces the “chicken-and-egg” problem: performing the analysis requires a call graph, which also needs reasoning about function pointers (grove2001framework). In our approach, we consult a flow- and context-insensitive analysis (zhang2013fast) for obtaining a sound call graph. Previous works (milanova2004precise; hardekopf2011flow) have shown that a precise call graph for C-like programs can be constructed using only flow-insensitive analysis. For C++ programs, we adopt the class hierarchy analysis to resolve virtual function calls (dean1995optimization).
5.1. Cloning-Based Analysis with Concise Function Summaries
For building interprocedural value-flow graphs, we perform a bottom-up and summary-based analysis. To achieve context-sensitivity, conventional summary-based analyses conservatively identify the side effects of a function, which are then cloned at every call site of the summarized function in the upper-level callers (xie2005scalable; dillig2011precise). However, the size of the side-effect summary can quickly explode, becoming a significant obstacle to scalability. To illustrate, consider the program in Fig. 8(a). In conventional approaches, the summary of foo is the points-to information, and , of the interface variable at the exit point of foo. That is,
By cloning the summary to the two call sites in qux and bar, the two variables and are cloned twice. When the summaries of qux and bar are cloned to their upper-level callers, and will continue to be cloned. As a result, the summary size will grow exponentially.
To mitigate the problem, our basic idea is to introduce symbolic auxiliary variables, each of which stands for a class of variables to clone. Then we can only clone a single auxiliary variable during interprocedural analysis, reducing the burden of cloning. For the above example, we introduce an extra value for the function foo to represent all values (e.g., and ) stored in the memory object pointed-by . As a result, the function summary gets smaller as the following, and we only need to clone a single variable to the callers during the interprocedural analysis.
Intuitively, this process amounts to adding an extra return value to the function foo, as shown in Fig. 8(b). Formally, this summarization process is illustrated in Alg. 3, where the points-to results are always merged into a single auxiliary variable so that the burden of the cloning processes can be reduced. Each auxiliary variable stands for a modified non-local memory object accessed through an access path rooted at an interface variable. In conclusion, the summarization scheme enables local reasoning of the value flows, as opposed to global reasoning about the entire heap.
5.2. Constructing the Value-Flow Graphs
Alongside the analysis, we have been able to construct the value-flow graph (§ 3) for each function. The graph has two types of edges representing value-flow relations: the direct edge connects a store to a load using the rule in Fig. 5, and the summary edge connects to if can transitively flow to . The analysis eagerly connects the summary edges between a formal argument (formal-in) at the function entry and the return value (formal-out) at the function exit. As shown in Fig. 8(c), the local value-flow graphs are stitched together by matching formal and actual parameters as well as return value and its receivers.
6. Answering Demand Queries
The storeless value-flow graphs allow tracking of transitive data dependence in a demand-driven fashion. By a forward or backward graph traversal, the graphs can be adopted in various applications.
Thin Slicing for Program Understanding
The first typical application is thin Slicing (sridharan2007thin; li2016program), which can be implemented via a backward traversal on the value-flow graph. Thin slicing is introduced by sridharan2007thin to facilitate program debugging and understanding. A thin slice for a program variable, a.k.a., the slicing seed, includes only the producer statements that directly affect the values of the variable. In contrast to conventional slicing, control dependence and the data dependence of the variable’s base pointer are excluded. Hence, thin slices are typically much smaller than conventional program slices.
Example 6.1 ().
Consider the program in Figure 8. To build the thin slice for the slicing seed at function qux, we only need to traverse the value-flow graphs from in a reversed direction. Such a traversal will visit all program statements that need to be included in the slice. For instance, and will be visited and included in the thin slice as they are the producer statements of the variable .
Value Flow Analysis for Bug Hunting
The second typical application is value flow analysis (fastcheck-07; sui2012static). The analysis of value flows underpins the inspection of a broad range of software bugs, such as the violations of memory safety (e.g., null dereference, double free, etc.), the violations of resource usage (e.g., memory leak, socket leak, etc.), and security problems (e.g., the use of tainted data). Clearly, it is of vital importance to precisely resolve value flows caused by pointer aliasing, which is the key problem we address in the paper. Value flow analysis can be implemented via a forward traversal on the value-flow graphs, during which the alias constraints and property-specific constraints can be gathered together and handed to an SMT solver.
Example 6.2 ().
Suppose we need to detect double-free bugs for the program in Figure 8(a). We traverse its value-flow graphs (Figure 8(c)) starting from and obtain one path from to . We then stitch together the path condition under which is data-dependent on (i.e., ), and the path conditions of the two statements free(a) and free(n) (i.e., ). Observe that we do not compute the interprocedural data dependence between or .
When used for full path-sensitive value flow analysis, our design has two advantages. First, it enhances precision, as a combined domain of pointer and source-sink information allows more precise information than could be obtained by solving each domain separately (Cousot:1979:SDP:567752.567778; fink:typestate:issta). Second, it allows scalability, because the client (1) does not need to perform explicit cast-splitting over the points-to sets when handling indirect reads/writes, alleviating a major source of case explosion in previous work (PSE; blackshear2013thresher; yan2018spatio); (2) sparsely tracks the value flows by following the data-dependence edges; and (3) can make distinctions between memory objects summarized by an access path on demand. Consequently, the client concentrates computational effort on the path- and context-sensitive pointer information only when it matters to the properties of interest.
Both the value-flow graphs building phase and the on-demand analysis phase are sparse, by piggybacking the computation of pointer information with the resolution of data dependence. The two phases mitigate the aliasing-path-explosion problem as follows. First, when constructing the value-flow graphs, duplicate edges are merged, many false edges are pruned, and the data-dependence guards are significantly simplified (§ 4). Second, when resolving transitive data dependence over the graphs, the concise summaries serve as the “conduits” to allow values of interests flow in and out of the function scope. In particular, the client (1) does not need to perform explicit cast-splitting over the points-to sets when handling indirect reads/writes, alleviating a major source of case explosion in previous work (PSE; blackshear2013thresher; yan2018spatio); (2) can sparsely track the value flows by following the data-dependence edges; and (3) can make distinctions between memory objects summarized by an access path on demand. Consequently, the client analysis can concentrate computational effort on the path- and context-sensitive pointer information only when it matters to the properties of interest.
To demonstrate the utility of Falcon, we examine its scalability in constructing the value-flow graphs (§ 7.2), and apply it to two practical clients, namely semi-path-sensitive thing slicing (§ 7.3), and fully path-sensitive bug hunting (§ 7.4).
We have implement Falcon on top of LLVM 3.6. While the language in § 3 has restricted language constructs, our implementations support most features of C/C++, such as unions, arrays, classes, dynamic memory allocation, and virtual functions. Arrays are considered monolithic. Falcon is soundy (livshits2015defense), which means that it handles most language features in a sound manner, while it also applies some unsound choices as in previous works (babic2008calysto; xie2005scalable). For instance, we unroll each loop twice on the control flow graph and call graph, do not handle inline assembly, assume distinct parameters are not alias with each other.
7.1. Experimental Setup
In this section, we compare Falcon against three groups of analyses.
First, we compare with the following analyses for constructing the value-flow graphs: (1) Svf (sui2016svf), an inclusion-based, flow- and context-insensitive pointer analysis; 222https://github.com/SVF-tools/SVF (2) Sfs (hardekopf2011flow), an inclusion-based, flow-sensitive, context-insensitive pointer analysis; (3) Dsa (lattner2007making), a unification-based, flow-insensitive, context-sensitive pointer analysis. 333https://github.com/seahorn/sea-dsa
Second, for the thin slicing client, we compare with Supa (sui2016demand; sui2018vfdemand), 444https://github.com/SVF-tools/SUPA the state-of-the-art demand-driven flow- and context-sensitive pointer analysis for C/C++. Supa relies on Svf to build the value-flow graphs, base on which it answers demand queries.
Finally, we compare with Cred (yan2018spatio), a state-of-the-art path-sensitive pointer analysis for bug hunting.555
The tool is not open-source. We implement the algorithm on top ofSvf.
All of these analyses are filed-sensitive, meaning that each field of a struct is treated as a separate variable.
We cannot compare with the pointer analyses in (guyer2005error; yu2010level; li2011boosting; li2013precise; sui2011spas; sui2014making; zhao2018parallel) because their implementations are not publicly available. For bug finding, we tried our best to compare with Saturn (hackett2006aliasing) and Compass (dillig2010fluid; dillig2011precise), but they are not runnable on the experimental environment that we are able to set up.
Subjects and Environment
Tbl. 1 shows the benchmarks. Six of them are taken from SPEC CINT2000 and ten from open-source projects. The programs cover a wide range of applications such as text editors and database engines, and their sizes vary from 13 KLoC to 8 MLoC. Note that since Falcon unrolls loops on the control flow graph and call graph, we feed the same transformed code to other tools. All experiments are conducted on a 64-bit machine with 40 Intel Xeon E5-2698 CPUs@2.20 GHz and 256 GB of RAM. All runtime numbers are medians of three runs.
7.2. Value-flow Graph Construction
First, we examine the scalability of Falcon for constructing value-flow graphs. The cutoff time per tool per program is 12 hours.
Comparing with Svf, Sfs, and Dsa
Tbl. 1 and Fig. 9 show the results of the four analyses. In terms of runtime overhead, we can see that they perform similarly in small-sized programs. However, on programs with more than 500 KLoC, Svf and Sfs get derailed and become orders-of-magnitude more expensive. In particular, both fail to analyze mysql, rethinkdb, and firefox within 12 hours. Dsa is comparable to Falcon on vim and php, but much slower than Falcon on other large programs (git, wrk, libicu, and mysql). Also, Dsa cannot finish the analysis of rethinkdb and firefox. To sum up, Falcon is on average 17, , and 4.4 faster than Svf, Sfs, and Dsa, respectively. In terms of memory consumption, on average, Falcon takes 1.4, 1.9, and 4.2 less memory than Svf, Sfs, and Dsa, respectively.
|OOT means the analysis runs out of the time budget (12 hours).|
We attribute the graceful scalability of Falcon to two factors. First, the combination of on-the-fly sparsity and semi-path-sensitivity translates into significant gains in both precision and performance (§ 4). Second, we do not clone the full points-to information to achieve context-sensitivity. Instead, we leverage a concise summary to avoid computing a whole-program image of the heap (§ 5).
The Effects of Semi-Decision Procedures
To understand the effects of constraint solving, we set up two additional configurations of Falcon for constructing value-flow graphs. Specifically, Falcon-PI is path-insensitive, while Falcon-SAT uses a full-featured SAT solver. The last three columns of Tbl. 1 compare the three configurations. Falcon is usually more–and occasionally much more–efficient than Falcon-PI, due to the increased precision. However, Falcon-SAT is not a good choice in practice: its precision is offset by unbearable runtime overhead. In particular, Falcon-SAT runs out of the time budget for all programs of more than 500 KLoC.
The results indicate that solving constraints when building value-flow graphs pays off, which naturally raises the question: could we do better by tuning the semi-decision procedure more aggressively? However, we find that being “too aggressive” can lead to performance overhead that overwhelms the benefits. For instance, we tried the Gaussian elimination algorithm for solving linear constraints, leaving the analysis hard to scale to millions of lines of code. Adapting the decision procedures defines a sophisticated design space that deserves further optimizations.
To provide insight into the nature of constraints explosion, in Fig. 10, we report the number of constraints Falcon deals with. Even when unrolling all loops, we can see that it is not unusual to have over constraints.
7.3. Thin Slicing for Program Understanding
This study aims to give a measure for the precision of Falcon’s semi-path-sensitive value-flow graphs (§ 4.3) with the thin slicing client. In the experimental results, we exclude the time for building value-flow graphs.
To generate a realistic set of queries, we compare Falcon against Supa for thin slicing, which is introduced by sridharan2007thin to facilitate program understanding and debugging. The thin slice for a given pair of program variable and statement, a.k.a., the slicing seed, includes only the producer statements that directly affect the values of the variable. In contrast to conventional slicing, control dependence and the data dependence of the variable’s base pointer are excluded. Hence, thin slices are typically much smaller than conventional program slices.
We generate the queries from the bug reports issued by a third-party typestate analysis, which only flags the buggy variables and program locations, but not the trace under which the bugs may occur. Thus, the thin slices can assist the developers in understanding the bug reports. Our results show that Falcon is scalable for the thin slicing client, taking under 240 milliseconds for each demand query. In summary, it achieves up to 302 speedups than Supa and 54 on average. Fig. 11 compares the precision of Falcon against Sfs, Dsa, and Supa on the 13 programs that get analyzed by all tools. The data for each program are normalized to the results of Svf, i.e, a higher bar corresponds to a more precise analysis. Briefly, we make the following observations:
The precision of Falcon is superior than other analyses. The average size of slices produced by Falcon is 5.5, 1.9, 2.6, and 1.3 smaller than that of Svf, Sfs, Dsa, and Supa, respectively.
Comparing Sfs and Svf, we see that flow sensitivity can substantially improve the precision of Andersen’s analysis in some programs, such as php and ffmpeg.
Dsa is comparable to Svf in some cases, and much more precise than Svf in many programs (e.g., vim, libicu, ffmpeg). The combination of context-sensitivity and unification may bring better precision than the flow- and context-insensitive Andersen’s analysis.
First, Falcon offers a clearly visible improved precision of pointer information. An important reason is that Falcon’s semi-path-sensitive analysis (§ 4.3) can prune away more spurious value flows, compared with the demand-driven flow- and context-sensitive analysis offered by Supa.
Second, the performance improvement is because that the value-flow graphs of Falcon are more compact than that of Supa. On the one hand, Supa constructs the graphs with a flow-insensitive analysis, whereas Falcon uses a flow-sensitive one. On the other hand, Supa has to explicate a memory object on en edge so that it can answer demand queries, where many edges are redundant.
Third, we observe that Falcon’s unsound assumption that function parameters are alias-free does not affect the soundness for more than of the queries, validated by manually checking the results. Similar to our observations, two recent studies (sui2014making; gharat2016flow) also show that the function parameters of real-world C/C++ programs tend to have few aliasing relations.
7.4. Value Flow Analysis for Bug Hunting
In this study, we investigate the efficiency and effectiveness of Falcon for double-free detection, by comparing it against Cred (yan2018spatio). Both Falcon and Cred employ the Z3 SMT solver (de2008z3) to achieve full path-sensitivity. Each tool is run in a single-thread mode and the cutoff time per program is 24 hours.
|%FP||47.8% (11/23)||25% (7/28)|
|OOT means the tool runs out of the time budget (24 hours).|
Tbl. 2 presents the time and memory overhead of the tools. As can be seen, Falcon surpass the performance of Cred for most large-scale programs, achieving up to 50 speedups, and 6 speedups on average. Besides, on average, Falcon takes 2 less of memory than Cred. Although not shown in the table, we remark that, if allowing the concurrent analysis of 10 threads, Falcon can finish the checking of each program within 45 minutes. Tbl. 2 also shows the number of reported warnings (#Rep) as well as the number and the rate of false positives (#FP and %FP). We can see that Falcon detects more real vulnerabilities than Cred (21 vs. 12). The false-positive rates of Falcon and Cred are 25% and 56.5%, respectively. Falcon is aligned with the common industrial requirement of 30% false positives (Bessey:2010:FBL:1646353.1646374; McPeak:2013:SIS:2491411.2501854).
Overall, our findings conclude that the ideas behind Falcon
have considerably practical value. In terms of all aspects, including not only the scaling efforts but also the precision and recall, the tool itself is promising in providing an industrial-strength capability of static bug hunting.
8. Related Work
Tbl. 3 gives key properties of several existing path-sensitive algorithms. Here we summarize some of the approaches taken by previous analyses. with a focus on pointer reasoning.
livshits2003tracking introduce a flow-, path-, and context-sensitive pointer analysis, which only scales to programs up to 13KLoC. The pointer analyses in (hackett2006aliasing) and (sui2011spas) are only intraprocedurally path-sensitive. dillig2010fluid; dillig2011precise present a path- and context-sensitive heap analysis that scales to program with 128KLoC. blackshear2013thresher introduce a symbolic-explicit representation that incorporates the pre-computed flow-insensitive points-to facts to guide the backward symbolic execution. Similar to the index variables in (dillig2010fluid; dillig2011precise) and symbolic variables in (blackshear2013thresher), we use the guards qualifying value-flow graph edges to enable lazy case splitting over the points-to set. However, their approaches are either not demand-driven or non-sparse and, thus, do not scale well for large-scale programs.
Our work follows a long line of research on path-sensitive dataflow analysis. Esp (das2002esp) encodes a typestate property into a finite state automata, which is used as criteria for partitioning and merging program paths. Essentially, Esp is similar to many other approaches such as trace partition (mauborgne2005trace) and elaborations (sankaranarayanan2006static) that control the trade-off between performing joining operations or logical disjunctions at control flow merge points. By contrast, Falcon uses logical disjunction to precisely merge value-flow guard.
Counterexample-guided abstraction refinement starts with an imprecise abstraction, which is iteratively and gradually refined (CEGAR-JACM; ball2002s). Our approach has a “refinement” flavor, but we compute the path- and context-sensitive conditions directly in the on-demand analysis phase, without using a refinement loop. Conceptually, our approach bears similarities to the staged analyses for typestate verification (fink:typestate:issta; fink2008effective), but we focus on path-sensitive analysis and We decompose the cost into semi-path-sensitive value-flow graphs construction and a full-path-sensitive alias resolution over the graphs.
Shape analysis (TVLA-TOPLAS) proves data-structure invariants and has had a major impact on the verification community. Precise shape analyses (TVLA-TOPLAS; li2017semantic) that are capable of path-sensitive heap reasoning do not readily scale to large programs (fink2008effective). There have been scalable solutions such as compositional shape analysis based on bi-abduction (calcagno2011compositional; gulavani2009bottom), yet they do not guarantee precision.
Demand-Driven Pointer Analysis
Demand-driven program analyses only analyze parts of the program that are relevant for answering a given query. To date, most existing demand-driven pointer analyses for C/C++ (heintze2001demand; saha2005incremental; zheng2008demand) and Java (sridharan2005demand; sridharan2006refinement; yan2011demand; shang2012demand; lu2013incremental; Su2014Parallel; feng2015explorer) are flow-insensitive. Their underlying data structures, such as the pointer expression graph (zheng2008demand), entirely or partially lose the control flow information and, thus, are not easy to be extended for path-sensitivity. Recently, there has been a resurgence of interest in demand-driven flow- or path-sensitive pointer analysis (spath2016boomerang; sui2016demand; spath2019context; yan2018spatio). Some of these approaches are not sparse (spath2016boomerang; spath2019context). Some of them are sparse but suffer from the aliasing-path-explosion issues that we address in this paper (sui2016demand; yan2018spatio).
There is an increasing interest in parametric pointer analyses (kastrinis2013hybrid; smaragdakis2014introspective; wei2015adaptive; jeong2017data; jeon2018precise; hassanshahi2017efficient; li2018scalability; li2018precision) that resemble demand-driven approaches. By contrast, they are not query-driven, but schedule analysis strategies such as selective context-sensitivity for different parts of the program. (functions, allocation sites) Introspective analysis (smaragdakis2014introspective)
tunes context sensitivity per-function based on a pre-analysis that computes heuristics such as “total points-to information”.jeong2017data present a data-driven approach to guiding selective context-sensitive analysis, which assigns each function a context length based on a set of program features. Most of the recent advances focus on context-sensitive analysis. Our approach uses a flow-insensitive analysis for function pointers, and precise path- and context-sensitive analysis for other pointers of interest.
Data-dependence analysis aims to identify the def-use information in a program. It has many applications such as change-impact analysis (arnold1996software; Orso:2004:ECD:998675.999453; acharya2011practical), program slicing (sridharan2007thin; li2016program) and bug hunting (tripp2009taj; tripp2013andromeda; arzt2014flowdroid).
There is a huge amount of literature on context-sensitive data-dependence analysis via context-free language (CFL) reachability (spath2019context; horwitz1990SDG; Chatterjee:2017:ODR:3177123.3158118; reps1998program2; sridharan2006refinement), or other language reachability problems (Tang:POPL2015; Zhang:LCL). Tang:POPL2015 propose a TAL-reachability formulation that improves the scalability of on constructing function summaries of library code. Zhang:LCL introduce linear conjunctive language (LCL) reachability for approximating the interleaved matching-parentheses problem of filed- and context-sensitive data-dependence analysis. spath2019context present a context- and flow-sensitive data-flow analysis based on synchronized pushdown systems. P/taint (PTaint-OOPSLA) unifies points-to and taint analysis by extending the Datalog rules of the underlying points-to analysis and then computing the information all together. All of these techniques are path-insensitive.
The array dependence analysis (maydan1991efficient) community has developed several path-sensitive approaches that are based on SMT solving (mohammadi2018extending), quantifier elimination (Mohammadi:2019), among others (pothen2004elimination). Typically, these approaches focus on array manipulating programs and do not scale to large-scale software with complicated pointer operations.
Sparse Pointer Analysis
The idea of sparse analysis starts from the static single assignment (SSA) form where def-use chains are explicitly encoded (cytron1989efficient; choi1991automatic; chow1996effective). Such def-use chains allow the propagation of data-flow facts to skip unnecessary control flows. hardekopf2009semi present an inclusion-based and semi-sparse flow-sensitive pointer analysis by leveraging the partial SSA form in LLVM (lattner2004llvm). It is semi-sparse because it only utilizes the def-use chains of the top-level pointers.
To be fully sparse, the def-use information of other address-taken variables is needed. There are two classes of full-sparse analysis. First, the staged approaches (hardekopf2011flow; sui2016sparse; sui2016demand; yan2018spatio) exploit a lightweight and imprecise pre-computed pointer analysis to approximate the def-use relations. Owing to the imprecision, spurious value flows will be introduced and harm the performance. Second, the on-the-fly approaches (chase1990analysis; yu2010level; sui2011spas; li2011boosting) construct the def-use chains alongside the pointer analysis. However, their approaches are exhaustive and, thus, do not scale well when path-sensitivity is required. Specifically, Spas (sui2011spas) is the only previous pointer analysis that is both path-sensitive and on-the-fly sparse, but they achieve incremental sparsity by extending the level-by-level analysis (yu2010level), which must be exhaustive because (1) a single level can consist of pointers from the whole program, and (2) pointers with higher levels strictly depend on the analysis results of pointers with lower levels.
We have presented Falcon, our approach to path-sensitive data-dependence analysis. At its heart stands an analysis that concisely constructs the guarded and compositional value-flow graphs, which allow tracking path- and context-sensitive def-use information on demand. The graceful scalability and high precision of Falcon rest on our solution to the aliasing-path-explosion problem. Specifically, Falcon is sparse, demand-driven, and can prune and simplify path constraints at an early stage of the analysis. Our work presents strong evidence that path-sensitive data-dependence analysis is a reasonable choice for millions of lines of code.