BCFA: Bespoke Control Flow Analysis for CFA at Scale

Many data-driven software engineering tasks such as discovering programming patterns, mining API specifications, etc., perform source code analysis over control flow graphs (CFGs) at scale. Analyzing millions of CFGs can be expensive and performance of the analysis heavily depends on the underlying CFG traversal strategy. State-of-the-art analysis frameworks use a fixed traversal strategy. We argue that a single traversal strategy does not fit all kinds of analyses and CFGs and propose bespoke control flow analysis (BCFA). Given a control flow analysis (CFA) and a large number of CFGs, BCFA selects the most efficient traversal strategy for each CFG. BCFA extracts a set of properties of the CFA by analyzing the code of the CFA and combines it with properties of the CFG, such as branching factor and cyclicity, for selecting the optimal traversal strategy. We have implemented BCFA in Boa, and evaluated BCFA using a set of representative static analyses that mainly involve traversing CFGs and two large datasets containing 287 thousand and 162 million CFGs. Our results show that BCFA can speedup the large scale analyses by 1 has low overheads; less than 0.2

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

01/28/2020

Parallelizing Binary Code Analysis

Binary code analysis is widely used to assess an a program's correctness...
12/08/2020

Control Flow Obfuscation for FJ using Continuation Passing

Control flow obfuscation deters software reverse engineering attempts by...
01/28/2020

Parallel Binary Code Analysis

Binary code analysis is widely used to assess a program's correctness, p...
07/04/2017

Control Flow Information Analysis in Process Model Matching Techniques

Online Appendix to: "Analyzing Control Flow Information to Improve the E...
07/27/2021

So You Want to Analyze Scheme Programs With Datalog?

Static analysis approximates the results of a program by examining only ...
07/11/2019

Provenance for Large-scale Datalog

Logic programming languages such as Datalog have become popular as Domai...
03/28/2020

liOS: Lifting iOS apps for fun and profit

Although iOS is the second most popular mobile operating system and is o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Data-driven techniques have been increasigly adopted in many software engineering (SE) tasks: API precondition mining (Nguyen et al., 2014; Khairunnesa et al., 2017), API usage mining (Acharya et al., 2007; Zhang et al., 2018), code search (McMillan et al., 2012), discovering vulnerabilities (Yamaguchi et al., 2014), to name a few. These data-driven SE tasks perform source code analysis over different program representations like source code text, abstract syntax trees (ASTs), control-flow graphs (CFGs), etc., at scale. For example, API precondition mining analyzes millions of methods that contain API call sites to capture conditions that are often checked before invoking the API. The source code mining infrastructures (Dyer et al., 2013; Bajracharya et al., 2014; Gousios, 2013) have started supporting CFG-level analysis to facilitate variety of data-driven SE tasks.

Performance of source code analysis over CFGs heavily depends on the order of the nodes visited during the traversals: the traversal strategy. Several graph traversal strategies exist from the graph traversal literatures, e.g., depth-first, post-order, reverse post-order, topological order, worklist-based strategies, etc. However, the state-of-the-art analysis frameworks use fixed traversal strategy. For example, Soot analysis framework (Lam et al., 2011) uses topological ordering of the nodes to perform control flow analysis. Our observation is that for analyzing millions of programs with different characteristics, no single strategy performs best for all kinds of analyses and programs. Both properties of the analysis and the properties of the input program influence the traversal strategy selection. For example, for a control flow analysis that is data-flow sensitive, meaning the output for any node must be computed using the outputs of its neighbors, a traversal strategy that visits neighbors prior to visiting the node performs better than other kinds of traversal strategies. Similarly, if the CFG of the input program is sequential, a simple strategy that visits nodes in the random order performs better than a more sophisticated strategy.

We propose (), a novel source code analysis technique for performing large scale source code analysis over the control flow graphs. Given an analysis and a large collection of CFGs on which the analysis needs to be performed, selects an optimal traversal strategy for each CFG. In order to achieve that, deploys a novel decision tree that combines a set of analysis properties with a set of graph properties of the CFG. The analysis properties include data-flow sensitivity, loop sensitivity, and traversal direction, and the graph properties include cyclicity (whether the CFG contains branches and loops). There exists no technique that automatically selects a suitable strategy based on analysis and CFG properties. Since manually extracting the properties can be infeasible when analyzing millions of CFGs, we provide a technique to extract the analysis properties by analyzing the source code of the analysis.

We have implemented in Boa, a source code mining infrastructure (Dyer et al., 2013, 2015) and evaluated using a set of 21 source code analyses that includes mainly control and data-flow analyses. The evaluation is performed on two datasets: a dataset containing well-maintained projects from DaCapo benchmark (with a total of 287K CFGs), and an ultra-large dataset containing more than 380K projects from GitHub (with a total of 162M CFGs). Our results showed that can speedup the large scale analyses by 1%-28% by selecting the most time-efficient traversal strategy. We also found that, has low overheads for computing the analysis and graph properties; less than 0.2%, and low misprediction rate; less than 0.01%.

In summary, this paper makes the following contributions:

  • [leftmargin=*]

  • It proposes a set of analysis properties and a set of graph properties that influence the selection of traversal strategy for CFGs.

  • It describes a novel decision tree for selecting the most suitable traversal strategy using the analysis and the graph properties.

  • It provides a technique to automatically extract the analysis properties by analyzing the source code of the analysis.

  • It provides an implementation of in Boa (Dyer et al., 2013, 2015) and a detailed evaluation using a wide-variety of source code analyses and two large datasets containing 287 thousand and 162 million CFGs.

2. Empirical Evaluation

We conducted an empirical evaluation on a set of 21 basic source code analyses on two public massive code datasets to evaluate several factors of . First, we show the benefit of using over standard strategies by evaluating the reduction in running times of over the standards ones (§2.2). Then, we evaluate the correctness of the analysis results using to show that the decision analyses and optimizations in do not affect the correctness of the source code analyses (§2.3). We also evaluate the precision of our selection algorithm by measuring how often selects the most time-efficient traversal (§2.4). We evaluate how the different components of and different kinds of static and runtime properties impact the overall performance in §2.5. Finally, we show practical uses of in three applications in §2.6.

2.1. Analyses, Datasets and Experiment Setting

Analysis Ts
Flw Lp Dir Flw Lp Dir Flw Lp Dir
1 Copy propagation (CP) 3
2 Common sub-expression detection (CSD) 3
3 Dead code (DC) 3
4 Loop invariant code (LIC) 3
5 Upsafety analysis (USA) 3
6 Valid FileReader (VFR) 3
7 Mismatched wait/notify (MWN) 3
8 Available expression (AE) 2
9 Dominator (DOM) 2
10 Local may alias (LMA) 2
11 Local must not alias (LMNA) 2
12 Live variable (LV) 2
13 Nullness analysis (NA) 2
14 Post-dominator (PDOM) 2
15 Reaching definition (RD) 2
16 Resource status (RS) 2
17 Very busy expression (VBE) 2
18 Safe Synchronization (SS) 2
19 Used and defined variable (UDV) 1
20 Useless increment in return (UIR) 1
21 Wait not in loop (WNIL) 1
Table 1. List of source code analyses and properties of their involved traversals. Ts: total number of traversals. : properties of the -th traversal. Flw: data-flow sensitive. Lp: loop sensitive. Dir: traversal direction where —, and mean iterative, forward and backward, respectively. ✓and ✗  for Flw and Lp indicates whether the property is true or false.

2.1.1. Analyses.

We collected source code analyses that traverse CFGs from textbooks and tools. We also ensured that the analyses list covers all the static properties discussed in §LABEL:sec:compute-properties, i.e., data-flow sensitivity, loop sensitivity and traversal direction (forward, backward and iterative). We ended up with 21 source code analyses as shown in Table 1. They include 10 basic ones (analyses 1, 2, 8, 9, 10, 11, 12, 14, 15 and 19) from textbooks (Aho et al., 1986; Nielson et al., 2010) and 11 others for detecting source code bugs, and code smells from the Soot framework (Vallée-Rai et al., 1999) (analyses 3, 4, 5, 13, 17 and 18), and FindBugs tool (Ayewah et al., 2007) (analyses 6, 7, 16, 20 and 21). Table 1 also shows the number of traversals each analysis contains and their static properties as described in §LABEL:sec:compute-properties. All analyses are intra-procedural. We implemented all twenty one of these analysis in Boa using the constructs described in §LABEL:sec:language.111Our implementation infrastructure Boa currently supports only method-level analysis, however our technique should be applicable to inter-procedural CFGs.

Dataset All graphs Sequential Branches Loops
All graphs Branches No branches
DaCapo 287K 186K (65%) 73K (25%) 28K (10%) 21K (7%) 7K (2%)
GitHub 161,523K 111,583K (69%) 33,324K (21%) 16,617K (10%) 11,674K (7%) 4,943K (3%)
Table 2. Statistics of the generated control flow graphs.
Analysis Avg. Time Static Runtime
DaCapo GitHub DaCapo GitHub
Avg. Total Avg. Total
CP 0.21 0.008 53 0.21 62,469 0.008 1359K
CSD 0.19 0.012 60 0.19 56,840 0.012 1991K
DC 0.19 0.010 45 0.19 54,822 0.010 1663K
LIC 0.21 0.006 69 0.20 60,223 0.006 992K
USA 0.19 0.006 90 0.19 54,268 0.009 1444K
VFR 0.18 0.007 42 0.18 52,483 0.007 1142K
MWN 0.18 0.006 36 0.18 52,165 0.006 1109K
AE 0.18 0.007 43 0.18 53,290 0.007 1169K
DOM 0.21 0.008 35 0.21 62,416 0.008 1307K
LMA 0.18 0.008 76 0.18 52,483 0.008 1346K
LMNA 0.18 0.008 80 0.18 53,182 0.008 1407K
LV 0.17 0.007 32 0.17 49,231 0.007 1273K
NA 0.16 0.008 64 0.16 46,589 0.008 1398K
PDOM 0.20 0.012 34 0.20 57,203 0.012 2040K
RD 0.20 0.007 48 0.20 57,359 0.007 1155K
RS 0.16 0.006 28 0.16 46,367 0.006 996K
VBE 0.17 0.006 44 0.17 49,138 0.006 1062K
SS 0.17 0.006 32 0.17 48,990 0.006 1009K
UDV 0.14 0.005 10 0.14 41,617 0.005 928K
UIR 0.14 0.006 14 0.14 41,146 0.006 1020K
WNIL 0.14 0.007 15 0.14 41,808 0.007 1210K
Table 3. Time contribution of each phase (in miliseconds).

2.1.2. Datasets.

We ran the analyses on two datasets: DaCapo 9.12 benchmark (Blackburn et al., 2006)

, and a large-scale dataset containing projects from GitHub. DaCapo dataset contains the source code of 10 open source Java projects: Apache Batik, Apache FOP, Apache Aurora, Apache Tomcat, Jython, Xalan-Java, PMD, H2 database, Sunflow and Daytrader. GitHub dataset contains the source code of more than 380K Java projects collected from GitHub.com. Each method in the datasets was used to generate a control flow graph (CFG) on which the analyses would be run. The statistics of the two datasets are shown in Table 

2. Both have similar distributions of CFGs over graph cyclicity (i.e., sequential, branch, and loop).

2.1.3. Setting.

We compared against six standard traversal strategies in §LABEL:sec:candidates: DFS, PO, RPO, WPO, WRPO and ANY. The running time for each analysis is measured from the start to the end of the analysis. The running time for also includes the time for computing the static and runtime properties, making the traversal strategy decision, optimizing it and then using the optimized traversal strategy to traverse the CFG, and run the analysis. The analyses on DaCapo dataset were run on a single machine with 24 GB of RAM and 24 cores running Linux 3.5.6-1.fc17 kernel. Running analyses on GitHub dataset on a single machine would take weeks to finish, so we ran them on a cluster that runs a standard Hadoop 1.2.1 with 1 name and job tracker node, 10 compute nodes with totally 148 cores, and 1 GB of RAM for each map/reduce task.

2.2. Running Time and Time Reduction

We first report the running times and then study the reductions (or speedup) against standard traversal strategies.

2.2.1. Running Time

Table 3 shows the running times for 21 analyses on the two datasets. On average (column Avg. Time), each analysis took 0.14–0.21 ms and 0.005–0.012 ms to analyze a graph in Dacapo and GitHub datasets, respectively. The variation in the average analysis time is mainly due to the difference in the machines used to run the analysis for DaCapo and GitHub datasets. Also, the graphs in DaCapo are on average much larger compared to GitHub. Columns Static and Runtime show the time contributions for different components of : the time for determining the static properties of each analysis which is done once for each analysis, and the time for constructing the CFG of each method and traversing the CFG which is done once for every constructed CFG. We can see that the time for collecting static information is negligible, less than 0.2% for DaCapo dataset and less than 0.01% for GitHub dataset, when compared to the total runtime information collection time, as it is performed only once per traversal. When compared to the average runtime information collection time, the static time is quite significant. However, the overhead introduced by static information collection phase diminishes as the number of CFGs increases and becomes insignificant when running on those two large datasets. This result shows the benefit of when applying on large-scale analysis.

2.2.2. Time Reduction

Analysis DaCapo GitHub
DFS PO RPO WPO WRPO ANY DFS PO RPO WPO WRPO ANY
CP 17% 83% 9% 66% 11% 72% 17% 88% 12% 80% 5% 82%
CSD 41% 93% 39% 74% 4% 89% 31% 24% 12%
DC 41% 30% 89% 7% 64% 81% 25% 22% 7%
LIC 17% 84% 8% 67% 7% 73% 19% 89% 15% 81% 19% 88%
USA 36% 92% 34% 72% 9% 87% 22% 17% 9%
VFR 20% 41% 18% 51% 15% 62% 15% 40% 10% 44% 9% 53%
MWN 21% 35% 16% 35% 22% 49% 17% 31% 12% 33% 11% 46%
AE 40% 14% 39% 73% 14% 87% 16% 16% 11%
DOM 54% 97% 48% 70% 6% 95% 27% 32% 6%
LMA 35% 46% 28% 74% 6% 46% 22% 13% 6%
LMNA 29% 39% 22% 68% 9% 41% 21% 15% 7%
LV 38% 30% 84% 11% 56% 75% 25% 21% 68% 11% 69% 80%
NA 26% 88% 30% 50% 10% 80% 13% 87% 12% 71% 10% 85%
PDOM 51% 41% 95% 10% 72% 95% 24% 20% 24%
RD 15% 80% 7% 62% 9% 68% 19% 91% 10% 79% 5% 86%
RS 31% 31% 30% 31% 28% 30% 16% 40% 9% 31% 7% 49%
VBE 40% 36% 88% 13% 76% 81% 28% 24% 10%
SS 26% 39% 22% 37% 25% 57% 20% 35% 13% 34% 10% 50%
UDV 6% 5% 6% 10% 9% 3% 3% 4% 2% 7% 6% 0%
UIR 2% 2% 1% 3% 3% 0% 2% 5% 4% 7% 7% 0%
WNIL 3% 4% 5% 6% 8% 2% 3% 6% 5% 5% 6% 0%
Overall 31% 83% 70% 55% 35% 81%
(a) Time reduction for each analysis.
Property DaCapo
DFS PO RPO WPO WRPO ANY
Data-flow 32% 84% 72% 57% 36% 83%
Data-flow 4% 4% 4% 6% 6% 2%
(b) Reduction over analysis properties.
Property DaCapo
DFS PO RPO WPO WRPO ANY
Sequential 20% 74% 63% 55% 28% 72%
Branch 31% 81% 66% 58% 40% 92%
Loop 53% 88% 75% 62% 37% 95%
(c) Reduction over graph properties.
Figure 1. Reduction in running times. Background colors indicate ranges of values: no reduction, (0%, 10%), [10%, 50%) and [50%, 100%].

To evaluate the efficiency in running time of over other strategies, we ran 21 analyses on DaCapo and GitHub datasets using and other strategies. When comparing the to a standard strategy , we computed the reduction rate where and are the running times using the standard strategy and , respectively. Some analyses have worst case traversal strategies which might not be feasible to run on GitHub dataset with 162 million graphs. For example, using post-order for forward data-flow analysis will visit the CFGs in the direction which is opposite to the natural direction of the analysis and hence takes a long time to complete. For such combinations of analyses and traversal strategies, the map and the reduce tasks time out in the cluster setting and, thus, did not have the running times. The corresponding cells in Figure (a)a are denoted with symbol –.

The result in Figure (a)a shows that helps reduce the running times in almost all cases. The values indicate the reduction in running time by adopting compared against the standard strategies. Most of positive reductions are from 10% or even from 50%. Compared to the most time-efficient strategies for each analysis, could speed up from 1% (UIR with RPO) to 28% (RS with WRPO). More importantly, the most time-efficient and the worst traversal strategies vary across the analyses which supports the need of . Over all analyses, the reduction was highest against any order and post-order (PO and WPO) strategies. The reduction was lowest against the strategy using depth-first search (DFS) and worklist with reverse post-ordering (WRPO). When compared with the next best performing traversal strategy for each analysis, reduces the overall execution time by about 13 minutes to 72 minutes on GitHub dataset. We do not report the overall numbers for GitHub dataset due to the presence of failed runs.

Figure (b)b shows time reductions for different types of analyses. For data-flow sensitive ones, the reduction rates were high ranging from 32% to 84%. The running time was not improved much for non data-flow sensitive traversals, which correspond to the last three rows in Figure (a)a with mostly one digit reductions). We actually perform almost the same as ANY-order traversal strategy for analyses in this category. This is because any-order traversal strategy is the best strategy for all the CFGs in these analyses. also chooses any-order traversal strategy and, thus, the performance is the same.

Figure (c)c shows time reduction for different cyclicity types of input graphs. We can see that reductions over graphs with loops is highest and those over any graphs is lowest.

2.3. Correctness of Analysis Results

To evaluate the correctness of analysis results, we first chose worklist as standard strategy to run analyses on DaCapo dataset to create the groundtruth of the results. We then ran analyses using our hybrid approach and compared the results with the groundtruth. In all analyses on all input graphs from the dataset, the results from always exactly matched the corresponding ones in the groundtruth.

2.4. Traversal Strategy Selection Precision

In this experiment, we evaluated how well picks the most time-efficient strategy. We ran the 21 analyses on the DaCapo dataset using all the candidate traversals and the one selected by . One selection is counted for each pair of a traversal and an input graph where the selects a traversal strategy based on the properties of the analysis and input graph. A selection is considered correct if its running time is at least as good as the running time of the fastest among all candidates. The precision is computed as the ratio between the number of correct selections over the total number of all selections. The precision was 100% and 99.9% for loop insensitive and loop sensitive traversals, respectively.

Analysis Precision
DOM, PDOM, WNIL, UDV, UIR 100.00%
CP, CSD, DC, LIC, USA, VFR, MWN, AE, LMA, LMNA, LV, NA, RD, RS, VBE, SS 99.99%
Table 4. Traversal strategy prediction precision.

As shown in Table 5, the selection precision is 100% for all analyses that are not loop sensitive. For analyses that involve loop sensitive traversals, the prediction precision is 99.99%. Further analysis revealed that the selection precision is 100% for sequential CFGs & CFGs with branches and no loop—always picks the most time-efficient traversal strategy. For CFGs with loops, the selection precision is 100% for loop insensitive traversals. The mispredictions occur with loop sensitive traversals on CFGs with loops. This is because for loop sensitive traversals, picks worklist as the best strategy. The worklist approach was picked because it visits only as many nodes as needed when compared to other traversal strategies which visit redundant nodes. However using worklist imposes an overhead of creating and maintaining a worklist containing all nodes in the CFG. This overhead is negligible for small CFGs. However, when running analyses on large CFGs, this overhead could become higher than the cost for visiting redundant nodes. Therefore, selecting worklist for loop sensitive traversals on large CFGs might not always result in the best running times.

2.5. Analysis on Traversal Optimization

Figure 2. Time reduction due to traversal optimization.

We evaluated the importance of optimizing the chosen traversal strategy by comparing with the non-optimized version. Figure 2 shows the reduction rate on the running times for the 21 analyses. For analyses that involve at least one data-flow sensitive traversal, the optimization helps to reduce at least 60% of running time. This is because optimizations in such traversals reduce the number of iterations of traversals over the graphs by eliminating the redundant result re-computation traversals and the unnecessary fixpoint condition checking traversals. For analyses involving only data-flow insensitive traversal, there is no reduction in execution time, as does not attempt to optimize.

2.6. Case Studies

This section presents three applications adopted from prior works that showed significant benefit from approach. These applications includes one or more analyses listed in Table 1. We computed the reduction in the overall analysis time when compared to WRPO traversal strategy (the second best performing traversal after ) and the results are shown in Figure 3.

Case WRPO Reduce
APM 1527 min. 1702 min. 10%
AUM 883 min. 963 min. 8%
SVT 1417 min. 1501 min. 6%
Figure 3. Running time of the case studies on GitHub data.

API Precondition Mining (APM). This case study mines a large corpus of API usages to derive potential preconditions for API methods (Nguyen et al., 2014). The key idea is that API preconditions would be checked frequently in a corpus with a large number of API usages, while project-specific conditions would be less frequent. This case study mined the preconditions for all methods of java.lang.String.

API Usage Mining (AUM). This case study analyzes API usage code and mines API usage patterns (Xie and Pei, 2006). The mined patterns help developers understand and write API usages more effectively with less errors. Our analysis mined usage patterns for java.util APIs.

Finding Security Vulnerabilities with Tainted Object Propagation (SVT). This case study formulated a variety of widespread SQL injections, as tainted object propagation problems (Livshits and Lam, 2005). Our analysis looked for all SQL injection vulnerabilities matching the specifications in the statically analyzed code.

Figure 3 shows that helps reduce running times significantly by 80–175 minutes, which is from 6%–10% relatively. For understanding whether  10% reduction is really significant, considering the context is important. A save of 3 hours (10%) on a parallel infrastructure is significant. If the underlying parallel infrastructure is open/free/shared ((Dyer et al., 2013; Bajracharya et al., 2014)), a 3 hour save enables supporting more concurrent users and analyses. If the infrastructure is a paid cluster (e.g., AWS), a 3 hour less computing time could translate to save of substantial dollar amount.

2.7. Threats to Validity

Our datasets do not contain a balanced distribution of different graph cyclicity.The majority of graphs in both DaCapo and GitHub datasets are sequential (65% and 69%, respectively) and only 10% have loops. The impact of this threat is that paths and decisions along sequential graphs are taken more often. This threat is not easy to mitigate, as it is not pratical to find a code dataset with a balanced distribution of graphs of various types. Nonetheless, our evaluation shows that the selection and optimization of the best traversal strategy for these 35% of the graphs (graphs with branches and loops) plays an important role in improving the overall performance of the analysis over a large dataset of graphs.

3. Related Works

Atkinson and Griswold (Atkinson and Griswold, 2001) discuss several implementation techniques for improving the efficiency of data-flow analysis, namely: factoring data-flow sets, visitation order of the statements, selective reclamation of the data-flow sets. They discuss two commonly used traversal strategies: iterative search and worklist, and propose a new worklist algorithm that results in 20% fewer node visits. In their algorithm, a node is processed only if the data-flow information of any of its successors (or predecessors) has changed. Tok et al. (Tok et al., 2006) proposed a new worklist algorithm for accelerating inter-procedural flow-sensitive data-flow analysis. They generate inter-procedural def-use chains on-the-fly to be used in their worklist algorithm to re-analyze only parts that are affected by the changes in the flow values. Hind and Pioli (Hind and Pioli, 1998) proposed an optimized priority-based worklist algorithm for pointer alias analysis, in which the nodes awaiting processing are placed on a worklist prioritized by the topological order of the CFG, such that nodes higher in the CFG are processed before nodes lower in the CFG. Bourdoncle (Bourdoncle, 1993) proposed the notion of weak topological ordering (WTO) of directed graphs and two iterative strategies based on WTO for computing the analysis solutions in dataflow and abstraction interpretation domains. Bourdoncle’ technique is more suitable for cyclic graphs, however for acyclic graphs Bourdoncle proposes any topological ordering. Kildall (Kildall, 1973) proposes combining several optimizing functions with flow analysis algorithms for solving global code optimization problems. For some classes of data-flow analysis problems, there exist techniques for efficient analysis. For example, demand interprocedural data-flow analysis (Horwitz et al., 1995) can produce precise results in polynomial time for inter-procedural, finite, distributive, subset problems (IFDS), constant propagation (Wegman and Zadeck, 1991), etc. These works propose new traversal strategies for improving the efficiency of certain class of source code analysis, whereas is a novel technique for selecting the best traversal strategy from a list of candidate traversal strategies, based on the static properties of the analysis and the runtime characteristics of the input graph.

Upadhyaya and Rajan (Upadhyaya and Rajan, 2018) proposed Collective Program Analysis (CPA) that leverages similarities between CFGs to speedup analyzing millions of CFGs by only analyzing unique CFGs. CPA utilizes pre-defined traversal strategy to traverse CFGs, however our technique selects optimal traversal strategy and could be utilized in CPA. Upadhyaya and Rajan have also proposed an approach for accelerating ultra-large scale mining by clustering artifacts that are being mined (Upadhyaya and Rajan, 2017, 2018). BCFA and this approach have the same goal of scaling large-scale mining, but complementary strategies.

Cobleigh et al. (Cobleigh et al., 2001) study the effect of worklist algorithms in model checking. They identified four dimensions along which a worklist algorithm can be varied. Based on four dimensions, they evaluate 9 variations of worklist algorithm. They do not solve traversal strategy selection problem. Moreover, they do not take analysis properties into account. We consider both static properties of the analysis, such as data-flow sensitivity and loop sensitivity, and the cyclicity of the graph. Further, we also consider non-worklist based algorithms, such as post-order, reverse post-order, control flow order, any order, etc., as candidate strategies.

Several infrastructures exist today for performing ultra-large-scale analysis (Dyer et al., 2013; Bajracharya et al., 2014; Gousios, 2013; Cosmo and Zacchiroli, 2017; Ma et al., 2019). Boa (Dyer et al., 2013) is a language and infrastructure for analyzing open source projects. Sourcerer (Bajracharya et al., 2014) is an infrastructure for large-scale collection and analysis of open source code. GHTorrent (Gousios, 2013) is a dataset and tool suite for analyzing GitHub projects. These frameworks currently support structural or abstract syntax tree (AST) level analysis and a parallel framework such as map-reduce is used to improve the performance of ultra-large-scale analysis. By selecting the best traversal strategy, could help improve their performance beyond parallelization.

There have been much works that targeted graph traversal optimization. Green-Marl (Hong et al., 2012) is a domain specific language for expressing graph analysis. It uses the high-level algorithmic description of the graph analysis to exploit the exposed data level parallelism. Green-Marl’s optimization is similar to ours in utilizing the properties of the analysis description, however also utilizes the properties of the graphs. Moreover, Green-Marl’s optimization is through parallelism while ours is by selecting the suitable traversal strategy. Pregel (Malewicz et al., 2010) is a map-reduce like framework that aims to bring distributed processing to graph algorithms. While Pregel’s performance gain is through parallelism, achieves it by traversing the graph efficiently.

Acknowledgements.
This material is based upon work supported by the Sponsor National Science Foundation Rlhttp://dx.doi.org/10.13039/100000001 under Grant No. Grant #3 and Grant No. Grant #3. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

References

  • M. Acharya, T. Xie, J. Pei, and J. Xu (2007) Mining api patterns as partial orders from source code: from usage scenarios to specifications. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, New York, NY, USA, pp. 25–34. External Links: ISBN 978-1-59593-811-4, Link, Document Cited by: §1.
  • A. V. Aho, R. Sethi, and J. D. Ullman (1986) Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. External Links: ISBN 0-201-10088-6 Cited by: §2.1.1.
  • D. C. Atkinson and W. G. Griswold (2001) Implementation techniques for efficient data-flow analysis of large programs. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’01), ICSM ’01, Washington, DC, USA, pp. 52–. External Links: ISBN 0-7695-1189-9, Link, Document Cited by: §3.
  • N. Ayewah, W. Pugh, J. D. Morgenthaler, J. Penix, and Y. Zhou (2007) Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ’07, New York, NY, USA, pp. 1–8. External Links: ISBN 978-1-59593-595-3, Link Cited by: §2.1.1.
  • S. Bajracharya, J. Ossher, and C. Lopes (2014) Sourcerer: an infrastructure for large-scale collection and analysis of open-source code. Sci. Comput. Program. 79, pp. 241–259. External Links: ISSN 0167-6423, Link, Document Cited by: §1, §2.6, §3.
  • S. M. Blackburn, R. Garner, C. Hoffmann, A. M. Khang, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann (2006) The dacapo benchmarks: java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’06, New York, NY, USA, pp. 169–190. External Links: ISBN 1-59593-348-4, Link, Document Cited by: §2.1.2.
  • F. Bourdoncle (1993) Efficient chaotic iteration strategies with widenings. In Formal Methods in Programming and Their Applications, D. Bjørner, M. Broy, and I. V. Pottosin (Eds.), Berlin, Heidelberg, pp. 128–141. External Links: ISBN 978-3-540-48056-3 Cited by: §3.
  • J. M. Cobleigh, L. A. Clarke, and L. J. Osterweil (2001) The Right Algorithm at the Right Time: Comparing Data Flow Analysis Algorithms for Finite State Verification. In Proceedings of the 23rd International Conference on Software Engineering, ICSE ’01, Washington, DC, USA, pp. 37–46. External Links: ISBN 0-7695-1050-7, Link Cited by: §3.
  • R. D. Cosmo and S. Zacchiroli (2017) Software heritage: why and how to preserve software source code. In iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan. External Links: Link Cited by: §3.
  • R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen (2013) Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, Piscataway, NJ, USA, pp. 422–431. External Links: ISBN 978-1-4673-3076-3, Link Cited by: 4th item, §1, §1, §2.6, §3.
  • R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen (2015) Boa: ultra-large-scale software repository and source-code mining. ACM Trans. Softw. Eng. Methodol. 25 (1), pp. 7:1–7:34. External Links: ISSN 1049-331X, Link, Document Cited by: 4th item, §1.
  • G. Gousios (2013) The ghtorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA, pp. 233–236. External Links: ISBN 978-1-4673-2936-1, Link Cited by: §1, §3.
  • M. Hind and A. Pioli (1998) Assessing the Effects of Flow-Sensitivity on Pointer Alias Analyses. In SAS, pp. 57–81. Cited by: §3.
  • S. Hong, H. Chafi, E. Sedlar, and K. Olukotun (2012) Green-marl: a dsl for easy and efficient graph analysis. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, New York, NY, USA, pp. 349–362. External Links: ISBN 978-1-4503-0759-8, Link, Document Cited by: §3.
  • S. Horwitz, T. Reps, and M. Sagiv (1995) Demand Interprocedural Dataflow Analysis. In Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, SIGSOFT ’95, New York, NY, USA, pp. 104–115. External Links: ISBN 0-89791-716-2, Link, Document Cited by: §3.
  • S. S. Khairunnesa, H. A. Nguyen, T. N. Nguyen, and H. Rajan (2017) Exploiting implicit beliefs to resolve sparse usage problem in usage-based specification mining. In OOPSLA’17: The ACM SIGPLAN conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA’17. Cited by: §1.
  • G. A. Kildall (1973) A unified approach to global program optimization. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’73, New York, NY, USA, pp. 194–206. External Links: Link, Document Cited by: §3.
  • P. Lam, E. Bodden, O. Lhoták, and L. Hendren (2011) The soot framework for java program analysis: a retrospective. In Cetus Users and Compiler Infastructure Workshop (CETUS 2011), Vol. 15, pp. 35. Cited by: §1.
  • V. B. Livshits and M. S. Lam (2005) Finding security vulnerabilities in java applications with static analysis. In Proceedings of the 14th Conference on USENIX Security Symposium - Volume 14, SSYM’05, Berkeley, CA, USA, pp. 18–18. External Links: Link Cited by: §2.6.
  • Y. Ma, C. Bogart, S. Amreen, R. Zaretzki, and A. Mockus (2019) World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, pp. 143–154. External Links: Link, Document Cited by: §3.
  • G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski (2010) Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, pp. 135–146. External Links: ISBN 978-1-4503-0032-2, Link, Document Cited by: §3.
  • C. McMillan, M. Grechanik, D. Poshyvanyk, C. Fu, and Q. Xie (2012) Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans. Softw. Eng. 38 (5), pp. 1069–1087. External Links: ISSN 0098-5589, Link, Document Cited by: §1.
  • H. A. Nguyen, R. Dyer, T. N. Nguyen, and H. Rajan (2014) Mining preconditions of apis in large-scale code corpus. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, New York, NY, USA, pp. 166–177. External Links: ISBN 978-1-4503-3056-5, Link, Document Cited by: §1, §2.6.
  • F. Nielson, H. R. Nielson, and C. Hankin (2010) Principles of program analysis. Springer Publishing Company, Incorporated. External Links: ISBN 3642084745, 9783642084744 Cited by: §2.1.1.
  • T. B. Tok, S. Z. Guyer, and C. Lin (2006) Efficient flow-sensitive interprocedural data-flow analysis in the presence of pointers. In Proceedings of the 15th International Conference on Compiler Construction, CC’06, Berlin, Heidelberg, pp. 17–31. External Links: ISBN 3-540-33050-X, 978-3-540-33050-9, Link, Document Cited by: §3.
  • G. Upadhyaya and H. Rajan (2017) On accelerating ultra-large-scale mining. In 2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER), Vol. , pp. 39–42. External Links: Document, ISSN null Cited by: §3.
  • G. Upadhyaya and H. Rajan (2018) On accelerating source code analysis at massive scale. IEEE Transactions on Software Engineering 44 (7), pp. 669–688. External Links: Document, ISSN 2326-3881 Cited by: §3.
  • G. Upadhyaya and H. Rajan (2018) Collective program analysis. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 620–631. External Links: ISBN 978-1-4503-5638-1, Link, Document Cited by: §3.
  • R. Vallée-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan (1999) Soot - a java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON ’99, pp. 13–. External Links: Link Cited by: §2.1.1.
  • M. N. Wegman and F. K. Zadeck (1991) Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst. 13 (2), pp. 181–210. External Links: ISSN 0164-0925, Link, Document Cited by: §3.
  • T. Xie and J. Pei (2006) MAPO: mining api usages from open source repositories. In Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, New York, NY, USA, pp. 54–57. External Links: ISBN 1-59593-397-2, Link, Document Cited by: §2.6.
  • F. Yamaguchi, N. Golde, D. Arp, and K. Rieck (2014) Modeling and discovering vulnerabilities with code property graphs. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, SP ’14, Washington, DC, USA, pp. 590–604. External Links: ISBN 978-1-4799-4686-0, Link, Document Cited by: §1.
  • T. Zhang, G. Upadhyaya, A. Reinhardt, H. Rajan, and M. Kim (2018) Are code examples on an online q&a forum reliable? a study of api misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA. Cited by: §1.

Appendix A Appendix: Omitted Results

Analysis Precision
DOM, PDOM, WNIL, UDV, UIR 100.00%
CP, CSD, DC, LIC, USA, VFR, MWN, AE, LMA, LMNA, LV, NA, RD, RS, VBE, SS 99.99%
Table 5. Traversal strategy prediction precision.

(a) CP

(b) CSD

(c) DC

(d) LIC

(e) USA

(f) VFR

(g) MWN

(h) AE

(i) LMA

(j) LMNA

(k) LV

(l) NA

(m) RD

(n) RS

(o) VBE

(p) SS
Figure 4. Scatter charts for analyses that have loop sensitive traversals. On -axis, 1 indicates a correct traversal strategy prediction and 0 indicates a mis-prediction.
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11
CP 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
CSD 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
DC 0% 32% 0% 13% 0% 0% 0% 0% 0% 5% 50%
LIC 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
USA 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
VFR 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
MWN 32% 0% 13% 0% 0% 0% 0% 0% 5% 0% 50%
AE 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
DOM 65% 0% 25% 0% 7% 0% 2% 0% 0% 0% 0%
LMA 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
LMNA 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
LV 0% 65% 0% 25% 0% 0% 0% 0% 0% 10% 0%
NA 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
PDOM 0% 65% 0% 25% 0% 7% 0% 2% 0% 0% 0%
RD 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
RS 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
VBE 0% 65% 0% 25% 0% 0% 0% 0% 0% 10% 0%
SS 65% 0% 25% 0% 0% 0% 0% 0% 10% 0% 0%
UDV 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100%
UIR 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100%
WNIL 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 100%
Overall 32.46% 9.27% 12.69% 3.62% 0.26% 0.26% 0.10% 0.10% 4.50% 1.04% 35.70%
Figure 5. Distribution of decisions over the paths of the decision tree for the DaCapo Dataset. Background colors indicate the ranges of values: 0%, (0%, 1%), [1%, 10%) and [10%, 100%].