Data-driven techniques have been increasigly adopted in many software engineering (SE) tasks: API precondition mining (Nguyen et al., 2014; Khairunnesa et al., 2017), API usage mining (Acharya et al., 2007; Zhang et al., 2018), code search (McMillan et al., 2012), discovering vulnerabilities (Yamaguchi et al., 2014), to name a few. These data-driven SE tasks perform source code analysis over different program representations like source code text, abstract syntax trees (ASTs), control-flow graphs (CFGs), etc., at scale. For example, API precondition mining analyzes millions of methods that contain API call sites to capture conditions that are often checked before invoking the API. The source code mining infrastructures (Dyer et al., 2013; Bajracharya et al., 2014; Gousios, 2013) have started supporting CFG-level analysis to facilitate variety of data-driven SE tasks.
Performance of source code analysis over CFGs heavily depends on the order of the nodes visited during the traversals: the traversal strategy. Several graph traversal strategies exist from the graph traversal literatures, e.g., depth-first, post-order, reverse post-order, topological order, worklist-based strategies, etc. However, the state-of-the-art analysis frameworks use fixed traversal strategy. For example, Soot analysis framework (Lam et al., 2011) uses topological ordering of the nodes to perform control flow analysis. Our observation is that for analyzing millions of programs with different characteristics, no single strategy performs best for all kinds of analyses and programs. Both properties of the analysis and the properties of the input program influence the traversal strategy selection. For example, for a control flow analysis that is data-flow sensitive, meaning the output for any node must be computed using the outputs of its neighbors, a traversal strategy that visits neighbors prior to visiting the node performs better than other kinds of traversal strategies. Similarly, if the CFG of the input program is sequential, a simple strategy that visits nodes in the random order performs better than a more sophisticated strategy.
We propose (), a novel source code analysis technique for performing large scale source code analysis over the control flow graphs. Given an analysis and a large collection of CFGs on which the analysis needs to be performed, selects an optimal traversal strategy for each CFG. In order to achieve that, deploys a novel decision tree that combines a set of analysis properties with a set of graph properties of the CFG. The analysis properties include data-flow sensitivity, loop sensitivity, and traversal direction, and the graph properties include cyclicity (whether the CFG contains branches and loops). There exists no technique that automatically selects a suitable strategy based on analysis and CFG properties. Since manually extracting the properties can be infeasible when analyzing millions of CFGs, we provide a technique to extract the analysis properties by analyzing the source code of the analysis.
We have implemented in Boa, a source code mining infrastructure (Dyer et al., 2013, 2015) and evaluated using a set of 21 source code analyses that includes mainly control and data-flow analyses. The evaluation is performed on two datasets: a dataset containing well-maintained projects from DaCapo benchmark (with a total of 287K CFGs), and an ultra-large dataset containing more than 380K projects from GitHub (with a total of 162M CFGs). Our results showed that can speedup the large scale analyses by 1%-28% by selecting the most time-efficient traversal strategy. We also found that, has low overheads for computing the analysis and graph properties; less than 0.2%, and low misprediction rate; less than 0.01%.
In summary, this paper makes the following contributions:
It proposes a set of analysis properties and a set of graph properties that influence the selection of traversal strategy for CFGs.
It describes a novel decision tree for selecting the most suitable traversal strategy using the analysis and the graph properties.
It provides a technique to automatically extract the analysis properties by analyzing the source code of the analysis.
2. Empirical Evaluation
We conducted an empirical evaluation on a set of 21 basic source code analyses on two public massive code datasets to evaluate several factors of . First, we show the benefit of using over standard strategies by evaluating the reduction in running times of over the standards ones (§2.2). Then, we evaluate the correctness of the analysis results using to show that the decision analyses and optimizations in do not affect the correctness of the source code analyses (§2.3). We also evaluate the precision of our selection algorithm by measuring how often selects the most time-efficient traversal (§2.4). We evaluate how the different components of and different kinds of static and runtime properties impact the overall performance in §2.5. Finally, we show practical uses of in three applications in §2.6.
2.1. Analyses, Datasets and Experiment Setting
|1||Copy propagation (CP)||3||✗||✗||—||✓||✓||✗||✗||—|
|2||Common sub-expression detection (CSD)||3||✗||✗||—||✓||✓||✗||✗||—|
|3||Dead code (DC)||3||✗||✗||—||✓||✓||✗||✗||—|
|4||Loop invariant code (LIC)||3||✗||✗||—||✓||✓||✗||✗||—|
|5||Upsafety analysis (USA)||3||✗||✗||—||✓||✓||✗||✗||—|
|6||Valid FileReader (VFR)||3||✗||✗||—||✓||✓||✗||✗||—|
|7||Mismatched wait/notify (MWN)||3||✗||✗||—||✓||✓||✗||✗||—|
|8||Available expression (AE)||2||✗||✗||—||✓||✓|
|10||Local may alias (LMA)||2||✗||✗||—||✓||✓|
|11||Local must not alias (LMNA)||2||✗||✗||—||✓||✓|
|12||Live variable (LV)||2||✗||✗||—||✓||✓|
|13||Nullness analysis (NA)||2||✗||✗||—||✓||✓|
|15||Reaching definition (RD)||2||✗||✗||—||✓||✓|
|16||Resource status (RS)||2||✗||✗||—||✓||✓|
|17||Very busy expression (VBE)||2||✗||✗||—||✓||✓|
|18||Safe Synchronization (SS)||2||✗||✗||—||✓||✓|
|19||Used and defined variable (UDV)||1||✗||✗||—|
|20||Useless increment in return (UIR)||1||✗||✗||—|
|21||Wait not in loop (WNIL)||1||✗||✗||—|
We collected source code analyses that traverse CFGs from textbooks and tools. We also ensured that the analyses list covers all the static properties discussed in §LABEL:sec:compute-properties, i.e., data-flow sensitivity, loop sensitivity and traversal direction (forward, backward and iterative). We ended up with 21 source code analyses as shown in Table 1. They include 10 basic ones (analyses 1, 2, 8, 9, 10, 11, 12, 14, 15 and 19) from textbooks (Aho et al., 1986; Nielson et al., 2010) and 11 others for detecting source code bugs, and code smells from the Soot framework (Vallée-Rai et al., 1999) (analyses 3, 4, 5, 13, 17 and 18), and FindBugs tool (Ayewah et al., 2007) (analyses 6, 7, 16, 20 and 21). Table 1 also shows the number of traversals each analysis contains and their static properties as described in §LABEL:sec:compute-properties. All analyses are intra-procedural. We implemented all twenty one of these analysis in Boa using the constructs described in §LABEL:sec:language.111Our implementation infrastructure Boa currently supports only method-level analysis, however our technique should be applicable to inter-procedural CFGs.
|All graphs||Branches||No branches|
|DaCapo||287K||186K (65%)||73K (25%)||28K (10%)||21K (7%)||7K (2%)|
|GitHub||161,523K||111,583K (69%)||33,324K (21%)||16,617K (10%)||11,674K (7%)||4,943K (3%)|
We ran the analyses on two datasets: DaCapo 9.12 benchmark (Blackburn et al., 2006)
, and a large-scale dataset containing projects from GitHub. DaCapo dataset contains the source code of 10 open source Java projects: Apache Batik, Apache FOP, Apache Aurora, Apache Tomcat, Jython, Xalan-Java, PMD, H2 database, Sunflow and Daytrader. GitHub dataset contains the source code of more than 380K Java projects collected from GitHub.com. Each method in the datasets was used to generate a control flow graph (CFG) on which the analyses would be run. The statistics of the two datasets are shown in Table2. Both have similar distributions of CFGs over graph cyclicity (i.e., sequential, branch, and loop).
We compared against six standard traversal strategies in §LABEL:sec:candidates: DFS, PO, RPO, WPO, WRPO and ANY. The running time for each analysis is measured from the start to the end of the analysis. The running time for also includes the time for computing the static and runtime properties, making the traversal strategy decision, optimizing it and then using the optimized traversal strategy to traverse the CFG, and run the analysis. The analyses on DaCapo dataset were run on a single machine with 24 GB of RAM and 24 cores running Linux 3.5.6-1.fc17 kernel. Running analyses on GitHub dataset on a single machine would take weeks to finish, so we ran them on a cluster that runs a standard Hadoop 1.2.1 with 1 name and job tracker node, 10 compute nodes with totally 148 cores, and 1 GB of RAM for each map/reduce task.
2.2. Running Time and Time Reduction
We first report the running times and then study the reductions (or speedup) against standard traversal strategies.
2.2.1. Running Time
Table 3 shows the running times for 21 analyses on the two datasets. On average (column Avg. Time), each analysis took 0.14–0.21 ms and 0.005–0.012 ms to analyze a graph in Dacapo and GitHub datasets, respectively. The variation in the average analysis time is mainly due to the difference in the machines used to run the analysis for DaCapo and GitHub datasets. Also, the graphs in DaCapo are on average much larger compared to GitHub. Columns Static and Runtime show the time contributions for different components of : the time for determining the static properties of each analysis which is done once for each analysis, and the time for constructing the CFG of each method and traversing the CFG which is done once for every constructed CFG. We can see that the time for collecting static information is negligible, less than 0.2% for DaCapo dataset and less than 0.01% for GitHub dataset, when compared to the total runtime information collection time, as it is performed only once per traversal. When compared to the average runtime information collection time, the static time is quite significant. However, the overhead introduced by static information collection phase diminishes as the number of CFGs increases and becomes insignificant when running on those two large datasets. This result shows the benefit of when applying on large-scale analysis.
2.2.2. Time Reduction
To evaluate the efficiency in running time of over other strategies, we ran 21 analyses on DaCapo and GitHub datasets using and other strategies. When comparing the to a standard strategy , we computed the reduction rate where and are the running times using the standard strategy and , respectively. Some analyses have worst case traversal strategies which might not be feasible to run on GitHub dataset with 162 million graphs. For example, using post-order for forward data-flow analysis will visit the CFGs in the direction which is opposite to the natural direction of the analysis and hence takes a long time to complete. For such combinations of analyses and traversal strategies, the map and the reduce tasks time out in the cluster setting and, thus, did not have the running times. The corresponding cells in Figure (a)a are denoted with symbol –.
The result in Figure (a)a shows that helps reduce the running times in almost all cases. The values indicate the reduction in running time by adopting compared against the standard strategies. Most of positive reductions are from 10% or even from 50%. Compared to the most time-efficient strategies for each analysis, could speed up from 1% (UIR with RPO) to 28% (RS with WRPO). More importantly, the most time-efficient and the worst traversal strategies vary across the analyses which supports the need of . Over all analyses, the reduction was highest against any order and post-order (PO and WPO) strategies. The reduction was lowest against the strategy using depth-first search (DFS) and worklist with reverse post-ordering (WRPO). When compared with the next best performing traversal strategy for each analysis, reduces the overall execution time by about 13 minutes to 72 minutes on GitHub dataset. We do not report the overall numbers for GitHub dataset due to the presence of failed runs.
Figure (b)b shows time reductions for different types of analyses. For data-flow sensitive ones, the reduction rates were high ranging from 32% to 84%. The running time was not improved much for non data-flow sensitive traversals, which correspond to the last three rows in Figure (a)a with mostly one digit reductions). We actually perform almost the same as ANY-order traversal strategy for analyses in this category. This is because any-order traversal strategy is the best strategy for all the CFGs in these analyses. also chooses any-order traversal strategy and, thus, the performance is the same.
Figure (c)c shows time reduction for different cyclicity types of input graphs. We can see that reductions over graphs with loops is highest and those over any graphs is lowest.
2.3. Correctness of Analysis Results
To evaluate the correctness of analysis results, we first chose worklist as standard strategy to run analyses on DaCapo dataset to create the groundtruth of the results. We then ran analyses using our hybrid approach and compared the results with the groundtruth. In all analyses on all input graphs from the dataset, the results from always exactly matched the corresponding ones in the groundtruth.
2.4. Traversal Strategy Selection Precision
In this experiment, we evaluated how well picks the most time-efficient strategy. We ran the 21 analyses on the DaCapo dataset using all the candidate traversals and the one selected by . One selection is counted for each pair of a traversal and an input graph where the selects a traversal strategy based on the properties of the analysis and input graph. A selection is considered correct if its running time is at least as good as the running time of the fastest among all candidates. The precision is computed as the ratio between the number of correct selections over the total number of all selections. The precision was 100% and 99.9% for loop insensitive and loop sensitive traversals, respectively.
|DOM, PDOM, WNIL, UDV, UIR||100.00%|
|CP, CSD, DC, LIC, USA, VFR, MWN, AE, LMA, LMNA, LV, NA, RD, RS, VBE, SS||99.99%|
As shown in Table 5, the selection precision is 100% for all analyses that are not loop sensitive. For analyses that involve loop sensitive traversals, the prediction precision is 99.99%. Further analysis revealed that the selection precision is 100% for sequential CFGs & CFGs with branches and no loop—always picks the most time-efficient traversal strategy. For CFGs with loops, the selection precision is 100% for loop insensitive traversals. The mispredictions occur with loop sensitive traversals on CFGs with loops. This is because for loop sensitive traversals, picks worklist as the best strategy. The worklist approach was picked because it visits only as many nodes as needed when compared to other traversal strategies which visit redundant nodes. However using worklist imposes an overhead of creating and maintaining a worklist containing all nodes in the CFG. This overhead is negligible for small CFGs. However, when running analyses on large CFGs, this overhead could become higher than the cost for visiting redundant nodes. Therefore, selecting worklist for loop sensitive traversals on large CFGs might not always result in the best running times.
2.5. Analysis on Traversal Optimization
We evaluated the importance of optimizing the chosen traversal strategy by comparing with the non-optimized version. Figure 2 shows the reduction rate on the running times for the 21 analyses. For analyses that involve at least one data-flow sensitive traversal, the optimization helps to reduce at least 60% of running time. This is because optimizations in such traversals reduce the number of iterations of traversals over the graphs by eliminating the redundant result re-computation traversals and the unnecessary fixpoint condition checking traversals. For analyses involving only data-flow insensitive traversal, there is no reduction in execution time, as does not attempt to optimize.
2.6. Case Studies
This section presents three applications adopted from prior works that showed significant benefit from approach. These applications includes one or more analyses listed in Table 1. We computed the reduction in the overall analysis time when compared to WRPO traversal strategy (the second best performing traversal after ) and the results are shown in Figure 3.
|APM||1527 min.||1702 min.||10%|
|AUM||883 min.||963 min.||8%|
|SVT||1417 min.||1501 min.||6%|
API Precondition Mining (APM). This case study mines a large corpus of API usages to derive potential preconditions for API methods (Nguyen et al., 2014). The key idea is that API preconditions would be checked frequently in a corpus with a large number of API usages, while project-specific conditions would be less frequent. This case study mined the preconditions for all methods of java.lang.String.
API Usage Mining (AUM). This case study analyzes API usage code and mines API usage patterns (Xie and Pei, 2006). The mined patterns help developers understand and write API usages more effectively with less errors. Our analysis mined usage patterns for java.util APIs.
Finding Security Vulnerabilities with Tainted Object Propagation (SVT). This case study formulated a variety of widespread SQL injections, as tainted object propagation problems (Livshits and Lam, 2005). Our analysis looked for all SQL injection vulnerabilities matching the specifications in the statically analyzed code.
Figure 3 shows that helps reduce running times significantly by 80–175 minutes, which is from 6%–10% relatively. For understanding whether 10% reduction is really significant, considering the context is important. A save of 3 hours (10%) on a parallel infrastructure is significant. If the underlying parallel infrastructure is open/free/shared ((Dyer et al., 2013; Bajracharya et al., 2014)), a 3 hour save enables supporting more concurrent users and analyses. If the infrastructure is a paid cluster (e.g., AWS), a 3 hour less computing time could translate to save of substantial dollar amount.
2.7. Threats to Validity
Our datasets do not contain a balanced distribution of different graph cyclicity.The majority of graphs in both DaCapo and GitHub datasets are sequential (65% and 69%, respectively) and only 10% have loops. The impact of this threat is that paths and decisions along sequential graphs are taken more often. This threat is not easy to mitigate, as it is not pratical to find a code dataset with a balanced distribution of graphs of various types. Nonetheless, our evaluation shows that the selection and optimization of the best traversal strategy for these 35% of the graphs (graphs with branches and loops) plays an important role in improving the overall performance of the analysis over a large dataset of graphs.
3. Related Works
Atkinson and Griswold (Atkinson and Griswold, 2001) discuss several implementation techniques for improving the efficiency of data-flow analysis, namely: factoring data-flow sets, visitation order of the statements, selective reclamation of the data-flow sets. They discuss two commonly used traversal strategies: iterative search and worklist, and propose a new worklist algorithm that results in 20% fewer node visits. In their algorithm, a node is processed only if the data-flow information of any of its successors (or predecessors) has changed. Tok et al. (Tok et al., 2006) proposed a new worklist algorithm for accelerating inter-procedural flow-sensitive data-flow analysis. They generate inter-procedural def-use chains on-the-fly to be used in their worklist algorithm to re-analyze only parts that are affected by the changes in the flow values. Hind and Pioli (Hind and Pioli, 1998) proposed an optimized priority-based worklist algorithm for pointer alias analysis, in which the nodes awaiting processing are placed on a worklist prioritized by the topological order of the CFG, such that nodes higher in the CFG are processed before nodes lower in the CFG. Bourdoncle (Bourdoncle, 1993) proposed the notion of weak topological ordering (WTO) of directed graphs and two iterative strategies based on WTO for computing the analysis solutions in dataflow and abstraction interpretation domains. Bourdoncle’ technique is more suitable for cyclic graphs, however for acyclic graphs Bourdoncle proposes any topological ordering. Kildall (Kildall, 1973) proposes combining several optimizing functions with flow analysis algorithms for solving global code optimization problems. For some classes of data-flow analysis problems, there exist techniques for efficient analysis. For example, demand interprocedural data-flow analysis (Horwitz et al., 1995) can produce precise results in polynomial time for inter-procedural, finite, distributive, subset problems (IFDS), constant propagation (Wegman and Zadeck, 1991), etc. These works propose new traversal strategies for improving the efficiency of certain class of source code analysis, whereas is a novel technique for selecting the best traversal strategy from a list of candidate traversal strategies, based on the static properties of the analysis and the runtime characteristics of the input graph.
Upadhyaya and Rajan (Upadhyaya and Rajan, 2018) proposed Collective Program Analysis (CPA) that leverages similarities between CFGs to speedup analyzing millions of CFGs by only analyzing unique CFGs. CPA utilizes pre-defined traversal strategy to traverse CFGs, however our technique selects optimal traversal strategy and could be utilized in CPA. Upadhyaya and Rajan have also proposed an approach for accelerating ultra-large scale mining by clustering artifacts that are being mined (Upadhyaya and Rajan, 2017, 2018). BCFA and this approach have the same goal of scaling large-scale mining, but complementary strategies.
Cobleigh et al. (Cobleigh et al., 2001) study the effect of worklist algorithms in model checking. They identified four dimensions along which a worklist algorithm can be varied. Based on four dimensions, they evaluate 9 variations of worklist algorithm. They do not solve traversal strategy selection problem. Moreover, they do not take analysis properties into account. We consider both static properties of the analysis, such as data-flow sensitivity and loop sensitivity, and the cyclicity of the graph. Further, we also consider non-worklist based algorithms, such as post-order, reverse post-order, control flow order, any order, etc., as candidate strategies.
Several infrastructures exist today for performing ultra-large-scale analysis (Dyer et al., 2013; Bajracharya et al., 2014; Gousios, 2013; Cosmo and Zacchiroli, 2017; Ma et al., 2019). Boa (Dyer et al., 2013) is a language and infrastructure for analyzing open source projects. Sourcerer (Bajracharya et al., 2014) is an infrastructure for large-scale collection and analysis of open source code. GHTorrent (Gousios, 2013) is a dataset and tool suite for analyzing GitHub projects. These frameworks currently support structural or abstract syntax tree (AST) level analysis and a parallel framework such as map-reduce is used to improve the performance of ultra-large-scale analysis. By selecting the best traversal strategy, could help improve their performance beyond parallelization.
There have been much works that targeted graph traversal optimization. Green-Marl (Hong et al., 2012) is a domain specific language for expressing graph analysis. It uses the high-level algorithmic description of the graph analysis to exploit the exposed data level parallelism. Green-Marl’s optimization is similar to ours in utilizing the properties of the analysis description, however also utilizes the properties of the graphs. Moreover, Green-Marl’s optimization is through parallelism while ours is by selecting the suitable traversal strategy. Pregel (Malewicz et al., 2010) is a map-reduce like framework that aims to bring distributed processing to graph algorithms. While Pregel’s performance gain is through parallelism, achieves it by traversing the graph efficiently.
Acknowledgements.This material is based upon work supported by the Sponsor National Science Foundation Rlhttp://dx.doi.org/10.13039/100000001 under Grant No. Grant #3 and Grant No. Grant #3. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.
- Mining api patterns as partial orders from source code: from usage scenarios to specifications. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, New York, NY, USA, pp. 25–34. External Links: Cited by: §1.
- Compilers: principles, techniques, and tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. External Links: Cited by: §2.1.1.
- Implementation techniques for efficient data-flow analysis of large programs. In Proceedings of the IEEE International Conference on Software Maintenance (ICSM’01), ICSM ’01, Washington, DC, USA, pp. 52–. External Links: Cited by: §3.
- Evaluating static analysis defect warnings on production software. In Proceedings of the 7th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering, PASTE ’07, New York, NY, USA, pp. 1–8. External Links: Cited by: §2.1.1.
- Sourcerer: an infrastructure for large-scale collection and analysis of open-source code. Sci. Comput. Program. 79, pp. 241–259. External Links: Cited by: §1, §2.6, §3.
- The dacapo benchmarks: java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications, OOPSLA ’06, New York, NY, USA, pp. 169–190. External Links: Cited by: §2.1.2.
- Efficient chaotic iteration strategies with widenings. In Formal Methods in Programming and Their Applications, D. Bjørner, M. Broy, and I. V. Pottosin (Eds.), Berlin, Heidelberg, pp. 128–141. External Links: Cited by: §3.
- The Right Algorithm at the Right Time: Comparing Data Flow Analysis Algorithms for Finite State Verification. In Proceedings of the 23rd International Conference on Software Engineering, ICSE ’01, Washington, DC, USA, pp. 37–46. External Links: Cited by: §3.
- Software heritage: why and how to preserve software source code. In iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan. External Links: Cited by: §3.
- Boa: a language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, Piscataway, NJ, USA, pp. 422–431. External Links: Cited by: 4th item, §1, §1, §2.6, §3.
- Boa: ultra-large-scale software repository and source-code mining. ACM Trans. Softw. Eng. Methodol. 25 (1), pp. 7:1–7:34. External Links: Cited by: 4th item, §1.
- The ghtorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, Piscataway, NJ, USA, pp. 233–236. External Links: Cited by: §1, §3.
- Assessing the Effects of Flow-Sensitivity on Pointer Alias Analyses. In SAS, pp. 57–81. Cited by: §3.
- Green-marl: a dsl for easy and efficient graph analysis. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, New York, NY, USA, pp. 349–362. External Links: Cited by: §3.
- Demand Interprocedural Dataflow Analysis. In Proceedings of the 3rd ACM SIGSOFT Symposium on Foundations of Software Engineering, SIGSOFT ’95, New York, NY, USA, pp. 104–115. External Links: Cited by: §3.
- Exploiting implicit beliefs to resolve sparse usage problem in usage-based specification mining. In OOPSLA’17: The ACM SIGPLAN conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA’17. Cited by: §1.
- A unified approach to global program optimization. In Proceedings of the 1st Annual ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, POPL ’73, New York, NY, USA, pp. 194–206. External Links: Cited by: §3.
- The soot framework for java program analysis: a retrospective. In Cetus Users and Compiler Infastructure Workshop (CETUS 2011), Vol. 15, pp. 35. Cited by: §1.
- Finding security vulnerabilities in java applications with static analysis. In Proceedings of the 14th Conference on USENIX Security Symposium - Volume 14, SSYM’05, Berkeley, CA, USA, pp. 18–18. External Links: Cited by: §2.6.
- World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In Proceedings of the 16th International Conference on Mining Software Repositories, MSR ’19, pp. 143–154. External Links: Cited by: §3.
- Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, New York, NY, USA, pp. 135–146. External Links: Cited by: §3.
- Exemplar: a source code search engine for finding highly relevant applications. IEEE Trans. Softw. Eng. 38 (5), pp. 1069–1087. External Links: Cited by: §1.
- Mining preconditions of apis in large-scale code corpus. In Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2014, New York, NY, USA, pp. 166–177. External Links: Cited by: §1, §2.6.
- Principles of program analysis. Springer Publishing Company, Incorporated. External Links: Cited by: §2.1.1.
- Efficient flow-sensitive interprocedural data-flow analysis in the presence of pointers. In Proceedings of the 15th International Conference on Compiler Construction, CC’06, Berlin, Heidelberg, pp. 17–31. External Links: Cited by: §3.
- On accelerating ultra-large-scale mining. In 2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER), Vol. , pp. 39–42. External Links: Cited by: §3.
- On accelerating source code analysis at massive scale. IEEE Transactions on Software Engineering 44 (7), pp. 669–688. External Links: Cited by: §3.
- Collective program analysis. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA, pp. 620–631. External Links: Cited by: §3.
- Soot - a java bytecode optimization framework. In Proceedings of the 1999 Conference of the Centre for Advanced Studies on Collaborative Research, CASCON ’99, pp. 13–. External Links: Cited by: §2.1.1.
- Constant propagation with conditional branches. ACM Trans. Program. Lang. Syst. 13 (2), pp. 181–210. External Links: Cited by: §3.
- MAPO: mining api usages from open source repositories. In Proceedings of the 2006 International Workshop on Mining Software Repositories, MSR ’06, New York, NY, USA, pp. 54–57. External Links: Cited by: §2.6.
- Modeling and discovering vulnerabilities with code property graphs. In Proceedings of the 2014 IEEE Symposium on Security and Privacy, SP ’14, Washington, DC, USA, pp. 590–604. External Links: Cited by: §1.
- Are code examples on an online q&a forum reliable? a study of api misuse on stack overflow. In Proceedings of the 40th International Conference on Software Engineering, ICSE ’18, New York, NY, USA. Cited by: §1.
Appendix A Appendix: Omitted Results
|DOM, PDOM, WNIL, UDV, UIR||100.00%|
|CP, CSD, DC, LIC, USA, VFR, MWN, AE, LMA, LMNA, LV, NA, RD, RS, VBE, SS||99.99%|