Identifying software defects through testing is a challenging problem.
Over the years, a number of approaches have been developed to test software, including random mutation testing (black box fuzzing) Doupé u. a. (2012); Woo u. a. (2013), abstract interpretation (of either source or machine code) Cousot und Cousot (1977); Cadar u. a. (2008); Ma u. a. (2011), and property based testing Arts u. a. (2006); Claessen und Hughes (2011).
Methods such as symbolic and concolic execution have increased the fidelity of analyses run over programs Schwartz u. a. (2010). The development of Satisfiability Modulo Theories (SMT) solvers such as Z3, Boolector, and others have allowed for powerful programmatic analysis of reasoning about software De Moura und Bjørner (2008); Brummayer und Biere (2009). Separation logic has allowed for analyses to be applied to complicated data structures Reynolds (2002); Dongol u. a. (2015).
American Fuzzy Lop (AFL) is an advanced fuzzing framework that has been used to discover a number of novel software vulnerabilities (https://github.com/mrash/afl-cve). AFL uses random mutations of byte strings to identify unique code paths and discover defects in target programs. The inputs that successfully generated unique code paths are then documented as "seed files". We propose to use these native seed files as training data for deep generative models to create augmented seed files. Our proposed reinitialization methods are a scalable process that can improve the time to discovery of software defects.
Other researchers have used machine learning to augment fuzzing frameworks including:Godefroid u. a. (2017), Wang u. a. (2017). To identify deeper bugs and code paths, Steelix Li u. a. (2017) uses a program-state based binary fuzzing approach and Driller Stephens u. a. (2016) demonstrates a hybrid approach using fuzzing and selective concolic execution. AFLFAST Böhme u. a. (2016)2015)
, have had great success in the fields of Natural Language Processing (NLP)Jones (2014); Wu u. a. (2016)2012), and the playing of bounded games such as Go Mnih u. a. (2013) or video games like ATARI Silver u. a. (2016). Can these DNN’s help existing program analysis tools perform better? In our work we investigate that question. We augment AFL Zalewski (2015), an advanced fuzzing framework, using Generative Adversarial Networks (GAN) Goodfellow u. a. (2014)
and Long Short Term Memory (LSTM)Sak u. a. (2014) to increase its rate of unique code path discovery.
Our work quantifies the benefits that augmentation strategies such as generative models can provide, even when limited by small quantities of training data. By periodically perturbing the state of AFL as it explores the input space, we are able to improve its performance, as measured by unique code paths. Specifically, we test our approach on the software ecosystem surrounding the Ethereum Wood (2014) project. As a financial system, correctness of the Ethereum code base is important for guaranteeing that transactions or calculations run without fault. We choose ethkey as an initial fuzzing target. Ethkey is a small C++ program provided as part of the cpp-ethereum project used to load, parse, and perform maintenance on Ethereum wallets, and importantly, takes a simple input file, making it easy to test with AFL.
2 Experimental Design
First, we describe the basic functionality of AFL, highlighting the key features that connect with the proposed augmentation framework. Next, we describe the methodology used to create the LSTM and GAN generated seed files. As a baseline, we also consider random generation of seed files from the training data used to construct the LSTM and GAN models. AFL has extensions to the GCC compiler which in conjunction with Genetic Algorithms, it uses to create seed files. Each seed file documents the input that yielded a unique code path, the time of discovery, and is used as the basis for mutation (or fuzzing) to generate future seed files. Our augmentation strategy takes advantage of the fact that if an external tool places additional seed files in the AFL working directory, AFL will use those files as inputs in subsequent fuzzing runs.
To produce the training data for our methods, we run AFL on a target program for a fixed amount of time. AFL generates an initial set of seed files for each unique execution trace taken through . We use
as training examples for the LSTM and GAN models, which are both trained using KerasChollet (2017).
Our LSTM is trained from the concatenation of AFL-generated seed file corpus into a single file and generates new seed files with a maximum length of characters. The LSTM model has a
wide initial layer, an internal dense layer, and a final softmax activation layer. To train the LSTM model, we use RMS propagation as our optimizer and a categorical cross-entropy loss function. The model takes in a seed sequence sampled from the training corpus and predicts the next character in the sequence. We additionally tune a separate temperature parameter to diversify the output seed files from the network. The generated seed files are noted as.
In our GAN architecture, two models are built, a generator G, which is pitted against a discriminator D. G is optimized to generate realistic output, and the discriminator D has the task of predicting if the data is true or fake. This training strategy is unsupervised and particularly expressive. The generative model G, is a fully connected 2 layer DNN with a ReLU non-linearity as the inner activation and a tanh output activation. It is trained with a binary cross-entropy loss function via stochastic gradient descent. The discriminative model D is a 3 layer DNN, but the first layer has 25% dropout followed by two fully connected layers. It uses an Adam optimizer for stochastic gradient descent and the seed files resulting from the GAN process are noted as.
Additionally, given the native AFL seed files, , we randomly draw bytes from this training set and produce new, random seed files of the same length as . This serves as a baseline to determine if the added time and complexity of GAN and LSTM based seed generation are truly providing an advantage over a simple strategy of randomly perturbing AFL’s state.
Small Experiment: The seed files (, , and ) alone are not an end goal. However, we are interested in characterizing their variability and other properties as they will provide a set of initial conditions when AFL is restarted. In a fuzzing run on a single CPU core, we produce unique code paths used to train initial GAN and LSTM models. Random seed generation is performed by drawing random bytes from /dev/urandom. For each method, we generate samples, reinitialize AFL on a single CPU with only the seed files of one method and run for an additional hours to measure the impact on code path discovery. Both LSTM and GAN models sightly out-perform random sampling for AFL reinitialization. We summarize the resulting mean time to discover new code paths in Table 1.
Each seed file produces a program trace when supplied as an input to ethkey. Code paths that have different lengths will differ in at least one basic block or branch instruction. The unique code path length is fast to compute but only provides a lower bound on the number of unique code paths exercised by the testing framework, across fuzzing runs using AFL. Two code paths with the same length can result from unique traces, thus detailed evaluation is needed to determine the true uniqueness of a code path from seed file execution.
|Class C||Sec/Path||Relative Rate|
Large Experiment: To demonstrate the scalability of this augmentation strategy, we performed an extended run of AFL on CPU cores for
hours. Each core in the AFL run stopped finding seed files after the first 10 to 12 hours of fuzzing and accumulated a total of 39,185 seed files across 49 workers. All seed files produced within a given node are known to be unique, due to AFL’s internal book keeping mechanism. However, seed files whose content are different across nodes, could in principle exercise the same code path. By measuring the length of each program trace (code path), we can compute a lower bound on the number of unique paths discovered by only counting paths that have a unique length. After removing identical seed files from across the nodes, and seed files that resulted in the same code path length, we estimate 802 of the initial files were duplicates from the independent worker nodes. Removing those duplicates resulted in a total of 38,384 unique files.
We then trained GAN and LSTM networks on the total corpus of unique seed files and generated approximately 20,000 samples from each method, respectively, to use as synthetic seed files in order to reinitialize AFL. GAN took approximately 30 minutes to train and generate synthetic seed files, while LSTM took 14 hours to do so.
In Table 2
we summarize the mean and variance of the length of program traces (i.e., code paths) associated with the seed files from native AFL and from the synthetic generation methods for this larger experiment. The synthetic seed files, when provided as inputs to the program under test, do not cause deep paths to be explored, compared to AFL. So, we cannot simplyreplace AFL with a generative model. Instead, we seek to combine generative models with AFL to boost its performance. We see from this data that, in fact, the seed files generated by LSTM and GAN are not representative of the distribution in terms of the mean and variance of code paths generated. This reinforces the need to use and as an augmentation strategy rather than a direct replacement of AFL seeds.
Next, we performed hours of fuzzing with GAN, LSTM, and a random reinitilialzation strategy using a random sampling of bytes from the initial seed files (i.e., performing no learning on the seed files). Table 3 summarizes our results. All three strategies allowed for additional seed files to be generated. The GAN-based approach produced seed files 14.23% quicker than the random approach and 60.72% faster than using LSTM. We do lose 30 minutes of training time for GAN that could otherwise be used for fuzzing using the random sampling method; discounting by this amount of time reduces the code path rate to an 11.85% improvement. However, we are most interested in unique code paths. GAN found the greatest number of seed files whose associated code paths had lengths not found in the initial fuzzing run, outperforming the random control approach by 6.16%. The average code path length discovered by GAN was 13.84% longer than the random control, so GAN is capable of exercising deeper paths in the program. The LSTM model underperformed both GAN and random sampling and took a substantially longer time (14 hours) to train.
|Class C|||C|||L(C)||Novel||L(C) Rate||Novel Rate|
In this work, we explored the utility of augmenting random mutation testing with deep neural models. Natively AFL, combines file mutation strategies from Genetic Algorithms with program instrumentation via the use of compiler plugins. We observed the most improvement in AFL’s performance when we restart a fuzzing run mid-course, using novel seed files built from a GAN model. Though the synthetic seed file statistics on average had similar path length, the GAN out-performed reinitialization from a random or LSTM strategy when restarting the fuzzing system. The LSTM model was deficient in both training time and code path discovery time. Both approaches used no manual analysis or information about file formats for the program under test. The GAN and random strategies both improve the performance of AFL, even though the internal state of the program is never directly exposed.
Future work of interest includes experimentation on additional targets, including the DARPA Cyber Grand Challenge problems, open source OS network services, bytecode interpreters, and other system applications and programs where input data is easily generated. We also plan to explore exposing the internal state of the program under test in order to define a reward function for reinforcement learning. We envision this internal state could be exposed by: 1) the instrumentation AFL adds to programs via its GCC compiler plugins, 2) using Intel’s PIN tool to output the length of each code path or summary information about a given trace 3) recording program traces using a replay framework such as Mozilla’s rr tool in order to collect additional descriptive statistics.
The authors would like to thank Court Corley, Nathan Hodas, and Sam Winters for useful discussions. The research described in this paper is part of the Deep Science Initiative at Pacific Northwest National Laboratory. It was conducted under the Laboratory Directed Research and Development Program at PNNL, a multi-program national laboratory operated by Battelle for the U.S. Department of Energy.
- Arts u. a. (2006) Arts u. a. 2006 Arts, Thomas ; Hughes, John ; Johansson, Joakim ; Wiger, Ulf: Testing telecoms software with quviq quickcheck. In: Proceedings of the 2006 ACM SIGPLAN workshop on Erlang ACM (Veranst.), 2006, S. 2–10
- Bengio u. a. (2015) Bengio u. a. 2015 Bengio, Yoshua ; Goodfellow, Ian J. ; Courville, Aaron: Deep learning. In: Nature 521 (2015), S. 436–444
- Böhme u. a. (2016) Böhme u. a. 2016 Böhme, Marcel ; Pham, Van-Thuan ; Roychoudhury, Abhik: Coverage-based Greybox Fuzzing As Markov Chain. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA : ACM, 2016 (CCS ’16), S. 1032–1043. – URL http://doi.acm.org/10.1145/2976749.2978428. – ISBN 978-1-4503-4139-4
- Brooks (2017) Brooks 2017 Brooks, Teresa N.: Survey of Automated Vulnerability Detection and Exploit Generation Techniques in Cyber Reasoning Systems. In: CoRR abs/1702.06162 (2017). – URL http://arxiv.org/abs/1702.06162
Brummayer und Biere (2009)
Brummayer und Biere 2009 Brummayer, Robert ;
Boolector: An efficient SMT solver for bit-vectors and arrays.In: Tools and Algorithms for the Construction and Analysis of Systems (2009), S. 174–177
- Cadar u. a. (2008) Cadar u. a. 2008 Cadar, Cristian ; Dunbar, Daniel ; Engler, Dawson R. u. a.: KLEE: Unassisted and Automatic Generation of High-Coverage Tests for Complex Systems Programs. In: OSDI Bd. 8, 2008, S. 209–224
- Chollet (2017) Chollet 2017 Chollet, François: Keras (2015). In: URL http://keras. io (2017)
- Claessen und Hughes (2011) Claessen und Hughes 2011 Claessen, Koen ; Hughes, John: QuickCheck: a lightweight tool for random testing of Haskell programs. In: Acm sigplan notices 46 (2011), Nr. 4, S. 53–64
- Cousot und Cousot (1977) Cousot und Cousot 1977 Cousot, Patrick ; Cousot, Radhia: Abstract interpretation: a unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Proceedings of the 4th ACM SIGACT-SIGPLAN symposium on Principles of programming languages ACM (Veranst.), 1977, S. 238–252
- De Moura und Bjørner (2008) De Moura und Bjørner 2008 De Moura, Leonardo ; Bjørner, Nikolaj: Z3: An efficient SMT solver. In: Tools and Algorithms for the Construction and Analysis of Systems (2008), S. 337–340
- Dongol u. a. (2015) Dongol u. a. 2015 Dongol, Brijesh ; Gomes, Victor B. ; Struth, Georg: A program construction and verification tool for separation logic. In: International Conference on Mathematics of Program Construction Springer (Veranst.), 2015, S. 137–158
- Doupé u. a. (2012) Doupé u. a. 2012 Doupé, Adam ; Cavedon, Ludovico ; Kruegel, Christopher ; Vigna, Giovanni: Enemy of the State: A State-Aware Black-Box Web Vulnerability Scanner. In: USENIX Security Symposium Bd. 14, 2012
- Godefroid u. a. (2017) Godefroid u. a. 2017 Godefroid, Patrice ; Peleg, Hila ; Singh, Rishabh: Learn&fuzz: Machine learning for input fuzzing. In: arXiv preprint arXiv:1701.07232 (2017)
- Goodfellow u. a. (2014) Goodfellow u. a. 2014 Goodfellow, Ian ; Pouget-Abadie, Jean ; Mirza, Mehdi ; Xu, Bing ; Warde-Farley, David ; Ozair, Sherjil ; Courville, Aaron ; Bengio, Yoshua: Generative adversarial nets. In: Advances in neural information processing systems, 2014, S. 2672–2680
- Jones (2014) Jones 2014 Jones, Nicola: The learning machines. In: Nature 505 (2014), Nr. 7482, S. 146
- Krizhevsky u. a. (2012) Krizhevsky u. a. 2012 Krizhevsky, Alex ; Sutskever, Ilya ; Hinton, Geoffrey E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012, S. 1097–1105
- Li u. a. (2017) Li u. a. 2017 Li, Yuekang ; Chen, Bihuan ; Chandramohan, Mahinthan ; Lin, Shang-Wei ; Liu, Yang ; Tiu, Alwen: Steelix: Program-state Based Binary Fuzzing. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. New York, NY, USA : ACM, 2017 (ESEC/FSE 2017), S. 627–637. – URL http://doi.acm.org/10.1145/3106237.3106295. – ISBN 978-1-4503-5105-8
- Ma u. a. (2011) Ma u. a. 2011 Ma, Kin-Keung ; Yit Phang, Khoo ; Foster, Jeffrey ; Hicks, Michael: Directed symbolic execution. In: Static Analysis (2011), S. 95–111
- Mnih u. a. (2013) Mnih u. a. 2013 Mnih, Volodymyr ; Kavukcuoglu, Koray ; Silver, David ; Graves, Alex ; Antonoglou, Ioannis ; Wierstra, Daan ; Riedmiller, Martin: Playing atari with deep reinforcement learning. In: arXiv preprint arXiv:1312.5602 (2013)
- Reynolds (2002) Reynolds 2002 Reynolds, John C.: Separation logic: A logic for shared mutable data structures. In: Logic in Computer Science, 2002. Proceedings. 17th Annual IEEE Symposium on IEEE (Veranst.), 2002, S. 55–74
Sak u. a. (2014)
Sak u. a. 2014 Sak, Haşim ; Senior,
Andrew ; Beaufays, Françoise:
Long short-term memory recurrent neural network architectures for large scale acoustic modeling.In: Fifteenth Annual Conference of the International Speech Communication Association, 2014
- Schwartz u. a. (2010) Schwartz u. a. 2010 Schwartz, Edward J. ; Avgerinos, Thanassis ; Brumley, David: All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In: Security and privacy (SP), 2010 IEEE symposium on IEEE (Veranst.), 2010, S. 317–331
- Shoshitaishvili u. a. (2016) Shoshitaishvili u. a. 2016 Shoshitaishvili, Yan ; Wang, Ruoyu ; Salls, Christopher ; Stephens, Nick ; Polino, Mario ; Dutcher, Andrew ; Grosen, John ; Feng, Siji ; Hauser, Christophe ; Kruegel, Christopher u. a.: Sok:(state of) the art of war: Offensive techniques in binary analysis. In: Security and Privacy (SP), 2016 IEEE Symposium on IEEE (Veranst.), 2016, S. 138–157
- Silver u. a. (2016) Silver u. a. 2016 Silver, David ; Huang, Aja ; Maddison, Chris J. ; Guez, Arthur ; Sifre, Laurent ; Van Den Driessche, George ; Schrittwieser, Julian ; Antonoglou, Ioannis ; Panneershelvam, Veda ; Lanctot, Marc u. a.: Mastering the game of Go with deep neural networks and tree search. In: Nature 529 (2016), Nr. 7587, S. 484–489
- Stephens u. a. (2016) Stephens u. a. 2016 Stephens, Nick ; Grosen, John ; Salls, Christopher ; Dutcher, Andrew ; Wang, Ruoyu ; Corbetta, Jacopo ; Shoshitaishvili, Yan ; Kruegel, Christopher ; Vigna, Giovanni: Driller: Augmenting Fuzzing Through Selective Symbolic Execution. In: NDSS Bd. 16, 2016, S. 1–16
- Wang u. a. (2017) Wang u. a. 2017 Wang, Junjie ; Chen, Bihuan ; Wei, Lei ; Liu, Yang: Skyfire: Data-driven seed generation for fuzzing. 2017
- Woo u. a. (2013) Woo u. a. 2013 Woo, Maverick ; Cha, Sang K. ; Gottlieb, Samantha ; Brumley, David: Scheduling black-box mutational fuzzing. In: Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security ACM (Veranst.), 2013, S. 511–522
- Wood (2014) Wood 2014 Wood, Gavin: Ethereum: A secure decentralised generalised transaction ledger. In: Ethereum Project Yellow Paper 151 (2014)
- Wu u. a. (2016) Wu u. a. 2016 Wu, Yonghui ; Schuster, Mike ; Chen, Zhifeng ; Le, Quoc V. ; Norouzi, Mohammad ; Macherey, Wolfgang ; Krikun, Maxim ; Cao, Yuan ; Gao, Qin ; Macherey, Klaus u. a.: Google’s neural machine translation system: Bridging the gap between human and machine translation. In: arXiv preprint arXiv:1609.08144 (2016)
- Zalewski (2015) Zalewski 2015 Zalewski, Michał: American Fuzzy Lop (AFL) fuzzer (2015). 2015