, for finding out bugs and security vulnerabilities in large scale application programs. In fuzzing, the target program is continuously executed with the provided or newly generated test cases. These test cases are generated by the genetic operations like mutation, crossover or through some symbolic execution and constraint solving operations. Depending upon the amount of exposure of the program structure of the target program, the fuzz testing can be broadly classified into 3 types, namely white box fuzz testing, grey box fuzz testing  and black box fuzz testing . In black box fuzz testing, no program analysis is performed and the fuzz are generated at random by mutating the bits and bytes of the provided seeds and the generated fuzz. Thus most of the times, black box fuzzers require a lot of time for creating the valid and malicious inputs which can expose the software vulnerabilities. High throughput and very low overhead are some of the advantages of the black box fuzzing. The black box fuzzing can be made more efficient by using techniques like good quality seed selections[7, 8], proper scheduling of mutations.
In case of white box fuzz testing, the extensive program analysis is performed and mostly the symbolic execution technique is used. The test cases are generated by collecting the path constraints and then one by one negating and solving them with the help of constraint solvers. Symbolic execution and constraint solving do not scale to large programs having million lines of code and paths. Due to the path explosion issue, the constraints will grow up exponentially and it becomes impossible to solve for any solver and it results in imprecise results. Godefroid et al. have given an approach for efficient white box fuzz testing. They have used generational search algorithm instead of using classical depth first search approach which attempts to expand multiple constraints instead of just the last one. Even though this approach is faster than the classical white box fuzz testing, authors observed that the symbolic execution is very slow. So the white box fuzz testing approach does not look much practical. The grey box fuzz testing sits in between the black box fuzzing and the white box fuzzing. It uses minimal amount of instrumentation to gain some knowledge about the program structure. It uses feedback mechanism like coverage information to guide the fuzzing process. Some approaches combine the symbolic execution with fuzzing[10, 11]. Taint analysis techniques have also been used along with fuzzing for improving the performance of fuzzing
. Recently, fuzzing using machine learning[13, 14]
and reinforcement learning techniques have been proposed.
AFL  is a popular coverage based evolutionary greybox fuzzing tool. AFL inserts only the minimal instrumentation to record the branch coverage. Thus the instrumentation overhead is very less compared to the white box fuzz testing. AFL  takes instrumented binary of the program to be tested and one or more sample input test case(s) which is/are generally referred to as seed(s). Then these seeds are mutated one by one to obtain the new test cases. The main aim of the AFL is to maximize the branch coverage. AFL has found a large number of critical vulnerabilities in popular libraries and tools like Mozilla Firefox, SQLite, Apple Safari, tcpdump, file, libxml2, lrzip, binutils, vlc  etc. In most of the cases, it starts with an empty file as a seed input and is able to create valid and malicious inputs which expose the crashes. The simplicity of AFL has made it very popular and efficient as compared to other fuzzers. AFL maintains a queue of interesting test cases and it dequeues test cases in round robin fashion. AFL can not figure out the best test case to be fuzzed next, instead it just uses the round robin scheduling and some heuristics for skipping a test case while fuzzing. It performs the deterministic fuzzing, random fuzzing and splicing on the currently selected test case. As a result of these operations, new test cases are generated and if they cover some new branches then they are considered as interesting test cases and are added to the queue. The AFL periodically performs the culling queue operation which marks some of the test cases as favored and and gives them more preference during all fuzzing steps.
In this work, we aim at replacing the heuristics the AFL uses while assigning the fuzzing iterations to a test case during the random fuzzing. AFL decides the number of random fuzzing iterations for a test case on the basis of the external features of the test case like the execution time, bitmap coverage, depth of the test case in terms of fuzzing hierarchy. But it does not consider the quality of the test case in terms of the test case contents. This may result in giving more or less number of random fuzzing iterations to a test case. So it can result in over fuzzing or under fuzzing. Some of the test cases can be pretty large (around 10KB), so it is not feasible to consider the entire test case contents while deciding the number of the random fuzzing iterations. Thus it’s difficult to completely replace the heuristics with some learning mechanism. An earlier approach discusses about the problem of assigning the energy value to the test cases based upon the density of distribution of unique paths traversed by the test cases. But they do not consider the test case contents while deciding the energy.
In this work, we explore whether it is possible to solve the problem of deciding the energy value from the test case contents, using reinforcement learning techniques. We formalize this problem as a ‘contextual bandit problem’. We propose an algorithm to solve this problem. We consider the fixed length substring of the test case and treat it as a ‘state’ in contextual bandits setting. We propose multipliers of the test case’s energy and treat them as the ‘action space’ in the contextual bandit setting. We give a ‘reward function’ based upon the number of interesting test cases generated after fuzzing the substring with our newly derived energy value. Thus our approach takes a state as input and outputs one of the action from the action space. The action is nothing but the multiplier. We multiply the energy value predicted by the AFL for that test case with the multiplier to obtain a new energy value. We fuzz the substring with this new energy value and we calculate the rewards from the reward function and continuously update the policy through policy gradient method, using the reward obtained.
We have implemented our approach on top of the AFL. We modify the AFL’s heuristics with our model learned through the policy gradient method. We have implemented a neural network architecture consisting of Long Short Term Memory(LSTM) to encode our state. As the state is a sequential stream of bytes so we selected LSTM for encoding. We have maintained exploration-exploitation trade-off for choosing action as well as fuzzing the entire test case vs. fuzzing the substring only. We have performed a number of experiments with various configurations of our model, across the various programs like binutils, tcpdump, mpg321, libpng, gif2png, libxml2. We have also performed the cross-binary experiments across the binutils binaries, where we train our model on one binary and test it on the other binaries. We compare the coverage, total number of paths and crashes resulted from these experiments.
In summary, we make the following contributions:
We formalize the problem of deciding the energy as a ‘contextual bandit problem’.
We present an algorithm to decide the energy multiplier of a test case given a fixed length contents of the test case.
We implement our neural network based learning algorithm on top of the AFL and we compare results of different configurations of our model with AFL.
We integrate AFL’s code with a popular open-source machine learning framework tensorflow. This implementation can be useful for others while implementing any machine/reinforcement learning technique with fuzzers like AFL.
In this section, we explain AFL, Multi-armed Bandits problem and Contextual Bandits problem in detail.
American Fuzzy Lop (AFL) is a popular coverage based evolutionary greybox fuzzing tool. The input given to AFL is an instrumented binary and one or more valid input tests called as seed(s). The main aim of the AFL is to maximize the branch coverage. The lightweight instrumentation for counting the branch coverage is added at the compile time in the target program. The algorithm 1 explains the working of AFL.
AFL first of all sets up the shared memory. AFL uses 64 KB of shared memory known as trace bits which captures the execution of individual test case and 64 KB of virgin bits which captures the overall effect of all the test cases executed so far. It initializes each byte in the trace bits to 0. Each byte represents a branch. AFL then reads the test cases which are given as seeds, and shuffle them and enqueue them in a queue. It stores information about each seed like it’s length, file name, execution checksum, execution time, bitmap size, whether the test case is favored or not etc. whenever computed.
AFL executes these seeds one by one and the trace bits will get updated because of the instrumentation. It records the count of execution of each branch and then bucket them up in the buckets of different capacities (powers of 2). The virgin bits will be modified if the test case executes entirely a new branch or it executes already executed branch with the count which falls under a bucket different than the earlier occupied buckets. This helps in skipping tests which are executing repetitive branches. The different buckets also provide some degree of immunity against tuple collisions. The path executed by a test case is identified uniquely from the checksum of the trace bits. AFL calculates this checksum by a hash function over the trace bits. AFL also counts the number of bytes set in the trace bits by each test case. AFL store this information with that test case in a field named bitmap size. This bitmap size proves useful while comparing the two different fuzz.
Once a test case is executed the is calculated for that test case. It is simply the product of the execution time and length of the test case. AFL maintains a top rated test case for each byte. A top rated test case for a byte means, the test case which is having the least and executing the branch corresponding to that byte. It takes the bytes set by the test case one by one and if the current test case is having less than the existing top rated test case, for that byte, then it sets the current test case as the top rated for that corresponding byte. Also it is required to store the count of the bytes for which the given test case is top rated. If this counter is non zero then it helps AFL in storing minimized trace of the entire trace of that test case. This minimized trace is helpful in deciding the favored test case. If a seed itself crashes or if it generates timeout then AFL can stop the further execution and may prompt user to change such seeds. It is advised by the AFL to keep these seeds small and non repetitive. AFL can also minimize these test cases while preserving their meaning. The above procedure is followed for all the seeds. If any test case is no longer top rated for any of the byte then there is no point in storing the minimized trace of that test case and wasting the space, thus AFL frees the memory allocated by those minimized trace.
Before performing any mutation on any of the seed, AFL performs culling queue operation. This operation is performed in order to give less importance to older, lengthy and less favored tests and give more importance to the more efficient test cases. It goes over each byte and checks the top rated test case for each byte, once it finds a top rated test case it checks the minimized trace of that test case and makes this test case favored for all those non zero bytes in the trace. Storing minimized trace is very useful in this case, otherwise the entire test case needs to be re-executed. The favored test cases are given more preference over non favored test cases while fuzzing.
AFL dequeues test cases one by one and fuzzes them to generate next generation fuzz. This operation is inspired from the Genetic algorithms which include the techniques like cross-overs and mutations. AFL can not figure out the best test case to be fuzzed next, instead it just uses the round robin scheduling and some heuristics for skipping a test case while fuzzing. AFL performs 2 types of fuzzing, first is deterministic fuzzing (performed only once per test case) and second one is random fuzzing. If the current test case is not favored or it is already fuzzed and there are some favored test cases in the queue which are not yet fuzzed then it skips the current test with 99probability. It is done in order to promote shorter duration test cases which can cover more branches. Even if there is no pending favored test case and the current test case is already fuzzed then it is skipped with 95 probability. It will promote the test cases which are not yet fuzzed over the test cases which are already fuzzed at least once. But if there is no pending favored test case and the current test case is not fuzzed then the chances of skipping the test case gets reduced to 75.
AFL then decides the number of random fuzzing iterations(also called as the energy) for that test case on the basis of the external features of the test case like the execution time, bitmap coverage, depth of the test case in terms of fuzzing hierarchy. But it does not consider the test case contents. Tests that are fast, which cover more branches and have more depth are given more energy. This energy assignment is totally heuristics based. In deterministic fuzzing, AFL perform mutations like single bit flip (flip only a single bit of the test case to generate new fuzz), two walking bits flip, four walking bits flip, byte flip, two walking bytes flip, four walking bytes flip. After each flip it executes the new fuzz formed and checks whether it is interesting or not (i.e. whether it affects virgin bits or not). If the fuzz is not found to be interesting then it is skipped and not added to the queue. It also performs 8 bit, 16 bit and 32 bit arithmetic increment and decrement operations, where it goes on selecting next word and goes on incrementing it till some max value is reached. It also sets the bytes/words/double words to interesting values like powers of 2, corner values like near to the both negative and positive overflow values. In random fuzzing it performs iterations equal to the energy given to each test case. It randomly flips a bit somewhere, sets a random byte/word/double word to some interesting value, sets random value to a random byte, randomly adds or subtracts to/from random byte, deletes random bytes, clones random bytes. It also splice the generated test cases, it selects one input file and randomly chooses other file and splices them at some random offset. It is similar to the process of producing next generation child from the existing species by using crossover operator.
2.2 Reinforcement Learning
Reinforcement learning is an online learning technique used for decision making based upon the evaluative feedback rather than the instructive feedback. It has been used extensively in the study of atari games, animal behavior. The agent trained using these techniques was able to beat the human world-champion in the games like backgammon
, game of Go, atari games. The markov decision process(MDP) is used by reinforcement learning to formulate the interaction between an agent and its environment in terms of states, actions and rewards. Formally, MDP M is a tuple M =(, ) where denotes a set of states, denotes a set of actions, denotes the state transition probability matrix , denotes the reward function and denotes the discount factor. In reinforcement learning, the agent starts in some state , and takes an action out of all the available actions from the action space. The environment gives a scalar reward to the agent and the system transits to the next state . The agent’s main aim is to maximize the expected sum of discounted future reward, . Choosing certain action may give bad immediate rewards but it may lead to better future rewards. In contrast, choosing an action which gives better immediate reward may lead to very poor future rewards. Thus the agent has to deal with these conditions carefully, while selecting an action in a given state. Thus the action selection is decided according to the agent’s policy .
2.3 Multi-armed Bandits Problem
The multi-armed bandits problem was derived from the slot machine with multiple levers. The player plays one of the levers and obtains some reward. The main aim of the player is to maximize the rewards by selecting the best action. Formally, the k-armed bandits problem consists of k actions . Each action is associated with a corresponding reward distribution . Given some fixed number of trials , at each trial t, the agent selects an action and receives a scalar reward , where . The agent does not know about the reward distributions associated with each action. The agent’s goal is to maximize the expected total reward over some time period. The actual value of an action is given by where
. The agent tries to estimate this value for each action from its trials. The agent estimates the value of an actionat time as and it’s aim is to make close to . The agent achieves this by having a trade-off between exploration(trying out random actions) and exploitation (choosing an action which has provided the highest value at that time instance). Exploitation is useful in maximizing the expected rewards for a single step but the combination of exploration and exploitation is useful for achieving the higher rewards in long run. More exploration in initial phase and then gradually increasing exploitation is one of the techniques for balancing exploration and exploitation. In contrast to full reinforcement learning problem, the multi-armed bandits problem does not have a concept of episode and the rewards depend only upon the action selected. A number of solutions have been proposed to solve this multi-armed bandits problem.
2.4 Contextual Bandits Problem
The Contextual Bandits problem falls in-between the full reinforcement learning problem and multi-armed bandits problem. Formally, contextual bandits problem consists of a state space and an action space . At each time instance , the environment presents a state to the agent, the agent takes an action and gets some reward from the environment. Each state, action pair is associated with a corresponding reward distribution . It means that the reward is obtained from the distribution . In case of multi-armed bandits problem, there is only a single state and multiple actions. But in case of contextual bandits problem, the reward depends upon the state as well as the action taken at that instance. The goal of the agent in this case, is not only to find out the best action for a single state, but it needs to find out the best actions for all the states, and to maximize the expected total reward over some time period. The agent improves it’s policy based upon it’s observations . The contextual bandit problems are also known as associative search tasks. A variety of problems have been modeled using contextual bandits problem like personalized news recommendation system, optimizing random allocation in clinical trials, adaptive routing in networks etc. The main difference between the contextual bandits problem and a full reinforcement learning problem is the state transition. In the case of full reinforcement learning problem the next state depends upon the current state and the action taken by the agent. Thus while taking an action in any state, the agent needs to take delayed rewards into the consideration. In contrast, for contextual bandits, the agent does not need to consider a long episode, rather a single state, action, reward triple forms an episode, and agent just focusses on selecting an optimal action for the given state.
3 Technical Details
We model the problem of deciding the number of random fuzzing iterations as a contextual-bandits problem. In this section, we describe all the components of this formulation.
We consider the continuous stream of 128 bytes from currently selected test case from the queue of test cases. Let us consider denotes the queue of test cases maintained by the AFL and is the currently selected test by AFL for fuzzing. We denote as the random offset for the test case , where where denotes the length of the test case . Thus our state can be represented as,
If the then we append number of zero bytes at the end of the test case . In short, we take a substring of 128 contiguous bytes from the current test case and treat it as the state in the contextual bandits setting.
We define the set of multipliers as the action space. Currently our action space is,
In order to preserve the global context of the test case we do not change the energy calculated for the test case by the AFL. After that, we decide one of the action on the basis of our state and multiply the energy predicted by the AFL with the , to obtain new energy value.
We define a reward function as follows,
Note that in the above formula, we are only considering the test cases generated by fuzzing our state i.e. by fuzzing a substring of 128 continuous bytes selected as state. Interesting test cases generated = number of test cases added to the queue by AFL on fuzzing the state . total test cases generated = number of test cases added to the queue by AFL + number of test cases not added to the queue by AFL(i.e. the uninteresting test cases generated by AFL on fuzzing).
We give the following two algorithms, the algorithm 2 explains the training procedure while the other one explains the testing procedure.
In Algorithm 2, we first of all initialize a variable with a value between 0 and 1. This probability value decides whether the entire test case will be fuzzed or just the state.
|AFL||AFL -d||AFL CB||AFL||AFL -d||AFL CB||AFL||AFL -d||AFL CB|
The main aim of this algorithm is to predict the correct energy for the stream of 128 contiguous bytes selected from the test case, which we call as the current state. But if we fuzz only the state, then we can not perform the operations like delete and clone bytes. Thus the size of test case will never grow or reduce during fuzzing. In order to prevent such conditions, we maintain a balance between the fuzzing the entire test case and fuzzing only the state. We generate a random number between 0 and 1, if it comes out to be less than the then we perform random fuzzing on the complete test case with the energy decided by the AFL. Otherwise, we just perform random fuzzing on the state. In this case, we first need to decide whether we want to do exploration or exploitation. We use -greedy policy to handle the trade-off between exploration and exploitation. In this policy, we explore with the probability and exploit with the probability , in actual implementation, we again generate a random number between 0 and 1 and perform exploration if it is less than , otherwise we exploit. In exploration, we choose randomly an action from the action-space. In exploitation, we query our model, which we explain in the next section, for the action to select by giving the state as an input. We multiply the energy with the action predicted, to obtain a new energy value. We perform the random fuzzing only on the state with this new energy value. We count the number of interesting test cases and total test cases generated by fuzzing the state. We calculate reward as per the reward function mentioned in the section 3.1.3
. We update model weights with the loss function which is calculated from the reward. Note that we are not performing the deterministic fuzzing in our approach in order to emphasize on the problem of energy decision for a state for random fuzzing. During experiment, we also disable the deterministic fuzzing from the original AFL with the help of ‘-d’ option provided by the AFL. We can perform this training for some pre decided time period.
The testing algorithm is similar to the algorithm 2, with few changes. We don’t update our trained model. Hence in testing algorithm, there is no need to count the interesting and total test cases, rewards and to update the model weights. Thus the lines 30, 34, 39, 41, 42 are not present in the testing algorithm.
We construct a policy based agent. As the input state is a sequential data, we use LSTM to encode our state. We use a single layer LSTM with 100 recurrent units and the default tanh activation function. Our state is a stream of 128 bytes, so we create a
matrix, where each row contains a byte represented in the binary format. We then pass it to LSTM as an input. Our model also consist of a single fully connected layer feed forward neural network with the softmax activation function and a random uniform weight initializer. This neural network takes the final hidden state of the LSTM as an input and outputs a vector of probabilities over the actions. We use the value of. We use gradient descent optimizer with learning rate = 0.001. We are using the policy gradient network to directly update the policy, hence we use the loss function , where represents the current policy i.e. the value output by out network for that corresponding action and is the reward obtained.
We have implemented our model using a popular open-source machine learning framework tensorflow-1.4.0 . The original AFL’s code is in C and C++. There are some machine learning libraries which use C or C++ code, like tensorflow C++ API, Caffe2, mlpack. But these APIs do not support all the operations supported by the python libraries and are currently in the development phase. Hence we decided to use python for writing out model code. For integration we use ctypes, a foreign function library for Python. We treat our AFL as an environment and we write the code for agent in python using ctypes libraries. We create a shared object of the fuzzer’s code during compilation. And ctypes loads this shared object and it can call any AFL’s function from the python code. ctypes also provide the C-compatible data types to make the calling of C functions and returning of the values easy and efficient. This implementation can be useful for others while integrating any machine/reinforcement learning technique with fuzzers like AFL. Our implementation is publicly available at https://bitbucket.org/iiscseal/bandit_afl
We used the value equal to 0.4. All experiments were performed on Intel(R) Xeon(R) E5-2450 2.10 GHz machine. We have performed a number of experiments with various configurations of our model, across the various popular programs and binaries like binutils-2.26, tcpdump-4.5.1, mpg321-0.3.2, libpng-1.6.32, gif2png-2.5.8, libxml2-2.9.2.
In first experiment, we run normal AFL on the binaries addr2line -e, cxxfilt, elfedit, nm-new -C, objcopy, objdump -d, readelf -a, size, strings, strip-new, gif2png, mpg321 –stdout, pngtest, xmllint, tcpdump -nn -r for 24 hours. We run 4 instances for each binary and then take average of the coverage, total paths and crashes produced. For total paths and crashes the average values are rounded off. For each binary, we give a single seed which is provided by the AFL. In second experiment we run AFL with ‘-d’ option for 24 hours. ‘-d’ option indicates afl without deterministic phase. Similar to first experiment, we run 4 instances of each binary and then average out the values. In third experiment, we train our model on each binary for 4 hours using algorithm 2 and we save our model and reset all the things like queue, crashes, hangs etc. We then start our testing algorithm with a single input seed and run it for 24 hours. We perform this experiments on 4 instances per binary and then take average values. The results of first, second and third experiments are mentioned in the table 1.
We have also performed the cross-binary experiments across the binutils binaries, where we train our model on one binary and test it on the other binaries. The reason behind performing the cross-binary experiments is, all these binaries belong to binutils and most of them use common libraries like BFD. The coverage results are mentioned in the table 2.
We have also added plots of coverage over time for all binaries for the second and third experiment in the above mentioned figures. Here we take the best instance in terms of coverage. We have also added plots of reward vs. iterations, during training phase. We found that we perform better than normal afl(i.e. AFL with deterministic phase) in terms of coverage for most of the binaries. But we perform comparatively bad than AFL without deterministic phase. We hope that the results can be further improved by careful tuning of our contextual bandit model.
5 Related Work
Our work is mostly inspired by the current work happening in the field of program analysis using machine/reinforcement learning. In this section we discuss these works. One of the early work in this area is LearnFuzz. The authors have proposed a neural network based model for format specific grammar construction. During fuzzing the AFL crates a lot of test cases which are invalid, thus LearnFuzz paper proposed a machine learning technique to create grammar for PDF format and then they used it to create valid test cases. In our work, we focus on format independent fuzzing. Rajpal et.al. have proposed LSTM and seq-seq based model to create a function to predict the good locations in the input files to guide further mutations. This technique always takes the entire test case as an input and then decides the fuzzing locations, but the test cases can get pretty large as the time proceeds and thus querying the model may require a lot of time. Bottinger et.al.  have proposed the first work in fuzzing which involves the reinforcement learning techniques. They take the substring of bytes and represent it as state and they use deep Q-learning algorithm to choose the suitable mutation action. Most of the actions from the action space mentioned by them are PDF format specific. Also they compare their model with some baseline fuzzer and not the AFL. In our work we do not assume any baseline fuzzer. The above mentioned approaches use the machine learning techniques on fuzzing processes. A lot of work have been done in improving the fuzzing performance with the help of the conventional program analysis techniques like taint analysis, symbolic executions etc. The black box fuzzing has been made more efficient by using techniques like good quality seed selections[7, 8], proper scheduling of mutations. Bohme et.al.  gave a new tool AFL-Go which can be used for directed fuzzing and skipping the mutations in the unnecessary directions, also they have given an approach  for assigning more energy to the low frequency paths to improve the coverage. Lemieux et.al.  targeted rare branches to rapidly increase greybox fuzz testing coverage. Some of the techniques have improved the existing fuzzers for some specific file formats.
6 Conclusions and Future Work
In this work, we formalize the problem of deciding the energy as a ‘contextual bandit problem’, for the very first time. We present an algorithm to decide the energy multiplier of a test case given a fixed length contents of the test case. We implement our neural network based learning algorithm on top of the AFL and we compare results of different configurations of our model with AFL. We integrate AFL’s code with a popular open-source machine learning framework tensorflow. This implementation can be useful for others while implementing any machine/reinforcement learning technique with fuzzers like AFL. We experiment our implementation on popular and large scale target programs. More careful parameter tuning and engineering work may lead the outperforming results than the existing fuzzers in term of coverage. As a future work, we can work on replacing other heuristics of AFL with some machine learning model. Some of the open problems are like deciding the next test case to be fuzzed from the queue of the test cases because the AFL selects the test cases in round robin fashion and it skips some of the test cases with some hand designed heuristics. We think that the heuristics of deciding the toprated test cases and favored test cases can also be completely removed or replaced if we are able to decide which test case is good or bad depending upon some model learned through past experiences.
-  Microsoft Security Risk Detection. https://www.microsoft.com/en-us/security-risk-detection/. [Online; accessed 10-January-2018].
-  Patrice Godefroid, Michael Y Levin, David A Molnar, et al. Automated whitebox fuzz testing. In NDSS, volume 8, pages 151–166, 2008.
-  libFuzzer – a library for coverage-guided fuzz testing. https://llvm.org/docs/LibFuzzer.html. [Online; accessed 10-January-2018].
-  OSS-Fuzz. https://opensource.google.com/projects/oss-fuzz/. [Online; accessed 10-January-2018].
-  Caroline Lemieux and Koushik Sen. Fairfuzz: Targeting rare branches to rapidly increase greybox fuzz testing coverage. arXiv preprint arXiv:1709.07101, 2017.
-  Peach Fuzzer Platform. http://www.peachfuzzer.com/products/peach-platform/. [Online; accessed 10-July-2017].
-  Allen D Householder and Jonathan M Foote. Probability-based parameter selection for black-box fuzz testing. Technical report, carnegie-mellon univ pittsburgh pa software engineering inst, 2012.
-  Alexandre Rebert, Sang Kil Cha, Thanassis Avgerinos, Jonathan M Foote, David Warren, Gustavo Grieco, and David Brumley. Optimizing seed selection for fuzzing. USENIX, 2014.
-  Maverick Woo, Sang Kil Cha, Samantha Gottlieb, and David Brumley. Scheduling black-box mutational fuzzing. In Proceedings of the 2013 ACM SIGSAC conference on Computer & communications security, pages 511–522. ACM, 2013.
-  Saahil Ognawala, Ana Petrovska, and Kristian Beckers. An exploratory survey of hybrid testing techniques involving symbolic execution and fuzzing. arXiv preprint arXiv:1712.06843, 2017.
-  Nick Stephens, John Grosen, Christopher Salls, Andrew Dutcher, Ruoyu Wang, Jacopo Corbetta, Yan Shoshitaishvili, Christopher Kruegel, and Giovanni Vigna. Driller: Augmenting fuzzing through selective symbolic execution. In NDSS, volume 16, pages 1–16, 2016.
-  Vijay Ganesh, Tim Leek, and Martin Rinard. Taint-based directed whitebox fuzzing. In Proceedings of the 31st International Conference on Software Engineering, pages 474–484. IEEE Computer Society, 2009.
-  Patrice Godefroid, Hila Peleg, and Rishabh Singh. Learn&fuzz: Machine learning for input fuzzing. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, pages 50–59. IEEE Press, 2017.
-  Mohit Rajpal, William Blum, and Rishabh Singh. Not all bytes are equal: Neural byte sieve for fuzzing. arXiv preprint arXiv:1711.04596, 2017.
-  Konstantin Böttinger, Patrice Godefroid, and Rishabh Singh. Deep reinforcement fuzzing. arXiv preprint arXiv:1801.04589, 2018.
-  Michael Zalewski. American Fuzzy Lop. http://lcamtuf.coredump.cx/afl/.
-  Michael Zalewski. Technical ”whitepaper” for afl-fuzz. http://lcamtuf.coredump.cx/afl/technical_details.txt. [Online; accessed 10-July-2017].
Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury.
Coverage-based greybox fuzzing as markov chain.In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1032–1043. ACM, 2016.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
-  GNU Binutils. https://www.gnu.org/software/binutils/. [Online; accessed 10-January-2018].
-  tcpdump. https://www.tcpdump.org/tcpdump_man.html. [Online; accessed 10-January-2018].
-  mpg321 mp3 player. http://mpg321.sourceforge.net/. [Online; accessed 10-January-2018].
-  libpng. http://www.libpng.org/pub/png/libpng.html. [Online; accessed 10-January-2018].
-  gif2png. http://www.catb.org/esr/gif2png/. [Online; accessed 10-January-2018].
-  The XML C parser and toolkit of Gnome. http://xmlsoft.org/. [Online; accessed 10-January-2018].
-  Clay B Holroyd and Michael GH Coles. The neural basis of human error processing: reinforcement learning, dopamine, and the error-related negativity. Psychological review, 109(4):679, 2002.
-  Gerald Tesauro. Td-gammon: A self-teaching backgammon program. In Applications of Neural Networks, pages 267–285. Springer, 1995.
-  Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
-  UCL Course on RL. http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html. [Online; accessed 10-January-2018].
-  Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
-  Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
-  Tensorflow Home page. https://www.tensorflow.org/. [Online; accessed 10-January-2018].
-  Caffe2 Home page. https://caffe2.ai/. [Online; accessed 10-January-2018].
-  mlpack Home page. https://www.mlpack.org/. [Online; accessed 10-January-2018].
-  ctypes python library. https://docs.python.org/2/library/ctypes.html. [Online; accessed 10-January-2018].
-  Marcel Böhme, Van-Thuan Pham, Manh-Dung Nguyen, and Abhik Roychoudhury. Directed greybox fuzzing. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS’17), 2017.