I Introduction
Fuzzing is the process of finding security vulnerabilities in inputprocessing code by repeatedly testing the code with modified, or fuzzed, inputs. Fuzzing is an effective way to find security vulnerabilities in software [1], and is becoming standard in the commercial software development process [2].
Existing fuzzing tools differ by how they fuzz program inputs, but none can explore exhaustively the entire input space for realistic programs in practice. Therefore, they typically use fuzzing heuristics
to prioritize what (parts of) inputs to fuzz next. Such heuristics may be purely random, or they may attempt to optimize for a specific goal, such as maximizing code coverage.
In this paper, we investigate how to formalize fuzzing as a reinforcement learning problem. Intuitively, choosing the next fuzzing action given an input to mutate can be viewed as choosing a next move in a game like Chess or Go: while an optimal strategy might exist, it is unknown to us and we are bound to play the game (many times) in the search for it. By reducing fuzzing to reinforcement learning, we can then try to apply the same neuralnetworkbased learning techniques that have beaten worldchampion human experts in Backgammon
[3, 4], Atari games [5], and the game of Go [6].Specifically, fuzzing can be modeled as learning process with a feedback loop. Initially, the fuzzer generates new inputs, and then runs the target program with each of them. For each program execution, the fuzzer extracts runtime information (gathered for example by binary instrumentation) for evaluating the quality (with respect to the defined search heuristic) of the current input. For instance, this quality can be measured as the number of (unique or not) instructions executed, or the overall runtime of the execution. By taking this quality feedback into account, a feedbackdriven fuzzer can learn from past experiments, and then generate other new inputs hopefully of better quality. This process repeats until a specific goal is reached, or bugs are found in the program. Similarly, the reinforcement learning setting defines an agent that interacts with a system. Each performed action causes a state transition of the system. Upon each performed action the agent observes the next state and receives a reward. The goal of the agent is to maximize the total reward over time.
Our mathematical model of fuzzing is captured in Figure 1. An input mutator engine M generates a new input I by performing a fuzzing action , and subsequently observes a new state directly derived from I as well as a reward that is measured by executing the target program P with input I. We reduce input fuzzing to a reinforcement learning problem by formalizing it using Markov decision processes [7]
. Our formalization allows us to apply stateoftheart machine learning methods. In particular, we experiment with deep
learning.In summary, we make the following contributions:

We formalize fuzzing as a reinforcement learning problem using the concept of Markov decision processes.

We introduce a fuzzing algorithm based on deep learning that learns to choose highlyrewarded fuzzing actions for any given propgram input.

We implement and evaluate a prototype of our approach.

We present empirical evidence that reinforcement fuzzing can outperform baseline random fuzzing.
Ii Related Work
Our work is influenced by three main streams of research: fuzzing, grammar reconstruction, and deep learning.
Iia Fuzzing
There are three main types of fuzzing techniques in use today: (1) blackbox random fuzzing [1, 8], (2) whitebox constraintbased fuzzing [9], and (3) grammarbased fuzzing [10, 1], which can be viewed as a variant of modelbased testing [11]. Blackbox and whitebox fuzzing are fully automatic, and have historically proved to be very effective at finding security vulnerabilities in binaryformat file parsers. In contrast, grammarbased fuzzing is not fully automatic: it requires an input grammar specifying the input format of the application under test. This grammar is typically written by hand, and this process is laborious, time consuming, and errorprone. Nevertheless, grammarbased fuzzing is the most effective fuzzing technique known today for fuzzing applications with complex structured input formats, like webbrowsers which must take as (untrusted) inputs webpages including complex HTML documents and JavaScript code.
Stateofart fuzzing tools like SAGE [9] or AFL [12] use coveragebased heuristics to guide their search for bugs towards lesscovered code parts. But they do not use machine learning techniques as done in this paper.
Combining statistical neuralnetworkbased machine learning with fuzzing is a novel approach and, to the best of our knowledge, there is just one prior paper on this topic: Godefroid et al. [13] use characterbased language models to learn a generative model of fuzzing inputs, but they do not use reinforcement learning.
IiB Grammar Reconstruction
Research on reconstructing grammars from sample inputs for testing purposes started in the early 1970’s [10, 14]. More recently, Bastani et al. [15] proposed an algorithm for automatic synthesis of a contextfree grammar given a set of seed inputs and a blackbox target. Cui et al. [16] automatically detect record sequences and types in the input by identification of chunks based on taint tracking input data in respective subroutine calls. Similarly, the authors of [17] apply dynamic tainting to identify failurerelevant inputs. Another recently proposed approach [18] mines input grammars from valid inputs based on feedback from dynamic instrumentation of the target by tracking input characters.
IiC Deep Learning
Reinforcement learning [19] emerged from trial and error learning and optimal control for dynamic programming [7]. Especially the learning approach introduced by Watkins [20, 21] was recently combined with deep neural networks [3, 4, 5, 6] to efficiently learn policies over large state spaces and has achieved impressive results in complex tasks.
Iii Reinforcement Learning
In this section we give the necessary background on reinforcement learning. We first introduce the concept of Markov decision processes [7], which provides the basis to formalize fuzzing as a reinforcement learning problem. Then we discuss the learning approach to such problems and motivate the application of deep networks.
Reinforcement learning is the process of adapting an agent’s behavior during interaction with a system such that it learns to maximize the received rewards based on performed actions and system state transitions. The agent performs actions on a system it tries to control. For each action, the system undergoes a state transition. In turn, the agent observes the new state and receives a reward. The aim of the agent is to maximize its cumulative reward received during the overall time of system interaction. The following formal notation relates to the presentation given in [19].
The interaction of the agent with the system can be seen as a stochastic process. In particular, a Markov decision process is defined as , where denotes a set of states, a set of actions, and
the transition probability kernel. For each stateaction pair
and each the kernel gives the probability such that performing action in state causes the system to transition into some state of that yields some realvalued reward . directly provides the state transition probability kernel for single transitions(1) 
This naturally gives rise to a stochastic process: An agent observing a certain state chooses an action to cause a state transition with the corresponding reward. By subsequently observing state transitions with corresponding rewards the agent aims to learn an optimal behavior that earns the maximal possible cumulative reward over time. Formally, with the stochastic variables distributed according to the expected immediate reward for each choice of action is given by . In the following, for a stochastic variable the notation indicates that is distributed according to . During the stochastic process the aim of an agent is to maximize the total discounted sum of rewards
(2) 
where indicates a discount factor that prioritizes rewards in the near future. The choice of action an agent makes in reaction to observing state is determined by its policy . The policy maps observed states to actions and therefore determines the behavior of the agent. Let
(3) 
denote the expected cumulative reward for an agent that behaves according to policy . Then we can reduce our problem of approximating the best policy to approximating the optimal function. One practical way to achieve this is adjusting after each received reward according to
(4)  
(5) 
where indicates the learning rate. The process in this setting works as follows: The agent observes a state , performs the action (where denotes the argument value that maximizes ) that maximizes the total expected future reward and thereby causes a state transition from to . Receiving reward and observing the agent then considers the best possible action . Based on this consideration, the agent updates the value . If for example the decision of taking action in state led to a state that allows to choose a high reward action and additionally invoked a high reward , the value for this decision is adapted accordingly. Here, the factor determines the rate of this function update.
For small state and action spaces, can be represented as a table. However, for large state spaces we have to approximate with an appropriate function. An approximation using deep neural networks was recently introduced by Mnih et al. [5]. For such a representation, the update rule in Equation (4
) directly translates to minimizing the loss function
(6) 
The learning rate in Equation (4
) then corresponds to the rate of stochastic gradient descent during backpropagation.
Deep networks have been shown to handle large state spaces efficiently. This allows us to define an endtoend algorithm directly on raw program inputs, as we will see in the next section.
Iv Modeling Fuzzing as a Markov decision process
In this section we formalize fuzzing as a reinforcement learning problem using a Markov decision process by defining states, actions, and rewards in the fuzzing context.
Iva States
We consider the system that the agent learns to interact with to be a given “seed” program input. Further, we define the states that the agent observes to be substrings of consecutive symbols within such an input. Formally, let denote a finite set of symbols. The set of possible program inputs written in this alphabet is then defined by the Kleene closure . For an input string let
(7) 
denote the set of all substrings of . Clearly, holds. We define the states of our Markov decision process to be . In the following, denotes an input for the target program and a substring of this input.
IvB Actions
We define the set of possible actions
of our Markov decision process to be random variables mapping substrings of an input to probabilistic rewriting rules
(8) 
where denotes the algebra of the sample space and gives the probability for a given rewrite rule. In our implementation (see Section VI) we define a small subset of probabilistic string rewrite rules that operate on a given seed input.
IvC Rewards
We define rewards independently for both characteristics of: 1) the next performed action and 2) the program execution with the next generated input , i.e., .
In our implementation in Section VI we experiment with providing number of newly discovered basic blocks, execution path length, and execution time of the target that processes the input . For example, we can define the number of newly discovered blocks as
(9) 
where denotes the execution path the target program takes when processing input , is the set of unique basic blocks of this path, and is the set of previously processed inputs. Here, we define a basic block as a sequence of program instructions without branch instructions.
V Reinforcement Fuzzing Algorithm
In this section we present the overall reinforcement fuzzing algorithm.
Va Initialization
We start with an initial seed input . The choice of is not constrained in any way, it may not even be valid with regard to the input format of the target program. Next, we initialize the
function. For this, we apply a deep neural net that maps states to the estimated
values of each action, i.e., we simultaneously approximate the values for all actions given a state as defined in Equation (7). The representation provides the advantage that we only need one forward pass to simultaneously receive the values for all actions instead of forward passes. During function initialization we distribute the network weights randomly.VB State Extraction
The state extraction step State() takes as input a seed and outputs a substring of . In Section IV we defined the states of our Markov decision process to be . For the given seed we extract a strict substring at offset of width . In other words, the seed corresponds to the system as depicted in Figure 1 and the reinforcement agent observes a fragment of the whole system via the substring . We experimented with controllable (via action) and predefined choices of offsets and substring widths, as discussed in Section VI.
VC Action Selection
The action selection step takes as input the current function and an observed state and outputs an action as defined in Equation (8). Actions are selected according to the policy following an greedy behavior: With probability (for a small ) the agent selects an action that is currently estimated optimal by the function, i.e., it exploits the best possible choice based on experience. With a probability
it explores any other action, where the probability of choice is uniformly distributed within
.VD Mutation
The mutation step takes as input a seed and an action . It outputs the string that is generated by applying action on . As indicated in Equation (8) we define actions to be mappings to probabilistic rewriting rules and not rewriting rules on their own. So applying action on means that we mutate according to the rewrite rule mapped by within the probability space . We make this separation to distinguish between the random nature of choice for the action and the randomness within the rewrite rule.
VE Reward Evaluation
The reward evaluation step takes as input the target program , an action , and an input that was generated by the application of on a seed. It outputs a positive number . The stochastic reward variable sums up the rewards for both generated input and selected action. Function rewards characteristics recorded during the program execution as defined in Section IVC.
VF Update
The update step takes as input the extracted substring , the action that generated , the evaluated reward , and the function approximation, which in our case is a deep neural network. It outputs the updated approximation. As indicated above, the choice of applying a deep neural network is motivated by the requirement to learn on raw substrings . The function predicts for a given state the expected rewards for all defined actions of simultaneously, i.e., it maps substrings according to . We update in the sense that we adapt the predicted reward value according to the target . This yields the loss function given by Equation (6) for action . All other actions are updated with zero loss. The convergence rate of is primarily determined by the learning rate of stochastic gradient descent during backpropagation as well as the choice of .
VG Joining the Pieces
Now that we have presented all individual steps we can proceed with combining them to get the overall fuzzing algorithm as depicted in Figure 2.
We start with an initialization phase that outputs a seed as well as the initial version of . Then, the fuzzer enters the loop of state extraction, action selection, input mutation, reward evaluation, update, and test case reset. Starting with a seed , the algorithm extracts a substring and based on the observed state then chooses the next action according to its policy. The choice is made looking at the best possible reward predicted via and applying an greedy exploitationexploration strategy. To guarantee initial exploration we initially define a relatively high value for and monotonically decrease over time until it reaches a final small threshold, from then on it remains constant. The selected action provides a string substitution as indicated in Equation (8) which is applied to for mutation. The generated mutant input is fed into the target program to evaluate the reward . Together with , , and , this reward is taken into account for update. Finally, the Reset() function periodically resets input to a valid seed. In our implementation we reset the seed after each mutation as described in Section VI. After reset, the algorithm continues the loop.
We formulated the algorithm with just one single input seed. However, we could generalize this to a set of seed inputs by choosing another seed within this set for each iteration of the main loop.
The algorithm above performs reinforcement fuzzing with activated policy learning. We show in our evaluation in Section VI that the network generalizes on states. This allows us to switch to highthroughput mutant generation with a fixed policy after a sufficiently long training phase.
Vi Implementation and Evaluation
In this section we present details regarding our implementation together with an evaluation of the prototype.
Via Target Programs
As fuzzing targets we chose programs processing files in the Portable Document Format PDF. This format is complex enough to provide a realistic testbed for evaluation. From the 1,300 pages long PDF specification [22], we just need the following basic understanding: each PDF document is a sequence of PDF bodies each of which includes three sections named objects, crossreference table, and trailer. While our algorithm is defined to be independent of the targeted input format, we used this structure to define fuzzing actions specifically crafted for PDF objects.
Initially we tested different PDF processing programs including the PDF parser in the Microsoft Edge browser on Windows and several command line converters on Linux. All results in the following presentation refer to fuzzing the pdftotext program mutating a kByte seed file with PDF objects including binary fields.
ViB Implementation
In the following we present details regarding our implementation of the proposed reinforcement fuzzing algorithm. We apply existing frameworks for binary instrumentation and neural network training and implement the core framework including the learning module in Python 3.5.
State Implementation
Our fuzzer observes and mutates input files represented as binary strings. With we can choose between state representations of different granularity, for example bit or byte representations. We encode the state of a substring as the sequence of bytes of this string. Each byte is converted to its corresponding float value when processed by the network. As introduced in Section V we denote to be the offset of and to be the width of the current state.
Action Implementation
We implement each action as a function in a Python dictionary. As string rewriting rules we take both probabilistic and deterministic actions into account. In the following we list the action classes we experiment with.

Random Bit Flips. This type of action mutates the substring with predefined and dynamically adjustable mutation ratios.

Insert Dictionary Tokens. This action inserts tokens from a predefined dictionary. The tokens in the dictionary consist of ASCII strings extracted from a set of selected seed files.

Shift Offset and Width. This type of action shifts the offset and width of the observed substring. Left and right shift take place at the PDF object level. Increasing and decreasing the width take place with byte granularity.

Shuffle. We define two actions for shuffling substrings. The first action shuffles bytes within , the second action shuffles three segments of the PDF object that is located around offset .

Copy Window. We define two actions that copy to a random offset within . The first action inserts the bytes of , the second overwrites bytes.

Delete Window. This action deletes the observed substring .
Reward Implementation
For evaluation of the reward we experimented with both coverage and execution time information.
To measure as defined in Equation (9), we used existing instrumentation frameworks. We initially used the Microsoft Nirvana toolset for measuring code coverage for the PDF parser included in Edge. However, to speed up training of the net we switched to smaller parser targets. On Linux we implemented a custom Intel PINtool plugin that counts the number of unique basic blocks within the pdftotext program.
Network Implementation
We implemented the
learning module in Tensorflow
[23]by constructing a feed forward neural network with four layers connected with nonlinear activation functions. The two hidden layers included between 64 and 180 hidden units (depending on the state size) and we applied
as activation function. We initialize the weights randomly and uniformly distributed within . The initial learning rate of the gradient descent optimizer is set to .ViC Evaluation
In this section we evaluate our implemented prototype. We present improvements to a predefined baseline and also discuss current limitations. All measurements were performed on a Xeon E52690 Ghz with GB of RAM. The summary of the improvements obtained in accumulated rewards based on different reward functions, modifying state size, and generalization to new inputs is shown in Table I. We now explain the results in more detail.
ViC1 Baseline
To show that our new reinforcement learning algorithm actually learns to perform highreward actions given an input observation, we define a comparison baseline policy that randomly selects actions, where the choice is uniformly distributed among the action space . Formally, actions in the baseline policy are distributed uniformly according to and After generations, we calculated the quotient of the most recent accumulated rewards by our algorithm and the baseline to measure the relative improvement.
ViC2 Replay Memory
We experimented with two types of agent memory: The recorded stateactionrewardstate sequences as well as the history of previously discovered basic blocks. The first type of memory is established during the fuzzing process by storing sequences in order to regularly replay samples of them in the update step. For each replay step at time a random experience out of is sampled to train the network. We could not measure any improvement compared to the baseline with this method. Second, comparing against the history of previously discovered basic blocks also did not result in any improvement. Only a memoryless choice of yielded good results. Regarding our algorithm as depicted in Figure 2 we reset the basic block history after each step via the function.
Improvement  
Reward functions  
Code coverage  7.75% 
Execution time  7% 
Combined  11.3% 
State width  
with Bytes  7% 
with Bytes  3.1% 
Generalization to new inputs  
for new input  4.7% 
Since both types of agent memory did not yield any improvement, we switched them off for the following measurements. Further, we deactivated all actions that do not mutate the seed input, e.g. random bit flip actions of adjusting the global mutation ratio or shifting offsets and state widths. Instead of active offset and state width selection via an agent action, we set the offset for each iteration randomly, where the choice is uniformly distributed within and fixed Bytes.
ViC3 Choices of Rewards
We experimented with three different types of rewards: Maximization of code coverage , execution time , and a combined reward with rescaled time for multigoal fuzzing. While is deterministic,
comes with minor noise in the time measurement. Measuring the execution time for different seeds and mutations revealed a variance that is two orders of magnitude smaller than the respective mean so that
is stable enough to serve as a reliable reward function. All three choices provided improvements with respect to the baseline.When rewarding execution time according to our proposed fuzzing algorithm cumulates in average higher execution time reward in comparison to the baseline.
Since both time and coverage rewards yielded comparable improvements with regard to the baseline, we tested to what extend those two types of rewards correlate: We measured an average Pearson correlation coefficient of between coverage and execution time . This correlation motivates the combined reward , where is a simple rescaling of execution time by a multiplicative factor so that the execution time contributes to the reward equitable to . Training the net with yielded an improvement of in execution time. This result is better that taking exclusively or into account. There are two likely explanations for this result. First, the noise of time measurement could introduce rewarding explorative behavior of the net. Second, deterministic coverage information could add stability to .
ViC4 net Activation Functions
From all activation functions provided by the Tensorflow framework, we found the function to yield the best results for our setting. The following list compares the different activation functions with respect to improvement in reward .
tanh  sigmoid  elu  softplus  softsign  relu 
7.75%  6.56%  5.3%  2%  6.4%  1.3% 
ViC5 State Width
Increasing the state width from Bytes to Bytes decreased the improvement (measured in average reward compared to the baseline) from to . In other words, smaller substrings are better recognized than large ones. This indicates that our proposed algorithm actually takes the structure of the state into account and learns to perform best rewarded actions according to this specific structure.
ViC6 State Generalization
In order to achieve highthroughput fuzzing we tested if the already trained net generalizes to previously unseen inputs. This would allow us to switch off net training after a while and therefore avoid the high processing costs of evaluating the coverage reward. To measure generalization we restricted the offset in the training phase to values in the first half of the seed file. For testing, we omitted reward measurement in the update step as depicted in Figure 2 to stop the training phase and only considered offsets in the second half of the seed file. This way, the net is confronted with previously unseen states. This resulted in an improvement in execution time of compared to the baseline.
Vii Conclusion
Inspired by the similar nature of feedbackdriven random testing and reinforcement learning, we introduce the first fuzzer that uses reinforcement learning in order to learn highreward mutations with respect to predefined reward metrics. By automatically rewarding runtime characteristics of the target program to be tested, we obtain new inputs that likely drive program execution towards a predefined goal, such as maximized code coverage or processing time. To achieve this, we formalize fuzzing as a reinforcement learning problem using Markov decision processes. This allows us to construct an reinforcementlearning fuzzing algorithm based on deep learning that chooses highreward actions given an input seed.
The policy as defined in Section III can be viewed as a form of generalized grammar for the input structure. Given a specific state, it suggests a string replacement (i.e., a fuzzing action) based on experience. Especially if we reward execution path depth, we indirectly reward validity of inputs with regard to the input structure, as nonvalid inputs are likely to be rejected early during parsing and result in small path depths. We presented preliminary empirical evidence that our reinforcement fuzzing algorithm can learn how to improve its effectiveness at generating new inputs based on successive feedback. Future research should investigate this further, with more setup variants, benchmarks, and experiments.
References
 [1] M. Sutton, A. Greene, and P. Amini, Fuzzing: Brute Force Vulnerability Discovery, 1st ed. Boston, MA, USA: AddisonWesley Professional, 2007.
 [2] M. Howard and S. Lipner, The Security Development Lifecycle. Microsoft Press, 2006.
 [3] G. Tesauro, “Practical issues in temporal difference learning,” in Advances in neural information processing systems, 1992, pp. 259–266.
 [4] ——, “Tdgammon: A selfteaching backgammon program,” in Applications of Neural Networks. Springer, 1995, pp. 267–285.
 [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
 [6] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
 [7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998.
 [8] A. Takanen, J. DeMott, and C. Miller, Fuzzing for Software Security Testing and Quality Assurance, 1st ed. Norwood, MA, USA: Artech House, Inc., 2008.
 [9] P. Godefroid, M. Y. Levin, and D. A. Molnar, “Automated whitebox fuzz testing.” in NDSS, vol. 8, 2008, pp. 151–166. [Online]. Available: http://46.43.36.213/sites/default/files/Automated%20Whitebox%20Fuzz%20Testing%20(paper)%20(Patrice%20Godefroid).pdf
 [10] P. Purdom, “A sentence generator for testing parsers,” BIT Numerical Mathematics, vol. 12, no. 3, pp. 366–375, 1972.
 [11] M. Utting, A. Pretschner, and B. Legeard, “A Taxonomy of ModelBased Testing,” Department of Computer Science, The University of Waikato, New Zealand, Tech. Rep, vol. 4, 2006.
 [12] M. Zalewski, “American fuzzy lop,” http://lcamtuf.coredump.cx/afl/.
 [13] P. Godefroid, H. Peleg, and R. Singh, “Learn&fuzz: Machine learning for input fuzzing,” in 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE 2017), January 2017. [Online]. Available: https://www.microsoft.com/enus/research/publication/learnfuzzmachinelearninginputfuzzing/
 [14] K. V. Hanford, “Automatic generation of test cases,” IBM Systems Journal, vol. 9, no. 4, pp. 242–257, 1970.
 [15] O. Bastani, R. Sharma, A. Aiken, and P. Liang, “Synthesizing program input grammars,” in Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2017. New York, NY, USA: ACM, 2017, pp. 95–110. [Online]. Available: http://doi.acm.org/10.1145/3062341.3062349
 [16] W. Cui, M. Peinado, K. Chen, H. J. Wang, and L. IrunBriz, “Tupni: Automatic reverse engineering of input formats,” in Proceedings of the 15th ACM Conference on Computer and Communications Security, ser. CCS ’08. New York, NY, USA: ACM, 2008, pp. 391–402. [Online]. Available: http://doi.acm.org/10.1145/1455770.1455820
 [17] J. Clause and A. Orso, “Penumbra: Automatically identifying failurerelevant inputs using dynamic tainting,” in Proceedings of the Eighteenth International Symposium on Software Testing and Analysis, ser. ISSTA ’09. New York, NY, USA: ACM, 2009, pp. 249–260. [Online]. Available: http://doi.acm.org/10.1145/1572272.1572301
 [18] M. Höschele and A. Zeller, “Mining input grammars with autogram,” in Proceedings of the 39th International Conference on Software Engineering Companion, ser. ICSEC ’17. Piscataway, NJ, USA: IEEE Press, 2017, pp. 31–34. [Online]. Available: https://doi.org/10.1109/ICSEC.2017.14

[19]
C. Szepesvári, “Algorithms for reinforcement learning,”
Synthesis lectures on artificial intelligence and machine learning
, vol. 4, no. 1, pp. 1–103, 2010.  [20] C. Wattkins, “Learning from delayed rewards,” Ph.D. dissertation, Cambridge University, 1989.
 [21] C. J. Watkins and P. Dayan, “Qlearning,” Machine learning, vol. 8, no. 34, pp. 279–292, 1992.
 [22] PDF Reference, 6th ed., Adobe Systems Incorporated, Nov. 2006, available at http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_17.pdf.
 [23] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for largescale machine learning,” in 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 24, 2016., 2016, pp. 265–283.
Comments
There are no comments yet.