1. Introduction and Motivation
The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. Following the seminal article of Merton [21] on dynamic replication for option pricing, a vast body of literature emerged around the optimal control perspective on pricing and hedging contingent claims. The main idea is that under standard BlackScholes (BSM) assumptions (in particular, in a complete and frictionless market) there is a continuous trading strategy in the stock and riskfree security that perfectly replicates the price of a European option. In reality, continuous time trading is of course impossible. It cannot even serve as a reasonable approximation due to the high resulting transaction costs. Instead, the replicating portfolio is adjusted at discrete times optimizing between replication error and trading costs. The acceptable deviation from the ideal hedge depends on the risk tolerance of the investor. As a consequence the best hedging strategy constitutes a tradeoff between replication error and risk on the one hand and trading costs on the other.
Leland [19] describes a method for hedging European options when, in addition to the ordinary BSM assumptions, there are proportional transaction costs and timediscretization errors. Hodges and Neuenberger [14] recognized that (for a special measure of risk) the hedging problem naturally amends to a formulation in terms of a Bellman equation, which relates the hedging problem to the area of Reinforcement Learning (RL). In recent years, RL has gained wide attention when it was recognized that trained RL agents outperform humans in playing Atari and board games such as Go and Chess [28, 29, 22]. Several publications have also reported on promising hedging performance of trained RL agents, see e.g. [14, 17, 11, 7, 5, 6].
In terms of the choice of training algorithm, the abovementioned articles focus on a specific variation of learning [38] in conjunction with deep neural networks called Deep Learning (DQN) [23]. The article [11] proposes DQN as a tool for training hedging agents with a quadratic utility functional but with no transaction costs. The articles [5, 6] are concerned with DQNbased hedging under coherent risk measures, which have been identified as the right class of measures of risk in the financial mathematics literature. Recent work [7] applies double learning [37] to take account of returns and risks in the hedging problem. The article [17] studies a DQN approach for hedging in discrete time in the presence of transaction costs, focusing on
meanvariance equivalent
loss distributions, and inspired our article. The article at hand deviates from the preceding publications in that it studies a bandittype algorithm for the application in hedging. The motivation is that, as compared to DQN, banditalgorithms are better studied in the RL area, have stronger performance guarantees, are easier to train and results can be interpreted more immediately. Yet they are sufficient to address the accounting Profit and Loss (P&L) formulation of the hedging problem as in [17]. It is important to mention that there are two formulations of the hedging exercise discussed in the literature, see e.g. [7, 4], which differ in the way rewards are computed for the RL agent: in the P&L formulation the agent is aware of the price of the derivative contract at time (e.g. from BSM formula), as opposed to the Cash Flow formulation, where a replication portfolio is set up to fit the price process of the option contract; in this alternative formulation rewards are computed from changes of the value of the replication portfolio.We focus on the P&L formulation of the hedging problem as a Riskaverse Contextual Armed Bandit (RCMAB) model. The armed bandit is the prototypical model of reinforcement learning and has been studied extensively [3] within diverse areas such as online advertising selection, clinical trials, finance, etc., and configurations, such as neural net based policies, Gaussian policies, etc. As compared to the fullfledged RL setting and, in particular, to DQN, the armed bandit takes limited account of the impact of an action to the environment. It is characterized by the assumption that current actions only influence the immediate rewards but not subsequent rewards. In terms of hedging, this implies that the agent’s choices do not dynamically impact market prices (where a static functional model is still an option). The perspective taken is that:

standard microscopic market models in terms of Itô processes do not depend on trading decisions of market participants;

strong market impact of trading decisions signifies a singular situation. Any system trained on a simulated environment acquires a behavior typical of the simulation and will not be able to make informed decisions in a singular state.
In other words, a RL agent trained on simulated data from Itô processes will not usually acquire the knowledge to appropriately capture market impact in a practical setting. The latter would require training on samples with market impact, which in practice will often be hard to provide in sufficient amount. In the case of a simulated market impact, such samples will simply present the modeled cost of the market impact.
2. Background and Methodology
2.1. Hedging
Under the standard theory of riskbearing [15] the rational investor (agent) chooses actions to maximize the expected utility of wealth over the investment horizon ,
(2.1) 
where the utility function is smooth, increasing and concave. The terminal wealth is the result of an initial endowment and a sequence of investment decisions at times . We focus on hedging a short euro vanilla call option contract but the discussion extends similarly to more general contracts. Let denote the price of the underlying at time , and the strike price of the option contract. The terminal payoff can be written as
The issuer of the option contract sets up a replicating portfolio to hedge the risk of the option
where is the number of shares hold by the agent, is the issuers bank account. For simpler notation we assume a flat IR market, i.e. discount factor is . The replication portfolio should match as closely as possible the value of the option contract over the investment horizon. The composition of the replication portfolio is computed by backpropagation beginning with
and setting
where the conditioning is on the natural filtration generated by stochastic processes . The selffinancing constraint imposes that shares bought at time have to be equally billed to the bank account:
As a consequence it holds that
(2.2) 
This expresses the task of pricing and hedging as a recursive optimization problem, where remains to be determined. The open question is which utility function should be used. A natural choice is to measure risk as negative variance, which corresponds to a quadratic utility function in (2.1). In the P&L formulation of hedging the agent is provided with the price such that reward could be simply defined as
(2.3) 
This is motivated by the famous BSM hedge, and of course in the limit that and the action space becomes continuous the optimal action will be just
An important point about quadratic utility is that it is wellknown to not satisfy a Bellmann optimality equation. In fact the exponential utility function
in (2.1) leads to option contract values that follow a Bellmann equation. As recognized already in [14] (in 1989) this is the setting of RL. Socalled coherent risk measures have been identified as the right measures of risk in the mathematical Finance literature. The recent articles of Buehler et al. [5, 6] investigate the respective hedging problems within RL methods.
2.2. Contextual Multiarmed Bandit Model
The armed bandit is a prototypical instance of RL and has been studied extensively [16, 2, 32] within diverse areas such as ad selection, clinical trials, finance, etc.. The setup involves a set of possible choices^{1}^{1}1The name originates from a gambler who chooses from slot machines. and a sequence of periods. In each period the learner makes her choice and receives a corresponding random reward . The objective is to develop a policy that specifies the action to take at each period to maximize the cumulative reward
over the execution episode. In the contextual armed bandit setting, the agent is faced with nonstationary reward distributions. In each round the agent receives context about the state of the armed bandit. Thus the task involves both trialanderror learning to search for the best actions but also the association of the action with the given context.
Contextual search tasks are an intermediate between the armed bandit problem and the full RL problem [32, 18, 27]. They are like the full RL problem in that they involve learning a policy, but like the ordinary armed bandit problem in that each action affects only the immediate reward. Like any RL agent, the contextual armed bandit must trade off between exploration and exploitation: is it better to choose a lucrative action given a context or should the agent explore in hope to find something even better? A priori exploration implies risk in the sense that the agent must deviate from an optimal action. To take account of the hedging risk in this article we employ a riskaverse version of CMAB.
While in the standard bandit problem, the objective is to select the arm leading to the highest reward in expectation, the meanvariance bandit problem focuses on finding the arm that most effectively trades off its expected reward versus its variance (i.e. the risk). In choosing meanvariance as our measure of risk we follow [24]. In this article the meanvariance of action is
where measures the risk aversion, and and
are, respectively, the mean and the standard deviation associated to action
. The optimal action maximizes meanvariance, that is . For sample outcomes of action , the empirical meanvariance at a given time iswith empirical mean and standard deviation given by
The meanvariance multiarmed bandit problem has been introduced in [25], where also the meanvariance lower confidence bound algorithm (MVLCB) is proposed. MVLCB generalizes the classical upper confidence bound algorithm of [2] to the meanvariance setup. [33] proposes a riskaverse contextual bandit model for portfolio selection and contains a more detailed description of the model.
2.3. learning
In learning the agent stores a
table that contains estimates of the
uality of actions, given the current state of the agent. At each time step, the agent i) performs an action following a policy based on the table, ii) observes the corresponding reward , and iii) updates the values in the table based on observed rewards.Let be the estimate of the quality value of state at time . Following a MonteCarlo approach for prediction, this estimate can be updated in an iterative way according to
(2.4) 
where measures the size of MonteCarlo steps, and is the cumulative reward observed from time up to the termination of the training episode. The drawback of (2.4) is that the estimate can only be updated at the end of a training episode. The socalled Temporal Difference (TD) method [31] approximates by and leads to the following updating rule:
(2.5) 
where is a discount factor that weights the importance of current versus future rewards. The main advantage of (2.5) lies in the possibility to evaluate it at each time step, differently from (2.4). In learning a table of actionvalue is stored at each time step ; in line with (2.5), the table is updated according to
(2.6) 
The iterative application of this procedure converges to the optimal policy if the learning rate is not too large [32]. In contrast to the CMAB algorithm, learning addresses the full fledged RL problem, in particular reflecting the impact of trading decisions to the environment (market).
3. Algorithm design
The choice of RL algorithm depends on the type of interaction between agent and environment to be modeled. Bandit models are characterized by limited interaction: the environment is not influenced by agent decisions. This fits naturally into the P&L formulation of the hedging problem. In this article we compare the RCMAB algorithm to DQN, which has been the focus of the majority of previous works. The impressive performance of modern RL systems relies in equal parts on classical RL theory and Deep Neural Networks (NN). More specifically, in modern RL applications (see eg. [27, 28, 29, 20, 13, 26]) NN serve as flexible approximators that estimate expected rewards from complex states. Two NN based architecture have been implemented for comparison.
3.1. Deep RCMAB algorithm
In order to maximize cumulative reward, NNs need to tradeoff what is expected to be best at the moment (i.e., exploitation), with potentially suboptimal exploratory actions, which are needed to learn from the data. In addition, exploratory actions should be coordinated along the entire decisionmaking process, rather than be performed independently at every step. This is where Thompson Sampling
[35] enters into the RL framework. Thompson Sampling dynamically deals with the explorationexploitation dilemma by maintaining at each time step a Bayesian estimateof the posterior distribution over models, and sampling actions in proportion to the probability that they are optimal
[8, 1]. The correlations present in the sequence of observations (like the similarity of the images received from an Atari game at consequent times) might cause instability of the training process as the neural network tends to overfit to this correlation. A common method to remedy this instability is to use a replay memory buffer:
Experience Replay [20]: In our architecture the transition is stored at each time step into a memory buffer of given capacity. For learning, a batch of given size is taken from the replay memory, which removes correlation from the observation sequence.
A typical shortcoming of linear neural networks is their lack of representational power. In order to alleviate this issue, we perform a Bayesian linear regression on top of the representation of the last layer of a neural network
[30]. In other words instead of directly regressing on input data, a neural network is trained and then a Bayesian linear regression is used on top to make decisions via Thompson Sampling. This implies that the weights of the output layer of the neural network are not used directly for the choice of the action. The Deep RCMAB Algorithm with Thompson Sampling is summarized in Algorithm 1. Notice also that the RCMAB algorithm has performance and convergence guarantees, see [32, 9, 18, 27] for details.3.2. Deep learning
The DQN learning algorithm has been originally designed for learning from alternating visual data. The application of DQN to hedge a contingent claim constitutes an interesting offlabel application of DQN algorithms. Existing publications and proofofconcept implementations demonstrate that this application is indeed possible [17, 7, 5, 6], although simpler algorithms might yield better performance. Furthermore, the mentioned articles provide only little quantitative comparison of their results to established market practice, such as the BSM model.
In recent years various DQNtype algorithms have appeared. The architecture tested in the article at hand constitutes a tradeoff between stability of established algorithms and performance of high end algorithms. Our design orients itself at the Atari game design [22]. Relatively more recent rainbow algorithms outperform the chosen algorithm in terms of sample efficiency [13]. From a practical perspective it is important not to underestimate the level of knowhow and the maintenance effort needed for successful training and application of DQN algorithms. learning is hard to train and intrinsically unstable [36], especially when used in conjunction with nonlinear function approximators, such as neural networks, to approximate the actionvalue function. To remedy these instabilities two standard methods are employed.

Experience Replay: as described in Section 3.1;

Asynchronous updates: target values are updated with a larger period than
values, thereby reducing the impact of random fluctuation to the target. This approach is similar to supervised learning, where fixed targets are provided before learning starts.
A schematic representation of the employed learning algorithms is presented in Algorithm 2. Finally, it is common that other models perform better than DQN, especially in situations where the trained model is deployed in a production setting but the training process is not of interest by itself. For instance, the famous DQN Atari [22] can be improved with a more straightforward MonteCarlo Tree search [10]. The performance of locomotive robots [12] can be compared with online trajectory optimization [34]. While training of the DQN agent required approx 6500 CPU hours, online trajectory optimization runs on an ordinary notebook.
4. Proof of concept and results
We train a RCMAB agent to hedge a euro vanilla option contract in the P&L formulation. We opt for the simplest configuration, with Geometric Brownian Motion (no stochastic volatility) as the underlying stochastic model and vanilla BSM model for pricing. Figure 4.1 depicts typical historic rewards during the training process. The following parameters were chosen:

Option contract: Short Euro vanilla call with strike and maturity .

Market: Underlying driven by Geometric Brownian Motion with , . Flat IR market.

Training: training episodes of samples each.

RCMAB algorithm: Neural linear posterior sampling with riskadjusted reward. Neural Network layers of
fully connected nodes. ReLu activation. Bayesian posterior
Thompson sampling for explorationexploitation tradeoff. Riskneutral investor chosen for benchmarking.
For testing, the trained RCMAB agent runs over a sample of Geometric Brownian Motion paths, not contained in the original training set and is compared to an oracle agent, that always takes the best decision. Notice that optimal actions deviate from BSM hedges due to time and action space discretization. The result is presented in Figure 4.2, which depicts a typical hedging behavior of the trained RCMAB agent. As a benchmark Figure 4.3 shows a histogram of the terminal P&L statistics of a RCMAB hedger on out of sample simulations. Notably, in the absence of transaction costs and riskadjustment (in the limit of small time steps) the hedges obtained by the RCMAB algorithm naturally converge to BSM . This feature is desirable for benchmarking and not observable in the DQN framework (see below).
4.1. learning
We train a DQN agent to hedge a euro vanilla option contract in the P&L formulation. The methodology is as described above and mirrors previous works, e.g. in [17]. A typical training process is presented in Figure 4.4. The setting of the experiment is the same as in the previous section but with:

Training: training episodes of samples each.

DQN algorithm: Neural linear model with riskadjusted reward. Neural Network layers of fully connected nodes. ReLu activation. Riskneutral investor chosen for benchmarking. We noticed that using less than 4 layers worsened the out of sample performance of the algorithm, while using more layers led to slower training without significant improvement of performance.
For testing, the trained DQN agent runs over a sample of Geometric Brownian Motion paths not contained in the original training set, and, as for the CMAB, is compared to an oracle agent that always takes the best possible decision. The result is presented in Figure 4.5, which depicts a typical hedging behavior of the trained DQN agent. Figure 4.6 shows a histogram of the terminal PNL statistics of a DQN hedger on out of sample simulations.
In the absence of transaction costs and risk adjustment a fundamental discrepancy between the BSM and DQN approaches to hedging becomes apparent. While the optimal value of an option contract in the BSM model follows the BSM differential equation, the optimal value function in DQN solves a Bellmann equation. This discrepancy is reflected in the optimal hedge: in the classical BSM economy the optimal hedge is provided by the hedge portfolio, where only depends on the current state. In DQN the optimal action takes account of future rewards and anticipates the expected reward structure, a feature that fits naturally with the exponential utility, see Section 2.1 above, but not the BSM model.
As a consequence the role of the parameter in DQN (Section 2.3) is key: apart from being a discount factor, it also regulates, in terms of the Bellman equation, the dependency of the immediate reward resulting from a given action and future rewards. Experimental results are in line with the expectation that hedging performance become weaker (when benchmarked against BSM) when the DQN agent attempts to anticipate the future reward structure, see Figure 4.7. But what happens in the presence of propagating transaction costs? If costs are small, then probably, should be comparatively small. To quantify the correct value for is, of course, a challenging task;
can be viewed as a delicate hyperparameter that should be tuned to fit well with the
real markets.5. Conclusions
We have confirmed, following the preceding publications [14, 17, 11, 7, 5, 6], that RL based agents can be trained to hedge an option contract. We have introduced the RCMAB algorithm and demonstrated that for the P&L formulation of hedging, RCMAB outperforms DQN in in terms of sample efficiency and hedging error (when compared to the standard BSM model).
Each of the preceding articles [14, 17, 11, 7, 5, 6, 25, 24] relies on one (or multiple) different measure(s) of risk. This implies that the same hedging path will be valued differently within the various methodologies and suggests a certain lack of consensus within the research community. The presence of delicate hyperparameters, such as in DQN, which regulates the interdependency of rewards and , which tunes between different measures of risk, make the evaluation of hedging performance even more subtle.
We introduce a novel approach to hedging, which is motivated by its relative simplicity and performance guarantees in the literature. Indeed, hedges obtained from RCMAB naturally converge to BSM deltas when no risk adjustment, discretization error and transaction costs are present. This makes RCMAB a useful benchmark for performance comparison and the development of more sophisticated algorithms.
From a practical perspective we note that RL agents trained on simulated data will perform well in market situations that are similar to the training simulations. This reveals an important shortcoming of RL methods when applied to hedging of an option contract in practice: the typical sampleinefficiency of RL (as compared to supervised learning) leaves the practitioner with little options but to train on simulated data, which leads to the standard Financial engineering problem of choosing the right class of Itô processes and calibration. However, once such processes have been used for training, it can hardly be claimed that the agent operates in a fully modelindependent way. The (comparatively) small amount of realworld data that the agent will encounter in operation will not change the overwhelming influence of the training in a statistically significant way. It is not uncommon that RL is used in Finance in situations where large data sets are available for training. A typical example that amends naturally to RL algorithms is marketmaking, where trading occurs on a submicrosecond time scale and produces an abundance of realworld data. The application of RL algorithms in slow businesses such as option hedging remains an area of active research.
References
 [1] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multiarmed bandit problem. In Conference on learning theory, pages 39–1, 2012.
 [2] P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 [3] P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multiarmed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
 [4] D. P. Bertsekas and S. Shreve. Stochastic optimal control: the discretetime case. 2004.
 [5] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging. Quantitative Finance, 19(8):1271–1291, 2019.
 [6] H. Buehler, L. Gonon, J. Teichmann, B. Wood, B. Mohan, and J. Kochems. Deep hedging: hedging derivatives under generic market frictions using reinforcement learning. Technical report, Swiss Finance Institute, 2019.
 [7] J. Cao, J. Chen, J. C. Hull, and Z. Poulos. Deep hedging of derivatives using reinforcement learning. Available at SSRN 3514586, 2019.
 [8] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
 [9] M. Collier and H. U. Llorens. Deep contextual multiarmed bandits. arXiv preprint arXiv:1807.09809, 2018.
 [10] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for realtime atari game play using offline montecarlo tree search planning. In Advances in neural information processing systems, pages 3338–3346, 2014.
 [11] I. Halperin. Qlbs: Qlearner in the blackscholes (merton) worlds. Available at SSRN 3087076, 2017.
 [12] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S.M. Eslami, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
 [13] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In ThirtySecond AAAI Conference on Artificial Intelligence, 2018.
 [14] S. Hodges and A. Neuberger. Option replication of contingent claims under transactions costs. Technical report, Working paper, Financial Options Research Centre, University of Warwick, 1989.
 [15] J. E. Ingersoll. Theory of financial decision making, volume 3. Rowman & Littlefield, 1987.
 [16] M. N. Katehakis and A. F. Veinott Jr. The multiarmed bandit problem: decomposition and computation. Mathematics of Operations Research, 12(2):262–268, 1987.

[17]
P. N. Kolm and G. Ritter.
Dynamic replication and hedging: A reinforcement learning approach.
The Journal of Financial Data Science
, 1(1):159–171, 2019.  [18] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
 [19] H. E. Leland. Option pricing and replication with transactions costs. The journal of finance, 40(5):1283–1301, 1985.
 [20] L. Lin. Reinforcement learning for robots using neural networks. Technical report, CarnegieMellon Univ Pittsburgh PA School of Computer Science, 1993.
 [21] R. C. Merton. An intertemporal capital asset pricing model. Econometrica: Journal of the Econometric Society, pages 867–887, 1973.
 [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
 [23] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
 [24] G. Ritter. Machine learning for trading. Available at SSRN 3015609, 2017.
 [25] A. Sani, A. Lazaric, and R. Munos. Riskaversion in multiarmed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
 [26] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, and T. Graepel. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
 [27] E. Sezener, M. Hutter, D. Budden, J. Wang, and J. Veness. Online learning in contextual bandits using gated linear networks. arXiv preprint arXiv:2002.11611, 2020.
 [28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
 [30] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. A. Patwary, Prabhat, and R. Adams. Scalable bayesian optimization using deep neural networks. pages 2171–2180, 2015.
 [31] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
 [32] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 [33] O. Szehr and L. Cannelli. Portfolio choice under market friction: A contextual meanvariance armed bandit model. Technical Report, 2019.
 [34] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913. IEEE, 2012.
 [35] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 [36] J. N. Tsitsiklis and B. Van Roy. Analysis of temporaldifference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997.
 [37] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double qlearning. In Thirtieth AAAI conference on artificial intelligence, 2016.
 [38] C. J. C. H. Watkins and P. Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.