Log In Sign Up

Hedging using reinforcement learning: Contextual k-Armed Bandit versus Q-learning

by   Loris Cannelli, et al.

The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. In real markets, continuous replication, such as in the model of Black, Scholes and Merton, is not only unrealistic but it is also undesirable due to high transaction costs. Over the last decades stochastic optimal-control methods have been developed to balance between effective replication and losses. More recently, with the rise of artificial intelligence, temporal-difference Reinforcement Learning, in particular variations of Q-learning in conjunction with Deep Neural Networks, have attracted significant interest. From a practical point of view, however, such methods are often relatively sample inefficient, hard to train and lack performance guarantees. This motivates the investigation of a stable benchmark algorithm for hedging. In this article, the hedging problem is viewed as an instance of a risk-averse contextual k-armed bandit problem, for which a large body of theoretical results and well-studied algorithms are available. We find that the k-armed bandit model naturally fits to the P&L formulation of hedging, providing for a more accurate and sample efficient approach than Q-learning and reducing to the Black-Scholes model in the absence of transaction costs and risks.


page 1

page 2

page 3

page 4


A Confirmation of a Conjecture on the Feldman's Two-armed Bandit Problem

Myopic strategy is one of the most important strategies when studying ba...

Exponential two-armed bandit problem

We consider exponential two-armed bandit problem in which incomes are de...

Variational inference for the multi-armed contextual bandit

In many biomedical, science, and engineering problems, one must sequenti...

Max K-armed bandit: On the ExtremeHunter algorithm and beyond

This paper is devoted to the study of the max K-armed bandit problem, wh...

Graph Signal Sampling via Reinforcement Learning

We formulate the problem of sampling and recovering clustered graph sign...

Guaranteed satisficing and finite regret: Analysis of a cognitive satisficing value function

As reinforcement learning algorithms are being applied to increasingly c...

Task Offloading and Replication for Vehicular Cloud Computing: A Multi-Armed Bandit Approach

Vehicular Cloud Computing (VCC) is a new technological shift which explo...

1. Introduction and Motivation

The construction of replication strategies for contingent claims in the presence of risk and market friction is a key problem of financial engineering. Following the seminal article of Merton [21] on dynamic replication for option pricing, a vast body of literature emerged around the optimal control perspective on pricing and hedging contingent claims. The main idea is that under standard Black-Scholes (BSM) assumptions (in particular, in a complete and frictionless market) there is a continuous trading strategy in the stock and risk-free security that perfectly replicates the price of a European option. In reality, continuous time trading is of course impossible. It cannot even serve as a reasonable approximation due to the high resulting transaction costs. Instead, the replicating portfolio is adjusted at discrete times optimizing between replication error and trading costs. The acceptable deviation from the ideal hedge depends on the risk tolerance of the investor. As a consequence the best hedging strategy constitutes a trade-off between replication error and risk on the one hand and trading costs on the other.

Leland [19] describes a method for hedging European options when, in addition to the ordinary BSM assumptions, there are proportional transaction costs and time-discretization errors. Hodges and Neuenberger [14] recognized that (for a special measure of risk) the hedging problem naturally amends to a formulation in terms of a Bellman equation, which relates the hedging problem to the area of Reinforcement Learning (RL). In recent years, RL has gained wide attention when it was recognized that trained RL agents outperform humans in playing Atari and board games such as Go and Chess [28, 29, 22]. Several publications have also reported on promising hedging performance of trained RL agents, see e.g. [14, 17, 11, 7, 5, 6].

In terms of the choice of training algorithm, the above-mentioned articles focus on a specific variation of -learning [38] in conjunction with deep neural networks called Deep -Learning (DQN) [23]. The article [11] proposes DQN as a tool for training hedging agents with a quadratic utility functional but with no transaction costs. The articles [5, 6] are concerned with DQN-based hedging under coherent risk measures, which have been identified as the right class of measures of risk in the financial mathematics literature. Recent work [7] applies double -learning [37] to take account of returns and risks in the hedging problem. The article [17] studies a DQN approach for hedging in discrete time in the presence of transaction costs, focusing on

mean-variance equivalent

loss distributions, and inspired our article. The article at hand deviates from the preceding publications in that it studies a bandit-type algorithm for the application in hedging. The motivation is that, as compared to DQN, bandit-algorithms are better studied in the RL area, have stronger performance guarantees, are easier to train and results can be interpreted more immediately. Yet they are sufficient to address the accounting Profit and Loss (P&L) formulation of the hedging problem as in [17]. It is important to mention that there are two formulations of the hedging exercise discussed in the literature, see e.g. [7, 4], which differ in the way rewards are computed for the RL agent: in the P&L formulation the agent is aware of the price of the derivative contract at time (e.g. from BSM formula), as opposed to the Cash Flow formulation, where a replication portfolio is set up to fit the price process of the option contract; in this alternative formulation rewards are computed from changes of the value of the replication portfolio.

We focus on the P&L formulation of the hedging problem as a Risk-averse Contextual -Armed Bandit (R-CMAB) model. The -armed bandit is the prototypical model of reinforcement learning and has been studied extensively [3] within diverse areas such as online advertising selection, clinical trials, finance, etc., and configurations, such as neural net based policies, Gaussian policies, etc. As compared to the full-fledged RL setting and, in particular, to DQN, the -armed bandit takes limited account of the impact of an action to the environment. It is characterized by the assumption that current actions only influence the immediate rewards but not subsequent rewards. In terms of hedging, this implies that the agent’s choices do not dynamically impact market prices (where a static functional model is still an option). The perspective taken is that:

  1. standard microscopic market models in terms of Itô processes do not depend on trading decisions of market participants;

  2. strong market impact of trading decisions signifies a singular situation. Any system trained on a simulated environment acquires a behavior typical of the simulation and will not be able to make informed decisions in a singular state.

In other words, a RL agent trained on simulated data from Itô processes will not usually acquire the knowledge to appropriately capture market impact in a practical setting. The latter would require training on samples with market impact, which in practice will often be hard to provide in sufficient amount. In the case of a simulated market impact, such samples will simply present the modeled cost of the market impact.

2. Background and Methodology

2.1. Hedging

Under the standard theory of risk-bearing [15] the rational investor (agent) chooses actions to maximize the expected utility of wealth over the investment horizon ,


where the utility function is smooth, increasing and concave. The terminal wealth is the result of an initial endowment and a sequence of investment decisions at times . We focus on hedging a short euro vanilla call option contract but the discussion extends similarly to more general contracts. Let denote the price of the underlying at time , and the strike price of the option contract. The terminal payoff can be written as

The issuer of the option contract sets up a replicating portfolio to hedge the risk of the option

where is the number of shares hold by the agent, is the issuers bank account. For simpler notation we assume a flat IR market, i.e. discount factor is . The replication portfolio should match as closely as possible the value of the option contract over the investment horizon. The composition of the replication portfolio is computed by back-propagation beginning with

and setting

where the conditioning is on the natural filtration generated by stochastic processes . The self-financing constraint imposes that shares bought at time have to be equally billed to the bank account:

As a consequence it holds that


This expresses the task of pricing and hedging as a recursive optimization problem, where remains to be determined. The open question is which utility function should be used. A natural choice is to measure risk as negative variance, which corresponds to a quadratic utility function in (2.1). In the P&L formulation of hedging the agent is provided with the price such that reward could be simply defined as


This is motivated by the famous BSM -hedge, and of course in the limit that and the action space becomes continuous the optimal action will be just

An important point about quadratic utility is that it is well-known to not satisfy a Bellmann optimality equation. In fact the exponential utility function

in (2.1) leads to option contract values that follow a Bellmann equation. As recognized already in [14] (in 1989) this is the setting of RL. So-called coherent risk measures have been identified as the right measures of risk in the mathematical Finance literature. The recent articles of Buehler et al. [5, 6] investigate the respective hedging problems within RL methods.

2.2. Contextual Multi-armed Bandit Model

The -armed bandit is a prototypical instance of RL and has been studied extensively [16, 2, 32] within diverse areas such as ad selection, clinical trials, finance, etc.. The setup involves a set of possible choices111The name originates from a gambler who chooses from slot machines. and a sequence of periods. In each period the learner makes her choice and receives a corresponding random reward . The objective is to develop a policy that specifies the action to take at each period to maximize the cumulative reward

over the execution episode. In the contextual -armed bandit setting, the agent is faced with non-stationary reward distributions. In each round the agent receives context about the state of the -armed bandit. Thus the task involves both trial-and-error learning to search for the best actions but also the association of the action with the given context.

Contextual search tasks are an intermediate between the -armed bandit problem and the full RL problem [32, 18, 27]. They are like the full RL problem in that they involve learning a policy, but like the ordinary -armed bandit problem in that each action affects only the immediate reward. Like any RL agent, the contextual -armed bandit must trade off between exploration and exploitation: is it better to choose a lucrative action given a context or should the agent explore in hope to find something even better? A priori exploration implies risk in the sense that the agent must deviate from an optimal action. To take account of the hedging risk in this article we employ a risk-averse version of CMAB.

While in the standard bandit problem, the objective is to select the arm leading to the highest reward in expectation, the mean-variance bandit problem focuses on finding the arm that most effectively trades off its expected reward versus its variance (i.e. the risk). In choosing mean-variance as our measure of risk we follow [24]. In this article the mean-variance of action is

where measures the risk aversion, and and

are, respectively, the mean and the standard deviation associated to action

. The optimal action maximizes mean-variance, that is . For sample outcomes of action , the empirical mean-variance at a given time is

with empirical mean and standard deviation given by

The mean-variance multi-armed bandit problem has been introduced in [25], where also the mean-variance lower confidence bound algorithm (MV-LCB) is proposed. MV-LCB generalizes the classical upper confidence bound algorithm of [2] to the mean-variance setup. [33] proposes a risk-averse contextual bandit model for portfolio selection and contains a more detailed description of the model.

2.3. -learning

In -learning the agent stores a

-table that contains estimates of the

-uality of actions, given the current state of the agent. At each time step, the agent i) performs an action following a policy based on the -table, ii) observes the corresponding reward , and iii) updates the values in the -table based on observed rewards.

Let be the estimate of the quality value of state at time . Following a Monte-Carlo approach for prediction, this estimate can be updated in an iterative way according to


where measures the size of Monte-Carlo steps, and is the cumulative reward observed from time up to the termination of the training episode. The drawback of (2.4) is that the estimate can only be updated at the end of a training episode. The so-called Temporal Difference (TD) method [31] approximates by and leads to the following updating rule:


where is a discount factor that weights the importance of current versus future rewards. The main advantage of (2.5) lies in the possibility to evaluate it at each time step, differently from (2.4). In -learning a table of action-value is stored at each time step ; in line with (2.5), the table is updated according to


The iterative application of this procedure converges to the optimal policy if the learning rate is not too large [32]. In contrast to the CMAB algorithm, -learning addresses the full fledged RL problem, in particular reflecting the impact of trading decisions to the environment (market).

3. Algorithm design

The choice of RL algorithm depends on the type of interaction between agent and environment to be modeled. Bandit models are characterized by limited interaction: the environment is not influenced by agent decisions. This fits naturally into the P&L formulation of the hedging problem. In this article we compare the R-CMAB algorithm to DQN, which has been the focus of the majority of previous works. The impressive performance of modern RL systems relies in equal parts on classical RL theory and Deep Neural Networks (NN). More specifically, in modern RL applications (see eg. [27, 28, 29, 20, 13, 26]) NN serve as flexible approximators that estimate expected rewards from complex states. Two NN based architecture have been implemented for comparison.

3.1. Deep R-CMAB algorithm

In order to maximize cumulative reward, NNs need to trade-off what is expected to be best at the moment (i.e., exploitation), with potentially sub-optimal exploratory actions, which are needed to learn from the data. In addition, exploratory actions should be coordinated along the entire decision-making process, rather than be performed independently at every step. This is where Thompson Sampling 

[35] enters into the RL framework. Thompson Sampling dynamically deals with the exploration-exploitation dilemma by maintaining at each time step a Bayesian estimate

of the posterior distribution over models, and sampling actions in proportion to the probability that they are optimal 

[8, 1]. The correlations present in the sequence of observations (like the similarity of the images received from an Atari game at consequent times) might cause instability of the training process as the neural network tends to overfit to this correlation. A common method to remedy this instability is to use a replay memory buffer:

  • Experience Replay [20]: In our architecture the transition is stored at each time step into a memory buffer of given capacity. For learning, a batch of given size is taken from the replay memory, which removes correlation from the observation sequence.

A typical shortcoming of linear neural networks is their lack of representational power. In order to alleviate this issue, we perform a Bayesian linear regression on top of the representation of the last layer of a neural network 

[30]. In other words instead of directly regressing on input data, a neural network is trained and then a Bayesian linear regression is used on top to make decisions via Thompson Sampling. This implies that the weights of the output layer of the neural network are not used directly for the choice of the action. The Deep R-CMAB Algorithm with Thompson Sampling is summarized in Algorithm 1. Notice also that the R-CMAB algorithm has performance and convergence guarantees, see [32, 9, 18, 27] for details.

Input: Prior distribution over the model; ;
while a termination criterion is not satisfied do

    Get a context vector

       Sample model through Thompson Sampling;
       Compute through (Deep) NN Multi-Armed Bandits;
       Perform action and observe reward ;
       Update Bayesian estimate of the posterior distribution using ;
end while
Algorithm 1 Deep CMAB Algorithm

3.2. Deep -learning

The DQN learning algorithm has been originally designed for learning from alternating visual data. The application of DQN to hedge a contingent claim constitutes an interesting off-label application of DQN algorithms. Existing publications and proof-of-concept implementations demonstrate that this application is indeed possible [17, 7, 5, 6], although simpler algorithms might yield better performance. Furthermore, the mentioned articles provide only little quantitative comparison of their results to established market practice, such as the BSM model.

In recent years various DQN-type algorithms have appeared. The architecture tested in the article at hand constitutes a trade-off between stability of established algorithms and performance of high end algorithms. Our design orients itself at the Atari game design [22]. Relatively more recent rainbow algorithms outperform the chosen algorithm in terms of sample efficiency [13]. From a practical perspective it is important not to underestimate the level of know-how and the maintenance effort needed for successful training and application of DQN algorithms. -learning is hard to train and intrinsically unstable [36], especially when used in conjunction with non-linear function approximators, such as neural networks, to approximate the action-value function. To remedy these instabilities two standard methods are employed.

  • Experience Replay: as described in Section 3.1;

  • Asynchronous updates: target values are updated with a larger period than

    -values, thereby reducing the impact of random fluctuation to the target. This approach is similar to supervised learning, where fixed targets are provided before learning starts.

A schematic representation of the employed -learning algorithms is presented in Algorithm 2. Finally, it is common that other models perform better than DQN, especially in situations where the trained model is deployed in a production setting but the training process is not of interest by itself. For instance, the famous DQN Atari [22] can be improved with a more straight-forward Monte-Carlo Tree search [10]. The performance of locomotive robots [12] can be compared with online trajectory optimization [34]. While training of the DQN agent required approx 6500 CPU hours, online trajectory optimization runs on an ordinary notebook.

Input: Initialize action-value matrix with random weights; initialize replay memory to capacity ;
for episode  do
       Initialize ;
       for  do
             With probability select a random action , otherwise select according to a Deep NN approach;
             Perform action and observe reward as in (2.3);
             Get new state , and update the NN parameters;
             Store transition in the replay memory;
             Sample random mini-batches from the replay memory;
             Update according to (2.6);
       end for
end for
Algorithm 2 Deep -learning Algorithm

4. Proof of concept and results

We train a R-CMAB agent to hedge a euro vanilla option contract in the P&L formulation. We opt for the simplest configuration, with Geometric Brownian Motion (no stochastic volatility) as the underlying stochastic model and vanilla BSM model for pricing. Figure 4.1 depicts typical historic rewards during the training process. The following parameters were chosen:

  • Option contract: Short Euro vanilla call with strike and maturity .

  • Market: Underlying driven by Geometric Brownian Motion with , . Flat IR market.

  • Training: training episodes of samples each.

  • R-CMAB algorithm: Neural linear posterior sampling with risk-adjusted reward. Neural Network layers of

    fully connected nodes. ReLu activation. Bayesian posterior

    Thompson sampling for exploration-exploitation trade-off. Risk-neutral investor chosen for benchmarking.


Figure 4.1. Historic rewards of a typical R-CMAB training process over first episodes in P&L formulation. The graph shows the accumulated reward (over 50 samples) of the respective episode.

For testing, the trained R-CMAB agent runs over a sample of Geometric Brownian Motion paths, not contained in the original training set and is compared to an oracle agent, that always takes the best decision. Notice that optimal actions deviate from BSM -hedges due to time and action space discretization. The result is presented in Figure 4.2, which depicts a typical hedging behavior of the trained R-CMAB agent. As a benchmark Figure 4.3 shows a histogram of the terminal P&L statistics of a R-CMAB hedger on out of sample simulations. Notably, in the absence of transaction costs and risk-adjustment (in the limit of small time steps) the hedges obtained by the R-CMAB algorithm naturally converge to BSM . This feature is desirable for benchmarking and not observable in the DQN framework (see below).


Figure 4.2. Typical hedging behavior of a trained R-CMAB agent on an out of sample GBM path in the absence of transaction costs and risk. Green line: option values. Blue line: P&L of optimal hedger. Orange line: P&L of CMAB agent. Purple line: hedge portfolio of optimal agent. Red line: hedge portfolio of CMAB agent.


Figure 4.3. Histogram of terminal P&L of option and hedging portfolio at maturity as obtained on out of sample simulations. For benchmarking transaction costs and risk adjustment are switched off. The -axis is scaled up by a factor of . Orange: trained R-CMAB agent. Blue: oracle agent.

4.1. -learning

We train a DQN agent to hedge a euro vanilla option contract in the P&L formulation. The methodology is as described above and mirrors previous works, e.g. in [17]. A typical training process is presented in Figure 4.4. The setting of the experiment is the same as in the previous section but with:

  • Training: training episodes of samples each.

  • DQN algorithm: Neural linear model with risk-adjusted reward. Neural Network layers of fully connected nodes. ReLu activation. Risk-neutral investor chosen for benchmarking. We noticed that using less than 4 layers worsened the out of sample performance of the algorithm, while using more layers led to slower training without significant improvement of performance.


Figure 4.4. Historic rewards of a typical DQN training process over first episodes in P&L formulation. The graph shows the accumulated reward (over 50 samples) of the respective episode.

For testing, the trained DQN agent runs over a sample of Geometric Brownian Motion paths not contained in the original training set, and, as for the CMAB, is compared to an oracle agent that always takes the best possible decision. The result is presented in Figure 4.5, which depicts a typical hedging behavior of the trained DQN agent. Figure 4.6 shows a histogram of the terminal PNL statistics of a DQN hedger on out of sample simulations.

In the absence of transaction costs and risk adjustment a fundamental discrepancy between the BSM and DQN approaches to hedging becomes apparent. While the optimal value of an option contract in the BSM model follows the BSM differential equation, the optimal value function in DQN solves a Bellmann equation. This discrepancy is reflected in the optimal hedge: in the classical BSM economy the optimal hedge is provided by the -hedge portfolio, where only depends on the current state. In DQN the optimal action takes account of future rewards and anticipates the expected reward structure, a feature that fits naturally with the exponential utility, see Section 2.1 above, but not the BSM model.

As a consequence the role of the parameter in DQN (Section 2.3) is key: apart from being a discount factor, it also regulates, in terms of the Bellman equation, the dependency of the immediate reward resulting from a given action and future rewards. Experimental results are in line with the expectation that hedging performance become weaker (when bench-marked against BSM) when the DQN agent attempts to anticipate the future reward structure, see Figure 4.7. But what happens in the presence of propagating transaction costs? If costs are small, then probably, should be comparatively small. To quantify the correct value for is, of course, a challenging task;

can be viewed as a delicate hyperparameter that should be tuned to fit well with the 

real markets.


Figure 4.5. Typical hedging behavior of a trained DQN agent on an out of sample GBM path in the absence of transaction costs and risk. Green line: option values. Blue line: P&L of an optimal hedger. Orange line: P&L of DQN agent. Purple line: hedge portfolio of optimal agent. Red line: hedge portfolio of DQN agent.


Figure 4.6. Histogram of terminal P&L of option and hedging portfolio at maturity as obtained on out of sample simulations. For benchmarking transaction costs and risk adjustment are switched off. The -axis is scaled up by a factor of . Orange: trained DQN agent. Blue: oracle agent.


(a) Large


(b) Small
Figure 4.7. Terminal P&L of out of sample hedging paths for two levels of . Red line: (Half of) terminal P&L corresponding to the option within BSM model. Orange line: (Half of) P&L corresponding to the underlying, -agent. Green line: (Half of) P&L corresponding to the underlying, optimal agent.

5. Conclusions

We have confirmed, following the preceding publications [14, 17, 11, 7, 5, 6], that RL based agents can be trained to hedge an option contract. We have introduced the R-CMAB algorithm and demonstrated that for the P&L formulation of hedging, R-CMAB outperforms DQN in in terms of sample efficiency and hedging error (when compared to the standard BSM model).

Each of the preceding articles [14, 17, 11, 7, 5, 6, 25, 24] relies on one (or multiple) different measure(s) of risk. This implies that the same hedging path will be valued differently within the various methodologies and suggests a certain lack of consensus within the research community. The presence of delicate hyperparameters, such as in DQN, which regulates the inter-dependency of rewards and , which tunes between different measures of risk, make the evaluation of hedging performance even more subtle.

We introduce a novel approach to hedging, which is motivated by its relative simplicity and performance guarantees in the literature. Indeed, hedges obtained from R-CMAB naturally converge to BSM deltas when no risk adjustment, discretization error and transaction costs are present. This makes R-CMAB a useful benchmark for performance comparison and the development of more sophisticated algorithms.

From a practical perspective we note that RL agents trained on simulated data will perform well in market situations that are similar to the training simulations. This reveals an important shortcoming of RL methods when applied to hedging of an option contract in practice: the typical sample-inefficiency of RL (as compared to supervised learning) leaves the practitioner with little options but to train on simulated data, which leads to the standard Financial engineering problem of choosing the right class of Itô processes and calibration. However, once such processes have been used for training, it can hardly be claimed that the agent operates in a fully model-independent way. The (comparatively) small amount of real-world data that the agent will encounter in operation will not change the overwhelming influence of the training in a statistically significant way. It is not uncommon that RL is used in Finance in situations where large data sets are available for training. A typical example that amends naturally to RL algorithms is market-making, where trading occurs on a sub-microsecond time scale and produces an abundance of real-world data. The application of RL algorithms in slow businesses such as option hedging remains an area of active research.


  • [1] S. Agrawal and N. Goyal. Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pages 39–1, 2012.
  • [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • [3] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science, pages 322–331. IEEE, 1995.
  • [4] D. P. Bertsekas and S. Shreve. Stochastic optimal control: the discrete-time case. 2004.
  • [5] H. Buehler, L. Gonon, J. Teichmann, and B. Wood. Deep hedging. Quantitative Finance, 19(8):1271–1291, 2019.
  • [6] H. Buehler, L. Gonon, J. Teichmann, B. Wood, B. Mohan, and J. Kochems. Deep hedging: hedging derivatives under generic market frictions using reinforcement learning. Technical report, Swiss Finance Institute, 2019.
  • [7] J. Cao, J. Chen, J. C. Hull, and Z. Poulos. Deep hedging of derivatives using reinforcement learning. Available at SSRN 3514586, 2019.
  • [8] O. Chapelle and L. Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • [9] M. Collier and H. U. Llorens. Deep contextual multi-armed bandits. arXiv preprint arXiv:1807.09809, 2018.
  • [10] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, pages 3338–3346, 2014.
  • [11] I. Halperin. Qlbs: Q-learner in the black-scholes (-merton) worlds. Available at SSRN 3087076, 2017.
  • [12] N. Heess, D. TB, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, S.M. Eslami, et al. Emergence of locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286, 2017.
  • [13] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [14] S. Hodges and A. Neuberger. Option replication of contingent claims under transactions costs. Technical report, Working paper, Financial Options Research Centre, University of Warwick, 1989.
  • [15] J. E. Ingersoll. Theory of financial decision making, volume 3. Rowman & Littlefield, 1987.
  • [16] M. N. Katehakis and A. F. Veinott Jr. The multi-armed bandit problem: decomposition and computation. Mathematics of Operations Research, 12(2):262–268, 1987.
  • [17] P. N. Kolm and G. Ritter. Dynamic replication and hedging: A reinforcement learning approach.

    The Journal of Financial Data Science

    , 1(1):159–171, 2019.
  • [18] A. Krause and C. S. Ong. Contextual gaussian process bandit optimization. In Advances in neural information processing systems, pages 2447–2455, 2011.
  • [19] H. E. Leland. Option pricing and replication with transactions costs. The journal of finance, 40(5):1283–1301, 1985.
  • [20] L. Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.
  • [21] R. C. Merton. An intertemporal capital asset pricing model. Econometrica: Journal of the Econometric Society, pages 867–887, 1973.
  • [22] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • [23] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep exploration via bootstrapped dqn. In Advances in neural information processing systems, pages 4026–4034, 2016.
  • [24] G. Ritter. Machine learning for trading. Available at SSRN 3015609, 2017.
  • [25] A. Sani, A. Lazaric, and R. Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
  • [26] J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lockhart, D. Hassabis, and T. Graepel. Mastering atari, go, chess and shogi by planning with a learned model. arXiv preprint arXiv:1911.08265, 2019.
  • [27] E. Sezener, M. Hutter, D. Budden, J. Wang, and J. Veness. Online learning in contextual bandits using gated linear networks. arXiv preprint arXiv:2002.11611, 2020.
  • [28] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, and M. Lanctot. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
  • [29] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, and A. Bolton. Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
  • [30] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. A. Patwary, Prabhat, and R. Adams. Scalable bayesian optimization using deep neural networks. pages 2171–2180, 2015.
  • [31] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • [32] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [33] O. Szehr and L. Cannelli. Portfolio choice under market friction: A contextual mean-variance -armed bandit model. Technical Report, 2019.
  • [34] Y. Tassa, T. Erez, and E. Todorov. Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4906–4913. IEEE, 2012.
  • [35] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • [36] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-difference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081, 1997.
  • [37] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In Thirtieth AAAI conference on artificial intelligence, 2016.
  • [38] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.