Large scale continuous-time mean-variance portfolio allocation via reinforcement learning

07/26/2019 ∙ by Haoran Wang, et al. ∙ 0

We propose to solve large scale Markowitz mean-variance (MV) portfolio allocation problem using reinforcement learning (RL). By adopting the recently developed continuous-time exploratory control framework, we formulate the exploratory MV problem in high dimensions. We further show the optimality of a multivariate Gaussian feedback policy, with time-decaying variance, in trading off exploration and exploitation. Based on a provable policy improvement theorem, we devise a scalable and data-efficient RL algorithm and conduct large scale empirical tests using data from the S&P 500 stocks. We found that our method consistently achieves over 10 econometric methods and the deep RL method by large margins, for both long and medium terms of investment with monthly and daily trading.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) has demonstrated to be sucessful in games ([Silver et al., 2016], Silver et al. [2017], Mnih et al. [2015]) and robotics (Levine et al. [2016], Peters et al. [2003]), which also raised significant attention on its applications to quantitative finance. Notable examples include large scale optimal order execution using clssical Q-learning method (Nevmyvaka et al. [2006]), portfolio allocation using direct policy search (Moody and Saffell [2001], Moody et al. [1998]), and option pricing and hedging using deep RL methods (Buehler et al. [2019]), among others.

However, most existing works only focus on RL problems with expected utility of discounted rewards. Such criteria are either unable to fully characterize the uncertainty of the decision making process in financial markets or opaque to typical investors. On the other hand, mean–variance (MV) is one of the most important criteria for portfolio choice. Initiated in the Nobel Prize winning work Markowitz [1952] for portfolio selection in a single period, such a criterion yields an asset allocation strategy that minimizes the variance of the final payoff while targeting some prespecified mean return. The popularity of the MV criterion is not only due to its intuitive and transparent nature in capturing the tradeoff between risk and reward for practitioners, but also due to the theoretically interesting issue of time-inconsistency (or Bellman’s inconsistency) inherent with the underlying stochastic optimization and control problems.

In a recent paper Wang and Zhou [2019], the authors established an RL framework for studying the continuous-time MV portfolio selection, with continuous portfolio (action) and wealth (state) spaces. Their framework adopts a general entropy-regularized, relaxed stochastic control formulation, known as the exploratory formulation, which was originally developed in Wang et al. [2019] to capture explicitly the tradeoff between exploration and exploitation in RL for continuous-time optimization problems. The paper Wang and Zhou [2019] proved the optimality of Gaussian exploration (with time-decaying variance) for the MV problem in one dimension, and proposed a data-driven algorithm, the EMV algorithm, to learn the optimal Gaussian policy of the exploratory MV problem. Their simulation shows that the EMV algorithm outperforms both a classical econometric method and the deep deterministic policy gradient (DDPG) algorithm by large margins when solving the MV problem in the setting with only one risky asset.

It is the contribution of this work to generalize the continuous-time exploratory MV framework in Wang and Zhou [2019] to large scale portfolio selection setting, with the number of risky assets being relatively large and the available training data being relatively limited. We establish the theoretical optimality of the high-dimensional Gaussian policy and design a scalable EMV algorithm to directly output portfolio allocation strategies. By switching to portfolio selection in high dimensions, we can in principle take more advantage of the diversification effect (Markowitz [1959]) to have better performance while, however, potentially encountering the challenges of low sample efficiency and instability faced by most deep RL methods (Duan et al. [2016], Henderson et al. [2018]). Nevertheless, although the EMV algorithm is an on-policy approach, it can achieve better data efficiency than the off-policy method DDPG, thanks to a provable policy improvement theorem and the explicit functional structures of the theoretical optimal Gaussian policy and value function. For instance, in a years monthly trading experiment (see section 5.2) where the available data point for training is the same amount as the decision making times for testing, the EMV algorithm still can outperform various alternative methods considered in this paper. To further empirically test the performance and robustness of the EMV algorithm, we conduct experiments using both monthly and daily price data of the S&P stocks, for long and medium term investment horizons. Annual returns over have been consistently observed across most experiments. The EMV algorithm also demonstrated remarkable universal applicability, as it can be trained and tested respectively on different sets of data from stocks that are randomly selected and still achieves competitive and, actually, more robust performance (see Appendix D).

2 Notations & Background

2.1 Classical continuous-time MV problem

We consider the classical MV problem in continuous time (without RL), where the investment universe consists of one riskless asset (savings account) and risky assets (e.g., stocks). Let an investment planning horizon be fixed. Denote by the discounted wealth (i.e. state) process of an agent who rebalances her portfolio (i.e. action) investing in the risky and riskless assets with a strategy (policy) . Here is the discounted dollar value put in the risky assets at time . Under the geometric Brownian motion assumption for stocks prices and the standard self-financing condition, it follows (see Appendix A) that the wealth process satisfies


with an initial endowment being . Here, , , is a standard

-dimensional Brownian motion defined on a filtered probability space

. The vector

111All vectors in this paper are taken as column vectors. is typically known as the market price of risk, and is the volatility matrix which is assumed to be non-degenerate.

The classical continuous-time MV model then aims to solve the following constrained optimization problem


where satisfies the dynamics (1) under the investment strategy (portfolio) , and is an investment target set at as the desired target payoff at the end of the investment horizon .

Due to the variance in its objective, (2) is known to be time inconsistent. In this paper we focus ourselves to the so-called pre-committed strategies of the MV problem, which are optimal at only. To solve (2), one first transforms it into an unconstrained problem by applying a Lagrange multiplier :222Strictly speaking, is the Lagrange multiplier.


This problem can be solved analytically, whose solution depends on . Then the original constraint determines the value of . We refer a detailed derivation to Zhou and Li [2000].

2.2 Exploratory continuous-time MV problem

The classical MV solution requires the estimation of the market parameters from historical time series of assets prices. However, as well known in practice, it is difficult to estimate the investment opportunity parameters, especially the mean return vector (aka the

mean–blur problem; see, e.g., Luenberger [1998]) with a workable accuracy. Moreover, the classical optimal MV strategies are often extremely sensitive to these parameters, largely due to the procedure of inverting ill-conditioned covariance matrices to obtain optimal allocation weights. In view of these two issues, the Markowitz solution can be greatly irrelevant to the underlying investment objective.

On the other hand, RL techniques do not require, and indeed often skip, any estimation of model parameters. Rather, RL algorithms, driven by historical data, output optimal (or near-optimal) allocations directly. This is made possible by direct interactions with the unknown investment environment, in a learning (exploring) while optimizing (exploiting) fashion. Following Wang et al. [2019], we introduce the “exploratory" version of the state dynamics (1). In this formulation, the control (portfolio) process is randomized to represent exploration in RL, leading to a measure-valued or distributional control process whose density function is given by . The dynamics (1) is changed to


where is a 1-dimensional standard Brownian motion on the filtered probability space . Mathematically, (4) coincides with the relaxed control formulation in classical control theory, and it is adopted here to characterize the effect of exploration on the underlying continuous-time state dynamics change. We refer the readers to Wang et al. [2019] for a detailed discussion on the motivation of (4).

The randomized, distributional control process is to model exploration, whose overall level is in turn captured by its accumulative differential entropy


Further, we introduce a temperature parameter (or exploration weight) reflecting the tradeoff between exploitation and exploration. The entropy-regularized, exploratory MV problem is then to solve, for any fixed :


where is the set of admissible distributional controls on whose precise definition is relegated to Appendix B. Once this problem is solved with a minimizer , the Lagrange multiplier can be determined by the additional constraint .

The optimization objective (6) explicitly encourages exploration, in contrast to the classical problem (3) which concerns exploitation only.

3 Optimality of Gaussian Exploration

To solve the exploratory MV problem (6), we apply the classical Bellman’s principle of optimality for the optimal value function (see Appendix B for the precise definition of ):

for and . Following standard arguments, we deduce that satisfies the Hamilton-Jacobi-Bellman (HJB) equation


with the terminal condition . Here, denotes the set of density functions of probability measures on that are absolutely continuous with respect to the Lebesgue measure and denotes the generic unknown solution to the HJB equation.

Applying the usual verification technique and using the fact that if and only if and , a.e., on , we can solve the (constrained) optimization problem in the HJB equation (7) to obtain a feedback (distributional) control whose density function is given by


where denotes the Gaussian density function with mean vector and covariance matrix . It is assumed in (8) that , which will be verified in what follows.

Substituting the candidate optimal Gaussian feedback control policy (8) back into the HJB equation (7), the latter is transformed to


with , where denotes the matrix determinant. A direct computation yields that this equation has a classical solution


which clearly satisfies , for any . It then follows that the candidate optimal feedback Gaussian policy (8) reduces to


Finally, the optimal wealth process (4) under becomes


It has a unique strong solution for , as can be easily verified. We now summarize the above results in the following theorem.

Theorem 1

The optimal value function of the entropy-regularized exploratory MV problem (6) is given by


for . Moreover, the optimal feedback control is Gaussian, with its density function given by


The associated optimal wealth process under is the unique solution of the stochastic differential equation


Finally, the Lagrange multiplier is given by .

Proof. See Appendix C.1.   

Theorem 1 indicates that the level of exploration, measured by the variance of Gaussian policy , decays in time. The agent initially engages in exploration at the maximum level, and reduces it gradually (although never to zero) as time approaches the end of the investment horizon. Naturally, exploitation dominates exploration as time approaches maturity. Theorem 1 presents such a decaying exploration scheme endogenously which, to our best knowledge, has not been derived in the RL literature.

Moreover, the mean of the Gaussian distribution (

14) is independent of the exploration weight , while its variance is independent of the state . This highlights a perfect separation between exploitation and exploration, as the former is captured by the mean and the latter by the variance of the optimal Gaussian exploration. This property is also consistent with the linear–quadratic case in the infinite horizon studied in Wang et al. [2019].

It is reasonable to expect that the exploratory problem converges to its classical counterpart as the exploration weight decreases to 0. Let be the optimal feedback control for the classical MV problem, and denote by the optimal value function. Let be the Dirac measure centered at . Then the following result holds.

Theorem 2

For each ,


Proof. See Appendix C.2.   

4 RL Algorithm Design

4.1 A policy improvement theorem

We present a policy improvement theorem that is a crucial prerequisite for our interpretable RL algorithm, the EMV algorithm, which solves the exploratory MV problem in high dimensions.

Theorem 3 (Policy Improvement Theorem)

Let be fixed and be an arbitrarily given admissible feedback control policy. Suppose that the corresponding value function and satisfies , for any . Suppose further that the feedback policy defined by


is admissible. Then,


Proof. See Appendix C.3.   

The above theorem suggests that there are always policies in the Gaussian family that improves the value function of any given, not necessarily Gaussian, policy. Moreover, the Gaussian family is closed under the policy improvement scheme. Hence, without loss of generality, we can simply focus on the Gaussian policies when choosing an initial solution. The next result shows convergence of both the value functions and the policies from a specifically parameterized Gaussian policy.

Theorem 4

Let , with , and being a positive definite matrix. Denote by the sequence of feedback policies updated by the policy improvement scheme (16), and the sequence of the corresponding value functions. Then,




for any , where and are the optimal Gaussian policy (14) and the optimal value function (13), respectively.

Proof. See Appendix C.4.   

4.2 The EMV algorithm

We provide the EMV algorithm to directly learn the optimal solution of the continuous-time exploratory MV problem in high dimensions within competitive training time. Theorem 3 provides guidance for policy improvement. For the policy evaluation step, we follow Doya [2000] to minimize the continuous-time Bellman’s error


where is the total derivative and is the discretization step for the learning algorithm. This leads to the cost function to be minimized


using samples collected in the set under the current Gaussian policy . Here, both the value function and the Gaussian policy can be parametrized more explicitly, in view of (13), Theorem 3 and 4. The cost function (21) can then be minimized by stochastic gradient decent. Finally, the EMV algorithm updates the Lagrange multiplier every iterations based on stochastic approximation and the constraint , namely, , where ’s are the most recent terminal wealth values. We refer the readers to Wang and Zhou [2019] for a more detailed description of the EMV algorithm in the one risky asset scenario.

5 Empirical Results

5.1 Data and methods

We test the EMV algorithm on price data of the S&P stocks for both monthly and daily trading. For the former, we train the EMV algorithm on the years monthly data333All data is from Wharton Research Data Services (WRDS). from 08-31-1990 to 08-31-2000, and then test the learned allocation strategy from 09-29-2000 to 09-30-2010. The initial wealth is normalized as and the years target is , corresponding to a annualized target return. In the daily rebalancing scenario, the EMV algorithm is trained on the year daily data from 01-09-2017 to 01-08-2018 and tested on the subsequent year, with a return set as the target for the year investment horizon.

For comparison studies, we also train and test other alternative methods for solving the portfolio allocation problem on the same data. Specifically, we consider the classical econometric methods including Black-Litterman (BL, Black and Litterman [1992]), Fama-French (FF, Fama and French [1996]) and the Markowitz portfolio (Markowitz, Markowitz [1952]). A recently developed distributionally robust MV strategy, the Robust Wasserstein Profile Inference (RWPI, Blanchet et al. [2018]), is also included. To compare EMV with deep RL method, we adjust DDPG similarly as in Wang and Zhou [2019], so that it can solve the classical MV problem (3

). All experiments were performed on a MacBook Air laptop, with DDPG trained using Tensorflow.

5.2 Test I: monthly rebalancing

We first consider . By randomly selecting stocks for each set/seed, we compose different seeds. The split of training and testing data for EMV and DDPG is fixed as described above, but we consider two types of training. The first training method is batch (off-line) RL, where both algorithms are trained for multiple episodes using one seed, following by testing on the subsequent years data of that seed. The performance is then averaged over the seeds. Another method is to use all the seeds and select one seed randomly for each episode during training. Then both algorithms are tested on randomly selected seeds over the test period and the performance is averaged as well. The second method can be seen to artificially generate randomness for training and testing, and an algorithm that performs well using this method has universality and potential to generate to data of stocks in different sectors.

For competitive performance, we adopt a rolling-horizon based training and testing for all the other methods. Specifically, each time after the month ahead investment decision is made on the test set, we add the most recent price data point from the test set into the training set, and discard the most obsolete data point from the training set.

(a) Monthly rebalancing ()
(b) Daily rebalancing ()
Figure 1: Investment performance comparison for (a) years horizon with monthly rebalancing and (b) year horizon with daily rebalancing.

Figure 0(a) shows the performance of various investment strategies, including variants of the EMV algorithm with different gross leverage constraints on portfolios.444Leverage is a fundamental investment tool for most hedge funds; according to Ang et al. [2011], the average gross leverage across the hedge funds studied therein is . Under reasonable leverage constraint, the EMV algorithm still outperforms most other methods (which have no constraints, except DDPG) by a large margin, although it was trained only using the previous years monthly data.

The universal training and testing method was used for EMV and DDPG in Figure 0(a). Results for the batch method can be found in Appendix D. A remarkable fact in both cases is that the original EMV algorithm, devised to solve the exploratory MV problem (6) without constraint, achieves the target with minimal variance for most of the test period. We also report various investment outcomes in Table 1 when scaling up , the number of stocks in the portfolio.

5.3 Test II: daily rebalancing

For daily trading with , we report the performance of the EMV algorithm under different gross leverage constraints in Figure 0(b). The DDPG algorithm was not competitive in the daily trading setting (see Table 1) and, hence, omitted. For different , Table 1 summarizes the investment outcomes and the training time (per experiment). These results were obtained using the universal method for both training and testing.

EMV (L=) hrs hrs hrs 5
() ( ) ()
DDPG (L=) hrs hrs hrs
  (Unannualized) () ( ) ()
Monthly rebalancing
EMV (L=) hrs hrs hrs555Only episodes were trained, while the training episodes for other experiments were .
() () ()
DDPG (L=) hrs hrs hrs
() () ()
Daily rebalancing
Table 1: Annualized return (with Sharpe ratio) and training time corresponding to different for
(a) years horizon with monthly rebalancing and (b) year horizon with daily rebalancing.

6 Related Work

The difficulty of seeking the global optimum for Markov Decision Process (MDP) problems under the MV criterion has been previously noted in

Mannor and Tsitsiklis [2013]. In fact, the variance of reward-to-go is nonlinear in expectation and, as a result of Bellman’s inconsistency, most of the well-known RL algorithms cannot be applied directly.

Existing works on variance estimation and control generally divide into value based methods and policy based methods. Sobel [1982] obtained the Bellman’s equation for the variance of reward-to-go under a fixed, given policy. Sato et al. [2001] further derived the TD(0) learning rule to estimate the variance, followed by Sato and Kobayashi [2000] which applied this value based method to an MV portfolio selection problem. It is worth noting that due to the definition of the value function (i.e., the variance penalized expected reward-to-go) in Sato and Kobayashi [2000], Bellman’s optimality principle does not hold. As a result, it is not guaranteed that a greedy policy based on the latest updated value function will eventually lead to the true global optimal policy. The second approach, the policy based RL, was proposed in Tamar et al. [2013]. They also extended the work to linear function approximators and devised actor-critic algorithms for MV optimization problems for which convergence to the local optimum is guaranteed with probability one (Tamar and Mannor [2013]). Related works following this line of research include Prashanth and Ghavamzadeh [2013], Prashanth and Ghavamzadeh [2016], among others. Despite the various methods mentioned above, it remains an open and interesting question in RL to search for the global optimum under the MV criterion.

In this paper, rather than relying on the typical framework of discrete-time MDP and discretizing time and state/action spaces accordingly, we designed the EMV algorithm to learn the global optimal solution of the continuous-time exploratory MV problem (6) directly. As pointed out in Doya [2000]

, it is typically challenging to find the right granularity to discretize the state and action spaces, and naive discretization may lead to poor performance. On the other hand, grid-based discretization methods for solving the HJB equation cannot easily extend to high dimensions in practice due to the curse of dimensionality, although theoretical convergence results have been established (see

Munos and Bourgine [1998], Munos [2000]

). Our EMV algorithm, however, is computationally feasible and implementable in high dimensions, as demonstrated by the experiements, due to the explicit representations of the value functions and the portfolio strategies, thereby devoid of the curse of dimensionality. Note that our algorithm does not use (deep) neural networks, which have been applied extensively in literature for (high-dimensional) continuous RL problems (e.g.,

Lillicrap et al. [2016], Mnih et al. [2015]

) but known for unstable performance, sample inefficiency as well as extensive hyperparameter tuning (

Mnih et al. [2015], Duan et al. [2016], Henderson et al. [2018]), in addition to their low interpretability.666

Interpretability is one of the most important and pressing issues in the general artificial intelligence applications in financial industry due to, among others, the regulatory requirement.

7 Conclusions

We studied continuous-time mean-variance (MV) portfolio allocation problem in high dimensions using RL methods. Under the exploratory control framework for general continuous-time optimization problems, we formulated the exploratory MV problem in high dimensions and proved the optimality of Gaussian policy in achieving the best tradeoff between exploration and exploitation. Our EMV algorithm, designed by combining quantitative finance analysis and RL techniques to solve the exploratory MV problem, is interpretable, scalable and data efficient, thanks to a provable policy improvement theorem and efficient functional approximations based on the theoretical optimal solutions. It consistently outperforms both classical model-based econometric methods and model-free deep RL method, across different training and testing scenarios. Interesting future research includes testing the EMV algorithm for shorter trading horizons with tick data (e.g. high frequency trading), or for trading other financial instruments such as mean-variance option hedging.


The author would like to thank Prof. Xun Yu Zhou for generous support and continuing encouragement on this work. The author also wants to thank Lin (Charles) Chen for providing the results on BL, FF, Markowitz and RWPI methods.


  • Ang et al. [2011] Andrew Ang, Sergiy Gorovyy, and Gregory B Van Inwegen. Hedge fund leverage. Journal of Financial Economics, 102(1):102–126, 2011.
  • Black and Litterman [1992] Fischer Black and Robert Litterman. Global portfolio optimization. Financial Analysts Journal, 48(5):28–43, 1992.
  • Blanchet et al. [2018] Jose Blanchet, Lin Chen, and Xun Yu Zhou. Distributionally robust mean-variance portfolio selection with Wasserstein distances. arXiv preprint arXiv:1802.04885, 2018.
  • Buehler et al. [2019] Hans Buehler, Lukas Gonon, Josef Teichmann, and Ben Wood. Deep hedging. Quantitative Finance, pages 1–21, 2019.
  • Doya [2000] Kenji Doya. Reinforcement learning in continuous time and space. Neural Computation, 12(1):219–245, 2000.
  • Duan et al. [2016] Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter Abbeel. Benchmarking deep reinforcement learning for continuous control. In

    International Conference on Machine Learning

    , pages 1329–1338, 2016.
  • Fama and French [1996] Eugene F Fama and Kenneth R French. Multifactor explanations of asset pricing anomalies. The Journal of Finance, 51(1):55–84, 1996.
  • Henderson et al. [2018] Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, and David Meger. Deep reinforcement learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Levine et al. [2016] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  • Lillicrap et al. [2016] Timothy Lillicrap, Jonathan Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2016.
  • Luenberger [1998] David G Luenberger. Investment Science. Oxford University Press, New York, 1998.
  • Mannor and Tsitsiklis [2013] Shie Mannor and John N Tsitsiklis. Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3):645–653, 2013.
  • Markowitz [1952] Harry Markowitz. Portfolio selection. The Journal of Finance, 7(1):77–91, 1952.
  • Markowitz [1959] Harry Markowitz. Portfolio Selection: Efficient Diversification of Investments. Yale University Press, 1959. ISBN 9780300013726. URL
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.
  • Moody and Saffell [2001] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875–889, 2001.
  • Moody et al. [1998] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17(5-6):441–470, 1998.
  • Munos [2000] Rémi Munos. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Machine Learning, 40(3):265–299, 2000.
  • Munos and Bourgine [1998] Rémi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Advances in Neural Information Processing Systems, pages 1029–1035, 1998.
  • Nevmyvaka et al. [2006] Yuriy Nevmyvaka, Yi Feng, and Michael Kearns. Reinforcement learning for optimized trade execution. In Proceedings of the 23rd International Conference on Machine Learning, pages 673–680, 2006.
  • Peters et al. [2003] Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Reinforcement learning for humanoid robotics. In Proceedings of the 3rd IEEE-RAS International Conference on Humanoid Robots, pages 1–20, 2003.
  • Prashanth and Ghavamzadeh [2013] LA Prashanth and Mohammad Ghavamzadeh. Actor-critic algorithms for risk-sensitive MDPs. In Advances In Neural Information Processing Systems, pages 252–260, 2013.
  • Prashanth and Ghavamzadeh [2016] LA Prashanth and Mohammad Ghavamzadeh. Variance-constrained actor-critic algorithms for discounted and average reward MDPs. Machine Learning, 105(3):367–417, 2016.
  • Sato and Kobayashi [2000] Makoto Sato and Shigenobu Kobayashi. Variance-penalized reinforcement learning for risk-averse asset allocation. In International Conference on Intelligent Data Engineering and Automated Learning, pages 244–249. Springer, 2000.
  • Sato et al. [2001] Makoto Sato, Hajime Kimura, and Shibenobu Kobayashi. TD algorithm for the variance of return and mean-variance reinforcement learning. Transactions of the Japanese Society for Artificial Intelligence, 16(3):353–362, 2001.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, and Marc Lanctot. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484, 2016.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, and Adrian Bolton. Mastering the game of Go without human knowledge. Nature, 550(7676):354, 2017.
  • Sobel [1982] Matthew J Sobel. The variance of discounted Markov decision processes. Journal of Applied Probability, 19(4):794–802, 1982.
  • Tamar and Mannor [2013] Aviv Tamar and Shie Mannor. Variance adjusted actor critic algorithms. arXiv preprint arXiv:1310.3697, 2013.
  • Tamar et al. [2013] Aviv Tamar, Dotan Di Castro, and Shie Mannor. Temporal difference methods for the variance of the reward to go. In International Conference on Machine Learning, pages 495–503, 2013.
  • Wang and Zhou [2019] Haoran Wang and Xun Yu Zhou. Continuous-time mean-variance portfolio selection: A reinforcement learning framework. arXiv preprint arXiv:1904.11392, 2019.
  • Wang et al. [2019] Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Exploration versus exploitation in reinforcement learning: A stochastic control approach. arXiv preprint: arXiv:1812.01552v3, 2019.
  • Zhou and Li [2000] Xun Yu Zhou and Duan Li. Continuous-time mean-variance portfolio selection: A stochastic LQ framework. Applied Mathematics and Optimization, 42(1):19–33, 2000.

Appendix A Controlled Wealth Dynamics

Let , be a standard -dimensional Brownian motion defined on a filtered probability space that satisfies the usual conditions. The price process of the -th risky asset is a geometric Brownian motion governed by


with being the initial price at , and , being the mean return and volatility coefficients of the -th risky asset, respectively. We denote for brevity the mean return vector by , and the volatility matrix by , whose -th column represents the volatility of the -th risky asset. The riskless asset has a constant interest rate . We assume that is non-degenerate and hence there exists a -dimensional vector that satisfies , where is the -dimensional vector with all components being . The vector is known as the market price of risk. It is worth noting that the above assumptions are only made for the convenience of deriving theoretical results in the paper; in practice, all the model parameters are unknown and time-varying, and it is the goal of RL algorithms to directly output trading strategies without relying on estimation of any underlying parameters.

Denote by and the discounted dollar value put in the savings account and the risky assets, respectively, at time . It then follows that the discounted wealth process is , . The self-financing condition further implies that, using (22), we have

Appendix B Value Functions and Admissible Control Distributions

In order to rigorously solve (6) by dynamic programming, we need to define the value functions. For each , consider the state equation (4) on with . Define the set of admissible controls, , as follows. Let be the Borel algebra on . A (distributional) control (or portfolio/strategy) process belongs to , if

(i)  for each , a.s.;

(ii) for each , is -progressively measurable;

(iii) ;

(iv) .

Clearly, it follows from condition (iii) that the stochastic differential equation (SDE) (4) has a unique strong solution for that satisfies .

Controls in are measure-valued (or, precisely, density-function-valued) stochastic processes, which are also called open-loop controls in the control terminology. As in the classical control theory, it is important to distinguish between open-loop controls and feedback (or closed-loop) controls (or policies as in the RL literature, or laws as in the control literature). Specifically, a deterministic mapping is called an (admissible) feedback control if i) is a density function for each ; ii) for each , the following SDE (which is the system dynamics after the feedback policy is applied)


has a unique strong solution , and the open-loop control where . In this case, the open-loop control is said to be generated from the feedback policy with respect to the initial time and state, . It is useful to note that an open-loop control and its admissibility depend on the initial , whereas a feedback policy can generate open-loop controls for any , and hence is in itself independent of . Note that throughout this paper, we have used boldfaced to denote feedback controls, and the normal style to denote open-loop controls.

Now, for a fixed , define


for . The function is called the optimal value function of the problem.

Moreover, we define the value function under any given feedback control :


for , where is the open-loop control generated from with respect to and is the corresponding wealth process.

Note that in the control literature, given by (24) is called the value function. However, in the RL literature the term “value function" is also used for the objective value under a particular control (i.e. in (25)). So to avoid ambiguity we have called the optimal value function in this paper.

Appendix C Proofs

c.1 Proof of Theorem 1

The main proof of Theroem 1 would be the verification arguments that aim to show the optimal value function of problem (6) is given by (13) and that the candidate optimal policy (14) is indeed admissible, based on the definitions in Appendix B. Since the current exploratory MV problem is a special case of the exploratory linear-quadratic problem extensively studied in Wang et al. [2019], a detailed proof would follow the same lines of that of Theorem therein, and is left for interested readers.

Proof. We now determine the Lagrange multiplier through the constraint . It follows from (15), along with the standard estimate that and Fubini’s Theorem, that

Hence, . The constraint now becomes , which gives .   

c.2 Proof of Theorem 2

To prove the solution of the exploratory MV problem converges to that of the classical MV problem, as , we first recall the solution of the classical MV problem.

In order to apply dynamic programming for (3), we again consider the set of admissible controls, , for ,

: is -progressively measurable and

The (optimal) value function is defined by


for , where is fixed. Once this problem is solved, can be determined by the constraint , with being the optimal wealth process under the optimal portfolio .

The HJB equation is


with the terminal condition .

Standard verification arguments deduce the optimal value function to be


the optimal feedback control policy to be


and the corresponding optimal wealth process to be the unique strong solution to the SDE


Comparing the optimal wealth dynamics, (15) and (30), of the exploratory and classical problems, we note that they have the same drift coefficient (but different diffusion coefficients). As a result, the two problems have the same mean of optimal terminal wealth and hence the same value of the Lagrange multiplier determined by the constraint .

Proof. The weak convergence of the feedback controls follows from the explicit forms of in (14) and in (29). The pointwise convergence of the value functions follows easily from the forms of in (13) and in (28), together with the fact that


c.3 Proof of Theorem 3

Proof. Fix . Since, by assumption, the feedback policy is admissible, the open-loop control strategy, , generated from with respect to the initial condition is admissible. Let be the corresponding wealth process under . Applying Itô’s formula, we have


Define the stopping times , for . Then, from (31), we obtain


On the other hand, by standard arguments and the assumption that is smooth, we have

for any