Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

01/25/2019 ∙ by Pengqian Yu, et al. ∙ 0

Dynamic portfolio optimization is the process of sequentially allocating wealth to a collection of assets in some consecutive trading periods, based on investors' return-risk profile. Automating this process with machine learning remains a challenging problem. Here, we design a deep reinforcement learning (RL) architecture with an autonomous trading agent such that, investment decisions and actions are made periodically, based on a global objective, with autonomy. In particular, without relying on a purely model-free RL agent, we train our trading agent using a novel RL architecture consisting of an infused prediction module (IPM), a generative adversarial data augmentation module (DAM) and a behavior cloning module (BCM). Our model-based approach works with both on-policy or off-policy RL algorithms. We further design the back-testing and execution engine which interact with the RL agent in real time. Using historical real financial market data, we simulate trading with practical constraints, and demonstrate that our proposed model is robust, profitable and risk-sensitive, as compared to baseline trading strategies and model-free RL agents from prior work.



There are no comments yet.


page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) consists of an agent interacting with the environment, in order to learn an optimal policy by trial and error for sequential decision-making problems (Bertsekas, 2005; Sutton & Barto, 2018). The past decade has witnessed the tremendous success of deep reinforcement learning (RL) in the fields of gaming, robotics and recommendation systems (Lillicrap et al., 2015; Silver et al., 2016; Mnih et al., 2015, 2016). However, its applications in the financial domain have not been explored as thoroughly.

Dynamic portfolio optimization remains one of the most challenging problems in the field of finance (Markowitz, 1959; Haugen & Haugen, 1990)

. It is a sequential decision-making process of continuously reallocating funds into a number of different financial investment products, with the main aim to maximize return while constraining risk. Classical approaches to this problem include dynamic programming and convex optimization, which require discrete actions and thus suffer from the ‘curse of dimensionality’ (e.g.,

(Cover, 1991; Li & Hoi, 2014; Feng et al., 2015)).

There have been efforts made to apply RL techniques to alleviate the dimensionality issue in the portfolio optimization problem (Moody & Saffell, 2001; Dempster & Leemans, 2006; Cumming et al., 2015; Jiang et al., 2017; Deng et al., 2017; Guo et al., 2018; Liang et al., 2018). The main idea is to train an RL agent that is rewarded if its investment decisions increase the logarithmic rate of return and is penalised otherwise. However, these RL algorithms have several drawbacks. In particular, the approaches in (Moody & Saffell, 2001; Dempster & Leemans, 2006; Cumming et al., 2015; Deng et al., 2017) only yield discrete single-asset trading signals. The multi-assets setting was studied in (Guo et al., 2018), however, the authors did not take transaction costs into consideration, thus limiting their practical usage. In recent study (Jiang et al., 2017; Liang et al., 2018), transaction costs were considered but it did not address the challenge of having insufficient data in financial markets for the training of robust machine learning algorithms. Moreover, the methods proposed in (Jiang et al., 2017; Liang et al., 2018) directly apply a model-free RL algorithm that is sample inefficient and also doesn’t account for the stability and risk issues caused by non-stationary financial market environment. In this paper, we propose a novel model-based RL approach, that takes into account practical trading restrictions such as transaction costs and order executions, to stably train an autonomous agent whose investment decisions are risk-averse yet profitable.

We highlight our three main contributions to realize a model-based RL algorithm for our problem setting. Our first contribution is an infused prediction module (IPM), which incorporates the prediction of expected future observations into state-of-the-art RL algorithms. Our idea is inspired by some attempts to merge prediction methods with RL. For example, RL has been successful in predicting the behavior of simple gaming environments (Oh et al., 2015). In addition, prediction based models have also been shown to improve the performance of RL agents in distributing energy over a smart power grid (Marinescu et al., 2017)

. In this paper, we explore two prediction models; a nonlinear dynamic Boltzmann machine

(Dasgupta & Osogami, 2017) and a variant of parallel WaveNet (van den Oord et al., 2018). These models make use of historical prices of all assets in the portfolio to predict the future price movements of each asset, in a codependent manner. These predictions are then treated as additional features that can be used by the RL agent to improve its performance. Our experimental results show that using IPM provides significant performance improvements over baseline RL algorithms in terms of Sharpe ratio (Sharpe, 1966), Sortino ratio (Sortino & Price, 1994), maximum drawdown (MDD, see (Chekhlov et al., 2005)), value-at-risk (VaR, see (Artzner et al., 1999)) and conditional value-at-risk (CVaR, see (Rockafellar et al., 2000)).

Our second contribution is a data augmentation module (DAM), which makes use of a generative adversarial network (GAN, e.g., (Goodfellow et al., 2014)) to generate synthetic market data. This module is well motivated by the fact that financial markets have limited data. To illustrate this, consider the case where new portfolio weights are decided by the agent on a daily basis. In such a scenario, which may not be uncommon, the size of the training set for a particular asset over the past 10 years is only around due to the fact that there are only about 253 trading days a year. Clearly, this is an extremely small dataset that may not be sufficient for training a robust RL agent. To overcome this difficulty, we train a recurrent GAN (Esteban et al., 2017) using historical asset data to produce realistic multi-dimensional time series. Different from the objective function in (Goodfellow et al., 2014), we explicitly consider the maximum mean discrepancy (MMD, see (Gretton et al., 2007)) in the generator loss which further minimizes the distribution mismatch between real and generated data distributions. We show that DAM helps to reduce over-fitting and typically leads to a portfolio with less volatility.

Our third contribution is a behavior cloning module (BCM), which provides one-step greedy expert demonstration to the RL agent. Our idea comes from the imitation learning paradigm (also called learning from demonstrations), with its most common form being behavior cloning, which learns a policy through supervision provided by expert state-action pairs. In particular, the agent receives examples of behavior from an expert and attempts to solve a task by mimicking the expert’s behavior, e.g.,

(Bain & Sommut, 1999; Abbeel & Ng, 2004; Ross et al., 2011)

. In RL, an agent attempts to maximize expected reward through interaction with the environment. Our proposed BCM combines aspects of conventional RL algorithms and supervised learning to solve complex tasks. This technique is similar in spirit to the work in

(Nair et al., 2018). The difference is that we create the expert behavior based on a one-step greedy strategy by solving an optimization problem that maximizes immediate rewards in the current time step. Additionally, we only update the actor with respect to its auxiliary behavior cloning loss in an actor-critic algorithm setting. We demonstrate that BCM can prevent large changes in portfolio weights and thus keep the volatility low, while also increasing returns in some cases.

To the best of our knowledge, this is the first work that leverages the deep RL state-of-art, and further extends it to a model-based setting and integrate it into the financial domain. Even though our proposed approach has been rigorously tested on the off-policy RL algorithm (in particular, the deep deterministic policy gradients (DDPG) algorithm (Lillicrap et al., 2015)), these concepts can be easily extended to on-policy RL algorithms such as proximal policy optimization (PPO) (Schulman et al., 2017) and trust region policy optimization (Schulman et al., 2015) algorithms. We showcase the overall algorithm for model-based PPO for portfolio management and the corresponding results in the supplementary material. Additionally, we also provide algorithms for differential risk sensitive deep RL for portfolio optimization in the supplementary material. For the rest of the main paper, our discussion will be centered around how our three contributions can improve the performance of the off-policy DDPG algorithm.

This paper is organized as follows: In Section 2, we review the deep RL literature, and formulate the portfolio optimization problem as a deep RL problem. We describe the structure of our automatic trading system in Section 3. Specifically, we provide details of the infused prediction module, data augmentation module and behavior cloning module in Section 3.2 to Section 3.4. In Section 4, we report numerical experiments that serve to illustrate the effectiveness of methods described in this paper. We conclude in Section 5.

2 Preliminaries and Problem Setup

In this section, we briefly review the literature of deep reinforcement learning and introduce the mathematical formulation of the dynamic portfolio optimization problem.

A Markov Decision Process (MDP) is defined as a 6-tuple

. Here, is the (possibly infinite) decision horizon; is the discount factor; is the state space and is the action space, both assumed to be finite dimensional and continuous; is the transition kernel and is the reward function. Policy is a mapping , specifying the action to choose in a particular state. At each time step , the agent in state takes an action , receives the reward and transits to the next state according to The agent’s objective is to maximize its expected return given the start distribution, The state-action value function, or the Q value function, is defined as

Deep deterministic policy gradient (DDPG) algorithm (Lillicrap et al., 2015)

is an off-policy model-free reinforcement learning algorithm for continuous control which utilize large function approximators such as deep neural networks. DDPG is an actor-critic method, which bridges the gap between policy gradient methods and value function approximation methods for RL. Intuitively, DDPG learns a state-action value function (critic) by minimizing the Bellman error, while simultaneously learning a policy (actor) by directly maximizing the estimated state-action value function with respect to the network parameters.

In particular, DDPG maintains an actor function with parameters , a critic function with parameters , and a replay buffer as a set of tuples for each experienced transition. DDPG alternates between running the policy to collect experiences (i.e. training roll-outs) and updating the parameters. In our implementation, training roll-outs were conducted with noise added to the policy network’s parameter space to encourage exploration (Plappert et al., 2017). During each training step, DDPG samples a minibatch consisting of tuples from to update the actor and critic networks. DDPG minimizes the following loss w.r.t. to update the critic, The actor parameters are updated using the policy gradient To stabilize learning, the Q value function is usually computed using a separate network (called the target network) whose weights are an exponential average over time of the critic network. This results in smoother target values.

Financial portfolio management is the process of constant redistribution of available funds to a set of financial assets. Our goal is to create a dynamic portfolio allocation system that periodically generates investment decisions and then act on these decisions autonomously. Following (Jiang et al., 2017), we consider a portfolio of assets, including risky assets and risk-free asset (e.g., broker cash balance or U.S. treasury bond). We introduce such notation: given a matrix , we denote the row of by , and the column by

. We denote the closing, high and low price vectors of trading period

as , and where is the closing price of the asset in the period. In this paper, we choose the first asset to be risk-free cash, i.e., . We further define the price relative vector of the trading period as where denotes the element-wise division. In addition, we let denote the percentage change of closing price at time for asset , the space associated with its vector form () as () where is the time embedding of prediction model. We define as the portfolio weight vector at the beginning of trading period where its element represents the proportion of asset in the portfolio after capital reallocation and for all . We initialize our portfolio with . Due to price movements in the market, at the end of the same period, the weights evolve according to where is the element-wise multiplication. Our goal at the end of period is to reallocate portfolio vector from to by selling and buying relevant assets. Paying all commission fees, this reallocation action shrinks the portfolio value by a factor where is the transaction fees for purchasing and selling. In particular, we let denote the portfolio value at the beginning of period and at the end. We then have The immediate reward is the logarithmic rate of return defined by

We define the normalized close price matrix at time by where and is the time embedding. The normalized high price matrix is defined by , and low price matrix

can be defined similarly. We further define the price tensor as

. Our objective is to design a RL agent that observes the state and takes a sequence of actions (portfolio weights) over the time such that the final portfolio value is maximized.

3 System Design and Data

In this section, we discuss the detailed design of our proposed RL based automatic trading system.

Figure 1: Trading framework.

The trading framework referenced in this paper is represented in Figure 1 and is a modular system composed of a data handler (DH), an algorithm engine (AE) and a market simulation engine (MSE). The DH retrieves market data and deals with the required data transformations. It is designed for continuous data ingestion in order to provide the AE with the required set of information. The AE is a collection of models containing RL agents and environment specifications. We refer the readers to Algorithm A in the supplementary material for further details. The MSE is an online event-driven module that provides feedback of executed trades, which can eventually be used by the AE to compute rewards. In addition, it also executes investment decisions made by the AE. The strategy applied in this study is an asset allocation system that rebalances the available capital between the provided set of assets and a cash asset on a daily frequency.

The data used in this paper is a mix of U.S. equities111We use data from Refinitiv DataScope, with experiments carried out on the following U.S. Equities: Costco Wholesale Corporation, Cisco Systems, Ford Motors, Goldman Sachs, American International Group and Caterpillar. The selection is qualitative with a balanced mix of volatile and less volatile stocks. An additional asset is cash (in U.S. dollars), representing the amount of capital not invested in any other asset. Furthermore, a generic market variable (S&P 500 index) has been added as an additional feature. The data is shared in supplementary files, which will be made available publicly later. on tick level (trade by trade) aggregated to form open-high-low-close (OHLC) bars on an hourly frequency222We execute orders hourly, while the agent produces decisions daily..

In the financial domain, it is common to use benchmark strategies to evaluate the relative profitability and risk profile of the tested strategies. A common benchmark strategy is the constantly rebalanced portfolio (CRP), where at each period the portfolio is rebalanced to the initial wealth distribution among the assets including the cash. This strategy can be seen as using the mean-reverting nature of stock prices, as it sells those that gained value while buying more of those losing value. In (Cover, 1991) it is shown that such a strategy is asymptotically the best for stationary stochastic processes such as stock prices, offering exponential returns and is, therefore, an optimistic benchmark to compare against. Transaction fees have been fixed at a conservative level of 20 basis points333One basis point is equivalent to 0.01%. and, given the use of market orders, an additional 50 basis points slippage444Slippage is defined as the relative deviation of the price at which the orders get executed and the price at which the agent produced the reallocation action. This conservative level of slippage is due to the lack of equity volume data, as during less liquid trading periods the agent market orders might influence the market price and get a worse trade execution than expected in a simulated environment. is applied.

Performance monitoring is based on a set of evaluation measures such as Sharpe and Sortino ratios, value-at-risk (VaR), conditional value-at-risk (CVaR), maximum drawdown (MDD), annualized volatility and annualized returns. Let

denote a bounded random variable. The Sharpe ratio of

is defined as

Sharpe ratio, representing the reward per unit of risk, has been recognized not to be desirable since it is a symmetric measure of risk and, hence, penalizes the low-cost events. Sortino ratio, VaR and CVaR are risk measures which gained popularity for taking into consideration only the unfavorable part of the return distribution, or, equivalently, unwanted high cost. Sortino ratio is defined similarly to Sharpe ratio, though replacing the standard deviation

with the downside deviation. The with level of is the

-quantile of

, and at level is the expected return of in the worst of cases.

3.1 Network Architecture

In this subsection, we provide a qualitative description of our proposed network before delving into the mathematical details in the subsequent subsections.

Figure 2: Agent architecture.

Figure 2 shows how we integrate the IPM, BCM and DAM into our network architecture. In particular, the DAM is used to append generated data to each training episode. The data fetcher fetches data from the augmented data in a step-wise fashion. This data is then fed to both the IPM and agent, which have multiple neural networks.

To illustrate our proposed off-policy version of dynamic portfolio optimization algorithm, we adapt the actor-critic style DDPG algorithm (Lillicrap et al., 2015). In this setting, at least two networks (one for the actor and one for the critic) are required in the agent as shown in Figure 2. Furthermore, our implementation utilizes both target networks (Mnih et al., 2015) and parameter noise exploration (Plappert et al., 2017), which in itself necessitates two additional networks for the actor. Our agent, comprising of six separate networks (four networks for the actor and two networks for the critic), is described in Algorithm A of supplementary material Section A.

We next discuss how the agent is trained and tested. For each episode during training, we select an episode of data by selecting a random trading date that satisfies our desired episode length. The training data of each episode is then augmented by synthetic data generated by the DAM. At each time step of the episode, market data, i.e., data corresponding to the current time step of interest, fetched via the data fetcher is used to train the IPM which produces predictions of price percentage changes in the next time step. At the same time, the agent’s networks are updated by making use of data sampled from the memory (also referred to as the replay buffer) via the prioritized experience replay (Schaul et al., 2015). Once the agent’s networks are updated, an action, i.e., desired portfolio allocation in the next time step, can be obtained from the actor network that has been perturbed with the parameter noise. The MSE then executes this noisy action and the agent moves to the next state. The corresponding state, action, rewards, next state and computed one-step greedy action, produced by the BCM, are stored in the memory. This process is repeated for each step in the episode and for all episodes. During the testing period, the IPM continues to be updated at each time step while the agent is frozen in that it is no longer trained and actions are obtained from the actual actor network, i.e., the actor network without parameter noise.

As shown in Figure 2

, actor and critic networks in the agent consist of feature extraction (FE), feature analysis (FA) and network improvement (NI) components. The FE layers aim to extract features given the current price tensor

. In our experiments, we have assets including cash and a time embedding . This essentially means that we have a price tensor of the shape

if using the channels first convention. The FE layers can be either LSTM-based recurrent neural networks (RNN, see

(Hochreiter & Schmidhuber, 1997)

) or convolutional neural networks (CNN, see

(Krizhevsky et al., 2012)). We find that the former typically yields better performance. Thus, the price tensor is reshaped to a tensor prior to being fed to the LSTM-based FE network.

The outputs at each time step of the LSTM-based FE network are concatenated together into a single vector. Next, the previous actions,

, is concatenated to this vector. Finally, it is further concatenated with a one-step predicted price percentage change vector produced by the IPM and a market index performance indicator (i.e., the price ratio of a market index such as the S&P 500 index). The resulting vector is then passed to a series of dense layers (i.e., multilayer perceptrons), which we refer to as the feature analysis (FA) component. We remark that the network in

(Jiang et al., 2017) does not have dense layers, which may not account for non-linear relationships across various assets.

Finally, we have a network improvement (NI) component for each network which specifies how the network is updated. Specifically, NI synchronizes the learning rates between the actor and the critic, which preserves training stability by ensuring the actor is updated at a slower rate than the critic (Bhatnagar et al., 2009). It is also important to note that the actor network’s NI component receives gradients from the BCM, which makes use of one-step greedy actions to provide supervised updates to the actor network to reduce portfolio volatility.

3.2 Infused Prediction Module

Here, for the IPM, we implemented and evaluated two different multivariate prediction models, trained in an online manner, differing in their computational complexity. As the most time efficient model, we implemented the nonlinear dynamic Boltzmann machine (Dasgupta & Osogami, 2017)

(NDyBM) that predicts the future price of each asset conditioned on the history of all assets in the portfolio. As NDyBM does not require backpropagation through time, it has a parameter update time complexity of

. This makes it very suitable for fast computation in online time-series prediction scenarios. However, NDyBM assumes that the inputs come from an underlying Gaussian distribution. In order to make IPM more generic, we also implemented a novel prediction model using dilated convolution layers inspired by the WaveNet architecture

(van den Oord et al., 2018). As there was no significant difference in predictive performance noticed between the two models, in the rest of the paper we provide results with the faster NDyBM based IPM module. However, details of our WaveNet inspired architecture can be seen in supplementary section D.

We use the state-space augmentation technique, and construct the augmented state-space : each state is now a pair , where is the original state, and where is the predicted future close, high and low asset percentage price change tensor.

The NDyBM can be seen as an unfolded Gaussian Boltzmann machine for an infinite time horizon i.e. history, that generalizes a standard vector auto-regressive model with eligibility traces and nonlinear transformation of historical data (Dasgupta & Osogami, 2017)

. It represents the conditional probability density of

given as, 555For mathematical convenience, and are used interchangeably. . Where, each factor of the right-hand side denotes the conditional probability density of given for . Where, are the number of units in the NDyBM. Here, is considered to have a Gaussian distribution for each :
Here, is the vector of expected values of the -th unit at time given the history up to patterns. It is represented as:

is a bias vector,

are eligibility trace vectors, is the time-delay between connections and
for , for are weight matrices. The eligibility trace can be updated recursively as, . Here, is a fixed decay rate factor defined for each of the column vectors. Additionally, the bias parameter vector

, is updated at each time using a RNN layer. This RNN layer computes a nonlinear feature map of the past time series. Where in, the output weights from the RNN to the bias layer along with other NDyBM parameters (bias, weight matrices and variance), are updated online using a stochastic gradient method.

Following (Dasgupta & Osogami, 2017)666Details of the learning rule and derivation of the model as per the original paper. Algorithm steps and hyper-parameter settings are provided in supplementary., the NDyBM is trained to predict the next time-step close, high and low percentage change for each asset conditioned on the history of all other assets, such that the log-likelihood of each time-series is maximised. As such NDyBM parameters are updated at each step , following the gradient of the conditional probability density of : .

In the spirit of providing the agent with additional market signals and removing non-Markovian dynamics in the model, along with the prediction, we further augment the state space with a market index performance indicator. The state space now has the form where, is the market index performance indicator for time step .

3.3 Data Augmentation Module

The limited historical financial data prevents us from scaling up the deep RL agent. To mitigate this issue, we augment the data set via the recurrent generative adversarial networks (RGAN) framework (Esteban et al., 2017) and generate synthetic time series. The RGAN follows the architecture of a regular GAN, with both the generator and the discriminator substituted by recurrent neural networks. Specifically, we generate percentage change of closing price for each asset separately at higher frequency than what the RL agent uses. We then downsample the generated series to obtain the synthetic series of high, low, close (HLC) triplet vector . This approach avoids the non-stationarity in market price dynamics by generating a stationary percentage change series instead, and the generated HLC triplet is guaranteed to maintain their relationship (i.e., generated highs are higher than generated lows).

We assume implicitly that the assets’ percentage change vector follows a distribution: . Let be the hidden dimension. Our goal is to find a parameterized function such that given a noise prior the generated distribution are empirically similar to the data distribution . Formally, given two batches of observations from distributions and where is the batch size, we want under certain similarity measure of distributions.

One suitable choice of such similarity measure is the maximum mean discrepancy (MMD) (Gretton et al., 2012). Following (Li et al., 2015), we can show (see supplementary material) that with RBF kernel, minimizing

, which is the biased estimator of squared MMD results in matching all moments between the two distribution


As discussed in previous works (Arjovsky et al., 2017; Arjovsky & Bottou, 2017), vanilla GANs suffer from the problem that discriminator becomes perfect when the real and the generated probabilities have disjoint supports (which is often the case under the hypothesis that real-world data lies in low dimensional manifolds). This could lead to generator gradient vanishing, making the training difficult. Furthermore, the generator can suffer from mode collapse issue where it succeeds in tricking the discriminator but the generated samples have low variation.

In our RGAN architecture, we model each asset separately by a pair of parameterized function , and we use

, the unbiased estimator of squared MMD between

and , as a regularizer for the generator

, such that the generator not only tries to ’trick’ the discriminator into classifying its output as coming from

, but also tries to match with in all moments. This alleviates the aforementioned issues in vanilla GAN: firstly is defined even when distributions have disjoint supports, and secondly gradients provided by are not dependent on the discriminator but only on the real data. Specifically, both the discriminator and the generator is trained with gradient descent method. The discriminator objective is:

and the generator objective is:

given a batch of samples drawn independently from a diagonal Gaussian noise prior , and a batch of samples drawn from the data distribution of the asset.
In order to select the bandwidth parameter in the RBF kernel, we set it to the median pairwise distance between the joint data. Both discriminator and generator networks are LSTMs (Hochreiter & Schmidhuber, 1997).
Although generator directly minimises estimated , we go one step further to validate the RGAN by conducting Kolmogorov-Smirnov (KS) test. Our results show that the RGAN is indeed generating data representative of the true underlying distribution. More details of the KS test can be found in supplementary material.

3.4 Behavior Cloning Module

In finance, some investors may favor a portfolio with lower volatility over the investment horizon. To achieve this, we propose a novel method of behavior cloning, with the primary purpose of reducing portfolio volatility while maintaining reward to risk ratios. We remark that one can broadly dichotomize imitation learning into a passive collection of demonstrations (behavioral cloning) versus an active collection of demonstrations. The former setting (Abbeel & Ng, 2004; Ross et al., 2011) assumes that demonstrations are collected a priori and the goal of imitation learning is to find a policy that mimics the demonstrations. The latter setting (Daumé et al., 2009; Sun et al., 2017) assumes an interactive expert that provides demonstrations in response to actions taken by the current policy. Our proposed BCM in this definition is an active imitation learning algorithm. In particular, for every step that the agent takes during training, we calculate the one-step greedy action in hindsight. This one-step greedy action is computed by solving an optimization problem, given the next time period’s price ratios, current portfolio distribution and transaction costs. The objective function is to maximize returns in the current time step, which is why the computed action is referred to as the one-step greedy action. For time step , the objective function is as follows:


Solving the above optimization problem for yields an optimal expert greedy action denoted by . This one-step greedy action is then stored in the replay buffer together with the corresponding pair. In each training iteration of the actor-critic algorithm, a mini-batch of is sampled from the replay buffer. Using the states that were sampled from the replay buffer, the actor’s corresponding actions are computed and the log-loss between the actor’s actions and the one-step greedy actions is calculated:


Gradients of the log-loss with respect to the actor network can then be calculated and used to perturb the weights of the actor slightly. Using this loss directly prevents the learned policy from improving significantly beyond the demonstration policy, as the actor is always tied back to the demonstrations. To achieve this, a factor is used to discount the gradients such that the actor network is only slightly perturbed towards the one-step greedy action, thus maintaining the stability of the underlying RL algorithm. Following this, the typical DDPG algorithm as described above is executed to train the agent.

Final accnt. value 574859 570482 586150 571853 571578 575430 577293 571417 580899
Ann. return 7.25% 7.14% 8.64% 7.26% 7.24% 7.60% 7.77% 7.22% 8.09%
Ann. volatility 12.65% 12.79% 14.14% 12.80% 12.79% 12.84% 12.81% 12.77% 12.77%
Sharpe ratio 0.57 0.56 0.61 0.57 0.57 0.59 0.61 0.57 0.63
Sortino ratio 0.80 0.78 0.87 0.79 0.79 0.83 0.85 0.79 0.89
1.27% 1.30% 1.41% 1.30% 1.30% 1.29% 1.28% 1.30% 1.27%
1.91% 1.93% 2.11% 1.94 1.93% 1.93% 1.92% 1.93% 1.91%
MDD 13.10% 13.80% 12.20% 13.70% 13.70% 12.70% 12.60% 13.70% 12.40%
Table 1: Performances for different models (last column represents IPM+DAM+BCM model. ann. and accnt. stand for annualized and account, respectively).

4 Experiments

All experiments777Algorithm of our model-based DDPG agent is detailed in supplementary material Section A. were conducted using a backtesting environment, and the assets used are as described in Section 3. The time period spanning from 1 Jan 2005 to 31 Dec 2016 was used to train the agent, while the time period between 1 Jan 2017 to 4 Dec 2018 was used to test the agent. We initialize our portfolio with in cash. We implement a CRP benchmark where funds are equally distributed among all assets, including the cash asset. We also compare the previous work (Jiang et al., 2017) as a baseline which can be considered as a vanilla model-free DDPG algorithm (i.e., without data augmentation, infused prediction and behavioral cloning)888In fact, our baseline is superior to the prior work (Jiang et al., 2017) due to the addition of prioritized experience replay and parameter noise for better exploration.. It should be noted that, compared to the implementation in (Jiang et al., 2017) our baseline agent is situated in a more realistic trading environment that does not assume an immediate trade execution. In particular, our backtesting engine executes market orders at the open of the next OHLC bar999Refer to Section 3 for details on these terms., as well as adds slippage to the trading costs. Additional practical constraints are applied such that fractional trades of an asset are not allowed.

As shown in Table 1, adding both DAM and BCM to the baseline leads to a very small increase in Sharpe ratio (from to ) and Sortino ratio (from to ). There are two hypotheses that we can draw here. First, we hypothesize that the vanilla DDPG algorithm (in its usage within our framework) is not over-fitting to the data since the use of data augmentation has a relatively small impact on the performance. Similarly, it could be hypothesized that the baseline is efficient in its use of available signals in the training set, such that, the addition of behavior cloning by itself, has a negligible improvement of performance. By making use of just IPM, we observe a significant improvement over the baseline in terms of Sharpe (from to ) and Sortino (from to ) ratios, indicating that we are able to get more returns per unit risk taken. However, we note that the volatility of the portfolio is significantly increased (from to ). This could be due to the agent over-fitting to the training set, with the addition of this module. Therefore, when the agent sees the testing set, its actions could result in more frequent losses, thereby increasing volatility.

To reduce over-fitting, we integrate IPM with DAM. As can be seen in Table 1, the Sharpe ratio is actually reduced slightly compared to the IPM case (from to ) although volatility is significantly reduced (from to ). This reduces the overall risk of the portfolio. Moreover, this substantiates our hypothesis of the agent over-fitting to the training set. One possible reason for IPM over-fitting to the training set while the baseline agent does not is the following: the IPM is inherently more complex due to the presence of a prediction model. The increased complexity in terms of larger network structure essentially results in higher probability of over-fitting to the training data. As such, using DAM to diversify the training set is highly beneficial in containing such model over-fitting.

One drawback of using IPM+DAM is the decreased return to risk ratios as mentioned above. To mitigate this, we make use of all our contributions as a single framework (i.e. IPM+DAM+BCM). As shown in Table 1, we are not only able to recover the performance of the original model trained using just IPM, but surpass the IPM performance in terms of both Sharpe (from to ) and Sortino ratios (from to ). It is worthwhile to note that addition of BCM has a strong impact compared to IPM+DAM in reducing volatility (from to ) and MDD (from to ). The observation of behavior cloning reducing downside risk is further substantiated when comparing the model trained via IPM versus the model trained with IPM and BCM; the Sharpe and Sortino ratios are maintained though annualized volatility is significantly reduced. In addition, when comparing baseline versus baseline with BCM, we see that MDD is reduced and Sharpe and Sortino ratios have small improvements.

We can conclude that IPM significantly improves portfolio management performances in terms of Sharpe and Sortino ratios. This is particularly attractive for investors who aim to maximize their returns per unit risk. In addition, DAM helps to prevent over-fitting especially in the case of larger and more complex network architectures. However, it is important to note that this may impact risk to reward performance. BCM as envisioned, helps to reduce portfolio risk as seen by its ability to either reduce volatility or MDD across all runs. It is also interesting to note its ability to slightly improve the risk to return (reward) ratios. Enabling all three modules, our proposed model-based approach can achieve significant performance improvement as compared with benchmark and baseline. In addition, we provide details on an additional risk adjustment module (that can adjust the reward function contingent on risk) with experimental results in supplementary material. Usage of the presented modules with a PPO based on-policy algorithm is also provided.

5 Conclusion

In this paper, we proposed a model-based deep reinforcement learning architecture to solve the dynamic portfolio optimization problem. To achieve a profitable and risk-sensitive portfolio, we developed infused prediction, GAN based data augmentation and behavior cloning modules, and further integrated them into an automatic trading system. The stability and profitability of our proposed model-based RL trading framework were empirically validated on several independent experiments with real market data and practical constraints. Our approach is applicable not only to the financial domain, but also to general reinforcement learning domains which require practical considerations on decision making risk and training data scarcity. Future work could test our model’s performance on gaming and robotics applications.


Appendix A Algorithms

Our proposed model-based off-policy actor-critic style RL architecture is summarized in Algorithm A.

  Algorithm 1 Dynamic portfolio optimization algorithm (off-policy version).


1:  Input: Critic , actor and perturbed actor networks with weights , , and standard deviation of parameter noise .
2:  Initialize target networks , with weights ,
3:  Initialize replay buffer
4:  for episode  do
5:     Receive initial observation state
6:     for  do
7:         Predict future price tensor with prediction models using and form augmented state
8:         Use perturbed weight to select action
9:         Execute action , observe reward and new state
10:         Predict next future price tensor using prediction models with inputs and form the augmented state
11:         Solve the optimization problem (1) for the expert greedy action
12:         Store transition in
13:         Sample a minibatch of transitions, , from via prioritized replay, according to temporal difference error
14:         Compute
15:         Update the critic by annealing the prioritized replay bias while minimizing the loss:
16:         Maintain the ratio between the actual learning rates of the actor and critic
17:         Update using the sampled policy gradient:
18:         Calculate the expert auxiliary loss in (2) and update using with factor
19:         Update the target networks:
20:         Create adaptive actor weights from current actor weight and current :
21:         Generate adaptive perturbed actions for the sampled transition starting states : . With previously calculated actual actions , calculate the mean induced action noise:
22:         Update : if , , otherwise
23:     end for
24:     Update perturbed actor:
25:  end for


An on-policy version of our proposed model-based PPO style RL architecture can be found in Algorithm A.

  Algorithm 2 Dynamic portfolio optimization algorithm (on-policy version).


1:  Input: Two-head policy network with weights , policy head and value head , and clipping threshold .
2:  for episode  do
3:     Receive initial observation state
4:     for  do
5:         Predict future price tensor with prediction models using and form augmented state
6:         Use the policy-head of policy network to produce the action mean and variance
7:         Execute action , observe reward and new state
8:         Compute the DSR or D3R based on (4) or (6)
9:         Predict next future price tensor using prediction models with inputs and form the augmented state
10:         Solve the optimization problem (1) for the expert greedy action
11:     end for
12:     Compute the probability ratio:
13:     Obtain the value estimate from the value head and calculate the advantage estimate
14:     for  do
15:         Update the policy network using the gradient of the clipped surrogate objective:
16:         Update the policy network using the gradient of value loss:
17:         Calculate the expert auxiliary loss in (2) and update policy network using with factor
18:     end for
19:  end for


Appendix B Experiments for On-policy Algorithm

As shown in Table 2, the proposed IPM can improve the PPO baseline and baseline+BCM model performance. For brevity, we do not enumerate all module combinations. The purpose is to show the proposed RL architecture could be extended to the on-policy setting.

For the trained PPO+BCM model, we arbitrarily chose the testing period between 17 Apr 2018 to 03 Dec 2018 to depict how the portfolio weights evolve over the trading time in Figure 3. The corresponding trading signals are shown in Figure 4. It can be seen from both figures that the portfolio weights produced by the RL agent can adapt to the market change and evolve as time goes by without falling into a local optimum.

Figure 3: Portfolio weights for different assets (assets from top to bottom are Cash, Costco Wholesale Corporation, Cisco Systems, Ford Motors, Goldman Sachs, American International Group, Amgen Inc and Caterpillar).

Figure 4: Trading signals (last row is the S&P 500 index).

Appendix C Hyperparameters for Experiments

We report the hyperparameters for the actor and critic networks in Table


Nonlinear Dynamic Boltzmann machine in the infused prediction module uses the following hyper-parameter settings: delay , decay rates i.e. , learning rate was

, with standard RMSProp optimizer. The input and output dimensions were fixed at three times the number of assets, corresponding to the high, low and close percentage change values. The RNN layer dimension is fixed at 100 units with a

nonlinear activation function. A zero mean,

standard deviation noise was applied to each of the input dimensions at each time step, in order to slightly perturb the inputs to the network. This injected noise was cancelled by applying a standard101010As implemented in scientific computing package Scipy. Savitzky-Golay (savgol) filter with window length and polynomial order .

For the variant of WaveNet (van den Oord et al., 2018) in the infused prediction module, we choose the number of dilation levels to be 6, filter length to be , the number of filters to be 32 and the learning rate to be . Inputs are scaled with min-max scaler with a window size equal to the receptive field of the network, which is .

For the DAM, we select the number of training set to be 30,000, noise latent dimension to be 8, batch size to be 128, time embedding to be 95, generator RNN hidden units to be 32, discriminator RNN hidden unit to be 32 and a learning rate . For each episode in Algorithm A, we generate and append two months of synthetic market data for each asset.

For the BCM, we choose the factor to discount the gradient of the log-loss.

Baseline BCM IPM+BCM
Final accnt. value 595332 672586 711475
Ann. return 5.23% 15.79% 18.25%
Ann. volatility 14.24% 13.96% 14.53%
Sharpe ratio 0.37 1.13 1.26
Sortino ratio 0.52 1.62 1.76
1.41% 1.39% 1.38%
2.10% 2.09% 2.20%
MDD 22.8% 12.40% 11.60%
Table 2: Performance for on-policy PPO-style dynamic portfolio optimization algorithm (accnt. and ann. are abbreviations for account and annualized respectively).
Actor network Critic network
FE layer type RNN bidirectional LSTM RNN bidirectional LSTM
FE layer size 20, 8 20, 8
FA layer type Dense Dense
FA layer size 256,128,64,32 256,128,64,32
FA layer activation function

Leaky relu

Leaky relu
Optimizer Gradient descent optimizer Adam optimizer
Dropout 0.5 0.5
Learning rate
Synchronized to be 100 times slower than
the critic’s actual learning rate
Episode length 650
Number of episodes 200
for parameter noise 0.01
Replay buffer size 1000
Table 3: Hyperparameters for the DDPG actor and critic networks.

Appendix D Details of Infused Prediction Module

d.1 Nonlinear Dynamic Boltzmann Machine Algorithm

Figure 5: A nonlinear dynamic Boltzmann machine unfolded in time. There are no connections between units and within a layer. Each unit is connected to each other unit only in time. The lack of intra-layer connections enables conditional independence as depicted.

In Figure  5 we show the typical unfolded structure of a nonlinear dynamic Boltzmann machine. The RNN layer (of a reservoir computing type architecture) (Jaeger, 2003; Jaeger & Haas, 2004) is used to create nonlinear feature transforms of the historical time-series and update the bias parameter vector as follows:

Where, is a dimensional state vector at time of a dimensional RNN. is the dimensional learned output weight matrix that connects the RNN state to the bias vector. The RNN state is updated based on the input time-series as follows:

Here, and are the RNN internal weight matrix and the weight matrix corresponding to the projection of the time-series input to the RNN layer, respectively.

The NDyBM is trained online to predict based on the estimated for all time observations . The parameters are updated such that the log-likelihood , of the given financial time-series data is maximized. We can derive exact stochastic gradient update rules for each of the parameters using this objective (can be referenced in the original paper). Such a local update mechanism, allows an

update of NDyBM parameters. As such a single epoch update of NDyBM occurs in sub-seconds

as compared to a tens of seconds update of the WaveNet inspired model. Scaling up to large number of assets in the portfolio, in an online learning scenario this can provide significant computational benefits.

Our implementation of the NDyBM based infused prediction module is based on the open-source code available at Algorithm 3 describes the basics steps.

  Algorithm 3 Online asset price change prediction with the NDybM IPM module.


1:  Require: All the weight and bias parameters of NDyBM are initialized to zero. The RNN weights, initialized randomly from , initialized randomly from . The FIFO queue is initialized with zero vectors. eligibility traces are initialized with zero vectors
2:  Input: Close, high and low percentage price change for each asset at each time step
3:  for   do
4:     Compute using & update the bias vector based on RNN layer
5:     Predict the expected price change pattern at time using
6:     Observe the current time series pattern at
7:     Update the parameters of NDyBM based on
8:     Update FIFO queues and eligibility traces by
9:     Update RNN layer state vector
10:  end for


d.2 WaveNet inspired multivariate time-series prediction

We use an autoregressive generative model, which is a variant of parallel WaveNet (van den Oord et al., 2018), to learn the temporal pattern of the percentage price change tensor , which is the part of state space that is assumed to be independent of agent’s actions. Our network is inspired by previous works on adapting WaveNet to time series prediction (Mittelman, 2015; Borovykh et al., 2017). We denote the asset’s price tensor at time as

. The joint distribution of price over time

is modelled as a factorized product of probabilities conditioned on a past window of size :

The model parameter is estimated through maximum likelihood estimation (MLE) respective to

. The joint probability is factorized both over time and different assets, and the conditional probabilities is modelled as stacks of dilated causal convolutions. Causal convolution ensures that the output does not depend on future data, which can be implemented by front zero padding convolution output in time dimension such that output has the same size in time dimension as input.

Figure 6 shows the tensorboard visualization of a dilated convolution stack of our WaveNet variant. At each time , the input window first goes through the common layer, which is a depthwise separable 2d convolution followed by a convolution, with time and features (close, high, low) as height and width and assets as channel. Output from is then feed into different stacks of the same architecture as depicted in Figure 6, one for each different asset. and denotes dilated convolution with filter length and relu activation at level , which has dilation factor . Each takes the concatenation of and as input. This concatenation is represented by the residual blocks in the diagram. The final layer in Figure 6 is a convolution with filters, one for each of high, low and close, and the output is exactly . The output from different stacks is then concatenated to produce the prediction .

Figure 6: The network structure of WaveNet.

The reason for modelling the probabilities as such is two-fold. First, it addresses the dependency on historical patterns of the financial market by using a high order autoregressive model to capture long term patterns. Models with recurrent connections can capture long term information since they have internal memory, but they are slow to train. A fully convolutional model can process inputs in parallel, thus resulting in faster training speed compared to recurrent models, but the number of layers needed is linear to

. This inefficiency can be solved by using stacked dilated convolution. A dilated convolution with dilation factor uses filters with zero inserted between its values, which allows it to operate in a coarser scale than a normal convolution with same effective filter length. When stacking dilated convolution layers with filter length , if an exponentially increasing dilation factor is used in each layer , the effective receptive fields will be , where is the total number of layers. Thus large receptive field can be achieved with having logarithmic many layers to , which has much fewer parameters needed compared to a normal fully convolutional model.

Secondly, by factoring not only over time but over different assets as well, this model makes parallelization easier and potentially has better interpretability. The prediction of each asset is conditioned on all the asset prices in the past window, which makes the model easier to run in parallel since each stack can be placed on different GPUs.

The serial cross correlation at lag for a pair of discrete time series is defined as

Figure 7: Out of sample actual and predicted closing price movement for asset Cisco Systems.

Roughly speaking, measure the similarity between and a lagged version of the . The peak indicates that has highest correlation with for all on average. If is the predicted time series, and is the real data, a naive prediction model simply takes the last observation as prediction and would thus have . Sometimes a prediction model will learn this trivial prediction and we call this kind of model trend following. To test whether our model has learned a trend following prediction, we convert the predicted percentage change vector back to price (see Figure 7) and calculated the serial cross-correlation between the predicted series and the actual. Figure 8 clearly shows that there is no trend following behavior as .

Figure 8: Trend following analysis on predicted closing price movement for asset Cisco Systems.

Appendix E Details of Data Augmentation Module

Maximum mean discrepancy (MMD) (Gretton et al., 2012) is a pseudometric over , the space of probability measures on some compact metric set . Given a family of functions , MMD is defined as

When is the unit ball in a Reproducing Kernel Hilbert Space (RKHS) associated with a universal kernel , i.e. , is not only a pseudometric but a proper metric as well, that is if and only if

. One example of a universal kernel is the commonly used Gaussian radial basis function (RBF) kernel


Given samples from , the square of MMD has the following unbiased estimator:

We also have the biased estimator where the empirical estimate of feature space means is used