Factor Representation and Decision Making in Stock Markets Using Deep Reinforcement Learning

08/03/2021 ∙ by Zhaolu Dong, et al. ∙ Georgia Institute of Technology 0

Deep Reinforcement learning is a branch of unsupervised learning in which an agent learns to act based on environment state in order to maximize its total reward. Deep reinforcement learning provides good opportunity to model the complexity of portfolio choice in high-dimensional and data-driven environment by leveraging the powerful representation of deep neural networks. In this paper, we build a portfolio management system using direct deep reinforcement learning to make optimal portfolio choice periodically among S&P500 underlying stocks by learning a good factor representation (as input). The result shows that an effective learning of market conditions and optimal portfolio allocations can significantly outperform the average market.



There are no comments yet.


page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Literature Review

Both deep learning (DL) and reinforcement learning (RL) are biological inspired frameworks and theoretically rooted in the neuroscientific field for behavior control. In practice, they get more and more attention in finance applications. One actively researched area is about applying such models for market forecasting. The literature review regarding this field is recently done in

Kimura (2019). Another area is to solve dynamic financial decision-making problems or train relative trading systems via these models. For instance, Jiang et al. J.J. (2017) use the model-free Deep Deterministic Policy Gradient (DDPG) Hunt (2016) to dynamically optimize cryptocurrency portfolios. Similarly, Liang et al. Li (2018) optimize stock portfolios by using the DDPG as well as the Proximal Policy Optimization (PPO) Klimov (2017). Fister et al. Jagriˇc (2019) train trading system by deep LSTM while Moody and Saffell Moody and Saffell (1999) by recurrent reinforcement learning.

Deng et al.Dai (2017) implement DL to automatically sense the dynamic market condition for informative feature learning and RL to make trading decisions. To the best of our knowledge, Dai (2017) is the first paper to learn (future) market conditions and trading decisions at the same time, i.e., the two areas as mentioned above. However, Dai (2017) only handles one share of the asset. Building upon this, we will extend the DL and RL framework learned in a complex neural network (NN) to a portfolio setting, extracting features from multiple assets and then making optimal trading decisions.

In reinforcement learning, an agent must be able to sense the environment, make decisions, and achieve goals Barto (1998). A typical reinforcement learning problem can be generally categorized into value-function-based method and actor-critic-method Schuitema (2012). A value-function-based method, such as Q-learning Hong (2007)

, can solve the optimization problems defined in a discrete space by estimating a value function like future discounted return. The applications of Q-learning in the portfolio optimization problem can be found in

Jinjian Z. (2009) Jin and El-Saawy (2016) Weijs (2018). However, this method is not ideal to solve trading problem because the trading environment is more complex than a discrete space. Unlike the value-function-based method, in an actor-critic method, the “critic” learns a value function and then uses the result to update the parameters of the “actor.” The actor-critic based methods such as DDPG and PPO for portfolio optimization problem can be found in J.J. (2017) and Liu (2020). Another example of actor-critic-method is direct reinforcement learning that learns from the continuous sensory data. Moody and Saffell Moody and Saffell (1999)

found that RRL direct reinforcement framework eliminates building a forecasting model, avoids Bellman’s curse of dimensionality, and provides better results than Q-learning

Moody and Saffell (2001).

2 Problem Definition

Portfolio management is the action of continuous reallocation of a capital into a number of financial assets. For an automatic portfolio management system, these investment decisions and actions are made periodically based on investment information states. This section provides a mathematical setting of the portfolio management problem.

2.1 Trading Period

In this project, we consider daily stock tradings so that the trading agent reallocates his asset allocations once per trading day. For each trading day, we have the basic market information, namely the opening, highest, lowest and closing prices, as well as the trading volumes. Extra integrated market information includes manually constructed factors using historical market and news information. It is assumed in the back-test experiments that in each trading day stocks can be bought or sold at the closing price. We only consider long positions since short selling is risky.

2.2 Data

The agent manage portfolios among the S&P500 underlying stocks. The historical daily price information (up to today) can be obtained using module “tidyquant” in R. To use as many information as possible, we also include several indicators for technical analysis such as EMA, RSI, MACD and so on. Further, we also include macro economic features and sentiment of financial news.

2.3 Mathematical Formalism

We construct portfolios of one risk-free asset (bond) and stocks. At trading day , the risk-free asset earns risk-free rate of return , and the th stock has price for . The market return of th stock for the period ending at time is defined as . Defining portfolio weights of the th stock at time as , the agent that only take long positions have portfolio weights that satisfy



represents the weight allocated in bond. By the equation above, it is a natural idea to treat the portfolio weights as the outcome of a neural network in which the output activation function is a softmax function.

When multiple assets are considered, the effective portfolio weights change per trading day due to price movements. Thus, maintaining constant or desired portfolio weights requires that adjustments in positions be made at each trading day. If there is no transaction costs, The wealth after periods for a portfolio management system is


where is the portfolio return at time which depends on price change and previous allocations, is the bond price and by the risk-free property we have .

In real world, transaction costs are not negligible for frequent tradings. We assume a proportional transaction cost is charged whenever rebalancing positions. Then the wealth after periods becomes

where is the effective portfolio weight of stock before readjusting at time .

At the beginning of each time , base on the observations the agent tries to find out the optimal portfolio weights that can maximize the final portfolio wealth. Thus, the objective is


where is the utility. The most common one is to choose as log utility of final wealth so that equivalently the objective function is to maximize the cumulative portfolio returns . Other utility function such as portfolios’ sharp ratio can be considered.

where the standard deviation of returns in the denominator can also be substituted by drawdowns to further control risks.

2.4 Factor Representation

Vintage data input is important for training DRL. Usually the financial data contain a large amount of noise, jump, and movement leading to the highly nonstationary time series. To learn robust feature representations directly from data, we borrow the idea from the development of Natural Language Processing, i.e., to use GRU or LSTM network to learn the integrated feature representation of historical data. The Drop out method used based on RNN can help to find the underlying features and reduce the uncertainty of data from

Han and Liu (2017), Han and Liu (2018) and Han et al. (2020).

2.5 Direct Reinforcement Learning

Typical Reinforcement Learning can be generalized into two types as critic-based (learning value functions) and actor-based (learning actions) methods. Critic models are algorithms that directly estimate value functions and solve problems defined in discrete space using dynamic programming. Among the available algorithms, Q-learning is the most widely used and it is the first method to be adopted in our project. However, just like it is mentioned above, Q learning failed to approximate the complex market dynamics. At the same time, the calculations of value function usually require re-coding of future discounted return. In reality, when we deploy the RL system in online learning& trading mode, information of future return apparently does not exist for computation and hence, Critic models are not ideal for trading cases.

For actor based RL models, the agent directly evaluates trading policy (instead of valuing market conditions in discrete states) defined in terms of continuous action space. Without using dynamic programming, the optimization functions simply require a differentiable function and a few latent variables, which saves a considerable amount of calculation power. In addition, the actor based model is capable of handling continuous data. Thus, it is advantageous to handle financial market dynamics, especially in the case of practical applications in online mode. Specifically, Direct Reinforcement Learning is used in our final model.

With the well-defined mathematical formalism of utility function, an efficient strategy to learn the trading policy directly is by using Direct Reinforcement Learning. Specifically, we approximate the action at each time period as the weight allocated on each asset within the portfolio.


using nonlinear equation:


where RDNN represents a Recurrent Deep Neural Net manipulation,

represents feature vector of market information at time t; we added the previous action term

in computation in order to consider the previous action and discourage frequent change in position to avoid potential heavy transaction cost. After RDNN, the output maps the function to final asset allocation in the range of (0,1). Note that, with the decided allocation of asset at time t, the amount of stock we need to trade for asset i will be


In conclusion, Direct Reinforcement Learning aims to study the set of trading rules within different state (associated with feature input and previous actions) in order to maximize the expected utility function.

2.6 RNN,DL and RDNN

In reality, the price movement of stocks is dependent on its last price and trades. So the analysis of stocks should use its reasoning about previous events to inform the later ones based on Han and Liu (2016), Han et al. (2017) and Han et al. (2021)

. Such a sequential manner can be facilitated by Recurrent Neural Network as it helps to persist long-short time information. A recurrent neural network can be understood as loops of the same networks each passing a message to its successor. In our project, we took advantage of RNN and constructed a one layer GRU (a type of RNN) for fuzzy feature representation,as it links the previous trading allocation

to the input layer in order to discourage frequent trading.This process can be formulated as follows.



represents output from the first layer of GRU, represents the inner product between two input variables, a,b and u represents coefficients for feature regression. After linear regression, the piece-wise linear method of


is used as activation function to map the function to the output of first layer with value no less than zero.

In addition, Deep Learning(DL) is a machine learning algorithm inspired by the structure and function of the brain called artificial neural networks

Lombardo et al. (2019), Han et al. (2018). With a layer to layer structure to hierarchically transform information, Deep Learning is proven success in learning behaviours and optimizing decision making. In our project, one drawback of a one layer NN is the lack of a "brain" to sense market shift and adjust its trading decision. Therefore, we introduce Deep Learning in our DRL model to study trading behaviour with policy gradient for optimized decision making.
Regarding deep representation, the deep neural network has multiple connected layers to hierarchically transform input to output vector. Based on the first layer of GRU network, we further constructed a two-layer DNN to map the output of fuzzy feature representation into final decision of asset allocation. The output of the ith node in layer l-1 is now made to be the input in layer l. The transformation between layer l-1 and layer l is


with being the intermediate value from linear regression and


Where the parameter set

are the layer-wise latent variables for DNN to learn. Note that the second (which is the first layer of DNN) layer of the overall neural network computes output by using RELU method as the activation function, which is the same as the previous layer of RNN; while the last layer computed the output as final asset allocation decisions by using Sigmoid as the activation function.


Note that the value of the output using sigmoid function is ranging from 0 to 1, hence obey the rules of no short-selling.

In conclusion, the overall structure of our DRL model consist of a one-layer RNN for feature representation and a two-layer DNN for studying trading actions. Note that the DNN uses extended features which includes previous actions. Thus, we can call the DNN learning as RDNN. According to the research conclusion, a neural network consists of 128-128-64 neuron is the most efficient structure for the case of financial analysis. Hence, it is also adopted in this project.

2.7 Task-Aware Back Propagation Through Time

To frequently adjusted to market change, updates of parameters are required with back propagation (gradient decent) when a batch of data is passing. However, there are issues of back propagation with recurrent and deep structures. If we denote

as the general parameter in RDNN model,then its gradient can be calculated using chain rule


where represents the unlinear mapping of function through the ith layer of Neural Network at time t. Note that as we also include the previous asset allocation as our input,when deriving , a subsequent derivation of is calculated recursively,imposing difficulties on gradient descent. By setting a time stack based on different values of for time-based unfolding, the current system reduces to a minimal recurrent structure and the typical BP method is easily applied to it. In addition, parameters’ gradients at each time step are averaged together to form the final gradients.
However, the original DNN now gets deeper and with time stacks, leading to issues of gradient vanishing. To solve this, we adopt a method of setting up virtual connections between policy function and variables of each time stack to bring more gradient information.

2.8 Methods

In this paper, we use python for coding along with templates including Pandas, Numpy and Pytorch.

First of all, we set up a financial environment with initial capital of $100,000 in cash and a commission rate of 0.01% based on amount of trade. Trading frequency is set on daily bases and no short selling is allowed. Data includes information of daily adjusted open, closed, high, low prices and volumes of stocks and ETFs from 2007 to 2019. We select this time range because it covers enough information to demonstrate shifts between bull and bear markets. Then, we calculate technical indicators and vectorize all available information as state for agent to train. Finally, to stabilize and smoothen price time series, all the information is normalized with data in the past 20 days.

In the next step, we build a deep-reinforcement-learning-driven trading agent. We first apply drop-out method, where randomly selected neurons are ignored during training to enhance model robustness. The dropout rate is set to be 20%. We then initialize the first layer of DRL network, which is a recurrent net with 128 neurons and with RELU as activation function. After the RNN process, we build two two-layer DNN models with 128 and 64 neurons respectively, with activation function of RELU for the first layer and Sigmoid for the second layer. The learning rate is set to 0.001 and batch size is set to 64.

In order to compare the performance of our system with respect to bench mark index of S&P500, we select an ETF called SPY, which is the largest ETF to track S&P500, as our single portfolio for the first strategy. Then we trained a portfolio consist of picked stocks and one risk-free asset. After that, we compare the results of total profit between these two strategies.

3 Result and Discussion

Results of the performance of strategy with consistent training & trading sequences show strong profit making ability by outperforming the strategy of buying and holding (to simulate return from S&P500) by over $250,000 dollars in around 10 years.

Figure 1 :Performance comparison between agent’s strategy with the buying and holding strategy in terms of investment return ($)

At the same time, it is worthwhile pointing out from Figure 1 that, during the period of financial crisis (between late 2008 and early 2009), the value between two sets of portfolio start to diverge, showing that the Deep Reinforcement Learning system has developed senses of risk and corresponding action, thus experiencing less loss.
To further investigate the strategy used by DRL under different market environment, we trained a system agent under the period of financial crisis, then stop training it to preserve the strategy and let it trade in the following period. As shown in Figure 2, portfolio value with this strategy stays in a stable zone during the time when market experience sharp decline in March 2009 and March 2010. However, as market start to rise in late 2010, conservative holdings of shares will lead to value of portfolio being surpassed by holding strategy.

Figure 2 :Trading performance of risk averse strategy

On the other hand, we guided the agent to develop a risk-seeking strategy by training it during early 2006, which is the rising period. We evaluated this performance during the period of financial crisis. As expected, the aggressive holding of portfolio leads to an under-performance comparing to benchmark. As indicated by Figure 3, the portfolio value of our strategy stays under buying and holding strategy under periods of fluctuations and market collapse.

Figure 3 :Trading performance of risk seeking strategy

Finally,the shift in agents’ trading appetite can be future proved in the figure of Gross Leverage, as shown below. During year 2006 to 2007, the gross leverage line stays at around 1 (100%), meaning that our trading agent have all the wealth allocated on stocks. However, after experiencing the great recession in 2008, our agent’s trading behaviour became more conservative, with holding less stock on hand (shown as the troughs between year 2008 to year 2010) when market is volatile.

Figure 4 :Gross Leverage (Proportion of wealth invested in stocks)

4 Limitations and Next-Step Improvements

The first limitation is that our model failed to consider market impacts of trades on stocks. As we have mentioned in the beginning, we assume zero market impact assumption. That is, the agent we trained is a small market participant and the capital invested by is so insignificant that is has no influence on the market. Though, price impact (either temporary or permanent) is implemented in classical structural model. It is difficult to implement it in our model since our model is only data-driven and we cannot rewrite the historical data.

The second limitation associated with our model is that, with the increase of trading cost, the performance of Deep Direct Reinforcement Learning reduced significantly. To address this issue, the next step of our implementation is to add constraints on volume turnovers. For example, we can set a threshold on the turnovers in the basket. Future, we will try different definition of actions actions in our DRL. Currently our control is the portfolio weight. In the presence of large transaction costs, this may not be suitable since the portfolio weights naturally change with price movements. Hence, decreasing the cost of transactions needs the action we learned to be partially predictable on stock movement, which is difficult. To make it earlier, we will try to directly control the number of shares invested in each stocks. Therefore, decreasing the cost of transactions is simply equivalent to decrease the change of stock shares. Similarly, we can add constrains on turnovers. For instance, we can restrict half of the stock tickers in the basket unchanged.

Thirdly, our model only consider limited pre-selected stocks. Currently we only model the optimal portfolio allocations given 20 pre-selected stocks. By fixing the total number of stocks in portfolio, in the future we may try general optimal stock selections directly from the underlying stocks in S&P500. That means, the stock tickers in our basket may change over time. Note that increasing the number of stocks will cause exponential growth of calculations. As we add in more stocks of interest, the resulted wealth achieved by the Deep Reinforcement Learning system downgrades significantly due to the costs in high trading frequency and aggressiveness. By controlling the turnovers as mentioned above, we hope the new general portfolio selection model can generate promising results. The further implementations are done in the following subsections.

4.1 Mask Network for general selection

We introduce mask for general selection. To select a fixed number of stocks, see e.g., 20 stocks, from a large pool of stocks, we need to introduce a mask vector, in which the value 1 means select and 0 means non-select. This mask vector is generated by a different network separated from actor-NN. As shown in the figures below. The mask network takes the state variables of all stocks in the large pool as input and generate a score vector, then the mask is generated by selecting the top 20 scores. Note that at time 0 and 1, the ticker names of top 20 stocks selected may be totally different (as distinguished by different colors). Then the normalization of the multiplication of actions generated by DRNN and the mask gives the portfolio weights. We can say that the portfolio weights of the 20 selected stocks are non-zero, while the portfolio weights of the remaining stocks are all zero.

Figure 5 :Mask network for general stock selection.

4.2 Turnover Control

The application of mask can help us make general selections on large pool of stocks, but it contains lots of uncertainty. As we have mentioned. The stock tickers selected at time 0 may be totally different from those selected at time 1. In the case at time 1, our agent needs to first liquidate all stocks invested at time 0. This liquidation usually generates large transaction costs. To solve this issue, we further add constraint on the turnover in our portfolio. We can restrict the turnover of stocks no more than a given threshold. For example, we can fix half of the stock tickers in our basket unchanged in two consecutive trading day. The tricky part is that we need to guarantee half of the stock tickers in the basket come from the selections in previous trading day. In the shown figure, at time 1, the mask network generates a score vector. We only need to select the top 10 additional stocks since the top 10 stock tickers selected at time 0 must be reserved at time 1. Using the score vector generated at time 1 and the 20 stock tickers selected at time 1, we can update the 10 tickers that will be reversed at time 2. Similar process goes to the whole time period. The comparison result of general selection model with and without turnover control is shown in this figure. The green line is the average market performance using simply buy and holding strategy. The blue line shows the general auto trading result on the pool of S&P 500 underlying stocks without turnover control, while the orange line shows the result with 50% turnover control. It is seen that turnover control deed help increase our model performance.

Figure 6 :Mask network for general stock selection with turnover control

Figure 7 : Comparison of general selection model with and without turnover control

4.3 Action Control on Stock Shares

We can directly control the number of shares invested in each stocks. In this setting, decreasing the transaction costs is simply equivalent to decrease the change of stock shares. We change the activation function in output layer of actor-NN to be tanh(), which yields value with range from -1 to 1. Since our initial endowment is 100K US dollars, we can restrict the maximum number of stock shares per stock to be 100. Thus, the optimal stock shares generated equal the multiplication of 100 and the output of tanh() function. The new model setting can also help to achieve high performance result, as shown in the following result.

Figure 7 : Direct control on stock shares.

5 Conclusions

This paper introduces concepts and implementations of Deep Direct Reinforcement Learning framework (DDRL) for financial signal representation and portfolio trading. With risk & reward models identified, we set up the objective of maximizing ultimate reward as absolute return in portfolio value. To build the DDRL framework, a recurrent neural net is introduced to handle sequential data representation, following by two two-layer deep neural network for decision learning. Method of Task Aware Back Propagation Through Time is introduced to address the issue of recurrent derivation and gradient vanishing.

To compare performance of DRL against S&P 500 index, the algorithm is tested by trading portfolios of selected stocks and comparing the result with the strategy of buying and holding an ETF named SPY, which tracks S&P500 Index, at a time range between 2006 and 2019. The result appears to be promising with DRL agents outperforming market (return from Buy and Hold of SPY) return by an amount of $250,000.

However, to address the limitations related to trading cost and portfolio size, we introduce mask for general selection, set a threshold of the stocks turnover, make stock shares as action control instead of weights, and change objective to Sharpe ratio. To the best of our knowledge, this is the first attempt to use the DRL for auto trading on a large pool of stocks. In the future, there could be room for further improvement on more sophisticated portfolio management.


  • A. G. Barto (1998) “Reinforcement learning: an introduction”. Cambridge, MA, USA: MIT Press. Cited by: §1.
  • Q.H. Dai (2017) “Deep direct reinforcement learning for financial signal representation and trading”. IEEE Trans. Neural Netw., vol. 28, no. 3, pp. 653 - 664. Cited by: §1.
  • J. Han, F. Ding, X. Liu, L. Torresani, J. Peng, and Q. Liu (2020) Stein variational inference for discrete distributions. arXiv preprint arXiv:2003.00605. Cited by: §2.4.
  • J. Han and Q. Liu (2016) Bootstrap model aggregation for distributed statistical learning. In Advances in Neural Information Processing Systems, pp. 1795–1803. Cited by: §2.6.
  • J. Han and Q. Liu (2017) Stein variational adaptive importance sampling.

    In Uncertainty in Artificial Intelligence

    Cited by: §2.4.
  • J. Han and Q. Liu (2018) Stein variational gradient descent without gradient. arXiv preprint arXiv:1806.02775. Cited by: §2.4.
  • J. Han, S. Lombardo, C. Schroers, and S. Mandt (2018) Deep probabilistic video compression. arXiv preprint arXiv:1810.02845. Cited by: §2.6.
  • J. Han, M. R. Min, L. Han, L. E. Li, and X. Zhang (2021)

    Disentangled recurrent wasserstein autoencoder

    arXiv preprint arXiv:2101.07496. Cited by: §2.6.
  • J. Han, H. Zhang, and Z. Zhang (2017) High efficiently numerical simulation of the tdgl equation with reticular free energy in hydrogel. arXiv preprint arXiv:1706.02906. Cited by: §2.6.
  • E. Hong (2007) "A multiagent approach to -learning for daily stock trading". IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, vol. 37, no. 6, pp. 864-877. Cited by: §1.
  • e. al. Hunt (2016) "Continuous control with deep reinforcement learning". https://arxiv.org/abs/1509.02971. Cited by: §1.
  • L. J.J. (2017) "A deep reinforcement learning framework for the financial portfolio management problem". https://arxiv.org/abs/1706.10059. Cited by: §1, §1.
  • T. Jagriˇc (2019) “Deep learning for stock market trading: a superior trading strategy”. Neural Network World, 29(3):151–171. Cited by: §1.
  • O. Jin and H. El-Saawy (2016) "Portfolio management using reinforcement learning". https://arxiv.org/pdf/1909.09571.pdf. Cited by: §1.
  • e. al. Jinjian Z. (2009) "Algorithm trading using q-learning and recurrent reinforcement learning". http://cs229.stanford.edu/proj2009/LvDuZhai.pdf. Cited by: §1.
  • H. Kimura (2019) “Literature review: machine learning techniques applied to financial market prediction”. Expert Syst. Appl. vol.124, pp.226–251. Cited by: §1.
  • O. Klimov (2017) "Proximal policy optimization algorithms". https://arxiv.org/abs/1707.06347. Cited by: §1.
  • Y.R. Li (2018) "Adversarial deep reinforcement learning in portfolio management". https://arxiv.org/abs/1808.09940. Cited by: §1.
  • e. al. Liu (2020) "Deep reinforcement learning for automated stock trading: an ensemble strategy". ACM International Conference on AI in Finance.. Cited by: §1.
  • S. Lombardo, J. Han, C. Schroers, and S. Mandt (2019) Deep generative video compression. In Advances in Neural Information Processing Systems, pp. 9283–9294. Cited by: §2.6.
  • J. Moody and M. Saffell (1999) “Reinforcement learning for trading,”. Advances in Neural Information Processing Systems, vol. 11, pp. 917-923, MIT Press. Cited by: §1, §1.
  • J. Moody and M. Saffell (2001) "Learning to trade via direct reinforcement". in IEEE Transactions on Neural Networks, vol. 12, no. 4, pp. 875-889. Cited by: §1.
  • E. Schuitema (2012) "Efficient model learning methods for actor–critic control". IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 42, no. 3, pp. 591-602. Cited by: §1.
  • L. Weijs (2018) "Reinforcement learning in portfolio management and its interpretation". Erasmus Universiteit Rotterdam. Cited by: §1.