Reinforcement Learning: Prediction, Control and Value Function Approximation

08/28/2019 ∙ by Haoqian Li, et al. ∙ 0

With the increasing power of computers and the rapid development of self-learning methodologies such as machine learning and artificial intelligence, the problem of constructing an automatic Financial Trading Systems (FTFs) becomes an increasingly attractive research topic. An intuitive way of developing such a trading algorithm is to use Reinforcement Learning (RL) algorithms, which does not require model-building. In this paper, we dive into the RL algorithms and illustrate the definitions of the reward function, actions and policy functions in details, as well as introducing algorithms that could be applied to FTFs.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In quantitative trading, a trader’s objective is to optimize some measure of the performance of the executions, e.g., profit or risk-adjusted return, subjected to some certain constraints. There is a lot of work using predictive models of price changes for quantitative trading, especially for high-frequency trading and market microstructure data (Kearns and Nevmyvaka (2013); Gerlein, et al. (2016); Schumaker and Chen (2009) and Abis (2017)

). These models are trained based on specific machine learning objective functions (e.g., regression and classification loss functions), and thus there is no guarantee for the models to globally optimize their performances under the measure of the trader’s objective.

The financial market is one of the most dynamic and fluctuating entities that exist, which makes it difficult to model its behavior accurately. However, the Reinforcement Learning algorithm, as another type of self-adaptive approach, can conquer these type of difficulties by directly learning from the outcomes of its actions. More specifically, the investment decision making in RL is a stochastic control problem, or a Markov Decision Process (MDP), where the trading strategies are learned from direct interactions with the market. Thus the need for building forecasting models for future prices or returns is eliminated.

Many research has been conducted within the field of RL in quantitative trading (Bertoluzzo and Corazza (2012); Moody et al. (1998); Moody and Saffell (2001); Gold (2003); Nevmyvaka et al. (2006) and Eilers et al. (2014)). Generally, they show that the trading strategies based on RL perform better than those based on the supervised machine learning methodologies. Additionally, they acknowledged the ways of using Direct Recurrent RL (DDRL) approach instead of using the traditional TD-learning and Q-learning in the RL field.

In this paper, we first present some basic aspects of RL in Section 2 and then go through the MDP and the reward function in Section 3. In Section 4, we introduce the model-free prediction and control, while in Section 5, we talk about the value function approximation. In Section 6, we provide an algorithm that combines the knowledge of all previous sections.

2 Basic Framework

Reinforcement Learning (RL), rooted in the field of control theory, is a branch of machine learning explicitly designed for taking suitable action to maximize the cumulative reward. It is employed by an agent to take actions in an environment so as to find the best possible behavior or path it should take in a specific situation. Reinforcement learning differs from the traditional supervised learning as in supervised learning the training data has the answer key with it provided by an external supervisor, and the model is trained with the correct answer. Whereas in reinforcement learning, the reinforcement agent decides what to do to perform well (quantified by a defined reward function) in the given task.

3 MDP and The Value Function

RL lies in the interactions between the agent and the environment. At any time , the agent receives input from the environment (observations) , take some action (possibly random) , and receive a reward (immediate/long-term) from the environment.

3.1 Basic Framework

The history is a sequence of observations, actions, and rewards where the agent selects actions and the environment selects observations/rewards. The state is the information determining what happens next. Formally, an information state is a function of the history containing all the useful information from history, and the sequence of process is assumed to possess Markov property.

An RL agent consists of two components, namely, policy and value function. A policy is a distribution over actions given the state , which fully defines an agent’s behavior mapping from state to action While the value function represents the goodness of each state based on the long-term expected cumulative rewards.

3.2 Reward and Return

A reward is a scalar feedback signal indicating how well an agent is doing at time In the trading scenario, we can apply the Sharpe Ratio or the differential Sharpe Ratio proposed by Moody et al. (1998) as

, which is a better estimate for the risk-adjusted profit. The return

is the total discounted reward starting from time-step

where is the discount factor. In some settings, a long-term reward is delayed, and it is better for the agent to sacrifice the immediate reward in exchange for the long-term reward. For instance, in the stock market, one can achieve long-term reward and gain more long-term risk-adjusted return by sacrificing the short-term stock return.

3.3 Mdp

In the RL framework, it is usually assumed that the system satisfies the Markov property

which states the fact that the probability of transition from the current state

to the next depends only on the current state , instead of the whole history, i.e., the future is independent to the past given the present. MDP is an environment where all states are Markov and can be viewed as a tuple where is a finite set of states, is a finite set of actions, is a reward function for taking some action at state and time-step

and is the discount factor. Given the MDP and a policy , the state sequence is a Markov process.

3.4 Value Function

In RL, there are basically two types of value functions, namely, the expected return of a state, and the expected return of an action. The state-value function of an MDP measures the expected return starting from state , given the policy

while the action-value function is the expected return starting from state , taking action , and following the policy

We can decompose the value functions into immediate reward plus discounted value of successor state

which can be demonstrated as the Bellman Equations, which is a cornerstone of algorithms such as the TD-learning and Q-learning. A policy is said to outperform another if for and the agent’s objective is to find an optimal policy that is better than or equal to all the other policies. The optimal policy identifies the values and such that


In fact, the optimal value functions are recursively determined by the Bellman optimality equations

which states the fact that the value of a state under the optimal policy should be equal to the expected return for the optimal action from the state itself.

4 Model-Free Prediction and Control

4.1 Model-Free Prediction

Model-free prediction learns an unknown MDP by estimating its value function and there are three approaches, namely, the Monte-Carlo learning (MC), Temporal-Difference Learning (TD(0)), and the TD(). Since MC learns from complete episodes and requires all episodes to terminate, in this paper, we only focus on the TD(0) and TD() algorithm.

TD methods learn directly from episodes of experience under policy and there is no requirement for the knowledge of MDP transitions and rewards. TD differs from MC in a way that TD learns from incomplete episodes by bootstrapping. Denote as the mean of a sequence that can be computed incrementally. Then,

If we apply this to the value function and treat the expected value as an empirical mean, then given a policy , we can update the value function toward the estimated return by the following equation



where . Note that the subtle difference

is an unbiased estimate of

whereas is biased. This method is called TD(0) that essentially look at one-step further when adjusting value estimates.

Naturally, we can generalize the TD(0) algorithm to steps into the future, and define the n-step return as

which is the cumulative return of time-steps plus the value onward. We then average the n-step returns over different and treat them as our new which combines all the n-step returns . More specifically, we use weight to define the -return as

where . When , the credit is deferred until the end of the episode (long-term); while when , the algorithm only looks at reward one step further (myopic).

Similar to the TD(0), we can use to define the TD algorithm, in which we update as

which is called the forward-view of TD() as we update the value function towards the return. However, this forward algorithm requires knowledge of the reward completely, and therefore can only be computed from complete episodes. To solve this problem, we use a backward TD() to update online from incomplete sequences by introducing the eligibility traces of a state

, which is the degree to which it has been visited in the recent past including both frequency heuristic and recency heuristic. One version of the eligibility trace is defined as

which is used to update all the states that have been recently visited according to their eligibility, when reinforcement is received. The backward TD() keep an eligibility trace for every state and update value for every state

while more details are covered by Kaelbling et al. (1996), Dayan (1992) and Dayan (1994).

4.2 Model-Free Control

Model-free control stands for optimizing the value function of an unknown MDP. Previously, we discussed how to evaluate a policy through value functions, while in this section, we focus on how to improve a policy function. It is worth noting that improving a policy over requires the model of MDP, i.e., where is denoted as the probability of taking an action while transiting from state into state . Therefore, the knowledge of such a probability is required, which is not model-free. Instead, we can improve the policy over with a simple idea called the -Greedy Exploration, in which all actions are tried to choose the greedy action with probability , and choose an action at random with probability Then, we have

It has been proved that for any -greedy policy , the -greedy policy with respect to is an improvement, i.e., . Thus, we can apply TD to by using a -greedy policy improvement and update at every time-step. A naive thought is to use TD(0): where is the immediate reward obtained from the state by taking action . This algorithm is also called SARSA On-Policy Control.

Similar to the state-value function, we define the n-step -return as

and use the weight to define the Forward SARSA()

Following the same logic, we use one eligibility trace for each state-action pair, where and The backward SARSA() updates are then defined as

4.3 Q-Learning

In the so-called off-policy learning, we evaluate some other target policy to compute while following our behavior policy and two different policies are used in the policy improvement process. The target policy is used to estimate the value functions, and the behavior policy is used to control the improvement process. Since such a consideration of the off-policy learning is based on the action-values this algorithm is called Q-learning. Specifically, we allow both behavior and target policies to improve by applying a greedy algorithm to the target policy w.r.t. and applying -greedy to the behavior policy w.r.t. . The target policy is a greedy search


Since we have the following sequence of equality

we can update following the policy as follows

5 Value Function Approximation

5.1 Parameter and Feature Vector

So far, we assume that the states are discrete variables, in which every state has an entry and every state-action pair have an entry . However, in the financial market, the price of equities is characterized by continuous states, which requires us to generalize from limited states to infinite states.

Since value function maps state/state-action pair to a value, we can build up a function which estimate value functions everywhere. In this case, we create a parameter vector

w or and update the parameter at step using a TD learning, such that the associated value function or totally depends on that varies step by step.

Nest, Our objective is to find a parameter vector w minimizing the mean-squared error between approximate value and true value

where shows that we can approximate the value of , given a state and a parameter vector w. The remaining is to find a vector that can represent states


We can then represent value function by a linear combination of features

and use a stochastic gradient descent method to minimize the mean-squared error. It can be easily found that

. Therefore, to find a local minimum of , we adjust the parameter w in the direction

where is the step-size.

5.2 Action-Value Function Approximation

Following the same analogy, we approximate the action-value function by minimizing the mean-squared error

and then use the stochastic gradient descent to find its local minimum. Again, a feature vector is defined to represent the state and action pair

and the value function is represented by a linear combination of features

Thus, the stochastic gradient descent update are as follows

where is the step-size. We substitute a target for and update it by and In the framework of approximation, the updating rule is and

5.3 Lstd

The gradient descent optimization of w works in general, including non-linear value function approximation. However, when the approximation is linear, we can solve the least-squares solution directly, by assuming that at the minimum of , the expected update should be zero, i.e., . We then obtain

where the true values are unknown, and our training data is noisy samples of in practice. For LSTD(0), the w is solved by

while for LSTD(), the w is solved by

In practice, we use LSTDQ algorithm and simply substitute with For the LSTDQ(0), the w is solved by

and in LSTDQ(), we have

6 Policy Gradient with Actor-Critic

6.1 Policy Gradient

In this section, we use w and to parameterize the value function and the policy function, respectively. Our goal is to find the optimal , based on a policy with parameter . In terms of measuring the quality of a policy , we can use the mean value

or the mean reward per time-step


is a stationary distribution of Markov Chain for

. We then need to find that maximizes . Similar to the previous section, the policy gradient algorithm search for a local maximum in by ascending the gradient of the policy, w.r.t. parameters

where is the policy gradient, and We now compute the policy gradient analytically using the likelihood ratios trick

where is a score function. Finally, we apply policy gradient theorem for the two functions and obtain the gradient policy as

while more details can be referred to Sutton et al. (2000).

6.2 Actor-Critic

We now combine the policy parametrization with the action-value function parametrization, where we use a critic to estimate the action-value function and update the action-value function parameter Next, we update the policy parameter in the direction of our approximated action-value function and follows


6.3 Advantage Actor-Critic

One of the disadvantages of the policy gradient method is that it usually has a large variance. To reduce the variance, we subtract a baseline function from the policy gradient to reduce the variance without changing the expectation (

Grondman et al. (2012) and Sutton et al. (2000)). A good baseline function can be , and hence, we have

where is called the advantage function. In fact, the TD error of is an unbiased estimate of the advantage function, and we can use the TD error to compute the policy gradient In practice, we can use the approximated TD error, and obtain the policy gradient as


where is the set of critic parameters. We apply TD(0) as mentioned and obtain

As in the backward-view of TD(), we can aplly the eligibility trace and obtain the following updating rule



7 Conclusion

In this technical report, we summarized the basic framework of the RL and then went through the development of the algorithms including TD, Q-learning, and Actor-Critic. We also introduced how to use value function approximation to evaluate the state value, policy value, and state-action pair value. With the concrete understanding of the framework of RL, we will be able to apply RL into the field of Quantitative Trading. In recent years, most of the published works relative to this topic are based on dynamic programming, TD learning Moody et al. (1998) Richard S. Sutton (1988) or Q learning C.J.C.H. Watkins (1989). These methods left the policy gradient method out. On the other hand, some other work such as Moody and Saffell (2001) only focused on the policy gradient method without using value functions. However, in our paper, we introduced another class of RL algorithm that combines both value function approximation and the policy gradient method, namely, the Actor-Critic algorithm. Therefore, one possible future research direction is to compare the performance of these different types of algorithms on real datasets. Another research direction might be boosting the RL algorithm itself. For example, it would be interesting to try to prove the convexity of the value functions and apply the Interior-point Algorithm while doing policy gradient search.

In regards to the application of RL in quantitative trading, an agent must also have its own interpretation of the information set, specifically, the external/market variables. Consequently, a natural question to ask is what factors should we take into consideration to fully describe the environment while keeping them simple. In addition, calculating the exact Sharpe Ratio at every time step is extremely expensive and we might need to use the differential Sharpe Ratio for approximation. What would be the risk-adjusted return using RL algorithm taking all of these into account? Answering these research questions is our next goal and we will address them in our future work.


  • Kearns and Nevmyvaka (2013) Kearns, M., & Nevmyvaka, Y. (2013). Machine Learning for Market Microstructure and High Frequency Trading. High Frequency Trading: New Realities for Traders, Markets, and Regulators.
  • Gerlein, et al. (2016) Gerlein, E. A., McGinnity, M., Belatreche, A., & Coleman, S. (2016). Evaluating Machine Learning Classification for Financial Trading: An Empirical Approach. Expert Systems with Applications, 54 193–207.
  • Schumaker and Chen (2009) Schumaker, R. P., & Chen, H. (2009). A Quantitative Stock Prediction System based on Financial News. Information Processing & Management, 45 571–583.
  • Abis (2017) Abis, S. (2017). Man vs. Machine: Quantitative and Discretionary Equity Management. Unpublished working paper. Columbia Business School.
  • Bertoluzzo and Corazza (2012) Bertoluzzo, F. and Corazza, M. (2012). Testing Different Reinforcement Learning Configurations for Financial Trading: Introduction and Applications. Procedia Economics and Finance, 3 68–77.
  • Gold (2003) Gold, C. (2003). FX Trading via Recurrent Reinforcement Learning. In 2003 IEEE International Conference on Computational Intelligence for Financial Engineering, Proceedings. (pp. 363-370). IEEE.
  • Moody and Saffell (2001) Moody, J., & Saffell, M. (2001). Learning to Trade via Direct Reinforcement.

    IEEE Transactions on Neural Networks

    , 12 875–889.
  • Moody et al. (1998) Moody, J., Wu, L., Liao, Y., & Saffell, M. (1998). Performance Functions and Reinforcement Learning for Trading Systems and Portfolios. Journal of Forecasting, 17 441–470.
  • Nevmyvaka et al. (2006) Nevmyvaka, Y., Feng, Y., & Kearns, M. (2006). Reinforcement Learning for Optimized Trade Execution. In Proceedings of The 23rd International Conference on Machine Learning (pp. 673–680). ACM.
  • Eilers et al. (2014) Eilers, D., Dunis, C. L., von Mettenheim, H. J., & Breitner, M. H. (2014). Intelligent Trading of Seasonal Effects: A Decision Support Algorithm based on Reinforcement Learning. Decision Support Systems, 64 100–108.
  • Kaelbling et al. (1996) Kaelbling, L. P., Littman, M. L., & Moore, A. W. (1996). Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research, 4 237–285.
  • Dayan (1992) Dayan, P. (1992). The Convergence of TD () for General . Machine Learning, 8 341–362.
  • Dayan (1994) Dayan, P., & Sejnowski, T. J. (1994). TD () Converges with Probability 1. Machine Learning, 14 295–301.
  • Sutton et al. (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Advances in Neural Information Processing Systems, 1057–1063.
  • Grondman et al. (2012) Grondman, I., Busoniu, L., Lopes, G. A., & Babuska, R. (2012). A Survey of Actor-Critic Reinforcement Learning: Standard and Natural Policy Gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42 1291–1307.
  • Richard S. Sutton (1988) Richard S. Sutton. (1988) Learning to predict by the method of temporal differences. Machine Learning, vol. 3, no.1, 9-44.
  • C.J.C.H. Watkins (1989) C.J.C.H. Watkins (1989). Learning with Delayed Rewards, Ph.D. thesis, Cambridge University, Psychology Department.