Reinforcement learning (RL) algorithms have been applied and achieved good performance in a wide variety of challenging domains, from games to robotic control [sutton2018reinforcement, mnih2015human, silver2016mastering, silver2017mastering, vinyals2019grandmaster]. Over the last few decades, numerous RL algorithms have appeared in literature on sequential decision-making and optimal control. These RL algorithms are generaly divided into value-based method and policy method depending on whether a parameterized policy has been learned. In value-based RL, a tabular or parameterized state (or state-action) value function is learned, and the optimal policy is directly calculated or derived from from the value function. In contrast, policy-based RL directly parameterize the policy and update its parameters in some way.
Early RL algorithms mostly belong to value-based method, which directly derive the optimal policy from a learned value function. Policy iteration (PI) is considered the first value-based algorithm, in which policy evaluation (PEV) and policy improvement (PIM) keep alternating until action-value function converges to the optimal action-value function. In PEV step, action-value function is updated with fixed policy until it converges. And in PIM step, a better policy is obtained according to the updated action-value function[howard1960dynamic]. Value iteration (VI) is a form of truncated policy iteration algorithm, in which action-value function is updated only once in PEV step instead of updating until convergence [puterman1978modified]
.However, PI and VI require complete knowledge of the environment in PEV step to calculate action-value estimation in a bootstapping way. Monte Carlo (MC) algorithm were proposed to approximate action-value function by averaging episodic returns, thus no model is needed[sutton2018reinforcement]
. But they suffer from large variance and cannot performing updating until the end of episode. SARSA and Q-learning are two famous temporal difference (TD) algorithms, and they are combination of PI and MC algorithms and are most widely used in RL[sutton1988learning]. They calculate action-value estimation of a state-action pair in PEV step by only bootstrapping to its next sampled state-action pair, instead of all of its adjacent state-action pairs or the whole episodic return from it, which greatly reduces estimation variance and speeds up learning process at the cost of adding bias. Main difference of SARSA and Q-learning lies in whose experiences they use to calculate action-value function estimation. For SARSA, it uses experiences conducted by target policy, i.e., the policy we aim to update, which requires target policy should always be stochastic due to exploration issues and thus limits optimality of the algorithm because optimal policy is usually deterministic [rummery1994line]. Q-learning relieves this limitation by learning greedy policy while exploring by -greedy policy, which is one of the early breakthroughs in RL algorithm [watkins1992q].
However, action-value methods are not feasible in case of continuous or large discrete action spaces because finding greedy policy is impractical in such action space. To solve this problem, methods that learn a parameterized policy were proposed, called policy gradient (PG) methods, in which the parameterized policy enables actions to be taken without consulting action-value function. PG methods can learn specific probabilities for taking the actions. Besides, PG methods can approach deterministic policies asymptotically and naturally handle continuous action spaces. Marbach and Tsitsiklis (2001) obtained policy gradient theorem, which gives an exact formula for how performance is affected by the policy parameter that does not involve derivatives of the state distribution, providing a theoretical foundation for PG methods[marbach2001simulation]. The REINFORCE method follows directly from the policy gradient theorem, which uses episodic return to estimate action-value function and is thus the first practical application of policy gradient theorem [williams1992simple]. Similar with MC methods, REINFORCE suffers from large variance. Willianms et al. (1992) added a state-value function as a baseline reduces REINFORCE’s variance without introducing bias [williams1992simple]. Actor-critic (AC) methods use one-step TD method for action-value function estimation, further reducing variance at the cost of introducing bias [degris2012off]. Other than reducing variance, PG methods have several improvements in other directions. Kakade et al. (2002) proposed natural policy gradient (NPG), updating policy parameters in fisher information matrix normed space, which eliminates influence of how to parameterize policy and is able to obtain a more stable gradient in parameter space [kakade2002natural]. Degris et al. (2012) introduced off-policy actor-critic to enable gradient estimation by experience data of any policy [degris2012off]. Silver et al. (2014) introduced policy gradient theorem for deterministic policy, and developed on-policy and off-policy deterministic policy gradient algorithms based on that [silver2014deterministic].
With the rise of deep learning, many traditional RL algorithms are extended to deep RL algorithms by choosing deep neural network as policy and value function estimator. For traditional action-value methods, DQN combined Q-learning with convolutional neural networks and experience replay, enabling to learn to play many Atari games at human-level performance from raw pixels, which kick-started many recent successes in scaling RL to complex sequential decision-making problems, such as Double DQN, Prioritized Experience Replay, Dueling network architecture, Distributional Q-learning and their combination, Rainbow[van2016deep, schaul2015prioritized, wang2015dueling, bellemare2017distributional, hessel2018rainbow]. For traditional policy gradient methods, A3C combined actor-critic with fully connected networks and succeeded in a wide variety of continuous motor control problems [mnih2016asynchronous]. Deep DPG (DDPG) is an extension of DPG and successfully solves more than 20 simulated physics tasks [lillicrap2015continuous].
With these many RL algorithms, they are usually categorized by the way they choose action, i.e., value-based, policy-based and actor-critic. However, we observe that there are two fundamental mechanisms to find optimal policy behind these RL algorithms. One of them is what we call indirect methods, which acquires optimal policy by solving bellman equation of action-value function and deriving optimal action from it. The other is direct methods, which seeks optimal policy by directly optimizing objective with respect to policy performance. In this paper, we reveal that the two classes of methods are equivalent and can be unified under actor-critic architecture if some conditions of initial state distribution of the problem hold. Besides, convergence results is introduced for both direct and indirect methods. Finally, We also classify current mainstream RL algorithms by the criteria, and compare the differences between other criteria including value-based and policy based, model-based and model-free.
The rest of this paper is organized as follows. Section II introduces preliminaries of value function and stationary distribution. Section III introduces concepts of direct and indirect methods, and establishes equivalence and unification of them. Section IV introduces convergence results for both direct and indirect methods. Section V classifies current main RL algorithms using our criterion and does a comparision with other criteria. Last section VI summaries this work.
We study the standard reinforcement learing (RL) setting in which an agent interacts with an environment by observing a state , selecting an action , receiving a reward , and observing the next state . We model this process with a Markov Decision Process (MDP) . Here, denotes the state and action spaces and is the transition function. Throughout we will assume that and are finite set and write . A policy maps a state to a distribution over actions, is the reward function, is the distribution of the initial state and we define as the discount factor.
Ii-a State-value function and action-value function
We seek to learn optimal policy which has maximum state-value function ,
The state-value function is the expected sum of discounted rewards from a state when following policy :
where and . Similarly, we use the following standard definition of the action-value function :
By dynamic programming principle, we can get the self-consistency condition,
which reveals the relationship between state-value functions of adjacent states under arbitrary policy, and the bellman equation,
From and , the state transition function is
Denoting as , then (1) can be expressed as
In vector notation, this becomes
where and The state-value function is in fact the fixed point of the self-consistency operator , which is defined as
Similarly, we define Bellman operator as
Both and are -contraction mapping with respect to maximum norm, which means and have unique fixed point, and respectively. Besides, the process converges to , and the process converges to . More interesting for us, the operator also describes the expected behavior of learning rules such as temporal-difference learning and consequently their learning dynamics [sutton2018reinforcement, tsitsiklis1997analysis].
Ii-B Stationary distribution and function approximation
From definition of stationary distribution, a distribution is a stationary distribution if and only if
Furthermore, according to properties of Markov chain[ross1996stochastic], given policy , there exists unique stationary state distribution if the Markov chain generated by is indecomposible, nonperiodic and positive-recurrent.
The Markov chain generated by is indecomposible, nonperiodic and positive-recurrent.
We will assume Assumption 1 hold in the following. By the property of Markov chain, the state distribution always becomes stationary as time goes by, then we generally have the following assumption.
If is generated by policy , then .
For large-scale MDP, there are too many states and/or actions to be stored, and learning the value function of each state individually is too slow. A more practical way is to solve large-scale MDPs by value function approximation. It generalizes RL from seen states to unseen states. We approximate state-value function by a parameterized value , where . We use or for short. Besides, we approximate policy by , where and we use or for short. Since the tabular case can be regarded as a special case of the parameterized function, we will mainly discuss how to obtain the optimal policy function and value function in the following.
Iii Direct methods and indirect methods
Now, we are ready to introduce concepts of direct and indirect methods.
(Direct RL). Direct RL finds the optimal policy by directly maximizing the state value function for .
(Indirect RL). Indirect RL finds the optimal policy by solving the Bellman’s optimality equation for .
Iii-a Direct method
Iii-A1 Vanilla policy gradient
By Definition 1, direct RL seeks to find which maximizes value function . However, due to the limited fitting ability of the approximation function, current direct RL algorithms usually tend to maximize the following policy objective function
By policy gradient theorem [sutton2000policy], the update gradient for the policy function is
is state distribution at time starting from state and following . We denote as , which is the probability of transition from to at time following policy . Defining the discounted visiting frequency (DVF)
the policy update gradient can be expressed as
Core procedure of direct RL is shown in Algorithm 1. However, there are three obstacles to make it practical. Firstly, properties of DVF are not clear; Secondly, summation over all states and actions is impossible; Thirdly, the value function is not accessible. These problems are tackled as following.
Properties of DVF
In practical applications, it is usually intractable to calculate the DVF for each given . To understand the properties of this distribution, we make the following two propositions.
When holds for , then
According to (2), when , it is clear that
So, it follows that
Before the second proposition, the following lemma is necessary at this point
If state-to-state transition function of a policy corresponds to an indecomposible, nonperiodic and positive-recurrent Markov chain, then its -step transition function converges to stationary state distribution of , and average of the first timesteps of also converges to stationary state distribution of [ross1996stochastic], i.e.,
When approaches 1,
For last step of proof, we use the property of indecomposible, nonperiodic and positive-recurrent Markov chain. ∎
Note that when we choose , it means we keep changing the objective function (3) every time parameters are updated, rather than the objective becomes . Gradient of this objective is not accessible because there is no analytic form between and its stationary state distribution.
Unbiased estimation of policy gradient
Policy gradient of equation (5) is in form of expectation, for its estimation, we collect samples batch generated by policy and approximate expectation using average, as shown in the following equation
By Assumption 2, this an unbiased estimation of policy gradient.
However, value function is not known. It can be approximated by Monte Carlo estimation, i.e., using episodic return, which is REINFORCE algorithm; Or it also can be approximated using value function approximation, i.e., using .
Value function approximation
Value function can be approximated by minimizing distance between between approximate value function and true value function . The mean square error under stationary state distribution is usually used as the distance, i.e.,
We use gradient descent to minimize its right hand sides. Its gradient is
True value function is not accessible, so we construct update target using samples under by Monte Carlo method, i.e.,
or temporal difference method, i.e.,
to approximate true value function , that is
By using value function approximation in policy gradient estimation, actor-critic architecture can be derived from direct method, which is shown in Fig. 1.
Iii-A2 Other direct algorithms
There are several other forms of policy gradient. For the sake of reducing variance of estimated gradient while not adding bias, value function is utilized as baseline [sutton2018reinforcement]. So as to get robust policy improvement, KL divergence or clip punishment is employed to constrain policy variation [kakade2002natural, schulman2015trust, wu2017scalable, schulman2017proximal]. For the purpose of making model-free RL algorithms more efficient, deterministic policy is applied to reduce amount of samples needed to estimate gradient accurately [silver2014deterministic]. To enhance sample efficiency, several off-policy policy gradient methods are employed to generate different policy for exploration. Off-policy AC adopts importance sampling to correct action distribution when estimating policy gradient [degris2012off]
. In order to reduce its variance, ACER uses a truncated IS with correction, and Reactor uses ”leave-one-out” technique at cost of introducing bais, while IPG interpolates on-policy gradient and off-policy gradient to stabilize learning[wang2016sample, gruslys2018reactor, gu2017interpolated]. Although Off-PAC is simple and used widely, it is actually a semi-gradient, to this end, ACE and Geoff-AC make use of emphatic method to give the true off-policy gradient, though they both suffer from large variance [imani2018off, zhang2019generalized]. There are some methods attempt to relieve variance issues by dropping IS correction term. DDPG and TD3 take advantage of deterministic policy and eliminate IS part in their policy gradient naturally [lillicrap2015continuous, fujimoto2018addressing]. SAC, Soft Q-learning and Trust-pcl are all under entropy-regularized framework and optimize policy parameters with off-policy data without IS [haarnoja2018soft, haarnoja2017reinforcement, nachum2017trust]. SIL exploits only good state-action pairs in the replay buffer and can be viewed as an implementation of lower-bound-soft-Q-learning under the entropy-regularized RL framework [oh2018self]. Off-policy mechanisms also enable RL to scale, such as A3C (A2C), Ape-X and IMPALA [mnih2016asynchronous, horgan2018distributed, espeholt2018impala].
Iii-B Indirect methods
Iii-B1 Policy iteration and value iteration
By Definition 2, indirect methods seek to find solution of Bellman equation
and acquire optimal policy indirectly by
There are several ways to get solution of Bellman equation, in which the most typical methods are policy iteration and value iteration.
In policy iteration algorithm, we start with an arbitrary policy , and we generate a sequence of new policies ,,…. Given policy , we perform a policy evaluation step (PEV), that computes as the solution of system
by keeping performing fixed point iteration, i.e., , until its convergence. We then perform a policy improvement step (PIM), which computes a new policy that satisfies
Then we go back to PEV step. This process continues until we have for all . From the theory of dynamic programming, it is guaranteed that the policy iteration algorithm terminates with the optimal policy .
In value iteration algorithm, we start with an arbitrary initial vector of value function and we keep performing
then the generated sequence of value functions converges to . Finally, the optimal policy is found by computing a policy that satisfies
It can be seen that value iteration is a special case of policy iteration. It iterates only once in PEV step while policy iteration iterates in PEV until convergence.
Iii-B2 Approximate policy iteration
We take approximate policy iteration to illustrate how indirect methods work in approximate function settings.
Inspired by PEV step of policy iteration, in every iteration, approximate policy iteration aims to find solution of self-consistency condition of , that is . Like policy iteration, PEV of approximate policy iteration keeps calculating update target and optimizing distance between and until converges, in which is calculated by . Similar with value approximation in direct method, the mean squared error under stationary state distribution is used as the distance, shown as
Its gradient is
Similarly, in PIM step, approximate policy iteration seeks to minimize distance between and . The distance is chosen to be their absolute error under some state distribution independent of , e.g. , , as shown in the following equation
It is obvious to see that , and is a constant irrelevant to . We thus can remove absolute operator and equally maximize the objective,
Its gradient is
which can be estimated by samples similar as equation (6).
Iii-B3 Other indirect algorithms
In practice, approximate dynamic programming (ADP) is a class of methods seeking to find bellman solution, besides, most of indirect methods are value-based methods, which have no explicit policy and find greedy policy w.r.t Q function to conduct policy improvement [powell2007approximate] . Q-learning is one of typical traditional indirect methods, and its extension DQN in deep RL has got great success in Go and computer games [watkins1992q, mnih2015human]. DDQN partially addresses overestimation problem of Q-learning by decoupling selection and evaluation of the bootstrap action [van2016deep]. PER replays important transitions more frequently to learn more efficiently than DQN [schaul2015prioritized]. Dueling DQN uses dueling architecture consists of two streams that represent the value and advantage functions to help to generalize across actions [wang2015dueling]. C51 learns a categorical distribution of discounted returns, instead of estimating the mean [bellemare2017distributional]. Rainbow combines these independent improvements to the DQN algorithm and provides state-of-the-art performance on the Atari 2600 benchmark [hessel2018rainbow]. NAF is proposed as a continuous variant of Q-learning algorithm, which can be regarded as an alternative to policy gradient methods [gu2016continuous]. While Q-learning and its variants learn by one-step bootstraping, Retrace() is the first return-based off-policy control algorithm converging a.s. to without the GLIE (Greedy in the Limit with Infinite Exploration) assumption [munos2016safe, singh2000convergence].
There are several works have drawn connection between policy gradient method and Q-learning in framework of entropy regularization. Odonoghue et al. (2016) decomposed Q-function into a policy part and a value part inspired by dueling Q-networks and shows that taking the gradient of the Bellman error of the Q-function leads to a result similar to the policy gradient [o2016combining]. Nachum et al. (2017) proposed path consistency learning (PCL) algorithm based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization, which can be interpreted as generalizing both policy gradient and Q-learning algorithms [nachum2017bridging]. Haarnoja et al. (2017) used a method called Stein Variational Gradient Descent to derive a procedure that jointly updates the Q-function and policy , which approximately samples from the Boltzmann distribution [haarnoja2017reinforcement]. Schulman et al. (2017) showed that there is a precise equivalence between Q-learning and policy gradient methods in the framework of entropy-regularized reinforcement learning, where ”soft” (entropy-regularized) Q-learning methods are secretly implementing policy gradient updates [schulman2017equivalence].
While they have drawn connection between direct methods and indirect methods without parameterized policy, we establish equivalence between direct methods and indirect methods with parameterized policy. We take vanilla policy gradient and approximate policy iteration as an example to compare. While they have exact the same gradient in policy evaluation procedure (value approximation procedure), we pay our attention on policy improvement procedure (policy update), in which there are several differences between direct (5) and indirect policy gradient (8).
Value function in policy gradient is different: Objective function of indirect policy gradient (7) depends on results of policy evaluation, which is fixed in policy improvement step, so we do not have to unroll it when taking gradient, and as a result, value function in its gradient (8) is also fixed and irrelevant of ; Objective function of direct methods is straightforward, in which value is not only a function of state, but also a function of policy parameters . When we take gradient of it, we have to unroll it until we get the form of (5). As a consequence, value in the gradient is also function of , i.e., true value function . This is one of main differences between direct and indirect RL methods. But it should be noted that the difference naturally disappears when we estimate the gradient, because the true value function in direct gradient is not accessible and can only be estimated by value approximation which is used in indirect gradient.
State distribution in policy gradient is different: Although indirect and direct policy gradient both seek to optimize value function under initial state distribution , they have gradient with respect to different state distributions. For direct RL methods, due to unroll effect, direct policy gradient is an expectation under discounted visiting frequency, which can approximate stationary distribution when we choose initial state distribution as stationary state distribution every time by Proposition 1 or when by Proposition 2. For indirect RL methods, indirect policy gradient is an expectation under initial state distribution. There is no way to approximate stationary distribution but choose initial state distribution as stationary state distribution every time. When estimating direct gradient, we should use samples generated by current policy , whose state distribution is assumed to be stationary state distribution of by Assumption 2. And when estimating indirect gradient, we should use samples generated by initial state distribution.
By the analyses above, because direct policy gradient needs to resort to approximate value function in practice, the first difference about value function is naturally eliminated. The only difference is about state distribution. However, this difference can be eliminated if both direct and indirect methods choose different objective function in policy update step at every iteration in which objective function always uses current stationary state distribution as initial distribution. In conclusion, direct methods are equivalent to indirect methods as long as we choose at each iteration, as shown in Fig. 3.
|Indirect||DP[bellman1966dynamic], Soft Q-learning[haarnoja2017reinforcement],||ADP[powell2007approximate], HDP[venayagamoorthy2002comparison],|
|Q-learning [watkins1992q], DQN [mnih2015human],||ADGDHP[zhang2013overview], ADHDP[fuselli2013action],|
|Dueling DQN [wang2015dueling], Rainbow [hessel2018rainbow],||DHP[ni2015model], GDHP[szuster2016globalized]|
|DDQN [van2016deep], PER [schaul2015prioritized],||CDADP[duan2019deep], DGPI[duan2019generalized]|
|C51 [bellemare2017distributional], NAF [gu2016continuous],|
|Natural PG [kakade2002natural], TRPO [schulman2015trust],|
|PPO [schulman2017proximal], DPG [silver2014deterministic] ,|
|Off-PAC [degris2012off], ACER [wang2016sample],|
|Reactor [gruslys2018reactor], IPG [gu2017interpolated],|
|ACE [imani2018off], Geoff-AC [zhang2019generalized],|
|DDPG [lillicrap2015continuous], TD3 [fujimoto2018addressing],|
|SAC [haarnoja2018soft], SIL [oh2018self],|
|ACKTR [wu2017scalable], Trust-PCL [nachum2017trust],|
|I2A [racaniere2017imagination], A3C (A2C) [mnih2016asynchronous],|
|APE-X [horgan2018distributed], IMPALA [espeholt2018impala],|
|MVE [feinberg2018model], STEVE[buckman2018sample],|
|Indirect||DP[bellman1966dynamic], ADP[powell2007approximate],||Soft Q-learning[haarnoja2017reinforcement], Q-learning [watkins1992q],|
|HDP[venayagamoorthy2002comparison], ADHDP[fuselli2013action],||TD() [tsitsiklis1994asynchronous], DQN [mnih2015human],|
|DHP[ni2015model], GDHP[szuster2016globalized],||GAE [schulman2015high], DDQN [van2016deep],|
|ADGDHP[zhang2013overview],CDADP[duan2019deep],||PER [schaul2015prioritized], Dueling DQN [wang2015dueling],|
|DGPI[duan2019generalized]||C51 [bellemare2017distributional], Rainbow [hessel2018rainbow],|
|NAF [gu2016continuous], Retrace()[munos2016safe]|
|Direct||MVE [feinberg2018model]||Natural PG [kakade2002natural], TRPO [schulman2015trust]|
|STEVE[buckman2018sample]||ACKTR [wu2017scalable], PPO [schulman2017proximal]|
|ME-TRPO[kurutach2018model]||DPG [silver2014deterministic], Off-PAC [degris2012off]|
|PILCO [deisenroth2011pilco]||ACER [wang2016sample], Reactor [gruslys2018reactor]|
|Recurrent world models[ha2018recurrent]||IPG [gu2017interpolated], ACE [imani2018off]|
|GPS [levine2013guided]||Geoff-AC [zhang2019generalized], DDPG [lillicrap2015continuous]|
|I2A [racaniere2017imagination]||TD3 [fujimoto2018addressing], SAC [haarnoja2018soft]|
|Trust-PCL [nachum2017trust], SIL [oh2018self]|
|A3C (A2C) [mnih2016asynchronous], APE-X [horgan2018distributed]|
Iv Convergence results
Iv-a Direct methods
Before getting into convergence analysis of direct methods, we first define some symbols for convenience. Consider following process,
We denote that , , and , then process (9) becomes
and we have following theorem:
(Stochastic gradient theorem [bertsekas2000gradient]) Let be a sequence generated by the method
where is a deterministic positive stepsize, is steepest ascent direction, and is a random noise term. Let be an increasing sequence of -fields. We assume the following:
(a) and are -measurable.
(b) (Lipschitz continuity of ) The function is continuously differentiable and there exists a constant such that
(c) We have, for all and with probability 1,
where is a positive deterministic constant.
(d) The stepsize is positive and satisfies
Then either or converges to a finite value and , . Furthermore, every limit point of is a stationary point of .
Iv-B Indirect methods
We only establish convergence of approximate policy iteration. We consider an approximate policy iteration algorithm that generates a sequence of policies and a corresponding sequence of approximate value function satisfying
where and are some positive scalars. The scalar is an assumed worst-case bound on the error incurred during policy evaluation. The scalar is a bound on the error incurred in the course of the computations required for a policy update. Then we have the following theorem.
(Error bound for approximate policy iteration [bertsekas1996neuro]) The sequence of policies generated by the approximate policy iteration algorithm satisfies
V Classification of RL algorithm
In this section, we classify mainstream RL algorithms with direct and indirect criterion. In table I, we compare it with value-based and policy-based criterion. We find that most of model-based methods, e.g. ADP, are classified into policy-based method but are actually indirect method. Besides, all value based methods are indirect methods, which is because direct methods need a parameterized policy. We also compare it with model-based and model-free criterion in table II.
In this paper, we group current RL algorithms by direct and indirect methods, where direct methods are defined as algorithms that solve optimal policy by directly optimizing the expectation of accumulative future rewards using gradient descent method while indirect methods are defined as algorithms that get optimal policy by indirectly solving the sufficient and necessary condition from Bellman’s principle of optimality, i.e., the Bellman equation. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actor-critic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Besides, from theorem of stochastic gradient methods, convergence of direct method can be guaranteed if gradient error has zero mean and bounded second moment. For indirect method, the upper limit of error can be determined by error bound of PEV and PIM steps. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including value-based and policy-based, model-based and model-free.
We would like to acknowledge Mr. Zhengyu Liu, Dr. Qi Sun, for their valuable suggestions throughout this research.