I Introduction
Reinforcement learning (RL) algorithms have been applied and achieved good performance in a wide variety of challenging domains, from games to robotic control [sutton2018reinforcement, mnih2015human, silver2016mastering, silver2017mastering, vinyals2019grandmaster]. Over the last few decades, numerous RL algorithms have appeared in literature on sequential decisionmaking and optimal control. These RL algorithms are generaly divided into valuebased method and policy method depending on whether a parameterized policy has been learned. In valuebased RL, a tabular or parameterized state (or stateaction) value function is learned, and the optimal policy is directly calculated or derived from from the value function. In contrast, policybased RL directly parameterize the policy and update its parameters in some way.
Early RL algorithms mostly belong to valuebased method, which directly derive the optimal policy from a learned value function. Policy iteration (PI) is considered the first valuebased algorithm, in which policy evaluation (PEV) and policy improvement (PIM) keep alternating until actionvalue function converges to the optimal actionvalue function. In PEV step, actionvalue function is updated with fixed policy until it converges. And in PIM step, a better policy is obtained according to the updated actionvalue function[howard1960dynamic]. Value iteration (VI) is a form of truncated policy iteration algorithm, in which actionvalue function is updated only once in PEV step instead of updating until convergence [puterman1978modified]
.However, PI and VI require complete knowledge of the environment in PEV step to calculate actionvalue estimation in a bootstapping way. Monte Carlo (MC) algorithm were proposed to approximate actionvalue function by averaging episodic returns, thus no model is needed
[sutton2018reinforcement]. But they suffer from large variance and cannot performing updating until the end of episode. SARSA and Qlearning are two famous temporal difference (TD) algorithms, and they are combination of PI and MC algorithms and are most widely used in RL
[sutton1988learning]. They calculate actionvalue estimation of a stateaction pair in PEV step by only bootstrapping to its next sampled stateaction pair, instead of all of its adjacent stateaction pairs or the whole episodic return from it, which greatly reduces estimation variance and speeds up learning process at the cost of adding bias. Main difference of SARSA and Qlearning lies in whose experiences they use to calculate actionvalue function estimation. For SARSA, it uses experiences conducted by target policy, i.e., the policy we aim to update, which requires target policy should always be stochastic due to exploration issues and thus limits optimality of the algorithm because optimal policy is usually deterministic [rummery1994line]. Qlearning relieves this limitation by learning greedy policy while exploring by greedy policy, which is one of the early breakthroughs in RL algorithm [watkins1992q].However, actionvalue methods are not feasible in case of continuous or large discrete action spaces because finding greedy policy is impractical in such action space. To solve this problem, methods that learn a parameterized policy were proposed, called policy gradient (PG) methods, in which the parameterized policy enables actions to be taken without consulting actionvalue function. PG methods can learn specific probabilities for taking the actions. Besides, PG methods can approach deterministic policies asymptotically and naturally handle continuous action spaces. Marbach and Tsitsiklis (2001) obtained policy gradient theorem, which gives an exact formula for how performance is affected by the policy parameter that does not involve derivatives of the state distribution, providing a theoretical foundation for PG methods
[marbach2001simulation]. The REINFORCE method follows directly from the policy gradient theorem, which uses episodic return to estimate actionvalue function and is thus the first practical application of policy gradient theorem [williams1992simple]. Similar with MC methods, REINFORCE suffers from large variance. Willianms et al. (1992) added a statevalue function as a baseline reduces REINFORCE’s variance without introducing bias [williams1992simple]. Actorcritic (AC) methods use onestep TD method for actionvalue function estimation, further reducing variance at the cost of introducing bias [degris2012off]. Other than reducing variance, PG methods have several improvements in other directions. Kakade et al. (2002) proposed natural policy gradient (NPG), updating policy parameters in fisher information matrix normed space, which eliminates influence of how to parameterize policy and is able to obtain a more stable gradient in parameter space [kakade2002natural]. Degris et al. (2012) introduced offpolicy actorcritic to enable gradient estimation by experience data of any policy [degris2012off]. Silver et al. (2014) introduced policy gradient theorem for deterministic policy, and developed onpolicy and offpolicy deterministic policy gradient algorithms based on that [silver2014deterministic].With the rise of deep learning, many traditional RL algorithms are extended to deep RL algorithms by choosing deep neural network as policy and value function estimator. For traditional actionvalue methods, DQN combined Qlearning with convolutional neural networks and experience replay, enabling to learn to play many Atari games at humanlevel performance from raw pixels, which kickstarted many recent successes in scaling RL to complex sequential decisionmaking problems, such as Double DQN, Prioritized Experience Replay, Dueling network architecture, Distributional Qlearning and their combination, Rainbow
[van2016deep, schaul2015prioritized, wang2015dueling, bellemare2017distributional, hessel2018rainbow]. For traditional policy gradient methods, A3C combined actorcritic with fully connected networks and succeeded in a wide variety of continuous motor control problems [mnih2016asynchronous]. Deep DPG (DDPG) is an extension of DPG and successfully solves more than 20 simulated physics tasks [lillicrap2015continuous].With these many RL algorithms, they are usually categorized by the way they choose action, i.e., valuebased, policybased and actorcritic. However, we observe that there are two fundamental mechanisms to find optimal policy behind these RL algorithms. One of them is what we call indirect methods, which acquires optimal policy by solving bellman equation of actionvalue function and deriving optimal action from it. The other is direct methods, which seeks optimal policy by directly optimizing objective with respect to policy performance. In this paper, we reveal that the two classes of methods are equivalent and can be unified under actorcritic architecture if some conditions of initial state distribution of the problem hold. Besides, convergence results is introduced for both direct and indirect methods. Finally, We also classify current mainstream RL algorithms by the criteria, and compare the differences between other criteria including valuebased and policy based, modelbased and modelfree.
The rest of this paper is organized as follows. Section II introduces preliminaries of value function and stationary distribution. Section III introduces concepts of direct and indirect methods, and establishes equivalence and unification of them. Section IV introduces convergence results for both direct and indirect methods. Section V classifies current main RL algorithms using our criterion and does a comparision with other criteria. Last section VI summaries this work.
Ii Preliminary
We study the standard reinforcement learing (RL) setting in which an agent interacts with an environment by observing a state , selecting an action , receiving a reward , and observing the next state . We model this process with a Markov Decision Process (MDP) . Here, denotes the state and action spaces and is the transition function. Throughout we will assume that and are finite set and write . A policy maps a state to a distribution over actions, is the reward function, is the distribution of the initial state and we define as the discount factor.
Iia Statevalue function and actionvalue function
We seek to learn optimal policy which has maximum statevalue function ,
The statevalue function is the expected sum of discounted rewards from a state when following policy :
where and . Similarly, we use the following standard definition of the actionvalue function :
By dynamic programming principle, we can get the selfconsistency condition,
(1) 
which reveals the relationship between statevalue functions of adjacent states under arbitrary policy, and the bellman equation,
From and , the state transition function is
Denoting as , then (1) can be expressed as
In vector notation, this becomes
where and The statevalue function is in fact the fixed point of the selfconsistency operator , which is defined as
Similarly, we define Bellman operator as
Both and are contraction mapping with respect to maximum norm, which means and have unique fixed point, and respectively. Besides, the process converges to , and the process converges to . More interesting for us, the operator also describes the expected behavior of learning rules such as temporaldifference learning and consequently their learning dynamics [sutton2018reinforcement, tsitsiklis1997analysis].
IiB Stationary distribution and function approximation
From definition of stationary distribution, a distribution is a stationary distribution if and only if
(2) 
Furthermore, according to properties of Markov chain
[ross1996stochastic], given policy , there exists unique stationary state distribution if the Markov chain generated by is indecomposible, nonperiodic and positiverecurrent.Assumption 1.
The Markov chain generated by is indecomposible, nonperiodic and positiverecurrent.
We will assume Assumption 1 hold in the following. By the property of Markov chain, the state distribution always becomes stationary as time goes by, then we generally have the following assumption.
Assumption 2.
If is generated by policy , then .
For largescale MDP, there are too many states and/or actions to be stored, and learning the value function of each state individually is too slow. A more practical way is to solve largescale MDPs by value function approximation. It generalizes RL from seen states to unseen states. We approximate statevalue function by a parameterized value , where . We use or for short. Besides, we approximate policy by , where and we use or for short. Since the tabular case can be regarded as a special case of the parameterized function, we will mainly discuss how to obtain the optimal policy function and value function in the following.
Iii Direct methods and indirect methods
Now, we are ready to introduce concepts of direct and indirect methods.
Definition 1.
(Direct RL). Direct RL finds the optimal policy by directly maximizing the state value function for .
Definition 2.
(Indirect RL). Indirect RL finds the optimal policy by solving the Bellman’s optimality equation for .
Iiia Direct method
IiiA1 Vanilla policy gradient
By Definition 1, direct RL seeks to find which maximizes value function . However, due to the limited fitting ability of the approximation function, current direct RL algorithms usually tend to maximize the following policy objective function
(3) 
By policy gradient theorem [sutton2000policy], the update gradient for the policy function is
where
is state distribution at time starting from state and following . We denote as , which is the probability of transition from to at time following policy . Defining the discounted visiting frequency (DVF)
the policy update gradient can be expressed as
(4)  
Core procedure of direct RL is shown in Algorithm 1. However, there are three obstacles to make it practical. Firstly, properties of DVF are not clear; Secondly, summation over all states and actions is impossible; Thirdly, the value function is not accessible. These problems are tackled as following.

Properties of DVF
In practical applications, it is usually intractable to calculate the DVF for each given . To understand the properties of this distribution, we make the following two propositions.
Proposition 1.
When holds for , then
Proof.
Before the second proposition, the following lemma is necessary at this point
Lemma 1.
If statetostate transition function of a policy corresponds to an indecomposible, nonperiodic and positiverecurrent Markov chain, then its step transition function converges to stationary state distribution of , and average of the first timesteps of also converges to stationary state distribution of [ross1996stochastic], i.e.,
and
Proposition 2.
When approaches 1,
Proof.
For last step of proof, we use the property of indecomposible, nonperiodic and positiverecurrent Markov chain. ∎
By Proposition 1 and 2, when or we choose , the policy gradient become
(5)  
Note that when we choose , it means we keep changing the objective function (3) every time parameters are updated, rather than the objective becomes . Gradient of this objective is not accessible because there is no analytic form between and its stationary state distribution.

Unbiased estimation of policy gradient
Policy gradient of equation (5) is in form of expectation, for its estimation, we collect samples batch generated by policy and approximate expectation using average, as shown in the following equation
(6)  
By Assumption 2, this an unbiased estimation of policy gradient.
However, value function is not known. It can be approximated by Monte Carlo estimation, i.e., using episodic return, which is REINFORCE algorithm; Or it also can be approximated using value function approximation, i.e., using .

Value function approximation
Value function can be approximated by minimizing distance between between approximate value function and true value function . The mean square error under stationary state distribution is usually used as the distance, i.e.,
We use gradient descent to minimize its right hand sides. Its gradient is
True value function is not accessible, so we construct update target using samples under by Monte Carlo method, i.e.,
or temporal difference method, i.e.,
to approximate true value function , that is
By using value function approximation in policy gradient estimation, actorcritic architecture can be derived from direct method, which is shown in Fig. 1.
IiiA2 Other direct algorithms
There are several other forms of policy gradient. For the sake of reducing variance of estimated gradient while not adding bias, value function is utilized as baseline [sutton2018reinforcement]. So as to get robust policy improvement, KL divergence or clip punishment is employed to constrain policy variation [kakade2002natural, schulman2015trust, wu2017scalable, schulman2017proximal]. For the purpose of making modelfree RL algorithms more efficient, deterministic policy is applied to reduce amount of samples needed to estimate gradient accurately [silver2014deterministic]. To enhance sample efficiency, several offpolicy policy gradient methods are employed to generate different policy for exploration. Offpolicy AC adopts importance sampling to correct action distribution when estimating policy gradient [degris2012off]
. In order to reduce its variance, ACER uses a truncated IS with correction, and Reactor uses ”leaveoneout” technique at cost of introducing bais, while IPG interpolates onpolicy gradient and offpolicy gradient to stabilize learning
[wang2016sample, gruslys2018reactor, gu2017interpolated]. Although OffPAC is simple and used widely, it is actually a semigradient, to this end, ACE and GeoffAC make use of emphatic method to give the true offpolicy gradient, though they both suffer from large variance [imani2018off, zhang2019generalized]. There are some methods attempt to relieve variance issues by dropping IS correction term. DDPG and TD3 take advantage of deterministic policy and eliminate IS part in their policy gradient naturally [lillicrap2015continuous, fujimoto2018addressing]. SAC, Soft Qlearning and Trustpcl are all under entropyregularized framework and optimize policy parameters with offpolicy data without IS [haarnoja2018soft, haarnoja2017reinforcement, nachum2017trust]. SIL exploits only good stateaction pairs in the replay buffer and can be viewed as an implementation of lowerboundsoftQlearning under the entropyregularized RL framework [oh2018self]. Offpolicy mechanisms also enable RL to scale, such as A3C (A2C), ApeX and IMPALA [mnih2016asynchronous, horgan2018distributed, espeholt2018impala].IiiB Indirect methods
IiiB1 Policy iteration and value iteration
By Definition 2, indirect methods seek to find solution of Bellman equation
and acquire optimal policy indirectly by
There are several ways to get solution of Bellman equation, in which the most typical methods are policy iteration and value iteration.
In policy iteration algorithm, we start with an arbitrary policy , and we generate a sequence of new policies ,,…. Given policy , we perform a policy evaluation step (PEV), that computes as the solution of system
by keeping performing fixed point iteration, i.e., , until its convergence. We then perform a policy improvement step (PIM), which computes a new policy that satisfies
Then we go back to PEV step. This process continues until we have for all . From the theory of dynamic programming, it is guaranteed that the policy iteration algorithm terminates with the optimal policy .
In value iteration algorithm, we start with an arbitrary initial vector of value function and we keep performing
then the generated sequence of value functions converges to . Finally, the optimal policy is found by computing a policy that satisfies
It can be seen that value iteration is a special case of policy iteration. It iterates only once in PEV step while policy iteration iterates in PEV until convergence.
IiiB2 Approximate policy iteration
We take approximate policy iteration to illustrate how indirect methods work in approximate function settings.
Inspired by PEV step of policy iteration, in every iteration, approximate policy iteration aims to find solution of selfconsistency condition of , that is . Like policy iteration, PEV of approximate policy iteration keeps calculating update target and optimizing distance between and until converges, in which is calculated by . Similar with value approximation in direct method, the mean squared error under stationary state distribution is used as the distance, shown as
Its gradient is
Similarly, in PIM step, approximate policy iteration seeks to minimize distance between and . The distance is chosen to be their absolute error under some state distribution independent of , e.g. , , as shown in the following equation
It is obvious to see that , and is a constant irrelevant to . We thus can remove absolute operator and equally maximize the objective,
(7)  
Its gradient is
(8) 
which can be estimated by samples similar as equation (6).
IiiB3 Other indirect algorithms
In practice, approximate dynamic programming (ADP) is a class of methods seeking to find bellman solution, besides, most of indirect methods are valuebased methods, which have no explicit policy and find greedy policy w.r.t Q function to conduct policy improvement [powell2007approximate] . Qlearning is one of typical traditional indirect methods, and its extension DQN in deep RL has got great success in Go and computer games [watkins1992q, mnih2015human]. DDQN partially addresses overestimation problem of Qlearning by decoupling selection and evaluation of the bootstrap action [van2016deep]. PER replays important transitions more frequently to learn more efficiently than DQN [schaul2015prioritized]. Dueling DQN uses dueling architecture consists of two streams that represent the value and advantage functions to help to generalize across actions [wang2015dueling]. C51 learns a categorical distribution of discounted returns, instead of estimating the mean [bellemare2017distributional]. Rainbow combines these independent improvements to the DQN algorithm and provides stateoftheart performance on the Atari 2600 benchmark [hessel2018rainbow]. NAF is proposed as a continuous variant of Qlearning algorithm, which can be regarded as an alternative to policy gradient methods [gu2016continuous]. While Qlearning and its variants learn by onestep bootstraping, Retrace() is the first returnbased offpolicy control algorithm converging a.s. to without the GLIE (Greedy in the Limit with Infinite Exploration) assumption [munos2016safe, singh2000convergence].
IiiC Equivalence
There are several works have drawn connection between policy gradient method and Qlearning in framework of entropy regularization. Odonoghue et al. (2016) decomposed Qfunction into a policy part and a value part inspired by dueling Qnetworks and shows that taking the gradient of the Bellman error of the Qfunction leads to a result similar to the policy gradient [o2016combining]. Nachum et al. (2017) proposed path consistency learning (PCL) algorithm based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization, which can be interpreted as generalizing both policy gradient and Qlearning algorithms [nachum2017bridging]. Haarnoja et al. (2017) used a method called Stein Variational Gradient Descent to derive a procedure that jointly updates the Qfunction and policy , which approximately samples from the Boltzmann distribution [haarnoja2017reinforcement]. Schulman et al. (2017) showed that there is a precise equivalence between Qlearning and policy gradient methods in the framework of entropyregularized reinforcement learning, where ”soft” (entropyregularized) Qlearning methods are secretly implementing policy gradient updates [schulman2017equivalence].
While they have drawn connection between direct methods and indirect methods without parameterized policy, we establish equivalence between direct methods and indirect methods with parameterized policy. We take vanilla policy gradient and approximate policy iteration as an example to compare. While they have exact the same gradient in policy evaluation procedure (value approximation procedure), we pay our attention on policy improvement procedure (policy update), in which there are several differences between direct (5) and indirect policy gradient (8).

Value function in policy gradient is different: Objective function of indirect policy gradient (7) depends on results of policy evaluation, which is fixed in policy improvement step, so we do not have to unroll it when taking gradient, and as a result, value function in its gradient (8) is also fixed and irrelevant of ; Objective function of direct methods is straightforward, in which value is not only a function of state, but also a function of policy parameters . When we take gradient of it, we have to unroll it until we get the form of (5). As a consequence, value in the gradient is also function of , i.e., true value function . This is one of main differences between direct and indirect RL methods. But it should be noted that the difference naturally disappears when we estimate the gradient, because the true value function in direct gradient is not accessible and can only be estimated by value approximation which is used in indirect gradient.

State distribution in policy gradient is different: Although indirect and direct policy gradient both seek to optimize value function under initial state distribution , they have gradient with respect to different state distributions. For direct RL methods, due to unroll effect, direct policy gradient is an expectation under discounted visiting frequency, which can approximate stationary distribution when we choose initial state distribution as stationary state distribution every time by Proposition 1 or when by Proposition 2. For indirect RL methods, indirect policy gradient is an expectation under initial state distribution. There is no way to approximate stationary distribution but choose initial state distribution as stationary state distribution every time. When estimating direct gradient, we should use samples generated by current policy , whose state distribution is assumed to be stationary state distribution of by Assumption 2. And when estimating indirect gradient, we should use samples generated by initial state distribution.
By the analyses above, because direct policy gradient needs to resort to approximate value function in practice, the first difference about value function is naturally eliminated. The only difference is about state distribution. However, this difference can be eliminated if both direct and indirect methods choose different objective function in policy update step at every iteration in which objective function always uses current stationary state distribution as initial distribution. In conclusion, direct methods are equivalent to indirect methods as long as we choose at each iteration, as shown in Fig. 3.
Valuebased  Policybased  
Indirect  DP[bellman1966dynamic], Soft Qlearning[haarnoja2017reinforcement],  ADP[powell2007approximate], HDP[venayagamoorthy2002comparison], 
Qlearning [watkins1992q], DQN [mnih2015human],  ADGDHP[zhang2013overview], ADHDP[fuselli2013action],  
Dueling DQN [wang2015dueling], Rainbow [hessel2018rainbow],  DHP[ni2015model], GDHP[szuster2016globalized]  
DDQN [van2016deep], PER [schaul2015prioritized],  CDADP[duan2019deep], DGPI[duan2019generalized]  
C51 [bellemare2017distributional], NAF [gu2016continuous],  
Retrace()[munos2016safe]  
Direct 
Natural PG [kakade2002natural], TRPO [schulman2015trust],  
PPO [schulman2017proximal], DPG [silver2014deterministic] ,  
OffPAC [degris2012off], ACER [wang2016sample],  
Reactor [gruslys2018reactor], IPG [gu2017interpolated],  
ACE [imani2018off], GeoffAC [zhang2019generalized],  
DDPG [lillicrap2015continuous], TD3 [fujimoto2018addressing],  
SAC [haarnoja2018soft], SIL [oh2018self],  
ACKTR [wu2017scalable], TrustPCL [nachum2017trust],  
I2A [racaniere2017imagination], A3C (A2C) [mnih2016asynchronous],  
APEX [horgan2018distributed], IMPALA [espeholt2018impala],  
MVE [feinberg2018model], STEVE[buckman2018sample],  
GPS [levine2013guided]  

Modelbased  Modelfree  
Indirect  DP[bellman1966dynamic], ADP[powell2007approximate],  Soft Qlearning[haarnoja2017reinforcement], Qlearning [watkins1992q], 
HDP[venayagamoorthy2002comparison], ADHDP[fuselli2013action],  TD() [tsitsiklis1994asynchronous], DQN [mnih2015human],  
DHP[ni2015model], GDHP[szuster2016globalized],  GAE [schulman2015high], DDQN [van2016deep],  
ADGDHP[zhang2013overview],CDADP[duan2019deep],  PER [schaul2015prioritized], Dueling DQN [wang2015dueling],  
DGPI[duan2019generalized]  C51 [bellemare2017distributional], Rainbow [hessel2018rainbow],  
NAF [gu2016continuous], Retrace()[munos2016safe]  
Direct  MVE [feinberg2018model]  Natural PG [kakade2002natural], TRPO [schulman2015trust] 
STEVE[buckman2018sample]  ACKTR [wu2017scalable], PPO [schulman2017proximal]  
METRPO[kurutach2018model]  DPG [silver2014deterministic], OffPAC [degris2012off]  
PILCO [deisenroth2011pilco]  ACER [wang2016sample], Reactor [gruslys2018reactor]  
Recurrent world models[ha2018recurrent]  IPG [gu2017interpolated], ACE [imani2018off]  
GPS [levine2013guided]  GeoffAC [zhang2019generalized], DDPG [lillicrap2015continuous]  
I2A [racaniere2017imagination]  TD3 [fujimoto2018addressing], SAC [haarnoja2018soft]  
TrustPCL [nachum2017trust], SIL [oh2018self]  
A3C (A2C) [mnih2016asynchronous], APEX [horgan2018distributed]  
IMPALA [espeholt2018impala]  

Iv Convergence results
Iva Direct methods
Before getting into convergence analysis of direct methods, we first define some symbols for convenience. Consider following process,
(9) 
We denote that , , and , then process (9) becomes
and we have following theorem:
Theorem 1.
(Stochastic gradient theorem [bertsekas2000gradient]) Let be a sequence generated by the method
where is a deterministic positive stepsize, is steepest ascent direction, and is a random noise term. Let be an increasing sequence of fields. We assume the following:
(a) and are measurable.
(b) (Lipschitz continuity of ) The function is continuously differentiable and there exists a constant such that
(c) We have, for all and with probability 1,
and
where is a positive deterministic constant.
(d) The stepsize is positive and satisfies
Then either or converges to a finite value and , . Furthermore, every limit point of is a stationary point of .
IvB Indirect methods
We only establish convergence of approximate policy iteration. We consider an approximate policy iteration algorithm that generates a sequence of policies and a corresponding sequence of approximate value function satisfying
and
where and are some positive scalars. The scalar is an assumed worstcase bound on the error incurred during policy evaluation. The scalar is a bound on the error incurred in the course of the computations required for a policy update. Then we have the following theorem.
Theorem 2.
(Error bound for approximate policy iteration [bertsekas1996neuro]) The sequence of policies generated by the approximate policy iteration algorithm satisfies
V Classification of RL algorithm
In this section, we classify mainstream RL algorithms with direct and indirect criterion. In table I, we compare it with valuebased and policybased criterion. We find that most of modelbased methods, e.g. ADP, are classified into policybased method but are actually indirect method. Besides, all value based methods are indirect methods, which is because direct methods need a parameterized policy. We also compare it with modelbased and modelfree criterion in table II.
Vi Conclusion
In this paper, we group current RL algorithms by direct and indirect methods, where direct methods are defined as algorithms that solve optimal policy by directly optimizing the expectation of accumulative future rewards using gradient descent method while indirect methods are defined as algorithms that get optimal policy by indirectly solving the sufficient and necessary condition from Bellman’s principle of optimality, i.e., the Bellman equation. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actorcritic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Besides, from theorem of stochastic gradient methods, convergence of direct method can be guaranteed if gradient error has zero mean and bounded second moment. For indirect method, the upper limit of error can be determined by error bound of PEV and PIM steps. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including valuebased and policybased, modelbased and modelfree.
Acknowledgment
We would like to acknowledge Mr. Zhengyu Liu, Dr. Qi Sun, for their valuable suggestions throughout this research.
Comments
There are no comments yet.