1 Introduction
In reinforcement learning, experiences are sequences of states, actions and rewards that generated by the agent interacts with environment. The agent’s goal is learning from experiences and seeking an optimal policy from the delayed reward decision system. There are two fundamental mechanisms have been studied, one is temporaldifference (TD) learning method which is a combination of Monte Carlo method and dynamic programming [Sutton1988]. The other one is eligibility trace [Sutton1984, Watkins1989], which is a shortterm memory process as a function of states. TD learning combining with eligibility trace provides a bridge between onestep learning and Monte Carlo methods through the tracedecay parameter [Sutton1988].
Recently, Multistep [Sutton and Barto2017] unifies step Sarsa (, fullsampling) and step Treebackup (, pureexpectation). For some intermediate value , creates a mixture of fullsampling and pureexpectation approach, can perform better than the extreme case or [De Asis et al.2018].
The results in [De Asis et al.2018] implies a fundamental tradeoff problem in reinforcement learning :
should one estimates the value function by adopting pureexpectation (
) algorithm or fullsampling () algorithm? Although pureexpectation approach has lower variance, it needs more complex and larger calculation
[Van Seijen et al.2009]. On the other hand, fullsampling algorithm needs smaller calculation time, however, it may have a worse asymptotic performance [De Asis et al.2018]. Multistep [Sutton and Barto2017] firstly attempts to combine pureexpectation with fullsample algorithms, however, multistep temporaldifference learning is too expensive during the training. In this paper, we try to combine the algorithm with eligibility trace, and create a new algorithm, called . Our unifies the Sarsa algorithm [Rummery and Niranjan1994] and algorithm [Harutyunyan2016]. When varies from 0 to 1, changes continuously from Sarsa ( in ) to ( in ). In this paper, we also focus on the tradeoff between pureexpectation and fullsample in control task, our experiments show that an intermediate value can achieve a better performance than extreme case.Our contributions are summaried as follows:

We define a new operator mixedsampling operator through which we can deduce the corresponding policy evaluation algorithm and control algorithm .

For new policy evaluation algorithm, we give its upper error bound.

We present an new algorithm which unifies Sarsa and . For the control problem, we prove that both of the offline and online algorithm can converge to the optimal value function.
2 Framework and Notation
The standard episodic reinforcement learning framework [Sutton and Barto2017] is often formalized as Markov decision processes (MDPs). Such framework considers 5tuples form , where indicates the set of all states, indicates the set of all actions,
indicates a statetransition probability from state
to state under taking action , ; indicates the expected reward for a transition, is the discount factor. In this paper, we denote as a trajectory of the statereward sequence in one episode.A policyis a probability distribution on
and stationary policy is a policy that does not change over time.Consider the stateaction value maps on to , for a given policy , has a corresponding stateaction value:
Optimal stateaction value is defined as:
Bellman operator
(1) 
Bellman optimality operator
(2) 
where and , the corresponding entry is:
Value function and satisfy the following Bellman equation and optimal Bellman equation correspondingly:
Both and are contraction operator in the supnorm, that is to say, for any , or . From the fact that fixed point of contraction operator is unique, the value iteration converges: , , as , for any initial [Bertsekas et al.2005].
Unfortunately, both the system (1) and (2) can not be solved directly because of fact that the and in the environment are usually unknown. A practical model in reinforcement learning has not been available, called, model free.
2.1 Onestep TD Learning Algorithms
TD learning algorithm [Sutton1984, Sutton1988] is one of the most significant algorithms in model free reinforcement learning, the idea of bootstrapping is critical to TD learning: the evluation of the value function are used as targets during the learning process.
Given a target policy which is to be learned and a behavior policy that generates the trajectory , if , the learning is called onpolicy learning, otherwise it is offpolicy learning.
Sarsa: For a given sample transition (), Sarsa [Rummery and Niranjan1994] is a onpolicy learning algorithm and its updates value as follows:
(3)  
(4) 
where is the kth TD error, is stepsize.
ExpectedSarsa: ExpectedSarsa [Van Seijen et al.2009] uses expectation of all the next stateaction value pairs according to the target policy to estimate value as follows:
(5)  
where is the kth expected TD error. ExpectedSarsa is a offpolicy learning algorithm if , for example, when is greedy with respect to then ExpectedSarsa is restricted to QLearning [Watkins1989]. If the trajectory was generated by , ExpectedSarsa is a onpolicy algorithm [Van Seijen et al.2009].
The above two algorithms are guaranteed convergence under some conditions [Singh et al.2000, Van Seijen et al.2009].
: Onestep [Sutton and Barto2017, De Asis et al.2018] is a weighted average between the Sarsa update and Expected Sarsa update through sampling parameter :
(6) 
Where is degree of sampling, denoting fullsampling and denoting a pureexpectation with no sampling, are in (4) and (5).
2.2 Return Algorithm
Onestep TD learning algorithm can be generalized to multistep bootstrapping learning method. The return algorithm [Watkins1989] is a particular way to mix many multistep TD learning algorithms through weighting step returns proportionally to .
operator^{1}^{1}1The notation is coincident with textbook [Bertsekas et al.2012]. is a flexible way to express return algorithm, consider a trajectory ,
where is step returns from initial stateaction pair , the term , called returns, and .
Based on the fact that is fixed point of , remains the fixed point of . When , is equal to the usual Bellman operator . When , the evaluation of becomes Monte Carlo method. It is wellknown that trades off the bias of the bootstrapping with an approximate , with the variance of sampling multistep returns estimation [Kearns and Singh2000]. In practice, a high and intermediate should be typically better [Singh and Dayan1998, Sutton1996].
3 Mixedsampling Operator
In this section, we present the mixedsampling operator , which is one of our key contribution and is flexible to analysis our new algorithm later. By introducing a sampling parameter , the mixedsampling operator varies continuously from pureexpectation method to fullsampling method. In this section, we analysis the contraction of firstly. Then we introduce the return vision of mixedsampling operator, denoting it . Finally, we give a upper error bound of the corresponding policy evaluation algorithm.
3.1 Contraction of Mixedsampling Operator
Definition 1.
Mixed sampling operator is a map on to
(7) 
where
The parameter is also degree of sampling intrduced by the algorithm [De Asis et al.2018]. In one of extreme end (, pureexpectation), can deduce the step returns in [Harutyunyan2016], where , is the kth expected TD error. Multistep Sarsa [Sutton and Barto2017] is in another extreme end (, fullsampling). Every intermediate value can create a mixed method varies continuously from pureexpectation to fullsampling which is why we call mixed sample operator.
Return Version We now define the version of , denote it as :
(8) 
where the is the parameter takes the from TD(0) to Monte Carlo version as usual. When , is restricted to [Harutyunyan2016], when , is restricted to operator. The next theorem provides a basic property of .
Theorem 1.
The operator is a contraction: for any ,
Furthermore, for any initial , the sequence is generated by the iteration
can converge to the unique fixed point of .
Proof.
Unfolding the operator , we have
(9) 
where . Based the fact that both [Bertsekas et al.2012]and [Harutyunyan2016, Munos et al.2016] are contraction operators, and is the convex combination of above operators, thus is a contraction. ∎
3.2 Upper Error Bound of Policy Evaluation
In this section we discuss the ability of policy evaluation iteration in Theorem 1. Our results show that when and are sufficiently close, the ability of the policy evaluation iteration increases gradually as the decreases from 1 to 0.
Lemma 1.
If a sequence satisfies , then for any , we have
Furthermore, for any , has the following estimation
Theorem 2 (Upper error bound of policy evaluation).
Consider the policy evaluation algorithm , if the behavior policy is away from the target policy , in the sense that , , and , then for a large , the policy evaluation sequence satisfy
where for a given policy , is determined by the learning system.
Proof.
Firstly, we provide an equation which could be used later:
(10) 
Rewrite the policy evaluation iteration:
Note is fixed point of [Harutyunyan2016], then we merely consider next estimator:
The first equation is derived by replacing in (10) with . Since is away from , the first inequality is determined the following fact:
where is determined by the reinforcement learning system and independent of . . For the given policy , is a constant on determined by learning system, we denote it . ∎
Remark 1.
The proof in Theorem 2 strictly dependent on the assumption that is smaller but never to be zero, where the is a bound of discrepancy between the behavior policy and target policy . That is to say, the ability of the prediction in policy evaluation iteration is dependent on the gap between and .
4 Control Algorithm
In this section, we present algorithm for control. We analysis the offline version of which converges to optimal value function exponentially.
Considering the typical iteration , is an arbitrary sequence of corresponding behavior policies, is calculated by the following two steps,
Step1: policy evaluation
Step2: policy improvement
that is is greedy policy with repect to . We call the approach introduced by above step1 and step2 control algorithm.
In the following, we presents the convergence rate of control algorithm.
Theorem 3 (Convergence of Control Algorithm).
Considering the sequence generated by the control algorithm, given , then
Particularly, for , then sequence converges to exponentially fast:
Proof.
By the definition of ,
we have^{2}^{2}2The section inequality is based on the next two results: [Munos et al.2016] Theorem2 and [Bertsekas et al.2012] Proposition6.3.10.:
∎
5 Online Implementation of
We have discussed the contraction of mixedsampling operator through which we introduced the control algorithm. Both of the iteration in Theorem 2 and Theorem 3 are the version of offline. In this section, we give the online version of and discuss its convergence.
5.1 Online Learning
Offline learning is too expensive due to the learning process must be carried out at the end of a episode, however, online learning updates value function with a lower computational cost, better performance. There is a simple interpretation of equivalence between offline learning and online learning which means that, by the end of the episode, the total updates of the forward view(offline learning) is equal to the total updates of the backward view(online learning) [Sutton and Barto1998]. By the view of equivalence^{3}^{3}3The true online learning was firstly introduced by [Seijen and Sutton2014], more details in [Van Seijen et al.2016]., online learning can be seen as an implementation of offline algorithm in an inexpensive manner. Another interpretation of online learning was provided by [Singh and Sutton1996], TD learning with accumulate trace comes to approximate everyvisit MonteCarlo method and TD learning with replace trace comes to approximate firstvisit MonteCarlo method.
The iterations in Theorem 2 and Theorem 3 are the version of expectations . In practice, we can only access to the trajectory . By statistical approaches, we can utilize the trajectory to estimate the value function. Algorithm 1 corresponds to online form of . Algorithm1:Online Q() algorithm Require:Initialize arbitrarily, Require:Initialize to be the behavior policy Parameters: stepsize Repeat (for each episode): Initialize stateaction pair For = 0 , 1, 2, : Obersive a sample For : + End For , If is terminal: Break End For
5.2 Online Learning Convergence Analysis
We make some common assumption similar to [Bertsekas and Tsitsiklis1996, Harutyunyan2016].
Assumption 1.
, minimum visit frequency, every pair can be visited.
Assumption 2.
For every historical chain in a MDPs, , where is a positive constants, is a positive integer.
For the convenience of expression, we give some notations firstly. Let
be the vector obtained after
iterations in the th trajectory, and the superscript emphasizes online learning. We denote the th trajectory as sampled by the policy . Then the online update rules can be expressed as follows:where is the length of the  trajectory.
Theorem 4.
Based on the Assumption 1 and Assumption 2, stepsize satisfying,, is greedy with respect to , then , where is short for with probability one.
Proof.
After some sample algebra:
where . We rewrite the offline update:
where is the returns at time when the pair was visited in the th trajectory,
the superscript in emphasizes the forward (offline) update. denotes the times of the pair visited in the th trajectory.
We define the residual
between and the offline estimate in the th trajectory:
Set , then we consider the next random iterative process:
(11) 
where
Step1:Upper bound on :
(12) 
where .
where is the difference between the total online updates of first steps and the first times offline update in th trajectory. By induction on , we have:
where is a consist and , .
Based on the condition of stepsize in the Theorem 4, , then we have (12).
Step2: .
In fact:
From the property of eligibility trace(more details refer to [Bertsekas et al.2012]) and Assumption 2, we have:
Then according to (11), for some :
Step3: Considering the iteration (11) and Theorem 1 in [Jaakkola et al.1994], then we have . ∎
Based on Theorem 3 in [Munos et al.2016] and our Theorem 4, if is greedy with respect to , then in Algorithm 1 can converge to with probability one.
Remark 2 The conclusion in [Jaakkola et al.1994] similar to our Theorem 4, but the update is different from ours and we further develop it under the Assumption 2.
6 Experiments
6.1 Experiment for Prediction Capability
In this section, we test the prediction abilities of in 19state random walk environment which is a onedimension MDP environment that widely used in reinforcement learning [Sutton and Barto2017, De Asis et al.2018]. The agent at each state has two action : left and right, and taking each action with equal probability.
We compare the rootmeansquare(RMS) error as a function of episodes, varies dynamically from 0 to 1 with steps of 0.2. Results in Figure 1 show that the performance of increases gradually as the decreases from 1 to 0, which just verifies the upper error bound in Theorem2.
Comments
There are no comments yet.