1 Introduction
The focus of our work are noregret modelfree algorithms for averagecost reinforcement learning (RL) problems with value function approximation. Most of the existing regret results for modelfree methods apply either to settings with no function approximation [StrLiWe06, AzOsMu17, JinAlBuJo18], or to systems with linear actionvalue functions and special structure, e.g. [mflq2019]. One exception is the recently proposed Politex algorithm by politex. Politex
is a variant of policy iteration, where the policy in each phase is selected to be nearoptimal in hindsight for the sum of all past actionvalue function estimates. This is in contrast to standard policy iteration, where each policy is typically greedy with respect to the most recent actionvalue estimate. In uniformlymixing Markov decision processes (MDPs) when the actionvalue error decreases to some level
at a parametric rate, the regret of Politex scales as .The Politex regret bound requires the error of a value function estimated from transitions to scale as . Such an error bound was shown by the authors to hold for the leastsquares policy evaluation (LSPE) method of bertsekas1996temporal whenever all policies produced by Politex sufficiently explore the feature space (the constant hidden depends on the dimension of the feature space among other problem dependent constants). However, this assumption concerning uniform exploration is often unrealistic. More generally, controlling the value function estimation error in some fixed norm (as required by the general theory of Politex) may be difficult when policies have a strong control over which parts of the state space they visit. In this work, we instead propose a deliberate exploration scheme that ensures sufficient coverage of the state/feature space. We also propose a value function estimation approach when working with linear value function approximation that allows us to obtain performance guarantees similar to the previously mentioned guaranteed, but under milder assumptions.
In particular, in this work we propose to address this problem by separating the problem of coming up with a policy that explores well its environment (a version of pure exploration) from the problem of learning to maximize a specific reward function. In fact, the problem of pure exploration has been the subject of a large number of previous works (e.g., schmidhuber1991adaptive, thrun1992active, pathakICLR19largescale, hazan2019provably and references therein). The new assumption that we work with only requires a single exploratory policy to be available beforehand, and interleaves the target policy with short segments of exploration and our main result will quantify how the choice of the exploration policy impacts the regret. We view our approach as a practical remedy that allows algorithm designers to focus on one aspect of the full RL problem (either learning good value functions, or learning a good exploration policy).
We propose to estimate the value of states generated by the exploration policy using returns computed from rollouts generated by the target policy – a hybrid, inbetween case that has both off and onpolicy aspects. We provide a MonteCarlo value estimation algorithm that can learn from such data, and analyze its error bound in the linear function approximation case. Learning from exploratory data allows us to obtain a more meaningful bound on the estimation error than the available results for temporal difference (TD)based methods [Sutton1988], as we cover states which the target policy might not visit otherwise. Under a uniform mixing assumption, the described estimation procedure can be accomplished using a single trajectory, and the exploration scheme can be thought of as performing soft resets within the trajectory. ^{size=,color=blue!20!white,}^{size=,color=blue!20!white,}todo: size=,color=blue!20!white,Csaba: I think the intro is quite long, so I’d remove the next paragraph. Or shorten it to one sentence or something that was not mentioned beforehand. ^{size=,color=red!20!white,}^{size=,color=red!20!white,}todo: size=,color=red!20!white,Yasin: commented out
We complement the novel algorithmic and theoretical contributions with synthetic experiments that are chosen to demonstrate that explicit exploration is indeed beneficial in hardtoexplore MDPs. While our analysis only holds for linear value functions, we also experiment with neural networks used for value function approximation and demonstrate that the proposed exploration scheme improves the performance of
Politex on cartpole swingup, a problem which is known to be hard for algorithms which only explore using simple strategies like dithering.2 Background
2.1 Problem definition
For an integer , let . We consider MDPs specified by states, actions, cost function , and transition matrix . The assumption that the number of states is finite is nonessential, while relaxing the assumption that the number of actions is finite is less trivial. Recall that a policy is a mapping from states to distributions over actions and following policy means that in time step , after observing state an action is sampled from distribution . Denoting by the sequence of stateactions generated under a baseline policy , the regret of a reinforcement learning algorithm that gives rise to the stateaction sequence , with respect to policy , is defined as
(1) 
Our goal is to design a learning algorithm that guarantees a small regret with high probability with respect to a large class of baseline policies.
Throughout the paper, we make the following assumption, which ensures that the quantities we define next are welldefined [yu2009convergence].

Assumption A1 (Single recurrent class) The states of the MDP under any policy form a single recurrent class.
MDPs satisfying this condition are also known as unichain MDPs [Section 8.3.1 Puterman94]. Under Assumption 2.1, the states observed under policy
form a Markov chain that has a unique stationary distribution, denoted by
. We also let , which is the unique stationary distribution of the Markov chain of stateaction pairs under . We use to denote the transition matrix under policy . Define the transition matrix by . For convenience, we will interchangeably write to denote this value. We will also do this for other transition matrices/kernels. Finally, let be the average cost of the policy and . We use subscript to denote the quantities related to the exploration policy. So is the stationary distribution of the exploration policy . Part of our analysis will require the following mixing assumption on all policies.
Assumption A2 (Uniformly fast mixing) Any policy has a unique stationary distribution . Furthermore, there exists a constant such that for any distribution , ^{size=,color=blue!20!white,}^{size=,color=blue!20!white,}todo: size=,color=blue!20!white,Csaba: Is the same for all policies?
2.2 Actionvalue estimation
For a fixed policy, we define , the actionvalue of in , as the expected total differential cost incurred when the process is started from state , the first action is and in the rest of the time steps policy is followed. Here, the differential cost in a time step is the difference between the immediate cost for that time step and the average expected cost under .
Under our assumption, up to addition of a scalar multiple of ,
(viewed as a vector) is the unique solution to the Bellman equation
. We will also use that given a function , any Boltzmann policy is invariant to shifting by a constant.One common approach to dealing with large stateaction spaces is using function approximators to approximate . While our algorithm works with general function approximators, we will provide an estimation approach and analysis for linear approximators, where the approximate solution lies in the subspace spanned by a feature matrix . With linear function approximation, value estimation algorithms solve the projected Bellman equation defined by . Here, is the projection matrix with respect to the stationary distribution of a policy . In onpolicy estimation, , while in offpolicy estimation is a behavior policy used to collect data. Let be the solution of the above equation, also known as the TD solution. Let be the onpolicy estimate computed using data points. For onpolicy algorithms, error bounds of the following form are available: . For convergence results of this type, see [TsitsiklisVanRoy1997, TsitsiklisVanRoy1999, antos2008learning, SuttonSzepesvariMaei2009, MaeiSzepesvariBhatnagarSutton2010, lazaric2012finite, GeistScherrer2014, farahmand2016regularized, liu2012regularized, liu2015finite, bertsekas1996temporal, yu2009convergence].
Unfortunately, one issue with such weightednorm bounds is that the error can be very large for stateaction pairs which are rarely visited by the policy. To overcome this issue, we aim to bound the error with respect to a weighted norm where the weights span all directions of the feature space. We will assume access an exploration policy that excites “all directions” in the featurespace:

Assumption A3 (Uniformly excited features) There exists a positive real such that for the exploration policy , .
Given the exploration policy, our goal will be to bound the error in the norm weighted by its stationary distribution . We will describe an algorithm for which we can obtain such a bound, which estimates the value function from onpolicy trajectories whose initial states are sampled from . We are not aware of any finitetime error bounds for offpolicy methods in the averagecost setting.
2.3 Policy iteration and Politex
Policy iteration algorithms (see e.g. bertsekas2011approximate) alternate between executing a policy and estimating its actionvalue function, and policy improvement based on the estimate. In Politex, the policy produced at each phase is a Boltzmann distribution over the sum of all previous actionvalue estimates, . Here, is the actionvalue estimate at phase . Under uniform mixing and for actionvalue errors that behave as , the regret of Politex scales as . The authors show that the error condition holds for the LSPE algorithm when all policies (not only ) satisfy the feature excitation. Since this is difficult to guarantee in practice, in this work we aim to bound all errors in the same norm, weighted by the stationary stateaction distribution of an exploratory policy . Thus, we only require the feature excitation condition to hold for a single known policy.
Politex (initial state , exploration policy , length )
CollectData (, , , , ):
3 Explorationenhanced Politex
The proposed algorithm, which we will refer to as EEPolitex (explorationenhanced Politex) is shown in Figure 1. Compared to Politex, the main difference is that we assume access to a fastmixing exploration policy which spans the state space, and we run that policy in short segments at a fixed schedule. Intuitively, the exploration policy serves the purpose of performing soft resets of the environment to a random state within a single trajectory. The actionvalue function of each policy is then estimated using onpolicy trajectories, whose initial states are chosen approximately i.i.d. from the stationary state distribution of the exploration policy. We assume access to an estimation blackbox which can learn from such data. In the next section, we show a concrete leastsquares Monte Carlo (LSMC) algorithm with an error bound of when run on data tuples, where is the worstcase approximation error. Our valueestimation requirement is weaker than that in politex, since we provide exploratory data to the estimation blackbox, rather than requiring each policy produced by Politex to sufficiently explore. However, as we show next, the exploration segments come at a price of a slightly worse regret of compared to .
3.1 Regret of EEPolitex
Consider any learning agent that produces the stateaction sequence while interacting with an MDP. For a fixed time step , let denote the policy that is used to generate . Let be the rounds that the exploration policy is played, i.e. , and let be the pseudoregret in those rounds. Similarly to politex, we decompose regret into pseudoregret and noise terms:
(2)  
We first bound the pseudoregret by a direct application of Theorem 4.1 of politex to rounds .
Theorem 3.1.
Fix . Let and and be such that for any , with probability ,
(3) 
and for any . Letting , with probability , the regret of Politex relative to the reference policy satisfies
where .
Next, we bound the noise terms and using a modified version of Lemma 4.4 of politex that accounts for additional policy switches due to exploration:
Lemma 3.2.
Let Section 2.1 hold. If the algorithm has a total of policy restarts, and each policy is executed for at most timesteps, then with probability at least , we have that
The proof is given in Appendix A.
Assume that the actionvalue error bound is of the form , where is the irreducible approximation error (defined in the next section), and is a constant. Given policy updates and phases of length , the exploration term is bounded as , the pseudoregret is bounded as , and the terms and are of order . By optimizing the terms, we obtain , , , , and the corresponding regret is of order .
4 Leastsquares MonteCarlo estimation with exploration
Our approach is to directly solve a leastsquares problem and find a solution such that . In order to do so, we use the definition of the value of policy ,
(4) 
Unlike TD methods, this approach does not rely on the Bellman equation.
Let
be the uniform distribution over the action space and let
be the datagenerating distribution.^{1}^{1}1We define to be the distribution on that puts the probability mass on pair . Let be the best linear value function estimator with respect to norm . The irreducible approximation error is the largest such that uniformly for all policies. We use to denote a constant such that uniformly for all policies.Let be a sufficiently large integer (polynomial in the mixing time of policy ). Let be a dataset, where for each , is a state sampled under exploration distribution , is sampled from the uniform distribution , and is a trajectory generated under policy starting at state where is the first state observed after taking action in state . The trajectory has the form , where is the state obtained by starting from state and running policy for rounds, and .
(5) 
Consider the dataset . Let be the solution of the following leastsquares problem
The stateaction value estimate is .
We will analyze the above estimation procedure in the next section. Although we treat each trajectory as a single sample, in practice it is often beneficial to use all data. For example, firstvisit MonteCarlo methods rely on returns from the first time a stateaction pair is encountered in a trajectory, while everyvisit Monte Carlo methods average returns obtained from all observations of a stateaction pair. We will refer to the analyzed approach as onevisit. Also, the choice of estimator is to simplify the analysis. The estimate might be more appropriate in practice.
4.1 Value estimation
First, we show that the bias due to the finite rollout length is exponentially small.
Lemma 4.1.
For the choice of and for , we have that
The proof is given in Appendix B. Given that these errors are exponentially small, we will ignore them to improve the readability.
Let be the empirical estimate of using data . The next lemma bounds the leastsquares MonteCarlo estimation error.
Lemma 4.2.
Fix . Under the assumption that , with probability at least , we have that
The proof is given in Appendix B. The above result implies that . Given that , we conclude that for sufficiently large , . Thus, the quantity in Theorem 3.1 can be chosen to be .
4.2 Comparison with existing work
Our value estimation approach is related to offpolicy temporal difference methods such as LSTD and LSPE in the sense that those methods attempt to solve the projected Bellman equation . where the projection matrix is weighted by the distribution of , and the transition matrix corresponds to the target policy. The goal is again to bound the estimation error in the weighted norm. However, while some analysis for offpolicy LSTD exists for discounted MDPs [Yu2010], we are not aware of any similar results for averagecost problems. In fact, it is known that offpolicy LSPE can diverge due to the matrix not being contractive in the weighted norm [bertsekas2011approximate]. In comparison to offpolicy MonteCarlo methods, our approach benefits from the fact that returns are estimated using onpolicy rollouts, and hence we do not require importance weights.
Our bound has two advantages compared with the available results for onpolicy LSTD and LSPE. First, the available bounds for these methods involve certain undesirable terms that do not appear in our result. For example, politex show an error bound of the form . Here, is the TD fixed point, is the finite sample LSPE estimate for policy , and is the contraction coefficient of mapping with respect to norm , i.e. . TsitsiklisVanRoy1999 show that is smaller than one. However, it can be arbitrarily close to one, which will make the error bounds meaningless. Similar quantities appear in the error bounds for the LSTD algorithm. In contrast, our bounds do not depend on the measure of contractiveness. Second, the TD fixed point solution is not the same as the best possible estimation in the linear span of the features, and this introduces an additional approximation error. Let . Theorem 2.2 of yu2010error shows that we might lose a multiplicative factor of :
(6) 
In contrast, we aim to get a better estimate by minimizing the error directly without imposing these constraints.
5 Experiments
5.1 DeepSea environment
We first demonstrate the advantages of our exploration scheme on a smallscale environment known as DeepSea (see also OsbandWenVanRoy2016), in which exploration can nonetheless be difficult. The states of this environment comprise an grid, and there are two actions. The environment transitions and costs are deterministic. On action 0 in cell , the environment transitions down and left, to cell . On action 1 in cell , the environment transitions down and right, to cell . The agent starts in state . The reward (negative cost) in the bottomright state is for any action. For all other states, the reward for action 0 is 0, and the reward for action 1 is 1. Thus, during the first steps (and in the episodic version of this task), the agent can obtain a positive return only if it always takes action 1, even though it is expensive. In the infinite horizon setting, an optimal strategy first gets to the right edge and then takes an equal number of left and right actions, and has an average reward close to . A simple strategy that always takes action 1 obtains an average reward close to . We represent states as length vectors containing onehot indicators for each grid coordinate.
We experiment with PolitexLSPE, EEPolitexLSMC, and PolitexLSMC, i.e. Politex with value estimation using LSMC and no exploration. We also evaluate an online version of RLSVI [OsbandWenVanRoy2016] with linear function approximation, similar to the version described in politex. For exploration, we use a policy that always takes action 1 and runs for steps in each rollout. This policy can help discover the hidden reward but incurs additional costs. For LSMC, we evaluate firstvisit, everyvisit, and onevisit (i.e. just using the first sample) return estimates computed from length rollouts. The results are shown in Figure 1 as costs, i.e. negative rewards. On a small grid, all policies achieve the lowest cost. However, as the grid size increases, RLSVI and noexploration Politex converge to the suboptimal policy which always takes action 0. The performance of onevisit LSMC with exploration also deteriorates for higher , and costs are positive due to exploration segments, suggesting that longer runs (more samples) are required in this case.
5.2 Sparse Cartpole with function approximation
In the next experiment, we examine whether the promising theoretical results presented in this paper lead to a practical algorithm when applied in the context of neural networks. We take the classic Cartpole (aka. Inverted Pendulum) problem [tassa2018deepmind] (Fig. 2 right), where the objective is to balance up a pole attached to a cart by only moving the cart left and right on the axis. The observation is a tuple consisting of the coordinate of the cart and its velocity, the cosine and sine of the angle of the pole compared to the upright position, and the rate of change of this . There are three actions which correspond to applying force to the cart towards the left or right or applying no force. Each episode begins with the pole hanging downwards and ends after 1000 timesteps. There is a small cost of 0.1 for any movement of the pole. A reward of 1 is collected whenever the pole is almost upright and the cart is centered (with a controllable threshold). This is a difficult exploration problem as the rewards are sparse; in particular, no reward is seen at intermediate states which the agent has to nevertheless explore to solve the problem. We compare EEPolitex to Politex, where for EEPolitex we make use of a separate exploration policy that was trained with a separate reward function that encourages the exploration of states where the pole is swung up. We approximate stateaction value functions using neural networks. We execute policies that are based only on the most recent neural networks, where is a parameter to be chosen. Further implementation details are given in Section C.1. Rather than evaluating learned policies after training, in line with the setting of this paper, we plot the total reward obtained by the agent against the total number of time steps. From Fig. 2, we can see that without the exploration policy, Politex never finds the solution to the problem, and learns to not move the cart at all. On the other hand, EEPolitex takes advantage of the exploration policy and manages to learn to collect rewards. In contrast, we also ran experiments on Ms Pacman (Atari), where we found that mixing in an exploration policy did not help (details in Section C.2).
6 Discussion
We have proposed an exploration strategy that utilizes a fastmixing exploratory policy. This strategy can be used with an actionvalue estimation algorithm that learns from onpolicy trajectories whose initial states are chosen from the stationary statedistribution of the exploration policy. One such algorithm is leastsquares Monte Carlo, for which we have provided an analysis of the estimation error. Integrating our exploration scheme into the Politex algorithm results in a new algorithm that enjoys sublinear regret guarantees under much weaker assumptions than previous work. The exploration scheme was demonstrated to improve empirical performance in difficult environments. While much work has been devoted to learning exploration policies, an interesting open problem for future work is learning nontrivial exploration policies which span the stateaction feature space (and do not depend on returns).
References
Appendix A Proof of Lemma 3.2
We prove that if explorationenhanced Politex has a total of policy restarts, and each policy is executed for at most timesteps, then with probability at least , we have that
Proof.
For , the proof is the same as in Lemma 4.4 of politex, and we provide only a sketch. We decompose the term as follows:
(7) 
where denotes the stationary distribution of the policy and is the stateaction distribution after time steps and we used that . We bound in the equation above using the uniform mixing assumption:
(8) 
where we have assumed that . The second term can be written as , where , and is a binary indicator vector with a nonzero at the linear index of the stateaction pair . The bound follows from the fact that is a vectorvalued martingale with a bounded difference sequence and the Azuma inequality.
The bound on follows similarly by noticing that Politex makes policy switches of length . Decomposing analogously to , we have that . For , we have that with probability at least , each length segment is bounded by . Within each of iterations, we have such identicallydistributed bounded segments corresponding to the same policy. A similar observation applies to length segments corresponding to the exploration policy. Thus, using a union bound and Hoeffding’s inequality, we have
(9) 
∎
Appendix B Proofs of Section 4.1
First, we prove that for the choice of and for ,
Proof of Lemma 4.1.
Let be a binary indicator vector corresponding to a state . By Eq. 5,
By the uniformly fast mixing assumption,
and
where the last inequality follows by the choice of and for . ∎
Proof of Lemma 4.2.
We have that and . Thus,
Using the assumption that for the exploration policy,
(10) 
For the second term, consider that for any vector ,
and is simply its estimate using i.i.d. samples. So the second term can be bounded as . For the first term, let , and notice that only
elements of this vector are nonzero and these elements are random variables with zero expectation. So for any deterministic vector
, by Hoeffding’s inequality and with probability at least ,∎
Appendix C Neural network experiment setup
c.1 Cartpole experiment
Our implementation of the Cartpole experiment is based on horgan2018distributed, which is a distributed implementation of DQN mnih2015human, featuring Dueling networks [DBLP:journals/corr/WangFL15], Nstep returns [DBLP:journals/corr/AsisHHS17], Prioritized replay [schaul2015prioritized], and Double Qlearning [DBLP:journals/corr/HasseltGS15]. To adapt this to Politex, we used TDlearning and Boltzmann exploration with the learning rate set according to SOLO FTRL by [OrDa15]: For a given state ,
where is a tuneable constant multiplier (chosen based on preliminary experiments); is the number of actions in the game and
where are the stateaction values for all past networks indexed from to the current timestep , is a vector of all ones and the minimisation over achieves robustness against the changing ranges of stateaction values. The minimisation is onedimensional convex optimisation problem which we solve numerically.
Both methods used the same neural network architecture of a dueling MLP Qnetwork with one hidden layer of size 512. Each learner step uniformly samples a batch of 128 experiences from the experience replay memory. The actors take 16 game steps in total for each learning step. We enter a new phase and terminate the current phase when the number of learning steps taken in the current phase reaches 100 times the square root of the total learning steps taken. When a new phase is started, the freshly learned neural network is copied in a circular buffer of size 10, which is used by the actors to calculate the averaged Qvalues, weighted by the length of each phase. For EEPolitex, we split each phase into “miniphases” of length scaling with the square root of the current phase lenght. Each miniphase consists of following the exploration policy for 1000 steps, and following the learned policy for a number of steps that scales with the square root of the current phase length. This corresponds to the settings , , , .
c.2 Ms Pacman (Atari) experiment
We also compared Politex with EEPolitex on the Atari game Ms Pacman. Here, the exploration strategy mixed into EEPolitex was trained such that the agent had to drive Ms Pacman to random positions on the map. This policy collects significantly less rewards than one trained to maximise rewards, but explores the map well. We found that this exploration didn’t help EEPolitex; in particular, it seems that exploration is relatively simple in this game, and enough exploration is performed even without mixing in the exploration policy. The results are shown in Fig. 3; the exploration only hurts EEPolitex as the agent collects less reward from exploration episodes.
Comments
There are no comments yet.