The focus of our work are no-regret model-free algorithms for average-cost reinforcement learning (RL) problems with value function approximation. Most of the existing regret results for model-free methods apply either to settings with no function approximation [StrLiWe06, AzOsMu17, JinAlBuJo18], or to systems with linear action-value functions and special structure, e.g. [mflq2019]. One exception is the recently proposed Politex algorithm by politex. Politex
is a variant of policy iteration, where the policy in each phase is selected to be near-optimal in hindsight for the sum of all past action-value function estimates. This is in contrast to standard policy iteration, where each policy is typically greedy with respect to the most recent action-value estimate. In uniformly-mixing Markov decision processes (MDPs) when the action-value error decreases to some levelat a parametric rate, the regret of Politex scales as .
The Politex regret bound requires the error of a value function estimated from transitions to scale as . Such an error bound was shown by the authors to hold for the least-squares policy evaluation (LSPE) method of bertsekas1996temporal whenever all policies produced by Politex sufficiently explore the feature space (the constant hidden depends on the dimension of the feature space among other problem dependent constants). However, this assumption concerning uniform exploration is often unrealistic. More generally, controlling the value function estimation error in some fixed norm (as required by the general theory of Politex) may be difficult when policies have a strong control over which parts of the state space they visit. In this work, we instead propose a deliberate exploration scheme that ensures sufficient coverage of the state/feature space. We also propose a value function estimation approach when working with linear value function approximation that allows us to obtain performance guarantees similar to the previously mentioned guaranteed, but under milder assumptions.
In particular, in this work we propose to address this problem by separating the problem of coming up with a policy that explores well its environment (a version of pure exploration) from the problem of learning to maximize a specific reward function. In fact, the problem of pure exploration has been the subject of a large number of previous works (e.g., schmidhuber1991adaptive, thrun1992active, pathakICLR19largescale, hazan2019provably and references therein). The new assumption that we work with only requires a single exploratory policy to be available beforehand, and interleaves the target policy with short segments of exploration and our main result will quantify how the choice of the exploration policy impacts the regret. We view our approach as a practical remedy that allows algorithm designers to focus on one aspect of the full RL problem (either learning good value functions, or learning a good exploration policy).
We propose to estimate the value of states generated by the exploration policy using returns computed from rollouts generated by the target policy – a hybrid, in-between case that has both off- and on-policy aspects. We provide a Monte-Carlo value estimation algorithm that can learn from such data, and analyze its error bound in the linear function approximation case. Learning from exploratory data allows us to obtain a more meaningful bound on the estimation error than the available results for temporal difference (TD)-based methods [Sutton-1988], as we cover states which the target policy might not visit otherwise. Under a uniform mixing assumption, the described estimation procedure can be accomplished using a single trajectory, and the exploration scheme can be thought of as performing soft resets within the trajectory. size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: I think the intro is quite long, so I’d remove the next paragraph. Or shorten it to one sentence or something that was not mentioned beforehand. size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Yasin: commented out
We complement the novel algorithmic and theoretical contributions with synthetic experiments that are chosen to demonstrate that explicit exploration is indeed beneficial in hard-to-explore MDPs. While our analysis only holds for linear value functions, we also experiment with neural networks used for value function approximation and demonstrate that the proposed exploration scheme improves the performance ofPolitex on cartpole swing-up, a problem which is known to be hard for algorithms which only explore using simple strategies like dithering.
2.1 Problem definition
For an integer , let . We consider MDPs specified by states, actions, cost function , and transition matrix . The assumption that the number of states is finite is non-essential, while relaxing the assumption that the number of actions is finite is less trivial. Recall that a policy is a mapping from states to distributions over actions and following policy means that in time step , after observing state an action is sampled from distribution . Denoting by the sequence of state-actions generated under a baseline policy , the regret of a reinforcement learning algorithm that gives rise to the state-action sequence , with respect to policy , is defined as
Our goal is to design a learning algorithm that guarantees a small regret with high probability with respect to a large class of baseline policies.
Throughout the paper, we make the following assumption, which ensures that the quantities we define next are well-defined [yu2009convergence].
Assumption A1 (Single recurrent class) The states of the MDP under any policy form a single recurrent class.
MDPs satisfying this condition are also known as unichain MDPs [Section 8.3.1 Puterman94]. Under Assumption 2.1, the states observed under policy
form a Markov chain that has a unique stationary distribution, denoted by. We also let , which is the unique stationary distribution of the Markov chain of state-action pairs under . We use to denote the transition matrix under policy . Define the transition matrix by . For convenience, we will interchangeably write to denote this value. We will also do this for other transition matrices/kernels. Finally, let be the average cost of the policy and . We use subscript to denote the quantities related to the exploration policy. So is the stationary distribution of the exploration policy . Part of our analysis will require the following mixing assumption on all policies.
Assumption A2 (Uniformly fast mixing) Any policy has a unique stationary distribution . Furthermore, there exists a constant such that for any distribution , size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: Is the same for all policies?
2.2 Action-value estimation
For a fixed policy, we define , the action-value of in , as the expected total differential cost incurred when the process is started from state , the first action is and in the rest of the time steps policy is followed. Here, the differential cost in a time step is the difference between the immediate cost for that time step and the average expected cost under .
Under our assumption, up to addition of a scalar multiple of ,
(viewed as a vector) is the unique solution to the Bellman equation. We will also use that given a function , any Boltzmann policy is invariant to shifting by a constant.
One common approach to dealing with large state-action spaces is using function approximators to approximate . While our algorithm works with general function approximators, we will provide an estimation approach and analysis for linear approximators, where the approximate solution lies in the subspace spanned by a feature matrix . With linear function approximation, value estimation algorithms solve the projected Bellman equation defined by . Here, is the projection matrix with respect to the stationary distribution of a policy . In on-policy estimation, , while in off-policy estimation is a behavior policy used to collect data. Let be the solution of the above equation, also known as the TD solution. Let be the on-policy estimate computed using data points. For on-policy algorithms, error bounds of the following form are available: . For convergence results of this type, see [Tsitsiklis-VanRoy-1997, Tsitsiklis-VanRoy-1999, antos2008learning, Sutton-Szepesvari-Maei-2009, Maei-Szepesvari-Bhatnagar-Sutton-2010, lazaric2012finite, Geist-Scherrer-2014, farahmand2016regularized, liu2012regularized, liu2015finite, bertsekas1996temporal, yu2009convergence].
Unfortunately, one issue with such weighted-norm bounds is that the error can be very large for state-action pairs which are rarely visited by the policy. To overcome this issue, we aim to bound the error with respect to a weighted norm where the weights span all directions of the feature space. We will assume access an exploration policy that excites “all directions” in the feature-space:
Assumption A3 (Uniformly excited features) There exists a positive real such that for the exploration policy , .
Given the exploration policy, our goal will be to bound the error in the norm weighted by its stationary distribution . We will describe an algorithm for which we can obtain such a bound, which estimates the value function from on-policy trajectories whose initial states are sampled from . We are not aware of any finite-time error bounds for off-policy methods in the average-cost setting.
2.3 Policy iteration and Politex
Policy iteration algorithms (see e.g. bertsekas2011approximate) alternate between executing a policy and estimating its action-value function, and policy improvement based on the estimate. In Politex, the policy produced at each phase is a Boltzmann distribution over the sum of all previous action-value estimates, . Here, is the action-value estimate at phase . Under uniform mixing and for action-value errors that behave as , the regret of Politex scales as . The authors show that the error condition holds for the LSPE algorithm when all policies (not only ) satisfy the feature excitation. Since this is difficult to guarantee in practice, in this work we aim to bound all errors in the same norm, weighted by the stationary state-action distribution of an exploratory policy . Thus, we only require the feature excitation condition to hold for a single known policy.
Politex (initial state , exploration policy , length )
CollectData (, , , , ):
3 Exploration-enhanced Politex
The proposed algorithm, which we will refer to as EE-Politex (exploration-enhanced Politex) is shown in Figure 1. Compared to Politex, the main difference is that we assume access to a fast-mixing exploration policy which spans the state space, and we run that policy in short segments at a fixed schedule. Intuitively, the exploration policy serves the purpose of performing soft resets of the environment to a random state within a single trajectory. The action-value function of each policy is then estimated using on-policy trajectories, whose initial states are chosen approximately i.i.d. from the stationary state distribution of the exploration policy. We assume access to an estimation black-box which can learn from such data. In the next section, we show a concrete least-squares Monte Carlo (LSMC) algorithm with an error bound of when run on data tuples, where is the worst-case approximation error. Our value-estimation requirement is weaker than that in politex, since we provide exploratory data to the estimation blackbox, rather than requiring each policy produced by Politex to sufficiently explore. However, as we show next, the exploration segments come at a price of a slightly worse regret of compared to .
3.1 Regret of EE-Politex
Consider any learning agent that produces the state-action sequence while interacting with an MDP. For a fixed time step , let denote the policy that is used to generate . Let be the rounds that the exploration policy is played, i.e. , and let be the pseudo-regret in those rounds. Similarly to politex, we decompose regret into pseudo-regret and noise terms:
We first bound the pseudo-regret by a direct application of Theorem 4.1 of politex to rounds .
Fix . Let and and be such that for any , with probability ,
and for any . Letting , with probability , the regret of Politex relative to the reference policy satisfies
Next, we bound the noise terms and using a modified version of Lemma 4.4 of politex that accounts for additional policy switches due to exploration:
Let Section 2.1 hold. If the algorithm has a total of policy restarts, and each policy is executed for at most timesteps, then with probability at least , we have that
The proof is given in Appendix A.
Assume that the action-value error bound is of the form , where is the irreducible approximation error (defined in the next section), and is a constant. Given policy updates and phases of length , the exploration term is bounded as , the pseudo-regret is bounded as , and the terms and are of order . By optimizing the terms, we obtain , , , , and the corresponding regret is of order .
4 Least-squares Monte-Carlo estimation with exploration
Our approach is to directly solve a least-squares problem and find a solution such that . In order to do so, we use the definition of the value of policy ,
Unlike TD methods, this approach does not rely on the Bellman equation.
be the uniform distribution over the action space and letbe the data-generating distribution.111We define to be the distribution on that puts the probability mass on pair . Let be the best linear value function estimator with respect to norm . The irreducible approximation error is the largest such that uniformly for all policies. We use to denote a constant such that uniformly for all policies.
Let be a sufficiently large integer (polynomial in the mixing time of policy ). Let be a dataset, where for each , is a state sampled under exploration distribution , is sampled from the uniform distribution , and is a trajectory generated under policy starting at state where is the first state observed after taking action in state . The trajectory has the form , where is the state obtained by starting from state and running policy for rounds, and .
Consider the dataset . Let be the solution of the following least-squares problem
The state-action value estimate is .
We will analyze the above estimation procedure in the next section. Although we treat each trajectory as a single sample, in practice it is often beneficial to use all data. For example, first-visit Monte-Carlo methods rely on returns from the first time a state-action pair is encountered in a trajectory, while every-visit Monte Carlo methods average returns obtained from all observations of a state-action pair. We will refer to the analyzed approach as one-visit. Also, the choice of estimator is to simplify the analysis. The estimate might be more appropriate in practice.
4.1 Value estimation
First, we show that the bias due to the finite rollout length is exponentially small.
For the choice of and for , we have that
The proof is given in Appendix B. Given that these errors are exponentially small, we will ignore them to improve the readability.
Let be the empirical estimate of using data . The next lemma bounds the least-squares Monte-Carlo estimation error.
Fix . Under the assumption that , with probability at least , we have that
4.2 Comparison with existing work
Our value estimation approach is related to off-policy temporal difference methods such as LSTD and LSPE in the sense that those methods attempt to solve the projected Bellman equation . where the projection matrix is weighted by the distribution of , and the transition matrix corresponds to the target policy. The goal is again to bound the estimation error in the -weighted norm. However, while some analysis for off-policy LSTD exists for discounted MDPs [Yu-2010], we are not aware of any similar results for average-cost problems. In fact, it is known that off-policy LSPE can diverge due to the matrix not being contractive in the -weighted norm [bertsekas2011approximate]. In comparison to off-policy Monte-Carlo methods, our approach benefits from the fact that returns are estimated using on-policy rollouts, and hence we do not require importance weights.
Our bound has two advantages compared with the available results for on-policy LSTD and LSPE. First, the available bounds for these methods involve certain undesirable terms that do not appear in our result. For example, politex show an error bound of the form . Here, is the TD fixed point, is the finite sample LSPE estimate for policy , and is the contraction coefficient of mapping with respect to norm , i.e. . Tsitsiklis-VanRoy-1999 show that is smaller than one. However, it can be arbitrarily close to one, which will make the error bounds meaningless. Similar quantities appear in the error bounds for the LSTD algorithm. In contrast, our bounds do not depend on the measure of contractiveness. Second, the TD fixed point solution is not the same as the best possible estimation in the linear span of the features, and this introduces an additional approximation error. Let . Theorem 2.2 of yu2010error shows that we might lose a multiplicative factor of :
In contrast, we aim to get a better estimate by minimizing the error directly without imposing these constraints.
5.1 DeepSea environment
We first demonstrate the advantages of our exploration scheme on a small-scale environment known as DeepSea (see also Osband-Wen-VanRoy-2016), in which exploration can nonetheless be difficult. The states of this environment comprise an grid, and there are two actions. The environment transitions and costs are deterministic. On action 0 in cell , the environment transitions down and left, to cell . On action 1 in cell , the environment transitions down and right, to cell . The agent starts in state . The reward (negative cost) in the bottom-right state is for any action. For all other states, the reward for action 0 is 0, and the reward for action 1 is -1. Thus, during the first steps (and in the episodic version of this task), the agent can obtain a positive return only if it always takes action 1, even though it is expensive. In the infinite horizon setting, an optimal strategy first gets to the right edge and then takes an equal number of left and right actions, and has an average reward close to . A simple strategy that always takes action 1 obtains an average reward close to . We represent states as length- vectors containing one-hot indicators for each grid coordinate.
We experiment with Politex-LSPE, EE-Politex-LSMC, and Politex-LSMC, i.e. Politex with value estimation using LSMC and no exploration. We also evaluate an online version of RLSVI [Osband-Wen-VanRoy-2016] with linear function approximation, similar to the version described in politex. For exploration, we use a policy that always takes action 1 and runs for steps in each rollout. This policy can help discover the hidden reward but incurs additional costs. For LSMC, we evaluate first-visit, every-visit, and one-visit (i.e. just using the first sample) return estimates computed from length- rollouts. The results are shown in Figure 1 as costs, i.e. negative rewards. On a small grid, all policies achieve the lowest cost. However, as the grid size increases, RLSVI and no-exploration Politex converge to the suboptimal policy which always takes action 0. The performance of one-visit LSMC with exploration also deteriorates for higher , and costs are positive due to exploration segments, suggesting that longer runs (more samples) are required in this case.
5.2 Sparse Cartpole with function approximation
In the next experiment, we examine whether the promising theoretical results presented in this paper lead to a practical algorithm when applied in the context of neural networks. We take the classic Cartpole (aka. Inverted Pendulum) problem [tassa2018deepmind] (Fig. 2 right), where the objective is to balance up a pole attached to a cart by only moving the cart left and right on the axis. The observation is a tuple consisting of the coordinate of the cart and its velocity, the cosine and sine of the angle of the pole compared to the upright position, and the rate of change of this . There are three actions which correspond to applying force to the cart towards the left or right or applying no force. Each episode begins with the pole hanging downwards and ends after 1000 timesteps. There is a small cost of -0.1 for any movement of the pole. A reward of 1 is collected whenever the pole is almost upright and the cart is centered (with a controllable threshold). This is a difficult exploration problem as the rewards are sparse; in particular, no reward is seen at intermediate states which the agent has to nevertheless explore to solve the problem. We compare EE-Politex to Politex, where for EE-Politex we make use of a separate exploration policy that was trained with a separate reward function that encourages the exploration of states where the pole is swung up. We approximate state-action value functions using neural networks. We execute policies that are based only on the most recent neural networks, where is a parameter to be chosen. Further implementation details are given in Section C.1. Rather than evaluating learned policies after training, in line with the setting of this paper, we plot the total reward obtained by the agent against the total number of time steps. From Fig. 2, we can see that without the exploration policy, Politex never finds the solution to the problem, and learns to not move the cart at all. On the other hand, EE-Politex takes advantage of the exploration policy and manages to learn to collect rewards. In contrast, we also ran experiments on Ms Pacman (Atari), where we found that mixing in an exploration policy did not help (details in Section C.2).
We have proposed an exploration strategy that utilizes a fast-mixing exploratory policy. This strategy can be used with an action-value estimation algorithm that learns from on-policy trajectories whose initial states are chosen from the stationary state-distribution of the exploration policy. One such algorithm is least-squares Monte Carlo, for which we have provided an analysis of the estimation error. Integrating our exploration scheme into the Politex algorithm results in a new algorithm that enjoys sublinear regret guarantees under much weaker assumptions than previous work. The exploration scheme was demonstrated to improve empirical performance in difficult environments. While much work has been devoted to learning exploration policies, an interesting open problem for future work is learning non-trivial exploration policies which span the state-action feature space (and do not depend on returns).
Appendix A Proof of Lemma 3.2
We prove that if exploration-enhanced Politex has a total of policy restarts, and each policy is executed for at most timesteps, then with probability at least , we have that
For , the proof is the same as in Lemma 4.4 of politex, and we provide only a sketch. We decompose the term as follows:
where denotes the stationary distribution of the policy and is the state-action distribution after time steps and we used that . We bound in the equation above using the uniform mixing assumption:
where we have assumed that . The second term can be written as , where , and is a binary indicator vector with a non-zero at the linear index of the state-action pair . The bound follows from the fact that is a vector-valued martingale with a bounded difference sequence and the Azuma inequality.
The bound on follows similarly by noticing that Politex makes policy switches of length . Decomposing analogously to , we have that . For , we have that with probability at least , each length- segment is bounded by . Within each of iterations, we have such identically-distributed bounded segments corresponding to the same policy. A similar observation applies to length- segments corresponding to the exploration policy. Thus, using a union bound and Hoeffding’s inequality, we have
Appendix B Proofs of Section 4.1
First, we prove that for the choice of and for ,
Proof of Lemma 4.1.
Let be a binary indicator vector corresponding to a state . By Eq. 5,
By the uniformly fast mixing assumption,
where the last inequality follows by the choice of and for . ∎
Proof of Lemma 4.2.
We have that and . Thus,
Using the assumption that for the exploration policy,
For the second term, consider that for any vector ,
and is simply its estimate using i.i.d. samples. So the second term can be bounded as . For the first term, let , and notice that only
elements of this vector are non-zero and these elements are random variables with zero expectation. So for any deterministic vector, by Hoeffding’s inequality and with probability at least ,
Appendix C Neural network experiment setup
c.1 Cartpole experiment
Our implementation of the Cartpole experiment is based on horgan2018distributed, which is a distributed implementation of DQN mnih2015human, featuring Dueling networks [DBLP:journals/corr/WangFL15], N-step returns [DBLP:journals/corr/AsisHHS17], Prioritized replay [schaul2015prioritized], and Double Q-learning [DBLP:journals/corr/HasseltGS15]. To adapt this to Politex, we used TD-learning and Boltzmann exploration with the learning rate set according to SOLO FTRL by [OrDa15]: For a given state ,
where is a tuneable constant multiplier (chosen based on preliminary experiments); is the number of actions in the game and
where are the state-action values for all past -networks indexed from to the current timestep , is a vector of all ones and the minimisation over achieves robustness against the changing ranges of state-action values. The minimisation is one-dimensional convex optimisation problem which we solve numerically.
Both methods used the same neural network architecture of a dueling MLP Q-network with one hidden layer of size 512. Each learner step uniformly samples a batch of 128 experiences from the experience replay memory. The actors take 16 game steps in total for each learning step. We enter a new phase and terminate the current phase when the number of learning steps taken in the current phase reaches 100 times the square root of the total learning steps taken. When a new phase is started, the freshly learned neural network is copied in a circular buffer of size 10, which is used by the actors to calculate the averaged Q-values, weighted by the length of each phase. For EE-Politex, we split each phase into “mini-phases” of length scaling with the square root of the current phase lenght. Each mini-phase consists of following the exploration policy for 1000 steps, and following the learned policy for a number of steps that scales with the square root of the current phase length. This corresponds to the settings , , , .
c.2 Ms Pacman (Atari) experiment
We also compared Politex with EE-Politex on the Atari game Ms Pacman. Here, the exploration strategy mixed into EE-Politex was trained such that the agent had to drive Ms Pacman to random positions on the map. This policy collects significantly less rewards than one trained to maximise rewards, but explores the map well. We found that this exploration didn’t help EE-Politex; in particular, it seems that exploration is relatively simple in this game, and enough exploration is performed even without mixing in the exploration policy. The results are shown in Fig. 3; the exploration only hurts EE-Politex as the agent collects less reward from exploration episodes.