Sequential decision making tasks are most commonly formulated as Markov Decision Problems Puterman:1994. An MDP models a world with state transitions that depend on the action an agent may choose. Transitions also yield rewards. Every MDP is guaranteed to have an optimal policy: a state-to-action mapping that maximises expected long-term reward Bellman:1957. Yet, on a given task, it might not be necessary to sense state at each time step in order to optimise performance. For example, even if the hardware allows a car to sense state and select actions every millisecond, it might suffice on typical roads to do so once every ten milliseconds. The reduction in reaction time by so doing might have a negligible effect on performance, and be justified by the substantial savings in sensing and computation.
Recent empirical studies bring to light a less obvious benefit from reducing the frequency of sensing: sheer improvements in performance when behaviour is learned Braylan+HMM:2015; durugkar2016deep; Lakshminarayanan+SR:2017. On the popular Atari console games benchmark Bellemare+NVB:2014 for reinforcement learning (RL), reduced sensing takes the form of “frame-skipping”, since agents in this domain sense image frames and respond with actions. In the original implementation, sensing is limited to every -th frame, with the intent of lightening the computational load mnih2015human. However, subsequent research has shown that higher performance levels can be reached by skipping up to frames in some games Braylan+HMM:2015.
We continue to use the term “frame-skipping” generically across all sequential decision making tasks, denoting by parameter the number of time steps between sensing steps (so means no frame-skipping). For using , observe that it is necessary to specify an entire sequence of actions, to execute in an open-loop fashion, in between sensed frames. The most common strategy for so doing is “action-repetition”, whereby the same atomic action is repeated times. Action-repetition has been the default strategy for implementing frame-skipping on the Atari console games, both when
is treated as a hyperparametermnih2015human; Braylan+HMM:2015 and when it is adapted on-line, during the agent’s lifetime durugkar2016deep; Lakshminarayanan+SR:2017; Sharma+LR:2017.
In this paper, we analyse the role of frame-skipping and action-repetition in RL—in short, examining why they work. We begin by surveying topics in sequential decision making that share connections with frame-skipping and action-repetition (Section 2). Thereafter we provide formal problem definitions in Section 3. In Section 4, we take up the problem of prediction
: estimating the value function of a fixed policy. We show that prediction with frame-skipping continues to give consistent estimates when used with linear function approximation. Additionally,serves as a handle to simultaneously tune the amount of bootstrapping and the task horizon. In Section 5, we investigate the control setting, wherein behaviour is adapted based on experience. First we define a task-specific quantity called the “price of inertia”, in terms of which we bound the loss incurred by action-repetition. Thereafter we show that frame-skipping might still be beneficial in aggregate because it reduces the effective task horizon. In Section 6, we augment our analysis with empirical findings on different tasks and learning algorithms. Among our results is a successful demonstration of learning defensive play in soccer, a hitherto less-explored side of the game ALA16-hausknecht. We conclude with a summary in Section 7.
2. Literature Survey
Frame-skipping may be viewed as an instance of (partial) open-loop control, under which a predetermined sequence of (possibly different) actions is executed without heed to intermediate states. Aiming to minimise sensing, Hansen et al. hansen1997reinforcement propose a framework for incorporating variable-length open-loop action sequences in regular (closed-loop) control. The primary challenge in general open-loop control is that the number of action sequences of some given length is exponential in . Consequently, the main focus in the area is on strategies to prune corresponding data structures Tan:1991; McCallum:1996-thesis; hansen1997reinforcement. Since action repetition restricts itself to a set of actions with size linear in , it allows for itself to be set much higher in practice Braylan+HMM:2015.
To the best of our knowledge, the earliest treatment of action-repetition in the form we consider here is by Buckland and Lawrence buckland1994transition. While designing agents to negotiate a race track, these authors note that successful controllers need only change actions at “transition points” such as curves, while repeating the same action for long stretches. They propose an algorithmic framework that only keeps transition points in memory, thereby achieving savings. In spite of limitations such as the assumption of a discrete state space, their work provides conceptual guidance for later work. For example, Buckland (Buckland:1994-thesis, see Section 3.1.4) informally discusses “inertia” as a property determining a task’s suitability to action repetition—and we formalise the very same concept in Section 5.
Investigations into the effect of action repetition on learning begin with the work of McGovern et al. McGovern+SF:1997, who identify two qualitative benefits: improved exploration (also affirmed by Randløv randlov1999learning
) and the shorter resulting task horizon. While these inferences are drawn from experiments on small, discrete, tasks, they find support in a recent line of experiments on the Atari environment, in which neural networks are used as the underlying representationdurugkar2016deep. In the original implementation of the DQN algorithm on Atari console games, actions are repeated times, mainly to reduce computational load mnih2015human. However, subsequent research has shown that higher performance levels can be reached by persisting actions for longer—up to 180 frames in some games Braylan+HMM:2015. More recently, Lakshminarayanan et al. Lakshminarayanan+SR:2017 propose a policy gradient approach to optimise (fixed to be either 4 or 20) on-line. Their work sets up the FiGAR (Fine Grained Action Repetition) algorithm Sharma+LR:2017, which optimises over a wider range of and achieves significant improvements in many Atari games. It is all this empirical evidence in favour of action repetition that motivates our quest for a theoretical understanding.
The idea that “similar” states will have a common optimal action also forms the basis for state aggregation, a process under which such states are grouped together Li+WL:2006. In practice, state aggregation usually requires domain knowledge, which restricts it to low-dimensional settings silva2009compulsory; dazeleycoarse. Generic, theoretically-grounded state-aggregation methods are even less practicable, and often validated on tasks with only a handful of states Abel+ALL:2018. In contrast, action-repetition applies the principle that states that occur close in time are likely to be similar. Naturally this rule of thumb is an approximation, but one that remains applicable in higher-dimensional settings (the HFO defense task in Section 6 has 16 features).
Even on tasks that favour action-repetition, it could be beneficial to explicitly reason about the “intention” of an action, such as to reach a particular state. This type of temporal abstraction is formalised as an option sutton1999between, which is a closed-loop policy with initial and terminating constraints. As numerous experiments show, action-repetition performs well on a variety of tasks in spite of being open-loop in between decisions. We expect options to be more effective when the task at hand requires explicit skill discovery Konidaris:2016.
In this paper, we take the frame-skip parameter as discrete, fixed, and known to the agent. Thus frame-skipping differs from Semi-Markov Decision Problems Bradtke+Duff:1995, in which the duration of actions can be continuous, random, and unknown. In the specific context of temporal difference learning, frame-skipping may both be interpreted as a technique to control bootstrapping Sutton+Barto:2018[see Section 6.2] and one to reduce the task horizon petrik2009biasing.
3. Problem Definition
We begin with background on MDPs, and thereafter formalise the prediction and control problems with frame-skipping.
3.1. Background: MDPs
A Markov Decision Problem (MDP) comprises a set of states and a set of actions . Taking action from state yields a numeric reward with expected value , which is bounded in for some . is the reward function of . The transition function
specifies a probability distribution over: for each ,
is the probability of reachingby taking action from . An agent is assumed to interact with over time, starting at some state. At each time step the agent must decide which action to take. The action yields a next state drawn stochastically according to and a reward according to , resulting in a state-action-reward sequence . The natural objective of the agent is to maximise some notion of expected long term reward, which we take here to be where is a discount factor. We assume unless the task encoded by is episodic: that is, all policies eventually reach a terminal state with probability .
A policy , specifies for each , a probability of taking action (hence ). If an agent takes actions according to such a policy (by our definition, is Markovian and stationary), the expected long-term reward accrued starting at state is denoted ; is the value function of . Let be the set of all policies. It is a well-known result that for every MDP , there is an optimal policy such that for all and , Bellman:1957 (indeed there is always a deterministic policy that satisfies optimality).
In the reinforcement learning (RL) setting, an agent interacts with an MDP by sensing state and receiving rewards, in turn specifying actions to influence its future course. In the prediction setting, the agent follows a fixed policy , and is asked to estimate the value function . Hence, for prediction, it suffices to view the agent as interacting with a Markov Reward Process (MRP) (an MDP with decisions fixed by ). In the control setting, the agent is tasked with improving its performance over time based on the feedback received. On finite MDPs, exact prediction and optimal control can both be achieved in the limit of infinite experience Watkins+Dayan:1992; rummery1994line.
In this paper, we consider generalisations of both prediction and control in which a frame-skip parameter is provided as input in addition to MDP . With frame-skipping, the agent is only allowed to sense every -th state: that is, if the agent has sensed state at time step , it is oblivious to states , and next only observes . We assume, however, that the discounted sum of the rewards accrued in between (or the -step return), is available to the agent at time step . Indeed in many applications (see, for example, Section 6), this return , defined below, can be obtained without explicit sensing of intermediate states.
In the problems we formalise below, taking gives the versions with no frame-skipping.
Prediction problem. In the case of prediction, we assume that a fixed policy is independently being executed on : that is, for , . However, since the agent’s sensing is limited to every -th transition, its
interaction with the resulting MRP becomes a sequence of the form . The agent must estimate based on this sequence.
Control problem. In the control setting, the agent is itself in charge of action selection. However, due to the constraint on sensing, the agent cannot select actions based on state at all time steps. Rather, at each time step that state is sensed, the agent can specify a -length action sequence , which will be executed open-loop for steps (until the next sensing step ). Hence, the agent-environment interaction takes the form , where for , is a -length action sequence. The agent’s aim is still to maximise its long-term reward, but observe that
for , it might not be possible to match , which is fully closed-loop.
In the next section, we analyse the prediction setting with frame-skipping; in Section 5 we consider the control setting.
4. Prediction with Frame-skipping
In this section, we drop the reference to MDP and policy , only assuming that together they fix an MRP . For , is the reward obtained on exiting and the probability of reaching . For the convergence of any learning algorithm to the value function of , it is necessary that be irreducible, ensuring that each state will be visited infinitely often in the limit. If using frame-skip , we must also assume that is aperiodic—otherwise some state might only be visited in between sensing steps, thus precluding convergence to its value. We proceed with the assumption that is irreducible and aperiodic—in other words, ergodic. Let , subject to , be the stationary distribution on induced by .
4.1. Consistency of Frame-skipping
If using frame-skipping with parameter , it is immediate that the agent’s interaction may be viewed as a regular one (with no frame-skipping) with induced MRP , in which, if we treat reward functions as
-length vectors and transition functions asmatrices,
Since is ergodic, it follows that is ergodic. Thus, any standard prediction algorithm (for example, TD() (Sutton+Barto:2018, see Chapter 12)) can be applied on with frame-skip —equivalent to being applied on with no frame-skip—to converge to its value function . It is easy to see that . Surprisingly, it also emerges that the stationary distribution on induced by —denote it , where —is identical to , the stationary distribution induced by . The following proposition formally establishes the consistency of frame-skipping.
For , and .
For the first part, we have that for ,
For the second part, observe that since is the stationary distribution induced by , it satisfies . With frame-skip , we have establishing that (its uniqueness following from the ergodicity of ). ∎
Preserving the stationary distribution is especially relevant for prediction with approximate architectures, as we see next.
4.2. Frame-skipping with a Linear Architecture
As a concrete illustration, we consider the effect of frame-skip in Linear TD() (Sutton+Barto:2018, see Chapter 12), the well-known family of on-line prediction algorithms. We denote our generalised version of the algorithm TD(), where is the given frame-skip parameter and controls bootstrapping. With a linear architecture, is approximated by , where for , is a -length vector of features. The -length coefficient vector is updated based on experience, keeping a -length eligibility trace vector for backing up rewards to previously-visited states. Starting with and arbitrary , an update is made as follows for each , based on the tuple :
where is the learning rate. Observe that with full bootstrapping (), each update by TD() is identical to a multi-step (here -step) backup (Sutton+Barto:2018, see Chapter 7) on . The primary difference, however, is that regular multi-step (and -) backups are performed at every time step. By contrast, TD makes an update only once every steps, hence reducing sensing (as well as the computational cost of updating ) by a factor of .
. Results are averages from 1000 random runs; standard errors are negligible.
With linear function approximation, the best result one can hope to achieve is convergence to
It is also well-known that linear converges to some such that Tsitsiklis+VanRoy:1997. Note that TD() on is the same as TD() on . Hence, from Proposition 1, we conclude that TD() on converges to some such that . The significance of this result is that the rate of sensing can be made arbitrarily small (by increasing ), and yet convergence to achieved (by taking ). The result might appear intriguing, since for fixed , a tighter bound is obtained by increasing (making fewer updates). Nonetheless, note that the bound is on the convergent limit; the best results after any finite number of steps are likely to be obtained for some finite value of
. The bias-variance analysis of multi-step returnsKearns+Singh:2000 applies as is to : small values imply more bootstrapping and bias, large values imply higher variance.
To demonstrate the effect of in practice, we construct a relatively simple MRP—described in Figure (a)a—in which linear TD() has to learn only a single parameter . Figure (c)c shows the prediction errors after 100,000 steps (thus learning updates). When and are fixed, observe that the error for smaller values of is minimised at , suggesting that can be a useful parameter to tune in practice. However, the lowest errors can always be obtained by taking sufficiently close to and suitably lowering , with no need to tune . We obtain similar results by generalising “True online TD()” VanSeijen+MPMS:2016; its near-identical plot is omitted.
5. Control with Action-repetition
In this section, we analyse frame-skipping in the control setting, wherein the agent is in charge of action selection. If sensing is restricted to every -th step, recall from Section 3 that the agent must choose a -length sequence of actions at every sensing step. The most common approach Braylan+HMM:2015; durugkar2016deep is to perform action-repetition: that is, to restrict this choice to sequences of the same action. This way the agent continues to have action sequences to consider (rather than ). It is also possible to consider as a parameter for the agent to itself learn, possibly as a function of state lakshminarayanan2016dynamic; Lakshminarayanan+SR:2017. We report some results from this approach in Section 6, but proceed with our analysis by taking to be a fixed input parameter. Thus, the agent must pick an action sequence .
It is not hard to see that interacting with input MDP by repeating actions times is equivalent to interacting with an induced MDP without action-repetition hansen1997reinforcement. Here . For , (1) let denote as an -length vector—thus —and (2) let denote as an matrix—thus Then and , where
5.1. Price of inertia
The risk of using in the control setting is that in some tasks, a single unwarranted repetition of action could be catastrophic. On the other hand, in tasks with gradual changes of state, the agent must be able to recover. To quantify the amenability of task to action repetition, we define a term called its “price of inertia”, denoted . For , , let denote the expected long-term reward of repeating action from state for time steps, and thereafter acting optimally on . The price of inertia quantifies the cost of a single repetition:
is a local, horizon-independent property, which we expect to be small in many families of tasks. As a concrete illustration, consider the family of deterministic MDPs that have “reversible” actions. A calculation in Appendix A 111Appendices are provided in the supplementary material. shows that for any such MDP is at most —which is a horizon-independent upper bound.
To further aid our understanding of the price of inertia , we devise a “pitted grid world” task, shown in Figure (a)a. This task allows for us to control and examine its effect on performance as is varied. The task is a typical grid world with cells, walls, and a goal state to be reached. Episodes begin from a start state chosen uniformly at random from a designated set. The agent can select “up”, “down”, “left”, and “right” as actions. A selected action is retained with probability 0.85, otherwise switched to one picked uniformly at random, and thereafter implemented to have the corresponding effect. There is a reward of at each time step, except when reaching special “pit” states, which incur a larger penalty. It is precisely by controlling the pit penalty that we control . The task is undiscounted. Figure (a)a shows optimal policies for and (that is, on ); observe that they differ especially in the vicinity of pits (which are harder to avoid with ).
5.2. Value deficit of action-repetition
Naturally, the constraint of having to repeat actions times may limit the maximum possible long-term value attainable. We upper-bound the resulting deficit as a function of and . For MDP , note that is the optimal value function.222In our forthcoming analysis, we treat value and action value functions as vectors, with denoting the max norm.
For , .
For and , define the terms and First we prove
for . The result is trivial for . Assuming it is true for , we get
In effect, (1) bounds the loss from persisting action for steps, which we incorporate in the long-term loss from action-repetition. To do so, we consider a policy that takes the same atomic actions as , but persists them for steps. In other words, for , . For , let denote the expected long-term reward accrued from state by taking the first decisions based on (that is, applying for time steps), and then acting optimally (with no action-repetition, according to ). We prove by induction, for :
For base case, we apply (1) and get
Assuming the result true for , and again using (1), we establish it for .
Observe that : the value of when is executed in . The result follows by using , and substituting for and . ∎
For , .
For every , , and , there exists an MDP with and discount factor such that .
The matching lower bound in Proposition 4 arises from a carefully-designed MDP; in practice we expect to encounter tasks for which the upper bound on is loose. Although our analysis is for infinite discounted reward, we expect to play a similar role on undiscounted episodic tasks such as the pitted grid world. Figure (b)b shows computed values of the performance drop from action-repetition, which monotonically increases with for every value. Even so, the analysis to follow shows that using might yet be the most effective if behaviour is learned.
5.3. Analysis of control with action-repetition
We now proceed to our main result: that the deficit induced by can be offset by the benefit it brings in the form of a shorter task horizon. Since standard control algorithms such as Q-learning and Sarsa may not even converge with function approximation, we sidestep the actual process used to update weights. All we assume is that (1) the learning process produces as its output , an approximate action value function, and (2) as is the common practice, the recommended policy is greedy with respect to : that is, for , . We show that on an MDP for which is small, it could in aggregate be beneficial to execute with frame-skip ; for clarity let us denote the resulting policy . The result holds regardless of whether was itself learned with or without frame-skipping, although in practice, we invariably find it more effective to use the same frame-skip parameter for both learning and evaluation.
Singh and Yee singh1994upper provide a collection of upper bounds on the performance loss from acting greedily with respect to an approximate value function or action value function. The lemma below is not explicitly derived in their analysis; we furnish an independent proof in Appendix D.
For MDP , let be an -approximation of . In other words, . Let be greedy with respect to . We have:
The implication of the lemma is that the performance loss due to a prediction error scales as . Informally, may be viewed as the effective task horizon. Now observe that if a policy is implemented with frame-skip , the loss only scales as , which can be substantially smaller. However, the performance loss defined in Lemma 5 is with respect to optimal values in the underlying MDP, which is (rather than ) when action-repetition is performed with . Fortunately, we already have an upper bound on from Lemma 2, which we can add to the one from Lemma 5 to meaningfully compare with . Doing so, we obtain our main result.
Fix MDP , and . Assume that a learning algorithm returns action-value function . Let be greedy with respect to . There exist constants and such that
with the dependencies of and shown explicitly in parentheses.
By the triangle inequality,
Lemma 2 upper-bounds the first RHS term by , where Observe that the second RHS term may be written as , which Lemma 5 upper-bounds by , where is an -approximation of . In turn, can be replaced by , which is itself upper-bounded using the triangle inequality by . Corollary 3 upper-bounds by . As for , observe that it only depends on and . In aggregate, we have
for appropriately defined and . ∎
While the first term in the bound increases with , the second term decreases on account of the shortening horizon. The overall bound is likely to be minimised by intermediate values of especially when the price of inertia () is small and the approximation error () large. We observe exactly this trend in the pitted grid world environment when we have an agent learn using Q-learning (with 0.05-greedy exploration and a geometrically-annealed learning rate). As a crude form of function approximation, we constrain (randomly chosen) pairs of neighbouring states to share the same Q-values. Observe from figures (b)b (lower panel) and (c)c that indeed the best results are achieved when .
6. Empirical Evaluation
The pitted grid world was an ideal task to validate our theoretical results, since it allowed us to control the price of inertia and to benchmark learned behaviour against optimal values. In this section, we evaluate action-repetition on more realistic tasks, wherein the evaluation is completely empirical. Our experiments test methodological variations and demonstrate the need for action-repetition for learning in a new, challenging task.
We begin with Acrobot, the classic control task consisting of two links and two joints (shown in Figure (a)a). The goal is to move the tip of the lower link above a given height, in the shortest time possible. Three actions are available at each step: leftward, rightward, and zero torque. Our experiments use the OpenAI Gym brockman2016openai implementation of Acrobot, which takes 5 actions per second. States are represented as a tuple of six features: , , , , , and , where and are the link angles. The start state in every episode is set up around the stable point: , , , and are sampled uniformly at random from . A reward of -1 is given each time step, and at termination. Although Acrobot is episodic and undiscounted, we expect that as with the pitted grid world, the essence of Theorem 6 will still apply. Note that with control at 5 Hz, Acrobot episodes can last up to 500 steps when actions are selected uniformly at random.
We execute Sarsa(), a straightforward generalisation of TD() to the control setting, using 1-dimensional tile coding (Sutton+Barto:2018, see Section 12.7).
Tuning other parameters to optimise results for , we set
, and an initial exploration rate , decayed by a factor of after each episode. Figure (b)b shows learning curves for different values. At 8,000 episodes, the best results are for ; in fact Sarsa() with up to dominates Sarsa(). It appears that Acrobot does not need control at 5 Hz; action-repetition shortens the task horizon and enhances learning.
Frame-skipping versus reducing discount factor. If the key contribution of to successful learning is the reduction in horizon from to , a natural idea is to artificially reduce the task’s discount factor , even without action-repetition. Indeed this approach has been found effective in conjunction with approximate value iteration petrik2009biasing and model-learning jiang2015dependence. Figure (c)c shows the values of policies learned by Sarsa() after episodes of training, when the discount factor (originally ) is reduced. Other parameters are as before. As expected, some values of do improve learning. Setting helps the agent finish the task in steps: an improvement of steps over regular Sarsa(). However, the configuration of performs even better—implying that on this task, is more effective to tune than . Although decreasing and increasing both have the effect of shrinking the horizon, the former has the consequence of revising the very definition of long-term reward. As apparent from Proposition 1, entails no such change. That tuning these parameters in conjunction yields the best results (at ) prompts future work to investigate their interaction. Interestingly, we find no benefit from using on the pitted grid world task.
Action-repetition in policy gradient methods. Noting that some of the recent successes of frame-skipping are on policy gradient methods Sharma+LR:2017, we run reinforce Williams:1992 on Acrobot using action-repetition. Our controller computes an output for each action as a linear combination of the inputs, thereafter applying soft-max action-selection. The resulting weights (including biases) are updated by gradient ascent to optimise the episodic reward , using the Adam optimiser with initial learning rate . We set to . Figure (d)d shows that yet again, performance is optimised at . Note that our implementation of reinforce performs baseline subtraction, which reduces the variance of the gradient estimate and improves results for . Even so, an empirical plot of the variance (Figure (e)e) shows that it falls further as is increased, with a relatively steep drop around . As yet, we do not have an analytical explanation of this behaviour. Although known upper bounds on the variance of policy gradients Zhao+HNS:2011 have a quadratic dependence on the task horizon, which is decreased by from to , they are also quadratic in the maximum reward, which is increased by from to . We leave it to future work to explain the empirical observation of a significant reduction of the policy gradient variance with on Acrobot.
6.2. Action-repetition in new, complex domain
Before wrapping up, we share our experience of implementing action-repetition in a new, relatively complex domain. We expect practitioners to confront similar design choices in other tasks, too.
The Half Field Offense (HFO) environment ALA16-hausknecht models a game of soccer in which an offense team aims to score against a defense team, when playing on one half of a soccer field (Figure (a)a). While previous investigations in this domain have predominantly focused on learning successful offense behaviour, we address the task of learning defense. Our rationale is that successful defense must anyway have extended sequences of actions such as approaching the ball and marking a player. Note that in 2 versus 2 (2v2) HFO, the average number of decisions made in an episode is roughly 8 for offense, and 100 for defense. We implement four high-level defense actions: mark_player, reduce_angle_to_goal, go_to_ball, and defend_goal. The continuous state space is represented by features such as distances and positions ALA16-hausknecht. Episodes yield a single reward at the end: 1 for no goal and 0 for goal. No discounting is used. As before, we run Sarsa() with 1-dimensional tile coding.
In the 2v2 scenario, we train one defense agent, while using built-in agents for the goalkeeper and offense. Consistent with earlier studies durugkar2016deep; McGovern+SF:1997, we observe that action-repetition assists in exploration. With , random action-selection succeeds on only of the episodes; the success rate increases with , reaching for . Figure (b)b shows learning curves: points are shown after every 5,000 training episodes, obtained by evaluating learned policies for 2,000 episodes. All algorithms use (optimised for
Sarsa() at 50,000 episodes). Action-repetition shows a dramatic effect on Sarsa, which only registers a modest improvement over random behaviour with , but with , even outperforms a defender from the helios team
akiyama2018helios2018 that won the RoboCup competition in 2010 and 2012.
Optimising . A natural question arising from our observations is whether we can tune “on-line”, based on the agent’s experience. We obtain mixed results from investigating this possibility. In one approach, we augment the atomic set of actions with extended sequences; in another we impose a penalty on the agent every time it switches actions. Neither of these approaches yields any appreciable benefit. The one technique that does show a rising learning curve, included in Figure (b)b, is FiGAR-Sarsa, under which we associate both action and (picked from ) with state, and update each -value independently. However, at 50,000 episodes of training, this method still trails Sarsa() with (static) by a significant margin.
Observe that the methods described above all allow the agent to adapt within each learning episode. On the other hand, the reported successes of tuning on Atari games Lakshminarayanan+SR:2017; Sharma+LR:2017 are based on policy gradient methods, in which a fixed policy is executed in each episode (and updated between episodes). In line with this approach, we design an outer loop that treats each value of (from a finite set) as an arm of a multi-armed bandit. A full episode, with Sarsa() updates using the corresponding, fixed frame-skip is played out on every pull. The state of each arm is saved between its pulls (but no data is shared between arms). Since we cannot make the standard “stochastic” assumption here, we use the EXP3.1 algorithm auer2002nonstochastic, which maximises expected payoff in the adversarial setting. Under EXP3.1, arms are sampled according to a probability distribution, which gets updated whenever an arm is sampled. Figure (c)c shows a learning curve corresponding to this meta-algorithm (based on a moving average of 500 episodes); we set for the best overall results. It is apparent from the curve and affirmed by the inset that Exp3.1 is quick to identify as the best among the given choices ( and are also picked many times due to their quick convergence, even if to suboptimal performance).
In this paper, we analyse frame-skipping a, simple approach that has recently shown much promise in applications of RL, and is especially relevant as technology continues to drive up frame rates and clock speeds. In the prediction setting, we establish that frame-skipping retains the consistency of prediction. In the control setting, we provide both theoretical and empirical justification for action-repetition, which applies the principle that tasks anyway having gradual changes of state can benefit from a shortening of the horizon. Indeed action-repetition allows TD learning to succeed on the defense variant of HFO, a hitherto less-studied aspect of the game. Although we are able to automatically tune the frame-skip parameter using an outer loop, it would be interesting to examine how the same can be achieved within each episode.
Appendix A Price of Inertia for Deterministic MDPs with Reversible Transitions
Consider a deterministic MDP in which transitions can be “reversed”: in other words, for , if taking from leads to , then there exists an action such that taking from leads to . Now suppose action carries the agent from to , and thereafter from to . We have:
Since , it follows that is at most .
Appendix B Proof of Proposition 3
Appendix C Proof of Proposition 4
The figure below shows an MDP with states , and actions stay (dashed) and move (solid). All transitions are deterministic, and shown by arrows labeled with rewards. The positive reward is set to .
It can be verified that