1 Introduction
This paper is concerned with reinforcement learning (RL) under
limited adaptivity or low switching cost, a setting in which the agent is allowed to act in the environment for a long period but is constrained to switch its policy for at most times. A small switching cost restricts the agent from frequently adjusting its exploration strategy based on feedback from the environment.There are strong practical motivations for developing RL algorithms under limited adaptivity. The setting of restricted policy switching captures various realworld settings where deploying new policies comes at a cost. For example, in medical applications where actions correspond to treatments, it is often unrealistic to execute fully adaptive RL algorithms – instead one can only run a fixed policy approved by the domain experts to collect data, and a separate approval process is required every time one would like to switch to a new policy [18, 2, 3]. In personalized recommendation [24], it is computationally impractical to adjust the policy online based on instantaneous data, and a more common practice is to aggregate data in a long period before deploying a new policy. In problems where we run RL for compiler optimization [4] and hardware placements [19], as well as for learning to optimize databases [17], often it is desirable to limit the frequency of changes to the policy since it is costly to recompile the code, to run profiling, to reconfigure an FPGA devices, or to restructure a deployed relational database. The problem is even more prominent in the RLguided new material discovery as it takes time to fabricate the materials and setup the experiments [23, 20]. In many of these applications, adaptivity turns out to be really the bottleneck.
Understanding limited adaptivity RL is also important from a theoretical perspective. First, algorithms with low adaptivity (a.k.a. “batched” algorithms) that are as effective as their fully sequential counterparts have been established in bandits [22, 11], online learning [8], and optimization [10], and it would be interesting to extend such undertanding into RL. Second, algorithms with few policy switches are naturally easy to parallelize as there is no need for parallel agents to communicate if they just execute the same policy. Third, limited adaptivity is closed related to offpolicy RL^{1}^{1}1In particular, corresponds to offpolicy RL, where the algorithm can only choose one data collection policy [13]. and offers a relaxation less challenging than the pure offpolicy setting.
In this paper, we take initial steps towards studying theoretical aspects of limited adaptivity RL through designing lowregret algorithms with limited adaptivity. We focus on modelfree algorithms, in particular QLearning, which was recently shown to achieve a regret bound with UCB exploration and a careful stepsize choice by Jin et al. [15]. Our goal is to design QLearning type algorithms that achieve similar regret bounds with a bounded switching cost.
The main contributions of this paper are summarized as follows:

We propose a notion of local switching cost that captures the adaptivity of an RL algorithm in episodic MDPs (Section 2). Algorithms with lower local switching cost will make fewer switches in its deployed policies.

Building on insights from the UCB2 algorithm in multiarmed bandits [5] (Section 3), we propose our main algorithms, QLearning with UCB2{Hoeffding, Bernstein} exploration. We prove that these two algorithms achieve regret (respectively) and local switching cost (Section 4). The regret matches their vanilla counterparts of [15] but the switching cost is only logarithmic in the number of episodes.

We show how our low switching cost algorithms can be applied in the concurrent RL setting [12], in which multiple agents can act in parallel (Section 5). The parallelized versions of our algorithms with UCB2 exploration give rise to Concurrent QLearning algorithms, which achieve a nearly linear speedup in execution time and compares favorably against existing concurrent algorithms in sample complexity for exploration.

We show a simple lower bound on the switching cost for any sublinear regret algorithm, which has at most a gap from the upper bound (Section 7).
1.1 Prior work
Lowregret RL
Sampleefficient RL has been studied extensively since the classical work of Kearns and Singh [16] and Brafman and Tennenholtz [7], with a focus on obtaining a nearoptimal policy in polynomial time, i.e. PAC guarantees. A subsequent line of work initiate the study of regret in RL and provide algorithms that achieve regret [14, 21, 1]. In our episodic MDP setting, the informationtheoretic lower bound for the regret is , which is matched in recent work by the UCBVI [6] and ORLC [9] algorithms. On the other hand, while all the above lowregret algorithms are essentially modelbased, the recent work of [15] shows that modelfree algorithms such as Qlearning are able to achieve regret which is only worse than the lower bound.
Low switching cost / batched algorithms
Auer et al. [5] propose UCB2 in bandit problems, which achieves the same regret bound as UCB but has switching cost only instead of the naive . CesaBianchi et al. [8] study the switching cost in online learning in both the adversarial and stochastic setting, and design an algorithm for stochastic bandits that acheive optimal regert and switching cost.
Learning algorithms with switching cost bounded by a fixed constant is often referred to as batched algorithms. Minimax rates for batched algorithms have been established in various problems such as bandits [22, 11] and convex optimization [10]. In all these scenarios, minimax optimal batch algorithms are obtained for all , and their rate matches that of fully adaptive algorithms once .
2 Problem setup
In this paper, we consider undiscounted episodic tabular MDPs of the form . The MDP has horizon with trajectories of the form , where and . The state space and action space are discrete with and . The initial state can be either adversarial (chosen by an adversary who has access to our algorithm), or stochastic specified by some distribution . For any
, the transition probability is denoted as
. The reward is denoted as , which we assume to be deterministic^{2}^{2}2Our results can be straightforwardly extended to the case with stochastic rewards.. We assume in addition that for all , so that the last state is effectively an (uninformative) absorbing state.A deterministic policy consists of subpolicies . For any deterministic policy , let and denote its value function and stateaction value function at the th step respectively. Let denote an optimal policy, and and denote the optimal and functions for all . As a convenient short hand, we denote and also use in the proofs to denote observed transition. Unless otherwise specified, we will focus on deterministic policies in this paper, which will be without loss of generality as there exists at least one deterministic policy that is optimal.
Regret
We focus on the regret for measuring the performance of RL algorithms. Let be the number of episodes that the agent can play. (so that total number of steps is .) The regret of an algorithm is defined as
where is the policy it employs before episode starts, and is the optimal value function for the entire episode.
Miscellanous notation
We use standard BigOh notations in this paper: means that there exists an absolute constant such that (similarly for ). means that where depends at most polylogarithmically on all the problem parameters.
2.1 Measuring adaptivity through local switching cost
To quantify the adaptivity of RL algorithms, we consider the following notion of local switching cost for RL algorithms. The local switching cost (henceforth also “switching cost”) between any pair of policies is defined as the number of pairs on which and are different:
For an RL algorithm that employs policies , its local switching cost is defined as
Note that (1)
is a random variable in general, as
can depend on the outcome of the MDP; (2) we have the trivial bound for any and for any algorithm .Remark The local switching cost extends naturally the notion of switching cost in online learning [8] and is suitable in scenarios where the cost of deploying a new policy scales with the portion of on which the action is changed.
A closely related notion of adaptivity is the global switching cost, which simply measures how many times the algorithm switches its entire policy:
As implies , we have the trivial bound that . However, the global switching cost can be substantially smaller for algorithms that tend to change the policy “entirely” rather than “locally”. In this paper, we focus on bounding , and leave the task of tighter bounds on as future work.
3 UCB2 for multiarmed bandits
To gain intuition about the switching cost, we briefly review the UCB2 algorithm [5] on multiarmed bandit problems, which achieves the same regret bound as the original UCB but has a substantially lower switching cost.
The multiarmed bandit problem can be viewed as an RL problem with , , so that the agent needs only play one action and observe the (random) reward . The distribution of ’s are unknown to the agent, and the goal is to achieve low regret.
The UCB2 algorithm is a variant of the celebrated UCB (Upper Confidence Bound) algorithm for bandits. UCB2 also maintains upper confidence bounds on the true means , but instead plays each arm multiple times rather than just once when it’s found to maximize the upper confidence bound. Specifically, when an arm is found to maximize the UCB for the th time, UCB2 will play it times, where
(1) 
for and some parameter to be determined. ^{3}^{3}3For convenience, here we treat as an integer. In Qlearning we could not make this approximation (as we choose super small), and will massage the sequence to deal with it. The full UCB2 algorithm is presented in Algorithm 1.
[Auer et al. [5]] For , the UCB2 algorithm acheives expected regret bound
where is the gap between arm and the optimal arm. Further, the switching cost is at most . The switching cost bound in Theorem 1 comes directly from the fact that implies , by the convexity of and Jensen’s inequality. Such an approach can be fairly general, and we will follow it in sequel to develop RL algorithm with low switching cost.
4 Qlearning with UCB2 exploration
In this section, we propose our main algorithm, Qlearning with UCB2 exploration, and show that it achieves sublinear regret as well as logarithmic local switching cost.
4.1 Algorithm description
Highlevel idea
Our algorithm maintains wo sets of optimistic estimates: a running estimate which is updated after every episode, and a delayed estimate which is only updated occasionally but used to select the action. In between two updates to , the policy stays fixed, so the number of policy switches is bounded by the number of updates to .
To describe our algorithm, let be defined as
and define the triggering sequence as
(2) 
where the parameters will be inputs to the algorithm. Define for all the quantities
Twostage switching strategy
The triggering sequence (2) defines a twostage strategy for switching policies. Suppose for a given , the algorithm decides to take some particular for the th time, and has observed and updated the running estimate accordingly. Then, whether to also update the policy network is decided as

Stage I: if , then always perform the update .

Stage II: if , then perform the above update only if is in the triggering sequence, that is, for some .
In other words, for any stateaction pair, the algorithm performs eager policy update in the beginning visitations, and switches to delayed policy update after that according to UCB2 scheduling.
Optimistic exploration bonus
We employ either a Hoeffdingtype or a Bernsteintype exploration bonus to make sure that our running estimates are optimistic. The full algorithm with Hoeffdingstyle bonus is presented in Algorithm 2.
4.2 Regret and switching cost guarantee
We now present our main results. [Qlearning with UCB2H exploration achieves sublinear regret and low switching cost] Choosing and , with probability at least , the regret of Algorithm 2 is bounded by . Further, the local switching cost is bounded as . Theorem 4.2 shows that the total regret of Qlearning with UCB2 exploration is , the same as UCB version of [15]. In addition, the local switching cost of our algorithm is only , which is logarithmic in , whereas the UCB version can have in the worst case the trivial bound . We give a highlevel overview of the proof Theorem 4.2 in Section 6, and defer the full proof to Appendix A.
Bernstein version
Replacing the Hoeffding bonus with a Bernsteintype bonus, we can achieve regret ( better than UCB2H) and the same switching cost bound. [Qlearning with UCB2B exploration achieves sublinear regret and low switching cost] Choosing and , with probability at least , the regret of Algorithm 3 is bounded by as long as . Further, the local switching cost is bounded as . The full algorithm description, as well as the proof of Theorem 4.2, are deferred to Appendix B.
4.3 PAC guarantee
Our low switching cost algorithms can also achieve the PAC learnability guarantee. Specifically, we have the following [PAC bound for QLearning with UCB2 exploration] Suppose (WLOG) that is deterministic. For any , QLearning with {UCB2H, UCB2B} exploration can output a (stochastic) policy such that with high probability
after episodes. The proof of Corollary 4.3 involves turning the regret bounds in Theorem 4.2 and 4.2 to PAC bounds using the onlinetobatch conversion, similar as in [15]. The full proof is deferred to Appendix C.
5 Application: Concurrent QLearning
Our low switching cost QLearning can be applied to developing algorithms for Concurrent RL [12] – a setting in which multiple RL agents can act in parallel and hopefully accelerate the exploration in wall time.
Setting
We assume there are agents / machines, where each machine can interact with a independent copy of the episodic MDP (so that the transitions and rewards on the MDPs are mutually independent). Within each episode, the machines must play synchronously and cannot communiate, and can only exchange information after the entire episode has finished. Note that our setting is in a way more stringent than [12], which allows communication after each timestep.
We define a “round” as the duration in which the machines simultanesouly finish one episode and (optionally) communicate and update their policies. We measure the performance of a concurrent algorithm in its required number of rounds to find an nearoptimal policy. With larger , we expect such number of rounds to be smaller, and the best we can hope for is a linear speedup in which the number of rounds scales as .
Concurrent QLearning
Intuitively, any low switching cost algorithm can be made into a concurrent algorithm, as its execution can be parallelized in between two consecutive policy switches. Indeed, we can design concurrent versions of our low switching QLearning algorithm and achieve a nearly linear speedup. [Concurrent QLearning achieves nearly linear speedup] There exists concurrent versions of QLearning with {UCB2H, UCB2B} exploration such that, given a budget of parallel machines, returns an nearoptimal policy in
rounds of execution. Theorem 5 shows that concurrent QLearning has a linear speedup so long as . In particular, in highaccuracy (small ) cases, the constant overhead term can be negligible and we essentially have a linear speedup over a wide range of . The proof of Theorem 5 is deferred to Appendix D.
Comparison with existing concurrent algorithms
Theorem 5 implies a PAC mistake bound as well: there exists concurrent algorithms on machines, Concurrent QLearning with {UCB2H, UCB2B}, that performs a nearoptimal action on all but
actions with high probability (detailed argument in Appendix D.2).
We compare ourself with the Concurrent MBIE (CMBIE) algorithm in [12], which considers the discounted and infinitehorizon MDPs, and has a mistake bound^{4}^{4}4 are the {# states, # actions, discount factor} of the discounted infinitehorizon MDP.
Our concurrent QLearning compares favorably against CMBIE in terms of the mistake bound:

Dependence on . CMBIE achieves , whereas our algorithm achieves , better by a factor of .

Dependence on . These are not comparable in general, but under the “typical” correspondence^{5}^{5}5One can transform an episodic MDP with states to an infinitehorizon MDP with states. Also note that the “effective” horizon for discounted MDP is . , , , we get . Compared to , CMBIE has a higher dependence on as well as a term due to its modelbased nature.
6 Proof overview of Theorem 4.2
The proof of Theorem 4.2 involves two parts: the switching cost bound and the regret bound. The switching cost bound results directly from the UCB2 switching schedule, similar as in the bandit case (cf. Section 3). However, such a switching schedule results in delayed policy updates, which makes establishing the regret bound technically challenging.
The key to the regret bound for “vanilla” QLearning in [15] is a propagation of error argument, which shows that the regret^{6}^{6}6Technically it is an upper bound on the regret. from the th step and forward (henceforth the regret), defined as
is bounded by times the regret, plus some bounded error term. As , this fact can be applied recursively for which will result in a total regret bound that is not exponential in . The control of the (excess) error propagation factor by and the ability to converge are then achieved simultaneously via the stepsize choice .
In constrast, our lowswitching version of QLearning updates the exploration policy in a delayed fashion according to the UCB2 schedule. Specifically, the policy at episode does not correspond to the argmax of the running estimate , but rather a previous version for some . This introduces a mismatch between the used for exploration and the being updated, and it is a priori possible whether such a mismatch will blow up the propagation of error.
We resolve this issue via a novel error analysis, which at a high level consists of the following steps:

We show that the quantity is upper bounded by a max error
(Lemma A.3). On the right hand side, the first term does not have a mismatch (as depends on ) and can be bounded similarly as in [15]. The second term is a perturbation term, which we bound in a precise way that relates to stepsizes in between episodes to and the regret (Lemma A.3).

We show that, under the UCB2 scheduling, the combined error above results a mild blowup in the relation between regret and regret – the multiplicative factor can be now bounded by (Lemma A.4). Choosing will make the multiplicative factor and the propagation of error argument go through.
We hope that the above analysis can be applied more broadly in analyzing exploration problems with delayed updates or asynchronous parallelization.
7 Lower bound on switching cost
Let and be the set of episodic MDPs satisfying the conditions in Section 2. For any RL algorithm satisfying , we have
i.e. the worst case regret is linear in . Theorem 7 implies that the switching cost of any noregret algorithm is lower bounded by , which is quite intuitive as one would like to play each action at least once on all . Compared with the lower bound, the switching cost we achieve through UCB2 scheduling is at most off by a factor of . We believe that the factor is not necessary as there exist algorithms achieving doublelog [8] in bandits, and would also like to leave the tightening of the factor as future work. The proof of Theorem 7 is deferred to Appendix E.
8 Conclusion
In this paper, we take steps toward studying limited adaptivity RL. We propose a notion of local switching cost to account for the adaptivity of RL algorithms. We design a QLearning algorithm with infrequent policy switching that achieves regret while switching its policy for at most times. Our algorithm works in the concurrent setting through parallelization and achieves nearly linear speedup and favorable sample complexity. Our proof involves a novel perturbation analysis for exploration algorithms with delayed updates, which could be of broader interest.
There are many interesting future directions, including (1) low switching cost algorithms with tighter regret bounds, most likely via modelbased approaches; (2) algorithms with even lower switching cost; (3) investigate the connection to other settings such as offpolicy RL.
References
 Agrawal and Jia [2017] S. Agrawal and R. Jia. Optimistic posterior sampling for reinforcement learning: worstcase regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
 Almirall et al. [2012] D. Almirall, S. N. Compton, M. GunlicksStoessel, N. Duan, and S. A. Murphy. Designing a pilot sequential multiple assignment randomized trial for developing an adaptive treatment strategy. Statistics in medicine, 31(17):1887–1902, 2012.
 Almirall et al. [2014] D. Almirall, I. NahumShani, N. E. Sherwood, and S. A. Murphy. Introduction to smart designs for the development of adaptive interventions: with application to weight loss research. Translational behavioral medicine, 4(3):260–274, 2014.

Ashouri et al. [2018]
A. H. Ashouri, W. Killian, J. Cavazos, G. Palermo, and C. Silvano.
A survey on compiler autotuning using machine learning.
ACM Computing Surveys (CSUR), 51(5):96, 2018.  Auer et al. [2002] P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Azar et al. [2017] M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 263–272. JMLR. org, 2017.
 Brafman and Tennenholtz [2002] R. I. Brafman and M. Tennenholtz. Rmaxa general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
 CesaBianchi et al. [2013] N. CesaBianchi, O. Dekel, and O. Shamir. Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160–1168, 2013.
 Dann et al. [2018] C. Dann, L. Li, W. Wei, and E. Brunskill. Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056, 2018.
 Duchi et al. [2018] J. Duchi, F. Ruan, and C. Yun. Minimax bounds on stochastic batched convex optimization. In Conference On Learning Theory, pages 3065–3162, 2018.
 Gao et al. [2019] Z. Gao, Y. Han, Z. Ren, and Z. Zhou. Batched multiarmed bandits problem. arXiv preprint arXiv:1904.01763, 2019.

Guo and Brunskill [2015]
Z. Guo and E. Brunskill.
Concurrent pac rl.
In
TwentyNinth AAAI Conference on Artificial Intelligence
, 2015.  Hanna et al. [2017] J. P. Hanna, P. S. Thomas, P. Stone, and S. Niekum. Dataefficient policy evaluation through behavior policy search. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1394–1403. JMLR. org, 2017.
 Jaksch et al. [2010] T. Jaksch, R. Ortner, and P. Auer. Nearoptimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
 Jin et al. [2018] C. Jin, Z. AllenZhu, S. Bubeck, and M. I. Jordan. Is Qlearning provably efficient? In Advances in Neural Information Processing Systems, pages 4868–4878, 2018.
 Kearns and Singh [2002] M. Kearns and S. Singh. Nearoptimal reinforcement learning in polynomial time. Machine learning, 49(23):209–232, 2002.
 Krishnan et al. [2018] S. Krishnan, Z. Yang, K. Goldberg, J. Hellerstein, and I. Stoica. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
 Lei et al. [2012] H. Lei, I. NahumShani, K. Lynch, D. Oslin, and S. A. Murphy. A ”smart” design for building individualized treatment sequences. Annual review of clinical psychology, 8:21–48, 2012.
 Mirhoseini et al. [2017] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. Device placement optimization with reinforcement learning. In nternational Conference on Machine Learning (ICML17), pages 2430–2439. JMLR. org, 2017.
 Nguyen et al. [2019] P. Nguyen, T. Tran, S. Gupta, S. Rana, M. Barnett, and S. Venkatesh. Incomplete conditional density estimation for fast materials discovery. In Proceedings of the 2019 SIAM International Conference on Data Mining, pages 549–557. SIAM, 2019.
 Osband et al. [2013] I. Osband, D. Russo, and B. Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 Perchet et al. [2016] V. Perchet, P. Rigollet, S. Chassang, E. Snowberg, et al. Batched bandit problems. The Annals of Statistics, 44(2):660–681, 2016.
 Raccuglia et al. [2016] P. Raccuglia, K. C. Elbert, P. D. Adler, C. Falk, M. B. Wenny, A. Mollo, M. Zeller, S. A. Friedler, J. Schrier, and A. J. Norquist. Machinelearningassisted materials discovery using failed experiments. Nature, 533(7601):73, 2016.
 Theocharous et al. [2015] G. Theocharous, P. S. Thomas, and M. Ghavamzadeh. Personalized ad recommendation systems for lifetime value optimization with guarantees. In TwentyFourth International Joint Conference on Artificial Intelligence, 2015.
Appendix A Proof of Theorem 4.2
This section is structured as follows. We collect notation in Section A.1 and list some basic properties of the running estimate in Section A.2, establish useful perturbation bounds on in Section A.3, and present the proof of the main theorem in Section A.4.
a.1 Notation
Let and denote the estimates and in Algorithm 2 before the th episode has started. Note that .
Define the sequences
For , we have and . For , we have .
With the definition of in hand, we have the following explicit formula for :
where is the number of updates on prior to the
th epoch, and
are the indices for the epochs. Note that if the algorithm indeed observes and takes the action on the th step of episode .Throughout the proof we let denote a log factor, where we recall is the prespecified tail probability.
a.2 Basics
[Properties of ; Lemma 4.1, [15]] The following properties hold for the sequence :

for every .

and for every .

for every .
Further, with probability at least , choosing for some absolute constant , we have for all that
where . Remark. This first part of the Lemma, i.e. the expression of in terms of rewards and value functions, is an aggregated form for the functions under the QLearning updates, and is independent to the actual exploration policy as well as the bonus.
a.3 Perturbation bound under delayed Q updates
For any , let
(4) 
denote the errors of the estimated relative to and . As is optimistic, the regret can be bounded as
The goal of the propagation of error is to related by .
We begin by showing that is controlled by the max of and , where . [Max error under delayed policy update] We have
(5) 
where (which depends on .) In particular, if , then and the upper bound reduces to .
Proof.
We first show (5). By definition of we have ,
so it suffices to show that
Indeed, we have
On the other hand, maximizes . Due to the scheduling of the delayed update, was set to , and was not updated since then before , so .
Now, defining
the vectors
and only differ in the th component (which is the only action taken therefore also the only component that is updated). If is also maximized at , then we have ; otherwise it is maximized at some and we havePutting together we get
which implies (5).
∎
Lemma A.3 suggests bounding via bounding the “main term” and “perturbation term” separately. We now establish the bound on the perturbation term. [Perturbation bound on ] For any such that (so that the perturbation term is nonzero), we have
(6) 
where
Comments
There are no comments yet.