We study the problem of choosing among a set of learning algorithms in sequential decision-making problems with partial feedback. Learning algorithms are designed to perform well when certain favorable conditions are satisfied. However, the learning agent might not know in advance which algorithm is more appropriate for the current problem that the agent is facing.
As an example, consider the application of stochastic bandit algorithms in personalization problems, where in each round a user visits the website and the learning algorithm should present the item that is most likely to receive a click or be purchased. When contextual information (such as location, browser type, etc) is available, we might decide to learn a click model given the user context. If the context is not predictive of the user behavior, using a simpler non-contextual bandit algorithm might lead to a better performance. As another example, consider the problem of tuning the exploration rate of bandit algorithms. Typically, the exploration rate in an -greedy algorithm has the form of , where is time and the optimal value of constant
depends on unknown quantities related to reward vector. The decision rule of the UCB algorithm also involves an exploration bonus(Auer et al., 2002). Choosing values smaller than the theoretically suggested value can lead to better performance in practice if the theoretical value is too conservative. However, if the exploration bonus is too small, the regret can be linear. It is desirable to have a model selection strategy that finds a near-optimal parameter value in an online fashion.
A model selection strategy can also be useful in finding effective reinforcement learning methods. There has been a great number of reinforcement learning algorithms proposed and studied in the literature (Sutton and Barto, 2018; Szepesvári, 2010)
. In some specialized domains, we might have a reasonable idea of the type of solution that can perform well. In general, however, designing a reinforcement learning solution can be a daunting task as the solution often involves many components. In fact, in some problems it is not even clear if we should use a reinforcement learning solution or a simpler contextual bandit solution. For example, bandit algorithms are used in many personalization and recommendation problems, although the decisions of the learning system can potentially change the future traffic and inherently we face a Markov decision process. In such problems, the available data might not be enough to solve the problem using an RL algorithm and a simpler bandit solution might be preferable. The complexity of the RL problem is often not known in advance and we would like to adapt to the complexity of the problem in an online fashion.
While model selection is a well-studied topic in supervised learning, results in the bandit and RL setting are scarce.Maillard and Munos (2011) propose a method for the model selection problem based on EXP4 with additional uniform exploration. Agarwal et al. (2017) obtain improved results by an online mirror descent method with a carefully selected mirror map. The algorithm is called CORRAL
, and under a stability condition, it is shown to enjoy strong regret guarantees. Many bandit algorithms that are designed for stochastic environments (such as UCB, Thompson sampling, etc) do not satisfy the stability condition and thus cannot be directly used as base algorithms for CORRAL. Although it might be possible to make these algorithm stable by proper modifications, the process can be tedious. To overcome this issue,Pacchiano et al. (2020) propose a generic smoothing procedure that transforms nearly any stochastic algorithm into one that is stable. Results of Agarwal et al. (2017) and Pacchiano et al. (2020) require the knowledge of the optimal base regret. Foster et al. (2019) study bandit model selection among linear bandit algorithms when the dimensionality of the underlying linear reward model, and thus the optimal base regret, is not known. A related problem is studied by Chatterji et al. (2020).
In this paper, we propose a model selection method for bandit and RL problems in stochastic environments. We call our method “regret balancing" because it maintains regret estimates of base algorithms and tries to keep the empirical regret of all algorithms roughly the same. The method achieves regret balancing by playing the base algorithm with the smallest empirical regret. An algorithm can have small empirical regret for two reasons: either it chooses good actions, or it has not been played enough. By playing the algorithm with the smallest empirical regret, the model selection procedure finds an effective trade-off between exploration and exploitation.
The proposed approach has several notable properties. First, no stability condition is needed and any base algorithm without any modifications can be used. Note that when applied to stochastic bandit algorithms, Agarwal et al. (2017) and Pacchiano et al. (2020) modify the base algorithms to ensure certain stability conditions. Second, our approach is intuitive and almost as simple as a UCB rule. By contrast, many existing model selection approaches have a complicated form. Finally, the approach can be readily applied to reinforcement learning problems.
The proposed approach, similar to a number of existing solutions, requires the knowledge of the regret of the optimal base algorithm. We show that, in general, any model selection strategy that achieves a near-optimal regret requires either the optimal base regret or direct sampling from the arms. We show that by adding a forced exploration scheme, and hence direct access to the arms, the regret balancing strategy can achieve near-optimal regret in a class of problems without the knowledge of the optimal base regret. Further, we show a class of problems where any near-optimal model selection procedure is indeed implementing a regret balancing method, possibly implicitly.
As we will show, the regret of our model selection strategy is , where is time horizon. This regret is minimax optimal, given the existing lower bound for the model selection problem that scales as (Pacchiano et al., 2020); Even if it is known that a base algorithm has logarithmic regret, the fast logarithmic regret cannot be preserved in general.
We show a number of applications of the proposed approach for model selection. We show how a near-optimal regret can be achieved in the class of -greedy algorithms without any prior knowledge of the reward function. We also show how the proposed approach can be used for representation learning in bandit problems. Further, we show a model selection strategy to choose among reinforcement learning algorithms. As a consequence for reinforcement learning, if a set of feature maps are given and the value functions are known to be linear in a feature map belonging to this set, we can use the regret balancing strategy to achieve a regret that is near-optimal up to a constant factor. Finally, the proposed regret balancing strategy can also be used as a bandit algorithm. We show how the approach is implemented as an algorithm for linear stochastic bandits.
1.1 Problem Definition
For an integer , we use to denote the set . A contextual bandit problem is a sequential game between a learner and an environment. We consider a set of learners . The game is specified by a context space , an action set of size , a reward function , and a time horizon . In round , the learner observes the context and chooses an action from the action set. Then the learner observes a reward , where for a positive constant , is a
-sub-Gaussian random variable, meaning that for any, . In the special case of linear contextual bandits (Lattimore and Szepesvári, 2020), we are given a feature map such that for an unknown vector . Let be the expected reward of the optimal action at time , where expectation is taken with respect to the randomization in and . The goal is to have small regret, defined as . If is an IID sequence, then is the same constant for all rounds and we use to denote this value. The game is challenging as the reward function is not known in advance. If contains only one element, then the problem reduces to the multi-armed bandit problem. If an action influences the distribution of the next context, then the problem is a Markov decision process (MDP) and it is more suitable to define regret with respect to the policy that has the highest total (or stationary) reward (See Section 2.2 for more details).
A bandit model selection problem is specified by a class of bandit problems and a set of bandit algorithms. Let be the number of bandit algorithms (called base algorithms in what follows). As defined above, is the regret of the th base in the underlying bandit problem if the base algorithm is executed alone. In a bandit model selection problem, the decision making is a two step process. In round , the learner choose base from the set of bandit algorithms, the base observes the context and selects an action from the set of actions, and the reward of the action is revealed to the learner. Then the internal state of the base is updated using reward . The regret of the overall model selection strategy is defined with respect to :
Let be the optimal base with the smallest regret if it is played in all rounds, . We would like to ensure that . A reinforcement learning model selection problem is defined similarly (See Section 2.2 for more details).
2 Regret Balancing
At a high level, the main idea is to estimate the empirical regret of the base algorithms during the rounds that the algorithms are played, and ensure that all base algorithms suffer roughly the same empirical regret. This simple idea ensures a good trade-off between exploration and exploitation: if a base algorithm is played only for a small number of rounds, or if it plays good actions, then its empirical regret will be small and will be chosen by the model selection procedure.
2.1 Bandit Model Selection
In this section, we present the regret balancing model selection method. Consider a bandit model selection problem in a stochastic environment. Let be the number of rounds that base is played up to but not including round , and let be the total reward of this base during these rounds. With an abuse of notation we also use to denote the set of rounds that base is selected. Let be all data in the rounds that base is played, . Let be the space of all such histories for all and . We use , , and to denote the quantities related to the optimal base, which was defined earlier in the problem definition. Regret of base during the rounds is
. We assume that a high probability (possibly data-dependent) upper bound on the regret of the optimal base algorithm is known: a functionis given so that for any , with probability at least , for any .111We can use different probabilistic guarantees here, and any form used here will also appear in Theorem 2.1. For example, for the UCB algorithm we have ,222We use notation to hide polylogarithmic terms. and for the OFUL algorithm we have , where is an empirical covariance matrix (Abbasi-Yadkori et al., 2011). Given that is defined with respect to the realized rewards , the regret upper bound should be at least of order .
Next, we describe the model selection strategy. In round , let be the optimistic base and be the optimistic value,
Variable estimates the value of the best action. Define the empirical regret of base by
Recall the true regret defined by . Notice that we have , i.e. is chosen so that the empirical regret of the optimistic base scales as the target regret of the optimal base. Throughout the game, we play bases to ensure that the empirical regrets of all bases are roughly the same. To be more precise, in time , we choose the base with the smallest empirical regret:
This choice will most likely increase the empirical regret of base . Next theorem shows the model selection guarantee of the regret balancing strategy.
If for a constant regardless of time , and if with probability at least , for any , then with probability at least .
First, we show that is an optimistic estimate of the average optimal reward. By (1) and the regret guarantee of the optimal base,
Let be the base chosen at time and be the optimistic base. The cumulative regret of base at time can be bounded as
|By definition of|
|By definition of and||(3)|
Let be the last time step that base is played. Given that the instantaneous regret is upper bounded by , by (2.1) the regret can be bounded as
The condition that for a constant regardless of time is needed to ensure that for any base . The condition holds in the following model selection problems: choosing a feature mapping in a stochastic bandit problem, and choosing the optimal exploration rate among a number of -greedy algorithms. The condition is also satisfied for choosing between multi-armed bandits and stochastic linear contextual bandits, where is a time-independent constant value for IID context .
As we mentioned earlier, the regret upper bound should be of order . Thus, our approach can achieve the regret of the optimal base as long as the optimal regret is at least . This observation is consistent with the lower bound argument of Pacchiano et al. (2020) who show that, in general, is the best rate that can be achieved by any model selection strategy. Unfortunately, this lower bound implies that in a model selection setting, we can no longer hope to achieve the logarithmic regret bounds that can be usually obtained in stochastic bandit problems. Notice that such logarithmic bounds are shown for the pseudo-regret and not for the regret as defined above. The pseudo-regret is the difference of the expected rewards of the optimal arm and the arm played, and is not directly observed by the learner, and it can be estimated only up to an error of order .
In this section, we show some applications of the regret balancing strategy.
Regret Balancing for Bandits
The regret balancing strategy can be used as a bandit algorithm. To use as a multi-armed bandit algorithm, we treat each arm as a base algorithm and we choose as the regret of the optimal arm . To see this, notice that by the sub-Gaussianity of the noise, with probability at least , . In Figure 1-Left, we compare regret balancing with the UCB algorithm (Auer et al., 2002) on a 4-armed Bernoulli bandit with means . In regret balancing, we treat each arm as a base algorithm and so we use with as the target regret.
Regret Balancing vs UCB and OFUL. Mean and standard deviation of 2000 and 20 runs.
Next, we show the implementation of the strategy as an algorithm for the linear stochastic bandits. Consider the following problem. In round , the learner chooses action from a (possibly time varying) decision space that is a subset of the unit sphere and observes a reward , where is an unknown parameter vector and is a -sub-Gaussian noise term.333This formulation includes the special case of linear contextual bandits with . Let be the optimal action at time defined as . The objective is to have small regret defined as .
We state some notation before defining the bandit method. For a regularization parameter , let be the empirical covariance matrix, and let be the weighted -norm of vector . Let be the regularized least-squares estimate. Let be as defined in Appendix A. Let be the “optimistic" choice in round . A UCB approach would take action next. Regret balancing, however, uses the optimistic choice to estimate the empirical regrets of different choices. Let , which will be shown to be an upper bound on the value of the best action. In time , we choose the action with the smallest empirical regret,
Intuitively, is an estimate of the instantaneous regret of action and is roughly the number of times that is played.444In multi-armed bandits, where actions are fixed axis aligned unit vectors, counts the number of times an action is played. Next theorem bounds the regret of the regret balancing strategy. The proof is in Appendix B.
For any , with probability at least , . Here hides polylogarithmic terms in , , , and .
The regret bound in the theorem is slightly worse than the minimax optimal rate of , however and as we show next, regret balancing strategy can be a competitive linear bandit algorithm in practice. In Figure 1-Right, we compare regret balancing, as described above, with the OFUL algorithm (Abbasi-Yadkori et al., 2011) on a contextual linear bandit problem with two arms: for , let drawn uniformly at random from at the beginning of the experiment. In round , the reward of arm is where and context is drawn uniformly at random from with .
Optimizing the Exploration Rate
Next, we consider the performance of regret balancing as a bandit model selection strategy. First, consider optimizing the exploration rate in an -greedy algorithm. The -greedy is a simple and popular bandit method. In round , the algorithm plays an action chosen uniformly at random with a small probability , and plays the empirically best, or greedy, choice otherwise. For a well-chosen , this simple strategy can be very competitive. The optimal value of however depends on the unknown reward function: It is known that the optimal value of is where is the smallest gap between the optimal reward and the sub-optimal rewards (Lattimore and Szepesvári, 2020). By this choice of exploration rate, the regret scales as for and for .
We apply the regret balancing strategy to find a near-optimal exploration rate. The result directly follows from Theorem 2.1. A similar result, but for a different algorithm, is shown by Pacchiano et al. (2020).
Let be the time horizon. Let . For , let be the -greedy algorithm with exploration rate in round . By the choice of for (or for ), the regret balancing model selection with the set of base algorithms achieves regret for (or for ).
Next, we evaluate the performance of regret balancing in finding a near optimal exploration rate. Consider a bandit problem with two Bernoulli arms with means . Consider 18 -greedy base algorithms with exploration , where values of are on a geometric grid in . Apply regret balancing with the target regret bound , and the set of -greedy base algorithms. The experiment is repeated 20 times. Figure 2-Left shows the performance of regret balancing strategy.
The sublinear regret bounds of linear bandit algorithms are valid as long as the reward function is truly a linear function of the input feature representation. Assume it is known that the reward function is linear in one of the feature maps , but the identity of the true feature map is unknown. By applying Theorem 2.1 to OFUL algorithms, each using one of the feature maps, we obtain a regret that scales as .
As an application, we consider the problem of choosing between UCB and OFUL. Contexts are drawn from the standard normal distribution, but the first element in the context vector is always 1. The noise is. First, consider a problem with arms, each having a reward vector in drawn uniformly at random from at the beginning. We use regret balancing with target function to perform model selection between UCB and OFUL. Results are shown in Figure 2-Middle. In this experiment, OFUL performs better than UCB, and performance of regret balancing is in between. Next we consider a problem with arms. Mean reward of arm , denoted by , is generated uniformly at random from at the beginning. In each round, we observe a context , but the expected reward of arm in each round is . We use target regret . Figure 2-Right shows that in this setting UCB performs better than OFUL, and performance of regret balancing is again in between.
Choosing Among Reinforcement Learning Algorithms
We consider the model selection problem in finite-horizon reinforcement learning problems. The ideas can be easily extended to average-reward setting as well, but we choose a finite-horizon setting to simplify the presentation.
A finite-horizon reinforcement learning problem is specified by a horizon , a state space that is partitioned into disjoint sets, an action space , a transition dynamics that maps a state-action pair to a distribution over the states in the next stage, and a reward function that assigns a scalar value to each state-action pair. The objective is to find a policy , that is a mapping from states to distributions on actions, that maximizes the total reward.
The model selection problem is defined next. In episode , the learner chooses base from a set of RL algorithms, the base is executed for rounds, and the rewards of the actions are revealed to the learner. Let be the total reward of the optimal policy in the underlying reinforcement learning problem. Quantities , , , , , etc are defined similar to the bandit case. For example, is the number of episodes that base is played up to episode . The regret balancing strategy is defined next. In episode , let be the optimistic base. Let such that . Define the empirical regret of base by . In episode , we choose the base with the smallest empirical regret: . The next theorem shows the model selection guarantee for the regret balancing strategy. The analysis is almost identical to the analysis of the bandit model selection in the previous section.
If for a constant regardless of round , and if for any with probability at least , for any , then with probability at least .
In Figure 3, we perform model selection with base algorithms UCRL2 (Jaksch et al., 2010), a Q-learning method with -greedy exploration and , and PSRL (Osband et al., 2013) in the River Swim domain (Strehl and Littman, 2008). Regret balancing adapts to the best performing strategy (PSRL in this case).
As another application, consider the problem of choosing state representation in reinforcement learning. Many existing theoretical results hold under the assumption that a correct state representation (or feature map) is given. As examples, Abbasi-Yadkori et al. (2019) show sublinear regret bounds under the assumption that the value function of any policy is linear in a given feature vector, while Jin et al. (2019) show sublinear regret bounds for linear MDPs, i.e. when the transition dynamics and the reward function are known to be linear in a given feature vector. Given candidate feature maps, one of which is fully aligned with the true dynamics of the MDP, we can apply the regret balancing strategy and by Theorem 2.3, the performance will be optimal up to a factor of .
Let be a linear MDP parametrized by an unknown feature map . Let be a family of feature maps with and satisfying . For regret balancing with target and with a class of LSVI-UCB base algorithms (Jin et al., 2019), each instantiated with a feature map in , the regret is bounded as .
3 Lower Bounds
3.1 Regret Balancing
In this section we show that for any model selection algorithm there are problem instances where the algorithm must do regret balancing. For simplicity we restrict ourselves to the case , and to a simple class of problem instances, although it is possible to extend the argument to richer families and beyond two base algorithms.
Let be a model selection algorithm with expected regret up to time . We say an algorithm “model selects" w.r.t. a class of algorithms if for any two base algorithms with expected regret and , there exists such that for all , . We say that algorithm is regret balancing for base algorithms if for all there exists such that for all , with probability at least ,
where and are the empirical regrets of algorithms and , respectively. The main result of this section is to show there exist problem and algorithm classes such that any model selection strategy must be regret balancing.
There exists two algorithm classes with such that any model selection strategy for class must satisfy the condition in (4) for all whose regrets are distinct.
The complete proof is in Appendix C. The proof proceeds by contradiction. We consider a pair of simple deterministic algorithm classes. Suppose there exist two algorithms such that does not regret balance them. In this case for infinitely many and with probability at least for each such , w.l.o.g ’s regret must be larger than that of by a factor of for some . We now construct another algorithm that acts just like
until the momentstops pulling it (in the probability event) and then acts optimally. Algorithm has better regret than . It can be shown that in this -probability event, will be unable to detect if it’s playing or , thus incurring in a large regret. ∎
3.2 The Knowledge of the Optimal Base Regret
We show that a prior knowledge of the optimal base regret is needed to achieve the optimal regret.
There is a model selection problem such that if the learner does not know the regret of best base, and does not have access to the arms, then its regret is larger than that of the optimal base.
The complete proof is in Appendix D. Let there be two base algorithms, and let and be their regrets incurred when called by the model selection strategy. If , we can construct the bases such that they both have zero regret after the learner stops selecting them. Therefore their regrets when running alone are and , and the learner has regret of the same order as , which is higher than the regret of the better base running alone (). If however , since the learner does not know the optimal arm reward, we can create another environment where the optimal arm reward is different, so that in the new environment the regrets are no longer equal. ∎
The work does not present any foreseeable societal consequence.
- POLITEX: regret bounds for policy iteration using expert prediction. In ICML, Cited by: §2.2.
- Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320. Cited by: Theorem A.1, §2.1, §2.2.
- Corralling a band of bandit algorithms. In COLT, pp. 12–38. Cited by: §1, §1.
- Finite-time analysis of the multiarmed bandit problem. Machine Learning 47, pp. 235–256. External Links: Cited by: §1, §2.2.
- OSOM: a simultaneously optimal algorithm for multi-armed and linear contextual bandits. In AISTATS, Cited by: §1.
- Model selection for contextual bandits. In Advances in Neural Information Processing Systems, Cited by: §1.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research 11 (Apr), pp. 1563–1600. Cited by: §2.2.
- Provably efficient reinforcement learning with linear function approximation. arXiv preprint arXiv:1907.05388. Cited by: §2.2, Corollary 2.2.
- Bandit algorithms. Cambridge University Press. Cited by: §1.1, §2.2.
- Adaptive bandits: towards the best history-dependent strategy. In AISTATS, Cited by: §1.
- Optimal regret bounds for selecting the state representation in reinforcement learning. In ICML, Cited by: §2.2.
- Selecting the state-representation in reinforcement learning. In NIPS, Cited by: §2.2.
- Optimal regret bounds for selecting the state representation in reinforcement learning. In ALT, Cited by: §2.2.
- Regret bounds for learning state representations in reinforcement learning. In NIPS, Cited by: §2.2.
- (More) efficient reinforcement learning via posterior sampling. In NIPS, Cited by: §2.2.
- Model selection in contextual stochastic bandit problems. arXiv preprint arXiv:2003.01704. Cited by: §1, §1, §1, §2.1, §2.2.
- An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §2.2.
- Reinforcement learning: an introduction. MIT Press. Cited by: §1.
- Algorithms for reinforcement learning. Morgan and Claypool. Cited by: §1.
Appendix A Some useful results
We state a result on the error of the least-squares predictor.
Theorem A.1 (Theorem 2 of Abbasi-Yadkori et al. (2011)).
Assume . Let
For any , with probability at least , for all and any ,
Let be a sequence in and define for a regularizer . If for all , then
Appendix B Regret balancing for linear bandits
Proof of Theorem 2.2.
By Theorem A.1, with probability at least , for all , . In what follows, we condition on the high probability event that these inequalities hold.
First, we show that is an optimistic estimate of . By definition of ,
We upper bound the instantaneous regret,
Using the fact that , and hence , we get that for any . Thus,
Thus, by Cauchy–Schwarz inequality and Lemma A.2,
Appendix C Proof of Theorem 3.1
Let be two classes of algorithms defined as follows: if then there exists a value such that has a deterministic instantaneous regret of during all time steps. If , then there is a time index and two values and such that has a deterministic instantaneous regret of for all and a deterministic instantaneous regret of for all . We show the following Theorem:
Let be an algorithm that for all timesteps plays a policy achieving (deterministically) an instantaneous regret of for some . Similarly let be an algorithm that for all timesteps plays a policy with a deterministic instantaneous regret of for some .
We proceed by contradiction. If is not regret matching for , then, there exists an such that with probability at least :
For some nonzero positive constants , and for infinitely many . Wlog the condition in Equation 7 implies that for infinitely many with probability at least :
For any such let this event be called . Define and to be the random number of times in that algorithm (respectively algorithm ) was called by . In this case, Equation 8 becomes, with probability at least :
Which in turn implies , additionally since , with probability at least we have . We now proceed to show a lower bound for the regret of the master in each of two cases, and .
Let where and . Notice that . In we have . In , which in turn implies by Equation 9 that and therefore that in it holds that . Since , we conclude that .
Case . Assume has model selection guarantees (in expectation ) w.r.t. algorithm . Therefore . As a consequence of Equation 9 with probability at least we it holds that .
This analysis shows that in case does not satisfy regret matching, then it must be the case that:
If : Then must incur in an expected regret of at least for some . Thus already precluding any model selection guarantees for .
If : Then with probability at least it follows that for some constants . Furthermore, if is assumed to satisfy model selection guarantees, it must be the case that for large enough, we can conclude that with probability at least , . We focus on this case to find a contradiction.
Two alternative worlds Having analyzed what happens if a master does not do regret matching with algorithms and , we proceed to show our lower bound. Let two base algorithms defined as above and let two base algorithms defined as:
acts exactly as a does.
acts as does only up to time , afterwards it pulls the optimal arm (deterministically).
Let and be the random number of times and are played by .
Suppose the master above is presented with sampled uniformly at random between and . We show the following:
Let . Note that environment is indistinguishable from environment in the probability at least event that . This implies that in environment , and with probability at least , . In this same event and for large enough since it must be the case that and , and therefore that:
Since for large enough the optimal regret for is instead , and for large enough:
We conclude that couldn’t have possibly satisfied model selection. ∎
Appendix D Proof of Theorem 3.2
Let the set of arms be . Let and be such that . Let . Define two environment and with reward vectors and , respectively. Let and be two base algorithms defined by the following fixed policies when running alone in or :
We also construct base defined as follows. Let and be two constants. Base mimics base when , and picks arm when . The instantaneous rewards of and when running alone are and for all . Next, consider model selection with base algorithms and in . Let and be the number of rounds that and