Recently, reinforcement learning (RL) has shown spectacular success. By experimenting, computers can learn how to autonomously perform tasks that no programmer could teach them. However, the performance of these approaches significantly depends on the application domains. In general, we need reinforcement learning algorithms with both good empirical performance and strong theoretical guarantees. This goal cannot be achieved without the efficient exploration of the environment, which has been studied in episodic RL.
In episodic RL, an agent interacts with the environment in a series of episodes while tries to maximize total reward accumulated over time (Burnetas and Katehakis, 1997; Sutton and Barto, 1998). This learning process leads to a fundamental trade-off: Shall the agent explore insufficiently-understood states and actions to gain new knowledge resulting in better long-term performance, or exploit its existing information to maximize short-run rewards? The existing algorithms focus on how to balance such a trade-off appropriately under the implicit assumption that the exploration cost remains the same over time. However, in a variety of application scenarios, the exploration cost is time-varying and situation-dependent. Such scenarios can provide us an opportunity to explore more when the exploration cost is relatively low and exploit more when that cost is high, thus adaptively balancing the exploration-exploitation trade-off to achieve better performance. Consider the following motivating examples.
Motivating scenario 1: return variation in game.
In a game or a gambling machine where, in some rounds, players may attain special multipliers (, , …, etc) on their reward. They can win a large number of points by getting lucky and having large prizes supplemented by large multipliers. Hence, when the player is given a large multiplier, he would better play the move that he believes is the best. Such a conservative move is less risky, especially that we already know that taking a “bad” action in this situation will result in a significant loss. On the other hand, in a game round with no multiplier or a small one, playing an experimental action will be less risky, since the regret of trying a suboptimal move, in this case, will be lower.
Motivating scenario 2: value variation in sequential recommendations.
For sequential recommender system in e-commerce, the system successively suggests candidate products for users to maximize the total click-through rate (i.e., the probability that a user accepts the recommendation) based on users’ preferences and browser history. We note that the real monetary return of a recommendation (if accepted) can differ depending on other factors, such as users with different levels of purchasing power or loyalty (e.g., diamond vs. silver status). Because the ultimate goal is to maximize the overall monetary reward, intuitively, when the monetary return of a recommendation (if accepted) is low, the monetary regret of suggesting suboptimal products is low, leading to a low exploration cost, and correspondingly, high returns lead to high regret and high exploration cost.
Opportunistic reinforcement learning.
Motivated by these examples, we propose and study opportunistic episodic reinforcement learning, a new paradigm of reinforcement learning problems where the regret of executing a suboptimal action depends on a varying cost referred to as variation factor, associated with the environmental conditions. When the variation factor is low, so is the cost/regret of picking a suboptimal action and vice versa. Therefore, intuitively, we should explore more when the variation factor is low and exploit more when the variation factor is high. As its name suggests, in opportunistic RL, we leverage the opportunities of variation factor’s dynamics to reduce regret.
In this work, we propose OppUCRL2 algorithm for opportunistic learning in episodic RL that introduces variation factor-awareness to balance the exploration-exploitation trade-off. The OppUCRL2 can significantly outperforms the UCRL2 (Jaksch et al., 2010) in the simulation and have same theoretical guarantee with respect to the regret. The opportunistic RL concept is also easy to generalize for other reinforcement algorithms. To demonstrate it, we design OppPSRL algorithm based on PSRL (Ian et al., 2013). It also achieves better performance compared with the original version in the simulation. To the best of our knowledge, this is the first work proposing and studying the concept of the opportunistic reinforcement learning. We believe this work will serve as a foundation for the opportunistic reinforcement learning concept and help further addressing the exploration-exploitation trade-off.
2 Related Work
Optimism in the face of uncertainty (OFU) is a popular paradigm for the exploration-exploitation trade-off in RL, where each pair of states and actions is offered some optimism bonus. The agent then chooses a policy that is optimal under the “optimistic” model of the environment. To learn efficiently, it maintains some control over its uncertainty by assigning a more substantial optimistic bonus to potentially informative states and actions. The assigned bonus can stimulate and guide the exploration process. Most OFU algorithms provided strong theoretical guarantees (Azar et al., 2017; Bartlett and Tewari, 2009; Jaksch et al., 2010; Dann and Brunskill, 2015; Strehl et al., 2009)
. A popular competitor to OFU algorithms is inspired by Thompson sampling (TS)(Chapelle and Li, 2011). In RL, TS approaches (Strens, 2000) maintain a posterior distribution over the reward function and the transition kernel, then compute the optimal policy for a random sampled MDP from the posterior. One of the well-known TS algorithms in the literature is Posterior Sampling for Reinforcement Learning (PSRL) (Ian et al., 2013; Osband and Van Roy, 2017).
The opportunistic learning idea has been introduced in (Wu et al., 2018) for classic -armed bandits and in (Guo et al., 2019) for context bandits. In the reinforcement learning, the authors in (Dann et al., 2019)
consider the case where the each episode has a side context and propose ORLC algorithm that can use the context information to estimate the dynamic of the environment but not include the opportunistic concept, which is different from us. To the best of our knowledge, no prior work has made formal mathematical formulation and rigorous performance analysis for opportunistic reinforcement learning.
3 Problem Formulation
We consider an RL problem in an episodic finite-horizon Markov decision process (MDP),, where is a finite state space with carnality , is a finite action space with carnality , is the horizon that represents the number of time steps in each episode, is a state transition distribution such that dictates a distribution over state if action is taken for state , and is the deterministic reward function. For simplicity, we assume that the reward function is known to the agent but the transition distribution is unknown.
In each episode of this MDP, an initial state is chosen arbitrarily by the environment before it starts. For each step 111We write for , the agent observes a state , selects an action , receives a reward and then the state transits to next state that is drawn from the distribution . The episode ends in state .
A policy for an agent during the episode is expressed as a mapping . We write as the value function at step under policy . For a state , is the expected return (i.e., sum of rewards) received under policy , starting from , i.e., . Because the action space, state space, and horizon are finite, and the reward function is deterministic, there always exits an optimal policy that attains the best value for all and . For an episode with initial state , the quality of a policy is measured by the regret that is the gap between the value function at step under policy and that under optimal policy, i.e., . The goal of the classic RL problem is to consider a RL agent interacts with the environment (MDP ) for episodes in a sequential manner and find the optimal policy.
Next we introduce the opportunistic reinforcement learning in an episodic finite-horizon MDP. For each episode , let be an external variation factor and not change during the episode. We assume is independent of the MDP for . To distinguish different episodes, we use , , to denote the state, action and reward in step of episode . The expected actual return for the episode is defined as if the initial state is and the policy of agent is . Before the -th episode, the agent can observe the initial state and the current value of . Based on the policy that the agent selected, the expected actual return that the learner receives is .
This model captures the essence of the opportunistic RL paradigm for the motivating scenarios in introduction. In the opportunistic RL model, we notice that the optimal policy that maximize for each episode does not change over episodes and is same as the optimal policy in the standard RL problem for a MDP . So, the best expected actual return for an episode is .
The goal is to minimize the actual total regret for episodes in terms of the expected actual return. Particularly, we define the actual total regret in opportunistic RL problem over episodes regarding the expected actual return as:
In a special case, equation (1) has an equivalent form: when is i.i.d. over the episodes with mean value , the total regret regarding actual reward is . Note that in general, it is likely that , because the policy can depend on .
4 Opportunistic Reinforcement Learning Algorithm
In this section, we propose two opportunistic algorithms that are designed based on the optimism in the face of uncertainty and the posterior sampling respectively.
We first introduce OppUCRL2 algorithm, an opportunistic variant of UCRL2 (Jaksch et al., 2010).
In Alg.1, is a hyper-parameter, and is the normalized variation factor, defined as,
where and are respectively the lower and upper thresholds for truncating the variation factor level, and . The variation factor normalization restricts the impact of the variation factor term in the confidence bounds, which avoids under or over explorations. We note that the normalized variation factor is only employed in the algorithm itself. Indeed, the regret depends on the real variation factor and not .
In the initialization, and are the counts for state-action pair played and tuple happened up to current episode. is the start time of the episode . Before the start of the -th episode, the algorithm observes the variation factor and normalize it by Eq.2 in Line 4. The empirical estimate of is calculated by all historical transitions observed so far in Line 6. The width of the high probability confidence regions of is estimated by Hoeffding’s inequality and normalized variation factor in line 7. Then, a plausible MDP set is created in line 8 that consists of finite-horizon MDP with same known reward function and the transition probability in the high probability confidence regions of with width for all state-action pairs.
Next, in line 9, Alg.1 calls a subroutine Finite Horizon Extended Value Iteration (see Appendix A.1 for more details) that returns an optimistic MDP with the best achievable reward from and the optimistic policy . The idea behind finite horizon extended value iteration is same as (Puterman, 1994; Dann and Brunskill, 2015). Last, the policy executes throughout the episode adn updates the counts and .
In general, Alg.1 explores more when the variation factor is relatively low, and exploits more when the variation factor is relatively high. To see this, note that in line 7 is the adaptive width of the confidence region modulated by for MDP set , which determines the level of exploration. For example, when is at its lowest level with , , and the width of confident region is the same as that of the UCRL2 algorithm, and then the algorithm learns the policy in the same way as the conventional UCRL2. At the other extreme, when , i.e., , the width , that is, when the variation factor is at its highest level, the algorithm purely exploits the existing knowledge and selects the best policy. With the exploitation of variation factor awareness capabilities and given that the actual regret is scaled with the variation factor level, OppUCRL2 could achieve a lower regret than the original UCRL2.
5 Regret Analysis for OppUCRL2
In this section, we present an upper bound on the regret of OppUCRL2. We study a simple case with periodic square wave variation factor. Specifically, we assume that the variation factor for an episode is if the episode index is even, and if
is odd. Because we use a sophisticated variation factor-aware regret expression as described in Eq.1 for the opportunistic learning that is different from the classical regret definition, in order to compare OppUCRL2 and original UCRL2 algorithm fairly, we should derive the regret bounds for both of them based on Eq. 1. Following the same logic as (Jaksch et al., 2010; Ghavamzadeh et al., 2020), we can get Theorem 1 and Theorem 2 that show UCRL2 and OppUCRL2 can achieve the same regret bound in the periodic square wave variation factor case. (see Appendix B for more details).
Theorem 1 (Regret Bound for UCRL2 under Periodic Square Wave Variation Factor).
For a finite horizon MDP, , and if the episode index is even, and if is odd, consider a parameter , then the regret of UCRL2 is bounded with a probability at least by,
Theorem 2 (Regret Bound for OppUCRL2 under Periodic Square Wave Variation Factor).
For a finite horizon MDP, , and if the episode index is even, and if is odd, consider a parameter , then the regret of OppUCRL2 is bounded with a probability at least by,
6 Experimental Evaluation
In this section, we evaluate the empirical performance of OppUCRL2 and OppPSRL compared to the original UCRL2 and PSRL algorithms. We use three classic examples of the OpenAI Gym, namely River Swim, Cliff Walking and Frozen Lake that represent three different test cases (Strehl and Littman, 2008): undiscounted reward in a stochastic environment, undiscounted reward deterministic environment, and discounted reward in deterministic environment (see Appendix C.1 for more details). The stochastic and deterministic describes the state transition distribution. The River Swim and Cliff Walking RL environments can be formulated as an undiscounted, episodic MDPs while Frozen Lake is a discounted, episodic task with a discount factor
. We report the results for the average of 20 simulations with different seeds while showing 95% confidence interval. We use the same scaling factors for both algorithms, chosen experimentally for each environment. We compare all algorithms with its best input precision hyper-parameters obtained by grid search.
We first introduce the result under random binary-valued variation factor. We assume that the variation factor is i.i.d. over the episodes, with , where and . Let denote the probability that the variation factor is low, i.e., . Fig. 4 shows the regret for different algorithms under random binary-value variation factor with and .
Opportunistic RL algorithms outperform the corresponding original RL algorithms across every environment by significantly reducing the regret. More significantly, for River Swim, Cliff Walking and Frozen Lake, at the end of the -th episode, OppUCRL2 reduces the regret by , and respectively compared with UCRL2. OppPSRL reduces the regret by , and respectively compared with PSRL. We also notice that OppUCRL2 achieves regret converging to the optimal policy in a constant time. This is because it pushes most exploration moves to the episodes where the variation factor is equal to zero. As a result, the exploration cost is negligible. Although OppUCRL2 largely outperforms UCRL2, it may have higher regret at the beginning, especially in the environments with less deterministic behavior such as River Swim and Cliff Walking. OppUCRL2 emphasizes the main insight of exploration-exploitation trade-off: we may sacrifice some short-term rewards to improve future performance. This observation combined with the constant-time optimal regret demonstrates OppUCRL2 capability to learn and adapt to the environment’s dynamics overtime. We also test these algorithms in the continuous variation factor case and find the opportunistic version of the algorithms also have a better performance (see Appendix C.2 and D for more details).
In this paper, we study opportunistic reinforcement learning where the regret of choosing a suboptimal action depends on an external condition denoted variation factor. We establish OppUCRL2 and OppPSRL algorithms, variants for the well-known UCRL2 and PSRL algorithms. We also analyze the regret of OppUCRL2 and present regret bounds. Experimental results demonstrate substantial benefits from employing low-cost opportunistic exploration in OppUCRL2 and OppPSRL algorithm under variation factor fluctuations.
Minimax regret bounds for reinforcement learning.
Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 263–272. Cited by: §2.
REGAL: a regularization based algorithm for reinforcement learning in weakly communicating MDPs.
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pp. 35–42. Cited by: Appendix D, §2.
- Optimal adaptive policies for markov decision processes. Mathematics of Operations Research 22 (1), pp. 222–255. Cited by: §1.
- An empirical evaluation of thompson sampling. In Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 2249–2257. Cited by: §2.
- Sample complexity of episodic fixed-horizon reinforcement learning. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, pp. 2818–2826. Cited by: §A.1, §2, §4.
- Policy certificates: towards accountable reinforcement learning. In International Conference on Machine Learning, pp. 1507–1516. Cited by: §2.
- Exploration in reinforcement learning. Tutorial at AAAI’20. External Links: Cited by: Appendix B, §5.
- AdaLinUCB: opportunistic learning for contextual bandits. In IJCAI, Cited by: Appendix D, §2.
- (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, pp. 3003–3011. Cited by: §1, §2.
- Near-optimal regret bounds for reinforcement learning. The Journal of Machine Learning Research 99, pp. 1563–1600. Cited by: Appendix B, Appendix D, §1, §2, §4, §5.
- Why is posterior sampling better than optimism for reinforcement learning?. In Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 2701–2710. Cited by: §A.2, §2, §4.
- Markov decision processes: discrete stochastic dynamic programming. 1st edition, John Wiley & Sons, Inc., New York, NY, USA. Cited by: §A.1, §4.
- An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences 74 (8), pp. 1309–1331. Cited by: §6.
- Reinforcement learning in finite MDPs: PAC analysis. The Journal of Machine Learning Research 10, pp. 2413–2444. Cited by: §2.
- A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pp. 943–950. Cited by: §2.
- Introduction to reinforcement learning. 1st edition, MIT Press, Cambridge, MA, USA. Cited by: §1.
- Adaptive exploration-exploitation tradeoff for opportunistic bandits. In ICML, Cited by: Appendix D, §2.
Appendix A Algorithms
a.1 Finite Horizon Extended Value Iteration
Alg.2 Finite Horizon Extended Value Iteration is used as a subroutine for Alg.1 OppUCRL2. The input of Alg.1 is a MDP set. The output is the optimistic MDP with the best achievable reward from and the optimistic policy in line 9. In practical , we define the value function in finite horizon MDP under policy as . Then the optimistic MDP and optimistic policy are for all . The idea behind finite horizon extended value iteration is same as (Puterman, 1994; Dann and Brunskill, 2015), putting as much transition probability as possible to the state with maximal value at the expense of transition probabilities to states with small values. Then, in order to make
correspond to a probability distribution again, the transition probabilities with small values are reproduced iteratively with respect to the constraint. This implies that extended value iteration solves a linear optimization problem over the convex polytope constructed by the set of transition probabilities satisfying conditions and .
In this section, we generalize the opportunistic RL concept into the sampling based algorithm, OppPSRL, which is a variant of Posterior Sampling for Reinforcement Learning (PSRL) (Osband and Van Roy, 2017). In each episode, PSRL samples a single MDP from the plausible MDP set and then selects a policy that has maximum value for that MDP.
Inspired by the opportunistic learing idea, we propose Alg.3 OppPSRL. The input is a prior distribution of the MDP. In general, we can formulate the state transition distribution as a Dirichlet distribution with parameters . At the begin of each episode, Alg. 3 calculates the normalized variation factor in line 3. Then it uses to rescale the parameter . Next, a MDP is sampled from the scaled distribution. In line 6, it computes the policy that has maximum value for the MDP. Finally, the policy is executed throughout the episode and the posterior distribution is updated by the new observations.
The core step of Alg. 3 is Line 4. Intuitively, when is small, it can decrease the value of and the corresponding Dirichlet distribution is more concentrated, then the sampled MDP in Line 5 is similar to the empirical MDP with high probability, so the policy in Line 6 is more conservative and less exploratory. When is larger, the distribution flattens, it provides the opportunity for the agent to explore new MDP and try under-explored actions.
Appendix B Regret Analysis
This section introduce the proof of the theorems in the main paper.
Lemma 1 (Regret Bound for UCRL2 in finite horizon MDP).
For a finite horizon MDP, , consider a parameter , the regret of UCRL2 is bounded with a probability at least by,
b.1 Proof of Theorem 1
For the periodic square wave variation factor case, we can categorize the episodes into two groups, then analyze the regret independently, which can still guarantee an upper bound for the regret because the variation factor is independent from the MDP and UCRL2 algorithm. Specifically, from Eq. 1, we have
So, based on the union bound and Lemma 1, we can get the bound in Theorem 1. ∎
b.2 Proof of Theorem 2
In order to bound the regret of OppUCRL2, we can still do the same decomposition as Eq. B.1. The difference is that the exploration in OppUCRL2 related to the variation factor, so we cannot directly apply lemma 1 for analysis. However, we notice that in the analysis of UCRL2 in finite horizon MDPs, the regret bound is mainly dominated by the time of visits for all state-action pairs. In order to get the upper bound of the regret of OppUCRL2 in periodic square wave variation factor, we can regard it as two independent algorithms with different exploration parameters in odd and even episodes to get the upper bound of the regret, because the time of visits for each state-action pairs in OppUCRL2 is at least same as that in two independent UCRL2 cases and the difference only affects the constant coefficient in the bound. So, OppUCRL2 can achieve the bound shown in Theorem 2. ∎
Appendix C Simulation
c.1 Environment Setting
c.2 Evaluation Using Continuous Variation Factor
We investigate the performance of the algorithms under continuous variation factor. We assume that the variation factor
is i.i.d. over episodes and sampled from a Beta distribution, i.e.,. Figure 11 shows the regrets for different algorithms and environments. Here, we define the lower threshold such that where , and the upper threshold such that .
For River Swim, Cliff Walking and Frozen Lake, at the end of the -th episode, OppUCRL2 reduces regret by , and respectively compared wiht UCRL2. OppPSRL reduces regret by , and respectively compared with PSRL. For OppUCRL2, we also see similar trends with previous experiments. However, with beta variation factor, OppUCRL2 algorithm does not achieve a constant-time regret. This is due to the fact that the variation factor does not vary radically between 0 and 1, and the exploration carried out in the low variation factor episode usually does not have a zero variation factor, thus, generating an extra overhead compared to the previous experimental case.
Appendix D Discussion
We reserve this section to discuss the limitations of our work and possible future improvements.
Weakly communicating MDPs: In this paper, we focused on the setting of finite horizon MDPs. Some previous approaches to exploration provide regret bounds for the more general setting of weakly communicating MDPs (Jaksch et al., 2010; Bartlett and Tewari, 2009). However, we believe that our analysis can be extended to this more general case using existing techniques such as the “doubling trick” (Jaksch et al., 2010).
Computational and statistical efficiency: The proposed algorithm is computationally tractable. In each episode, it performs an optimistic value iteration with computational cost of the same order as solving a known MDP. Besides, the obtained regret bounds guarantee with a high probability the statistical efficiency of the algorithm.
Theoretical Regret Bound In current work, we show the OppUCRL2 and UCRL2 can achieve same bound in periodic square wave variation factor case. However, in the simulation, the opportunistic version has significant better result. This implies the bound of OppUCRL2 is not tight, at least under some circumstances. In existing literature of finite horizon MDP, the analysis consider all state-action pairs together to get an upper bound of regret, which ignores difference of the strength of the exploration. So, in order to get a better bound for the opportunistic reinforcement learning algorithm, such as a better bound in opportunistic bandit setting (Wu et al., 2018) and (Guo et al., 2019), we may need to consider each state-action pair independently and we will consider this direction in the future work.