Reinforcement learning (RL) is about learning to make sequential decisions in an unknown environment through trial and error. It finds wide applications in robotics (Kober et al., 2013), autonomous driving (Shalev-Shwartz et al., 2016), game AI (Silver et al., 2017)
and beyond. We consider a basic RL model - the Markov decision process (MDP). In the MDP, an agent at a stateis able to play an action , where and are the state and action spaces. Then the system transitions to another state according to an unknown probability , while returning an immediate reward . The goal of the agent is to obtain the maximal possible return after playing for a period of time - even though she has no knowledge about the transition probabilities at the beginning.
The performance of a learning algorithm is measured by “regret”. Regret is the difference between the cumulative reward obtained using the best possible policy and the cumulative reward obtained by the learning algorithm. In the tabular setting where and are finite sets, there exist algorithms that achieve asymptotic regret (e.g. (Jaksch et al., 2010; Osband and Van Roy, 2016; Osband et al., 2017; Agrawal and Jia, 2017; Azar et al., 2017; Dann et al., 2018; Jin et al., 2018)), where is the number of time steps. However, the aforementioned regret bound depends polynomially on and , sizes of the state and action space, which can be very large or even infinite. For instance, the game of Go has unique states, and a robotic arm has infinitely many continuous-valued states. In the most general sense, the regret is nonimprovable in the worst case (Jaksch et al., 2010). This issue is more generally known as the “curse of dimensionality” of control and dynamic programming (Bellman, 1966).
To tackle the dimensionality, a common practice is to use features to parameterize high-dimensional value and policy functions in compact presentations, with the hope that the features can capture leading structures of the MDP. In fact, there are phenomenal empirical successes of reinforcement learning using explicit features and/or neural networks as implicit features (see e.g.,(Mnih et al., 2015)). However, there is a lack of theoretical understanding about using features for exploration in RL and its learning complexity. In this paper, we are interested in the following theoretical question:
|How to use features for provably efficient exploration in reinforcement learning?|
Furthermore, we consider online RL in a reproducing kernel space. Kernel methods are well known to be powerful to capture nonlinearity and high dimensionality in many machine learning tasks(Shawe-Taylor et al., 2004). We are interested in using kernel methods to capture nonlinearity in the state-transition dynamics of MDP. A kernel space may consist of infinitely many implicit feature functions. We study the following questions: How to use kernels in online reinforcement learning? Can one achieve low regret even though the kernel space is infinite-dimensional? The goal of this paper is to answer the aforementioned questions affirmatively. In particular, we would like to design algorithms that take advantages of given features and kernels to achieve efficient exploration.
1.1 Our Approach and Main Results
Consider episodic reinforcement learning in finite-horizon MDP. The agent learns through episodes, and each episode consists of time steps. Here is also called the planning horizon. Let be given feature functions. We focus on the case that the probability transition model can be fully embedded in the feature space (Assumption 1), i.e., there exists some core matrix such that
In the kernel setting, this condition is equivalent to that the transition probability model belongs to the product space of the reproducing kernel spaces. This condition is essentially equivalent to using the features to represent value functions (Parr et al., 2008). When the probability transition model cannot be fully embedded using , then value function approximation using may lead to arbitrarily large Bellman error (Yang and Wang, 2019).
We propose an algorithm, which is referred to as MatrixRL, that actively explores the state-action space by estimating the core matrix via ridge regression. The algorithm balances the exploitation-exploration tradeoff by constructing a confidence ball of core matrix for optimistic dynamic programming. It can be thought of as a “matrix bandit” algorithm which generalizes the idea of linear bandit (e.g.(Dani et al., 2008; Li et al., 2010; Chu et al., 2011)). It is proved to achieve the regret bound either
depending on regularity properties of the features. MatrixRL can be implemented efficiently in space . Each step can be carried out in closed form. Next we extend the MatrixRL to work with the kernel spaces with and , and show that it admits a kernelized version. The kernelized MatrixRL achieves a regret bound of
where is the effective dimension of kernel space, even if there may be infinitely many features. The regret bounds using features or kernels do not depend on sizes of the state and action spaces, making efficient exploration possible in high dimensions.
Note that for linear bandit, the regret lower bound is known to be (Dani et al., 2008). Since linear bandit is a special case of RL, our regret bounds match the lower bound up to polylog factors in and . To our best knowledge, for reinforcement learning using features/kernels, our result gives the first regret bound that is simultaneously near-optimal in time , polynomial in the planning horizon , and near-optimal in the feature dimension .
1.2 Related Literature
In the tabular case where there are finitely many states and actions without any structural knowledge, complexity and regret for RL has been extensively studied. For -horizon episodic RL, efficient methods typically achieve regret that scale asymptotically as (see for examples (Jaksch et al., 2010; Osband and Van Roy, 2016; Osband et al., 2017; Agrawal and Jia, 2017; Azar et al., 2017; Dann et al., 2018; Jin et al., 2018)). In particular, (Jaksch et al., 2010) provided a regret lower bound for -horizon MDP. There is also a line of works studying the sample complexity of obtaining a value or policy that is at most -suboptimal (Kakade, 2003; Strehl et al., 2006, 2009; Szita and Szepesvári, 2010; Lattimore and Hutter, 2014; Azar et al., 2013; Dann and Brunskill, 2015; Sidford et al., 2018). The optimal sample complexity for finding an -optimal policy is (Sidford et al., 2018) for a discounted MDP with discount factor . The optimal lower bound has been proven in (Azar et al., 2013).
There is also a line of works on solving MDPs with a function approximation. For instance (Baird, 1995; Tsitsiklis and Van Roy, 1997; Parr et al., 2008; Mnih et al., 2013, 2015; Silver et al., 2017; Yang and Wang, 2019). There are also phenomenal empirical successes in deep reinforcement learning as well (e.g., (Silver et al., 2017)). However there are not many works on the regret analysis of RL with function approximators. Very recently, (Azizzadenesheli et al., 2018) studied the regret bound for linear function approximator. However their bound has factor that can be exponential in . (Chowdhury and Gopalan, 2019) considers the regret bound for kernelized MDP. However, they need a Gaussian process prior and assumes that the transition is deterministic with some controllable amount of noise – a very restrictive setting. Another work (Modi and Tewari, 2019) also considers the linear setting for RL. However, the regret bound is linearly depending on the number of states. To the best of our knowledge, we are not aware of other works that achieve regret bound for RL with function approximators that is simultaneously near optimal in , polynomial in , and has no dependence with the state-action space size.
Our results are also related to the literature of linear bandits. Bandit problems can be viewed as a special case as Markov decision problems. There is a line of works on linear bandit problems and their regret analysis (Dani et al., 2008; Rusmevichientong and Tsitsiklis, 2010; Li et al., 2010; Abbasi-Yadkori et al., 2011; Chu et al., 2011). For a more detailed survey, please refer to (Bubeck et al., 2012). Part of our results are inspired by the kernelization for the linear bandit problems, e.g. (Valko et al., 2013; Chowdhury and Gopalan, 2017), who studied the regret bound when the features of each arm lies in some reproducing kernel Hilbert space.
2 Problem Formulation
In a episodic Markov decision process (MDP for short), there is a set of states and a set of actions , which are not necessarily finite. At any state , an agent is allowed to play an action . She receives an immediate reward after playing at , the process will transition to the next state with probability , where is the collection of transition distributions. After time steps, the system restarts at a prespecified state . The full instance of an MDP can be described by the tuple The agent would like to find a policy that maximizes the long-term expected reward starting from every state and every stage , i.e.,
We call the value function of policy . A policy is said to be optimal if it attains the maximal possible value at every state-stage pair . We denote as the optimal value function. We also denote the optimal action-value function (or -function) as
In the online RL setting, the learning algorithm interacts with the environment episodically. Each episode starts from state takes steps to finish. We let denote the current number of episodes and denote the current time step. We equalize and and may switch between the two nations. We use the following definition of regret.
Suppose we run algorithm in the online environment of an MDP for steps. We define the regret for algorithm as
where is taken over the random path of states under the control of algorithm .
Throughout this paper, we focus on RL problems where the probability transition kernel can be fully embedded in a given feature space.
Assumption 1 (Feature Embedding of Transition Model).
For each , feature vectors
, feature vectorsare given as a priori. There exists an unknown matrix such that
Here, we call the matrix as a transition core.
Note that when are features associated with two reproducing kernel spaces and , this assumption requires that belong to their product kernel space .
For simplicity of representation, we assume throughout that the reward function is known. This is in fact without loss of generality because learning about the environment is much harder than learning about . In the case if is unknown but satisfies for some unknown , we can extend our algorithm by adding a step of optimistic reward estimation like in LinUCB (Dani et al., 2008; Chu et al., 2011). This would generate an extra regret, which is a low order term compared to our current regret bounds.
3 RL Exploration in Feature Space
In this section, we study the near optimal way to balance exploration and exploitation in RL using a given set of features. We aim to develop an online RL algorithm with regret that depends only on the feature size but not on the size of the state-action space. Our algorithm is inspired by the LinUCB algorithm (Chu et al., 2011) and its variants (Dani et al., 2008) and can be viewed as a “matrix bandit” method.
3.1 The MatrixRL Algorithm
The high level idea of the algorithm is to approximate the unknown transition core using data that has been collected so far. Suppose at the time step (i.e. episode and stage ), we obtain the following state-action-state transition triplet: where . For simplicity, we denote the associated features by
Estimating the core matrix.
Let . We construct our estimator of as:
Let us explain the intuition of . Note that
Therefore is the solution to the following ridge regression problem:
Upper confidence RL using a matrix ball.
In online RL, a critical step is to estimate future value of the current state and action use dynamic programming. To better balance exploitation and exploration, we use a matrix ball to construct optimistic value function estimator. At episode :
Here the matrix ball is constructed as
where is a parameter to be determined later, and . At time , suppose the current state is , we play the optimistic action The full algorithm is given in Algorithm 1.
3.2 Regret Bounds for MatrixRL
Let be the matrix of all features. We first introduce some regularity conditions of the features space.
Assumption 2 (Feature Regularity).
Let and be positive parameters.
.222Here is the operator norm.
With these conditions we are ready to provide the regret bound.
Consider the case where and are absolute constants. For example, we may let and let be orthonormal functions over (meaning that ). Then Assumption 2 automatically holds with . In this case, our regret bound is simply . The -dependence in such a regret bound is consistent with the regret of the -ball algorithm for linear bandit (Dani et al., 2008).
Further, if the feature space admits a tighter bound for value function in this space, we can slightly modify our algorithm to achieve sharper regret bound. To do this, we need to slightly change our Assumption 2 to Assumption 2.
Assumption 2 (Stronger Feature Regularity).
Let and be positive parameters.
We modify the algorithm slightly by using a Frobenious-norm matrix ball instead of the 2-1 norm and computing sharper confidence bounds. Let in (3.1), where
Then a sharper regret bound can be established.
The only stronger condition needed by Assumption 2 is . It can be satisfied if is a set of sparse features, or if is a set of highly concentrated features.
We remark that in Theorem 1 and Theorem 2, we need to know the value in before the algorithm runs. In the case when is unknown, one can use the doubling trick to learn adaptively: first we run the algorithm by picking , then for until the true is reached. It is standard knowledge that this trick increase the overall regret by only a constant factor (e.g. (Besson and Kaufmann, 2018)).
Proof Sketch. The proof consists of two parts. We show that when the core matrix belongs to the sequence of constructed balls , the estimated Q-functions provide optimistic estimates of the optimal values, therefore the algorithm’s regret can be bounded using the sum of confidence bounds on the sample path. The second part constructs a martingale difference sequence by decomposing a matrix into an iterative sum and uses a probabilistic concentration argument to show that the “good” event happens with sufficiently high probability. Full proofs of Theorems 1 and 2 are deferred to the appendix.
Near optimality of regret bounds. The regret bound in Theorem 2 matches the optimal regret bound for linear bandit (Dani et al., 2008). In fact, linear bandit is a special case of RL: the planning horizon is . Therefore our bound is nearly optimal in and .
Implementation. Algorithm 1 can be implemented easily in space . When implementing Step 6 using (3.1), we do not need to compute the entire function as the algorithm only queries the -values at particular encountered state-action pairs. For the computation of , we can apply random sampling over the columns of to accelerate the computation (see e.g. (Drineas and Mahoney, 2016) for more details). We can also apply the random sampling method to compute the matrix and approximately.
Closed-form confidence bounds. Equation (3.1) requires maximization over a matrix ball. However, it is not necessary to solve this maximization problem explicitly. The algorithm only requires an optimistic Q value. In fact, we can use a closed-form confidence bound instead of searching for the optimal in the confidence ball. It can be verified that Theorem 1 still holds (by following the same proofs of the theorem) if we replace the second equation of (3.1) as the following equation (see the proof of Theorem 2)
4 RL Exploration in Kernel Space
In this section, we transform MatrixRL to work with kernels instead of explicitly given features. Suppose we are given two reproducing kernel Hilbert spaces with kernel functions and , respectively. There exists implicit features such that and , but the learning algorithm can access the kernel functions only.
4.1 Kernelization of MatrixRL
The high level idea of Kernelized MatrixRL is to represent all the features used in Algorithm 1 with their corresponding kernel representations. We first introduce some notations. For episode and , we denote and as the Gram matrix, respectively, i.e., for all ,
where . We denote and by
We are now ready to kernelize Algorithm 1. The full algorithm of Kernelized MatrixRL is given in Algorithm 2. Note that the new Q function estimator (6) is the dualization form of (3.1). Therefore Algorithm 2 is more general but essentially equivalent to Algorithm 1 if we let and . See Section B for the proof.
4.2 Regret Bound for Kernelized MatrixRL
We define the effective dimension of the kernel space as
where , with is the Gram matrix over data set . Note that captures the effective dimension of the space spanned by the features of state-action pairs. Consider the case when are -dimensional unit vectors, then has dimension at most . It can be verified that . A similar notation of effective dimension was introduced by (Valko et al., 2013) for analyzing kernelized contextual bandit.
Further, we need regularity assumptions for the kernel space.
Let be generated by orthonormal basis on , i.e., there exists such that and . There exists a constant such that
where denotes the Hilbert space norm.
The formal guarantee of Kernelized MatrixRL is presented as follows.
Note that in Assumption 3, we can additionally relax the assumption on the orthogonality of . Similar regret bound can be proved with Assumption 2. The proof of Theorem 3 is very similar to that of Theorem 2. Although Kernelized MatrixRL does not access the features, the proof is based on the underlying features and the equivalence between kernel representation and feature representation. We postpone it to Section B.
Remark. Similar as MatrixRL, Kernelized MatrixRL can be generalized to deal with unknown reward function by using the Kernelized Bandit (Valko et al., 2013). Again, since linear bandit problems are special cases of kernel RL with , our results match the linear bandit bound on and . The computation time of Kernelized MatrixRL scales with time as (by applying randomized algorithms, e.g. (Dani et al., 2008), in dealing with matrices), still polynomial in . We can apply the random features or sketching techniques for kernel to additionally accelerate the computation (e.g. (Rahimi and Recht, 2008; Yang et al., 2017)).
This paper provided the algorithm MatrixRL for episodic reinforcement learning in high dimensions. It also provides the first regret bounds that are near-optimal in time and feature dimension and polynomial in the planning horizon . MatrixRL uses given features (or kernels) to estimate a core transition matrix and its confidence ball, which is used to compute optimistic Q-functions for balancing the exploitation-exploration tradeoff. We prove that the regret of MatrixRL is bounded by where is the number of features, provided that the feature space satisfies some regularity conditions. MatrixRL has an equivalent kernel version, which does not require explicit features. The kernelized MatrixRL satisfies a regret bound , where is the effective dimension of the kernel space. For future work, it remains open if the regularity condition can be relaxed and if there is a more efficient way for constructing confidence balls in order to further reduce the regret.
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320.
- Agrawal and Jia (2017) Agrawal, S. and Jia, R. (2017). Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194.
- Azar et al. (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349.
- Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449.
- Azizzadenesheli et al. (2018) Azizzadenesheli, K., Brunskill, E., and Anandkumar, A. (2018). Efficient exploration through bayesian deep q-networks. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE.
- Baird (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
- Bellman (1966) Bellman, R. (1966). Dynamic programming. Science, 153(3731):34–37.
- Besson and Kaufmann (2018) Besson, L. and Kaufmann, E. (2018). What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971.
- Bubeck et al. (2012) Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122.
- Chowdhury and Gopalan (2017) Chowdhury, S. R. and Gopalan, A. (2017). On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 844–853. JMLR. org.
- Chowdhury and Gopalan (2019) Chowdhury, S. R. and Gopalan, A. (2019). Online learning in kernelized markov decision processes. In Chaudhuri, K. and Sugiyama, M., editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 3197–3205. PMLR.
- Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. E. (2011). Contextual Bandits with Linear Payoff Functions. Technical report.
- Dani et al. (2008) Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback.
- Dann and Brunskill (2015) Dann, C. and Brunskill, E. (2015). Sample complexity of episodic fixed-horizon reinforcement learning. In Advances in Neural Information Processing Systems, pages 2818–2826.
- Dann et al. (2018) Dann, C., Li, L., Wei, W., and Brunskill, E. (2018). Policy certificates: Towards accountable reinforcement learning. arXiv preprint arXiv:1811.03056.
- Drineas and Mahoney (2016) Drineas, P. and Mahoney, M. W. (2016). Randnla: randomized numerical linear algebra. Communications of the ACM, 59(6):80–90.
- Freedman (1975) Freedman, D. A. (1975). On Tail Probability for Martigales. The Annals of Probability, 3(1):100–118.
- Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600.
- Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873.
- Kakade (2003) Kakade, S. M. (2003). On the sample complexity of reinforcement learning. PhD thesis, University of London London, England.
- Kober et al. (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Lattimore and Hutter (2014) Lattimore, T. and Hutter, M. (2014). Near-optimal pac bounds for discounted mdps. Theoretical Computer Science, 558:125–143.
- Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM.
- Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529.
- Modi and Tewari (2019) Modi, A. and Tewari, A. (2019). Contextual markov decision processes using generalized linear models. arXiv preprint arXiv:1903.06187.
- Osband and Van Roy (2016) Osband, I. and Van Roy, B. (2016). On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732.
- Osband et al. (2017) Osband, I., Van Roy, B., Russo, D., and Wen, Z. (2017). Deep exploration via randomized value functions. arXiv preprint arXiv:1703.07608.
Parr et al. (2008)
Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. (2008).
An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning.In Proceedings of the 25th international conference on Machine learning, pages 752–759. ACM.
- Rahimi and Recht (2008) Rahimi, A. and Recht, B. (2008). Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184.
- Rusmevichientong and Tsitsiklis (2010) Rusmevichientong, P. and Tsitsiklis, J. N. (2010). Linearly parameterized bandits. Mathematics of Operations Research, 35(2):395–411.
- Shalev-Shwartz et al. (2016) Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016). Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295.
- Shawe-Taylor et al. (2004) Shawe-Taylor, J., Cristianini, N., et al. (2004). Kernel methods for pattern analysis. Cambridge university press.
- Sidford et al. (2018) Sidford, A., Wang, M., Wu, X., Yang, L. F., and Ye, Y. (2018). Near-optimal time and sample complexities for for solving discounted markov decision process with a generative model. arXiv preprint arXiv:1806.01492.
- Silver et al. (2017) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550(7676):354.
- Strehl et al. (2009) Strehl, A. L., Li, L., and Littman, M. L. (2009). Reinforcement learning in finite MDPs: PAC analysis. Journal of Machine Learning Research, 10(Nov):2413–2444.
- Strehl et al. (2006) Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, M. L. (2006). Pac model-free reinforcement learning. In Proceedings of the 23rd international conference on Machine learning, pages 881–888. ACM.
- Szita and Szepesvári (2010) Szita, I. and Szepesvári, C. (2010). Model-based reinforcement learning with nearly tight exploration complexity bounds. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 1031–1038.
- Tsitsiklis and Van Roy (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pages 1075–1081.
- Valko et al. (2013) Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. (2013). Finite-time analysis of kernelised contextual bandits. pages 654–663.
- Yang and Wang (2019) Yang, L. F. and Wang, M. (2019). Sample-optimal parametric q-learning with linear transition models. arXiv preprint arXiv:1902.04779.
- Yang et al. (2017) Yang, Y., Pilanci, M., Wainwright, M. J., et al. (2017). Randomized sketches for kernels: Fast and optimal nonparametric regression. The Annals of Statistics, 45(3):991–1023.
Appendix A Analysis and Proofs
In this section we will focus on proving Theorem 1. In the proof we will also establish all the necessary analytical tools for proving Theorem 2 and Theorem 3. We provide the proofs of the last two theorems at the end of this section.
The proof of Theorem 1 consists of two steps: (a) We first show that if the true transition core is always in the confidence ball , defined in Equation 5, we can then achieve the desired regret bound; (b) We then show that with high probability, the event required by (a) happens. We formalize the event required by step (a) as follows.
Definition 2 (Good Estimator Event).
For all , we denote if for all and otherwise .
Note that is completely determined by the game history up to episode . In the next section, we show (a).
a.1 Regret Under Good Event
To better investigate the regret formulation (1), we rewrite it according to Algorithm 1. Note that conditioning on the history before episode , the algorithm plays a fixed policy for episode . Therefore, we have
where . We now show that the algorithm always plays an optimistic action (an action with value estimated greater than the optimal value of the state).
Lemma 4 (Optimism).
Suppose for , we have the good estimator event, , happens. Then for and , we have
We prove the lemma by induction on . It is vacuously true for the case since . Suppose the lemma holds for some . We then have
We now consider . Note that
Next we show that the confidence ball actually gives a strong upper bound for the estimation error: the estimation error is “along” the direction of the exploration.
For any we have
as desired. ∎
Next we show that the value iteration per-step does not introduce too much error.
Suppose for , . Then for , we have
We then have
We are now ready to show the regret bound.
Suppose Assumption 2 holds, , then,
Consider for a fixed . Denote as the filtration of fixing the history up to time (i.e., fixing but not ). Since if , we can always bound . We then have
We then denote