Policy evaluation is a key step in many reinforcement learning systems. Policy evaluation approximates the value of each state—future sum of rewards—given a policy and either a model of the world or a stream of data produced by an agents choices. In classical policy iteration schemes, the agent continually alternates between improving the policy using the current approximation of the value function, and updating the approximate value function for the new policy. Policy search methods like actor-critic estimate the value function of the current policy to perform gradient updates for the policy.
However, there has been relatively little research into methods for accurately evaluating policy evaluation algorithms when the true values are not available. In most domains where we are interested in performing policy evaluation, it is difficult or impossible to compute the true value function. We may not have access to the transition probabilities or the reward function in every state, making it impossible to obtain the closed form solution of the true value function. Even if we have access to a full model of the environment, we may not be able to represent the value function if the number of states is too large or the state is continuous. Aside from small finite MDPs like gridworlds and random MDPs, where closed-form solutions can be computed (Geist and Scherrer, 2014; White and White, 2016), we often do not have access to . In nearly all our well-known benchmark domains, such as Mountain Car, Puddle World, Cart pole, and Acrobot, we must turn to some other method to evaluate learning progress.
One option that has been considered is to estimate the objective minimized by the algorithms. Several papers (Sutton et al., 2008; Du et al., 2017) have compared the performance of the algorithms in terms of their target objective on a batch of samples, using the approximate linear system for the mean-squared projected Bellman error (MSPBE). One estimator, called RUPEE (White, 2015), is designed to incrementally approximate the MSPBE by keeping a running average across data produced during learning. Some terms, such as the feature covariance matrix, can be estimated easily; however, one component of the MSPBE includes the current weights, and is biased by this moving average approach. More problematically, some algorithms do not converge to the minimum of the MSPBE, such as residual gradient for the mean-squared Bellman error (Baird, 1995) or Emphatic Temporal Difference (ETD) learning (Sutton et al., 2016), which minimize a variant of the MSPBE with a different weighting. This approach, therefore, is limited to comparing algorithms that minimize the same objective.
The more standard approach has been to use rollouts from states to obtain samples of returns. To obtain these rollout estimates, three parameters need to be chosen: the number of states from which to rollout, the number of rollouts or trajectories , and the length of each rollout. Given these rollouts, the true values can be estimated from each of the chosen states, stored offline, and then used for comparison repeatedly during experiments. These evaluation schemes, however, have intuitively chosen parameters, without any guarantees that the distance to the true values, the error, is well-estimated. Early work comparing gradient TD algorithms (Maei et al., 2009) used sampled trajectories—2500 of them—but compared to returns, rather than value estimates. For several empirical studies using benchmark domains, like Mountain Car and Acrobot, there are a variety of choices, including (Gehring et al., 2016); , and 1000 length rollouts (Pan et al., 2017); and , (Le et al., 2017). For a continuous physical system, (Dann et al., 2014) used as little as 10 rollouts from a state. Otherwise, other papers have mentioned that extensive rollouts are used111Note that (Boyan and Moore, 1995) used rollouts for a complementary purpose, to train a nonlinear value function, rather than for evaluating policy evaluation algorithms., but did not describe how (Konidaris et al., 2011; Dabney and Thomas, 2014). In general, as new policy evaluation algorithms are derived, it is essential to find a solution to this open problem: How can we confidently compare value estimates returned by our algorithms?
In this work, we provide an algorithm that ensures, with high-probability, that the estimated distance has small error in approximating the true distance between the true value function for an arbitrary estimate . We focus in the main body of the paper on the clipped mean-absolute percentage value error (CMAPVE) as a representative example of the general strategy. We provide additional results for a variety of other losses in the appendix, to facilitate use for a broader range of error choices. We conclude by demonstrating the rollout parameters chosen for several case studies, highlighting that previous intuitive choices did not effectively direct sampling. We hope for this algorithm to become a standard approach for generating estimates of the true values to facilitate comparison of policy evaluation algorithms by reinforcement learning researchers.
2 Measures of Learning Performance
This paper investigates the problem of comparing algorithms that estimate the discounted sum of future rewards incrementally for a fixed policy. In this section, we first introduce the policy evaluation problem and motivate a particular measures of learning performance for policy evaluation algorithms. In the following section, we discuss how to estimate this measure.
We model the agent’s interaction with the world as a Markov decision process (MDP), defined by a (potentially uncountable) set of states, a finite set of actions , transitions , rewards and a scalar discount function . On each time step , the agent selects an action according to it’s behaviour policy , the environment transitions into a new state and the agent receives a scalar reward . In policy evaluation, the agent’s objective is to estimate the expected return
where is called the state-value function for the target policy . From a stream of data, the agent incrementally approximates this value function, . For experiments, to report learning curves, we need to measure the accuracy of this estimate every step or at least periodically, such as every 10 steps.
For policy evaluation, when the policy remains fixed, the value error remains the gold standard of evaluation. Ignoring how useful the value function is for policy improvement, our only goal is accuracy with respect to . Assume some weighting
, a probability distribution over states. Given access to, it is common to estimate the mean squared value error
or the mean absolute value error
The integral is replaced with a sum if the set of states is finite. Because we consider how to estimate this error for continuous state domains—for which it is more difficult to directly estimate —we preferentially assume the states are continuous.
These losses, however, have several issues, beyond estimating them. The key issue is that the scale of the returns can be quite different across states. This skews the loss and, as we will see, makes it more difficult to get high-accuracy estimates. Consider a cost-to-goal problem, where the agent receives a reward of -1 per step. From one state the value could beand for another it could be . For a prediction of and respectively, the absolute value error for both states would be . However, the prediction of for the first state is quite accurate, whereas a prediction of for the second state is highly inaccurate.
One alternative is to estimate a percentage error, or relative error. The mean absolute percentage value error is
for some . The term in the denominator ensures the MAPVE does not become excessively high, if true values of states are zero or near zero. For example, for , the MAPVE is essentially the MAVE for small , which reflects that small absolute differences are meaningful for these smaller numbers. For large , the has little effect, and the MAPVE becomes a true percentage error, reflecting the fact that we are particularly interested in relative errors for larger .
Additionally, the MAPVE can be quite large if is highly inaccurate. When estimating these performance measures, however, it is uninteresting to focus on obtaining high-accuracy estimate of very large MAPVE. Rather, it is sufficient to report that is highly inaccurate, and focus the estimation of the loss on more accurate . Towards this goal, we introduce the clipped MAPVE
for some . This provides a maximum percentage error. For example, setting caps error estimates for approximate values that are worse than inaccurate. Such a level of inaccuracy is already high, and when comparing policy evaluation algorithms, we are much more interested in their percentage error—particularly compared to each other—once we are within a reasonable range around the true values. Note that can be chosen to be the maximum value of the loss, and so the following results remain quite general.
Though many losses could be considered, we put forward the CMAPVE as a proposed standard for policy evaluation. The parameters and can both be appropriately chosen by the experimentalist, for a given MDP. These parameters give sufficient flexibility in highlighting differences between algorithms, while still enabling high-confidence estimates of these errors, which we discuss next. For this reason, we use CMAPVE as the loss in the main body of the text. However, for completeness, we also show how to modify the analysis and algorithms for other losses in the appendix.
3 High-Confidence Estimates of Value Error
Our goal now is to approximate the value error, CMAPVE, with high-confidence, for any value function . Instead of approximating the error directly for each , the typical approach is to estimate as accurately as possible, for a large set of states . Given these high-accuracy estimates , the true expected error can be approximated from this subset of states for any .
Since the CMAPVE needs to be computed frequently, for many steps during learning potentially across many algorithms, it is important for this estimate of CMAPVE to be efficiently computable. An important requirement, then, is for the number of states to be as small as possible, so that all the can be stored and the summed difference is quick to compute.
One possible approach is to estimate the true value function using a powerful function approximator, offline. A large batch of data could be gathered, and a learning method used to train . This large function approximator would not even need to be stored: only would need to be saved once this offline procedure was complete. This approach, however, will be biased by the form of the function approximator, which can favor certain policy evaluation algorithms during evaluation. Further, it is difficult to quantify this bias, particularly in a general way agnostic to the type of function approximator an experimentalist might use for their learning setting.
An alternative strategy is to use many sampled rollouts from this subset of states. This strategy is general—requiring only access to samples from the MDP. A much larger number of interactions can be used with the MDP, to compute , because this is computed once, offline, to facilitate many further empirical comparisons between algorithms after. For example, one may want to examine the early learning performance of two different policy evaluation algorithms—which may themselves receive only a small number of samples. The cached then enables computing this early learning performance. However, even offline, there are limits to how many samples can be computed feasibly, particularly for computationally costly simulators (Dann et al., 2014).
Therefore, our goal is the following: how can we efficiently compute high-confidence estimates of CMAPVE, using a minimal number of offline rollouts. The choice of a clipped loss actually enables the number of states to remain small (shown in Lemma 2), enabling efficient computation of CMAPVE. In the next section, we address the second point: how to obtain high-confidence estimates, given access to that approximates . In the following section, we discuss how to obtain these .
We first provide an overview of the approach, to make it easier to follow the argument. We additionally include a notation table (Table 1), particularly to help discern the various value functions.
|true values for policy|
|true values for policy ,|
|when using truncated rollouts to length|
|estimated values for policy using rollouts,|
|when using truncated rollouts to length|
|estimated values for policy , being evaluated|
|distribution over the states ,|
|number of states ,|
|true error, under|
|an upper bound on the maximum absolute value reward,|
|maximum absolute value for the policy for any state, e.g.,|
|the number of times the error estimate is queried|
First, we consider several value function approximations, for use within the bound, summarized in Table 1. The goal is to determine the accuracy of the estimates of the learned with respect to the true values . We estimate true values for using repeated rollouts from ; this results in two forms of error. The first is due to truncated rollouts, which for the continuing case would otherwise be infinitely long. The second source of error is due to using an empirical estimate of the true values, by averaging sampled returns. We denote as the true values, for truncated returns, and as the sample estimate of from truncated rollouts.
Second, we consider the approximation in computing the loss: the difference between and . We consider the true loss and the approximate loss , in Table 1. The argument in Theorem 1 revolves around upper bounding the difference between these two losses, in terms of three terms. These terms are bounded in Lemmas 2, 3 and 4. Lemma 2 bounds the error due to sampling only a subset of . Lemma 3 bounds the error from approximating with truncated rollouts. Lemma 4 bounds the error from dividing by instead of .
Finally, to obtain this general bound, we first assume that we can obtain highly-accurate estimates of . We state these two assumptions in Assumptions 1 and 2. These estimates could be obtained with a variety of sampling strategies, and so separate it from the main proof. We later develop one algorithm to obtain such estimates, in Section 4.
3.2 Main Result
We will compute rollout values from a set of sampled states . Each rollout consists of a trajectory simulated, or rolled out, some number of steps. The length of this trajectory can itself be random, depending on if an episode terminates or if the trajectory is estimated to be a sufficiently accurate sample of the full, non-truncated return. We first assume that we have access to such trajectories and rollout estimates and in later sections show how to obtain such trajectories and estimates.
For any and sampled state , the trajectory lengths are specified such that,
Starting from , assume you have trajectories of rewards for trajectory index and rollout index for a trajectory length that depends on the trajectory. The approximated rollout values
are an () -approximation to the true expected values, where is an instance of the random variable
is an instance of the random variable
i.e, for , with probability at least , the following holds for all states
Proof: We need to bound the errors introduced from having a reduced number of states, a finite set of trajectories to approximate the expected returns for each of those states and truncated rollouts to get estimates of returns. To do so, we first consider the difference under the approximate clipped loss, to the true value function.
The first component is bounded in Lemma 2. For the second component, notice that
However, these two differences are difficult to compare, because they have different denomiators: the first has , whereas the second has . We therefore further separate each component in the sum
The first difference has the same denominator, so
Therefore, putting it all together, we have
where the first, second and third components are bounded in Lemmas 2, 3 and 4 respectively. Finally, due to the application of Hoeffding’s bound (Lemma 2) with error probability of atmost and assumption 2 which may not hold with probability atmost and the union bound, we conclude that the final bound holds with probability at least .
Lemma 2 (Dependence on ).
Suppose the empirical loss mean estimates are computed number of times. Then with probability at least :
is an unbiased estimate of, we can use Hoeffding’s bound for variables bounded between . For any of the times, the concentration probability is as follows:
Thus, due to union bound over all the times, for all those empirical loss mean estimates, the following holds
Rearranging the above, to express in terms of ,
Therefore, with probability at least ,
Proof: We can split up this error into sampling error for a finite length rollout and for a finite number of trajectories. We can consider the unclipped error, which is an upper bound on the clipped error.
These two terms are both bounded by , by assumption.
Under Assumption 2,
Proof: We need to bound the difference due to the difference in normalizer. To do so, we simply need to find a constant such that
The key is to lower bound , which results in an upper bound on the first term and consequently an upper bound on the difference between the two terms.
where the second inequality is due to Assumption 2. Now further
So, for and , the term upper bounds the difference. Because and , this term is maximized when , and . In the worst case, therefore, which finishes the proof.
3.3 Satisfying the Assumptions
The bounds above relied heavily on accurate sample estimates of . To obtain Assumption 1, we need to rollout trajectories sufficiently far to ensure that truncated sampled returns do not incur too much bias. For problems with discounting, for , the returns can be truncated once becomes sufficiently small, as the remaining terms in the sum for the return have negligible weight. For episodic problems with no discounting, it is likely that trajectories need to be simulated until termination, since rewards beyond the truncation horizon would not be discounted and so could have considerable weight.
We show how to satisfy Assumption 1, for the discounted setting. Note that for the trivial setting of , it is sufficient to use , so we assume .
For and , if
then satisfies Assumption 1:
Proof: The first component can be bounded as
Setting as in (14) ensures , completing the proof.
For Assumption 2, we need a stopping rule for sampling truncated returns that ensures is within of the true expected value of the truncated returns,
. The idea is to continue sampling truncated returns, until the confidence interval around the mean estimate shrinks sufficiently to ensure, with high probability, that the values estimates are withinof the true values. Such stopping rules have been designed for non-negative random variables (Domingos and Hulten, 2001; Dagum et al., 2006), and extended to more general random variables (Mnih et al., 2008). We defer the development of such an algorithm for this setting until the next section.
4 The Rollout Algorithm
We can now design a high-confidence algorithm for estimating the accuracy of a value function. Practically, the most important number to reduce is , because these values will be stored and used for comparisons on each step. The choice of a clipped loss, however, makes it more manageable to control . In this section, we focus more on how much the variability in trajectories, and trajectory length, impact the number of required samples.
The general algorithm framework is given in Algorithm 1. The algorithm is straightforward once given an algorithm to sample rollouts from a given state. The rollout algorithm is where development can be directed, to reduce the required number of samples. This rollout algorithm needs to be designed to satisfy Assumptions 1 and 2. We have already shown how to select trajectory lengths to satisfy Assumption 1. Below, we describe how to select and how to satisfy Assumption 2.
Specifying the number of sampled states .
For the number of required samples for the outer loop in Algorithm 1, we need enough samples to match the bound in Lemma 2.
is chosen as and thus we are being slightly conservative regarding the error to ensure correctness with high probability. We opt for a separate choice of for this part of the bound, because it is completely separate from the other errors. This number could be chosen slightly larger, to reduce the number of required sampled states to compare to, whereas might need to be smaller depending on the choice of and . Separating them explicitly can significantly reduce the in the outer loop, both improving time and storage, as well as later comparison time, without impacting the accuracy of the algorithm.
Satisfying Assumption 2.
Our goal is to get an ()-approximation of
, with a feasible number of samples. In many cases, it is difficult to make parametric assumptions about returns in reinforcement learning. A simple strategy is to use a stopping rule for generating returns, based on general concentration inequalities—like Hoeffding’s bound—that make few assumptions about the random variables. If we had a bit more information, however, such as the variance of the returns, we could obtain a tighter bound, using Bernstein’s inequality and so reduce the number of required samples. We cannot know this variance a priori, but fortunately an empirical Bernstein bound has been developed(Mnih et al., 2008). Using this bound, Mnih et al. (2008) designed EBGStop, which incrementally estimates variance and significantly reduces the number of samples required to get high-confidence estimates.
EBGStop can be used, without modification, given a mechanism to sample truncated returns that satisfy Assumption 1. However, we generalize the algorithm to allow for our less restrictive condition , as opposed to the original algorithm which ensured . When in our algorithm, it reduces to the original; since this is a generalization on that algorithm, we continue to call it EBGStop. This modification is important when , since this would require when . For , once the accuracy is within , the algorithm can stop. The Algorithm is summarized in Algorithm 2. The proof follows closely to the proof for EBGStop; we include it in Appendix A. Algorithm 2 uses geometric sampling, like EBGStop, to improve sample efficiency. The idea is to avoid checking the stopping condition after every sample. Instead, for some , the condition is checked after samples; the next check occurs at . This modification improves sample efficiency from a multiplicative factor of to , where is the range of the random variables and is the mean.
Algorithm 2 returns an -approximation :
For any and , Algorithm 1 returns an -accurate approximation: with probability at least ,
. An additional point of interest is that there are a few states that required significantly more samples for the returns, indicated by the outliers depicted as individual points.
5 Experiments on Benchmark Problems
We investigate the required number of samples to get with a level of accuracy, for different probability levels. We report this for two continuous-state benchmark problems—Mountain Car and Puddle World—which have previously been used to compare policy evaluation algorithms. Our goal is to (a) demonstrate how this framework can be used to obtain high-confidence estimates of accuracy and (b) provide some insight into how many samples are needed, even for simple reinforcement learning benchmark domains.
We report the number of returns sampled by Algorithm 2, averaged across several states. The domains, Mountain Car and Puddle World, are as specified in the policy evaluation experiments by Pan et al. (2017). For Mountain Car, we use the energy pumping policy, with random action selection for the three actions. For Puddle World, we used a uniform random policy for the four actions. They are both episodic tasks, with a maximum absolute value of and respectively. The variance in Puddle World is particularly high, as it has regions with high-variance, high-magnitude rewards. We sampled states uniformly across the state-space, to provide some insight into the variability of the number of returns sampled across the state-space. We tested and , and set . We focus here on how many returns need to be sampled, rather than the trajectory length, and so do not use nor explicitly compute clipped errors .
The results indicate that EBGStop requires a large number of samples, particularly in Puddle World. Figure 0(b) for Mountain Car and Figure 0(a) both indicate that decreasing from to , to enable higher-accuracy estimates of value function error, causes an exponential increase in the required number of samples, an increase of to for Mountain Car and to for Puddle World. An accuracy level of , which corresponds to difference of 1% for clipped errors, is a typical choice for policy evaluation experiments, yet requires an inordinate number of samples, particularly in Puddle World.
We further investigated lower bounds on the required number of samples. Though EBGStop is a state-of-the-art stopping algorithm, to ensure high-confidence bounds for any distribution with bounded mean and variance, it collects more samples than is actually required. To assess its efficiency gap, we also include an idealistic approach to computing the confidence intervals, using repeated subsamples computed from the simulator. By obtaining many, many estimates of the sample average, using samples of the truncated return, we can estimate the actual variability of the sample average. We provide additional details in Appendix D. Such a method to compute the confidence interval is not a viable algorithm to reduce the number of samples generated. Rather, the goal here is to report a lower bound on the number of samples required, for comparison and to motivate the amount the sampling algorithm could be improved. The number of samples generated by EBGStop is typically between 10 to 100 times more than the optimal number of samples, which indicates that there is much room to improve sample efficiency.
In this work, we present the first principled approach to obtain high-confidence error estimates of learned value functions. Our strategy is focused on the setting tackled by reinforcement learning empiricists, comparing value function-learning algorithms. In this context, accuracy of value estimates, for multiple algorithms, need to be computed repeatedly, every few steps with increasing data given to the learning algorithms. We provide a general framework for such a setting, where we store estimates of true value functions using samples of truncated returns. The framework for estimating true values for comparison is intentionally generic, to enable any (sample-efficient) stopping algorithm to be used. We propose one solution, which uses empirical Bernstein bounds, to significantly reduce the required number of samples over other concentration inequalities, such as Hoeffding’s bound.
This paper highlights several open challenges. As demonstrated in the experiments, there is a large gap between the actual required number of samples and that provided by the algorithm using an empirical Bernstein stopping-rule. For some simulators, this overestimate could result in a prohibitively large number of samples. Although this is a problem more generally faced by the sampling literature, it is particularly exacerbated in reinforcement learning where the variability across states and returns can be high, with large maximum values. An important avenue, then, is to develop more sample-efficient sampling algorithms to make high-confidence error estimates feasible for a broader range of settings in reinforcement learning.
Another open challenge is to address how to sample states . This paper is agnostic to how these states are obtained. However, it is not always straightforward to sample these from a desired distribution. Some choices are simple, such as randomly selecting these across the state space. For other cases, it is more complicated, such as sampling these from the stationary distribution of the behaviour policy, . The typical strategy is to run for a burn-in period, so that afterwards it is more likely for states to be sampled from the stationary distribution. The theoretical effectiveness of this strategy, however, is not yet well-understood. There has been work estimating empirical mixing times (Hsu et al., 2015) and some work bounding the number of samples required for burn-in (Paulin, 2015). Nonetheless, it remains an important open question on how to adapt these results for the general reinforcement learning setting.
One goal of this paper has been to highlight an open problem that has largely been ignored by reinforcement learning empiricists. We hope for this framework to stimulate further work in high-confidence estimates of value function accuracy.
Geist and Scherrer 
Matthieu Geist and Bruno Scherrer.
Off-policy learning with eligibility traces: a survey.
The Journal of Machine Learning Research, 2014.
- White and White  Adam M White and Martha White. Investigating practical, linear temporal difference learning. In International Conference on Autonomous Agents and Multiagent Systems, 2016.
Sutton et al. 
R Sutton, C Szepesvári, A Geramifard, and Michael Bowling.
Dyna-style planning with linear function approximation and
Conference on Uncertainty in Artificial Intelligence, 2008.
- Du et al.  Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic Variance Reduction Methods for Policy Evaluation. In International Conference on Machine Learning, 2017.
- White  Adam White. Developing a predictive approach to knowledge. PhD thesis, University of Alberta, 2015.
- Baird  Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In International Conference on Machine Learning, 1995.
- Sutton et al.  Richard S Sutton, A R Mahmood, and Martha White. An emphatic approach to the problem of off-policy temporal-difference learning. The Journal of Machine Learning Research, 2016.
- Maei et al.  HR Maei, C Szepesvári, S Bhatnagar, D Precup, D Silver, and Richard S Sutton. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 2009.
- Gehring et al.  Clement Gehring, Yangchen Pan, and Martha White. Incremental Truncated LSTD. In International Joint Conference on Artificial Intelligence, 2016.
- Pan et al.  Yangchen Pan, Adam White, and Martha White. Accelerated Gradient Temporal Difference Learning. In International Conference on Machine Learning, 2017.
- Le et al.  Lei Le, Raksha Kumaraswamy, and Martha White. Learning Sparse Representations in Reinforcement Learning with Sparse Coding. In International Joint Conference on Artificial Intelligence, 2017.
- Dann et al.  Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal differences: a survey and comparison. The Journal of Machine Learning Research, 2014.
- Boyan and Moore  J Boyan and A W Moore. Generalization in Reinforcement Learning: Safely Approximating the Value Function. Advances in Neural Information Processing Systems, 1995.
- Konidaris et al.  George Konidaris, Scott Niekum, and Philip S Thomas. TDgamma: Re-evaluating Complex Backups in Temporal Difference Learning. In Advances in Neural Information Processing Systems, 2011.
- Dabney and Thomas  William Dabney and Philip S Thomas. Natural Temporal Difference Learning. In AAAI Conference on Artificial Intelligence, 2014.
- Domingos and Hulten  Pedro M Domingos and Geoff Hulten. A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering. In International Conference on Machine Learning, 2001.
- Dagum et al.  Paul Dagum, Richard Karp, Michael Luby, and Sheldon Ross. An Optimal Algorithm for Monte Carlo Estimation. SIAM Journal on Computing, 2006.
- Mnih et al.  Volodymyr Mnih, Csaba Szepesvari, and Jean-Yves Audibert. Empirical Bernstein stopping. In the 25th international conference, 2008.
Hsu et al. 
Daniel Hsu, Aryeh Kontorovich, and Csaba Szepesvari.
Mixing Time Estimation in Reversible Markov Chains from a Single Sample Path.In Advances in Neural Information Processing Systems, 2015.
- Paulin  Daniel Paulin. Concentration inequalities for Markov chains by Marton couplings and spectral methods. Electronic Journal of Probability, 2015.
- Audibert et al.  Jean-Yves Audibert, Rémi Munos, and Csaba Szepesvari. Tuning Bandit Algorithms in Stochastic Environments. In Algorithmic Learning Theory. 2007.
- Welford  BP Welford. Note on a method for calculating corrected sums of squares and products. Technometrics, 1962.
Appendix A Supplementary Lemmas
The clipped error satisfies the triangle inequality.
Proof: This results follows because , still holds under clipping. To see why, consider the following. If either or are clipped to , then clearly the sum is larger than . Otherwise, if only is clipped to , then it can only have been strictly decreased and again the inequality must hold. Once we have this inequality, we can use the fact that and to get the and .