1 Introduction
When applying reinforcement learning (RL), particularly to realworld applications, it is desirable to have algorithms that reliably achieve high levels of performance without requiring expert knowledge or significant human intervention. For researchers, having algorithms of this type would mean spending less time tuning algorithms to solve benchmark tasks and more time developing solutions to harder problems. Current evaluation practices do not properly account for the uncertainty in the results (Henderson et al., 2018) and neglect the difficulty of applying RL algorithms to a given problem. Consequently, existing RL algorithms are difficult to apply to realworld applications (DulacArnold et al., 2019). To both make and track progress towards developing reliable and easytouse algorithms, we propose a principled evaluation procedure that quantifies the difficulty of using an algorithm.
For an evaluation procedure to be useful for measuring the usability of RL algorithms, we suggest that it should have four properties. First, to ensure accuracy and reliability, an evaluation procedure should be scientific, such that it provides information to answer a research question, tests a specific hypothesis, and quantifies any uncertainty in the results. Second, the performance metric captures the usability
of the algorithm over a wide variety of environments. For a performance metric to capture the usability of an algorithm, it should include the time and effort spent tuning the algorithm’s hyperparameters (e.g., stepsize and policy structure). Third, the evaluation procedure should be
nonexploitative (Balduzzi et al., 2018), meaning no algorithm should be favored by performing well on an overrepresented subset of environments or by abusing a particular score normalization method. Fourth, an evaluation procedure should be computationally tractable, meaning that a typical researcher should be able to run the procedure and repeat experiments found in the literature.As an evaluation procedure requires a question to answer, we pose the following to use throughout the paper: which algorithm(s) perform well across a wide variety of environments with little or no environmentspecific tuning? Throughout this work, we refer to this question as the general evaluation question. This question is different from the one commonly asked in articles proposing a new algorithm, e.g., the common question is, can algorithm X outperform other algorithms on tasks A, B, and C? In contrast to the common question, the expected outcome for the general evaluation question is not to find methods that maximize performance with optimal hyperparameters but to identify algorithms that do not require extensive hyperparameter tuning and thus are easy to apply to new problems.
In this paper, we contend that the standard evaluation approaches do not satisfy the above properties, and are not able to answer the general evaluation question. Thus, we develop a new procedure for evaluating RL algorithms that overcomes these limitations and can accurately quantify the uncertainty of performance. The main ideas in our approach are as follows. We present an alternative view of an algorithm such that sampling its performance can be used to answer the general evaluation question. We define a new normalized performance measure, performance percentiles, which uses a relative measure of performance to compare algorithms across environments. We show how to use a gametheoretic approach to construct an aggregate measure of performance that permits quantifying uncertainty. Lastly, we develop a technique, performance bound propagation (PBP), to quantify and account for uncertainty throughout the entire evaluation procedure. We provide source code so others may easily apply the methods we develop here.^{1}^{1}1Source code for this paper can be found at https://github.com/ScottJordan/EvaluationOfRLAlgs.
2 Notation and Preliminaries
In this section, we give notation used in this paper along with an overview of an evaluation procedure. In addition to this section, a list of symbols used in this paper is presented in Appendix C. We represent a performance metric of an algorithm, , on an environment,
, as a random variable
. This representation captures the variability of results due to the choice of the random seed controlling the stochastic processes in the algorithm and the environment. The choice of the metric depends on the property being studied and is up to the experiment designer. The performance metric used in this paper is the average of the observed returns from one execution of the entire training procedure, which we refer to as the average return. The cumulative distribution function (CDF),
, describes the performance distribution of algorithm on environment such that. The quantile function,
, maps a cumulative probability,
, to a score such that proportion of samples of are less than or equal to . A normalization function, , maps a score, , an algorithm receives on an environment, , to a normalized score, , which has a common scale for all environments. In this work, we seek an aggregate performance measure, , for an algorithm, , such that , where for all and . In Section 4, we discuss choices for the normalizing function and weightings that satisfy the properties specified in the introduction.The primary quantities of interest in this paper are the aggregate performance measures for each algorithm and confidence intervals on that measure. Let
be a vector representing the aggregate performance for each algorithm. We desire confidence intervals,
, such that, for a confidence level ,(1) 
To compute an aggregate performance measure and its confidence intervals that meet the criteria laid out in the introduction, one must consider the entire evaluation procedure. We view an evaluation procedure to have three main components: data collection, data aggregation, and reporting of the results. During the data collection phase, samples are collected of the performance metric for each combination of an algorithm and environment . In the data aggregation phase, all samples of performance are normalized so the metric on each environment is on a similar scale, then they are aggregated to provide a summary of each algorithm’s performance across all environments. Lastly, the uncertainty of the results is quantified and reported.
3 Data Collection
In this section, we discuss how common data collection methods are unable to answer the general evaluation question and then present a new method that can. We first highlight the core difference in our approach to previous methods.
The main difference between data collection methods is in how the samples of performance are collected for each algorithm on each environment. Standard approaches rely on first tuning an algorithm’s hyperparameters, i.e., any input to an algorithm that is not the environment, and then generating samples of performance. Our method instead relies on having a definition of an algorithm that can automatically select, sample, or adapt hyperparameters. This method can be used to answer the general evaluation question because its performance measure represents the knowledge required to use the algorithm. We discuss these approaches below.
3.1 Current Approaches
A typical evaluation procedure used in RL research is the tuneandreport method. As depicted in Figure 1, the tuneandreport method has two phases: a tuning phase and a testing phase. In the tuning phase, hyperparameters are optimized either manually or via a hyperparameter optimization algorithm. Then after tuning, only the best hyperparameters are selected and executed for
trials using different random seeds to provide an estimate of performance.
The tuneandreport data collection method does not satisfy the usability requirement or the scientific requirement. Recall that our objective is to capture the difficulty of using a particular algorithm. Because the tuneandreport method ignores the amount of data used to tune the hyperparameter, an algorithm that only works well after significant tuning could be favored over one that works well without environmentspecific tuning, thus, violating the requirements.
Consider an extreme example of an RL algorithm that includes all policy parameters as hyperparameters. This algorithm would then likely be optimal after any iteration of hyperparameter tuning that finds the optimal policy. This effect is more subtle in standard algorithms, where hyperparameter tuning infers problemspecific information about how to search for the optimal policy, (e.g., how much exploration is needed, or how aggressive policy updates can be). Furthermore, this demotivates the creation of algorithms that are easier to use but do not improve performance after finding optimal hyperparameters.
The tuneandreport method violates the scientific property by not accurately capturing the uncertainty of performance. Multiple i.i.d. samples of performance are taken after hyperparameter tuning and used to compute a bound on the mean performance. However, these samples of performance do not account for the randomness due to hyperparameter tuning. As a result, any statistical claim would be inconsistent with repeated evaluations of this method. This has been observed in several studies where further hyperparameter tuning has shown no difference in performance relative to baseline methods (Lucic et al., 2018; Melis et al., 2018).
The evaluation procedure proposed by Dabney (2014) addresses issues with uncertainty due to hyperparameter tuning and performance not capturing the usability of algorithms. Dabney’s evaluation procedure computes performance as a weighted average over all iterations of hyperparameter tuning, and the entire tuning process repeats for trials. Even though this evaluation procedure fixes the problems with the tuneandreport approach, it violates our computationally tractable property by requiring executions of the algorithm to produce just samples of performance. In the case where it is not clear how hyperparameters should be set. Furthermore, this style of evaluation does not cover the case where it is prohibitive to perform hyperparameter tuning, e.g., slow simulations, long agent lifetimes, lack of a simulator, and situations where it is dangerous or costly to deploy a bad policy. In these situations, it is desirable for algorithms to be insensitive to the choice of hyperparameters or able to adapt them during a single execution. It is in this setting that the general evaluation question can be answered.
3.2 Our Approach
In this section, we outline our method, complete data collection, that does not rely on hyperparameter tuning. If there were no hyperparameters to tune, evaluating algorithms would be simpler. Unfortunately, how to automatically set hyperparameters has been an understudied area. Thus, we introduce the notion of a complete algorithm definition.
Definition 1 (Algorithm Completeness).
An algorithm is complete on an environment , when defined such that the only required input to the algorithm is metainformation about environment , e.g., the number of state features and actions.
Algorithms with a complete definition can be used on an environment and without specifying any hyperparameters. Note that this does not say that an algorithm cannot receive forms of problem specific knowledge, only that it is not required. A welldefined algorithm will be able to infer effective combinations of hyperparameters or adapt them during learning. There are many ways to make an existing algorithm complete. In this work, algorithms are made complete by defining a distribution from which to randomly sample hyperparameters. Random sampling may produce poor or divergent behavior in the algorithm, but this only indicates that it is not yet known how to set the hyperparameters of the algorithm automatically. Thus, when faced with a new problem, finding decent hyperparameters will be challenging. One way to make an adaptive complete algorithm is to include a hyperparameter optimization method in the algorithm. However, all tuning must be done within the same fixed amount of time and cannot propagate information over trials used to obtain statistical significance.
Figure 2 shows the complete data collection method. For this method we limit the scope of algorithms to only include ones with complete definitions; thus, it does not violate any of the properties specified. This method satisfies the scientific requirement since it is designed to answer the general evaluation question, and the uncertainty of performance can be estimated using all of the trials. Again, this data collection method captures the difficulty of using an algorithm since the complete definition encodes the knowledge necessary for the algorithm to work effectively. The compute time of this method is tractable, since executions of the algorithm produces independent samples of performance.
The practical effects of using the complete data collection method are as follows. Researchers do not have to spend time tuning each algorithm to try and maximize performance. Fewer algorithm executions are required to obtain a statistically meaningful result. With this data collection method, improving upon algorithm definitions will become significant research contributions and lead to algorithms that are easy to apply to many problems.
4 Data Aggregation
Answering the general evaluation question requires a ranking of algorithms according to their performance on all environments . The aggregation step accomplishes this task by combining the performance data generated in the collection phase and summarizing it across all environments. However, data aggregation introduces several challenges. First, each environment has a different range of scores that need to be normalized to a common scale. Second, a uniform weighting of environments can introduce bias. For example, the set of environments might include many slight variants of one domain, giving that domain a larger weight than a single environment coming from a different domain.
4.1 Normalization
The goal in score normalization is to project scores from each environment onto the same scale while not being exploitable by the environment weighting. In this section, we first show how existing normalization techniques are exploitable or do not capture the properties of interest. Then we present our normalization technique: performance percentiles.
4.1.1 Current Approaches
We examine two normalization techniques: performance ratios and policy percentiles. We discuss other normalization methods in Appendix A. The performance ratio is commonly used with the Arcade Learning Environment to compare the performance of algorithms relative to human performance (Mnih et al., 2015; Machado et al., 2018). The performance ratio of two algorithms and on an environment is . This ratio is sensitive to the location and scale of the performance metric on each environment, such that an environment with scores in the range will produce larger differences than those on the range . Furthermore, all changes in performance are assumed to be equally challenging, i.e., going from a score of to is the same difficulty as to . This assumption of linearity of difficulty is not reflected on environments with nonlinear changes in the score as an agent improves, e.g., completing levels in Super Mario.
A critical flaw in the performance ratio is that it can produce an arbitrary ordering of algorithms when combined with the arithmetic mean, (Fleming and Wallace, 1986)
, meaning a different algorithm in the denominator could change the relative rankings. Using the geometric mean can address this weakness of performance ratios, but does not resolve the other issues.
Another normalization technique is policy percentiles, a method that projects the score of an algorithm through the performance CDF of random policy search (Dabney, 2014). The normalized score for an algorithm, , is , where is the performance CDF when a policy is sampled uniformly from a set of policies, , on an environment , i.e, . Policy percentiles have a unique advantage in that performance is scaled according to how difficult it is to achieve that level of performance relative to random policy search. Unfortunately, policy percentiles rely on specifying , which often has a large search space. As a result, most policies will perform poorly, making all scores approach . It is also infeasible to use when random policy search is unlikely to achieve high levels of performance. Despite these drawbacks, the scaling of scores according to a notion of difficulty is desirable, so we adapt this idea to use any algorithm’s performance as a reference distribution.
4.1.2 Our Approach
An algorithm’s performance distribution can have an interesting shape with large changes in performance that are due to divergence, lucky runs, or simply that small changes to a policy can result in large changes in performance (Jordan et al., 2018). These effects can be seen in Figure 3, where there is a quick rise in cumulative probability for a small increase in performance. Inspired by Dabney (2014)’s policy percentiles, we propose performance percentiles, a score normalization technique that can represent these intricacies.
The probability integral transform shows that projecting a random variable through its CDF transforms the variable to be uniform on (Dodge and Commenges, 2006). Thus, normalizing an algorithm’s performance by its CDF will equally distribute and represent a linear scaling of difficulty across
. When normalizing performance against another algorithm’s performance distribution, the normalized score distribution will shift towards zero when the algorithm is worse than the normalizing distribution and shift towards one when it is superior. As seen in Figure
3, the CDF can be seen as encoding the relative difficulty of achieving a given level of performance, where large changes in an algorithm’s CDF output indicate a high degree of difficulty for that algorithm to make an improvement and similarly small changes in output correspond to low change in difficulty. In this context difficulty refers to the amount of random chance (luck) needed to achieve a given level of performance.To leverage these properties of the CDF, we define performance percentiles, that use a weighted average of each algorithm’s CDF to normalize scores for each environment.
Definition 2 (Performance Percentile).
In an evaluation of algorithms, , the performance percentile for a score on an environment, , is , where is the mixture of CDFs , with weights , , and .
So we can say that performance percentiles capture the performance characteristic of an environment relative to some averaged algorithm. We discuss how to set the weights in the next section.
Performance percentiles are closely related to the concept of (probabilistic) performance profiles (Dolan and Moré, 2002; Barreto et al., 2010). The difference being that performance profiles report the cumulative distribution of normalized performance metrics over a set of tasks (environments), whereas performance percentiles are a technique for normalizing scores on each task (environment).
4.2 Summarization
A weighting over environments is needed to form an aggregate measure. We desire a weighting over environments such that no algorithm can exploit the weightings to increase its ranking. Additionally, for the performance percentiles, we need to determine the weighting of algorithms to use as the reference distribution. Inspired by the work of Balduzzi et al. (2018), we propose a weighting of algorithms and environments, using the equilibrium of a twoplayer game.
In this game, one player, , will try to select an algorithm to maximize the aggregate performance, while a second player, , chooses the environment and reference algorithm to minimize ’s score. Player ’s pure strategy space, , is the set of algorithms , i.e., plays a strategy corresponding to an algorithm . Player ’s pure strategy space, , is the cross product of a set of environments, , and algorithms, , i.e., player plays a strategy corresponding to a choice of environment and normalization algorithm . We denote the pure strategy space of the game by . A strategy, , can be represented by a tuple .
The utility of strategy is measured by a payoff function and for players and respectively. The game is defined to be zero sum, i.e., . We define the payoff function to be . Both players and
sample strategies from probability distributions
and , where is the set of all probability distributions over .The equilibrium solution of this game naturally balances the normalization and environment weightings to counter each algorithm’s strengths without conferring an advantage to a particular algorithm. Thus, the aggregate measure will be useful in answering the general evaluation question.
After finding a solution , the aggregate performance measure for an algorithm defined as
(2) 
To find a solution , we employ the Rank technique (Omidshafiei et al., 2019), which returns a stationary distribution over the pure strategy space . Rank allows for efficient computation of both the equilibrium and confidence intervals on the aggregate performance (Rowland et al., 2019). We detail this method and details of our implementation in Appendix B.
5 Reporting Results
As it is crucial to quantify the uncertainty of all claimed performance measures, we first discuss how to compute confidence intervals for both single environment and aggregate measures, then give details on displaying the results.
5.1 Quantifying Uncertainty
In keeping with our objective to have a scientific evaluation, we require our evaluation procedure to quantify any uncertainty in the results. When concerned with only a single environment, standard concentration inequalities can compute confidence intervals on the mean performance. Similarly, when displaying the distribution of performance, one can apply standard techniques for bounding the empirical distribution of performance. However, computing confidence intervals on the aggregate has additional challenges.
Notice that in (2) computing the aggregate performance requires two unknown values: and the mean normalized performance, . Since depends on mean normalized performance, any uncertainty in the mean normalized performance results in uncertainty in . To compute valid confidence intervals on the aggregate performance, the uncertainty through the entire process must be considered.
We introduce a process to compute the confidence intervals, which we refer to as performance bound propagation (PBP). We represent PBP as a function , which maps a dataset containing all samples of performance and a confidence level , to vectors and representing the lower and upper confidence intervals, i.e., .
The overall procedure for PBP is as follows, first compute confidence intervals for each , then using these intervals compute confidence intervals on each mean normalized performance, next determine an uncertainty set for that results from uncertainty in the mean normalized performance, finally for each algorithm find the minimum and maximum aggregate performance over the uncertainty in the mean normalized performances and . We provide pseudocode in Appendix C and source code in the repository.
We prove that PBP produces valid confidence intervals for a confidence level and a dataset containing samples of performance for all algorithms and environments .
Theorem 1.
If , then
(3) 
Proof.
Although the creation of valid confidence intervals is critical to this contribution, due to space restrictions it is presented in Appendix C. ∎
5.2 Displaying Results
In this section, we describe our method for reporting the results. There are three parts to our method: answering the stated hypothesis, providing tables and plots showing the performance and ranking of algorithms for all environments, and the aggregate score, then for each performance measure, provide confidence intervals to convey uncertainty.
The learning curve plot is a standard in RL and displays a performance metric (often the return) over regular intervals during learning. While this type of plot might be informative for describing some aspects of the algorithm’s performance, it does not directly show the performance metric used to compare algorithms, making visual comparisons less obvious. Therefore, to provide the most information to the reader, we suggest plotting the distribution of performance for each algorithm on each environment. Plotting the distribution of performance has been suggested in many fields as a means to convey more information, (Dolan and Moré, 2002; Farahmand et al., 2010; Reimers and Gurevych, 2017; Cohen et al., 2018). Often in RL, the object is to maximize a metric, so we suggest showing the quantile function over the CDF as it allows for a more natural interpretation of the performance, i.e., the higher the curve, the better the performance (Bellemare et al., 2013). Figure 4 show the performance distribution with confidence intervals for different sample sizes. It is worth noting that when tuning hyperparameters the data needed to compute these distributions is already being collected, but only the results from the tuned runs are being reported. By only reporting only the tuned performance it shows what an algorithm can achieve not what it is likely to achieve.
6 Experimental Results
In this section, we describe and report the results of experiments to illustrate how this evaluation procedure can answer the general evaluation question and identify when a modification to an algorithm or its definition improves performance. We also investigate the reliability of different bounding techniques on the aggregate performance measure.
6.1 Experiment Description
To demonstrate the evaluation procedure we compare the algorithms: ActorCritic with eligibility traces (AC) (Sutton and Barto, 2018), Q(), Sarsa(), (Sutton and Barto, 1998), NACTD (Morimura et al., 2005; Degris et al., 2012; Thomas, 2014), and proximal policy optimization (PPO) (Schulman et al., 2017). The learning rate is often the most sensitive hyperparameter in RL algorithms. So, we include three versions of Sarsa, Q, and AC: a base version, a version that scales the stepsize with the number of parameters (e.g., Sarsas), and an adaptive stepsize method, Parl2 (Dabney, 2014), that does not require specifying the step size. Since none of these algorithms have an existing complete definition, we create one by randomly sampling hyperparameters from fixed ranges. We consider all parameters necessary to construct each algorithm, e.g., stepsize, function approximator, discount factor, eligibility trace decay. For the continuous state environments, each algorithm employs linear function approximation using the Fourier basis (Konidaris et al., 2011) with a randomly sampled order. See Appendix E for full details of each algorithm.
These algorithms are evaluated on environments, eight discrete MDPs, half with stochastic transition dynamics, and seven continuous state environments: CartPole (Florian, 2007), Mountain Car (Sutton and Barto, 1998), Acrobot (Sutton, 1995), and four variations of the pinball environment (Konidaris and Barto, 2009; Geramifard et al., 2015). For each independent trial, the environments have their dynamics randomly perturbed to help mitigate environment overfitting (Whiteson et al., 2011); see code for details. For further details about the experiment see Appendix F.
While these environments have simple features compared to the Arcade Learning Environment (Bellemare et al., 2013), they remain useful in evaluating RL algorithms for three reasons. First is that experiments finish quickly. Second, the environments provide interesting insights into an algorithm’s behavior. Third, as our results will show, there is not yet a complete algorithm that can reliably solve each one.
We execute each algorithm on each environment for
trials. While this number of trials may seem excessive, our goal is to detect a statistically meaningful result. Detecting such a result is challenging because the variance of RL algorithms performance is high; we are comparing
random variables, and we do not assume the performances are normally distributed. Computationally, executing ten thousand trials is not burdensome if one uses an efficient programming language such as Julia (Bezanson et al., 2017) or C++, where we have noticed approximately two orders of magnitude faster execution than similar Python implementations. We investigate using smaller sample sizes at the end of this section.Aggregate Performance  

Algorithms  Score  Rank 
SarsaParl2  1 (2,1)  
QParl2  2 (2,1)  
ACParl2  3 (11,3)  
Sarsas  4 (11,3)  
ACs  5 (11,3)  
Sarsa  6 (11,3)  
AC  7 (11,3)  
Qs  8 (11,3)  
Q  9 (11,3)  
NACTD  10 (11,3)  
PPO  11 (11,3) 
6.2 Algorithm Comparison
The aggregate performance measures and confidence intervals are illustrated in Figure 5 and given in Table 1. Appendix I lists the performance tables and distribution plots for each environment. Examining the empirical performances in these figures, we notice two trends. The first is that our evaluation procedure can identify differences that are not noticeable in standard evaluations. For example, all algorithms perform near optimally when tuned properly (indicated by the high end of the performance distribution). The primary differences between algorithms are in the frequency of high performance and divergence (indicated by low end of the performance distribution). Parl2 methods rarely diverge, giving a large boost in performance relative to the standard methods.
The second trend is that our evaluation procedure can identify when theoretical properties do or do not make an algorithm more usable. For example, Sarsa() algorithms outperform their Q() counterparts. This result might stem from the fact that Sarsa() is known to converge with linear function approximation (Perkins and Precup, 2002) while Q() is known to diverge (Baird, 1995; Wiering, 2004). Additionally, NACTD performs worse than AC despite that natural gradients are a superior ascent direction. This result is due in part because it is unknown how to set the three stepsizes in NACTD, making it more difficult to use than AC. Together these observations point out the deficiency in the way new algorithms have been evaluated. That is, tuning hyperparameters hides the lack of knowledge required to use the algorithm, introducing bias that favors the new algorithm. In contrast, our method forces this knowledge to be encoded into the algorithm, leading to a more fair and reliable comparison.
6.3 Experiment Uncertainty
While the trends discussed above might hold true in general, we must quantify our uncertainty. Based on the confidence intervals given using PBP, we claim with confidence that on these environments and according to our algorithm definitions, SarsaParl2 and QParl2 have a higher aggregate performance of average returns than all other algorithms in the experiment. It is clear that trials per algorithm per environment is not enough to detect a unique ranking of algorithms using the nonparametric confidence intervals in PBP. We now consider alternative methods, PBPt, and the percentile bootstrap. PBPt replaces the nonparameteric intervals in PBP with ones based on the Student’s tdistribution. We detail these methods in Appendix G. From Figure 5, it is clear that both alternative bounds are tighter and thus useful in detecting differences. Since assumptions of these bounds are different and not typically satisfied, it is unclear if they are valid.
Confidence Interval Performance  

PBP  PBPt  Bootstrap  
Samples  FR  SIG  FR  SIG  FR  SIG 
0.0  0.0  1.000  0.00  0.112  0.11  
0.0  0.0  0.000  0.00  0.092  0.37  
0.0  0.0  0.000  0.02  0.084  0.74  
0.0  0.0  0.000  0.34  0.057  0.83  
0.0  0.33  0.003  0.83  0.069  0.83 
To test the different bounding techniques, we estimate the failure rate of each confidence interval technique at different sample sizes. For this experiment we execute trials of the evaluation procedure using sample sizes (trials per algorithm per environment) of , , , , and . There are a total of million samples per algorithm per environment. To reduce computation costs, we limit this experiment to only include Sarsa()Parl2, Q()Parl2, ACParl2, and Sarsa()s. Additionally, we reduce the environment set to be the discrete environments and Mountain Car. We compute the failure rate of the confidence intervals, where a valid confidence interval will have a failure rate less than or equal to , e.g., for failure rate should be less than . We report the failure rate and the proportion of statistically significant pairwise comparisons in Table 2. All methods use the same data, so the results are not independent.
The PBP method has zero failures indicating it is overly conservative. The failure rate of PBPt is expected to converge to zero as the number of samples increase due to the central limit theorem. PBPt begins to identify significant results at a sample size of
, but it is only at that it can identify all pairwise differences.^{2}^{2}2SarsaParl2 and QParl2 have similar performance on discrete environments so we consider detecting of results optimal. The bootstrap technique has the tightest intervals, but has a high failure rate.These results are stochastic and will not necessarily hold with different numbers of algorithms and environments. So, one should use caution in making claims that rely on either PBPt or bootstrap. Nevertheless, to detect statistically significant results, we recommend running between , and samples, and using the PBPt over bootstrap.
While this number of trials seems, high it is a necessity as comparison of multiple algorithms over many environments is a challenging statistical problem with many sources of uncertainty. Thus, one should be skeptical of results that use substantially fewer trials. Additionally, researchers are already conducting many trials that go unreported when tuning hyperparameters. Since our method requires no hyperparameter tuning, researchers can instead spend the same amount of time collecting more trials that can be used to quantify uncertainty.
There are a few ways that the number of trials needed can be reduced. The first is to think carefully about what question one should answer so that only a few algorithms and environments are required. The second is to use active sampling techniques to determine when to stop generating samples of performance for each algorithm environment pair (Rowland et al., 2019). It is important to caution the reader that this process can bias the results if the sequential tests are not accounted for (Howard et al., 2018).
Summarizing our experiments, we make the following observations. Our experiments with complete algorithms show that there is still more work required to make standard RL algorithms work reliably on even extremely simple benchmark problems. As a result of our evaluation procedure, we were able to identify performance differences in algorithms that are not noticeable under standard evaluation procedures. The tests of the confidence intervals suggest that both PBP and PBPt provide reliable estimates of uncertainty. These outcomes suggest that this evaluation procedure will be useful in comparing the performance of RL algorithms.
7 Related Work
This paper is not the first to investigate and address issues in empirically evaluating algorithms. The evaluation of algorithms has become a signficant enough topic to spawn its own field of study, known as experimental algorithmics (Fleischer et al., 2002; McGeoch, 2012).
In RL, there have been significant efforts to discuss and improve the evaluation of algorithms (Whiteson and Littman, 2011). One common theme has been to produce shared benchmark environments, such as those in the annual reinforcement learning competitions (Whiteson et al., 2010; Dimitrakakis et al., 2014), the Arcade Learning Environment (Bellemare et al., 2013), and numerous others which are to long to list here. Recently, there has been a trend of explicit investigations into the reproducibility of reported results (Henderson et al., 2018; Islam et al., 2017; Khetarpal et al., 2018; Colas et al., 2018)
. These efforts are in part due to the inadequate experimental practices and reporting in RL and general machine learning
(Pineau et al., 2020; Lipton and Steinhardt, 2018). Similar to these studies, this work has been motivated by the need for a more reliable evaluation procedure to compare algorithms. The primary difference in our work to these is that the knowledge required to use an algorithm gets included in the performance metric.An important aspect of evaluation not discussed so far in this paper is competitive versus scientific testing (Hooker, 1995). Competitive testing is the practice of having algorithms compete for top performance on benchmark tasks. Scientific testing is the careful experimentation of algorithms to gain insight into how an algorithm works. The main difference in these two approaches is that competitive testing only says which algorithms worked well but not why, whereas scientific testing directly investigates the what, when, how, or why better performance can be achieved.
There are several examples of recent works using scientific testing to expand our understanding of commonly used methods. Lyle et al. (2019) compares distributional RL approaches using different function approximation schemes showing that distributional approaches are only effective when nonlinear function approximation is used. Tucker et al. (2018) explore the sources of variance reduction in action dependent control variates showing that improvement was small or due to additional bias. Witty et al. (2018) and Atrey et al. (2020) investigate learned behaviors of an agent playing Atari 2600 games using ToyBox (Foley et al., 2018), a tool designed explicitly to enable carefully controlled experimentation of RL agents. While, at first glance the techniques developed here seems to be only compatible with competitive testing, this is only because we specified question with a competitive answer. The techniques developed here, particularly complete algorithm definitions, can be used to accurately evaluate the impact of various algorithmic choices. This allows for the careful experimentation to determine what components are essential to an algorithm.
8 Conclusion
The evaluation framework that we propose provides a principled method for evaluating RL algorithms. This approach facilitates fair comparisons of algorithms by removing unintentional biases common in the research setting. By developing a method to establish highconfidence bounds over this approach, we provide the framework necessary for reliable comparisons. We hope that our provided implementations will allow other researchers to easily leverage this approach to report the performances of the algorithms they create.
Acknowledgements
The authors would like to thank Kaleigh Clary, Emma Tosch, and members of the Autonomous Learning Laboratory: Blossom Metevier, James Kostas, and Chris Nota, for discussion and feedback on various versions of this manuscript. Additionally, we would like to thank the reviewers and metareviewers for their comments, which helped improved this paper.
This work was performed in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative. This work was supported in part by a gift from Adobe. This work was supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. Research reported in this paper was sponsored in part by the CCDC Army Research Laboratory under Cooperative Agreement W911NF1720196 (ARL IoBT CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein.
References
 [1] (2011) 2011 IEEE symposium on adaptive dynamic programming and reinforcement learning, ADPRL. IEEE. Cited by: S. Whiteson, B. Tanner, M. E. Taylor, and P. Stone (2011).
 Confidence limits for the value of an arbitrary bounded random variable with a continuous distribution function. Bulletin of The International and Statistical Institute 43, pp. 249–251. Cited by: §C.1.
 Exploratory not explanatory: counterfactual analysis of saliency maps for deep reinforcement learning. In 8th International Conference on Learning Representations, ICLR, Cited by: §7.
 Residual algorithms: reinforcement learning with function approximation. In Machine Learning, Proceedings of the Twelfth International Conference on Machine Learning, A. Prieditis and S. J. Russell (Eds.), pp. 30–37. Cited by: §6.2.
 Reevaluating evaluation. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, NeurIPS., pp. 3272–3283. Cited by: Appendix A, §1, §4.2.
 Probabilistic performance profiles for the experimental evaluation of stochastic algorithms. See Genetic and evolutionary computation conference, GECCO, Pelikan and Branke, pp. 751–758. Cited by: §4.1.2.

The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research
47, pp. 253–279. Cited by: Appendix A, Appendix A, §5.2, §6.1, §7.  Julia: A fresh approach to numerical computing. SIAM Review 59 (1), pp. 65–98. Cited by: §6.1.
 Distributed evaluations: ending neural point metrics. CoRR abs/1806.03790. External Links: 1806.03790 Cited by: §5.2.
 How many random seeds? Statistical power analysis in deep reinforcement learning experiments. CoRR abs/1806.08295. External Links: 1806.08295 Cited by: §7.
 PageRank optimization by edge selection. Discrete Applied Mathematics 169, pp. 73–87. Cited by: Appendix B.
 Adaptive stepsizes for reinforcement learning. Ph.D. Thesis, University of Massachusetts Amherst. Cited by: §3.1, §4.1.1, §4.1.2, §6.1.
 Maximizing pagerank via outlinks. CoRR abs/0711.2867. External Links: 0711.2867 Cited by: Appendix B.
 Modelfree reinforcement learning with continuous action in practice. In American Control Conference, ACC, pp. 2177–2182. Cited by: §6.1.
 The reinforcement learning competition 2014. AI Magazine 35 (3), pp. 61–65. Cited by: §7.
 The oxford dictionary of statistical terms. Oxford University Press on Demand. Cited by: §4.1.2.
 Benchmarking optimization software with performance profiles. Math. Program. 91 (2), pp. 201–213. Cited by: §4.1.2, §5.2.
 Challenges of realworld reinforcement learning. CoRR abs/1904.12901. External Links: 1904.12901 Cited by: §1.
 Asymptotic minimax character of a sample distribution function and of the classical multinomial estimator. Annals of Mathematical Statistics 27, pp. 642–669. Cited by: §C.1, §C.1.

Interaction of culturebased learning and cooperative coevolution and its application to automatic behaviorbased system design.
IEEE Trans. Evolutionary Computation
14 (1), pp. 23–57. Cited by: §5.2.  Ergodic control and polyhedral approaches to pagerank optimization. IEEE Trans. Automat. Contr. 58 (1), pp. 134–148. Cited by: Appendix B, Appendix B, §C.2, §C.2, Appendix D, Appendix D.
 Experimental algorithmics, from algorithm design to robust and efficient software [dagstuhl seminar, september 2000]. Lecture Notes in Computer Science, Vol. 2547, Springer. Cited by: §7.
 How not to lie with statistics: the correct way to summarize benchmark results. Commun. ACM 29 (3), pp. 218–221. Cited by: §4.1.1.
 Correct equations for the dynamics of the cartpole system. Center for Cognitive and Neural Studies (Coneural), Romania. Cited by: §6.1.
 Toybox: Better Atari Environments for Testing Reinforcement Learning Agents. In NeurIPS 2018 Workshop on Systems for ML, Cited by: §7.
 RLPy: A valuefunctionbased reinforcement learning framework for education and research. Journal of Machine Learning Research 16, pp. 1573–1578. Cited by: Appendix F, §6.1.
 Deep reinforcement learning that matters. In Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence, (AAAI18), pp. 3207–3214. Cited by: §1, §7.

Testing heuristics: we have it all wrong
. Journal of Heuristics 1 (1), pp. 33–42. Cited by: §7.  Uniform, nonparametric, nonasymptotic confidence sequences. arXiv: Statistics Theory. Cited by: §6.3.
 Reproducibility of benchmarked deep reinforcement learning tasks for continuous control. CoRR abs/1708.04133. External Links: 1708.04133 Cited by: §7.
 Using cumulative distribution based performance analysis to benchmark models. In Critiquing and Correcting Trends in Machine Learning Workshop at Neural Information Processing Systems, Cited by: §4.1.2.
 Reevaluate: reproducibility in evaluating reinforcement learning algorithms. Cited by: §7.
 Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems 22., pp. 1015–1023. Cited by: §6.1.
 Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the TwentyFifth AAAI Conference on Artificial Intelligence, AAAI, Cited by: Appendix E, §6.1.
 Troubling trends in machine learning scholarship. CoRR abs/1807.03341. External Links: 1807.03341 Cited by: §7.
 Are gans created equal? A largescale study. In Advances in Neural Information Processing Systems 31., pp. 698–707. Cited by: §3.1.
 A comparative analysis of expected and distributional reinforcement learning. In The ThirtyThird AAAI Conference on Artificial Intelligence, pp. 4504–4511. Cited by: §7.
 Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. J. Artif. Intell. Res. 61, pp. 523–562. Cited by: §4.1.1.
 The tight constant in the DvoretzkyKieferWolfowitz inequality. The Annals of Probability 18 (3), pp. 1269–1283. Cited by: §C.1, §C.1.
 A guide to experimental algorithmics. Cambridge University Press. Cited by: §7.
 On the state of the art of evaluation in neural language models. In 6th International Conference on Learning Representations, ICLR, Cited by: §3.1.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. Cited by: §4.1.1.
 Utilizing the natural gradient in temporal difference reinforcement learning with eligibility traces. In International Symposium on Information Geometry and Its Applications, pp. 256–263. Cited by: §6.1.
 Rank: multiagent evaluation by evolution. Scientific reports 9 (1), pp. 1–29. Cited by: Appendix B, Appendix B, §4.2.
 The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: Appendix B, Appendix B.
 Genetic and evolutionary computation conference, GECCO. ACM. Cited by: A. M. S. Barreto, H. S. Bernardino, and H. J. C. Barbosa (2010).
 A convergent form of approximate policy iteration. In Advances in Neural Information Processing Systems 15, pp. 1595–1602. Cited by: §6.2.
 Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). CoRR abs/2003.12206. External Links: 2003.12206 Cited by: §7.
 Markov decision processes: discrete stochastic dynamic programming. Wiley Series in Probability and Statistics, Wiley. Cited by: §C.2.

Reporting score distributions makes a difference: Performance study of LSTMnetworks for sequence tagging.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP
, pp. 338–348. Cited by: §5.2.  Multiagent evaluation under incomplete information. In Advances in Neural Information Processing Systems 3, NeurIPS, pp. 12270–12282. Cited by: Appendix B, Appendix B, §C.2, §C.2, §4.2, §6.3.
 Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: 1707.06347 Cited by: §6.1.
 Reinforcement learning: an introduction. MIT press. Cited by: §6.1.
 Reinforcement learning  an introduction. Adaptive computation and machine learning, MIT Press. Cited by: §6.1, §6.1.
 Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in Neural Information Processing Systems 8, pp. 1038–1044. Cited by: §6.1.
 Bias in natural actorcritic algorithms. In Proceedings of the 31th International Conference on Machine Learning, ICML, pp. 441–448. Cited by: §6.1.
 The mirage of actiondependent baselines in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning, ICML, pp. 5022–5031. Cited by: §7.
 Introduction to the special issue on empirical evaluations in reinforcement learning. Mach. Learn. 84 (12), pp. 1–6. Cited by: §7.
 Protecting against evaluation overfitting in empirical reinforcement learning. See 1, pp. 120–127. Cited by: Appendix A, §6.1.
 Report on the 2008 reinforcement learning competition. AI Magazine 31 (2), pp. 81–94. Cited by: §7.
 Convergence and divergence in standard and averaging reinforcement learning. In Machine Learning: ECML 2004, 15th European Conference on Machine Learning, Lecture Notes in Computer Science, Vol. 3201, pp. 477–488. Cited by: §6.2.
 Tight performance bounds on greedy policies based on imperfect value functions. Cited by: Appendix D.
 Measuring and characterizing generalization in deep reinforcement learning. arXiv preprint arXiv:1812.02868. Cited by: §7.
Appendix
Appendix A Other Normalization Methods
A simple normalization technique is to map scores on an environment that are in the range to , i.e., (Bellemare et al., 2013). However, this can result in normalized performance measures that cluster in different regions of for each environment. For example, consider one environment where a the minimum is , the maximum is and a uniform random policy can score around . Similarly consider a second environment where the minimum score is , maximum is , and random gets around . On the first environment, algorithms will tend to have a normalized performance near and in the second case most algorithms will have a normalized performance near . So in the second environment algorithms will likely appear worse than algorithms in the first regardless of how close to optimal they are. This means the normalized performances are not really comparable.
A different version of this approach uses the minimum and maximum mean performance of each algorithm (Bellemare et al., 2013; Balduzzi et al., 2018). Let be the sample mean of . Then this normalization method uses the following function, . This sets the best algorithm’s performance on each environment to and the worst to , spreading the range of values out over the whole interval . This normalization technique does not correct for nonlinear scaling of performance. As a result algorithms could be near or
if there is an outlier algorithm that does very well or poorly. For example, one could introduce a terrible algorithm that just chooses one action the whole time. This makes the environment seem easier as all scores would be near
except for this bad algorithm. We would like the evaluation procedure to be robust to the addition of poor algorithms.An alternative normalization technique proposed by Whiteson et al. (2011) uses the probability that one algorithm outperforms another on an environment, , i.e., . This technique is intuitive and straight forward to estimate but neglects the difference in score magnitudes. For example, consider that algorithm always scores a and algorithm always scores , the probability that is better than is , but the difference between them is small, and the normalized score of neglects this difference.
Appendix B Rank and our Implementation
The
Rank procedure finds a solution to a game by computing the stationary distribution of strategy profiles when each player is allowed to change their strategy. This is done by constructing a directed graph where nodes are pure strategies and edges have weights corresponding to the probability that one of the players switches strategies. This graph can be represented by a Markov matrix,
. The entry corresponds to a probability of switching from a strategy to . Only one player is allowed to change strategies at a time, so the possible transitions for a strategy , are any strategies or for all and . The entries of the matrix for valid transitions are:(4) 
where , represents the player who switched from strategy to , is the population constant (we set it to ). To ensure is irreducible we follow the damping approach used in PageRank (Page et al., 1999), i.e., , where is a hyperparameter and represents the probability of randomly switching to any strategy in . This method differs from the infinite approach presented by Rowland et al. (2019), but in the limit as the solutions are in agreement. This approach has a benefit in there is no data dependent hyperparameter and has a simple interpretation. We expand on these differences below. We set so the stationary distribution can cover the longest possible chain of strategies before a reset occurs.
The equilibrium over strategies is then given by the stationary distribution
of the Markov chain induced by
, i.e., is a distribution such that . The aggregate performance can then be computed using as in (2). However, we use the following alternative but equivalent method to compute the aggregate performance more efficiently (Fercoq et al., 2013)(5) 
where is a vector with entries with and . Notice that is ignored because is already specified by .
The typical Rank procedure uses transition probabilities, , that are based on a logistic transformation of the payoff difference . These differences are scaled by a parameter , which when it approaches , the transition matrix approximates the Markov Conley chain (MCC) that is the motivation for using Rank as a solution concept for games. See the work of Omidshafiei et al. (2019) for more detailed information.
The entries of the matrix as given by Rowland et al. (2019) are:
(6) 
where , represents the player who switched from strategy to , is the population constant (we set it to following the prior work). Theoretically, could be chosen arbitrarily high and the matrix would still be irreducible. However, due to numerical precision issues, a high value of sets transition probabilities to zero for some dominated strategies, i.e., . The suggested method to chose is to tune it on a logarithmic scaled to find the highest value such that the transition matrix, , is still irreducible (Omidshafiei et al., 2019).
This strategy works when the payoffs are known, but when they represent empirical samples of performance, then the value of chosen will depend on the empirical payoff functions. Setting based solely on the empirical payoffs could introduce bias to the matrix based on that sample. So we need a different solution without a data dependent hyperparameter.
In the MCC graph construction, all edges leading to strategies with strictly greater payoffs have the same positive weight. All edges that lead to strategies with the same payoff have he same weight but less than that of the strictly greater payoff. There are no transitions to strategies with worse payoffs. As the transitions probabilities quickly saturate to if and if . So we use the saturation values to set the transition probabilities so that our matrix is close to the MCC construction. However, this often makes the transition matrix reducible, i.e., the stationary distribution might have mass on only one strategy.
To force the matrix to be irreducible we modify the matrix to include a random jump to any strategy with probability , i.e., . This is commonly done in the PageRank method (Page et al., 1999), which also computes the stationary distribution of a Markov matrix. For the matrix is unchanged and represents the MCC solution, but is reducible. For near one, the stationary distribution will be similar to the solution given by the MCC solution with high weight placed on dominate strategies and small weight given to weak ones. As the stationary distribution becomes more uniform and is only considering shorter squences of transitions before a random jump occurs.
We chose to set so that the expected number of transitions to occur before a random jump is . This allows for propagation of transition probabilities to cover every strategy combination. We could have chosen to set near one, e.g., , but this would make the computation of the confidence intervals take longer. This is because optimizing the within its bounds is equivalent finding the optimal value function of a Markov decision process (MDP) with a discount parameter of . See the work of de Kerchove et al. (2007); Fercoq et al. (2013); Csáji et al. (2014) for more information on this connection. Solving and MDP with a discount near causes the optimization process of value iteration and policy iteration to converge slower than if is small. So we chose such that it could still find solutions near the MCC solution, but remain computationally efficient.
Appendix C Confidence Intervals on the Aggregate Performance
Symbol List  

Symbol  Description 
set of algorithms in the evaluation  
set of environments in the evaluation  
random variable representing performance of algorithm on environment  
number of samples of performance for algorithm on environment  
the sample of performance of algorithm on environment and sorted such that  
data set containing all samples of performance for each algorithm on each environment  
is the aggregate performance for each algorithm  
lower and upper confidence intervals on computed using  
confidence level for the aggregate performance  
cumulative distribution function (CDF) of and is also used for normalization  
empirical cumulative distribution function constructed using samples  
,  lower and upper confidence intervals on computed using 
performance of algorithm , i.e.,  
lower and upper confidence intervals on computed using .  
strategy for player where and is often denoted using  
strategy for player where and is often denoted using  
joint strategy where and is often denoted as  
strategy for player represented as a distribution over  
strategy for player represented as a distribution over  
payoff for player when is played, i.e.,  
payoff for player when is played, i.e.,  
confidence intervals on for player computed using D 
In this section, we detail the PBP procedure for computing confidence intervals and on the aggregate performance and prove that they hold with high probability. That is, we show that for any confidence level ;
(7) 
We will first describe the PBP procedure to compute confidence intervals and then prove that they valid. A list of the symbols used in the construction of confidence intervals and their description are provided in Table 3 to refresh the reader. The steps to compute the confidence intervals are outlined in Algorithm 1.
Recall that the aggregate performance for an algorithm is
(8) 
where is the equilibrium solution to the game specified in Section 4.2. To compute valid confidence intervals on using a dataset , the uncertainty of and mean normalized performance . PBP accomplishes this by three primary steps. The first step is to compute confidence intervals on such that
(9) 
The second step is to compute the uncertainty set containing all possible that are compatible with and . The last step is to compute the smallest and largest possible aggregate performances for each algorithm over these sets, i.e.,
(10) 
PBP follows this process, except in the last two steps is never explicitly constructed to improve computational efficiency. Intuitively, the procedure provides valid confidence intervals because all values to compute the aggregate performance depend on the normalized performance. So by guaranteeing with probability at least that the true mean normalized performances will be between and , then so long as the the upper (lower) confidence interval computed is at least as large (small) as the maximum (minimum) of the aggregate score for any setting of , the confidence intervals will be valid.
We break the rest of this section into two subsections. The first subsection discusses constructing the confidence intervals on the mean normalized performance and proving their validity. The second subsection describes how to construct the confidence intervals on the aggregate performance proves their validity.
c.1 Confidence intervals on the normalized performance
The normalized performance has two unknowns, and the distribution of . To compute confidence intervals on for all , confidence intervals are needed on the output on all distribution functions . The confidence intervals on the distributions can then be combined to get confidence intervals on .
To compute confidence intervals on we assume that is bounded on the interval for all and . Let be the empirical CDF with
(12) 
where is the number of samples of , is the sample of , and if event is true and otherwise. Using the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality (Dvoretzky et al., 1956) with tight constants (Massart, 1990), we define and to be the lower and upper confidence intervals on , i.e.,
(13) 
where is a confidence level and we use . By the DKW inequality with tight constants the following property is known:
Property 1 (DKW with tight constants confidence intervals).
(14) 
Further, by the union bound we have that
(15) 
To construct confidence intervals on the mean normalized performance, we will use Anderson’s inequality (Anderson, 1969). Let be a bounded random variable on , with sorted samples , , and . Let be a monotonically increasing function. Anderson’s inequality specifies for a confidence level the following high confidence bounds on :
(16)  
where uses the DKW inequality with tight constants and as defined in (13).
Anderson’s inequality can be used to bound the mean normalized performance since is a monotonically increasing function and . Since is unknown, we replace in Anderson’s inequality with for the lower bound and for the upper bound. This gives the following confidence intervals for :
(17)  
where , , and . We now prove the following lemma:
Lemma 1.
If and are computed by (17), then:
(18) 
Proof.
By Anderson’s inequality we know that is an high confidence upper bound on and similarly is a high confidence lower bound on , i.e.,
(19)  