1 Introduction
Reinforcement Learning (RL) agents aim to maximise collected rewards by interacting over a certain period of time in initially unknown environments. Actions that yield the highest performance according to the current knowledge of the environment and those that maximise the gathering of new knowledge on the environment may not be the same. This is the dilemma known as Exploration/Exploitation (E/E). In such a context, using prior knowledge of the environment is extremely valuable, since it can help guide the decisionmaking process in order to reduce the time spent on exploration. Modelbased Bayesian Reinforcement Learning (BRL) (Dearden99; Strens00) specifically targets RL problems for which such a prior knowledge is encoded in the form of a probability distribution (the “prior”) over possible models of the environment. As the agent interacts with the actual model, this probability distribution is updated according to the Bayes rule into what is known as “posterior distribution”. The BRL process may be divided into two learning phases: the offline learning phase refers to the phase when the prior knowledge is used to warmup the agent for its future interactions with the real model. The online learning phase, on the other hand, refers to the actual interactions between the agent and the model. In many applications, interacting with the actual environment may be very costly (e.g. medical experiments). In such cases, the experiments made during the online learning phase are likely to be much more expensive than those performed during the offline learning phase.
In this paper, we investigate how the way BRL algorithms use the offline learning phase may impact online performances. To properly compare Bayesian algorithms, the first comprehensive BRL benchmarking protocol is designed, following the foundations of Castronovo14. “Comprehensive BRL benchmark” refers to a tool which assesses the performance of BRL algorithms over a large set of problems that are actually drawn according to a prior distribution. In previous papers addressing BRL, authors usually validate their algorithm by testing it on a few test problems, defined by a small set of predefined MDPs. For instance, BAMCP (Guez2012), SBOSS (Castro10), and BFS3 (Asmuth11approachingbayesoptimalilty) are all validated on a fixed number of MDPs. In their validation process, the authors select a few BRL tasks, for which they choose one arbitrary transition function, which defines the corresponding MDP. Then, they define one prior distribution compliant with the transition function. This type of benchmarking is problematic in the sense that the authors actually know the hidden transition function of each test case. It also creates an implicit incentive to overfit their approach to a few specific transition functions, which should be completely unknown before interacting with the model. In this paper, we compare BRL algorithms in several different tasks. In each task, the real transition function is defined using a random distribution, instead of being arbitrarily fixed. Each algorithm is thus tested on an infinitely large number of MDPs, for each test case. To perform our experiments, we developed the BBRL library, whose objective is to also provide other researchers with our benchmarking tool.
This paper is organised as follows: Section 2 presents the problem statement. Section 3 formally defines the experimental protocol designed for this paper. Section 4 briefly presents the library. Section 5 shows a detailed application of our protocol, comparing several wellknow BRL algorithms on three different benchmarks. Section 6 concludes the study.
2 Problem Statement
This section is dedicated to the formalisation of the different tools and concepts discussed in this paper.
2.1 Reinforcement Learning
Let be a given unknown MDP, where denotes its finite state space and refers to its finite action space. When the MDP is in state at time and action is selected, the agent moves instantaneously to a next state with a probability of . An instantaneous deterministic, bounded reward is observed.
Let denote the history observed until time . An E/E strategy is a stochastic policy which, given the current state , returns an action . Given a probability distribution over initial states , the expected return of a given E/E strategy with respect to the MDP can be defined as follows:
where is the stochastic sum of discounted rewards received when applying the policy , starting from an initial state :
RL aims to learn the behaviour that maximises , i.e. learning a policy defined as follows:
2.2 Prior Knowledge
In this paper, the actual MDP is assumed to be initially unknown. Modelbased Bayesian Reinforcement Learning (BRL) proposes to the model the uncertainty, using a probability distribution over a set of candidate MDPs . Such a probability distribution is called a prior distribution and can be used to encode specific prior knowledge available before interaction. Given a prior distribution , the expected return of a given E/E strategy is defined as:
In the BRL framework, the goal is to maximise , by finding , which is called “Bayesian optimal policy” and defined as follows:
2.3 Computation time characterisation
Most BRL algorithms rely on some properties which, given sufficient computation time, ensure that their agents will converge to an optimal behaviour. However, it is not clear to know beforehand whether an algorithm will satisfy fixed computation time constraints while providing good performances.
The parameterisation of the algorithms makes the selection even more complex. Most BRL algorithms depend on parameters (number of transitions simulated at each iteration, etc.) which, in some way, can affect the computation time. In addition, for one given algorithm and fixed parameters, the computation time often varies from one simulation to another. These features make it nearly impossible to compare BRL algorithms under strict computation time constraints. In this paper, to address this problem, algorithms are run with multiple choices of parameters, and we analyse their time performance a posteriori.
Furthermore, a distinction between the offline and online computation time is made. Offline computation time corresponds to the moment when the agent is able to exploit its prior knowledge, but cannot interact with the MDP yet. One can see it as the time given to take the first decision. In most algorithms concerned in this paper, this phase is generally used to initialise some data structure. On the other hand, online computation time corresponds to the time consumed by an algorithm for taking each decision.
There are many ways to characterise algorithms based on their computation time. One can compare them based on the average time needed per step or on the offline computation time alone. To remain flexible, for each run of each algorithm, we store its computation times , with indexing the time step, and the offline learning time. Then a feature function is extracted from this data. This function is used as a metric to characterise and discriminate algorithms based on their time requirements.
In our protocol, which is detailed in the next section, two types of characterisation are used. For a set of experiments, algorithms are classified based on their offline computation time only, i.e. we use
. Afterwards, the constraint is defined as , in case it is required to only compare the algorithms that have an offline computation time lower than .For another set of experiments, algorithms are separated according to their empirical average online computation time. In this case, . Algorithms can then be classified based on whether or not they respect the constraint , .
This formalisation could be used for any other computation time characterisation. For example, one could want to analyse algorithms based on the longest computation time of a trajectory, and define .
3 A new Bayesian Reinforcement Learning benchmark protocol
3.1 A comparison criterion for BRL
In this paper, a real Bayesian evaluation is proposed, in the sense that the different algorithms are compared on a large set of problems drawn according to a test probability distribution. This is in contrast with the Bayesian literature (Guez2012; Castro10; Asmuth11approachingbayesoptimalilty), where authors pick a fixed number of MDPs on which they evaluate their algorithm.
Our criterion to compare algorithms is to measure their average rewards against a given random distribution of MDPs, using another distribution of MDPs as a prior knowledge. In our experimental protocol, an experiment is defined by a prior distribution and a test distribution . Both are random distributions over the set of possible MDPs, not stochastic transition functions. To illustrate the difference, let us take an example. Let be a transition. Given a transition function , is the probability of observing if we chose in . In this paper, this function is assumed to be the only unknown part of the MDP that the agent faces. Given a certain test case, corresponds to a unique MDP . A Bayesian learning problem is then defined by a probability distribution over a set of possible MDPs. We call it a test distribution, and denote it . Prior knowledge can then be encoded as another distribution over , and denoted . We call “accurate” a prior which is identical to the test distribution (), and we call “inaccurate” a prior which is different ().
In previous Bayesian literature, authors select a fixed number of MDPs , train and test their algorithm on them. Doing so does not guarantee any generalisation capabilities. To solve this problem, a protocol that allows rigorous comparison of BRL algorithms is designed. Training and test data are separated, and can even be generated from different distributions (in what we call the inaccurate case).
More precisely, our protocol can be described as follows: Each algorithm is first trained on the prior distribution. Then, their performances are evaluated by estimating the expectation of the discounted sum of rewards, when they are facing MDPs drawn from the test distribution. Let
be this value:where is the algorithm trained offline on . In our Bayesian RL setting, we want to find the algorithm which maximises for the experiment:
In addition to the performance criterion, we also measure the empirical computation time. In practice, all problems are subject to time constraints. Hence, it is important to take this parameter into account when comparing different algorithms.
3.2 The experimental protocol
In practice, we can only sample a finite number of trajectories, and must rely on estimators to compare algorithms. In this section our experimental protocol is described, which is based on our comparison criterion for BRL and provides a detailed computation time analysis.
An experiment is defined by (i) a prior distribution and (ii) a test distribution . Given these, an agent is evaluated as follows:

Train offline on .

Sample MDPs from the test distribution .

For each sampled MDP , compute estimate of .

Use these values to compute an estimate .
To estimate , the expected return of agent trained offline on , one trajectory is sampled on the MDP , and the cumulated return is computed .
To estimate this return, each trajectory is truncated after steps. Therefore, given an MDP and its initial state , we observe , an approximation of :
If denotes the maximal instantaneous reward an agent can receive when interacting with an MDP drawn from , then choosing as guarantees the approximation error is bounded by :
is set for all experiments, as a compromise between measurement accuracy and computation time.
Finally, to estimate our comparison criterion , the empirical average of the algorithm performance is computed over different MDPs, sampled from :
(1) 
For each agent , we retrieve and
, the empirical mean and standard deviation of the results observed respectively. This gives us the following statistical confidence interval at 95% for
:The values reported in the following figures and tables are estimations of the interval within which is, with probability .
As introduced in Section 2.3, in our methodology, a function of computation times is used to classify algorithms based on their time performance. The choice of depends on the type of time constraints that are the most important to the user. In this paper, we reflect this by showing three different ways to choose . These three choices lead to three different ways to look at the results and compare algorithms. The first one is to classify algorithms based on their offline computation time, the second one is to classify them based on the algorithms average online computation time. The third is a combination of the first two choices of , that we denote and . The objective is that for each pair of constraints and , , we want to identify the best algorithms that respect these constraints. In order to achieve this: (i) All agents that do not satisfy the constraints are discarded; (ii) for each algorithm, the agent leading to the best performance in average is selected; (iii) we build the list of agents whose performances are not significantly different^{1}^{1}1A paired sampled test with a confidence level of 95% has been used to determine when two agents are statistically equivalent (more details in Appendix C)..
The results will help us to identify, for each experiment, the most suitable algorithm(s) depending on the constraints the agents must satisfy. This protocol is an extension of the one presented in Castronovo14.
4 BBRL library
BBRL^{2}^{2}2BBRL stands for Benchmaring tools for Bayesian Reinforcement Learning. is a C++ opensource library for Bayesian Reinforcement Learning (discrete state/action spaces). This library provides highlevel features, while remaining as flexible and documented as possible to address the needs of any researcher of this field. To this end, we developed a complete commandline interface, along with a comprehensive website:
BBRL focuses on the core operations required to apply the comparison benchmark presented in this paper. To do a complete experiment with the BBRL library, follow these five steps:

We create a test and a prior distribution. Those distributions are represented by Flat Dirichlet Multinomial distributions (FDM), parameterised by a state space , an action space
, a vector of parameters
, and reward function . For more information about the FDM distributions, check Section 5.2.
./BBRLDDS mdp_distrib_generation \name <name> \short_name <short name> \n_states <> n_actions <> \ini_state <> \transition_weights \<> <> \reward_type "RT_CONSTANT" \reward_means \<> <> \output <output file>A distribution file is created.

We create an experiment. An experiment is defined by a set of MDPs, drawn from a test distribution defined in a distribution file, a discount factor and a horizon limit .
./BBRLDDS new_experiment \name <name> \mdp_distribution "DirMultiDistribution" \mdp_distribution_file <distribution file> \n_mdps <> n_simulations_per_mdp 1 \discount_factor <> horizon_limit <> \compress_output \output <output file>An experiment file is created and can be used to conduct the same experiment for several agents.

We create an agent. An agent is defined by an algorithm , a set of parameters , and a prior distribution defined in a distribution file, on which the created agent will be trained.
./BBRLDDS offline_learning \agent <> [<parameters >]\mdp_distribution "DirMultiDistribution" \mdp_distribution_file <distribution file> \output <output file>An agent file is created. The file also stores the computation time observed during the offline training phase.

We run the experiment. We need to provide an experiment file, an algorithm and an agent file.
./BBRLDDS run_experiment \experiment \experiment_file <experiment file> \agent <> \agent_file <agent file> \n_threads 1 \compress_output \safe_simulations \refresh_frequency 60 \backup_frequency 900 \output <output file>A result file is created. This file contains a set of all transitions encountered during each trajectory. Additionally, the computation times we observed are also stored in this file. It is often impossible to measure precisely the computation time of a single decision. This is why only the computation time of each trajectory is reported in this file.

Our results are exported. After each experiment has been performed, a set of result files is obtained. We need to provide all agent files and result files to export the data.
./BBRLexport agent <> \agent_file <agent file #1> \experiment \experiment_file <result file #1> \...agent <> \agent_file <agent file #K> \experiment \experiment_file <result file #K>BBRL will sort the data automatically and produce several files for each experiment.

A graph comparing offline computation cost w.r.t. performance;

A graph comparing online computation cost w.r.t. performance;

A graph where the Xaxis represents the offline time bound, while the Yaxis represents the online time bound. A point of the space corresponds to set of bounds. An algorithm is associated to a point of the space if its best agent, satisfying the constraints, is among the best ones when compared to the others;

A table reporting the results of each agent.
BBRL will also produce a report file in LaTeX gathering the 3 graphs and the table for each experiment.

More than commands have to be entered in order to reproduce the results of this paper. We decided to provide several script in order to simplify the process. By completing some configuration files, the user can define the agents, the possible values of their parameters and the experiments to conduct.
Those configuration files are then used by a script called make_scripts.sh
, included within the library, whose purpose is to generate four other scripts:

0init.sh
Create the experiment files, and create the formulas sets required by OPPS agents. 
1ol.sh
Create the agents and train them on the prior distribution(s). 
2re.sh
Run all the experiments. 
3export.sh
Generate the LaTeX reports.
Due to the high computation power required, we made those scripts compatible with workload managers such as SLURM. In this case, each cluster should provide the same amount of CPU power in order to get consistent time measurements. To sum up, when the configuration files are completed correctly, one can start the whole process by executing the four scripts, and retrieve the results in nice LaTeX reports.
It is worth noting that there is no computation budget given to the agents. This is due to the diversity of the algorithms implemented. No algorithm is “anytime” natively, in the sense that we cannot stop the computation at any time and receive an answer from the agent instantly. Strictly speaking, it is possible to develop an anytime version of some of the algorithms considered in BBRL. However, we made the choice to stay as close as possible to the original algorithms proposed in their respective papers for reasons of fairness. In consequence, although computation time is a central parameter in our problem statement, it is never explicitly given to the agents. We instead let each agent run as long as necessary and analyse the time elapsed afterwards.
Another point which needs to be discussed is the impact of the implementation of an algorithm on the comparison results. For each algorithm, many implementations are possible, some being better than others. Even though we did our best to provide the best possible implementations, BBRL does not compare algorithms but rather the implementations of each algorithms. Note that this issue mainly concerns small problems, since the complexity of the algorithms is preserved.
5 Illustration
This section presents an illustration of the protocol presented in Section 3. We first describe the algorithms considered for the comparison in Section 5.1, followed by a description of the benchmarks in Section 5.2. Section 5.3 shows and analyses the results obtained.
5.1 Compared algorithms
In this section, we present the list of the algorithms considered in this study. The pseudocode of each algorithm can be found in Appendix A. For each algorithm, a list of “reasonable” values is provided to test each of their parameters. When an algorithm has more than one parameter, all possible parameter combinations are tested.
5.1.1 Random
At each timestep , the action is drawn uniformly from .
5.1.2 Greedy
The Greedy agent maintains an approximation of the current MDP and computes, at each timestep, its associated Qfunction. The selected action is either selected randomly (with a probability of (), or greedily (with a probability of ) with respect to the approximated model.
Tested values:

.
5.1.3 Softmax
The Softmax agent maintains an approximation of the current MDP and computes, at each timestep, its associated Qfunction. The selected action is selected randomly, where the probability to draw an action is proportional to . The temperature parameter allows to control the impact of the Qfunction on these probabilities (: greedy selection; : random selection).
Tested values:

.
5.1.4 Opps
Given a prior distribution and an E/E strategy space (either discrete or continuous), the Offline, Priorbased Policy Search algorithm (OPPS) identifies a strategy which maximises the expected discounted sum of returns over MDPs drawn from the prior.
The OPPS for Discrete Strategy spaces algorithm (OPPSDS) (Castronovo12; Castronovo14) formalises the strategy selection problem as a armed bandit problem, where . Pulling an arm amounts to draw an MDP from , and play the E/E strategy associated to this arm on it for one single trajectory. The discounted sum of returns observed is the return of this arm. This multiarmed bandit problem has been solved by using the UCB1 algorithm (Auer2002; Audibert2007). The time budget is defined by a variable , corresponding to the total number of draws performed by the UCB1.
The E/E strategies considered by Castronovo et. al are indexbased strategies, where the index is generated by evaluating a small formula. A formula is a mathematical expression, combining specific features (Qfunctions of different models) by using standard mathematical operators (addition, subtraction, logarithm, etc.). The discrete E/E strategy space is the set of all formulas which can be built by combining at most features/operators (such a set is denoted by ).
OPPSDS does not come with any guarantee. However, the UCB1 bandit algorithm used to identify the best E/E strategy within the set of strategies provides statistical guarantees that the best E/E strategies are identified with high probability after a certain budget of experiments. However, it is not clear that the best strategy of the E/E strategy space considered yields any highperformance strategy regardless the problem.
Tested values:

^{3}^{3}3The number of arms is always equal to the number of strategies in the given set. For your information: ,

.
5.1.5 Bamcp
Bayesadaptive Monte Carlo Planning (BAMCP) (Guez2012) is an evolution of the Upper Confidence Tree (UCT) algorithm (Kocsis2006), where each transition is sampled according to the history of observed transitions. The principle of this algorithm is to adapt the UCT principle for planning in a Bayesadaptive MDP, also called the beliefaugmented MDP, which is an MDP obtained when considering augmented states made of the concatenation of the actual state and the posterior. The BAMCP algorithm is made computationally tractable by using a sparse sampling strategy, which avoids sampling a model from the posterior distribution at every node of the planification tree. Note that the BAMCP also comes with theoretical guarantees of convergence towards Bayesian optimality.
In practice, the BAMCP relies on two parameters: (i) Parameter which defines the number of nodes created at each timestep, and (ii) Parameter which defines the depth of the tree from the root.
Tested values:

,

.
5.1.6 Bfs3
The Bayesian Forward Search Sparse Sampling (BFS3) (Asmuth11approachingbayesoptimalilty) is a Bayesian RL algorithm whose principle is to apply the principle of the FSSS (Forward Search Sparse Sampling, see Kearns2002) algorithm to beliefaugmented MDPs. It first samples one model from the posterior, which is then used to sample transitions. The algorithm then relies on lower and upper bounds on the value of each augmented state to prune the search space. The authors also show that BFS3 converges towards Bayesoptimality as the number of samples increases.
In practice, the parameters of BFS3 are used to control how much computational power is allowed.
The parameter defines the number of nodes to develop at each timestep, defines the branching factor of the tree and controls its maximal depth.
Tested values:

,

,

.
5.1.7 Sboss
The Smarter Best of Sampled Set (SBOSS) (Castro10) is a Bayesian RL algorithm which relies on the assumption that the model is sampled from a Dirichlet distribution.
From this assumption, it derives uncertainty bounds on the value of state action pairs.
It then uses those bounds to decide how many models to sample from the posterior, and how often the posterior should be updated in order to reduce the computational cost of Bayesian updates.
The sampling technique is then used to build a merged MDP, as in Asmuth09, and to derive the corresponding optimal action with respect to that MDP.
In practice, the number of sampled models is determined dynamically with a parameter . The resampling frequency depends on a parameter .
Tested values:

,

.
5.1.8 Beb
The Bayesian Exploration Bonus (BEB) (Kolter09nearbayesianexploration) is a Bayesian RL algorithm which builds, at each timestep , the expected MDP given the current posterior. Before solving this MDP, it computes a new reward function , where denotes the number of times transition has been observed at timestep . This algorithm solves the mean MDP of the current posterior, in which we replaced by , and applies its optimal policy on the current MDP for one step. The bonus is a parameter controlling the E/E balance. BEB comes with theoretical guarantees of convergence towards Bayesian optimality.
Tested values:

.
5.1.9 Computation times variance
Each algorithm has one or more parameters that can affect the number of sampled transitions from a given state, or the length of each simulation. This, in turn, impacts the computation time requirement at each step. Hence, for some algorithms, no choice of parameters can bring the computation time below or over certain values. In other words, each algorithm has its own range of computation time. Note that, for some methods, the computation time is influenced concurrently by several parameters. We present a qualitative description of how computation time varies as a function of parameters in Table 1.
Offline phase duration  Online phase duration  
Random  Almost instantaneous.  Almost instantaneous. 
Greedy^{4}^{4}4If a random decision is chosen, the model is not solved.  Almost instantaneous.  Varies in inverse proportion to . 
Can vary a lot from one step to another.  
OPPSDS  Varies proportionally to .  Varies proportionally to the number of features implied in the selected E/E strategy. 
BAMCP^{5}^{5}5 defines the number of nodes to develop at each step, and defines the maximal depth of the tree.  Almost instantaneous.  Varies proportionally to and . 
BFS3^{6}^{6}6 defines the number of nodes to develop at each step, the branching factor of the tree and its maximal depth.  Almost instantaneous.  Varies proportionally to , and . 
SBOSS^{7}^{7}7The number of models sampled is inversely proportional to , while the frequency at which the models are sampled is inversely proportional to . When an MDP has been sufficiently explored, the number of models to sample and the frequency of the sampling will decrease.  Almost instantaneous.  Varies in inverse proportion to and . 
Can vary a lot from one step to another, with a general decreasing tendency.  
BEB  Almost instantaneous.  Constant. 
5.2 Benchmarks
In our setting, the transition matrix is the only element which differs between two MDPs drawn from the same distribution. For each state, action pair , we define a Dirichlet distribution, which represents the uncertainty about the transitions occurring from . A Dirichlet distribution is parameterised by a set of concentration parameters .
We gathered all concentration parameters in a single vector .
Consequently, our MDP distributions are parameterised by (the reward function) and several Dirichlet distributions, parameterised by .
Such a distribution is denoted by . In the Bayesian Reinforcement Learning community, these distributions are referred to as Flat Dirichlet Multinomial distributions (FDMs).
We chose to study two different cases:

Accurate case: the test distribution is fully known (),

Inaccurate case: the test distribution is unknown ().
In the inaccurate case, we have no assumption on the transition matrix. We represented this lack of knowledge by a uniform FDM distribution, where each transition has been observed one single time ().
5.2.1 Generalised Chain distribution ()
The Generalised Chain (GC) distribution is inspired from the fivestate chain problem ( states, actions) (Dearden98). The agent starts at State , and has to go through State , and in order to reach the last state (State ), where the best rewards are. The agent has at its disposal actions. An action can either let the agent move from State to State or force it to go back to State . The transition matrix is drawn from a FDM parameterised by , and the reward function is denoted by . More details can be found in Appendix B.1.
5.2.2 Generalised DoubleLoop distribution ()
The Generalised DoubleLoop (GDL) distribution is inspired from the doubleloop problem ( states, actions) (Dearden98). Two loops of states are crossing at State , where the agent starts. One loop is a trap: if the agent enters it, it has no choice to exit but crossing over all the states composing it. Exiting this loop provides a small reward. The other loop is yielding a good reward. However, each action of this loop can either let the agent move to the next state of the loop or force it to return to State with no reward. The transition matrix is drawn from an FDM parameterised by , and the reward function is denoted by . More details can be found in Appendix B.2.
5.2.3 Grid distribution ()
The Grid distribution is inspired from the Dearden’s maze problem ( states, actions) (Dearden98). The agent is placed at a corner of a 5x5 grid (the S cell), and has to reach the opposite corner (the G cell). When it succeeds, it returns to its initial state and receives a reward. The agent can perform different actions, corresponding to the directions (up, down, left, right). However, depending on the cell on which the agent is, each action has a certain probability to fail, and can prevent the agent to move in the selected direction. The transition matrix is drawn from an FDM parameterised by , and the reward function is denoted by . More details can be found in Appendix B.3.
5.3 Discussion of the results
5.3.1 Accurate case



As it can be seen in Figure 7
, OPPS is the only algorithm whose offline time cost varies. In the three different settings, OPPS can be launched after a few seconds, but behaves very poorly. However, its performances increased very quickly when given at least one minute of computation time. Algorithms that do not use offline computation time have a wide range of different scores. This variance represents the different possible configurations for these algorithms, which only lead to different online computation time.
On Figure 7, BAMCP, BFS3 and SBOSS have variable online time costs. BAMCP behaved poorly on the first experiment, but obtained the best score on the second one and was pretty efficient on the last one. BFS3 was good only on the second experiment. SBOSS was never able to get a good score in any cases. Note that OPPS online time cost varies slightly depending on the formula’s complexity.
If we take a look at the topright point in Figure 9, which defines the less restrictive bounds, we notice that OPPSDS and BEB were always the best algorithms in every experiment. Greedy was a good candidate in the two first experiments. BAMCP was also a very good choice except for the first experiment. On the contrary, BFS3 and SBOSS were only good choices in the first experiment.
If we look closely, we can notice that OPPSDS was always one of the best algorithm since we have met its minimal offline computation time requirements.
Moreover, when we place our offlinetime bound right under OPPSDS minimal offline time cost, we can see how the top is affected from left to right:

GC: (Random), (SBOSS), (BEB, Greedy), (BEB, BFS3, Greedy), 
GDL: (Random), (Random, SBOSS), (Greedy), (BEB, Greedy), (BAMCP, BEB, Greedy), 
Grid: (Random), (SBOSS), (Greedy), (BEB, Greedy).
We can clearly see that SBOSS was the first algorithm to appear on the top, with a very small online computation cost, followed by Greedy and BEB. Beyond a certain online time bound, BFS3 emerged in the first experiment while BAMCP emerged in the second experiment. Neither of them was able to compete with BEB or Greedy in the last experiment.
Softmax was never able to reach the top regardless the configuration.
Figure 9 reports the best score observed for each algorithm, disassociated from any time measure. Note that the variance is very similar for all algorithms in GDL and Grid experiments. On the contrary, the variance oscillates between and . However, OPPS seems to be the less stable algorithm in the three cases.
5.3.2 Inaccurate case



GC Experiment
GDL Experiment Agent Score Random eGreedy ( = 0.3) SoftMax ( = 0.05) OPPSDS () BAMCP (= 10000, = 50) BFS3 ( = 1250, C = 15, = 50) SBOSS ( = 0.1, = 0.001) BEB ( = 2.5) Grid Experiment Agent Score Random eGreedy ( = 0.2) SoftMax ( = 0.05) OPPSDS ()) BAMCP ( = 25000, = 25) BFS3 ( = 1, = 15, = 50) SBOSS ( = 0.001, = 0.1) BEB ( = 0.25) 
As seen in the accurate case, Figure 11 also shows impressive performances for OPPSDS, which has beaten all other algorithms in every experiment. We can also notice that, as observed in the accurate case, in the Grid experiment, the OPPSDS agents scores are very close. However, only a few were able to significantly surpass the others, contrary to the accurate case where most OPPSDS agents were very good candidates.
Surprisingly, SBOSS was a very good alternative to BAMCP and BFS3 in the two first experiments as shown in Figure 11. It was able to surpass both algorithms on the first one while being very close to BAMCP performances in the second. Relative performances of BAMCP and BFS3 remained the same in the inaccurate case, even if the BAMCP advantage is less visible in the second experiment. BEB was no longer able to compete with OPPSDS and was even beaten by BAMCP and BFS3 in the last experiment. Greedy was still a decent choice except in the first experiment. As observed in the accurate case, Softmax was very bad in every case.
In Figure 13, if we take a look at the topright point, we can see OPPSDS is the best choice in the second and third experiment. BEB, SBOSS and Greedy share the first place with OPPSDS in the first one.
If we place our offlinetime bound right under OPPSDS minimal offline time cost, we can see how the top is affected from left to right:

GC: (Random), (Random, SBOSS), (SBOSS), (BEB, SBOSS, Greedy), (BEB, BFS3, SBOSS, Greedy), 
GDL: (Random), (Random, SBOSS), (BAMCP, Random, SBOSS), (BEB, SBOSS, Greedy), (BEB, BFS3, SBOSS, Greedy), (BAMCP, BEB, BFS3, SBOSS, Greedy), 
Grid: (Random), (Random, SBOSS), (BAMCP, BEB, BFS3, Random, SBOSS), (Greedy).
SBOSS is again the first algorithm to appear in the rankings. Greedy is the only one which could reach the top in every case, even when facing BAMCP and BFS3 fed with high online computation cost. BEB no longer appears to be undeniably better than the others. Besides, the two first experiments show that most algorithms obtained similar results, except for BAMCP which does not appear on the top in the first experiment. In the last experiment, Greedy succeeded to beat all other algorithms.
Figure 13 does not bring us more information than those we observed in the accurate case.
5.3.3 Summary
In the accurate case, OPPSDS was always among the best algorithms, at the cost of some offline computation time. When the offline time budget was too constrained for OPPSDS, different algorithms were suitable depending on the online time budget:

Low online time budget: SBOSS was the fastest algorithm to make better decisions than a random policy.

Medium online time budget^{8}^{8}8 100 times more than the low online time budget: BEB reached performances similar to OPPSDS on each experiment.

High online time budget^{9}^{9}9 100 times more than the medium online time budget: In the first experiment, BFS3 managed to catch up BEB and OPPSDS when given sufficient time. In the second experiment, it was BAMCP which has achieved this result. Neither BFS3 nor BAMCP was able to compete with BEB and OPPSDS in the last experiment.
The results obtained in the inaccurate case were very interesting. BEB was not as good as it seemed to be in the accurate case, while SBOSS improved significantly compared to the others. For its part, OPPSDS obtained the best overall results in the inaccurate case by outperforming all the other algorithms in two out of three experiments while remaining among the best ones in the last experiment.
6 Conclusion
We have proposed a new extensive BRL comparison methodology which takes into account both performance and time requirements for each algorithm. In particular, our benchmarking protocol shows that no single algorithm dominates all other algorithms on all scenarios. The protocol we introduced can compare any time algorithm to nonanytime algorithms while measuring the impact of inaccurate offline training. By comparing algorithms on large sets of problems, we avoid over fitting to a single problem. Our methodology is associated with an opensource library, BBRL, and we hope that it will help other researchers to design algorithms whose performances are put into perspective with computation times, that may be critical in many applications. This library is specifically designed to handle new algorithms easily, and is provided with a complete and comprehensive documentation website.
Michaël Castronovo acknowledges the financial support of the FRIA. Raphael Fonteneau is a postdoctoral fellow of the F.R.S.FNRS (Belgian Funds for Scientifique Research).
Comments
There are no comments yet.