1 Introduction
1.1 Motivation
Imagine two treatments are being tested in a medical trial. The treatments are cheap but having doctors evaluate whether they worked costs £5,000 for each additional patient. Treatments are assigned using a Bandit design (Kuleshov and Precup, 2014) and after 200 trials the difference in mean evaluation between the two treatments is tiny. Should the trial continue?
At some point an additional trial is not worth another £5,000. This cost of evaluating outcomes is not incorporated into standard Bandits. When playing Bandits, deciding whether to explore depends only on the estimated differences in expected (discounted) return between arms. The same is true for Reinforcement Learning in MDPs: the cost of providing a reward for a stateaction pair is not a parameter of the learning problem.This makes sense when the reward function is created all at once and
offline, as when it is handengineered. But if the rewards are created incrementally online, as in the medical trial, then an important feature of the decision problem has been left out.Online construction of rewards is common in realworld Bandit problems: customers subjected to AB testing may be paid to give feedback on new products (Scott, 2015). Recent research, spurred by the difficulty of handengineering rewards, has formalised more general approaches to online reward construction. In Reward Learning, a reward function is learned online from human evaluations of the agent’s behaviour (Warnell et al., 2017; Christiano et al., 2017; Saunders et al., 2017). In Inverse Reinforcement Learning (IRL) and Imitation Learning (Abbeel and Ng, 2004; Ho and Ermon, 2016; Evans et al., 2016), humans provide demonstrations that are used to infer the reward function or optimal policy. These demonstrations can be provided offline or online but the reward function is always specified incrementally, as a set of human actions or trajectories.
In Reward Learning and IRL, the human labour required to construct rewards is a significant cost. How can this cost be reduced? Intuitively, if the RL agent can predict an action’s reward then a human need not provide it. In Active Reinforcement Learning (ARL), this choice of whether to pay for reward construction is given to the RL agent (Krueger et al., 2016)
. It is the analogue of Active Learning, where an algorithm decides online whether to have the next data point labelled in a classification or regression task
(Settles, 2012).1.2 ARL Definition and Illustration
To fix intuition, we define the ARL problem here and elaborate on this definition in later sections. An active reinforcement learning (ARL) problem is a tuple . The components
define a regular Markov Decision Process (MDP), where
is the state space, is the action space, is the transition function, is the reward function on stateaction pairs, and is the time horizon. The component is a scalar constant, the “query cost”, which specifies the cost of observing rewards. All components except and are initially known to the agent.ARL proceeds as follows. At time step , the agent takes an action pair , where and , which determines a reward and next state . If , the agent pays to observe the reward , and so receives a total reward of . If the agent does not observe the reward; so if the agent did something bad it will not be knowingly punished. The agent’s total return after timesteps is defined as:
We emphasise that actions for which the agent did not observe the reward still count towards the return.
An ARL problem depends crucially on how query cost compares to the agent’s expected total returns. When is large relative to the expected returns, the agent should never query and should rely on prior knowledge about . When is very small, the agent can use a regular RL algorithm and always query. In between these two extremes, the agent must carefully select a subset of actions to query and so RL algorithms are not readily applicable to ARL. Figure 1 shows two MDPs (Early Fork and Late Fork) that illustrate the challenge of deciding which actions to query. RL algorithms perform suboptimally on these MDPs unless effort is made to adapt them to ARL.
This paper presents the following contributions:

We show that ARL for tabular MDPs can be reduced to planning in a BayesAdaptive MDP.

We adapt MCTSbased algorithm BAMCP (Guez et al., 2012) to provide an asymptotically optimal modelbased algorithm for Bayesian ARL.

BAMCP fails in practice on small MDPs. We introduce BAMCP++, which uses smarter modelfree rollouts and substantially outperforms BAMCP.

We benchmark BAMCP++ against modelfree algorithms with ARLspecific exploration heuristics. BAMCP++ outperforms modelfree methods on random MDPs.
1.3 Related Work
How does ARL (as defined above) related to regular RL? In regular RL there is no cost for deciding to observe a reward. Yet regular RL does involve “active learning” in the more general sense: the agent decides which actions to explore instead of passively receiving them. So techniques for exploration in regular RL might carry over to ARL.
Unfortunately, most practical algorithms for regular RL use heuristics for exploration such as greedy, optimism (Auer et al., 2002; Kolter and Ng, 2009), and Thompson sampling (Osband et al., 2013). While these heuristics achieve nearoptimal exploration for certain classes of RL problem (Bubeck et al., 2012; Azar et al., 2017), they are not directly applicable to ARL, as explained in Section 3. There are RL algorithms that try to explore in ways closer to the decisiontheoretic optimum. Various algorithms use an approximation to the Bayesian value of information (Srinivas et al., 2009; Dearden et al., 1998) and so relate to our Section 3. An alternative nonBayesian approach is to have the agent learn about the transitions to which the optimal policy is most sensitive (Epshteyn et al., 2008).
There is a substantial literature on active learning of rewards provided online by humans (Wirth et al., 2017; Dragan, 2017). Daniel et al. (2014) learn a reward function on trajectories (not actions) from human feedback and use Bayesian optimization techniques to select which trajectories to have labelled. D. Sadigh et al. (2017) learn a reward function on stateaction pairs and their agent optimizes actions to be informative about this function. These rewardlearning techniques are aimed at continuousstate environments and do not straightforwardly transfer to our tabular ARL setting. Our work also differs from D. Sadigh et al. in that we optimize for informativeness about the optimal policy and not the true reward function. As Figure 1 illustrates, if some states are unavoidable then their reward is irrelevant to the optimal policy.
2 Background
This section reviews Bayesian RL and the BAMCP algorithm. Later we cast ARL as a special kind of Bayesian RL problem and apply BAMCP to ARL.
2.1 Bayesian RL
An MDP is specified by , with components defined in Section 1.1. While our algorithms apply more generally, this paper focuses on finite, episodic MDPs (Osband et al., 2013), where is the episode length. A Bayesian RL problem (Ghavamzadeh et al., 2015; Guez, 2015) is specified by an MDP and an agent’s prior distribution on the transition function parameters . The agent’s posterior at timestep is then given by , where is the likelihood of history given the transition function with parameters .
The Bayesian RL problem can be transformed into an MDP planning problem by augmenting the state space with the agent’s belief and the transition function with the agent’s belief update. The resulting MDP is defined by and is called a BayesAdaptive MDP (BAMDP), where:

is the set of hyperstates ;

is the combined transition function between states and beliefs: ; and

The initial hyperstate is determined by the initial distribution over and the prior on the transition function.
2.2 BAMCP: MCTS for Bayesian RL
BAMCP is a Monte Carlo Tree Search (MCTS) algorithm for Bayesian RL (Guez et al., 2012)
. It converges in probability to the optimal Bayesian policy (i.e. the optimal policy for the corresponding BAMDP) in the limit of infinitely many MC simulations. In experiments, it has achieved near stateoftheart performance in a range of environments
(Castronovo et al., 2016; Guez, 2015).At any given timestep BAMCP attempts to compute the Bayesoptimal action for the current state under the agent’s posterior . As is common for work on Bayesian RL, this posterior is only over the transition function and not the reward function.^{1}^{1}1For our experiments in ARL the agent will always be uncertain about the reward function. BAMCP is an online algorithm. At each timestep, it updates the posterior on an observation from the real MDP and then uses MCTS to simulate possible futures using models sampled from this posterior. The MCTS builds a search tree mapping histories to valuefunction estimates (see Fig 2). A node corresponds to a posterior belief and current state action , and for each node the algorithm maintains a value estimate and visit count . BAMCP’s behaviour can be specified in four steps:

Node selection: At any node BAMCP chooses to expand the subtree for the action chosen by a UCB policy. In particular, when at node , the algorithm expands the action given by:
where is an exploration constant.

Expansion: This node selection continues until it reaches the final timestep of the episode or a leaf node. At leaf nodes exactly one child node is added per simulation.

Rollouts: If additional steps outside the tree need to be simulated, a rollout policy, trained by running Qlearning on observations from the real MDP, selects actions. No new nodes are added during the rollout phase.

Backup: After the rollout, value estimates of tree nodes along the trajectory are updated with the sampled returns. A simple average over all trajectories is computed.
BAMCP also uses root sampling and lazy sampling to improve efficiency (Guez et al., 2012).
3 Algorithms for Arl
3.1 Reducing ARL Problems to BAMDPs
We consider Active RL (defined in Section 1.1) in the Bayesian setting, where the agent has a prior distribution over the reward and transition functions. This is similar to a Bayesian RL problem. Actions in ARL reduce to regular RL actions by crossing each regular action with an indicator variable. But unlike in RL, an ARL agent does not always observe a scalar reward. To accommodate this, we introduce the null reward “”. If the agent takes an action without querying, it receives a reward . The definition of the agent’s belief update is modified to not update on . With this minor emendation, Bayesian ARL can be reduced to an MDP in an augmented statespace exactly as in Section 2.
3.2 RL Algorithms Fail at ARL
Can we apply Bayesian RL algorithms to Bayesian ARL? Many such algorithms can be straightforwardly adapted to deal with the null reward and produce welltyped output for ARL. Yet naive adaptations often fail pathologically. For instance, they might never choose to query and hence learn nothing. Here are some principles used in RL algorithms that lead to pathologies in ARL.
Optimism in the face of uncertainty
Optimism means adding bonuses to more uncertain rewards and taking optimal actions in the resulting optimistic MDP (Kolter and Ng, 2009; Araya et al., 2012; Auer et al., 2002). An optimal agent in a known MDP never queries. Optimism treats the optimistic MDP as (temporarily) known and hence optimism applied to ARL never queries.
Thompson Sampling (PSRL)
Thompson Sampling samples from the posterior on MDPs and plans in the sampled MDP (Osband et al., 2013; Strens, 2000). This fails for the same reason as optimism.
Modelfree TDlearning with random exploration
TDlearning is described in Sutton and Barto (1998). The value of querying an action is always lower than the value of not querying the same action. So for every action, a TDlearner learns to avoid querying the action and so fails when some actions must be queried many times.
3.3 Applying BAMCP to ARL
BAMCP is simple to adapt to Bayesian ARL and does not lead to obvious pathologies like the principles above. In fact, it converges in the limit to the optimal Bayesian policy for the BayesAdaptive MDP derived from the ARL problem. Adapting BAMCP to ARL requires a few modifications of Guez (2015) which are depicted in Figure 2. First, we explicitly model uncertainty over both the reward function and transition function . Second, the rollout policy only considers nonquerying actions (as querying is pointless for rollouts that do not learn). Third, querying is incorporated into MonteCarlo simulations. When simulating a trajectory, each action can be queried or not queried, as represented by indicator . If the action is not queried, the search tree may not branch (since there is no reward observation) but the reward backup is still performed. If the action is queried, its reward is observed and reduced by the query cost .
3.4 Algorithm for ARL: BAMCP++
As we show in Section 4, BAMCP performs poorly on ARL. We introduce BAMCP++ (Algorithm 1), which builds on BAMCP and leads to improved estimates of the value of querying actions. The first new feature of BAMCP++ is Delayed Tree Expansion
. UCB tree expansion often avoids query actions, because it is hard to recognise their value when estimating via noisy rollouts. To address this, we accumulate the results of multiple rollouts from a leaf node before letting UCB expand the actions from that node. This reduces the variance of value estimates, helping to prevent query actions from being prematurely dismissed. The second new feature of BAMCP++ addresses a problem with the rollouts themselves.
3.4.1 Episodic Rollouts
BAMCP’s rollout policy is responsible for value estimation in parts of the state space not yet covered by the MCTS search tree. Returns from a rollout are used to initialise leaf nodes and are also propagated back up the tree.
BAMCP’s rollout policy consists of a Qlearner trained on observations from the real MDP. This can result in a vicious circle when applied to ARL: (i) the Qlearner can only learn from the real MDP if the agent chooses to query; (ii) the agent only chooses to query if simulated queries lead to higher reward; (iii) simulated queries only lead to higher reward if the information gained is exploited and random rollouts do not exploit it. Our experiments suggest this happens in practice: BAMCP queries far too little. Related to the vicious circle, BAMCP’s rollouts do not share information across related hyperstates. After getting a big reward ten out of ten times from one simulated Bandit arm, the rollout is just as likely to choose the other arm.
In Episodic Rollouts, the rollout policy is still a Qlearner. But instead of just training on the real MDP, we also train on the observations from the current MC simulation. Let denote a Qlearner trained on the real MDP up to timestep . For each MC simulation, the rollout is performed by a distinct Qlearner that is initialised to but then trained by Qlearning on observations in the simulated^{2}^{2}2This is the rootsampled MDP. MDP (see QLEARNUPDATE applied to in Algorithm 1). This simulation consists of repeated episodes of and so gradually learns a better policy for , sharing information across hyperstates and exploiting querying actions. The rollout’s actions are sampled from a Boltzmann distribution.^{3}^{3}3Since the search tree eventually covers the entire state space (due to UCB), we can freely modify the rollout policy can without removing the asymptotic guarantees of MCTS.
Episodic Rollouts use a modelfree agent that learns during simulation, at the cost of a slower rollout. Having a fast modelfree agent to guide modelbased simulations is also central to AlphaZero (Silver et al., 2016, 2017; Anthony et al., 2017), where the modelfree network is trained to predict the result of MCTS simulations.
3.5 Modelfree Agents for ARL
As noted above, modelfree agents such as greedy Qlearners can fail pathologically at ARL. We want to investigate whether Qlearners augmented with querying heuristics can perform well on ARL.
The FirstN Heuristic queries each stateaction pair on the first
visits. The hyperparameter
can be tuned empirically or set using prior knowledge of the transition function and the variance of reward distributions.The MindChanging Cost Heuristic (MCCH) of Krueger et al. (2016) is based on bounding the value of querying and is closely related to the Value of Information heuristic (Dearden et al., 1998). After enough timesteps, an optimal Bayesian ARL agent may stop querying because the value of information (which decreases over time) does not exceed the query cost (which is constant). Likewise, MCCH computes an approximate upper bound on the value of querying and avoids querying if the bound exceeds the query cost. The bound is based on the number of episodes remaining , the value of the best possible policy (consistent with existing evidence), the value of the currently known best policy, and finally the number of queries required for the agent to learn they should switch to . The quantity can be upperbounded by the total reward possible in an episode (given the maximum reward ). Since is difficult to approximate without prior knowledge, we replace it with a hyperparameter that needs to be tuned. If the agent follows MCCH for MDPs, it queries whenever:
The FirstN Heuristic and MCCH can be combined with any modelfree learner. In our experiments, we use an greedy Qlearner. For FirstN, if a stateaction has been queried times, it cannot be chosen for exploratory actions. For MCCH, the agent follows greedy up until it stops querying at which point it just exploits using its fixed Qvalues.
4 Experiments
We test BAMCP and BAMCP++ on Bandits and then investigate the scalability of BAMCP++ on a range of larger MDPs.
4.1 BAMCP vs. BAMCP++ in Bandits
In the ARL version of multiarmed Bandits, the agent decides both which arm to pull and whether to pay a cost to query that arm. Optimal behaviour in ARL Bandits has a simple form: the agent queries every action up to some point and thereafter never queries (Krueger et al., 2016). We test BAMCP against BAMCP++ on a twoarm Bernoulli Bandit, with parameters for the two arms and a query cost . The total number of trials (which is known) varies up to 40. Both algorithms have priors over arm parameters and use 200,000 MonteCarlo simulations. Gridsearch was used to set the UCB hyperparameter and BAMCP++’s delayed treeexpansion parameter.
4.1.1 BAMCP++ is near optimal
Figure 3 shows returns averaged over 100 repeats of the same ARL Bandit (for horizons up to 40). We compare BAMCP and BAMCP++ to the optimal policy (which always pulls the best arm and never queries) and to the optimal policy minus the cost of up to three queries (for a fairer comparison). The optimal policy is distinct from the Bayes optimal policy, which is the ideal comparison but is hard to compute (Krueger et al., 2016). BAMCP++ is mostly close to the optimal policy minus three queries, whereas BAMCP is closer to the random policy.
While BAMCP++ is nearoptimal for horizon , it is suboptimal for smaller horizons. What explains this? For sufficiently small , the Bayes optimal agent does not query and performs randomly. However, for the Bayes optimal agent would query and so BAMCP++ falters. The difficulty is that querying is only optimal if the agent performs flawlessly after the query. Hence many MCTS samples are needed to recognise that querying is Bayes optimal (as most trajectories that start with querying are bad). This is illustrated in Figure 5, which shows the estimated BAMDP Qvalues for query and nonquery actions in the first timestep for . Even after 100,000 simulations, nonquerying is (incorrectly) estimated to be superior.
BAMCP is outperformed by BAMCP++. Figure 4 shows the probability of queries for and . For these horizons, the Bayes optimal agent queries at the first few timesteps with probability one. Yet BAMCP almost never queries () or queries with low probability (). BAMCP (unlike the random agent) exploits information gained from queries but because it fails to recognise the value of queries it never gains much information.
4.1.2 BAMCP’s problems in regular RL
Is the failure of BAMCP in Bandits due to a special feature of ARL, or does BAMCP fail at related problems in regular RL? We tested BAMCP on the DoubleLoop (Fig 6), an RL environment that poses a similar challenge to ARL Bandits. For this environment the agent knows the rewards and has a Dirichlet prior on the transition probabilities. While BAMCP achieved excellent performance on DoubleLoops with (Guez, 2015; Castronovo et al., 2016), we test it up to . We set the UCB parameter and the number of MC simulations to 10,000 (following Guez). Figure 6 shows that BAMCP’s performance drops rapidly after and ends up no better than a simple modelfree Qlearner. How is this poor performance related to ARL? Suppose the agent believes (after trying both loops) that reaching state is likely worse than the right loop. The reason to explore is that if it is better it can be exploited many times. But unless MCTS simulates that systematic exploitation the agent will not explore.
4.2 Benchmarking BAMCP++ and ModelFree Algorithms
Having shown that BAMCP++ does well on ARL Bandits, we test it on more complex MDPs with unknown transition dynamics and compare it against modelfree algorithms.
4.2.1 BAMCP++ on Late Fork
We test BAMCP++ on Late Fork (Fig 1) with . This is a 3state MDP, where the first two actions are unavoidable and should not be queried. The query cost is . In the condition “Known Transitions”, all transitions are known and only the rewards for each action are unknown. In “Unknown Transitions”, the agent knows which actions are available at each state but not where the actions lead. The priors are for rewards and symmetric Dirichlet with parameter for transitions. Figure 8 shows total returns averaged over 50 runs for different horizons. (The number of episodes plays the same role as the number of trials in Bandits).
BAMCP++ achieves close to the optimal policy when the horizon is above 17. But does it explore in the Bayes optimal way? Figure 9 shows the probability of querying actions at each timestep in a setting with horizon episodes, which corresponds to the midpoint on the xaxis of Fig 8. The spikes in the graph show the agent alternates between querying with probability zero (at the unavoidable action) and querying with positive probability (at the fork), just as the Bayes optimal agent does.^{4}^{4}4For “Unknown Transitions” the agent knows that actions are unavoidable while not knowing where they lead.
4.2.2 BAMCP++ on Random MDPs
BAMCP++ does well on very small MDPs like Bandits and 3state Late Fork. Can it scale to larger and more varied MDPs? We compare BAMCP++, MCCH, and FirstN on the Fork environments (Fig 1) and on random MDPs with 5 states and 3 actions. The query cost is throughout. To generate 25 random MDPs for testing algorithms, we sample rewards and transitions from symmetric Dirichlet distributions with and respectively. We call this the generating prior. The BAMCP++ agent uses the generating prior across all MDPs (including the Fork environments) and uses a fixed number of MC simulations (200,000).
BAMCP++ and FirstN use a fixed set of hyperparameters for all MDPs in Table 1. These are set by gridsearch on random MDPs sampled from the generating prior. So the hyperparameters are tuned for the task “Rand25” but not for any other tasks in Table 1. We tried fixing hyperparameters for MCCH in the same way but performance was so poor that we instead tuned hyperparameters for each row in Table 1.
On random MDPs, BAMCP++ substantially outperforms the modelfree approaches. The mean performance averaged over all 25 random MDPs is shown in row “Rand25” of Table 1. Here each algorithm has its hyperparameters tuned to the task. Figure 10 shows performance (total return vs. number of queries) on the same task but with a range of different hyperparameter settings. MCCH performs poorly because without tuning of hyperparameters it queries far too much. FirstN and BAMCP++ are both fairly robust to hyperparameter settings in terms of both number of queries and total return. BAMCP++ achieves more reward without querying more, suggesting it makes smarter choices of where to explore and which actions to query.
On Early and Late Fork environments, BAMCP++ performs best on horizon ; while FirstN wins on horizon . The Fork environments all have a maximum perepisode reward of 1 and hence a maximum total reward of 30 and 50 (for and ). As the horizon increases, BAMCP++ improves its absolute score significantly but its score declines as a function of the maximal total return. What explains this decline? The most challenging task “Early550” is a 10state MDP with planning horizon of 250 timesteps (50 episodes 5 steps per episode). This is a vastly larger search tree than for “Late430” but the number of MCTS simulations at each timestep was the same, making it harder to sample the best exploration strategies.
MCCH and FirstN initially query all states indiscriminately. As the horizon increases, they scale scale well because there is more time for their indiscriminate querying to be exploited. The strong overall performance of FirstN is partly due to our choice of MDPs. All reward distributions were Bernoulli (which have an upperbound on their variance) and differences between optimal values for actions were rarely very small. So by tuning the hyperparameter (the maximum number of queries per action) on random MDPs, FirstN was well adapted to all the MDPs in Table 1. But outside our experiments the same MDP could have reward distributions with huge variation in variance (e.g. Gaussian rewards with and ) and so a Bayes optimal ARL agent would need to query some actions many more times than others.
BAMCP++  MCCH  FirstN  

Rand25  60.2[8.8]  48.5[17]  55.7[16] 
Late430  28.2[0.7]  25.1[9.0]  26.1[2.9] 
Late530  27.4[0.2]  25.7[8.2]  25.6[2.5] 
Late450  45.2[0.7]  41.7[12]  46.3[7.6] 
Late550  43.5[1.0]  42.1[13]  45.2[5.3] 
Early430  25.9[1.9]  22.8[11]  24.5[2.5] 
Early530  23.8[3.7]  22.8[10]  23.7[2.5] 
Early450  41.2[3.2]  40.1[17]  43.2[8.8] 
Early550  32.9[6.9]  39.3[16]  42.9[5.5] 
5 Conclusion
Active RL is a twist on standard RL in which the cost of evaluating the reward of actions is incorporated into the agent’s objective. It is motivated by settings where rewards are constructed incrementally online, as when humans provide feedback to a learning agent. We introduced BAMCP++, an algorithm for Bayesian ARL in tabular MDPs which converges to the Bayes optimal policy in the limit of MonteCarlo samples. In experiments, BAMCP++ achieves nearoptimal performance on small MDPs and outperforms modelfree algorithms on MDPs with 15 actions and a horizon of 100 timesteps.
The key idea behind BAMCP++ is that MCTS is guided by a sophisticated (and more computationally costly) modelfree learner in the rollouts. This helps alleviate a fundamental challenge for simulationbased ARL algorithms. Such algorithms must simulate recouping the upfront query costs by exploiting the information gained from queries. This requires simulations that are nonrandom (to capture exploitation) over many timesteps (query costs are only recouped after many timesteps).
Acknowledgements
OE was supported by the Future of Humanity Institute (University of Oxford) and the Future of Life Institute grant 2015144846. SSch is in a PhD position supported by Dyson. Clare Lyle contributed to early work on modelfree heuristics and suggested the Early Fork environment. We thank Joelle Pineau and Jan Leike for helpful conversations. We thank David Abel, Michael Osborne and Thomas McGrath for comments on a draft.
References

Abbeel and Ng (2004)
P. Abbeel and A. Y. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
Proceedings of the twentyfirst international conference on Machine learning
, 2004. 
Anthony et al. (2017)
T. Anthony, Z. Tian, and D. Barber.
Thinking fast and slow with deep learning and tree search.
In Advances in Neural Information Processing Systems, pages 5366–5376, 2017.  Araya et al. (2012) M. Araya, O. Buffet, and V. Thomas. Nearoptimal brl using optimistic local transitions. arXiv preprint arXiv:1206.4613, 2012.
 Auer et al. (2002) P. Auer, N. CesaBianchi, and P. Fischer. Finitetime analysis of the multiarmed bandit problem. Machine learning, 47(23):235–256, 2002.
 Azar et al. (2017) M. G. Azar, I. Osband, and R. Munos. Minimax regret bounds for reinforcement learning. arXiv preprint arXiv:1703.05449, 2017.
 Bubeck et al. (2012) S. Bubeck, N. CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Castronovo et al. (2016) M. Castronovo, D. Ernst, A. Couëtoux, and R. Fonteneau. Benchmarking for bayesian reinforcement learning. PloS one, 11(6), 2016.
 Christiano et al. (2017) P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4302–4310, 2017.
 D. Sadigh et al. (2017) A. D. Dragan D. Sadigh, Shankar S., and Sanjit A S. Active preferencebased learning of reward functions. In Robotics: Science and Systems (RSS), 2017.
 Daniel et al. (2014) C. Daniel, M. Viering, J. Metz, O. Kroemer, and J. Peters. Active reward learning. In Robotics: Science and System, 2014.
 Dearden et al. (1998) R. Dearden, N. Friedman, and S. Russell. Bayesian qlearning. In AAAI, pages 761–768. AAAI Press, 1998.
 Dragan (2017) A. D. Dragan. Robot planning with mathematical models of human state and action. arXiv preprint arXiv:1705.04226, 2017.
 Epshteyn et al. (2008) A. Epshteyn, A. Vogel, and G. DeJong. Active reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 296–303. ACM, 2008.

Evans et al. (2016)
O. Evans, A. Stuhlmüller, and N. D. Goodman.
Learning the preferences of ignorant, inconsistent agents.
In
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
, pages 323–329. AAAI Press, 2016.  Ghavamzadeh et al. (2015) M. Ghavamzadeh, S. Mannor, J. Pineau, A. Tamar, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(56):359–483, 2015.
 Guez (2015) A. Guez. SampleBased Search Methods For BayesAdaptive Planning. PhD thesis, UCL (University College London), 2015.
 Guez et al. (2012) A. Guez, D. Silver, and P. Dayan. Efficient bayesadaptive reinforcement learning using samplebased search. In Advances in Neural Information Processing Systems, pages 1025–1033, 2012.
 Ho and Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, pages 4565–4573, 2016.
 Judah et al. (2012) K. Judah, A. Fern, and T. G. Dietterich. Active imitation learning via reduction to iid active learning. In UAI, pages 428–437, 2012.
 Kolter and Ng (2009) J. Z. Kolter and A. Y Ng. Nearbayesian exploration in polynomial time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 513–520. ACM, 2009.
 Krueger et al. (2016) D. Krueger, J. Leike, O. Evans, and J. Salvatier. Active reinforcement learning: Observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop, 2016.
 Kuleshov and Precup (2014) V. Kuleshov and D. Precup. Algorithms for multiarmed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
 Osband et al. (2013) I. Osband, D. Russo, and B. Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
 Saunders et al. (2017) W. Saunders, G. Sastry, A. Stuhlmueller, and O. Evans. Trial without error: Towards safe reinforcement learning via human intervention. arXiv preprint arXiv:1707.05173, 2017.
 Scott (2015) S. L. Scott. Multiarmed bandit experiments in the online service economy. Applied Stochastic Models in Business and Industry, 31(1):37–45, 2015.
 Settles (2012) B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 6:1–114, 2012.

Silver et al. (2016)
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,
et al.
Mastering the game of go with deep neural networks and tree search.
Nature, 529(7587):484–489, 2016.  Silver et al. (2017) D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Srinivas et al. (2009) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
 Strens (2000) M. Strens. A bayesian framework for reinforcement learning. In Proceedings of the 17th international conference on Machine learning, 2000.
 Subramanian et al. (2016) K. Subramanian, C. L. Isbell Jr, and Andrea L. Thomaz. Exploration from demonstration for interactive reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, pages 447–456. International Foundation for Autonomous Agents and Multiagent Systems, 2016.
 Sutton and Barto (1998) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
 Warnell et al. (2017) G. Warnell, N. Waytowich, V. Lawhern, and P. Stone. Deep tamer: Interactive agent shaping in highdimensional state spaces. arXiv preprint arXiv:1709.10163, 2017.
 Wirth et al. (2017) C. Wirth, R. Akrour, G. Neumann, J. Fürnkranz, et al. A survey of preferencebased reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
Comments
There are no comments yet.