1 Introduction
When seeing a patient concerned about a potentially cancerous skin blemish, a doctor must decide which diagnostic assessments are required. Some measurements, such as touch and visual inspection, can easily be conducted during the initial consultation, whilst others require sophisticated equipment, drawnout lab analyses, and have higher costs associated with them. The doctor must actively decide whether the higher cost assessment will provide information necessary in order to accurately and efficiently select the next treatment action.
The above scenario describes a Markov Decision Process (MDP) with observation classes and costs. Environments of this nature include the following properties:

One or more classes of observations (measurements) of the next state are possible;

The measurements have explicit associated costs; and,

The value of the measurement depends on time and space.
Indeed, a wide variety of sequential decision making tasks, such as materials design, public health planning during a pandemic and operational planning (business decision making) involve the choice of actions and classes of observations with associated costs.
The MDP formalism and the environments on which reinforcement learning (RL) algorithms are developed and tested, however, are not designed to explore such settings Brockman et al. (2016)
. In the canonical framework, observations of the state of the environment are produced automatically, instantaneously and have no explicit associated costs. Generally, agents are agnostic to the state observations provided by the environment in the sense that they learn from what they receive. To the extent that the agent might try to improve the quality of observations, it is through deep feature representations
Mnih and others (2015), maintaining a belief state for partially observable MDPs Kaelbling et al. (1998), or taking actions to change the state of the environment in order to gain a better understanding of it Bossens et al. (2019). Thus prior work has considered observations, yet has not dealt with the selection of observation classes, nor minimizing observation costs.Here, we frame MDPs with observation classes and costs as an active learning problem. Active learning is typically applied to supervised machine learning with the aim of reducing the cost of labelling training data
Settles (2012). However, active learning has recently been applied to RL in the context of determining reward from external experts Epshteyn et al. (2008); Krueger et al. (2016); Schulze and Evans (2018). Conversely, we postulate that in some domains observations of the state of the environment, like supervised labels, are expensive to obtain. In the context of this work, the active component is applied to learning which measurements to make in a given state at a particular time, or deciding not to make a measurement at all  thereby foregoing the additional information and cost associated with it. The aim is to discounted sum of rewards minus observation costs, which we denote as the costed return.We propose the Active Measure Reinforcement Learning (Amrl) framework in which the agent learns a policy and a state estimator in parallel via online experience. The agent chooses actions pairs that change the environment and dictate whether the next state is measured directly or estimated. As the state estimator is refined over time, the agent smoothly shifts to increasingly rely on it thereby lowering its observations cost. This enables the Amrl agents to achieve a higher costed return.
We demonstrate an implementation of Amrl using Qlearning and a statistical state estimator (AmrlQ). We compare AmrlQ to Qlearning and DynaQ on four benchmark learning environments, including a new chemistry motivated environment; specifically, the junior scientist environment. The results show that AmrlQ achieves a higher sum of rewards minus observation cost than Qlearning and DynaQ, whilst learning at an equivalent rate to Qlearning and DynaQ.
1.1 Contributions
The main contributions of this work are:

Formalization of MDPs with observation classes and costs

Definition of the Active Measure RL framework (Amrl)

Initial implementation of a Qlearning approach, the AmrlQ algorithm

Analysis of AmrlQ on benchmark RL environments
2 Related Work
Previous work on active reinforcement learning has focused on ameliorating the problem of defining a complete reward function over the stateaction space Akrour et al. (2012); Krueger et al. (2016); Schulze and Evans (2018). In addition to selecting an action at each time step, the agents in these proposals actively decide to request a human expert to provide the reward for the stateaction pair. To minimize reliance on human experts, there is a cost assigned to requesting a humanspecified reward. The agent aims to minimize this cost whilst maximizing the discounted sum of rewards. Similarly, Amrl maximizes the discounted sum of rewards minus the sum of observations cost. However, the Amrl agent differs in the sense that state observations are the bottleneck in the learning process rather than the rewards. Moreover, the Amrl agent may have multiple different measurements of the state of the environment available to it, each of which has a distinct cost.
Active perception relates to our work in that the agent takes actions to increase the information available Gibson (1966). The key distinction, however, is that in active adaptive perception applied to RL, the agents employ selfmodification and selfevaluation to improve its perception Bossens et al. (2019). Alternatively, the Amrl agent aims to judiciously select observation classes in order to have the necessary and sufficient amount of information to choose the next action in order to maximize costed return.
Recently, the authors in Chen et al. (2017); Li et al. (2019)
proposed the extension of the concept of multiview learning from supervised domains reinforcement learning. They formulate this as an agent having multiple views of the statespace available to it. This is the case, for example, for agents controlling autonomous vehicles equipped with multiple sensors. This previous work, however, does not contain the concept of observation costs, which are fundamental in applications of Amrl. Approximate dynamic programming (ADP) aims to ameliorate the “curseofdimensionality” in dynamic programming
Powell (2007). It is connected to our work in the sense that it introduces a new component, the postdecision state, to the interaction with the environment. Alternatively, our work, which is not focused on the curseofdimensionality, formulates an action pair that determines the process to be applied in the environment and the class observation to be made.The learning of the state transition dynamics of the Amrl framework is consistent with the techniques employed in modelbased RL Deisenroth and Rasmussen (2011); Kumar et al. (2016); Gal et al. (2016). The goal of modelbased RL, however, is to reduce the number of realworld training steps needed to obtain an optimal policy. This does not solve our problem of selecting the observation class, nor minimizing associated observations.
Learning algorithms for POMDPs utilize a state estimator to internalize the agent’s recent experience in order to reduce uncertainty in partially observable environments. At each time step, the next action is selected based on the the agent’s belief state as determined by its state estimator, rather than the observation emitted from the environment Kaelbling et al. (1998). Alternatively, in Amrl the agent is learning an optimal policy under a MDP with observation costs. The agent chooses between paying the cost to measure the true state of the environment or estimating it . Thus, in Amrl the state estimator is a mechanism to increase the costed return, not manage partial observability.
3 Preliminaries
We define active measure reinforcement learning as a tuple: . The components make up a standard MDP where is the statespace, is the actionspace,
is the state transition probabilities,
is the reward function, and is a discount factor. and are not known by the agent. is the cost charged to the agent each time it decides to measure the state of the environment. Thus, for a state , the environment returns the cost as follows:(1) 
Applications may have multiple observation classes , such as different sensors that serve different purposes. In this case, each measurement class may have a different associated cost. Selecting constitutes the active learning choice on the part of the agent. The values of indicate a specific observation class (such as a specific sensor) to be used, whereas specifies that no measurement of the environment is to made^{1}^{1}1When , the agent uses its state estimator in place of a measurement of the environment..
As in Schulze and Evans (2018), at each time step the agent selects an action pair. In Amrl, the action pair consists of an atomic process (e.g., move left) and an observation class . Thus, if , the process is applied to the environment, and the environment returns the reward and the next state observation measured via (). Here, results from the underlying, unknown transition dynamics . For , the process is applied to the environment, but the environment only returns the reward . In this case, the Amrl agent estimates the next state , and selects its next action pair based on this estimate, . This leads to an alternative agentenvironment interaction sequence of the form:
(2) 
where the agent starts each episode with a true measurement of the environment’s current state, , and proceeds to sequentially select action pairs that determine the process to be applied and whether to measure the next state or estimate instead.
Importantly, the reward emitted from the environment is always a function of the process and the true state of the environment irregardless of whether the agent selected based on or an estimate . For simplicity and generalization, at times we drop the hat notation on the state estimates.
In this work, we focus on episodic environments with discrete states, , and action sets , and stationary statetransition dynamics. In an MDP with measurement costs, the objective is to select a sequence of action pairs that maximize the costed return, which is defined as the discounted sum of rewards minus the sum of measurement costs:
(3) 
In Amrl, a policy, maps states and actions pairs to a probability , such that is the probability of selecting action pair in state . The value function associated with policy is:
(4) 
where the actions are selected according to . Since actions pairs can be though of as a higherlevel class of action, the standard RL theorems hold. Thus, there is at least one policy such that , where is an optimal policy and is the corresponding value function.
4 AmrlQ
We propose an initial implementation of the Amrl framework for a tabular learning environment. Our proposed solution utilizes Qlearning for the value function and a statistical state transition model. We focus on tabular problems here for clarity in the demonstration and analysis. Our future work will implement Amrl solutions for continuous state and action spaces.
4.1 Overview
As previously stated, AmrlQ framework learns a value function , and a state estimator in parallel. Learning and
is essential to the active learning based solution which enables the agent to reduce its the total number of times it requests a true measurement. The theory behind this can be demonstrated with the Markov chain in Figure
1.This Markov chain forms a two action (left, right) episodic RL problem where the agent starts in stage zero, and receives a reward of one upon entering the absorbing state, state four. For temporal difference (TD) learning methods, such as Qlearning, applied to episodic problems such as this, the value of states and actions is refined over episodes of training from the state closest to the absorbing state back to the start state. The backup algorithm for Qlearning is:
(5) 
where is the value of action in state at time , is the learning rate and is the discount factor. If we assume a table initialized to zeros, after one episode of training is complete, only will have a value greater than zero; after the second episode is complete, states 2 and 3 will have values greater than zero, and so on. In general, for an state chain of this nature, the agent will require episodes of training to start to improve the values associated with the start state, state 0.
The number of times the agent visits each state per episode indicates how many true measurements of the environment it will make. We can estimate this by calculating the fundamental matrix of the absorbing Markov chain shown in Figure 1. The fundamental matrix is defined as , where is the identify matrix and is the matrix representing the transient states in . Based on this, the expected number of state visits before absorbing for an agent starting in state 0 and following a random policy is 8,6,4 and 2, respectively. Thus, in the first four episodes of training, the Qagent is expected to take 46 measurements of the environment.
If we consider the state estimator learned by Amrl, according to the calculations above, in the first episode of training the agent is expected to have tried both actions in each state 4, 3, 2 and 1 times, respectively. Since for a deterministic , the agent must try each stateaction pair once to have an accurate , Amrl can safely switch from actively measuring the next state, to estimating it with after the first episode of training. Moreover, because the agent tries actions in states closer to the start sooner and more frequently, it can switch to using in these states even before the first episode of training ends. In this way, Amrl is able to improve measurement efficiency well beyond what can be achieved by standard RL methods and modelbased RL.
4.2 Algorithm
The AmrlQ algorithm maintains , countbased statistics table for state transitions models . In this initial presentation, we limit the agent to selecting from one observation class. Therefore, the agent maintains an dimensional table, where is the number of action pairs. An environment with 2 action has 4 action pairs, and thus, a four column table.
The Qtable is update in the standard way as:
(6) 
The agent employs an greedy strategy to pick action pairs from the Qtable. If the action pair at time includes , then the agent chooses to pay the cost of measuring the next state from the environment. Otherwise, the agent estimates the next state from its model as . When the agent chooses to measure the true state, it updates for the corresponding action
Much like a human learning a new task, the first few times an agent enters a state it must measure the result of taking an action. We produce this behaviour by initializing Qvalues for action pairs involving state measurements with small positive value, and zero for Qvalues related to measurements (implications of initialization are discussed below).
Over successive visits to a state and applications of a process and measurement , the return for will be less than the maximum possible return because the measurement cost is subtracted from the reward . Since moving without measuring does not incur an additional measurement cost, in time and as the model improves, moving and relying on the learned model produces an increased reward and the agent shifts to this strategy.
The outline of the algorithm is:

Initialize a biased Qtable of size

Initialize A statetransition statistic table of size to zeros

get the first state from the environment

repeat until done

Select action pair with greedy policy from Q table for state

Apply action to environment

If measure :

Measure next state in environment

Update state transition model for action


Else:

Sample next state


Get reward from environment

Get cost from the environment

Update Q table for state with tuple

Set

5 Experimental Setup
The following experiments are conducted on episodic, discrete state and action problems. Our analysis involves three standard RL environments (Chain, Frozen Lake and Taxi) and one new environment (Junior Scientist). Each of these environments has the feature that the agent must actively decide if and when to measure the state of the environment. In the case of the OpenAI Gym environments (Frozen Lake and Taxi), we have implemented a wrapper class in Python that adds the Amrl functionality.
5.1 RL Environments
Chain environment: A chain of 11 states, , where the agent starts at and the episodes ends when the agent enters . Upon entering goal state , the agent receives a reward of . The agent receives a reward of at each time step. The agent is charged a measure cost of for measuring the state of the environment. Measuring the state results in the environment returning the current state in chain. The action space is . We evaluate the performance with both deterministic state transitions and stochastic state transitions. In the stochastic setup, the environment has a probability of the actions being swapped at each time step.
Frozen Lake environment: In this environment, the agent learns to navigate from a start location to a goal in a frozen lake grid with holes in the ice. Each episode ends when the agent reaches the goal or falls through a hole in the ice. The agent receives a reward of at the goal, otherwise. The agent pays a cost of for measuring the state of the environment. Measuring the state results in the environment returning the current position in the 2dimensional frozen lake grid. The actionspace in the environment is . In this implementation, the agent is prevented from moving off the grid. We evaluate the agents with both the predefined deterministic and slippery settings in the openAI gym.
Taxi environment: The agent learns to navigate a city grid world to pick up and drop off passengers at the appropriate location Dietterich (2000). The agent receives a reward for dropping off at the correct location, for illegal pickup or dropoff and at each time step. The agent is charged a cost of for measuring the state of the environment. Measuring the state results in the environment returning the current position in the city grid. The actionspace includes .
Junior Scientist environment: This environment emulates a student learning to manipulate an energy source to produce a desired state change in a target material. Specifically, the agent starts with a sealed container of water composed of an initial percent ice, percent water and percent gas (). The agent learns to sequentially and incrementally adjust a heat source in order to transition the ratio of ice, liquid, gas from to a goal ratio . The episode ends when the agent declares that it has reached the goal and it is correctly in the goal state. The actionspace includes , where decrease and increase are fixed incremental adjustments in the energy source. The agent receives a reward of when it reaches the goal and it correctly declares that it is done, and receives a reward of at each time step. The agent is charged for measuring the state of the environment. Measuring the state results in the environment returning the cumulative energy which has been added or removed from the system.
5.2 RL Algorithms
We compare the relative performance of AmrlQ to nonactive methods: Qlearning Watkins and Dayan (1992) and DynaQ Sutton (1990). Since neither Qlearning nor DynaQ are active RL methods, they require a measurement of the environment at each time step. As a result, they are charged the measurement cost at each time step. The relative performance of these methods is assessed in terms of the sum of reward minus observation costs, along with the mean number of steps and measurements per episode. To the best of our knowledge, we are proposing the first solution to the Amrl problems. As such, Qlearning and DynaQ are a reasonable baseline for comparison in this introductory work.
5.3 Methodology
For each RL algorithm in our evaluation, we utilize a discount factor of and greedy exploration . The tables for both Qlearning and DynaQ are initialized to zeros. The columns of the table in AmrlQ associated with estimate () are initialized to zeros and those associated with measure () were set to a small positive, typically (we also explore the impact of larger values). The results presented are mean performance averaged over 20 random trials, enough to be statistically significant. We employ 5 planning steps, a reasonable baseline, in DynaQ after each real step.
6 Results
We initially focus on the performance of each agent in the deterministic environments. We highlight the impact of stochasticity in the Discussion.
6.1 Chain
The mean performance of each agent is shown in Figure 2. The left plot displays the mean number of steps to the goal for Qlearning, DynaQ, and AmrlQ. All three methods learn a policy that takes a similar number of steps to the goal. Naturally, DynaQ learns faster (red line versus green and blue). It worth noting that Dyna styled planning could easily be incorporated into AmrlQ as an enhancement, however, this is beyond the scope of this study.
Whilst Qlearning and DynaQ require a measurement after each action, AmrlQ actively decides whether to measure or estimate the next state. The purple line in Figure 2 show the mean number of measurements per episode made by AmrlQ. In the very early episodes, the number of measurements is similar to DynaQ, however, it quickly drops well below the alternatives.
The cost savings resulting from fewer measurements for AmrlQ can be seen in the higher costed return presented in the plot on the right. As in the previous analysis, AmrlQ is initially similar to DynaQ. This holds while AmrlQ learns about state transition dynamics. Because AmrlQ dynamically shifts its measurement behaviour in each state as it learns about the transition dynamics, over episodes of training it reduces its measurement costs to acquire a higher costed return (blue line).
Figure 3 summarizes the total number of visits and measurements made in each state after 1, 20 and 40 episodes of training for Qlearning^{2}^{2}2The state visits and measurements are equivalent for Qlearning. and AmrlQ. The plot on the left shows that initially AmrlQ (blue bar) visits most states slightly more frequently than Qlearning (red bar). Importantly, however, the purple bars show that it measures each state less frequently than Qlearning. Thus, the measurement costs are lower from the outset. After 20 and 40 episodes of training (centre and right plots), the state visit frequency of AmrlQ is consistent with Qlearning. In these later episodes of training, however, AmrlQ requires significantly fewer state measurements than Qlearning. This highlights the advantage that the Amrl framework has in its ability to shift from measuring the state of the environment to estimating it as more experience (episodes of training) is gathered. This behaviour is shown in greater detail in Figure 4.
Figure 4 contains four 2dimensional histograms. These depict the number of visits to each state (plots 1 and 2) and the number of measurements in each state (plots 3 and 4)^{3}^{3}3Qlearning measures the state on each visit, therefore, plots 1 and 3 are the same. as a function of episodes of training. The axis specifies the state in the chain and the axis indicates the number of episodes of training completed. The darker black cells indicate more visits / measurements, whilst white indicates a moderate number and red depicts a low number. As a result of the learning behaviour of Qlearning that was discussed in Section 4, the lower diagonal of the state visit and measurement plots for Qlearning, and the state visit plot for AmrlQ have a light red to black shading, with the darkest black appearing in the lower left corner. The upper diagonal, where the shading is uniformly dark red, shows the time at which the agent has learned a policy that enables it to directly transition from this current state to the goal. This occurs within just a few episodes of training for Qlearning in state 10 (first plot, lower right), whereas it takes approximately 50 episodes of training for state 0 (first plot, upper left). Whilst the state visit distributions are very similar for Qlearning and AmrlQ, their state measurement distributions have an outstanding difference in magnitude. The max state measurement value for AmrlQ (right most plot) is 6, in comparison to 16 for for QLearning. Moreover, in shading in the AmrlQ measurement plot quick shift from light red to dark red. In fewer than 30 episodes of training, the agent is able to replace all measurements of the environment with its own estimate.
6.2 Frozen Lake
Figure 5 shows the mean number of steps to the goal for each algorithm on the deterministic frozen lake. Similar to the deterministic chain, AmrlQ learns at the same rate as Qlearning. It takes approximately the same number of steps per episode (green versus blue line). DynaQ learns faster than the alternatives, but converges to a similar mean number of steps as Qlearning and AmrlQ (red line). AmrlQ requires fewer measurements on average (purple line). The mean number of steps per episode at the end of training for each method is: random agent = 31.95, QLearning = 13.99, DynaQ = 15.45, AmrlQ Steps = 18.52. Importantly however, AmrlQ only takes a mean of 10.50 measurements per episode.
6.3 Taxi
Figure 6 depicts the mean number of steps to the goal for each algorithm on the Taxi environment. This is a more challenging environment because it requires the agent to learn an intermediate goal. Nonetheless, the relative performance of the considered algorithms is consistent with our previous results. AmrlQ learns at a similar rate to Qlearning, and takes approximately the same number of steps (green versus blue line). DynaQ learns faster (red line), but converges to a similar average number of steps as Qlearning and AmrlQ. AmrlQ requires fewer measurements on average (purple line). The mean number of steps per episode are as follows: random agent = 31.95, QLearning = 14.83, DynaQ = 14.67, AmrlQ Steps = 15.30. AmrlQ take an average of 12.13 measurements per episode.
6.4 Junior Scientist
Figure 7 shows the mean of the costed return for each algorithm on the Junior Scientist environment. Once again, DynaQ learns slightly faster than Qlearning and AmrlQ. The plot on the left clearly shows AmrlQ shifting away from measuring the state after approximately 2,000 episodes of training (purple line). The fact that the mean steps (blue line) is stable during this shift indicates that the agent is not becoming ‘lost’ in the state space due to bad estimates.
7 Discussion
Figure 8 shows the evolution of the values of the Qtable for AmrlQ over episodes of training on the deterministic chain environment. The axis shows the four action pairs [(move left, measure), (move right, measure), (move left, estimate), (move right estimate)] that the agent chooses from. The axis shows each state, where 0 is the start state and 10 is the goal state. From left to right, the first plot is the initialized Qtable. It is followed by the Qvalues after increments of 29 episodes of training. In earlier episodes of training, the action pair (move right, measure) has the highest values. The sequence of plots demonstrates that over episodes of training, the action pair (move right, estimate) comes to have the highest value. Thus, the agent shifts over time away from its reliance of more costly measurements.
The shift to estimating the next state occurs naturally within the Qlearning backup algorithm and sufficient exploration. There is a clear tradeoff in this evolution. If an agent in state relies on its state estimator before it is sufficiently accurate, it will be misinformed about its current location. As a result, it is likely to select the wrong action and take more time to reach the goal. Moreover, the agent’s Q updates will be applied to the wrong state. Alternatively, if an agent in state utilizes measurements longer than is necessary (i.e., when is sufficiently accurate), it needlessly pays the measurement cost which lowers its reward. In AmrlQ, proper exploration and the initialization of the Qtable serve to balance this tradeoff. However, more sophisticated solution using model confidence are expected to produce even better performance. We leave the study of such methods to future work.
The right column of Figure 9 depicts how the initialization of the Qvalues associated with measure in AmrlQ shapes the number of measurements made by the agent. The top plot depicts the number of steps on the deterministic chain and the bottom for the stochastic chain. The episodes of training are plotted on the axis and the mean number of measurements is plotted on the axis. This clearly shows that as the initialization is decreased towards zero, the number of measurements made by the agent reduces.
The number of measurements per stateaction pair has important implications on performance in the stochastic environments. In the lower right plot, which applies to the stochastic environment, the difference between the initialization of 0.01 and 0.005 is much smaller than in the deterministic case. In that case, the agent using the initialization of 0.005 shifts to using its state estimator before it is sufficiently accurate. As result, the agent is operating from error prone estimates of is current state, and thus, requires more steps and more measurements on average.
The column on the left shows how the initialization impacts the costed return. The upper plot shows that given enough time, the agent overcomes the larger initialization to achieve an equivalent costed return as agents with smaller initial values. The lower plot demonstrates the benefit of a large initial value in environments with stochastic transitions. From early episodes of training the difference in mean performance (shown without error bars in the embedded plot) of the agents with different initialization is small. In the large plot (with error bars) it is clear that the larger initial value leads to a notably lower standard deviation. Given the added robustness of the larger initial values, and the fact that the agent will converge to the same performance, we advise against setting it too close to zero.
The plot on the left in Figure 10 depicts the mean of the costed return for AmrlQ, Qlearning and DynaQ on the stochastic Chain. In this case, the action pairs involving measure
are initialized to 0.01. The results for Slippery Frozen Lake environment are plotted on the right. This is much more complex than the stochastic chain because it involves a larger number of actions and more variability in the transition dynamics. In this setting all methods have a high variance. The actions pairs associated with measure
must be set to a large value (in this case 10.0) in order to provide time to stabilize. The AmrlQ agent begins to slowly shift way from relying on measurements after approximately 1,000 episodes of training.8 Conclusion
We introduced a sequential decision making framework, Amrl, in which the agent selects both an action and an observation class at each time step. The observation classes have associated costs and provide information that depends on time and space. We formulate our solution in terms of active learning, and empirically show that AmrlQ learns to shift from relying on costly measurements of the environment to using its state estimator via online experience. AmrlQ learns at a similar rate to Qlearning and DynaQ, and achieves a higher costed return. Amrl has the potential to expand the applicability of RL to important applications in operational planning, scientific discovery, and medical treatments. To achieve this, additional research is required to develop Amrl methods for continuous state and action environments, and function approximation methods, such as Gaussian processes and deep learning.
References
 [1] (2012) April: active preference learningbased reinforcement learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 116–131. Cited by: §2.
 [2] (2019) Learning to learn with active adaptive perception. Neural Networks 115, pp. 30–49. Cited by: §1, §2.
 [3] (2016) OpenAI gym. External Links: 1606.01540 Cited by: §1.

[4]
(2017)
Doubletask deep qlearning with multiple views.
In
Proceedings of the IEEE International Conference on Computer Vision Workshops
, pp. 1050–1058. Cited by: §2.  [5] (2011) PILCO: a modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML11), pp. 465–472. Cited by: §2.
 [6] (2000) Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research 13, pp. 227–303. Cited by: §5.1.
 [7] (2008) Active reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 296–303. Cited by: §1.
 [8] (2016) Improving pilco with bayesian neural network dynamics models. In DataEfficient Machine Learning workshop, ICML, Vol. 4, pp. 34. Cited by: §2.
 [9] (1966) The senses considered as perceptual systems.. pp. 1–5. Cited by: §2.
 [10] (1998) Planning and acting in partially observable stochastic domains. Artificial intelligence 101 (12), pp. 99–134. Cited by: §1, §2.
 [11] (2016) Active reinforcement learning: observing rewards at a cost. In Future of Interactive Learning Machines, NIPS Workshop, Cited by: §1, §2.
 [12] (2016) Optimal control with learned local models: application to dexterous manipulation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 378–383. Cited by: §2.
 [13] (2019) Multiview reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1418–1429. Cited by: §2.
 [14] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
 [15] (2007) Approximate dynamic programming: solving the curses of dimensionality. Vol. 703, John Wiley & Sons. Cited by: §2.
 [16] (2018) Active reinforcement learning with montecarlo tree search. arXiv preprint arXiv:1803.04926. Cited by: §1, §2, §3.
 [17] (2012) Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, pp. 1–114. Cited by: §1.
 [18] (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Machine learning proceedings 1990, pp. 216–224. Cited by: §5.2.
 [19] (1992) Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §5.2.