1 Balancing the Needs of Many Learners
Learning about many things can provide numerous benefits to a reinforcement learning system. Adding many auxiliary losses to a deep learning system can act as a regularizer on the representation, ultimately resulting in better final performance in reward maximization problems, as demonstrated with Unreal
[Jaderberg et al., 2016]. A collection of value functions encoding goaldirected behaviours can be combined to generate new policies that generalize to goals unseen during training [Schaul et al., 2015]. Learning in hierarchical robotcontrol problems can be improved with persistent exploration, provided callreturn execution of a collection of subgoal policies or skills [Riedmiller et al., 2018], even if those policies are imperfectly learned. In all these examples, a collection of general value functions is updated from a single stream of experience. The question we tackle in this paper is how to sculpt that stream of experience—how to adapt the system’s behaviour—to optimize the learning of a collection of value functions.One answer is to simply maximize the environmental reward. This was the approach explored with Unreal and resulted in significant performance improvements in challenging visual navigation problems. However, it is not hard to imagine situations where this approach would be limited. In general, the reward may be delayed and sparse; what should the agent do in the absence of external motivations? Learning reusable knowledge such as skills [Sutton et al., 1999] or a model of the world might result in more longterm reward. Such auxiliary learning objectives could emerge automatically during learning [Silver et al., 2017]. Most agent architectures, however, include explicit skill and model learning components. It seems natural that progress towards these auxiliary learning objectives could positively influence the agent’s behaviour, resulting in improved learning overall.
Learning many value functions offpolicy from a shared stream of experience—with function approximation and an unknown environment—provides a natural setting to investigate noreward intrinsically motivated learning. The basic idea is simple. The aim is to accurately estimate many value functions, each with an independent learner—there is no external reward signal. Directly optimizing the data collection for all learners jointly is difficult because we cannot directly measure this total learning objective and actions have an indirect impact on learning efficiency. There is a large related literature in active learning
[Cohn et al., 1996, Balcan et al., 2009, Settles, 2009, Golovin and Krause, 2011, Konyushkova et al., 2017] and active perception [Bajcsy et al., 2018], from which to draw inspiration for a solution but which do not directly apply to this problem. In active learning the agent must subselect from a larger set of items, to choose which points to label. Active perception is a subfield of vision and robotics. Much of the work in active perception has focused on specific settings—namely visual attention [Bylinskii et al., 2015], localization in robotics [Patten et al., 2018] and sensor selection [Satsangi et al., 2018]—or assumes knowledge of the dynamics (see [Bajcsy et al., 2018, Section 5]).An alternative strategy is to formulate our task as a reinforcement learning problem. We can use an internal or intrinsic reward that approximates the total learning across all learners. The behaviour can be adapted to maximize the intrinsic reward by choosing actions in each state that maximize the total learning of the system. The choice of intrinsic rewards can have a significant impact on the sample efficiency of such intrinsically motivated learning systems. To the best of our knowledge, this paper provides the first formulation of parallel value function learning as a reinforcement learning task. Fortunately, there are many ideas from related areas that can inform our choice of intrinsic rewards.
Rewards computed from internal statistics about the learning process have been explored in many contexts over the years. Intrinsic rewards have been shown to induce behaviour that resembles the development stages similar to those exhibited by young humans and animals [Barto, 2013, Chentanez et al., 2005, Oudeyer et al., 2007, Lopes et al., 2012, Haber et al., 2018]. Internal measures of learning have been used to improve skill or option learning [Chentanez et al., 2005, Schembri et al., 2007, Barto and Simsek, 2005, Santucci et al., 2013, Vigorito, 2016], and model learning [Schmidhuber, 1991b, Schmidhuber, 2008]. Most recent work has investigated using intrinsic reward as a bonus to encourage additional exploration in single task learning [Itti and Baldi, 2006, Stadie et al., 2015, Bellemare et al., 2016, Pathak et al., 2017, Hester and Stone, 2017, Tang et al., 2017, Andrychowicz et al., 2017, Achiam and Sastry, 2017, Martin et al., 2017, Colas et al., 2018].
It remains unclear, however, which of these measures of learning would work best in our noreward setting. Most prior work has focused on providing demonstrations of the utility of particular intrinsic reward mechanisms. One study focused on a suite of largescale control domains with a single scalar external reward [Burda et al., 2018], comparing different instantiations of a learning system that use an intrinsic reward based on modelerror as an exploration bonus. To the best of our knowledge there has never been a broad empirical comparison of intrinsic rewards.
A computational study of intrinsic rewards is certainly needed, but tackling this problem with function approximation and offpolicy updating is not the place to start. Estimating multiple value functions in parallel requires offpolicy algorithms because each value function is conditioned on a policy that is different than the exploratory behaviour used to select actions. In problems of moderate complexity, these offpolicy updates can introduce significant technical challenges. Popular offpolicy algorithms like Qlearning and Vtrace can diverge with function approximation [Sutton and Barto, 2018]. Sound offpolicy algorithms exist, but require tuning additional parameters and are relatively understudied in practice. Even in tabular problems, good performance requires tuning the parameters of each component of the learning system—a complication that escalates with the number of value functions. Finally, the agent must solve the primary exploration problem in order to make use of intrinsic rewards. Finding states with high intrinsic reward may not be easy, even if we assume the intrinsic reward is reliable and informative. To avoid these many confounding factors, the right place to start is in a simpler setting.
In this paper, we investigate and compare different intrinsic reward mechanisms in a new banditlike parallel learning testbed. The testbed consists of a single state and multiple actions. Each action is associated with an independent scalar target to be estimated by an independent prediction learner. A successful behaviour policy will focus on actions that generate the most learning across the prediction learners. However, like auxiliary task learning systems, the overall task is partially observable, and learning is never done. The targets change without an explicit notification to the agent, and the task continually changes due to changes in action selection and learning of the individual prediction learners. Different configurations of the target distributions can simulate unlearnable targets, nonstationary targets, and easytopredict targets. Our new testbed provides a simple instantiation of a problem where introspective learners should help achieve low overall error. An introspective prediction learner is one that can autonomously increase its rate of learning when progress is possible, and decrease learning when progress is not—or cannot—be made.
Our second contribution is a comprehensive empirical comparison of different intrinsic reward mechanisms, including wellknown ideas from reinforcement learning and active learning. Our computational study of learning progress highlighted a simple principle: intrinsic rewards based on the amount of learning (e.g., change in weights) can generate useful behaviour if each individual learner is introspective. Otherwise, without introspective learners, no intrinsic reward was able to consistently produce useful behaviours. We conclude with a discussion about how the ideas of introspective learners and intrinsic rewards based on the change in weights are equally applicable in our onestate prediction problem and largescale problems where offpolicy learning and function approximation are required.
2 Problem Formulation
In this section we formalize a testbed for comparing intrinsic reward using a stateless prediction task and independent learners. This formalism is meant to simplify the study of balancing the needs of many learners to facilitate comprehensive comparisons.
We formalize our multipleprediction learning setting as a collection of independent, online supervised learning tasks. On each discrete time step
, the agent selects one task , causing a target signal to be sampled from an (unknown) target distribution, , wheredenotes the random variable with distribution
. This distribution is indexed by time to reflect that it can change on each time step; this enables a wide range of different target distribution to be considered, to model this nonstationary, multiprediction learning setting. We provide the definition we use in this work later in this section, in Equation (3).Associated with each prediction task is a simple prediction learner
that maintains a realvalued vector of weights
, to produce an estimate, , of the expected value of the target, . On a step where task is selected, could be updated using any standard learning algorithm. In this work, we use a 1dimensional weight vector, and so the update is a simple deltarule:(1) 
where is a scalar learning rate and is the prediction error of prediction learner on step . On a step where task is not selected, is not updated, implicitly setting to .
The primary learning objective is to minimize the Mean Squared Error up to time for all of the learners:
(2) 
The behaviour does not get to observe this error, both because it only observes a target on each step, and because that target is a noisy sample of the true expected value . It can nonetheless attempt to minimize this unobserved error.
In order to minimize Equation (2), we must devise a way to choose which prediction task to sample. This can be naturally formulated as sequential decisionmaking problem, where on each time step , the agent chooses an action , resulting in a new sample of , and an update to . In order to learn a preference over actions we associate a reward with each action selection, and thus with each prediction task. We investigate different intrinsic rewards. Given a suitable definition of reward, we can make use of any Bandit learning algorithm suitable for nonstationary problems.
In this work we use a Gradient Bandit agent^{1}^{1}1
Our framework differs from the usual multiarmed bandit setting in at least two ways: (1) the distributions of the targets are nonstationary, and (2) our objective is to minimize error across all learners, meaning the optimal strategy is obtain a distribution over the actions—not find the single best action. We used the Gradient Bandit algorithm, because it similar to policy gradient methods in reinforcement learning, reflecting the setting which we are ultimately interesting in—learning the behaviour for a Horde of demons in Markov Decision Process problems with function approximation
[Sutton et al., 2011]. The Gradient Bandit will sample all the actions infinitely often.^{2}^{2}2In preliminary experiments with the DrifterDistractor problem described in Section 3, we also compared the greedy bandit learning algorithm [Sutton and Barto, 2018, pp. 27–28]. The conclusions were the same as with the gradient bandit. However, the gradient bandit is better suited to nonstationary tasks [Sutton and Barto, 2018], which we focus on here. [Sutton and Barto, 2018], which attempts to maximize the expected average reward by modifying a vector of action preferences based on the difference between the reward and average reward baseline:where is the average of all the rewards up to time , maintaining using an exponential average, and and
are both initialized to zero. Actions are selected probabilistically according to a softmax distribution which converts the preferences to probabilities (with a temperature of one):
The targets for each prediction learner are intended to replicate the dynamics of targets that a parallel auxiliary task learning system might experience, such as sensor values of a robot. To simulate a range of interesting dynamics, we construct each
as Gaussian distribution with drifting mean:
(3)  
where controls sampling noise, , controls the rate of drift and projects the drifting back to the range
to keep it bounded. The variance and drift are indexed by
, because we explore settings where they change periodically. These changes are not communicated to the agent, and the individual LMS learners are prevented from storing explicit histories of the targets. The purpose of this choice was to simulate partial observability common in many largescale systems (e.g., [Sutton et al., 2011, Modayil et al., 2014, Jaderberg et al., 2016, Silver et al., 2017]). Given our setup, both prediction learners and the behaviour learner would do well to treat their respective learning tasks as nonstationary and track rather than converge [Sutton et al., 2007]. Each sample , and is bounded between , and is updated on each step regardless of which action is selected.Our formalism is summarized in Figure 1.3 Simulating Parallel Prediction Problems
We consider several prediction problems corresponding to different settings of and to define task distribution in Equation (3). We introduce three problems, with target data simulated from those problems show in Figure 2.
The DrifterDistractor problem has four targets, one for each action: (1) two (stationary) noisy targets as distractors (2) a slowly drifting target and (3) a constant target, with and for each of these types in Table 1.
The Switch DrifterDistractor problem is similar to DrifterDistractor except, after 50,000 timesteps the associations between the actions and the target distributions are permuted as detailed in Table 1. To do well in this problem, the system must be able to respond to changes. In addition, in phase two of this problem, two targets exhibit the same drift characteristics; the agent should prefer both actions equally.
Task  1  2  3  4  5  7 and 8 
0.1  0.5  1.0  0.01  0.01  0.0  
0.0  0.0  0.0  0.01  0.05  0.0 
The Jumpy EightAction problem is designed to require sampling different prediction tasks with different frequencies. In this problem all the drift, but at different rates and with different amounts of sampling variance as summarized in Table 2. The best approach is to select several actions probabilistically depending on their drift and sampling variance. We add an additional target type, that drifts more dramatically over time, with periodic shifts in the mean:
(4) 
where indicator and switches signs if . The sample from a Bernoulli ensures the jumps are rare, but the large mean of the Gaussian makes it likely for this jump to be large when it occurs, as shown in Figure 2. This problem simulates a prediction problem where the target changes by a large magnitude in a semiregular pattern, but then remains constant. This could occur due to changes in the world outside the prediction learner’s control and representational abilities. These largemagnitude jumps in prediction target are also possible in offpolicy learning settings where the agent’s behaviour changes, perhaps encountering a totally new part of the world.
4 Intrinsic Rewards for Multiprediction Learning
Many learning systems draw inspiration from the exploratory behaviour of young humans and animals, uncertainty reduction in active learning, and information theory—and the resulting techniques could all be packed into the suitcase of curiosity and intrinsic motivation. In an attempt to distill the key ideas and perform a meaningful yet inclusive empirical study, we consider only methods applicable to our problem formulation of multiprediction learning.^{3}^{3}3
Automatic curriculum generation systems often assume that some tasks cannot be tackled until other easier tasks have been solved first. Curriculum has been most explored in the supervised multitask learning case where the agent can switch between tasks at any moment
[Graves et al., 2017, Matiisen et al., 2017]. Although we could use some ideas from curriculum learning in our banditlike task, these approaches would not be compatible with more general sequential decision making tasks with delayed outcomes (e.g., a robot), which represents our ultimate goal. Although few approaches have been suggested for offpolicy multitask reinforcement learning—[Chentanez et al., 2005, White et al., 2014] as notable exceptions—many existing approaches can be used to generate intrinsic rewards for multiple, independent prediction learners (see Barto’s excellent summary [Barto, 2013]). We first summarize methods we evaluate in our empirical study. The specific form of each intrinsic reward discussed below is given in Table 3, with italicized names below corresponding to the entries in the table. We conclude by mentioning several rewards we did not evaluate, and why.Several intrinsic rewards are based on violated expectations, or surprise. This notion can be formalized using the prediction error itself to compute the instantaneous Absolute Error or Squared Error. We can obtain a less noisy measure of violated expectations with a windowed average of the error, which we call Expected Error. Regardless of the specific form, if the error increases, then the intrinsic reward increases encouraging further sampling for that target. Such errors can be normalized, such as was done for Unexpected Demon Error [White et al., 2014], to mitigate the impact of noise in and magnitude of the targets.
Another category of methods focus on learning progress, and assume that the learning system is capable of continually improving its policy or predictions. This is trivially true for approaches designed for tabular stationary problems [Chentanez et al., 2005, Still and Precup, 2012, Little and Sommer, 2013, Meuleau and Bourgine, 1999, Barto and Simsek, 2005, Szita and Lőrincz, 2008, Lopes et al., 2012]. The most wellknown approaches for integrating intrinsic motivation make use of rewards based on improvements in (model) error: including Error Reduction [Schmidhuber, 1991b, Schmidhuber, 2008], and Oudeyer’s model Error Derivative approach [Oudeyer et al., 2007]. Improvement in the value function can also be used to construct rewards, and can be computed from the Positive Error Part [Schembri et al., 2007], or by tracking improvement in the value function over all states [Barto and Simsek, 2005]. As our experiments reveal, however, intrinsic rewards requiring improvement can lead to suboptimal behaviour in nonstationary tracking problems.
An alternative to learning progress or improvement is to reward amount of learning. This does not penalize errors becoming worse, and instead only measures that estimates are changing: the prediction learner is still adjusting its estimates and so is still learning. Bayesian Surprise [Itti and Baldi, 2006] formalizes the idea of amount of learning. For a Bayesian learner, which maintains a distribution over the weights, Bayesian Surprise corresponds to the KLdivergence between this distribution over parameters before and after the update. This KLdivergence measures how much the distribution over parameters has changed. Bayesian Surprise can be seen as a stochastic sample of Mutual Information, which is the expected KLdivergence between prior and posterior across possible observed targets. We discuss this more in Section 5. Other measures based on information gain have been explored [Still and Precup, 2012, Little and Sommer, 2013, Achiam and Sastry, 2017, de Abril and Kanai, 2018, Still and Precup, 2012], though they have been found to perform similarly to Bayesian Surprise [Little and Sommer, 2013].
Though derived assuming stationarity, we can use Bayesian Surprise for our nonstationary setting by using exponential averages which prioritize recent data in sample averages for means and variances (see Table 3 for the formula). We can additionally consider nonBayesian strategies for measuring amount of learning, including those based on change in error (Absolute Error Derivative), Variance of Prediction, Uncertainty Change—how much the variance in the prediction changes—and the Weight Change, which we discuss in more depth in the next section. Note that several learning progress measures can be modified to reflect amount of learning by taking the absolute value, and so removing the focus on increase rather than change (this must be done with care as we likely do not want to reward model predictions becoming worse, for example). We test such a modification to Oudeyer’s model derivative, which we refer to as Absolute Error Derivative.
Reward name  







denotes the exponentially weighted average of to with with decay rate  


is a small constant set to in our experiments  






specify the length of the window and amount of overlap  
Weight Change*  
Stepsize Change*  


is an estimate of , using an exponential average variant of Welford’s algorithm, with for  
Absolute Error Derivative*  
Variance of error  
Variance of Prediction*  
Uncertainty Reduction  
Uncertainty Change* 
There are several strategies which we omit, because they would (1) result in uniform exploration in our pure exploration problem, (2) require particular predictions about state to drive exploration, or (3) are based on statistics of the targets rather than the statistics generated by the prediction learners. Countbased approaches [Brafman and Tennenholtz, 2002, Bellemare et al., 2016, Sutton and Barto, 2018] are completely unsupervised, rewarding visits to under sampled states or actions—resulting in uniform exploration in our problem. Though countbased approaches are sometimes used in learning systems, they reflect novelty rather than learning progress or surprise [Barto et al., 2013]. Many methods use a model to encourage exploration [Schmidhuber, 2008, Chentanez et al., 2005, Stadie et al., 2015, Pathak et al., 2017] such as by using Bayesian Surprise for nextstate prediction [Houthooft et al., 2016]. Subgoal discovery systems [Kulkarni et al., 2016, Andrychowicz et al., 2017, Péré et al., 2018] define rewards to reach particular states. Empowerment and state control systems are explicitly designed to respect and use the fact that some tasks or regions of the statespace cannot be well learned. Often such systems use only unsupervised signals relating to statistics of the exploration policy, ignoring the statistics generated by the learning process itself [Karl et al., 2017]. Like countbased approaches, unsupervised measures like this would induce uniform exploration in our stateless task. Finally, we do not test intrinsic rewards based only on targets, such as variance of the target. To see why, consider a behaviour that estimates the variance for a constant target, and quickly determines it only needs to select that action a few times. The prediction learner, however, could have a poor estimate of this target, and may need many more samples to converge to the true value. Separately estimating possible amount of learning from actual amount of learning has clear limitations.^{4}^{4}4In the bandit setting, with a simple sample average learner, the variance of the prediction target provides a measure of uncertainty for the learned prediction [Audibert et al., 2007, Garivier and Moulines, 2011, Antos et al., 2008], and has been successful applied in education applications [Liu et al., 2014, Clement et al., 2015]. When generalizing to other learners and problem settings, however, variance of the target will no longer obviously reflect uncertainty in the predictions. We therefore instead directly test intrinsic rewards that measure uncertainty in predictions.
5 Approximating Bayesian Surprise
One natural question given this variety of intrinsic rewards is if there is an optimal approach. In some settings, there is in fact a clear answer. In a stationary, stateless problem where the goal is to estimate means of multiple targets, it has been shown that the agent should take actions proportional to the variance of each target to obtain minimal regret [Antos et al., 2008]. For a stationary setting, with state, an optimal approach would be to take actions to maximize information gain—the reduction in entropy after an update—across learners. We therefore use information gain as the criteria to measure optimal action selection. In this section, we describe how to maximize information gain in an ideal case, and approximation strategies otherwise. The goal of this section is to provide intuition and motivation, as we do not yet have theoretical claims about the approximation strategies. We hope instead for this discussion to help lead to such a formalization.
We first show that information gain is maximized when maximizing expected Bayesian surprise, assuming Bayesian learners. A Bayesian learner updates weights for a parameterized distribution on the parameters needed to make the prediction . The parameters can be seen as a random variable, , with distribution . The goal is to narrow this distribution around the true parameters that generate . After seeing each new sample, the posterior distribution over parameters is computed using the previous distribution and the new sample, , using the update
A Bayesian learner is one that uses exact updates to obtain the posterior. We assume the prior is appropriately specified so that , and so has nonzero support as almost surely for any stochastic sequence .
Bayesian surprise is defined as the KL divergence between the distribution over parameters before and after an update [Itti and Baldi, 2006]
(5) 
The Bayesian surprise is high when taking an action that produces a stochastic outcome that results in a large change in the prior and posterior distributions over parameters. The expectation of the KLdivergence over stochastic outcomes, with a Bayesian learner, corresponds to the Information Gain. This result is straightforward, but we explicitly show it in the following theorem to provide intuition. Notice that Information Gain defined in Equation (6) is relative to the model class of our learner, rather than some objective notion of information content.
Theorem 1
Assume targets are distributed according to true parameters , with density . For a Bayesian learner, that maintains distribution over parameters ,
(6) 
where the expectation is over stochastic outcomes . Note that , the mutual information between and , is the also called the Information Gain.
Proof
The weights are dependent on the observed . By definition, this integral is gives an expected KL, across possible observed .
To make this more concrete, consider Bayesian surprise for a Bayesian learner with a simple Gaussian distribution over parameters. For our simplified problem setting, the weights for the Bayesian learner are for the Gaussian distribution over the parameters , which in this case is the current estimate of the mean of the target, . The Bayesian surprise is
We can make this even simpler if we consider the variance to be fixed, rather than learned. The Bayesian surprise then simplifies to
(7) 
This value is maximized when the squared change in weights is maximal. Therefore, though Bayesian surprise in general may be expensive to compute, for some settings it is as straightforward as measuring the change in weights.
Additionally, we can also consider approximations to Bayesian surprise for nonBayesian learners. A nonBayesian learner typically estimates the parameters directly, such as by maximizing likelihood or taking the maximum a posteriori (MAP) estimate
Now instead of maintaining the full posterior as , the agent need only learn directly. Because is the mode of the posterior, for many distributions will actually equal a component of . For the Gaussian example above with a learned variance, equals the first component of , the mean . For a fixed variance, exactly equals . Therefore, the nonBayesian learner would have the exact same information gain, measured by the Bayesian surprise in (7).
This direct connection, for Bayesian and nonBayesian learners, only exists for a limited set of distributions. One such class is the natural exponential family distribution over the parameters. Examples include the Gaussian with fixed variance and mean
and the Gamma distribution with a fixed shape parameter and scale parameter
. Each natural exponential family has the property that the KLdivergence between two distributions with parameters and corresponds to a (Bregman) divergence directly on the parameters [Banerjee et al., 2005]. For a Gaussian, this divergence is the squared error normalized by the variance, as above in Equation (7). Another distribution that has this connection is a Laplace distribution with mean and fixed variance . Then the KLdivergence is . Note that this connection is limited to certain posterior distributions, but is true for general problem settings, even the general reinforcement learning setting. The distributions before and after an update, andrespectively, are over the parameters of the prediction learner. These parameters are more complex in settings with state—such as parameters to a neural network—but we can nonetheless consider exponential family distributions on those parameters.
This discussion motivates a simple proposal to approximate Bayesian surprise and Bayesian learners for a general setting with nonBayesian learners: using weight change with introspective learners. An introspective learner is not a precise definition, but rather a scale. A perfectly introspective learner would be a Bayesian learner, or in some cases the equivalent nonBayesian MAP learner. A perfectly nonintrospective learner could be a random update. The more closely the learner approximates the weights to the perfectly introspective learner, the better its solution and the better the Bayesian surprise reflects the Information Gain. Further, because the underlying distribution may not be known, we use the change in weights as an approximation.
For concreteness, consider the following system. Each prediction learner is augmented with a procedure to automatically adapt the step size parameter , based on the errors produced over time (). In this paper we use the Autostep algorithm [Mahmood et al., 2012]. The Autostep algorithm will automatically reduce towards zero if the target is unlearnable, increase when successive errors have the same sign, and not change if the error is zero. A learner with a fixed step size could be considered a nonintrospective learner, because the learner will forever chase the noise. The weight change for such a learner would not at all be reflective of Information Gain, reflecting instead only the inadequacy of the learner. The Autostep learner, on the other hand, like a Bayesian or MAP learner, will stop learning once new samples provide no new information.
This proposal reflects the following philosophy: there should be an explicit separation in the role of the behaviour agent—to balance data generation amongst parallel prediction learners—and the role of the prediction learners—to learn. If the agent trusts that the prediction learners are using the data appropriately, this suggests simpler intrinsic rewards based solely on the prediction learner’s parameters, such as the change in the weights. The alternative is to assume that the intrinsic rewards must be computed to overcome poor learning. This approach would require the agent to recognize when a prediction learner is nonintrospective, and to stop generating rewards to encourage exploration for that agent. If the agent can measure this, though, then presumably so too can the prediction learner—they are after all part of the same learning system. The learner should then be able to use the same measure to adjust its own learning, and avoid large Bayesian surprise simply from ineffective updates to weights.
In this work, we define the change in weights using the norm,
(8) 
It follows from Equation (1) that weight change is simply Absolute Error scaled by the step size, emphasizing the role that learner capability plays in ensuring an effective reward.
(9) 
Remark: The above discussion applies to the nonstationary setting, by treating the nonstationarity as partial observability. We can assume that the world is stationary, driven by some hidden state, but that it appears nonstationary to the agent because it only observes partial information. If a Bayesian agent had the correct model class, it could still maximize information gain. For example, the agent could know there is a hidden parameter defining the rate of drift for the mean of the distribution over . It could then maintain a posterior over both and the mean and covariance of , based on observed data. As above, it would be unlikely for the agent to have this true model class, and a prediction learner would likely only be an approximation. It remains an important open theoretical question how such approximations influence the agent’s ability to maximize information gain.
6 Experiments
We conducted six experiments, with two experiments in each of the three problems described in Section 3, where the prediction learners used either a single constant, global step size (giving nonintrospective learners) or stepsize adaption with a step size per prediction learner (giving introspective learners). The goal of the experiments is to understand the behaviour of different intrinsic rewards in different problems, with different prediction learners (introspective and nonintrospective). Our hypothesis is that nonintrospective prediction learners render intrinsic rewards based on amount of learning ineffective, and, with introspective learners, simple rewards like Weight Change can enable the behaviour to effectively balance the needs of all prediction learners.
In these problems, these two learners correspond to nonintrospective and introspective learners, because a vector of adaptive step sizes is critical for these changing, heterogeneous tasks. A constant global step size cannot balance the need to track the drifting targets, and the need to learn slowly on the highvariance targets. Ideally we could tune the step size for each prediction learner individually to mitigate this issue, however that approach does not scale to the settings we ultimately care about [Sutton et al., 2011, Modayil et al., 2014, Jaderberg et al., 2016]. If the learning rate is too large for the highvariance target, then the prediction learner will continually make large updates due to the sampling variance, never converging to low error. If the step size is too small for the tracking target, then the prediction learner’s estimate will often lag, causing higherror. The stepsize adaption algorithm Autostep [Mahmood et al., 2012] provides such a vector of adaptive step sizes.^{5}^{5}5
We experimented with other stepsize adaption methods, like RMSProp, but the results were qualitatively the same. The results with Autostep were quantitatively better, compared with RMSProp. Autostep is a metadescent method designed for incremental online learning problems like ours, which may explain this difference.
The performance difference between these prediction learners is stark, with Autostep significantly improving tracking, enabling different rates for different prediction learners and reducing the step sizes on unlearnable targets or noisy targets once learning is complete.In all our experiments, an extensive parameter search was conducted over the the parameters of agent (Gradient Bandit), the prediction learners, and the reward functions. The best performing parameter combinations according to RMSE (see Equation 2) for each was used to generate the results. In each of our three experiments 13,800 parameter combinations were tested and the results were averaged over 200 independent runs. The parameters swept were: agent learning rate, average reward parameter, learning rate for nonintrospective learners, initial learning rate of introspective learners, meta learning rate for Autostep, and , from Table 3. For clarity of presentation, we include a large selection of approaches advocated in the literature, listed in Table 3, but omit results for several poorly performing reward functions.
In the DrifterDistractor problem, we focus on qualitative understanding of step sizes over time, shown in Figures 3 (nonintrospective learner) and 4 (introspective learner). There are two key conclusions from this experiment. First, introspective learners were critical for intrinsic rewards based on amount of learning, particularly Weight Change and Bayesian Surprise. Without Autostep, both Weight Change and Bayesian Surprise incorrectly cause the agent to prefer the two highvariance targets because their targets continually generate changes to the prediction. With Autostep, however, the weights can converge for the constant and highvariance targets, and both agents correctly prefer the drifting target. Second, measures based on violated expectations—Unexpected Demon Error, Squared Error and Expected Error—or learning progress—Error Reduction—induce either nearly uniform selection or focus on noisy targets, with or without Autostep.
We ran a similar experiment, called the Switch DrifterDistractor, where targets unexpectedly change after 50,000 steps (see Figures 5 and 6). The conclusions are similar. One notable point is that the Variance of Prediction reacts surprisingly well to the change—as well as Weight Change. Another notable point is that Bayesian Surprise emphasizes differences before and after updates, more so than Weight Change, because both means (weights) and variances are different. Consequently, after the sudden change, when the constant target becomes a drifting target, the magnitude of intrinsic rewards for that action dominates the other action that changed to a drifting target. Therefore, even though the agent samples the other action several times to see that the error is high, it nonetheless erroneously settles on single action.
Our final experiment in the Jumpy EightAction problem quantitatively compares the best performing intrinsic rewards in a setting where the agent should prefer several different actions. To achieve good performance, the agent must continually sample three actions, with different probabilities, and ignore three noise targets and two constant targets. We only report results with Autostep, based on the conclusions from the first two experiments. We report RMSE in Figure 8 and highlight the actionselection probabilities for Weight Change, Bayesian Surprise, Variance of Prediction, Expected Error and Change in Uncertainty in Figure 7. These five were the most promising intrinsic rewards from the first two experiments. The agent based on Weight Change achieves the lowest overall RMSE, because the rewards cause the agent to prefer actions associated with the jumpy target, and the two drifting targets differently. Variance of Prediction causes the agent to only select the action corresponding to the jumpy target, thus suffering higher error on the two undersampled drifting targets. Reward based on Bayesian Surprise does not induce a preference between the one of drifting targets and the jumpy target, causing higher RMSE. Expected Error and Change in Uncertainty rewards initially overvalue the jumpy target, slightly degrading performance. Expected Error incorrectly induces a preference for highvariance target, resulting in a bound in the learning curve near the end of learning.
7 Adapting the Behaviour of a Horde of Demons
The ideas and algorithms promoted in this paper may be even more impactful when combined with policycontingent, temporallyextended prediction learning. Imagine learning hundreds or thousands of offpolicy predictions from a single stream of experience, as in the Unreal [Jaderberg et al., 2016] and Horde [Sutton et al., 2011] architectures. In these settings, the agent’s behaviour must balance overall speed of learning with prediction accuracy. That is, balancing action choices that generate offpolicy updates across many predictions, with the need to occasionally choose actions in almost total agreement with one particular policy. In general we cannot assume that each prediction target is independent as we have done in this paper; selecting a particular sequence of actions might generate useful offpolicy updates to several predictions in parallel [White et al., 2012]. There have been several promising investigations of how intrinsic rewards might benefit single (albeit complex) task learning (see [Pathak et al., 2017, Hester and Stone, 2017, Tang et al., 2017, Colas et al., 2018]). However, to the best of our knowledge, no existing work has studied adapting the behaviour based on intrinsic rewards of a modelbased or otherwise parallel offpolicy learning system.
It seems clear that simple intrinsic reward schemes and the concept of an introspective agent should scale nicely to these more ambitious problem settings. We could swap our stateless LMS learners for Qlearning with experience replay, or gradient temporal difference learning [Maei et al., 2010]. The weightchange reward could be computed for each predictor with computation linear in the number of weights. It would be natural to learn the behaviour policy with an averagereward actorcritic architecture, instead of the gradient bandit algorithm used here. Finally, the notion of an introspective learner still simply requires that each prediction learner can adapt its learning rate. This can be achieved with quasi second order methods like Adam [Kingma and Ba, 2015], or extensions of the AutoStep algorithm to the case of temporal difference learning and function approximation [Kearney et al., 2019, Jacobsen et al., 2019]. It is not possible to know if the ideas advocated in this paper will work well in a largescale offpolicy prediction learning architecture like Horde, however they will certainly scale up.
Maximizing intrinsic reward as presented in this paper is not a form of exploration, its a mechanism for defining good behaviour. In our stateless prediction task, sufficient exploration was provided by the stochastic behaviour policy. The stochasticity of the policy combined with the intrinsic reward allowed the agent to discover good policies. In the switched task, the behaviour was able to adapt to abrupt and unanticipated change to the target distributions. In this case, AutoStep did not decay the stepsize parameters too low, ensuring the policy occasionally sampled all the actions. This will not always be the case, and additional exploration will likely be needed. The objective of this paper was to define good behaviours for multiprediction learning through the lens of intrinsic reward and internal measures of learning. Efficient exploration is an open problem in reinforcement learning. Combining the ideas advocated in this paper with exploration bonuses or planning could work well, but this topic is left to future work.
8 Conclusion
The goal of this work was to systematically investigate intrinsic rewards for a multiprediction setting. This paper has three main contributions. The first is a new benchmark for comparing intrinsic rewards. Our banditlike task requires the agent to demonstrate several important capabilities: avoiding dawdling on noisy outcomes, tracking nonstationary outcomes, and seeking actions for which consistent learning progress is possible. We provide a survey of intrinsically motivated learning systems, and compared 15 different analogs of wellknown intrinsic reward schemes. Finally we demonstrated how one of the simplest intrinsic rewards—weight change—outperformed all other rewards tested, when combined with stepsize adaption. We motivated this approach based on connections to Bayesian surprise with Bayesian learners. These introspective prediction learners can decide for themselves when learning is done, and therefore the change in the weights of an introspective learner provides a clear reward to drive behaviour.
References
 [Achiam and Sastry, 2017] Achiam, J. and Sastry, S. (2017). Surprisebased intrinsic motivation for deep reinforcement learning. arXiv:1703.01732.
 [Andrychowicz et al., 2017] Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems.
 [Antos et al., 2008] Antos, A., Grover, V., and Szepesvári, C. (2008). Active learning in multiarmed bandits. In International Conference on Algorithmic Learning Theory.
 [Audibert et al., 2007] Audibert, J.Y., Munos, R., and Szepesvári, C. (2007). Variance estimates and exploration function in multiarmed bandit. In CERTIS Research Report 07–31.
 [Bajcsy et al., 2018] Bajcsy, R., Aloimonos, Y., and Tsotsos, J. K. (2018). Revisiting active perception. Autonomous Robots.
 [Balcan et al., 2009] Balcan, M.F., Beygelzimer, A., and Langford, J. (2009). Agnostic active learning. Journal of Computer and System Sciences.

[Banerjee et al., 2005]
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005).
Clustering with bregman divergences.
Journal of Machine Learning Research
.  [Barto et al., 2013] Barto, A., Mirolli, M., and Baldassarre, G. (2013). Novelty or surprise? Frontiers in Psychology, 4:907.
 [Barto, 2013] Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In Intrinsically Motivated Learning in Natural and Artificial Systems.
 [Barto and Simsek, 2005] Barto, A. G. and Simsek, O. (2005). Intrinsic motivation for reinforcement learning systems. In Yale Workshop on Adaptive and Learning Systems.
 [Bellemare et al., 2016] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems.
 [Brafman and Tennenholtz, 2002] Brafman, R. I. and Tennenholtz, M. (2002). Rmax – a general polynomial time algorithm for nearoptimal reinforcement learning. Journal of Machine Learning Research.
 [Burda et al., 2018] Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018). Largescale study of curiositydriven learning. In arXiv:1808.04355.

[Bylinskii et al., 2015]
Bylinskii, Z., DeGennaro, E., Rajalinghamd, R., Ruda, H., Zhang, J., and
Tsotsos, J. (2015).
Towards the quantitative evaluation of visual attention models.
Vision Research.  [Chentanez et al., 2005] Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems.
 [Clement et al., 2015] Clement, B., Roy, D., Oudeyer, P.Y., and Lopes, M. (2015). Multiarmed bandits for intelligent tutoring systems. Journal of Educational Data Mining.

[Cohn et al., 1996]
Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996).
Active learning with statistical models.
Journal of artificial intelligence research
.  [Colas et al., 2018] Colas, C., Sigaud, O., and Oudeyer, P.Y. (2018). GEPPG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning.
 [de Abril and Kanai, 2018] de Abril, I. M. and Kanai, R. (2018). Curiositydriven reinforcement learning with homeostatic regulation. arXiv: 1801.07440v2.
 [Garivier and Moulines, 2011] Garivier, A. and Moulines, E. (2011). On upperconfidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory.
 [Golovin and Krause, 2011] Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research.
 [Gordon and Ahissar, 2011] Gordon, G. and Ahissar, E. (2011). Reinforcement active learning hierarchical loops. In International Joint Conference on Neural Networks.
 [Graves et al., 2017] Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. In International Conference on Machine Learning.
 [Haber et al., 2018] Haber, N., Mrowca, D., FeiFei, L., and Yamins, D. L. (2018). Learning to play with intrinsicallymotivated selfaware agents. arXiv:1802.07442.
 [Hester and Stone, 2017] Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence.
 [Houthooft et al., 2016] Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems.
 [Itti and Baldi, 2006] Itti, L. and Baldi, P. F. (2006). Bayesian surprise attracts human attention. In Advances in Neural Information Processing Systems.
 [Jacobsen et al., 2019] Jacobsen, A., Schlegel, M., Linke, C., Degris, T., White, A., and White, M. (2019). Metadescent for online, continual prediction. In AAAI Conference on Artificial Intelligence.
 [Jaderberg et al., 2016] Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv:1611.05397.
 [Karl et al., 2017] Karl, M., Soelch, M., BeckerEhmck, P., Benbouzid, D., van der Smagt, P., and Bayer, J. (2017). Unsupervised realtime control through variational empowerment. arXiv:1710.05101.
 [Kearney et al., 2019] Kearney, A., Veeriah, V., Travnik, J., Pilarski, P. M., and Sutton, R. S. (2019). Learning feature relevance through step size adaptation in temporaldifference learning. arXiv:1903.03252.
 [Kingma and Ba, 2015] Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. In International Conference on Machine Learning.
 [Konyushkova et al., 2017] Konyushkova, K., Sznitman, R., and Fua, P. (2017). Learning active learning from data. In Advances in Neural Information Processing Systems.
 [Kulkarni et al., 2016] Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems.
 [Little and Sommer, 2013] Little, D. Y.J. and Sommer, F. T. (2013). Learning and exploration in actionperception loops. Frontiers in Neural Circuits.
 [Liu et al., 2014] Liu, Y.E., Mandel, T., Brunskill, E., and Popovic, Z. (2014). Trading off scientific knowledge and user learning with multiarmed bandits. In International Conference on Educational Data Mining.
 [Lopes et al., 2012] Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.Y. (2012). Exploration in modelbased reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems.
 [Maei et al., 2010] Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward offpolicy learning control with function approximation. In International Conference on Machine Learning.
 [Mahmood et al., 2012] Mahmood, A. R., Sutton, R. S., Degris, T., and Pilarski, P. M. (2012). Tuningfree stepsize adaptation. In International Conference on Acoustics, Speech and Signal Processing.
 [Martin et al., 2017] Martin, J., Sasikumar, S. N., Everitt, T., and Hutter, M. (2017). Countbased exploration in feature space for reinforcement learning. arXiv:1706.08090.
 [Matiisen et al., 2017] Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. (2017). Teacherstudent curriculum learning. arXiv:1707.00183.
 [Meuleau and Bourgine, 1999] Meuleau, N. and Bourgine, P. (1999). Exploration of multistate environments: Local measures and backpropagation of uncertainty. Machine Learning.
 [Mirolli and Baldassarre, 2013] Mirolli, M. and Baldassarre, G. (2013). Functions and mechanisms of intrinsic motivations. In Intrinsically Motivated Learning in Natural and Artificial Systems. Springer.
 [Modayil et al., 2014] Modayil, J., White, A., and Sutton, R. S. (2014). Multitimescale nexting in a reinforcement learning robot. Adaptive Behavior.

[Oudeyer et al., 2007]
Oudeyer, P.Y., Kaplan, F., and Hafner, V. (2007).
Intrinsic motivation systems for autonomous mental development.
IEEE Transactions on Evolutionary Computation
.  [Pathak et al., 2017] Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning.
 [Patten et al., 2018] Patten, T., Martens, W., and Fitch, R. (2018). Monte carlo planning for active object classification. Autonomous Robots.
 [Péré et al., 2018] Péré, A., Forestier, S., Sigaud, O., and Oudeyer, P.Y. (2018). Unsupervised learning of goal spaces for intrinsically motivated goal exploration. arXiv:1803.00781.
 [Riedmiller et al., 2018] Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playingsolving sparse reward tasks from scratch. arXiv:1802.10567.
 [Santucci et al., 2013] Santucci, V. G., Baldassarre, G., and Mirolli, M. (2013). Which is the best intrinsic motivation signal for learning multiple skills? Frontiers in Neurorobotics, 7:22.
 [Satsangi et al., 2018] Satsangi, Y., Whiteson, S., Oliehoek, F. A., and Spaan, M. T. (2018). Exploiting submodular value functions for scaling up active perception. Autonomous Robots.
 [Schaul et al., 2015] Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In International Conference on Machine Learning.
 [Schembri et al., 2007] Schembri, M., Mirolli, M., and Baldassarre, G. (2007). Evolving childhood’s length and learning parameters in an intrinsically motivated reinforcement learning robot. In International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems.
 [Schmidhuber, 1991a] Schmidhuber, J. (1991a). Curious modelbuilding control systems. In International Joint Conference on Neural Networks.
 [Schmidhuber, 1991b] Schmidhuber, J. (1991b). A possibility for implementing curiosity and boredom in modelbuilding neural controllers. In International Conference on Simulation of Adaptive Behavior: From Animals to Animats.
 [Schmidhuber, 2008] Schmidhuber, J. (2008). Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In Workshop on Anticipatory Behavior in Adaptive Learning Systems.
 [Settles, 2009] Settles, B. (2009). Active learning literature survey. Technical report, University of WisconsinMadison Department of Computer Sciences.
 [Silver et al., 2017] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., DulacArnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. (2017). The Predictron: Endtoend learning and planning. In International Conference on Machine Learning.
 [Stadie et al., 2015] Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814.
 [Still and Precup, 2012] Still, S. and Precup, D. (2012). An informationtheoretic approach to curiositydriven reinforcement learning. Theory in Biosciences, 131(3):139–148.
 [Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction, 2nd Edition. MIT press.
 [Sutton et al., 2007] Sutton, R. S., Koop, A., and Silver, D. (2007). On the role of tracking in stationary environments. In International Conference on Machine Learning.
 [Sutton et al., 2011] Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems.
 [Sutton et al., 1999] Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
 [Szita and Lőrincz, 2008] Szita, I. and Lőrincz, A. (2008). The many faces of optimism: a unifying approach. In International Conference on Machine learning.
 [Tang et al., 2017] Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. (2017). #Exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems.
 [Vigorito, 2016] Vigorito, C. M. (2016). Intrinsically Motivated Exploration in Hierarchical Reinforcement Learning. PhD thesis, University of Massachusetts Amherst.
 [White et al., 2012] White, A., Modayil, J., and Sutton, R. S. (2012). Scaling lifelong offpolicy learning. In International Conference on Development and Learning and Epigenetic Robotics.
 [White et al., 2014] White, A., Modayil, J., and Sutton, R. S. (2014). Surprise and curiosity for big data robotics. In AAAI Workshop on Sequential DecisionMaking with Big Data.
 [White, 2015] White, A. M. (2015). Developing a Predictive Approach to Knowledge. PhD thesis, University of Alberta.