1 Balancing the Needs of Many Learners
Learning about many things can provide numerous benefits to a reinforcement learning system. Adding many auxiliary losses to a deep learning system can act as a regularizer on the representation, ultimately resulting in better final performance in reward maximization problems, as demonstrated with Unreal[Jaderberg et al., 2016]. A collection of value functions encoding goal-directed behaviours can be combined to generate new policies that generalize to goals unseen during training [Schaul et al., 2015]. Learning in hierarchical robot-control problems can be improved with persistent exploration, provided call-return execution of a collection of subgoal policies or skills [Riedmiller et al., 2018], even if those policies are imperfectly learned. In all these examples, a collection of general value functions is updated from a single stream of experience. The question we tackle in this paper is how to sculpt that stream of experience—how to adapt the system’s behaviour—to optimize the learning of a collection of value functions.
One answer is to simply maximize the environmental reward. This was the approach explored with Unreal and resulted in significant performance improvements in challenging visual navigation problems. However, it is not hard to imagine situations where this approach would be limited. In general, the reward may be delayed and sparse; what should the agent do in the absence of external motivations? Learning reusable knowledge such as skills [Sutton et al., 1999] or a model of the world might result in more long-term reward. Such auxiliary learning objectives could emerge automatically during learning [Silver et al., 2017]. Most agent architectures, however, include explicit skill and model learning components. It seems natural that progress towards these auxiliary learning objectives could positively influence the agent’s behaviour, resulting in improved learning overall.
Learning many value functions off-policy from a shared stream of experience—with function approximation and an unknown environment—provides a natural setting to investigate no-reward intrinsically motivated learning. The basic idea is simple. The aim is to accurately estimate many value functions, each with an independent learner—there is no external reward signal. Directly optimizing the data collection for all learners jointly is difficult because we cannot directly measure this total learning objective and actions have an indirect impact on learning efficiency. There is a large related literature in active learning[Cohn et al., 1996, Balcan et al., 2009, Settles, 2009, Golovin and Krause, 2011, Konyushkova et al., 2017] and active perception [Bajcsy et al., 2018], from which to draw inspiration for a solution but which do not directly apply to this problem. In active learning the agent must subselect from a larger set of items, to choose which points to label. Active perception is a subfield of vision and robotics. Much of the work in active perception has focused on specific settings—namely visual attention [Bylinskii et al., 2015], localization in robotics [Patten et al., 2018] and sensor selection [Satsangi et al., 2018]—or assumes knowledge of the dynamics (see [Bajcsy et al., 2018, Section 5]).
An alternative strategy is to formulate our task as a reinforcement learning problem. We can use an internal or intrinsic reward that approximates the total learning across all learners. The behaviour can be adapted to maximize the intrinsic reward by choosing actions in each state that maximize the total learning of the system. The choice of intrinsic rewards can have a significant impact on the sample efficiency of such intrinsically motivated learning systems. To the best of our knowledge, this paper provides the first formulation of parallel value function learning as a reinforcement learning task. Fortunately, there are many ideas from related areas that can inform our choice of intrinsic rewards.
Rewards computed from internal statistics about the learning process have been explored in many contexts over the years. Intrinsic rewards have been shown to induce behaviour that resembles the development stages similar to those exhibited by young humans and animals [Barto, 2013, Chentanez et al., 2005, Oudeyer et al., 2007, Lopes et al., 2012, Haber et al., 2018]. Internal measures of learning have been used to improve skill or option learning [Chentanez et al., 2005, Schembri et al., 2007, Barto and Simsek, 2005, Santucci et al., 2013, Vigorito, 2016], and model learning [Schmidhuber, 1991b, Schmidhuber, 2008]. Most recent work has investigated using intrinsic reward as a bonus to encourage additional exploration in single task learning [Itti and Baldi, 2006, Stadie et al., 2015, Bellemare et al., 2016, Pathak et al., 2017, Hester and Stone, 2017, Tang et al., 2017, Andrychowicz et al., 2017, Achiam and Sastry, 2017, Martin et al., 2017, Colas et al., 2018].
It remains unclear, however, which of these measures of learning would work best in our no-reward setting. Most prior work has focused on providing demonstrations of the utility of particular intrinsic reward mechanisms. One study focused on a suite of large-scale control domains with a single scalar external reward [Burda et al., 2018], comparing different instantiations of a learning system that use an intrinsic reward based on model-error as an exploration bonus. To the best of our knowledge there has never been a broad empirical comparison of intrinsic rewards.
A computational study of intrinsic rewards is certainly needed, but tackling this problem with function approximation and off-policy updating is not the place to start. Estimating multiple value functions in parallel requires off-policy algorithms because each value function is conditioned on a policy that is different than the exploratory behaviour used to select actions. In problems of moderate complexity, these off-policy updates can introduce significant technical challenges. Popular off-policy algorithms like Q-learning and V-trace can diverge with function approximation [Sutton and Barto, 2018]. Sound off-policy algorithms exist, but require tuning additional parameters and are relatively understudied in practice. Even in tabular problems, good performance requires tuning the parameters of each component of the learning system—a complication that escalates with the number of value functions. Finally, the agent must solve the primary exploration problem in order to make use of intrinsic rewards. Finding states with high intrinsic reward may not be easy, even if we assume the intrinsic reward is reliable and informative. To avoid these many confounding factors, the right place to start is in a simpler setting.
In this paper, we investigate and compare different intrinsic reward mechanisms in a new bandit-like parallel learning testbed. The testbed consists of a single state and multiple actions. Each action is associated with an independent scalar target to be estimated by an independent prediction learner. A successful behaviour policy will focus on actions that generate the most learning across the prediction learners. However, like auxiliary task learning systems, the overall task is partially observable, and learning is never done. The targets change without an explicit notification to the agent, and the task continually changes due to changes in action selection and learning of the individual prediction learners. Different configurations of the target distributions can simulate unlearnable targets, non-stationary targets, and easy-to-predict targets. Our new testbed provides a simple instantiation of a problem where introspective learners should help achieve low overall error. An introspective prediction learner is one that can autonomously increase its rate of learning when progress is possible, and decrease learning when progress is not—or cannot—be made.
Our second contribution is a comprehensive empirical comparison of different intrinsic reward mechanisms, including well-known ideas from reinforcement learning and active learning. Our computational study of learning progress highlighted a simple principle: intrinsic rewards based on the amount of learning (e.g., change in weights) can generate useful behaviour if each individual learner is introspective. Otherwise, without introspective learners, no intrinsic reward was able to consistently produce useful behaviours. We conclude with a discussion about how the ideas of introspective learners and intrinsic rewards based on the change in weights are equally applicable in our one-state prediction problem and large-scale problems where off-policy learning and function approximation are required.
2 Problem Formulation
In this section we formalize a testbed for comparing intrinsic reward using a state-less prediction task and independent learners. This formalism is meant to simplify the study of balancing the needs of many learners to facilitate comprehensive comparisons.
We formalize our multiple-prediction learning setting as a collection of independent, online supervised learning tasks. On each discrete time step, the agent selects one task , causing a target signal to be sampled from an (unknown) target distribution, , where
denotes the random variable with distribution. This distribution is indexed by time to reflect that it can change on each time step; this enables a wide range of different target distribution to be considered, to model this nonstationary, multi-prediction learning setting. We provide the definition we use in this work later in this section, in Equation (3).
Associated with each prediction task is a simple prediction learner
that maintains a real-valued vector of weights, to produce an estimate, , of the expected value of the target, . On a step where task is selected, could be updated using any standard learning algorithm. In this work, we use a 1-dimensional weight vector, and so the update is a simple delta-rule:
where is a scalar learning rate and is the prediction error of prediction learner on step . On a step where task is not selected, is not updated, implicitly setting to .
The primary learning objective is to minimize the Mean Squared Error up to time for all of the learners:
The behaviour does not get to observe this error, both because it only observes a target on each step, and because that target is a noisy sample of the true expected value . It can nonetheless attempt to minimize this unobserved error.
In order to minimize Equation (2), we must devise a way to choose which prediction task to sample. This can be naturally formulated as sequential decision-making problem, where on each time step , the agent chooses an action , resulting in a new sample of , and an update to . In order to learn a preference over actions we associate a reward with each action selection, and thus with each prediction task. We investigate different intrinsic rewards. Given a suitable definition of reward, we can make use of any Bandit learning algorithm suitable for non-stationary problems.
In this work we use a Gradient Bandit agent111 Our framework differs from the usual multi-armed bandit setting in at least two ways: (1) the distributions of the targets are non-stationary, and (2) our objective is to minimize error across all learners, meaning the optimal strategy is obtain a distribution over the actions—not find the single best action.
We used the Gradient Bandit algorithm, because it similar to policy gradient methods in reinforcement learning, reflecting the setting which we are ultimately interesting in—learning the behaviour for a Horde of demons in Markov Decision Process problems with function approximation
Our framework differs from the usual multi-armed bandit setting in at least two ways: (1) the distributions of the targets are non-stationary, and (2) our objective is to minimize error across all learners, meaning the optimal strategy is obtain a distribution over the actions—not find the single best action. We used the Gradient Bandit algorithm, because it similar to policy gradient methods in reinforcement learning, reflecting the setting which we are ultimately interesting in—learning the behaviour for a Horde of demons in Markov Decision Process problems with function approximation[Sutton et al., 2011]. The Gradient Bandit will sample all the actions infinitely often.222In preliminary experiments with the Drifter-Distractor problem described in Section 3, we also compared the -greedy bandit learning algorithm [Sutton and Barto, 2018, pp. 27–28]. The conclusions were the same as with the gradient bandit. However, the gradient bandit is better suited to non-stationary tasks [Sutton and Barto, 2018], which we focus on here. [Sutton and Barto, 2018], which attempts to maximize the expected average reward by modifying a vector of action preferences based on the difference between the reward and average reward baseline:
where is the average of all the rewards up to time , maintaining using an exponential average, and and
are both initialized to zero. Actions are selected probabilistically according to a softmax distribution which converts the preferences to probabilities (with a temperature of one):
The targets for each prediction learner are intended to replicate the dynamics of targets that a parallel auxiliary task learning system might experience, such as sensor values of a robot. To simulate a range of interesting dynamics, we construct each
as Gaussian distribution with drifting mean:
where controls sampling noise, , controls the rate of drift and projects the drifting back to the range
to keep it bounded. The variance and drift are indexed by, because we explore settings where they change periodically. These changes are not communicated to the agent, and the individual LMS learners are prevented from storing explicit histories of the targets. The purpose of this choice was to simulate partial observability common in many large-scale systems (e.g., [Sutton et al., 2011, Modayil et al., 2014, Jaderberg et al., 2016, Silver et al., 2017]). Given our setup, both prediction learners and the behaviour learner would do well to treat their respective learning tasks as non-stationary and track rather than converge [Sutton et al., 2007]. Each sample , and is bounded between , and is updated on each step regardless of which action is selected.Our formalism is summarized in Figure 1.
3 Simulating Parallel Prediction Problems
We consider several prediction problems corresponding to different settings of and to define task distribution in Equation (3). We introduce three problems, with target data simulated from those problems show in Figure 2.
The Drifter-Distractor problem has four targets, one for each action: (1) two (stationary) noisy targets as distractors (2) a slowly drifting target and (3) a constant target, with and for each of these types in Table 1.
The Switch Drifter-Distractor problem is similar to Drifter-Distractor except, after 50,000 time-steps the associations between the actions and the target distributions are permuted as detailed in Table 1. To do well in this problem, the system must be able to respond to changes. In addition, in phase two of this problem, two targets exhibit the same drift characteristics; the agent should prefer both actions equally.
|Task||1||2||3||4||5||7 and 8|
The Jumpy Eight-Action problem is designed to require sampling different prediction tasks with different frequencies. In this problem all the drift, but at different rates and with different amounts of sampling variance as summarized in Table 2. The best approach is to select several actions probabilistically depending on their drift and sampling variance. We add an additional target type, that drifts more dramatically over time, with periodic shifts in the mean:
where indicator and switches signs if . The sample from a Bernoulli ensures the jumps are rare, but the large mean of the Gaussian makes it likely for this jump to be large when it occurs, as shown in Figure 2. This problem simulates a prediction problem where the target changes by a large magnitude in a semi-regular pattern, but then remains constant. This could occur due to changes in the world outside the prediction learner’s control and representational abilities. These large-magnitude jumps in prediction target are also possible in off-policy learning settings where the agent’s behaviour changes, perhaps encountering a totally new part of the world.
4 Intrinsic Rewards for Multi-prediction Learning
Many learning systems draw inspiration from the exploratory behaviour of young humans and animals, uncertainty reduction in active learning, and information theory—and the resulting techniques could all be packed into the suitcase of curiosity and intrinsic motivation. In an attempt to distill the key ideas and perform a meaningful yet inclusive empirical study, we consider only methods applicable to our problem formulation of multi-prediction learning.333 Automatic curriculum generation systems often assume that some tasks cannot be tackled until other easier tasks have been solved first.
Curriculum has been most explored in the supervised multi-task learning case where the agent can switch between tasks at any moment
Automatic curriculum generation systems often assume that some tasks cannot be tackled until other easier tasks have been solved first. Curriculum has been most explored in the supervised multi-task learning case where the agent can switch between tasks at any moment[Graves et al., 2017, Matiisen et al., 2017]. Although we could use some ideas from curriculum learning in our bandit-like task, these approaches would not be compatible with more general sequential decision making tasks with delayed outcomes (e.g., a robot), which represents our ultimate goal. Although few approaches have been suggested for off-policy multi-task reinforcement learning—[Chentanez et al., 2005, White et al., 2014] as notable exceptions—many existing approaches can be used to generate intrinsic rewards for multiple, independent prediction learners (see Barto’s excellent summary [Barto, 2013]). We first summarize methods we evaluate in our empirical study. The specific form of each intrinsic reward discussed below is given in Table 3, with italicized names below corresponding to the entries in the table. We conclude by mentioning several rewards we did not evaluate, and why.
Several intrinsic rewards are based on violated expectations, or surprise. This notion can be formalized using the prediction error itself to compute the instantaneous Absolute Error or Squared Error. We can obtain a less noisy measure of violated expectations with a windowed average of the error, which we call Expected Error. Regardless of the specific form, if the error increases, then the intrinsic reward increases encouraging further sampling for that target. Such errors can be normalized, such as was done for Unexpected Demon Error [White et al., 2014], to mitigate the impact of noise in and magnitude of the targets.
Another category of methods focus on learning progress, and assume that the learning system is capable of continually improving its policy or predictions. This is trivially true for approaches designed for tabular stationary problems [Chentanez et al., 2005, Still and Precup, 2012, Little and Sommer, 2013, Meuleau and Bourgine, 1999, Barto and Simsek, 2005, Szita and Lőrincz, 2008, Lopes et al., 2012]. The most well-known approaches for integrating intrinsic motivation make use of rewards based on improvements in (model) error: including Error Reduction [Schmidhuber, 1991b, Schmidhuber, 2008], and Oudeyer’s model Error Derivative approach [Oudeyer et al., 2007]. Improvement in the value function can also be used to construct rewards, and can be computed from the Positive Error Part [Schembri et al., 2007], or by tracking improvement in the value function over all states [Barto and Simsek, 2005]. As our experiments reveal, however, intrinsic rewards requiring improvement can lead to suboptimal behaviour in nonstationary tracking problems.
An alternative to learning progress or improvement is to reward amount of learning. This does not penalize errors becoming worse, and instead only measures that estimates are changing: the prediction learner is still adjusting its estimates and so is still learning. Bayesian Surprise [Itti and Baldi, 2006] formalizes the idea of amount of learning. For a Bayesian learner, which maintains a distribution over the weights, Bayesian Surprise corresponds to the KL-divergence between this distribution over parameters before and after the update. This KL-divergence measures how much the distribution over parameters has changed. Bayesian Surprise can be seen as a stochastic sample of Mutual Information, which is the expected KL-divergence between prior and posterior across possible observed targets. We discuss this more in Section 5. Other measures based on information gain have been explored [Still and Precup, 2012, Little and Sommer, 2013, Achiam and Sastry, 2017, de Abril and Kanai, 2018, Still and Precup, 2012], though they have been found to perform similarly to Bayesian Surprise [Little and Sommer, 2013].
Though derived assuming stationarity, we can use Bayesian Surprise for our non-stationary setting by using exponential averages which prioritize recent data in sample averages for means and variances (see Table 3 for the formula). We can additionally consider non-Bayesian strategies for measuring amount of learning, including those based on change in error (Absolute Error Derivative), Variance of Prediction, Uncertainty Change—how much the variance in the prediction changes—and the Weight Change, which we discuss in more depth in the next section. Note that several learning progress measures can be modified to reflect amount of learning by taking the absolute value, and so removing the focus on increase rather than change (this must be done with care as we likely do not want to reward model predictions becoming worse, for example). We test such a modification to Oudeyer’s model derivative, which we refer to as Absolute Error Derivative.
|denotes the exponentially weighted average of to with with decay rate|
|is a small constant set to in our experiments|
|specify the length of the window and amount of overlap|
|is an estimate of , using an exponential average variant of Welford’s algorithm, with for|
|Absolute Error Derivative*|
|Variance of error|
|Variance of Prediction*|
There are several strategies which we omit, because they would (1) result in uniform exploration in our pure exploration problem, (2) require particular predictions about state to drive exploration, or (3) are based on statistics of the targets rather than the statistics generated by the prediction learners. Count-based approaches [Brafman and Tennenholtz, 2002, Bellemare et al., 2016, Sutton and Barto, 2018] are completely unsupervised, rewarding visits to under sampled states or actions—resulting in uniform exploration in our problem. Though count-based approaches are sometimes used in learning systems, they reflect novelty rather than learning progress or surprise [Barto et al., 2013]. Many methods use a model to encourage exploration [Schmidhuber, 2008, Chentanez et al., 2005, Stadie et al., 2015, Pathak et al., 2017] such as by using Bayesian Surprise for next-state prediction [Houthooft et al., 2016]. Subgoal discovery systems [Kulkarni et al., 2016, Andrychowicz et al., 2017, Péré et al., 2018] define rewards to reach particular states. Empowerment and state control systems are explicitly designed to respect and use the fact that some tasks or regions of the state-space cannot be well learned. Often such systems use only unsupervised signals relating to statistics of the exploration policy, ignoring the statistics generated by the learning process itself [Karl et al., 2017]. Like count-based approaches, unsupervised measures like this would induce uniform exploration in our state-less task. Finally, we do not test intrinsic rewards based only on targets, such as variance of the target. To see why, consider a behaviour that estimates the variance for a constant target, and quickly determines it only needs to select that action a few times. The prediction learner, however, could have a poor estimate of this target, and may need many more samples to converge to the true value. Separately estimating possible amount of learning from actual amount of learning has clear limitations.444In the bandit setting, with a simple sample average learner, the variance of the prediction target provides a measure of uncertainty for the learned prediction [Audibert et al., 2007, Garivier and Moulines, 2011, Antos et al., 2008], and has been successful applied in education applications [Liu et al., 2014, Clement et al., 2015]. When generalizing to other learners and problem settings, however, variance of the target will no longer obviously reflect uncertainty in the predictions. We therefore instead directly test intrinsic rewards that measure uncertainty in predictions.
5 Approximating Bayesian Surprise
One natural question given this variety of intrinsic rewards is if there is an optimal approach. In some settings, there is in fact a clear answer. In a stationary, stateless problem where the goal is to estimate means of multiple targets, it has been shown that the agent should take actions proportional to the variance of each target to obtain minimal regret [Antos et al., 2008]. For a stationary setting, with state, an optimal approach would be to take actions to maximize information gain—the reduction in entropy after an update—across learners. We therefore use information gain as the criteria to measure optimal action selection. In this section, we describe how to maximize information gain in an ideal case, and approximation strategies otherwise. The goal of this section is to provide intuition and motivation, as we do not yet have theoretical claims about the approximation strategies. We hope instead for this discussion to help lead to such a formalization.
We first show that information gain is maximized when maximizing expected Bayesian surprise, assuming Bayesian learners. A Bayesian learner updates weights for a parameterized distribution on the parameters needed to make the prediction . The parameters can be seen as a random variable, , with distribution . The goal is to narrow this distribution around the true parameters that generate . After seeing each new sample, the posterior distribution over parameters is computed using the previous distribution and the new sample, , using the update
A Bayesian learner is one that uses exact updates to obtain the posterior. We assume the prior is appropriately specified so that , and so has non-zero support as almost surely for any stochastic sequence .
Bayesian surprise is defined as the KL divergence between the distribution over parameters before and after an update [Itti and Baldi, 2006]
The Bayesian surprise is high when taking an action that produces a stochastic outcome that results in a large change in the prior and posterior distributions over parameters. The expectation of the KL-divergence over stochastic outcomes, with a Bayesian learner, corresponds to the Information Gain. This result is straightforward, but we explicitly show it in the following theorem to provide intuition. Notice that Information Gain defined in Equation (6) is relative to the model class of our learner, rather than some objective notion of information content.
Assume targets are distributed according to true parameters , with density . For a Bayesian learner, that maintains distribution over parameters ,
where the expectation is over stochastic outcomes . Note that , the mutual information between and , is the also called the Information Gain.
The weights are dependent on the observed . By definition, this integral is gives an expected KL, across possible observed .
To make this more concrete, consider Bayesian surprise for a Bayesian learner with a simple Gaussian distribution over parameters. For our simplified problem setting, the weights for the Bayesian learner are for the Gaussian distribution over the parameters , which in this case is the current estimate of the mean of the target, . The Bayesian surprise is
We can make this even simpler if we consider the variance to be fixed, rather than learned. The Bayesian surprise then simplifies to
This value is maximized when the squared change in weights is maximal. Therefore, though Bayesian surprise in general may be expensive to compute, for some settings it is as straightforward as measuring the change in weights.
Additionally, we can also consider approximations to Bayesian surprise for non-Bayesian learners. A non-Bayesian learner typically estimates the parameters directly, such as by maximizing likelihood or taking the maximum a posteriori (MAP) estimate
Now instead of maintaining the full posterior as , the agent need only learn directly. Because is the mode of the posterior, for many distributions will actually equal a component of . For the Gaussian example above with a learned variance, equals the first component of , the mean . For a fixed variance, exactly equals . Therefore, the non-Bayesian learner would have the exact same information gain, measured by the Bayesian surprise in (7).
This direct connection, for Bayesian and non-Bayesian learners, only exists for a limited set of distributions. One such class is the natural exponential family distribution over the parameters. Examples include the Gaussian with fixed variance and mean
and the Gamma distribution with a fixed shape parameter and scale parameter. Each natural exponential family has the property that the KL-divergence between two distributions with parameters and corresponds to a (Bregman) divergence directly on the parameters [Banerjee et al., 2005]. For a Gaussian, this divergence is the squared error normalized by the variance, as above in Equation (7). Another distribution that has this connection is a Laplace distribution with mean and fixed variance . Then the KL-divergence is . Note that this connection is limited to certain posterior distributions, but is true for general problem settings, even the general reinforcement learning setting. The distributions before and after an update, and
respectively, are over the parameters of the prediction learner. These parameters are more complex in settings with state—such as parameters to a neural network—but we can nonetheless consider exponential family distributions on those parameters.
This discussion motivates a simple proposal to approximate Bayesian surprise and Bayesian learners for a general setting with non-Bayesian learners: using weight change with introspective learners. An introspective learner is not a precise definition, but rather a scale. A perfectly introspective learner would be a Bayesian learner, or in some cases the equivalent non-Bayesian MAP learner. A perfectly non-introspective learner could be a random update. The more closely the learner approximates the weights to the perfectly introspective learner, the better its solution and the better the Bayesian surprise reflects the Information Gain. Further, because the underlying distribution may not be known, we use the change in weights as an approximation.
For concreteness, consider the following system. Each prediction learner is augmented with a procedure to automatically adapt the step size parameter , based on the errors produced over time (). In this paper we use the Autostep algorithm [Mahmood et al., 2012]. The Autostep algorithm will automatically reduce towards zero if the target is unlearnable, increase when successive errors have the same sign, and not change if the error is zero. A learner with a fixed step size could be considered a non-introspective learner, because the learner will forever chase the noise. The weight change for such a learner would not at all be reflective of Information Gain, reflecting instead only the inadequacy of the learner. The Autostep learner, on the other hand, like a Bayesian or MAP learner, will stop learning once new samples provide no new information.
This proposal reflects the following philosophy: there should be an explicit separation in the role of the behaviour agent—to balance data generation amongst parallel prediction learners—and the role of the prediction learners—to learn. If the agent trusts that the prediction learners are using the data appropriately, this suggests simpler intrinsic rewards based solely on the prediction learner’s parameters, such as the change in the weights. The alternative is to assume that the intrinsic rewards must be computed to overcome poor learning. This approach would require the agent to recognize when a prediction learner is non-introspective, and to stop generating rewards to encourage exploration for that agent. If the agent can measure this, though, then presumably so too can the prediction learner—they are after all part of the same learning system. The learner should then be able to use the same measure to adjust its own learning, and avoid large Bayesian surprise simply from ineffective updates to weights.
In this work, we define the change in weights using the norm,
It follows from Equation (1) that weight change is simply Absolute Error scaled by the step size, emphasizing the role that learner capability plays in ensuring an effective reward.
Remark: The above discussion applies to the non-stationary setting, by treating the non-stationarity as partial observability. We can assume that the world is stationary, driven by some hidden state, but that it appears non-stationary to the agent because it only observes partial information. If a Bayesian agent had the correct model class, it could still maximize information gain. For example, the agent could know there is a hidden parameter defining the rate of drift for the mean of the distribution over . It could then maintain a posterior over both and the mean and covariance of , based on observed data. As above, it would be unlikely for the agent to have this true model class, and a prediction learner would likely only be an approximation. It remains an important open theoretical question how such approximations influence the agent’s ability to maximize information gain.
We conducted six experiments, with two experiments in each of the three problems described in Section 3, where the prediction learners used either a single constant, global step size (giving non-introspective learners) or step-size adaption with a step size per prediction learner (giving introspective learners). The goal of the experiments is to understand the behaviour of different intrinsic rewards in different problems, with different prediction learners (introspective and non-introspective). Our hypothesis is that non-introspective prediction learners render intrinsic rewards based on amount of learning ineffective, and, with introspective learners, simple rewards like Weight Change can enable the behaviour to effectively balance the needs of all prediction learners.
In these problems, these two learners correspond to non-introspective and introspective learners, because a vector of adaptive step sizes is critical for these changing, heterogeneous tasks.
A constant global step size cannot balance the need to track the drifting targets, and the need to learn slowly on the high-variance targets. Ideally we could tune the step size for each prediction learner individually to mitigate this issue, however that approach does not scale to the settings we ultimately care about [Sutton et al., 2011, Modayil et al., 2014, Jaderberg et al., 2016]. If the learning rate is too large for the high-variance target, then the prediction learner will continually make large updates due to the sampling variance, never converging to low error. If the step size is too small for the tracking target, then the prediction learner’s estimate will often lag, causing high-error. The step-size adaption algorithm Autostep [Mahmood et al., 2012] provides such a vector of adaptive step sizes.555 We experimented with other step-size adaption methods, like RMSProp, but the results were qualitatively the same. The results with Autostep were quantitatively better, compared with RMSProp. Autostep is a meta-descent method designed for incremental online learning problems like ours, which may explain this difference.
We experimented with other step-size adaption methods, like RMSProp, but the results were qualitatively the same. The results with Autostep were quantitatively better, compared with RMSProp. Autostep is a meta-descent method designed for incremental online learning problems like ours, which may explain this difference.The performance difference between these prediction learners is stark, with Autostep significantly improving tracking, enabling different rates for different prediction learners and reducing the step sizes on unlearnable targets or noisy targets once learning is complete.
In all our experiments, an extensive parameter search was conducted over the the parameters of agent (Gradient Bandit), the prediction learners, and the reward functions. The best performing parameter combinations according to RMSE (see Equation 2) for each was used to generate the results. In each of our three experiments 13,800 parameter combinations were tested and the results were averaged over 200 independent runs. The parameters swept were: agent learning rate, average reward parameter, learning rate for non-introspective learners, initial learning rate of introspective learners, meta learning rate for Autostep, and , from Table 3. For clarity of presentation, we include a large selection of approaches advocated in the literature, listed in Table 3, but omit results for several poorly performing reward functions.
In the Drifter-Distractor problem, we focus on qualitative understanding of step sizes over time, shown in Figures 3 (non-introspective learner) and 4 (introspective learner). There are two key conclusions from this experiment. First, introspective learners were critical for intrinsic rewards based on amount of learning, particularly Weight Change and Bayesian Surprise. Without Autostep, both Weight Change and Bayesian Surprise incorrectly cause the agent to prefer the two high-variance targets because their targets continually generate changes to the prediction. With Autostep, however, the weights can converge for the constant and high-variance targets, and both agents correctly prefer the drifting target. Second, measures based on violated expectations—Unexpected Demon Error, Squared Error and Expected Error—or learning progress—Error Reduction—induce either nearly uniform selection or focus on noisy targets, with or without Autostep.
We ran a similar experiment, called the Switch Drifter-Distractor, where targets unexpectedly change after 50,000 steps (see Figures 5 and 6). The conclusions are similar. One notable point is that the Variance of Prediction reacts surprisingly well to the change—as well as Weight Change. Another notable point is that Bayesian Surprise emphasizes differences before and after updates, more so than Weight Change, because both means (weights) and variances are different. Consequently, after the sudden change, when the constant target becomes a drifting target, the magnitude of intrinsic rewards for that action dominates the other action that changed to a drifting target. Therefore, even though the agent samples the other action several times to see that the error is high, it nonetheless erroneously settles on single action.
Our final experiment in the Jumpy Eight-Action problem quantitatively compares the best performing intrinsic rewards in a setting where the agent should prefer several different actions. To achieve good performance, the agent must continually sample three actions, with different probabilities, and ignore three noise targets and two constant targets. We only report results with Autostep, based on the conclusions from the first two experiments. We report RMSE in Figure 8 and highlight the action-selection probabilities for Weight Change, Bayesian Surprise, Variance of Prediction, Expected Error and Change in Uncertainty in Figure 7. These five were the most promising intrinsic rewards from the first two experiments. The agent based on Weight Change achieves the lowest overall RMSE, because the rewards cause the agent to prefer actions associated with the jumpy target, and the two drifting targets differently. Variance of Prediction causes the agent to only select the action corresponding to the jumpy target, thus suffering higher error on the two undersampled drifting targets. Reward based on Bayesian Surprise does not induce a preference between the one of drifting targets and the jumpy target, causing higher RMSE. Expected Error and Change in Uncertainty rewards initially over-value the jumpy target, slightly degrading performance. Expected Error incorrectly induces a preference for high-variance target, resulting in a bound in the learning curve near the end of learning.
7 Adapting the Behaviour of a Horde of Demons
The ideas and algorithms promoted in this paper may be even more impactful when combined with policy-contingent, temporally-extended prediction learning. Imagine learning hundreds or thousands of off-policy predictions from a single stream of experience, as in the Unreal [Jaderberg et al., 2016] and Horde [Sutton et al., 2011] architectures. In these settings, the agent’s behaviour must balance overall speed of learning with prediction accuracy. That is, balancing action choices that generate off-policy updates across many predictions, with the need to occasionally choose actions in almost total agreement with one particular policy. In general we cannot assume that each prediction target is independent as we have done in this paper; selecting a particular sequence of actions might generate useful off-policy updates to several predictions in parallel [White et al., 2012]. There have been several promising investigations of how intrinsic rewards might benefit single (albeit complex) task learning (see [Pathak et al., 2017, Hester and Stone, 2017, Tang et al., 2017, Colas et al., 2018]). However, to the best of our knowledge, no existing work has studied adapting the behaviour based on intrinsic rewards of a model-based or otherwise parallel off-policy learning system.
It seems clear that simple intrinsic reward schemes and the concept of an introspective agent should scale nicely to these more ambitious problem settings. We could swap our state-less LMS learners for Q-learning with experience replay, or gradient temporal difference learning [Maei et al., 2010]. The weight-change reward could be computed for each predictor with computation linear in the number of weights. It would be natural to learn the behaviour policy with an average-reward actor-critic architecture, instead of the gradient bandit algorithm used here. Finally, the notion of an introspective learner still simply requires that each prediction learner can adapt its learning rate. This can be achieved with quasi second order methods like Adam [Kingma and Ba, 2015], or extensions of the AutoStep algorithm to the case of temporal difference learning and function approximation [Kearney et al., 2019, Jacobsen et al., 2019]. It is not possible to know if the ideas advocated in this paper will work well in a large-scale off-policy prediction learning architecture like Horde, however they will certainly scale up.
Maximizing intrinsic reward as presented in this paper is not a form of exploration, its a mechanism for defining good behaviour. In our state-less prediction task, sufficient exploration was provided by the stochastic behaviour policy. The stochasticity of the policy combined with the intrinsic reward allowed the agent to discover good policies. In the switched task, the behaviour was able to adapt to abrupt and unanticipated change to the target distributions. In this case, AutoStep did not decay the step-size parameters too low, ensuring the policy occasionally sampled all the actions. This will not always be the case, and additional exploration will likely be needed. The objective of this paper was to define good behaviours for multi-prediction learning through the lens of intrinsic reward and internal measures of learning. Efficient exploration is an open problem in reinforcement learning. Combining the ideas advocated in this paper with exploration bonuses or planning could work well, but this topic is left to future work.
The goal of this work was to systematically investigate intrinsic rewards for a multi-prediction setting. This paper has three main contributions. The first is a new benchmark for comparing intrinsic rewards. Our bandit-like task requires the agent to demonstrate several important capabilities: avoiding dawdling on noisy outcomes, tracking non-stationary outcomes, and seeking actions for which consistent learning progress is possible. We provide a survey of intrinsically motivated learning systems, and compared 15 different analogs of well-known intrinsic reward schemes. Finally we demonstrated how one of the simplest intrinsic rewards—weight change—outperformed all other rewards tested, when combined with step-size adaption. We motivated this approach based on connections to Bayesian surprise with Bayesian learners. These introspective prediction learners can decide for themselves when learning is done, and therefore the change in the weights of an introspective learner provides a clear reward to drive behaviour.
- [Achiam and Sastry, 2017] Achiam, J. and Sastry, S. (2017). Surprise-based intrinsic motivation for deep reinforcement learning. arXiv:1703.01732.
- [Andrychowicz et al., 2017] Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. (2017). Hindsight experience replay. In Advances in Neural Information Processing Systems.
- [Antos et al., 2008] Antos, A., Grover, V., and Szepesvári, C. (2008). Active learning in multi-armed bandits. In International Conference on Algorithmic Learning Theory.
- [Audibert et al., 2007] Audibert, J.-Y., Munos, R., and Szepesvári, C. (2007). Variance estimates and exploration function in multi-armed bandit. In CERTIS Research Report 07–31.
- [Bajcsy et al., 2018] Bajcsy, R., Aloimonos, Y., and Tsotsos, J. K. (2018). Revisiting active perception. Autonomous Robots.
- [Balcan et al., 2009] Balcan, M.-F., Beygelzimer, A., and Langford, J. (2009). Agnostic active learning. Journal of Computer and System Sciences.
[Banerjee et al., 2005]
Banerjee, A., Merugu, S., Dhillon, I. S., and Ghosh, J. (2005).
Clustering with bregman divergences.
Journal of Machine Learning Research.
- [Barto et al., 2013] Barto, A., Mirolli, M., and Baldassarre, G. (2013). Novelty or surprise? Frontiers in Psychology, 4:907.
- [Barto, 2013] Barto, A. G. (2013). Intrinsic motivation and reinforcement learning. In Intrinsically Motivated Learning in Natural and Artificial Systems.
- [Barto and Simsek, 2005] Barto, A. G. and Simsek, O. (2005). Intrinsic motivation for reinforcement learning systems. In Yale Workshop on Adaptive and Learning Systems.
- [Bellemare et al., 2016] Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. (2016). Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems.
- [Brafman and Tennenholtz, 2002] Brafman, R. I. and Tennenholtz, M. (2002). R-max – a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research.
- [Burda et al., 2018] Burda, Y., Edwards, H., Pathak, D., Storkey, A., Darrell, T., and Efros, A. A. (2018). Large-scale study of curiosity-driven learning. In arXiv:1808.04355.
[Bylinskii et al., 2015]
Bylinskii, Z., DeGennaro, E., Rajalinghamd, R., Ruda, H., Zhang, J., and
Tsotsos, J. (2015).
Towards the quantitative evaluation of visual attention models.Vision Research.
- [Chentanez et al., 2005] Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems.
- [Clement et al., 2015] Clement, B., Roy, D., Oudeyer, P.-Y., and Lopes, M. (2015). Multi-armed bandits for intelligent tutoring systems. Journal of Educational Data Mining.
[Cohn et al., 1996]
Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996).
Active learning with statistical models.
Journal of artificial intelligence research.
- [Colas et al., 2018] Colas, C., Sigaud, O., and Oudeyer, P.-Y. (2018). GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning.
- [de Abril and Kanai, 2018] de Abril, I. M. and Kanai, R. (2018). Curiosity-driven reinforcement learning with homeostatic regulation. arXiv: 1801.07440v2.
- [Garivier and Moulines, 2011] Garivier, A. and Moulines, E. (2011). On upper-confidence bound policies for switching bandit problems. In International Conference on Algorithmic Learning Theory.
- [Golovin and Krause, 2011] Golovin, D. and Krause, A. (2011). Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research.
- [Gordon and Ahissar, 2011] Gordon, G. and Ahissar, E. (2011). Reinforcement active learning hierarchical loops. In International Joint Conference on Neural Networks.
- [Graves et al., 2017] Graves, A., Bellemare, M. G., Menick, J., Munos, R., and Kavukcuoglu, K. (2017). Automated curriculum learning for neural networks. In International Conference on Machine Learning.
- [Haber et al., 2018] Haber, N., Mrowca, D., Fei-Fei, L., and Yamins, D. L. (2018). Learning to play with intrinsically-motivated self-aware agents. arXiv:1802.07442.
- [Hester and Stone, 2017] Hester, T. and Stone, P. (2017). Intrinsically motivated model learning for developing curious robots. Artificial Intelligence.
- [Houthooft et al., 2016] Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). Vime: Variational information maximizing exploration. In Advances in Neural Information Processing Systems.
- [Itti and Baldi, 2006] Itti, L. and Baldi, P. F. (2006). Bayesian surprise attracts human attention. In Advances in Neural Information Processing Systems.
- [Jacobsen et al., 2019] Jacobsen, A., Schlegel, M., Linke, C., Degris, T., White, A., and White, M. (2019). Meta-descent for online, continual prediction. In AAAI Conference on Artificial Intelligence.
- [Jaderberg et al., 2016] Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv:1611.05397.
- [Karl et al., 2017] Karl, M., Soelch, M., Becker-Ehmck, P., Benbouzid, D., van der Smagt, P., and Bayer, J. (2017). Unsupervised real-time control through variational empowerment. arXiv:1710.05101.
- [Kearney et al., 2019] Kearney, A., Veeriah, V., Travnik, J., Pilarski, P. M., and Sutton, R. S. (2019). Learning feature relevance through step size adaptation in temporal-difference learning. arXiv:1903.03252.
- [Kingma and Ba, 2015] Kingma, D. P. and Ba, J. (2015). Adam: A Method for Stochastic Optimization. In International Conference on Machine Learning.
- [Konyushkova et al., 2017] Konyushkova, K., Sznitman, R., and Fua, P. (2017). Learning active learning from data. In Advances in Neural Information Processing Systems.
- [Kulkarni et al., 2016] Kulkarni, T. D., Narasimhan, K., Saeedi, A., and Tenenbaum, J. (2016). Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems.
- [Little and Sommer, 2013] Little, D. Y.-J. and Sommer, F. T. (2013). Learning and exploration in action-perception loops. Frontiers in Neural Circuits.
- [Liu et al., 2014] Liu, Y.-E., Mandel, T., Brunskill, E., and Popovic, Z. (2014). Trading off scientific knowledge and user learning with multi-armed bandits. In International Conference on Educational Data Mining.
- [Lopes et al., 2012] Lopes, M., Lang, T., Toussaint, M., and Oudeyer, P.-Y. (2012). Exploration in model-based reinforcement learning by empirically estimating learning progress. In Advances in Neural Information Processing Systems.
- [Maei et al., 2010] Maei, H. R., Szepesvári, C., Bhatnagar, S., and Sutton, R. S. (2010). Toward off-policy learning control with function approximation. In International Conference on Machine Learning.
- [Mahmood et al., 2012] Mahmood, A. R., Sutton, R. S., Degris, T., and Pilarski, P. M. (2012). Tuning-free step-size adaptation. In International Conference on Acoustics, Speech and Signal Processing.
- [Martin et al., 2017] Martin, J., Sasikumar, S. N., Everitt, T., and Hutter, M. (2017). Count-based exploration in feature space for reinforcement learning. arXiv:1706.08090.
- [Matiisen et al., 2017] Matiisen, T., Oliver, A., Cohen, T., and Schulman, J. (2017). Teacher-student curriculum learning. arXiv:1707.00183.
- [Meuleau and Bourgine, 1999] Meuleau, N. and Bourgine, P. (1999). Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning.
- [Mirolli and Baldassarre, 2013] Mirolli, M. and Baldassarre, G. (2013). Functions and mechanisms of intrinsic motivations. In Intrinsically Motivated Learning in Natural and Artificial Systems. Springer.
- [Modayil et al., 2014] Modayil, J., White, A., and Sutton, R. S. (2014). Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior.
[Oudeyer et al., 2007]
Oudeyer, P.-Y., Kaplan, F., and Hafner, V. (2007).
Intrinsic motivation systems for autonomous mental development.
IEEE Transactions on Evolutionary Computation.
- [Pathak et al., 2017] Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017). Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning.
- [Patten et al., 2018] Patten, T., Martens, W., and Fitch, R. (2018). Monte carlo planning for active object classification. Autonomous Robots.
- [Péré et al., 2018] Péré, A., Forestier, S., Sigaud, O., and Oudeyer, P.-Y. (2018). Unsupervised learning of goal spaces for intrinsically motivated goal exploration. arXiv:1803.00781.
- [Riedmiller et al., 2018] Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Van de Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing-solving sparse reward tasks from scratch. arXiv:1802.10567.
- [Santucci et al., 2013] Santucci, V. G., Baldassarre, G., and Mirolli, M. (2013). Which is the best intrinsic motivation signal for learning multiple skills? Frontiers in Neurorobotics, 7:22.
- [Satsangi et al., 2018] Satsangi, Y., Whiteson, S., Oliehoek, F. A., and Spaan, M. T. (2018). Exploiting submodular value functions for scaling up active perception. Autonomous Robots.
- [Schaul et al., 2015] Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015). Universal value function approximators. In International Conference on Machine Learning.
- [Schembri et al., 2007] Schembri, M., Mirolli, M., and Baldassarre, G. (2007). Evolving childhood’s length and learning parameters in an intrinsically motivated reinforcement learning robot. In International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems.
- [Schmidhuber, 1991a] Schmidhuber, J. (1991a). Curious model-building control systems. In International Joint Conference on Neural Networks.
- [Schmidhuber, 1991b] Schmidhuber, J. (1991b). A possibility for implementing curiosity and boredom in model-building neural controllers. In International Conference on Simulation of Adaptive Behavior: From Animals to Animats.
- [Schmidhuber, 2008] Schmidhuber, J. (2008). Driven by compression progress: A simple principle explains essential aspects of subjective beauty, novelty, surprise, interestingness, attention, curiosity, creativity, art, science, music, jokes. In Workshop on Anticipatory Behavior in Adaptive Learning Systems.
- [Settles, 2009] Settles, B. (2009). Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences.
- [Silver et al., 2017] Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A., et al. (2017). The Predictron: End-to-end learning and planning. In International Conference on Machine Learning.
- [Stadie et al., 2015] Stadie, B. C., Levine, S., and Abbeel, P. (2015). Incentivizing exploration in reinforcement learning with deep predictive models. arXiv:1507.00814.
- [Still and Precup, 2012] Still, S. and Precup, D. (2012). An information-theoretic approach to curiosity-driven reinforcement learning. Theory in Biosciences, 131(3):139–148.
- [Sutton and Barto, 2018] Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction, 2nd Edition. MIT press.
- [Sutton et al., 2007] Sutton, R. S., Koop, A., and Silver, D. (2007). On the role of tracking in stationary environments. In International Conference on Machine Learning.
- [Sutton et al., 2011] Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems.
- [Sutton et al., 1999] Sutton, R. S., Precup, D., and Singh, S. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence.
- [Szita and Lőrincz, 2008] Szita, I. and Lőrincz, A. (2008). The many faces of optimism: a unifying approach. In International Conference on Machine learning.
- [Tang et al., 2017] Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, X., Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. (2017). #Exploration: A study of count-based exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems.
- [Vigorito, 2016] Vigorito, C. M. (2016). Intrinsically Motivated Exploration in Hierarchical Reinforcement Learning. PhD thesis, University of Massachusetts Amherst.
- [White et al., 2012] White, A., Modayil, J., and Sutton, R. S. (2012). Scaling life-long off-policy learning. In International Conference on Development and Learning and Epigenetic Robotics.
- [White et al., 2014] White, A., Modayil, J., and Sutton, R. S. (2014). Surprise and curiosity for big data robotics. In AAAI Workshop on Sequential Decision-Making with Big Data.
- [White, 2015] White, A. M. (2015). Developing a Predictive Approach to Knowledge. PhD thesis, University of Alberta.