In this paper, we consider the problem of online long-term multi-step prediction in partially observable domains. The agent’s objective is to predict a signal of interest multiple steps into the future, where the prediction is updated online
—on every time-step. In addition, we are interested in problems where the immediate sensory information available to the agent is insufficient for accurate prediction. Instead, the agent must construct explicit memories or use a recurrent learning architecture like a recurrent neural network. Most popular benchmarks in reinforcement learning are fully observable. Others like the Arcade Learning Environment exhibit minor partial observability where frame-stacking can be used to achieve good performance[Bellemare et al., 2013, Machado et al., 2018].
In this paper, we are interested in incremental learning approaches to our prediction tasks which are scalable in the same way humans and animals are. As far as we know, people and animals do not use more computation to make longer predictions. Instead, we use temporal abstraction [Sutton et al., 1999] to both predict and remember at a variety of timescales. Our primary interest is in learning systems that exhibit this temporal scalability. We seek methods where the (1) computational complexity is linear in the number of learned parameters, and (2) cost of updating long-term multi-steps predictions and remembering things from the past is independent of span [Hasselt and Sutton, 2015]. That is, the compute and memory associated with making and updating a prediction whose outcome is not observable for time-steps is independent of . Similarly, the cost of remembering things from the past should not be a function of how long ago the event occurred; recurrent architectures that unroll the network all the way to the observation of interest do not meet our criteria.
There are several benchmark problems available for evaluating state construction and representation learning. For example, the DeepMind Lab contains several 3D simulation problems inspired by experiments in neuroscience [Beattie et al., 2016, Wayne et al., 2018]. These tasks have been used to benchmark numerous large-scale learning systems including MERLIN [Wayne et al., 2018], IMPALA [Espeholt et al., 2018], and specialized memory architectures [Wayne et al., 2018, Parisotto et al., 2019]. All published results on these domains require several billion steps of interaction and cloud-scale compute [Beattie et al., 2016, Wayne et al., 2018, Parisotto et al., 2019, Fortunato et al., 2019, Espeholt et al., 2018]. The tasks in DeepMind lab represent aspirational challenge problems for state construction algorithms. However, there is also a need for a new set of tasks suited for rapid prototyping and statistically significant comparisons of new ideas.
We contribute three test problems inspired by experiments in animal learning. Each of the problems were designed to represent a key challenge in online multi-step prediction and state construction. The idea is that good performance on one of our test problems should not require sophisticated multi-threaded learning architectures and large compute. Indeed, our test suite is similar to the recently proposed behavior suite [Osband et al., 2019], but instead of control our focus is on prediction.
Our first problem is based on trace conditioning. The agent must predict a distal stimulus in relation to a previously observed cue; just as a rabbit predicts an upcoming puff to its eye based on a cue in order to close its inner eye-lid in advance of the puff. The challenge here is representational: how does the agent bridge the gap between the cue and the eye puff in a way that is not specific to the particular arrangement of stimuli and does not require computation and storage related to the length of time between the cue and the stimulus [Ludvig et al., 2012, Sutton and Barto, 2018]. Predicting future stimuli in such partially observable settings has broad relevance in AI. An agent should be able to predict the strength and location of the next enemy attack based on the current game scene.
Our second task is inspired by positive/negative patterning experiments where the agent must predict a binary outcome which only occurs if a particular pattern of stimuli is presented [Mackintosh, 1974]. Finally, our third test problem combines trace conditioning, positive/negative patterning, and the addition of numerous irrelevant distractor signals. Each of our three problems have tunable hyper-parameters that can be adjusted to smoothly vary the difficulty.
We provide a set of baseline results for the first problem to both illustrate its difficulty, but also provide a set of initial benchmarks. The baseline methods include recurrent learning systems trained by truncated backprop-through-time, and simple methods inspired by animal learning models. We study the performance as we vary both the hyper-parameters of our test problem and the key performance parameters of the baseline methods. Our results highlight the difficulty of our test problems for online recurrent approaches and how the agent’s performance often exhibits significant parameter sensitivity.
Our test-problems pose learning and representational challenges which are (1) relevant to state-of-the-art AI systems and large-scale benchmarks, (2) known to be solvable by a variety of animals, and (3) simple and light-weight facilitating extensive replication study, hyper-parameter sweeping and analysis with modest compute resources.
2 Classical conditioning as representation learning
The study of multi-step prediction learning in the face of partial observability dates back to the origins of classical conditioning. Pavlov was perhaps the first to observe that animals form predictive relationships between sensory cues while training dogs to associate the sound of a metronome with the presentation of food [Pavlov, 1927]. The animal uses the sound of a metronome—which is never associated with food in nature—to predict roughly when the food will arrive, which induces a hardwired behavioral response. The ability of animals to learn the predictive relationship between stimuli enables them to respond appropriately in important situations. These responses could be preparatory like dogs salivation before food presentation or protective in case of anticipating danger like blinking to protect the eyes. Predicting the future in the face of limited information is useful to humans too. You predict when the bus might stop next—and perhaps get off—based on the distal memory of the bell. You might predict when the water from the tap will get too hot and move your hand in advance. The study of prediction, timing, and memory in natural systems remains of chief interest to those that wish to replicate it in artificial systems.
Some of the most relevant theories on multi-step prediction in animals has been explored in trace conditioning. In the classical setup, two stimuli are presented to the animal in sequence as shown in Figure 1. The first is called the conditioned stimulus or CS (the predictive trigger) which usually takes the form of a light or tone. Then an unconditioned stimulus (US), such as a puff of air to the animal eye, is presented which generates a behavioral response called the unconditioned response (UR)—the rabbit closes its inner eye-lid. After enough pairings of the CS and US, the animal produces a conditioned response (e.g., closing the inner eye-lid before the puff of air)—behaving in advance of the US. This arrangement is interesting because there is a gap, called the trace interval between the offset of the CS and onset of the US where no stimuli are presented. Empirically we can only reliably measure the strength and timing of the animal’s anticipatory behavior: the muscles controlling the inner eye-lid. However, the common view is that the rabbit is making a multi-step prediction of the US triggered by the onset of the CS that grows in strength closer to the onset of the US [Schneiderman, 1966, Sutton and Barto, 1990, 2018], similar to the conditioned response in Figure 1.
The mystery for both animal learning and AI is how does the agent fill the gap? No stimuli occur during the gap and yet the prediction of the US rises on each time-step. There must be some temporal generalization of the stimuli occurring inside the animal. Additionally, what is the form of the prediction being made, and what algorithm is used to update it? Previous work has suggested that the predictions resemble discounted returns used in reinforcement learning [Dickinson, 1980, Wagner, 1978], sometimes called nexting predictions [Modayil et al., 2014], which can be learned using temporal difference learning and eligibility traces (i.e., TD()). Indeed the TD-model of classical conditioning has been shown to emulate several phenomena observed in animals [Ludvig et al., 2012, 2008, Sutton and Barto, 1990].
On the question of representation or agent state, the answer is less clear. TD-model can generate predictions consistent with the animal data, but only if the state representation fills the gap between the CS and US in the right way [Ludvig et al., 2012]. A flag indicating the CS just happened, called the presence representation, will not induce predictions that increase over time, and a clock is not plausible given the range of timescales, the presence of other non-relevant distracting signals, and the massive number of predictive relationships an agent must learn in its lifetime 111Though Ludvig’s Microstimulus representation can be viewed as a clock whose resolution gets worse over time [Ludvig et al., 2008, 2012].. Hand-designed temporal representations do reproduce the animal data well [Ludvig et al., 2012, 2008], but their generality remains unclear. Ideally, the learning system could discover for itself how to represent different stimuli over-time in a way that (1) is useful across a variety of prediction tasks, and (2) requires computation and storage independent of the size of the trace interval. Animals do require more training to learn tracing conditioning tasks with longer and longer trace interval, but there is no evidence that the update mechanisms or representations fundamentally change as a function of the trace interval [Howard, ]. This suggests that recurrent architectures like RNNs, LSTMs, and other Gated architectures [Elman, 1990, Hochreiter and Schmidhuber, 1997, Chung et al., 2014] trained by different flavors of back-prop through time [Mozer, 1989, Robinson and Fallside, 1987, Werbos, 1988] may not be ideal, due to the need to store and unroll network activations back in time.
Trace conditioning represents a family of test problems with many potential variations. There can be several additional stimuli called distractors, which are unrelated to the CS and US. The CS and US could occur for different lengths of time and overlap in different ways. There can be multiple CS’s and the US might only occur for particular ordering and configurations of the CS’s. In positive/negative patterning, for example, the CSs all occur at the same time, but only a particular pattern of active and inactive CS’s trigger the US. In positive patterning the combination of CSs activates the US but individual CSs does not. In negative patterning each CS in isolation activates the US whereas their combination does not. Finally, there is a rich space of combinations of trace conditioning (where there is a CS-US gap), distractors, and positive/negative patterning. In this paper, we propose three such variations as test problems for online multi-step prediction and state construction algorithms.
3 From animal learning to online multi-step prediction
We model our multi-step prediction task as non-stationary, uncontrolled dynamical system. On each step , the agent observes the available stimuli . In the simplest case would contain . On each step, the agent makes a prediction, denoted , about the future value of the US. In general may contain other signals that are either unrelated to the US, called distractors, or other stimuli that may be relevant to the prediction of future US—regardless does not fully capture the current state of the system. As discussed in Section 2, a suitable choice for formulating these US predictions is the expected discounted return or value function: where return is and is the unobserved state. The variable defines the horizon of the US prediction. In section 4, we provide examples of this particular formation of US prediction.
We will use semi-gradient temporal difference (TD) learning to incrementally estimateon each time step [Sutton, 1988]. Semi-gradient TD is the most commonly used algorithm for these online prediction tasks, and has appealing features relevant to our setting: TD is (1) simple and computationally frugal (linear complexity), and (2) efficient and accurate for learning multi-step predictions online from real data (see Modayil et al. ). Semi-gradient TD learns a parametric approximation
by updating a vector of parametersas follows:
where is the learning rate and controls the decay of eligibility trace . The precise form of depends on the parameterization scheme. In the linear case and , where is a vector of features constructed from . In the non-linear case can be computed by a neural network and
In many cases does not provide enough information to accurately estimate —the problem is partially observable. The agent would do better using observations from previous time-steps. In the linear case this might be handled by constructing from . In the non-linear case is usually constructed recursively from and1990, Hochreiter and Schmidhuber, 1997]. A common approach for training RNNs is backpropagation through time (BPTT) which computes the gradient back through time [Rumelhart et al., 1986]. BPTT can be expensive since it needs to compute the gradient all the way back to the first state. An efficient alternative is truncated backpropagation through time which only goes steps back to compute the gradient.
4 Test problem 1: Trace-conditioning
The first test problem is analogous to trace conditioning. It includes a series of trials on each of which a sequence of stimuli is presented. The stimuli include a CS and a US. Each trial starts with the onset of the CS which lasts for time steps and is followed by the onset of the US in time steps which lasts for times step. The time from the CS onset to the US onset is called the inter-stimulus interval (ISI) which in this problem is . The time from the US onset to the start of the next trial is called the inter-trial interval (ITI). In this problem, ITI is ISI . is . Figure 2 shows the CS, the US, and the ideal prediction time steps before the CS onset to time steps after the US offset for one trial. To make the test problem more challenging, we include distractor stimuli that do not contain any information about the US. The distractors occur in a Poisson fashion and last for
time steps. The first to tenth distractor occur with probability, , …, respectively at each time step.
To see why this task can be challenging, let us consider an example of failure using the presence representation—one binary feature per stimulus which is activated when the stimulus is present. The presence representation is not sufficient for learning the trace-conditioning problem since there are no active features during the empty interval between the CS onset and the US onset (Figure 2). A success example on the trace-conditioning problem is provided in Figure 2, row 5 using the Microstimuli representation. The Microstimuli representation is inspired by models from animal learning that successfully associates the CS with the US by keeping traces of the stimuli ([Ludvig et al., 2012, 2008, Hull, 1939]). In this case, the prediction increases only after the CS onset whereas the ideal prediction has non-zero values before the CS onset. This makes sense because each trial is independent and the onset of the US is unpredictable by design—just like in trace conditioning experiments with animals. Finally, note the ideal prediction reaches its maximum just before the US onset and steps downward after. This happens because the discounted sum of future US is maximal just before US onset: at this instant in time the US is multiplied by the largest possible values of . This temporal profile is consistent with previous work on Nexting [Modayil et al., 2014] and computational modeling [Ludvig et al., 2012].
We studied the performance of three groups of baselines. The first group included the presence representation and the presence representation plus onset and offset features. The presence representation includes one feature per stimulus which is one whenever the corresponding stimulus is on. The onset and offset features are one at the onset and offset of the corresponding stimulus respectively. The second group of baselines included trace-based methods that keep a trace of the onset of each stimulus and apply coarse-coding on them. Note that the stimuli traces are different from the eligibility traces in that they carry the trace of the stimuli in the representation during the trace interval whereas the eligibility traces do not include active features during the empty interval, making them insufficient. This group includes tile-coded-traces (TCT) and Microstimuli (MS) methods which respectively use tile-coding (Albus 1975, 1981) and radial basis functions for coarse coding. The third group of baselines included an RNN. To evaluate the performance, we computed the squared return error:. To summarize the performance within each trial, we then averaged SRE within the trial resulting in a mean squared return error (MSRE).
We studied the effect of ISI on the performance of the baseline methods including ISI , , and . We adapted according to the ISI following: . We swept over the parameters of each method including the step-size, the number of tiles/RBFs for the trace-based methods, and the truncation parameter and hidden layer size for the RNN. (See the appendix.) Figure 3.A shows the bar chart for different values of ISI with the height of each bar showing the area under the learning curve (AUC). The parameters were optimized for the AUC. The trace-based methods performed well across different values of ISI whereas the performance of the RNN depended on the value of ISI. Figure 3.B shows the sensitivity of the trace-based methods and RNN to their parameters. Each dot corresponds to one parameter setting and the height of each dot shows the AUC. The trace-based methods were robust to their parameters across different values of ISI. However, the RNN became more sensitive to its parameters as ISI got bigger. Moreover, for smaller values of ISI, bigger truncation parameter and hidden layer resulted in lower error, whereas for ISI no relation was found between the error and the truncation parameter or the hidden layer size.
5 Test problem 2 and 3
The second test problem, noisy patterning with distractors, is analogous to positive/negative patterning in psychology. It considers a situation where non-linear combinations of CSs activate the US. As we discussed in Section 2, in negative patterning each CS in isolation activates the US but not when together. Interestingly these tasks correspond to famous logical operations like XOR, which neural networks are well known to easily solve. To make the problem more challenging we designed the problem such that multiple configurations of the CSs activate the US and added distractors and noise.
The second test problem includes n CSs and a US. k configurations of the CSs activate the US. Each trial starts with the CSs getting a value of or . If the value of the CSs matches an activating configuration, the US becomes in time steps (i.e. ISI equals 4). The next trial starts in time steps (i.e. ITI equals 40). In half of the trials, one of the activating configurations occurs each of which includes activated CSs and non-activated CSs. The test problem also includes distractors which occur at the same time as the CSs but do not contribute to the US activation. We also added some level of noise to the problem. In x percent of the trials, an activating configuration occurs but the US remains or a non-activating configuration occurs and the US gets activated. is . Figure 4 shows the signals for the case of CSs distractors.
A failure example for test problem 2 is the presence representation which while predicting a high value in an activated trial, fails to take the interactions between the two CSs into account resulting in a non-zero prediction in the non-activated trial (Figure 4). An RNN, on the other hand, makes reasonably good predictions in both activated and non-activated trials (Figure 4).
The level of difficulty of the second test problem can be controlled by the number of CSs, the number of activating configurations, number of distractors, and the level of noise. The study of how each of these parameters affects the performance of the baseline methods will remain for future work.
The third test problem is a combination of the first two. Similar to test problem 1, there is a gap between the CS offset and the US onset, and similar to test problem 2, some combinations of the CSs activate the US. For a learner to successfully learn this problem, it has to both fill the trace interval effectively and respond to the CSs in a non-linear way. This motivates the design of methods that perform reasonably in both problems 1 and 2. The testbed also includes distractors and some level of noise. For test problem 3, the level of difficulty can be controlled with both ISI, the number of CSs, and the number of configurations as well as the number of distractors and the level of noise. Recognizing the parameters that affect the difficulty of the problem the most and providing baseline results for test problem 3 will remain for future work.
Challenging benchmark problems have facilitated the study of artificial learning systems. In this paper, we presented three light-weight but challenging test problems inspired by problems that animals can solve. We also provided baseline results for our first test problem, including results from modern recurrent learning systems as well as simple methods from animal learning. Our results suggest that well-tuned modern recurrent mechanisms cannot solve the hard instances of our first test problem. On the other hand, simple animal learning models do not account for the interaction between the signals, making them inadequate for test problems 2 and 3 where the learner has to respond to a non-linear combination of signals. This motivates the design of new methods that can perform well across all variations of the three test problems.
- Deepmind lab. arXiv preprint arXiv:1612.03801. Cited by: §1.
The arcade learning environment: an evaluation platform for general agents.
Journal of Artificial Intelligence Research47, pp. 253–279. Cited by: §1.
- Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §2.
- Contemporary animal learning theory. Vol. 1, CUP Archive. Cited by: §2.
- Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §2, §3.
- Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561. Cited by: §1.
- Generalization of reinforcement learners with working and episodic memory. In Advances in Neural Information Processing Systems, pp. 12469–12478. Cited by: §1.
- Learning to predict independent of span. arXiv Preprint arXiv:1508.04582. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2, §3.
-  3.2 memory for time. Cited by: §2.
- The problem of stimulus equivalence in behavior theory.. Psychological Review 46 (1), pp. 9. Cited by: §4.
- Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural computation 20 (12), pp. 3034–3054. Cited by: §2, §2, §4, footnote 1.
- Evaluating the td model of classical conditioning. Learning & behavior 40 (3), pp. 305–319. Cited by: §1, §2, §2, §4, footnote 1.
- Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research 61, pp. 523–562. Cited by: §1.
- The psychology of animal learning.. Academic Press. Cited by: §1.
- Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior 22 (2), pp. 146–160. Cited by: §2, §3, §4.
A focused back-propagation algorithm for temporal pattern recognition. Complex systems 3 (4), pp. 349–381. Cited by: §2.
- Behaviour suite for reinforcement learning. arXiv preprint arXiv:1908.03568. Cited by: §1.
- Stabilizing transformers for reinforcement learning. arXiv preprint arXiv:1910.06764. Cited by: §1.
- Conditioned reflexes.,(oxford university press: london). Cited by: §2.
- The utility driven dynamic error propagation network. University of Cambridge Department of Engineering Cambridge, MA. Cited by: §2.
- Learning internal representation by error propagation, parallel distributed processing. MIT Press, Cambridge. Cited by: §3.
- Interstimulus interval function of the nictitating membrane response of the rabbit under delay versus trace conditioning.. Journal of comparative and physiological psychology 62 (3), pp. 397. Cited by: §2.
- Time-derivative models of pavlovian reinforcement.. Cited by: §2, §2.
- Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.
- Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §1.
- Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: §3.
- Expectancies and the priming of stm. Cognitive processes in animal behavior, pp. 177–209. Cited by: §2.
- Unsupervised predictive memory in a goal-directed agent. arXiv preprint arXiv:1803.10760. Cited by: §1.
- Generalization of backpropagation with application to a recurrent gas market model. Neural networks 1 (4), pp. 339–356. Cited by: §2.
For the trace-conditioning test problem, we swept over the parameters of each methods. See Table 1.