1. Predictions as Knowledge: Understanding the World Through Forecasts
Intelligence has been defined many ways throughout history; a central criteria to many of these definitions is the ability to achieve goal-oriented behaviour: the ability to learn, plan, and act in order to accomplish a task. Acquiring and using that knowledge to support decision-making plays an important role in intelligent systems. It is no surprise, then, that a long-standing pursuit of Artificial Intelligence research is the development of agents capable of independently constructing knowledge of their environment.
The grand challenge of these systems is determining how to construct knowledge. The world is complex; it is so complex that at any given moment there is insufficient information available to us with our limited senses to make decisions. It is impossible to understand the entirety of the world from our immediate observations alone. To cope with this immense complexity, we construct abstractions with which we can interpret the world in order to make decisions. The challenge of constructing knowledge is then the challenge of relating an agent’s sensations over time in order to construct these abstractions with which we can come to understand the world.
One approach to knowledge construction is predictive knowledge: a growing collection of research which attempts to express all of an agent’s world knowledge exclusively in terms of predictions about the environment, typically using the approach of sutton_horde_2011
. As an agent interacts with the world, it estimates many General Value Functions—the expectation of many sensorimotor signals. Ordinary value functions underpin most of reinforcement learning: they estimate thevalue, or discounted sum of future reward in a given state sutton_reinforcement_1998; General Value Functions (GVFs) expand upon value functions by estimating arbitrary values an agent has access to, not just the reward111See white_developing_2015 for an introduction to General Value Functions and their use in constructing an agent’s knowledge..
For instance, a predictive knowledge agent might express one aspect of knowledge about keys as “If I put my hand in my pockets, I predict I will feel my keys”. Such a complex and abstract notion as “my keys” would be be impractical to capture through one prediction alone. Many predictions must be made in order to capture all aspects of such a broad concept as keys: many predictions would be necessary both to inform the state of such a prediction, and to construct the target the prediction is about. For these reasons, a central component of predictive knowledge is the use of predictions to inform one another. Knowledge is constructed starting with low-level immediate predictions about sensation—such as, “can I touch something in front of me”?—which can then be interrelated to express more abstract, conceptual aspects of the environmentschapire_diversity-based_1988—for instance, spacial awareness ring_representing_2016.
In this paper, we argue that evaluation methods for predictive knowledge systems are as of yet underdeveloped, leading to an inability to differentiate between a prediction that is useful in informing decision making, and a useless one. As we will show in what follows, this inability to precisely evaluate single predictions has consequences for how we both structure and how we evaluate predictive knowledge architectures as a whole. To explicate this further, we examine the definition of a GVF and discuss how GVFs can be interrelated to form abstractions in a worked example.
2. The Anatomy of a Prediction:
General Value Functions
General Value functions estimate the discounted sum of some signal over discrete time-steps defined as
. On each time-step the agent receives some vectorof observations which describes the environment, and takes an action . The observations are used to construct the agent-state : the state of the environment from the agent’s perspective. A GVF is parameterized by a set of weights which when combined with the agent-state produce an estimate of the return .
What the GVF’s prediction is about is determined by its question parameters, including the signal of interest (often called the cumulant), a discounting function , and a policy which describes the behaviour over which the predictions are made. Answer parameters include the step-size222Also known as the learning rate. which scales updates to the weights, and the linear or non-linear function-approximator used to construct state.
GVFs can be interrelated by 1) informing the state of a higher-order GVF, or 2) by acting as a cumulant—as a signal of interest—for a higher-order GVF. Imagine an agent which has access to visual stimuli and the ability to reach out and feel in front of itself. We may start with humble beginnings, predicting whether the agent can touch something immediately in front of itself if the agent reaches out. This provides a rudimentary sense of spatial awareness immediately in front of the agent. Using the learned touch prediction, we can make a further prediction: “Would I predict that I could touch something if I turned left or right”? From here, we open up a world of increasing complexity, leveraging what is already learnt to build models of the world.
3. The Problem of Evaluation: Deciding What to Learn
There has been a steady progress in predictive approaches to knowledge in Reinforcement Learning. The first suggestion that knowledge could be constructed using incrementally learned predictions in Reinforcement Learning dates back to early papers on Temporal-difference learning sutton_learning_1988-2, and is influenced by a long line of AI research focused constructing models of the world exclusively in terms of an agent’s observations cunningham_intelligence_1972; becker_model_1973; drescher_made-up_1991-1; schapire_diversity-based_1988. These early suggestions have been shaped into a proposal that knowledge can be constructed online, in real-time, continually, as an agent interacts with their environment sutton_horde_2011, typically by learning many value functions.
There has been success in incrementally learning interrelated predictions to conceptualize abstract aspects of the environment tanner_temporal-difference_2005; makino_-line_2008, which has been further extended from temporal-difference networks to networks of General Value Functions schlegel_general_2018. Along the way, much work has focused on improving understanding of the underlying methods upon which predictive knowledge architectures rely: a few such works include demonstrations of the real-time effectiveness of predictive knowledge sutton_horde_2011, step-size adaptation for tuning-free learning mahmood_tuning-free_2012; kearney_learning_2019, and better understanding of the empirical performance of off-policy learning methods ghiassian_first_2017.
In spite of this steady march of progress, one pernicious problem has remained unresolved: how do agents choose what to predict? This unanswered question presents an obstruction which has limited applications of predictive knowledge to a few examples. Predictive knowledge has been used in industrial laser welding gunther_intelligent_2016, bionic limb control edwards_machine_2016, and reactive robot control systems modayil_multi-timescale_2014. While these practical applications are impressive, they share a common trait: in each of these instances the predictions are hand-selected and specified by engineers. We do not have systems which are capable of independently choosing what predictions to learn and how to interrelate them. Moreover, these real-world applications involve limited low-level sensorimotor predictions, not the high-level abstract predictions which originally motivated predictive knowledge. Even when the predictions are hand-selected by engineers, it is challenging to describe abstract notions of the environment in terms of GVFs. Early work demonstrated progress in conceptualizing objects in terms of predictions koop_investigating_2008; however, predictive knowledge systems have not lived up to the lofty ambitions which were first set out in part because it is not clear how to decide what to predict.
An agent cannot predict everything about the world. The world is vast and complex; the agent must choose from the many predictions it could make the GVFs which will enable it to make sense of the world. Certainly, not all predictions are created equally. Two GVFs may have the same question parameters—, , and —and, thus, are predictions about the same experience. While being about the same experience, these two predictions may have very different answer parameters and thus very different estimates; feature construction, the step-size parameter, and even amount of experience contribute to the how well a value function estimated.
One option for choosing GVFs is to generate and test predictions: to select some GVFs, learn them, and after a period of time decide which from the collection are worth making, and which can be replaced schlegel_general_2018. To be able to compare predictions, we must have some metric or means of evaluating them.
When an agent is is making a prediction, the agent is making an assertion about the world as observed through its data stream. We determine how well the value function is estimated for a given observation by comparing with an estimate of the true return pilarski_dynamic_2012; edwards_machine_2016; gunther_intelligent_2016: for some buffer-size which determines how many steps into the future cumulants are stored to produce the return estimate on any given time-step. The truthfulness of the prediction can be described as the extent to which estimated value matches the true, observed return333This approach is advocated in the original proposal of sutton_horde_2011 and used in numerous application gunther_intelligent_2016; edwards_application_2016.
In this paper, we argue that the fundamental challenges in developing more complex predictive knowledge systems stem from poor evaluation methods: that existing limitations predictive knowledge are a result of how we evaluate predictions—such as return error —rather than a critical flaw in predictive knowledge as a paradigm.
4. Issues With Evaluating
At first blush, using error to differentiate between the useful and the useless seems effective. This is not so. Figure 2 presents a simple square-pulse (in grey) as a cumulant which two functions estimate the return of. While this example is contrived, there are many situations in which we would want to make such a prediction; being able to detect the onset of events is often useful in decision-making. For example, in the previous section, we worked out an example where an agent built a sense of spatial awareness (Figure 1) by predicting whether it could touch something in front of itself; In the spatial awareness example, touch is a binary signal that rises and falls, similar to this simple synthetic example.
We present two estimates (green and orange) of the square-pulse with a discount factor of . The predictive estimate rises before the signal of interest rises, and falls before the signal of interest falls—it precedes the signal of interest. The tracking estimate rises and falls after the signal of interest: it is not predictive. When making the decision what prediction to make in order to inform decision-making, it is obvious to the engineer hand-designing GVFs that the tracking estimate is poor. The tracking prediction is redundant: we would be better off simply using the original observation as a feature. While this insight is obvious when inspecting the relationship of GVFs to their signals of interest, systems that autonomously pick which predictions to make use error estimates to differentiate between GVFs that are useful for informing further decision-making and those which are not. Evaluating based on error alone, we would be led to the conclusion that the tracking estimate should be kept.
Low return error does not imply that a GVF is useful. More than a contrived example, these predictions are examples of prototypical GVFs we are interested in using to inform decision-making: we are often interested in anticipating the onset of a stimulus. See kearney_when_2019 for an example about how such difficulties play out on predictions used to inform bionic limb control systems for individuals with upper-limb amputations.
While existing applications of predictive knowledge systems are hand-engineered, if we choose to build predictive knowledge systems that independently make decisions about what to learn and how to learn them, we must be able to assess the quality of a prediction in a robust, reliable way. We cannot depend on the domain knowledge of system designers. In order to build such predictive knowledge systems successfully we must be able to pick and choose between different predictions we might want to make—we must be able to discriminate between predictions which have low error for poor reasons and predictions which explain their signal of interest pilarski_real-time_2013. Put simply, just because a prediction is accurate, doesn’t make it useful in informing behaviour.
5. Issues With Evaluating
Networks of Predictions
A motivation of predictive knowledge is that GVFs can encode information about possible futures which can then be used to inform other predictions in turn by 1) using an estimate as an input feature when making a higher-order GVF, or 2) using a learned estimate as a cumulant for another GVF. That is, the thrust of predictive knowledge is its construction of higher-order GVFs from lower-order GVFs. In the previous section, we demonstrated how estimated return error can be misleading in evaluation of singular GVFs. In this section, we demonstrate that poor evaluation in lower-order GVFs has consequences for the performance of higher-order GVFs. In order to demonstrate these challenges in evaluation, we turn our attention to the off-policy setting.
In the off-policy setting, we estimate value functions under some policy which may not match the agent’s current behaviour . By making predictions about behaviours the agent is not always taking, we introduce a new problem: how do we determine how accurate our predictions are when they are predicting futures which do not necessarily occur? In the on-policy case, it was possible to estimate the return online. We could simply store recent estimates and compare them to the observed return. In the off-policy case, the behaviour policy may not overlap enough with the target policy to accurately estimate the true-return. Enough experience can be periodically gathered by taking an excursion rafiee_predictive_2018—by setting the behaviour policy to the target policy in order to collect enough experience to estimate the return for a policy . By taking an excursion, we are turning off-policy evaluation into an on-policy evaluation problem for a brief period of time—by forgoing other learning goals we are able to collect enough experience to evaluate a prediction; however, the cost of taking an excursion can be substantial. An agent shouldn’t have to leap off a cliff in order to determine whether it was correct in predicting that jumping would be lethal. Moreover, by taking an excursion, we are only able to evaluate GVFs under a specific policy —possibly a small subset of all the GVFs being learnt at any given time.
An off-policy error metric which can be calculated on-line in real-time is RUPEE: the Recent Unsigned Projected Error Estimate. RUPEE estimates the mean squared projected bellman error of a single GVF 444See white_developing_2015 for an explanation of RUPEE on pages 119-122.
. While RUPEE does not correspond to prediction accuracy, it gives an estimate of learning progress with respect to the features used to construct an agent’s state representation, and can be calculated online. For the following experiment we use RUPEE as an evaluation metric. In addition to each of the aforementioned concerns, by estimating learning progress using RUPEE in the off-policy case, we fall prey to the same evaluation trap as using the return error in the on-policy case.
We demonstrate these off-policy issues in a simple Minecraft johnson2016malmo grid-world which reflects the example introduced in (Figure 1), a simplification of the thought experiment introduced in ring_representing_2016. The world is a square pen which is 30 30 and two blocks high. The mid-section of each wall has a silver column, and the base of each wall is a unique colour. On every time-step, the agent receives observations which contain: 1) the pixel input from the environment (Figure 3(a)), and 2) whether or not the agent is touching something. We demonstrate the hidden difficulty of off-policy evaluation for predictive knowledge using three simple predictions: whether the agent will touch something if it extends its hand, and whether the agent could touch something if it turned left or right (as introduced in Figure 1). These predictions are useful building-blocks which can inform much more complex predictions that express abstract aspects of the world—i.e., basic navigation and spacial awareness ring_representing_2016. In order to get to these higher-order, abstractions, we must first be able to get these simple, primary predictions right.
We construct two GVF networks which are specified with the same question parameters, but differ in answer parameters used. That is, both networks are approximating the same value-functions; however, the way they learn their approximation differs. One touch prediction uses a Tile Coder sutton_reinforcement_1998; sherstov2005function as a function approximator. To construct state, the agent tiles together the binary touch input and a randomly initialized sub-sampling of pixels. In contrast, the tracking GVF uses only a single bias bit as a representation. We choose this, as it is obvious to any designer that a bias bit is insufficient to inform any of the chosen predictions: we cannot predict whether the agent can touch a wall using a single bit to represent our Minecraft world. Using this obviously poor GVF, we demonstrate that we can achieve lower RUPEE than a well-crafted GVF (Figures 3 and 5).
By comparing the two touch predictions based on their RUPEE (Figure 2(a)), we would be lead to conclude that the bias bit GVF is superior to the tile-coded GVF—we would be lead, against our intuition, to think that the prediction which does not predict is superior. When we examine the actual predictions made by each GVF, we are told a different story (Figure 4(a)). The reason why the bias bit prediction is poor is because it tracks. An architect designing a system understands this prediction is poor because it is redundant: the immediate sensation of touch tells us whether or not an agent is touching something. The intent of the prediction is to determine whether or not an agent can touch a wall without needing to engage in the behaviour. When the agent does touch a wall, the prediction is updated and stored in the weights of the GVF. Only when the agent is touching a wall will the bias bit GVF predict that it can touch a wall. By looking at the internal error alone, we miss this critical shortcoming.
The challenges of differentiating between a good and bad prediction have an impact which extends beyond the single prediction. A core motivation for the development of predictive knowledge is the compositions of predictions: being able to use predictions as inputs to inform the features constructed for another prediction, or being able to make predictions of existing predictions. In systems which use GVFs to construct an agent’s knowledge of the world, predictions are intended to inform further learning processes. Low RUPEE or low return error does not necessarily equate to more useful predictions for these further purposes. In our example, two additional GVFs use learned touch predictions as their cumulant, or signal of interest (Figure 1). By using a prediction as the target of another prediction, the agent is building a rudimentary sense of perception that is grounded in the data stream. Of course, having an accurate touch prediction is useful in and of itself; however, in this predictive knowledge architecture, that is not the only role the touch prediction plays. Being able to anticipate whether something can be touched is necessary to inform these further Touch Left and Touch Right predictions, building the agent’s spatial awareness.
We want not only an accurate touch prediction, but one which is capable of informing Touch Left and Touch Right predictions. In Figure 5, we display the RUPEE of Touch Left and Touch Right. There are two sets of these predictions: the first, using the bias bit GVF’s prediction as its cumulant; the second, using the tile-coded GVF as its cumulant. In this layer, the GVFs all share the same function approximator: they both use sufficient representations to learn a reasonable estimate. In this case, a random sub-sampling of the pixel input, binary touch signal, and touch prediction are all tiled together to construct the state for each GVF. The only differentiating factor is which cumulant is used: the prediction from either the tracking touch GVF, or the anticipatory touch GVF.
When we examined the first layer’s Touch predictions, the tracking GVF seemed superior based on RUPEE. When we examine the RUPEE of the second set of predictions (Figure 2(b)), we catch a glimpse of the down-stream effects of this misunderstanding. Although only slight, the GVFs dependent on the tracking Touch prediction have a higher RUPEE than those using the predictive Touch GVF. This point is brought into focus when we examine the predictions made by each touch-left and touch-right prediction (Figures 4(b) and 4(c)). When we examine average trajectories where the agent approaches a wall and turns left, the touch-right prediction using the tracking touch GVF as a cumulant (Figure 4(b), in orange) rises and falls with its underlying GVF. That is, the touch-right prediction with a tracking cumulant predicts wall even before turning such that the wall is to its right, while the touch-right prediction with a predictive cumulant is able to better match the ground-truth. This disparity is further exacerbated in Figure 4(c), where we see that the touch-left prediction dependent on the tracking touch GVF as a cumulant incorrectly anticipates a wall is on its left, even as it turns away from it. By using a poor underlying touch prediction, the higher-order GVFs become unlearnable. Through examining the error—the metric used to inform predictive knowledge architectures—we miss this. The usage of a prediction tells us more about the quality of that prediction than error alone.
Our arguments rely on demonstrating quirks of particular GVF estimates—we demonstrate that poor behaviour of estimates can be hidden by commonly used error metrics. This kind of inquiry into the structure of predictions cannot be automated: it relies on inspection by system designers—a form of evaluation which cannot scale. Moreover, these precise comparison are limited to simple domains. The room our agent inhabits is so simple that we can acquire the ground-truth in order to examine the predictions as is done in Figure 5. In many domains of interest, this ease of comparison is simply impossible. Each of these factors further frustrates the problem of determining what to learn, and whether particular GVFs are useful for informing decision-making. In problem settings that are more complex, system designers have no recourse and must address the issues we have raised.
6. A Proposal: Evaluating Feature Relevance
In the preceding sections, we laid an argument outlining how existing evaluation methods for General Value Functions in predictive knowledge architectures are insufficient. We demonstrated that return error in isolation of any additional information is misleading: return error and RUPEE are insufficient to determine the quality of a GVF in the on-policy and off-policy settings. Most importantly, we demonstrated how the error of a prediction tells us little about how useful a prediction is for informing further predictions—the foundational motivation of predictive knowledge. We now propose an alternative approach to tackling evaluation for predictive knowledge architectures focusing on: feature relevance.
The average active step-sizes for each layer of both the prediction and tracking networks averaged over 30 independent trials. Error bars are standard error of the mean.
All else being equal, a good forecast is one whose features are well aligned with the prediction problem at hand: that is, the features are relevant. One way to determine the relevance of features is by learning step-sizes. Some meta-gradient learning methods tune the step-size parameter based on the relevance of a given feature. For instance, TD Incremental Delta-Bar-Delta (TIDBD) kearney_learning_2019 assigns a step-size to each weight , adjusting the step-size based on the correlation of recent weight updates. If many weight updates in the same direction are made, then a more efficient use of experience would have been to make one large update with a larger . If an update has over-shot, then the weight updates will be uncorrelated, and thus the step-size should be smaller. More broadly, we can view these forms of step-size adaptation as the most basic form of representation learning 555See kearney_learning_2019 for discussion of feature relevance and meta-descent methods..
To demonstrate step-sizes as feature relevance, we generalize TIDBD kearney_learning_2019 to GTD() (Algorithm 1), creating a step-size adaptation method for the off-policy touch, touch-left, and touch-right predictions we previously introduced. We depict how average step-size values for each prediction changes during learning in Figure 6. The touch predictions both tune their step-sizes slowly over time, tapering close to values of 0 (Figure 5(a)). In the case of the tile-coded touch prediction, the step-sizes taper as the prediction is slowly learnt. In the case of the bias-bit prediction, the weight updates are not correlated, and the step-size is slowly lowered. In this sense, the step-sizes’ magnitude is a metric of learning progress.
Alone, the feature relevance is insufficient to inform our evaluation of predictions. While we are able to discriminate between the tracking and predictive touch-left and touch-right predictions (Figure 5(b)), the tracking and predictive touch predictions are not appreciably different when examining their step sizes(Figure 5(a)). Step-sizes do not tell the full story; our step-sizes are a weighting of our features when learning some weights . The learned step-sizes in combination with the learned weights give us greater insight into the performance of our GVFs. In Figure 7 a combination of the absolute value of the learned weights and step-sizes are plotted: . We take , as the magnitude of the step-size describes progress in learning. Intuitively, a feature which is stable, and thus has a small , and has a relatively large weight is preferable.
By examining , we are finally able to separate the tracking and anticipatory touch predictions (Figure 6(a)). As the step-sizes decrease, the value of both the tracking and anticipatory predictions rises; however, since the magnitude of the weight is low for the bias-bit, its weighted feature value remains low. This clarity in comparison carries over to the touch-left and touch-right predictions (Figure 6(b)). From Figure 5(b), we know that the tracking-based touch-left and touch-right predictions’ step-sizes never decay—that is, the tracking predictions’ step-sizes maintain an average value of approximately 0.25 for the duration of the trials, while the anticipatory predictions’ step-sizes decay as the predictions are learnt. This results in a pronounced bifurcation between the two predictions. By looking at weighted features, we are able see and interpret what has been lost in our error estimates.
The practice of using step-sizes that describe feature relevance to inform other aspects of learning is already an established practice. For instance, learned step-sizes have been used to inform feature discovery mahmood_representation_2013, and exploration methods linke2019adapting. Recent work has suggested that step-sizes can be used to monitor the status of robots and indicate when physical damage has occurred to a system gunther_meta-learning_2019. Moreover, using internal learning measurements to evaluate predictive knowledge systems has been suggested in other works sherstan_introspective_2016, although no existing applications of predictive knowledge use step-sizes for evaluation.
Using the learning method we generalized, AutoStep for GTD(), we can learn step-sizes online and incrementally as the agent is interacting with the environment. In situations where traditional prediction error metrics fail, the magnitude of learned weights and step-sizes enables differentiation between GVFs that are useful in informing further predictions, and GVFs which are not. In breif, GVFs can be evaluated in a meaningful, scalable way using feature relevance.
Within Reinforcement Learning, there are the seeds of an approach to constructing machine knowledge through prediction. This has manifested itself in a handful of promising real-world applications. Existing applications hint at the possibilities of fully automated and independently learned architectures; however, existing applications rely on hand-crafted predictions that are chosen by engineers for each specific domain of application. We argue that the challenge of discovering predictions stems from a misunderstanding of how to evaluate predictive knowledge. Our results demonstrate that common methods for evaluating predictive knowledge do not enable us to differentiate between useful predictions, and trite ones: between predictions that are useful for further decision-making and learning processes, and those that are not. We further establish how such misunderstandings in low-level sensorimotor predictions result in down-stream prediction difficulties. To remedy this, we suggest two approaches: First, we suggest evaluating predictions not just by their own prediction error, but also the error of further predictions which depend on it. Second, we suggest looking at internal measurements of learning to further inform evaluation. We demonstrate the effectiveness of this latter approach, we generalize the TIDBD algorithm to off-policy learning and demonstrate that by examining the relevance of the features of a GVF’s representation, we gain greater insight into the usefullness of a prediction.