Most domains of interest are partially observable, where a part of the state is hidden from or unobserved by the agent. Consider an agent that is unable to localize itself within a room using its sensor readings. By keeping a history of interaction, the agent can obtain state and so overcome this partial observability. Once our limited agent reaches a wall, it can determine its distance from the wall in the future by remembering this interaction. This simple strategy, however, can be problematic if a long history length is needed (McCallum, 1996).
. An RNN provides a recurrent state-update function, where the state is updated as a function of the (learned) state on the previous step and the current observations. These recurrent connections can be unrolled back in time infinitely far, making it possible for the current RNN state to be dependent on observations far back in time. There have been several specialized activation units crafted to improve learning long-term dependencies, including long short term memory units (LSTMs)(Hochreiter and Schmidhuber, 1997)
and gated recurrent units (GRUs)(Cho et al., 2014).
One issue with RNNs, however, is that computing gradients back-in-time is costly. Real Time Recurrent Learning (RTRL) (Williams and Zipser, 1989) is a real-time algorithm, but is prohibitively expensive: quartic in the hidden dimension size . Back propagation through time (BPTT), on the other hand, requires maintaining the entire trajectory, which is infeasible for many online learning systems we consider here. A truncated form of BPTT (p-BPTT) is often used to reduce the complexity of training, where complexity grows linearly with p: . Unfortunately, training can be highly sensitive to the truncation parameters (Pascanu et al., 2013), particularly if the dependencies back-in-time are longer than the chosen —as we reaffirm in our experiments.
In this paper, we propose a new RNN architecture that is significantly more robust to the truncation parameter in p-BPTT—often achieving better performance with complete truncation (). The key idea is to constrain the hidden state to be multi-step predictions. Such predictive state approaches have been previously considered (Littman et al., 2001; Sutton and Tanner, 2004; Rafols et al., 2005; Silver, 2012; Downey et al., 2017). We formulate an architecture and corresponding objective that generalizes beyond previous approaches, enabling these predictions to be general policy-contingent multi-step predictions—called General Value Functions (GVFs) (Sutton et al., 2011). These GVFs have been shown to represent a wide array of multi-step predictions (Modayil et al., 2014). We demonstrate though a series of experiments that GVF networks are effective for representing the state and are much more robust to train, allowing even simple gradient updates with no gradients needed back-in-time (i.e., with ). We highlight these properties in three partially observable domains, with long-term dependencies, designed to investigate learning state-update functions in a continual learning setting.
Our work provides additional evidence for the predictive representation hypothesis, that state-components restricted to be predictions about the future result in better predictive accuracy and better generalization. Previously Schaul and Ring (2013) showed how a collection of optimal GVFs—not learned from while the system was operating—provide a better state representation for a reward maximizing task, than a collection of optimal PSR predictions. A competing but related idea is that the state need not be predictive. However, if additional auxiliary prediction and control tasks are combined with a learned state, then dramatic improvements in reward maximizing tasks are possible (Jaderberg et al., 2016)—even if these tasks are not related directly to the main task. The auxiliary tasks losses cause the system to learn a state that generalizes better. Our experiments show that predictive state components provide a distinct advantage over an RNNs augmented with auxiliary tasks.
2 Problem Formulation
We consider a partially observable setting, where the observations are a function of an unknown, unobserved underlying state. We formulate this problem as a partially observable Markov decision process (POMDP), though we will not consider reward and rather only the dynamics. The dynamics of the POMDP are specified by a Markov decision process (MDP), with state space, action-space
, and transition probabilities
. On each time step the agent receives an observation vector, as a function of the underlying state . The agent only observes , not , and then takes an action , producing a sequence of observations and actions: .
The goal for the agent in this partially observable setting is to identify a state representation which is a sufficient statistic (summary) of past interaction. More precisely, such a sufficient state would ensure that predictions about future outcomes given this state are independent of history , i.e. for any
Such a state summarizes the history, removing the need to store the entire (potentially infinite) history.
One strategy for learning a state is with recurrent neural networks (RNNs), which learn a state-update function. Imagine a setting where the agent has a sufficient state for this step. To obtain sufficient state for the next step, it simply needs to update with the new information in the given observation . The goal, therefore, is to learn a state-update function such that
provides a sufficient state . The update function is parameterized by a weight vector . An example of a simple RNN update function, for composed of stacked vectors for each hidden state
is, for activation function,
The goal in this work is to develop an efficient algorithm to learn this state-update function that is not dependent on number of steps back-in-time to an important event,. Most RNN algorithms learn this state-update by minimizing prediction error to desired targets across time step, with an error such as for some weights and the state implicitly a function of parameters . We pursue an alternative strategy, inspired by predictive representations, where the state-update function is learned such that each hidden state is an accurate prediction about future outcomes.
3 GVF Networks
In this section, we propose a new RNN architecture, where hidden states are constrained to be predictions. In particular, we propose to constrain the hidden layer to predict policy-contingent, multi-step outcomes about the future, called General Value Functions (GVFs). We first describe our GVF Networks (GVFNs) architecture, and then develop the objective function and algorithm to learn GVFNs. There are several related predictive approaches, in particular TD Networks, that we discuss in Section 4, after introducing GVFNs.
3.1 The GVFN architecture
A GVFN is an RNN, and so is a state-update function , but with the additional criteria that each element in corresponds to a prediction—to a GVF. To embed GVFs into a recurrent network structure we need to extend the definition of GVFs (Sutton et al., 2011) to the partially observable setting. The first step is to replace state with histories. We define to be the minimal set of histories, that enables the Markov property for the distribution over next observation
A GVF question is a tuple composed of a policy , cumulant111In the definition of GVFs given access to state, the cumulant and termination function are defined on states, action and next states. When defined on histories, the cumulant and termination only need to be defined on , and not on because contains and . and termination function . The answer to a GVF question is defined as the value function, , which gives the expected, cumulative discounted cumulant from any history , which can be defined recursively with a Bellman equation as
The sums can be replaced with integrals if or are continuous sets.
A GVFN is composed of GVFs, with each hidden state component trained such that for the th GVF and history . Each hidden state component, therefore, is a prediction about a multi-step policy-contingent question. The hidden state is updated recurrently as for a parametrized function , where is trained towards ensuring . This is summarized in Figure 1.
General value functions provide a rich language for encoding predictive knowledge. In their simplest form, GVFs with constant correspond to multi-timescale predictions referred to as Nexting predictions (Modayil et al., 2014). Allowing to change as a function of state or history, GVF predictions can combine finite-horizon prediction with predictions that terminate when specific outcomes are observed (Modayil et al., 2014). For example, in the Compass World used in our experiments, we might predict the likelihood the agent will observe a red square over the next few steps, if it were to execute a policy that always moves forward under the current heading. This question can be specified simply as a GVF. If the current observation is ‘red square’, then the cumulant is 1 and is 0. If the observation is some other colour, then the cumulant is 0 and is 0.9, corresponding to approximately a 10 step prediction horizon. This what-if question asks, if I were to move forward until termination (drive forward is the if), will I see red over the next 10 steps (see red is the what). In addition we can create rich hierarchies of questions by forming compositional predictions—GVFs that make use of the prediction of another GVF as its prediction target (Sutton et al., 2011). Compositional GVFs can be learned independently, without requiring the agent to actually perform the sequence of actions corresponding the composition of the policies. GVFs are not new to this work and we suggest the reader consult the literature for extensive motivation and additional examples (Sutton et al., 2011; Schaul and Ring, 2013; Modayil et al., 2014; White, 2015).
3.2 The Objective Function for GVFNs
Each state component of a GVFN is a value function prediction, and so is approximating the fixed point to a Bellman equation with history in Equation (4). Because the GVFs are in a network, the Bellman equations are coupled in two ways: through composition—where one GVF can be the cumulant for another GVF—and through the recurrent state representation. We first consider the Bellman Network operator, which defines the value function recursion jointly for the collection of GVFs including compositions. We show that the Bellman Network operator is a contraction, as long as compositions between GVFs are acyclic. We then explain how the coupling that arises from the recurrent state representation can be handled using a projected operator, and provide the final objective for GVFNs, called the Mean-Squared Projected Bellman Network Error (MSPBNE).
We first define the Bellman Network operator. For the -th GVF , let the expected cumulant value under the policy be
and expected discounted transition be
and zero otherwise for inconsistent histories, where is not a subset of . Let be the vector of values for GVF . The Bellman Network operator is
The Bellman Network operator needs to be treated as a joint operator on all the GVFs because of compositional predictions, where the prediction on the next step of GVF is the cumulant for GVF . When iterating the Bellman operator is not only involved in its own Bellman equation, but also in the Bellman equation for . Without compositions, the Bellman Network operator would separate into individual Bellman operators, that operate on each independently.
To use such a Bellman Network operator, we need to ensure that iterating under this operator converges to a fixed point. The result is relatively straightforward, simply requiring that the connections between GVFs be acyclic. For example, GVF cannot be a cumulant for GVF , if is already a cumulant for . More generally, the connections between GVFs cannot create a cycle, such as . We provide a counterexample to illustrate that this condition is both sufficient and necessary. We provide the proofs for the below results in Appendix A.
Let be the directed graph where each vertex corresponds to a GVF node and each directed edge indicates that is a cumulant for . If and is acyclic, iterating converges to a unique fixed point.
There exists an MDP and policy such that, for two GVFs in a cycle, iteration with the Bellman Network operator diverges.
With a valid Bellman Network operator, we can proceed to approximating the fixed point. The above fixed point equation assumes a tabular setting, where the values can be estimated directly for each history. GVFNs, however, have a restricted functional form, where the value estimates must be a parametrized function of the current observation and value predictions from the last time step. Under such a functional form, it is unlikely that we can exactly solve for the fixed point.222One approach to exactly solve such an equation has been to define a belief state, as in POMDPs, and solve for the value function as a function of belief state. These approaches guarantee that the fixed point can be identified; however, they also require that belief state be identified and are known to be NP-hard. Rather, we will solve for a projected fixed point, which projects into the space of representable value functions.
Define the space of functions as
and projection operator
where is the stationary distribution over histories, when following the behaviour policy . The MSPBNE, for GVFs and state-update function parameterized by , is
where is parameterized by the weights because the cumulant could be related to the value prediction for another GVF. If the cumulants do not include composition, then is simply constant in terms of . A variant of the MSPBNE has been previously introduced for TD networks (Silver, 2012); the above generalizes that MSPBNE to GVF networks. Because it is a strict generalization, we use the same name.
There are a variety of possible strategies to optimize the MSPBNE for GVFNs, similarly to how there are a variety of strategies to optimize the MSPBE for GVFs. We can compute a gradient of the MSPBNE, using similar approaches to those used for learning nonlinear value function (Maei et al., 2010) and for the MSPBNE for TD networks (Silver, 2012). We derive a full gradient strategy, which we call Recurrent GTD (see Equation (20) in the Appendix). Additionally, however, we can propose semi-gradient approximations, including Recurrent TD and even simpler approximations that simply do TD() for this step, without computing gradients back-in-time. We find in our experiments that training GVFNs is robust to these choices, suggesting that constraining the hidden states to be predictions can significantly simplify learning a state-update function.
4 Connection to other predictive state approaches
The idea that an agent’s knowledge might be represented as predictions has a long history in machine learning. The first references to such a predictive approach can be found in the work ofCunningham (1972), Becker (1973), and Drescher (1991) who hypothesized that agents would construct their understanding of the world from interaction, rather than human engineering. These ideas inspired work on predictive state representations (PSRs) (Littman et al., 2001), as an approach to modeling dynamical systems. Simply put, a PSR can predict all possible interactions between an agent and it’s environment by reweighting a minimal collection of core test (sequence of actions and observations) and their predictions without the need for a finite history or dynamics model. Extensions to high-dimensional continuous tasks have demonstrated that the predictive approach to dynamical system modeling is competitive with state-of-the-art system identification methods (Hsu et al., 2012). PSRs can be combined options (Wolfe and Singh, 2006), and preliminary work suggests discovery of the core tests is possible(McCracken and Bowling, 2005). One important limitation of the PSR formalism including prior combinations of PSRs and RNNs (Downey et al., 2017; Choromanski et al., 2018), is that the agent’s internal representation of state must be composed exclusively of probabilities of observation sequences.
A TD network (Sutton and Tanner, 2004) is similarly composed of predictions, and updates using the current observation and previous step predictions like an RNN. TD networks with options (Rafols et al., 2005) condition the predictions on temporally extended actions similar to GVF networks, but do not incorporate several of the recent modernization of GVFs, including state-dependent discounting and convergence off-policy training methods. The key differences, then, between GVF networks and TD networks is in how the question networks are expressed and subsequently how they can be answered. GVF networks are less cumbersome to specify, because they use the language of GVFs. Further, once in this language, it is more straightforward to apply algorithms designed for learning GVFs.
Finally, there has been some work on learning and using a collection of GVFs. Originally GVFs were introduced as part of the Horde architecture (Sutton et al., 2011), though experiments were limited to learning a dozen non-compositional GVFs. Schaul and Ring (2013) showed how a collection of optimal GVFs—not learned from while the system was operating—provide a better state representation for a reward maximizing task, than a collection of optimal PSR predictions. Beyond the original potentially divergent TD networks algorithms, Silver (2012) introduced Gradient TD networks (GTDN) and specified a valid gradient-descent update rule for TD networks. The GTDN formulation provides a way to learn a network of predictions, but is restricted to on-policy, and the experiments limited to one-step prediction ( always zero). GVFNs enable off-policy learning of many what-if predictions about many different policies, independent of the behavior policy used to learn them. Makino and Takagi (2008) incrementally discovered TD-networks—building a restricted collection of GVFs—demonstrating effective learning on several benchmark POMDP tasks.
We evaluate the performance of GVFNs on three partially observable domains: Cycle World, Ring World, and Compass World. These environments are designed to have long temporal dependencies back in time, and enable systematic investigation of the sensitivity of the truncation level in p-BPTT to this horizon. We compare against RNNs using GRUs, which are designed for long temporal dependencies, like LSTMs. We investigate three questions: 1) can GVFNs learn accurate predictions, 2) how robust are GVFNs to the truncation level in p-BPTT, and 3) does using GVFs explicitly as state provide more benefit than using the same GVFs in an alternative way, as in auxiliary tasks.
Cycle World is a six-state domain (Tanner and Sutton, 2005a) where the agent steps forward through a cycle deterministically. All the states are indistinguishable except state six. The observation vector is simply a two bit binary encoding indicating if the agent is state six or not.The goal is to predict the observation bit on the next time step.
Ring World is a six-state ring, similar to the Cycle World, but the agent can move forward or backwards. The observation vector is again a binary encoding indicating if the agent is state six or not, but contains four bits to encode the previous action. The task is 5-Markov, indicating that history based approaches must store the 5 most recent observations to make accurate predictions. This extension on the Cycle World is used to investigate the effect of off-policy learning. The behaviour policy is to randomly select between actions forward and backward.
Compass World is a gridworld (Rafols et al., 2005) where the agent can only see the colour immediately in front of it. There are four walls, with different colours; the agent observes this colour if it takes the action forward in front of the wall. Otherwise, the agent just sees white. There are five colours in total, with one wall having two colours, making it more difficult to predict. The observation encoding includes two bits for each colour: one to indicate that that bit is active and the other to indicate that some other colour is active. Similar to Ring World, the full observation vector is encoded based on which action was taken, and includes a bias unit. The behaviour policy chooses randomly between moving one-step forward, turning right/left for one step, or moving forward until the wall is reached (leap) or randomly selecting actions for steps (wander).
5.2 Algorithms and Architectures
To specify GVFNs, we need to define a collection of GVFs and specify the connections between compositional predictions as required. For Cycle World and Ring World we arrange the GVFs in a way that is difficult but possible to learn from interaction with the world (as detailed in previous work on these domains (Tanner and Sutton, 2005b)). The architecture used in Compass World is composed of 64 GVFs. We describe the GVFs in more detail, for all three domains, in Appendix D. We additionally added one useful GVF to the Cycle World, to examine the impact of less adversarially defined GVFs. This echo GVF has a constant discount of and termination when the observed bit becomes one. This GVF reflects distance to state 6 from every other state, where the bit is one.
The GRUs are trained with truncated BPTT to minimize the mean squared TD error (MSTDE) with different truncation levels. The RNNs are given the exact same number units as the GVFN. We also use train the GRUs with the GVFs as auxiliary tasks in Cycle World and Compass World. This experiment is meant to distinguish the primary impact of learning additional prediction on representation learning. If the the GRU with auxiliary predictions perform as well as the GVFN, then the result lends support to the auxiliary task effect over the predictive representations hypothesis.
We first consider overall performance, across the three domains, shown in Figures 3(a), 2(a) and 4(a). The most compelling results comparing to GRUs are in Ring World and Compass world, in Figures 3 and 4. In Ring World, the performance of the GRU is significantly worse that GVFNs. The figure depicts the learning curves for a variety of truncation levels. For all truncation levels, GVFNs converge to near zero error. GRUs, on the other hand, can do almost as well for larger truncation levels, but requires truncation larger than the length of the Ring World and fail for shorter truncation levels. This is even more stark in Compass World, where the GRU truncation level was swept as high as 256, and the prediction performance was poor for all values tested. GVFNs, on the other hand, were robust to this truncation, enabling even learning with just one-step gradient updates! This result highlights that, for appropriately specified GVF questions, GVFNs can learn in partially observable domains with long temporal dependencies, with simple TD-update rules.
To further test GVFNs, with more ill-specified GVFs in the network, we tested a more difficult network configuration in Cycle World, shown in Figure 2. Again, for a sufficiently large truncation level—the length of the cycle—GRUs can perform well, converging faster than GVFNs. However, for truncation level less than the length of the cycle—a level of five—again GRUs fail. GVFNs, on the other hand, can converge with a truncation level of four, though in this case can no longer converge with just one-step gradient updates (i.e., a truncation level of one), as show in Figure 2(b). Once we add the echo GVF, GVFNs can converge with less truncation, and though not depicted here, can even converge with one-step updates when using traces (i.e., with TD(). This indicates that in some cases GVFNs do need some number of gradients back-in-time, but with a reasonable set of GVFs—particularly those that are not designed to be difficult to predict—GVFNs can use simple one-step updates. Moreover, in either case, GVFNs are more robust to truncation level than GRUs.
We then examined, in more depth, the impact of truncation in Cycle World and Ring World. In Cycle World, as mentioned above, GVFNs are more robust to truncation level, even with a poorly specified network. In Ring World, in Figure 3(b), GVFNs, after 100k steps, converged to a near-zero solution for all truncation levels. GRUs, on the other hand, even with a longer truncation level could not get to the same error level. The figures also provides dotted lines showing performance in early learning. Interestingly, GVFNs actually perform a bit worse with less truncation, likely because computing gradients further back-in-time makes training less stable.
In this work, we made a case for a new recurrent architecture, called GVF Networks, that can be trained without back-propagation through time. We first derive a sound fixed-point objective for these networks. We then show in experiments that GVFNs can outperform GRUs, without requiring gradients to be compute (far) back in time. We demonstrated that this is particularly true for GVFNs with an expert set of GVF questions, but that good performance could also be obtained with a naive generation strategy for GVFs—still outperforming the best GRU model.
A natural extension is to consider a GVFN that only constrains certain hidden states to be predictions and otherwise allows other states to simply be set to improve prediction accuracy for the targets. It is in fact straightforward to learn GVFNs with this modification, as will become obvious below in developing the algorithm to learn GVFNs. Additionally, GVFNs could even be combined with other RNN types, like LSTMs, by simply concatenating the states learned by the two RNN types. Overall, GVFNs provide a complementary addition to the many other RNN architectures available, particularly for continual learning systems with long temporal dependencies; with this work, we hope to expand interest and investigation further into these promising RNN models.
- Becker  Joseph D Becker. A model for the encoding of experiential information. Computer Models of Thought and Language, 1973.
Cho et al. 
Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio.
On the Properties of Neural Machine Translation: Encoder-Decoder Approaches.In International Conference on Learning Representations, 2014.
- Choromanski et al.  Krzysztof Choromanski, Carlton Downey, and Byron Boots. Initialization matters: Orthogonal predictive state recurrent neural networks. In International Conference on Learning Representations, 2018.
- Cunningham  Michael Cunningham. Intelligence: Its Organization and Development. Academic Press., 1972.
- Downey et al.  Carlton Downey, Ahmed Hefny, Byron Boots, Geoffrey J Gordon, and Boyue Li. Predictive State Recurrent Neural Networks. In Advances in Neural Information Processing Systems, 2017.
Gary L Drescher.
Made-up minds: a constructivist approach to artificial intelligence. MIT press, 1991.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long Short-Term Memory. Neural Computation, 1997.
- Hopfield  J J Hopfield. Neural Network and Physical Systems with Emergent Collective Computational Abilities. Proceedings of the National Academy of Sciences of the United States of America, 1982.
Hsu et al. 
D Hsu, SM Kakade, and Tong Zhang.
A spectral algorithm for learning Hidden Markov Models.Journal of Computer and System Sciences, 2012.
- Jaderberg et al.  Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397, 2016.
- Lin and Mitchell  Long-Ji Lin and Tom M Mitchell. Reinforcement learning with hidden states. In International Conference on Simulation of Adaptive Behavior, 1993.
- Littman et al.  Michael L Littman, Richard S Sutton, and S Singh. Predictive representations of state. In Advances in Neural Information Processing Systems, 2001.
- Maei  H Maei. Gradient Temporal-Difference Learning Algorithms. PhD thesis, University of Alberta, 2011.
- Maei et al.  H Maei, C Szepesvári, S Bhatnagar, and R Sutton. Toward Off-Policy Learning Control with Function Approximation. In International Conference on Machine Learning, 2010.
- Maei et al.  HR Maei, C Szepesvári, S Bhatnagar, D Precup, D Silver, and Richard S Sutton. Convergent temporal-difference learning with arbitrary smooth function approximation. In Advances in Neural Information Processing Systems, 2009.
- Makino and Takagi  Takaki Makino and Toshihisa Takagi. On-line discovery of temporal-difference networks. In International Conference on Machine Learning, 2008.
- McCallum  R A McCallum. Learning to use selective attention and short-term memory in sequential tasks. In International Conference on Simulation of Adaptive Behavior, 1996.
- McCracken and Bowling  P McCracken and Michael H Bowling. Online discovery and learning of predictive state representations. In Advances in Neural Information Processing Systems, 2005.
- Modayil et al.  Joseph Modayil, Adam White, and Richard S Sutton. Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior - Animals, Animats, Software Agents, Robots, Adaptive Systems, 2014.
- Pascanu et al.  Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, 2013.
- Pearlmutter  Barak A Pearlmutter. Fast Exact Multiplication by the Hessian. dx.doi.org, 1994.
- Rafols et al.  Eddie J Rafols, Mark B Ring, Richard S Sutton, and Brian Tanner. Using predictive representations to improve generalization in reinforcement learning. In International Joint Conference on Artificial Intelligence, 2005.
- Schaul and Ring  Tom Schaul and Mark Ring. Better generalization with forecasts. In International Joint Conference on Artificial Intelligence, 2013.
- Silver  D Silver. Gradient Temporal Difference Networks. In European Workshop on Reinforcement Learning, 2012.
- Sutton and Tanner  Richard S Sutton and Brian Tanner. Temporal-Difference Networks. In Advances in Neural Information Processing Systems, 2004.
- Sutton et al.  Richard S Sutton, J Modayil, M Delp, T Degris, P.M. Pilarski, A White, and D Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and Multiagent Systems, 2011.
- Tanner and Sutton [2005a] B Tanner and Richard S Sutton. Temporal-Difference Networks with History. In International Joint Conference on Artificial Intelligence, 2005a.
- Tanner and Sutton [2005b] Brian Tanner and Richard S Sutton. TD() networks. In International Conference on Machine Learning, 2005b.
- White  Adam White. Developing a predictive approach to knowledge. PhD thesis, University of Alberta, 2015.
- Williams and Zipser  Ronald J Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation, 1989.
- Wolfe and Singh  Britton Wolfe and Satinder P Singh. Predictive state representations with options. In International Conference on Machine Learning, 2006.
Appendix A Proofs of theorems
Theorem 1 Let be the directed graph where each vertex corresponds to a GVF node and each directed edge indicates that is a cumulant for . If and is acyclic, iterating converges to a unique fixed point.
We cannot rely on the Bellman network operator being a contraction on each step. Rather, we explicitly prove that the sequence converges (Part 1) and that it converges to a unique fixed point (Part 2 and 3).
Part 1: The sequence defined by converges to a limit .
Because is acyclic, we have a linear topological ordering of the vertices, , where for each directed edge , comes before in the ordering. Therefore, starting from the last GVF , we know that the Bellman operator is a contraction with rate , and so iterating for steps results in the error
As , converges to its fixed point.
We will use induction for the argument, with the above as the base case. Assume for all there exists a ball of radius where and as . Consider the next GVF in the ordering, .
Case 1: If does not have another GVF as its cumulant, then iterating with independently iterates and so converges because the Bellman operator is a contraction, and so clearly such an exists.
Case 2: If has precisely one GVF that is its cumulant, then after some steps, we know the change gets very small. As a result, the change in is
For sufficiently large , can be made arbitrarily small. If , then
and so the iteration is a contraction on step . Else, if , then this implies the difference is already within a very small ball, with radius . As , the difference can oscillate between being within this ball, which shrinks to zero, or being iterated with a contraction that also shrinks the difference. In either case, there exists an such that , where as .
Case 3: If has a weighted sum of GVFs as its cumulant, the argument is similar as Case 2, simply with a weighted sum of .
Therefore, because we have such an for all GVFs in the network, we know the sequence converges.
Part 2: is a fixed point of .
Because the Bellman network operator is continuous, the limit can be taken inside the operator
Part 3: is the only fixed point of .
Consider an alternative solution . Then, because of the uniqueness of fixed points under Bellman operators, all those GVFs that have non-compositional cumulants have unique fixed points and so those components in must be the same as . Then, all the GVFs next in the ordering that use those GVFs as cumulants have a unique cumulant, and so must then also converge to a unique value, because their Bellman operators with fixed GVFs as cumulants have a unique fixed point. This argument continues for the remaining GVFs in the ordering. ∎
Proposition 1 There exists an MDP and policy such that, for two GVFs in a cycle, iteration with the Bellman Network operator diverges.
Define a two state MDP and policy such that
and , where the rewards are irrelevant since the GVFs have each other as cumulants. The resulting Bellman iteration is
Since the matrix is an expansion, for many initial this iteration goes to infinity, such as initial . ∎
Appendix B Deriving an Update for GVFNs
We first recast the objective function for GVFNs in a similar form to the nonlinear MSPBE (Maei et al., 2009)
, which will make it more straightforward to take the gradient. The approach for taking the gradient is similar to that for nonlinear MSPBE—since the MSPBNE is a nonlinear objective based on a projected Bellman operator—but becomes slightly more complex due to taking gradient back through time. We highlight at the end two simpler algorithms that could be used to train GVFNs, which we show in our experiments is equally effective in learning the GVFN but significantly simpler. These updates rely on the fact that GVFNs appear to be much more robust to the level of truncation when doing backpropagation-through-time, facilitating the use of updates that do a simple TD update only for this step.
For a given history , for all GVFs, we take the gradient of their predictions w.r.t.
Assume the behaviour policy has stationary distribution for all where for any of the policies . Assume that for that is continuously differentiable as a function of for all histories where and that the matrix
is nonsingular, where represents a random vector for the history in the expectation. Then for importance sampling ratios , and TD-errors
where is a history immediately following history .
The extension is a relatively straightforward modification of the nonlinear MSPBE (Maei, 2011) and the TD-network MSPBNE (Silver, 2012). The main modification is in the extension to off-policy sampling—both allowing different and necessitating the addition of importance sampling ratio—and the extension to transition-based discounting.
Before providing the gradient of the MSPBNE, we introduce one more notation to indicat compositions. Let be the mapping to give a weighted edge between GVFs, where is the weighted edge between if is a cumulant for , and if there is no connection.
Using this, we can write
Assume that is tpwice continuously differentiable as a function of for all histories where and that , defined in Equation (13), is non-singular in a small neighbourhood of . Then for
we get the gradient
For simplicity in notation below, we drop the explicit dependence on the random variablein the expectations.
Using this theorem, we can sample gradients for GVFNs. Like GTD, we need to estimate the second set of weights , as a quasi-stationary estimate. The key difficulty is in obtain the gradient of the value functions w.r.t. , which requires backpropagation through time; we provide details on how to obtain these gradients in the next section. However, given access to the gradient of the value functions w.r.t. , the update is relatively straightforward. The typically preferred gradient update uses (19), rather than (18). This preferred update is often called TDC but also more simply now labeled as GTD—with GTD2 labeled as the less desirable update. Our proposed Recurrent GTD update, therefore, uses (19) and is
Given some intuition for the MSPBNE, we can also consider simpler algorithms that do not provide true gradients for this objective. Similarly to how nonlinear TD is used, in place of nonlinear GTD, we can obtain a Recurrent TD algorithm that is a semi-gradient algorithm by neglecting the gradients of the values on the next step, and neglecting the gradients through the question network.
This algorithm corresponds to Recurrent GTD, with similarly to how GTD reduces to TD when the second set of weights are zero. While the semi-gradient form does not optimize the MSPBNE, it is more computationally efficient than Recurrent GTD and we have found it to be generally as effective. Finally, we can go even further and update GVFNs using a very simply update: a one-step linear TD update. On each step, the state vector composed of predictions and the observation vector are concatenated to produce a feature vector .
TD() run separately for each GVF in the GVFN:
This neglects all gradients back-in-time, and simply runs instances of TD() concurrently. In alignment with our results indicating that GVFNs are robust to the truncation level in backpropagation-through-time, it is potentially not too surprising therefore that this simple update was effective for GVFNs. The simplicity in training GVFNs, removing all need to maintain the network structure or compute onerous gradients is one of the most compelling reasons to consider them as another standard recurrent architecture.
Appendix C Computing gradients of the value function back through time
In this section, we show how to compute , which was needed in the algorithms. For both Backpropagation Through Time or Real Time Recurrent Learning, it is useful to take advantage of the following formula for recurrent sensitivities
where is the Kronecker delta function. Given this formula, BPTT or RTRL can simply be applied.
For Recurrent GTD—though not for Recurrent TD—we additionally need to compute the Hessian back in time, for the Hessian-vector product. The Hessian for each value function is a matrix; computing the Hessian-vector product naively would cost at least for each GVF, which is prohibitively expensive. We can avoid this using R-operators also known as Pearlmutter’s method (Pearlmutter, 1994). The R-operator is defined as
for a (vector-valued) function and satisfies
Therefore, instead of computing the Hessian and then producting with , this operation can be completed in linear time, in the length of .
Specifically, for our setting, we have
To make the calculation more managable we seperate into each partial for every node k and associated weight j.