1 Introduction
Much knowledge can be represented by answers to predictive questions (Sutton, 2009), for example, “to know that Joe is in the coffee room is to predict that you will see him if you went there” (Sutton, 2009). Such knowledge is referred to as predictive knowledge (Sutton, 2009; Sutton et al., 2011). General Value Functions (GVFs, Sutton et al. 2011) are commonly used to represent predictive knowledge. GVFs are essentially the same as canonical value functions (Puterman, 2014; Sutton and Barto, 2018).
However, the policy, the reward function, and the discount function associated with GVFs are usually carefully designed such that the numerical value of a GVF at certain states matches the numerical answer to certain predictive questions. In this way, GVFs can represent predictive knowledge.
Consider the concrete example in Figure 1, where a microdrone is doing a random walk. The microdrone is initialized somewhere with 100% battery. L4
is a power station where its battery is recharged to 100%. Each clockwise movement consumes 2% of the battery, and each counterclockwise movement consumes 1% (for simplicity, we assume negative battery levels, e.g., 10%, are legal). Furthermore, each movement fails with probability 1%, in which case the microdrone remains in the same location and no energy is consumed. An example of a predictive question in this system is:
Question 1.
Starting from L1, how much energy will be consumed in expectation before the next charge?
To answer this question, we can model the system as a Markov Decision Process (MDP). The policy is uniformly random and the reward for each movement is the additive inverse of the corresponding battery consumption. Whenever the microdrone reaches state
L4, the episode terminates. Under this setup, the answer to Question 1 is the expected cumulative reward when starting from L1, i.e., the state value of L1. Hence, GVFs can represent the predictive knowledge in Question 1. As a GVF is essentially a value function, it can be trained with any data stream from agentenvironment interaction via Reinforcement Learning (RL, Sutton and Barto 2018), demonstrating the generality of the GVF approach. However, the most appealing feature of GVFs is their compatibility with offpolicy learning, making this representation of predictive knowledge scalable and efficient. For example, in the Horde architecture (Sutton et al., 2011), many GVFs are learned in parallel with gradientbased offpolicy temporal difference methods (Sutton et al., 2009b, a; Maei, 2011). In the microdrone example, we can learn the answer to Question 1 under many different conditions (e.g., when the charging station is located at L2 or when the microdrone moves clockwise with probability 80%) simultaneously with offpolicy learning by considering different reward functions, discount functions, and polices.GVFs, however, cannot answer many other useful questions, e.g., if at some time , we find the microdrone at L1, how much battery do we expect it to have? As such questions emphasize the influence of possible past events on the present, we refer to their answers as retrospective knowledge. Such retrospective knowledge is useful, for example, in anomaly detection. Suppose the microdrone runs for several weeks by itself while we are traveling. When we return at time , we find the microdrone is at L1. We can then examine the battery level and see if it is similar to the expected battery at L1. If there is a large difference, it is likely that there is something wrong with the microdrone. There are, of course, many methods to perform such anomaly detection. For example, we could store the full running log of the microdrone during our travel and examine it when we are back. The memory requirement to store the full log, however, increases according to the length of our travel. By contrast, if we have retrospective knowledge, i.e., the expected battery level at each location, we can program the microdrone to log its battery level at each step (overwriting the record from the previous step). We can then examine the battery level when we are back and see if it matches our expectation. The current battery level can be easily computed via the previous battery level and the energy consumed at the last step, using only constant computation per step. The storage of the battery level requires only constant memory as we do not need to store the full history, which would not be feasible for a microdrone. Thus retrospective knowledge provides a memoryefficient way to perform anomaly detection. Of course, this approach may have lower accuracy than storing the full running log. This is indeed a tradeoff between accuracy and memory, and we expect applications of this approach in memoryconstrained scenarios such as embedded systems.
To know the expected battery level at L1 at time is essentially to answer the following question:
Question 2.
How much energy do we expect the microdrone to have consumed since the last time it had 100% battery given that it is at L1 at time ?
Unfortunately, GVFs cannot represent retrospective knowledge (e.g., the answer to Question 2) easily. GVFs provide a mechanism to ignore all future events after reaching certain states via setting the discount function at those states to be 0. This mechanism is useful for representing predictive knowledge. For example, in Question 1, we do not care about events after the next charge. For retrospective knowledge, we, however, need a mechanism to ignore all previous events before reaching certain states. For example, in Question 2, we do not care about events before the last time the microdrone had 100% battery. Unfortunately, GVFs do not have such a mechanism. In Appendix A, we describe several tricks that attempt to represent retrospective knowledge with GVFs and explain why they are invalid.
In this paper, we propose Reverse GVFs
to represent retrospective knowledge. Using the same MDP formulation of the microdrone system, let the random variable
denote the energy the microdrone has consumed at time since the last time it had 100% battery. To answer Question 2, we are interested in the conditional expectation of given that . We refer to such conditional expectations as Reverse GVFs, which we propose to learn via Reverse Reinforcement Learning. The key idea of Reverse RL is still bootstrapping, but in the reverse direction. It is easy to see that depends on and the energy consumption from to . In general, in Reverse RL, the quantity of interest at time depends on that at time . This idea of bootstrapping from the past has been explored by Wang et al. (2007, 2008); Hallak and Mannor (2017); Gelada and Bellemare (2019); Zhang et al. (2020c) but was limited to the density ratio learning setting. We propose several Reverse RL algorithms and prove their convergence under linear function approximation. We also propose Distributional Reverse RL algorithms akin to Distributional RL (Bellemare et al., 2017; Dabney et al., 2017; Rowland et al., 2018) to compute the probability of an event for anomaly detection. We demonstrate empirically the utility of Reverse GVFs in anomaly detection and representation learning.Besides Reverse RL, there are other approaches we could consider for answering Question 2. For example, we could formalize it as a simple regression task, where the input is the location and the target is the power consumption since the last time the microdrone had 100% battery. We show below that this regression formulation is a special case of Reverse RL, similar to how Monte Carlo is a special case of temporal difference learning (Sutton, 1988). Alternaticely, answering Question 2 is trivial if we have formulated the system as a Partially Observable MDP. We could use either the location or the battery level as the state and the other as the observation. In either case, however, deriving the conditional observation probabilities is nontrivial. We could also model the system as a reversed chain directly as Morimura et al. (2010) in light of reverse bootstrapping. This, however, creates difficulties in offpolicy learning, which we discuss in Section 5.
2 Background
We consider an infinitehorizon Markov Decision Process (MDP) with a finite state space , a finite action space , a transition kernel , and an initial distribution . In the GVF framework, users define a reward function , a discount function , and a policy to represent certain predictive questions. An agent is initialized at according to . At time step , the agent at state selects an action according to , receives a bounded reward satisfying , and proceeds to the next state according to . We then define the return at time step recursively as
(1) 
which allows us to define the general value function .^{1}^{1}1For a full treatment of GVFs, one can use a transitiondependent reward function and a transitiondependent discount function as suggested by White (2017). In this paper, we consider and for the ease of presentation. All the results presented in this paper can be directly extended to transitiondependent reward and discount functions. The general value function is essentially the same as the canonical value function (Puterman, 2014; Sutton and Barto, 2018). The name general emphasizes its usage in representing predictive knowledge. In the microdrone example (Figure 1), we define the reward function as , where is moving clockwise and is moving counterclockwise. We define the discount function as . Then it is easy to see that the numerical value of is the answer to Question 1
. In the rest of the paper, we use functions and vectors interchangeably, e.g., we also interpret
as a vector in . Furthermore, all vectors are column vectors.The general value function is the unique fixed point of the generalized Bellman operator (Yu et al., 2018): , where is the state transition matrix, i.e., , is the reward vector, i.e., , and is a diagonal matrix whose th diagonal entry is . To ensure is welldefined, we assume and are defined such that exists (Yu, 2015). Then if we interpret as the probability for an episode to terminate at , we can assume termination occurs w.p. 1.
3 Reverse General Value Function
Inspired by the return , we define the reverse return , which accumulates previous rewards:
(2) 
In the reverse return , the discount function has different semantics than in the return . Namely, in , the discount function downweights future rewards, while in , the discount function downweights past rewards. In an extreme case, setting allows us to ignore all the rewards before time when computing the reverse return , which is exactly the mechanism we need to represent retrospective knowledge.
Let us consider the microdrone example again (Figure 1) and try to answer Question 2. Assume the microdrone was initialized at L3 at and visited L4 and L1 afterwards. Then it is easy to see that is exactly the energy the microdrone has consumed since its last charge. In general, if we find the microdrone at L1 at time , the expectation of the energy that the microdrone has consumed since its last charge is exactly . Note the answer to Question 2 is not homogeneous in . For example, suppose the microdrone is initialized at L4 at . If we find it at L1 at , it is trivial to see the microdrone has consumed 2% battery. By contrast, if we find it at L1 at , computing the energy consumption since the last time it had 100% battery is nontrivial. It is inconvenient that the answer depends the time step but fortunately, we can show the following:
Assumption 1.
The chain induced by is ergodic and exists.
Theorem 1.
Under Assumption 1, the limit exists, which we refer to as . Furthermore, we define the reverse Bellman operator as
(3) 
where with being the stationary distribution of the chain induced by , is the transition matrix, i.e., , and with . Then is a contraction mapping w.r.t. some weighted maximum norm, and is its unique fixed point. We have .
The proof of Theorem 1 is based on Sutton et al. (2016); Zhang et al. (2019, 2020c) and is detailed in the appendix. Theorem 1 states that the numerical value of approximately answers Question 2. When Question 2 is asked for a large enough , the error in the answer is arbitrarily small. We call a Reverse General Value Function, which approximately encodes the retrospective knowledge, i.e., the answer to the retrospective question induced by and .
Based on the reverse Bellman operator , we now present the Reverse TD algorithm. Let us consider linear function approximation with a feature function , which maps a state to a dimensional feature. We use to denote the feature matrix, each row of which is
. Our estimate for
is then , where contains the learnable parameters. At time step , Reverse TD computes as(4) 
where is shorthand, and is a deterministic positive nonincreasing sequence satisfying the RobbinsMonro condition (Robbins and Monro, 1951), i.e., . We have
Proposition 1.
The proof of Proposition 1 is based on the proof of the convergence of linear TD in Bertsekas and Tsitsiklis (1996). In particular, we need to show that is negative definite. Details are provided in the appendix. For a sanity check, it is easy to verify that in the tabular setting (i.e., ), indeed holds. Inspired by the success of TD() (Sutton, 1988) and COPTD() (Hallak and Mannor, 2017), we also extend Reverse TD to Reverse TD(), which updates as
(5) 
With , Reverse TD
reduces to supervised learning.
Distributional Learning. In anomaly detection with Reverse GVFs, we compare the observed quantity (a scalar) with our retrospective knowledge (a scalar, the conditional expectation). It is not clear how to translate the difference between the two scalars into a decision about whether there is an anomaly. If our retrospective knowledge is a distribution instead, we can perform anomaly detection from a probabilistic perspective. To this end, we propose Distributional Reverse TD, akin to Bellemare et al. (2017); Rowland et al. (2018).
We use
to denote the conditional probability distribution of
given , where is the set of all probability measures over the measurable space , with being the Borel sets of . Moreover, we use to denote the vector whose th element is . By the definition of , we have for any(6) 
where is defined as , and is the pushforward measure, i.e., , where is the preimage of . To study when , we define
(7) 
When , Eq (6) suggests evolves according to . We, therefore, define the distributional reverse Bellman operator as . We have
Proposition 2.
Under Assumption 1, is a contraction mapping w.r.t. a metric , and we refer to its fixed point as . Assuming , then .
We now provide a practical algorithm to approximate
based on quantile regression, akin to
Dabney et al. (2017). We use quantiles with quantile levels , where . The measure is approximated with , where is a Dirac at , is a quantile corresponding to the quantile level , and is learnable parameters. Given a transition , we train to minimize the following quantile regression loss(8) 
where contains the parameters of the target network (Mnih et al., 2015), which is synchronized with periodically, and
is the quantile regression loss function.
is the Huber loss, i.e., , whereis a hyperparameter.
Dabney et al. (2017) provide more details about quantileregressionbased distributional RL.Offpolicy Learning. We would also like to be able to answer to Question 2 without making the microdrone do a random walk, i.e., we may have another policy for the microdrone to collect data. In this scenario, we want to learn offpolicy. We consider Offpolicy Reverse TD, which updates as:
(9) 
where and is obtained by following the behavior policy . Here we assume access to the density ratio , which can be learned via Hallak and Mannor (2017); Gelada and Bellemare (2019); Nachum et al. (2019); Zhang et al. (2020a, b).
Proposition 3.
Offpolicy Reversed TD converges to the same point as onpolicy Reverse TD. This convergence relies heavily on having the true density ratio . When using a learned estimate for the density ratio, approximation error is inevitable and thus convergence is not ensured. It is straightforward to consider a GTD (Sutton et al., 2009b, a; Maei, 2011) analogue, Reverse GTD, as Zhang et al. (2020c) does in Gradient Emphasis Learning. The convergence of OffPolicy Reverse GTD is straightforward (Zhang et al., 2020c), but to a different point from Onpolicy Reverse TD.
4 Experiments
The Effect of . At time step , the reverse return is known and can approximately serve as a sample for . It is natural to model this as a regression task where the input is , and the target is . This is indeed Reverse TD(1). So we first study the effect of in Reverse TD(). We consider the microdrone example in Figure 1. The dynamics are specified in Section 1. The reward function and the discount function are specified in Section 2. The policy is uniformly random. We use a tabular representation and compute the ground truth analytically. We vary in . For each , we use a constant step size tuned from . We report the Mean Value Error (MVE) against training steps in Figure 2. At a time step , assuming our estimation is , the MVE is computed as
. The results show that the bias of the estimate decreases quickly at the beginning. As a result, variance of the update target becomes the major obstacle in the learning process, which explains why the best performance is achieved by smaller
in this experiment.is tuned to minimize the MVE at the end of training. All curves are averaged over 30 independent runs with shaded regions indicate standard errors.
Anomaly Detection.
Tabular Representation. Consider the microdrone example once again (Figure 1).
Suppose we want the microdrone to follow a policy where .
However, something can go wrong when the microdrone is following .
For example, it may start to take with probability 0.9 at all states due to a malfunctioning navigation system,
which we refer to as a policy anomaly.
The microdrone may also consume 2% extra battery per step with probability 0.5 due to a malfunctioning engine,
which we refer to as a reward anomaly,
i.e., the reward becomes
with probability 0.5.
We cannot afford to monitor the microdrone every time step but can do so occasionally,
and we hope if something has gone wrong we can discover it.
Since it is a microdrone,
it does not have the memory to store all the logs between examinations.
We now demonstrate that Reverse GVFs can discover such anomalies using only constant memory and computation.
Our experiment consists of two phases. In the first phase, we train Reverse GVFs offpolicy. Our behavior policy is uniformly random with . The target policy is with . Given a transition following , we update the parameters , which is a lookup table in this experiment, to minimize . In this way, we approximate with quantiles for all . The MVE against training steps is reported in Figure 6a.
In the second phase, we use the learned from the first phase for anomaly detection when we actually deploy . Namely, we let the microdrone follow for steps and compute on the fly. In the first steps, there is no anomaly. In the second steps, the aforementioned reward anomaly or policy anomaly happens every step. We aim to discover the anomaly via computing the likelihood that is sampled from , namely, we compute
where is a hyperparameter and we use in our experiments. We do not have access to but only estimated quantiles . To compute , we need to first find a distribution whose quantiles are
. This operation is referred to as imputation in
Rowland et al. (2018). Such a distribution is not unique. The commonly used imputation strategy for quantileregressionbased distributional RL is (Dabney et al., 2017). This distribution, however, makes it difficult to compute. Inspired by the fact that a Dirac can be regarded as the limit of a normal distribution with decreasing scale, we define our approximation for
as , where is a hyperparameter and we use in our experiments. Note does not necessarily have the quantiles . We report against time steps in Figure 6b. When the anomaly occurs after the first steps, the probability of anomaly reported by Reverse GVF becomes high.Nonlinear Function approximation. We now consider Reacher from OpenAI gym (Brockman et al., 2016)
and use neural networks as a function approximator for
. Our setup is the same as the tabular setting except that the tasks are different. For a state , we define if the distance between the end of the robot arm and the target is less than 0.02. Otherwise we always have . When the robot arm reaches a state with , the arm and the target are reinitialized randomly. We first train a deterministic policy with TD3 (Fujimoto et al., 2018) achieving an average episodic return of . In the first phase, we use a Gaussian behavior policy . The target policy is . In the second phase, we consider two kinds of anomaly. In the policy anomaly, we consider three settings where the policy becomes , and respectively. In the reward anomaly, we consider three settings where with probability 0.5 the reward becomes , , and respectively. We report the estimated probability of anomaly in Figure 6c. When an anomaly happens after the first steps, the probability of anomaly reported by Reverse GVF becomes high.Representation Learning. Veeriah et al. (2019) show that automatically discovered GVFs can be used as auxiliary tasks (Jaderberg et al., 2016) to improve representation learning, yielding a performance boost in the main task. Let and be the reward function and the discount factor of the main task. Veeriah et al. (2019) propose two networks for solving the main task: a main task and answer network, parameterized by , and a question network, parameterized by . The two networks do not share parameters. The question network takes as input states and outputs two scalars, representing a reward signal and a discount factor . The network has two heads with a shared backbone. The backbone represents the internal state representation of the agent. One head represents the policy , as well as the value function , for the main task. The other head represents the answer to the predictive question specified by , i.e., this head represents the value function . At time step , is updated to minimize two losses and . Here is the usual RL loss for and , e.g., Veeriah et al. (2019) consider the loss used in IMPALA (Espeholt et al., 2018). is the TD loss for training with and . Minimizing improves the policy directly, and Veeriah et al. (2019) show that minimizing , the loss of the auxiliary task, facilitates the learning of by improving representation learning. Every steps, the question network is updated to minimize . In this way, the question network is trained to propose useful predictive questions for learning the main task.
We now show that automatically discovered Reverse GVFs can also be used as auxiliary tasks to improve the learning of the main task. We propose an IMPALA+ReverseGVF agent, which is the same as the IMPALA+GVF agent in Veeriah et al. (2019) except that we replace with . Here is the Reverse TD loss for training the reverse general value function with and , and the head replaces the head in Veeriah et al. (2019). We benchmark our IMPALA+ReverseGVF agent against a plain IMPALA agent, an IMPALA+RewardPrediction agent, an IMPALA+PixelControl agent, and an IMPALA+GVF agent in ten Atari games. The IMPALA+RewardPrediction agent predicts the immediate reward of the main task of its current stateaction pair as an auxiliary task (Jaderberg et al., 2016). The IMPALA+PixelControl agent maximizes the change in pixel intensity of different regions of the input image as an auxiliary task (Jaderberg et al., 2016). According to Veeriah et al. (2019), those ten Atari games are the ten where the IMPALA+PixelControl agent achieves the largest improvement over the plain IMPALA agent over all 57 Atari games.
The results in Figure 7 show that IMPALA+ReverseGVF yields a performance boost over plain IMPALA in 7 out of 10 tested games, and the improvement is larger than 25% in 5 games. IMPALA+ReverseGVF outperforms IMPALA+RewardPrediction in all 10 tested games, indicating reward prediction is not a good auxiliary task for an IMPALA agent in those ten games. IMPALA+ReverseGVF outperforms IMPALA+PixelControl in 8 out of 10 tested games, though the games are selected in favor of IMPALA+PixelControl. IMPALA+ReverseGVF also outperforms IMPALA+GVF, the stateoftheart in discovering auxiliary tasks, in 3 games. Overall, our empirical study confirms that ReverseGVFs are useful inductive bias for composing auxiliary tasks, though not achieving a new state of the art. We conjecture that IMPALA+GVF outperforms IMPALA+ReverseGVF because GVF aligns better with the main task than ReverseGVF in that the value function of the main task itself is also a GVF.
5 Related Work
Our reverse return is inspired by the followon trace in Sutton et al. (2016), which is defined as , where is a userdefined interest function specifying user’s preference for different states. Sutton et al. (2016) use the followon trace to reweight value function update in Emphatic TD. Later on, Zhang et al. (2020c) propose to learn the conditional expectation with function approximation in offpolicy actorcritic algorithms. This followon trace perspective is one origin of bootstrapping in the reverse direction, and the followon trace is used only for stabilizing offpolicy learning. The second origin is related to learning the stationary distribution of a policy, which dates back to Wang et al. (2007, 2008) in dual dynamic programming for stable policy evaluation and policy improvement. Later on, Hallak and Mannor (2017); Gelada and Bellemare (2019) propose stochastic approximation algorithms (discounted) COPTD to learn the density ratio, i.e. the ratio between the stationary distribution of the target policy and that of the behavior policy, to stabilize offpolicy learning. Our Reverse TD differs from the discounted COPTD in that (1) Reverse TD is onpolicy and does not have importance sampling ratios, while discounted COPTD is designed only for offpolicy setting, as there is no density ratio in the onpolicy setting. (2) Reverse TD uses in the update, while discounted COPTD uses a carefully designed constant. The third origin is an application of RL in web page ranking (Yao and Schuurmans, 2013), where a different Reverse Bellman Equation is proposed to learn the authority score function. Although the idea of reverse bootstrapping is not new, we want to highlight that this paper is the first to apply this idea for representing retrospective knowledge and show its utility in anomaly detection and representation learning. We are also the first to use distributional learning in reverse bootstrapping, providing a probabilistic perspective for anomaly detection.
Another approach for representing retrospective knowledge is to work directly with a reversed chain like Morimura et al. (2010). First, assume the initial distribution is the same as the stationary distribution . We can then compute the posterior action distribution given the next state and the posterior state distribution given the action and the next state using Bayes’ rule: We can then define a new MDP with the same state space and the same action space . But the new policy is the posterior distribution and the new transition kernel is the posterior distribution . Intuitively, this new MDP flows in the reverse direction of the original MDP. Samples from the original MDP can also be interpreted as samples from the new MDP. Assuming we have a trajectory from the original MDP following , we can interpret the trajectory as a trajectory from the new MDP, allowing us to work on the new MDP directly. For example, applying TD in the new MDP is equivalent to applying the Reverse TD in the original MDP. However, in the new MDP, we no longer have access to the policy, i.e., we cannot compute explicitly as it requires both and , to which we do not have access. This is acceptable in the onpolicy setting but renders the offpolicy setting infeasible, as we do not know the target policy at all. We, therefore, argue that working on the reversed chain directly is only compatible with onpolicy learning.
6 Conclusion
In this paper, we present Reverse GVFs for representing retrospective knowledge and formalize the Reverse RL framework. We demonstrate the utility of Reverse GVFs in both anomaly detection and representation learning. Investigating ReverseGVFbased anomaly detection with real world data is a possible future work. In this paper, we investigate Reverse RL in only a policy evaluation sense. Reverse RL for control is also a possible future work.
Broader Impact
ReverseRL makes it possible to implement anomaly detection with little extra memory. This is particularly important for embedded systems with limited memory, e.g., satellites, spacecrafts, microdrones, and IoT devices. The saved memory can be used to improve other functionalities of those systems. Systems where memory is not a bottleneck, e.g., selfdriving cars, benefit from ReverseRLbased anomaly detection as well, as saving memory saves energy, making them more environmentfriendly.
ReverseRL provides a probabilistic perspective for anomaly detection. So misjudgment is possible. Users may have to make a decision considering other available information as well to reach a certain confidence level. Like any other neural network application, combining neural network with ReverseRLbased anomaly detection is also vulnerable to adversarial attacks. This means the users, e.g., companies or governments, should take extra care for such attacks when making a decision on whether there is an anomaly or not. Otherwise, they may suffer from property losses. Although ReverseRL itself does not have any bias or unfairness, if the simulator used to train reverse GVFs is biased or unfair, the learned GVFs are likely to inherit those bias or unfairness. Although ReverseRL itself does not raise any privacy issue, to make a better simulator for training, users may be tempted to exploit personal data. Like any artificial intelligence system, ReverseRLbased anomaly detection has the potential to greatly improve human productivity. However, it may also reduce the need for human workers, resulting in job losses.
Acknowledgments and Disclosure of Funding
SZ is generously funded by the Engineering and Physical Sciences Research Council (EPSRC). This project has received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (grant agreement number 637713). The experiments were made possible by a generous equipment grant from NVIDIA.
References
 Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. (2017). A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887.
 Bertsekas and Tsitsiklis (1989) Bertsekas, D. P. and Tsitsiklis, J. N. (1989). Parallel and distributed computation: numerical methods. Prentice hall Englewood Cliffs, NJ.
 Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). NeuroDynamic Programming. Athena Scientific Belmont, MA.
 Borkar (2009) Borkar, V. S. (2009). Stochastic approximation: a dynamical systems viewpoint. Springer.
 Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). Openai gym. arXiv preprint arXiv:1606.01540.
 Dabney et al. (2017) Dabney, W., Rowland, M., Bellemare, M. G., and Munos, R. (2017). Distributional reinforcement learning with quantile regression. arXiv preprint arXiv:1710.10044.
 Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. (2018). Impala: Scalable distributed deeprl with importance weighted actorlearner architectures. arXiv preprint arXiv:1802.01561.
 Fujimoto et al. (2018) Fujimoto, S., van Hoof, H., and Meger, D. (2018). Addressing function approximation error in actorcritic methods. arXiv preprint arXiv:1802.09477.
 Gelada and Bellemare (2019) Gelada, C. and Bellemare, M. G. (2019). Offpolicy deep reinforcement learning by bootstrapping the covariate shift. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence.

Hallak and Mannor (2017)
Hallak, A. and Mannor, S. (2017).
Consistent online offpolicy evaluation.
In
Proceedings of the 34th International Conference on Machine Learning
.  Horn and Johnson (2012) Horn, R. A. and Johnson, C. R. (2012). Matrix analysis (2nd Edition). Cambridge university press.
 Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv preprint arXiv:1611.05397.
 Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Levin and Peres (2017) Levin, D. A. and Peres, Y. (2017). Markov chains and mixing times. American Mathematical Soc.
 Maei (2011) Maei, H. R. (2011). Gradient temporaldifference learning algorithms. PhD thesis, University of Alberta.
 Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature.
 Morimura et al. (2010) Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural computation.
 Nachum et al. (2019) Nachum, O., Chow, Y., Dai, B., and Li, L. (2019). Dualdice: Behavioragnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733.
 Nair and Hinton (2010) Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning.
 Puterman (2014) Puterman, M. L. (2014). Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons.
 Robbins and Monro (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics.
 Rowland et al. (2018) Rowland, M., Bellemare, M. G., Dabney, W., Munos, R., and Teh, Y. W. (2018). An analysis of categorical distributional reinforcement learning. arXiv preprint arXiv:1802.08163.
 Sutton (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning.
 Sutton (2009) Sutton, R. S. (2009). The grand challenge of predictive empirical abstract knowledge. In Working Notes of the IJCAI09 Workshop on Grand Challenges for Reasoning from Experiences.
 Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd Edition). MIT press.
 Sutton et al. (2009a) Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., and Wiewiora, E. (2009a). Fast gradientdescent methods for temporaldifference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning.
 Sutton et al. (2009b) Sutton, R. S., Maei, H. R., and Szepesvári, C. (2009b). A convergent temporaldifference algorithm for offpolicy learning with linear function approximation. In Advances in Neural Information Processing Systems.
 Sutton et al. (2016) Sutton, R. S., Mahmood, A. R., and White, M. (2016). An emphatic approach to the problem of offpolicy temporaldifference learning. The Journal of Machine Learning Research.
 Sutton et al. (2011) Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D. (2011). Horde: A scalable realtime architecture for learning knowledge from unsupervised sensorimotor interaction. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems.
 Veeriah et al. (2019) Veeriah, V., Hessel, M., Xu, Z., Rajendran, J., Lewis, R. L., Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. (2019). Discovery of useful questions as auxiliary tasks. In Advances in Neural Information Processing Systems.
 Wang et al. (2007) Wang, T., Bowling, M., and Schuurmans, D. (2007). Dual representations for dynamic programming and reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.
 Wang et al. (2008) Wang, T., Bowling, M., Schuurmans, D., and Lizotte, D. J. (2008). Stable dual dynamic programming. In Advances in neural information processing systems.
 White (2017) White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
 Yao and Schuurmans (2013) Yao, H. and Schuurmans, D. (2013). Reinforcement ranking. arXiv preprint arXiv:1303.5988.
 Yu (2015) Yu, H. (2015). On convergence of emphatic temporaldifference learning. In Conference on Learning Theory.
 Yu et al. (2018) Yu, H., Mahmood, A. R., and Sutton, R. S. (2018). On generalized bellman equations and temporaldifference learning. The Journal of Machine Learning Research.
 Zhang et al. (2020a) Zhang, R., Dai, B., Li, L., and Schuurmans, D. (2020a). Gendice: Generalized offline estimation of stationary values. In International Conference on Learning Representations.
 Zhang et al. (2019) Zhang, S., Boehmer, W., and Whiteson, S. (2019). Generalized offpolicy actorcritic. In Advances in Neural Information Processing Systems.
 Zhang et al. (2020b) Zhang, S., Liu, B., and Whiteson, S. (2020b). Gradientdice: Rethinking generalized offline estimation of stationary values. In Proceedings of the 37th International Conference on Machine Learning.
 Zhang et al. (2020c) Zhang, S., Liu, B., Yao, H., and Whiteson, S. (2020c). Provably convergent twotimescale offpolicy actorcritic with function approximation. In Proceedings of the 37th International Conference on Machine Learning.
Appendix A Failure in Representing Retrospective Knowledge with GVFs
One may consider answering Question 2 with GVF via setting L4 to be the initial state and terminating an episode when the microdrone gets to L1. Then the value of L4 seems to be the answer to Question 2. To understand how this approach fails, let us consider transitions L4  L3  L4  L1. It then becomes clear that we are unable to design a Markovian reward for the transition L3  L4. This reward has to be nonMarkovian to cancel all previously accumulated rewards. To make the reward Markovian, one may augment the state space with the battery level, which significantly increases the size of the state space. More importantly, this renders offpolicy learning infeasible. The transition kernel on this augmented state space depends on the original reward function. So we cannot use offpolicy learning to learn a GVF associated with a different reward function, as changing the reward function changes the transition kernel on the augmented state space. We can, of course, include the information about the new reward function into the augmented space. This, however, indicates the size of the state space grows exponentially with the number of reward functions we want to consider in offpolicy learning. There is even a deeper defect. Let us consider the setting where we have two charging stations, say L2 and L4. Then if we want to use GVF directly as aforementioned assuming the aforementioned issues could somehow be solved, we need to set the initial state to L2 and L4 respectively. We then solve the two MDPs and compute and respectively. Finally, we may need to compute as the answer, where is the stationary distribution of the original MDP, which is, unfortunately, unknown. To summarize, there may be some retrospective knowledge that GVF can represent if enough tweaks are applied. But in general, representing retrospective with GVF suffers from poor generality and poor scalability.
Appendix B Proofs
Lemma 1.
(Corollary 6.1 in page 150 of Bertsekas and Tsitsiklis (1989)) If is a square nonnegative matrix and , then there exists some vector such that . Here is elementwise greater and is the spectral radius. For a vector , its weighted maximum norm is . For a matrix , .
b.1 Proof of Theorem 1
Proof.
Given the similarity between and the followon trace as discussed in Section 5, the existence of can be established in exactly the same way as Zhang et al. (2019) establish the existence of in their Lemma 1. We therefore omit this to avoid verbatim repetition. We have
(10)  
(11)  
(12)  
(Law of total expectation)  
(13) 
The matrix form of Eq (13) is exactly , solving which leads to . Assumption 1 implies . As and
have the same eigenvalues (e.g., see Theorem 1.3.22 in
Horn and Johnson (2012)), we have . Lemma 1 then implies is a contraction mapping w.r.t. some weighted maximum norm. ∎b.2 Proof of Proposition 1
We first state a lemma about the convergence of the following iterates
(14) 
where is a Markov chain evolving in , .
Assumption 2.
(Assumption 4.5 in Bertsekas and Tsitsiklis (1996))
(a) The step sizes are nonnegative, deterministic, and satisfy .
(b) The chain has a stationary distribution .
(c) The matrix is negative definite.
(d) There is a constant such that and .
(e) There exists scalars such that
(15) 
where .
Lemma 2.
Proof.
We first consider a deterministic reward setting, i.e., we assume . The Reverse TD update Eq (4) can be rearranged as
(16) 
where . It is easy to verify that and . Assumption 2(a) is satisfied automatically. Obviously is ergodic and its stationary distribution is . Assumption 2(b) is now satisfied.
We now verify Assumption 2(c). Our proof is inspired by the proof of Lemma 6.4 in Bertsekas and Tsitsiklis (1996). Let , we aim to show . As has linearly independent columns, it suffices to show . We have
(17)  
(18) 
where the first inequality comes from Jensen’s inequality, whose equality holds iff all components of are the same scalar (referred to as ). When that happens, we have . Note there exists at least one such that , otherwise is singular, violating Assumption 1. So . To conclude, for any , we always have , yielding
(19) 
which completes the proof.
Assumption 2(d) is straightforward as is finite. Assumption 2(e) is trivial in our setting as we do not have eligibility trace and can be obtained from standard arguments about the mixing time of MDP (e.g., Theorem 4.9 in Levin and Peres (2017)).
The extension from deterministic rewards to stochastic rewards is standard (e.g., see Section 2.2 in Borkar (2009)) thus omitted. ∎
Comments
There are no comments yet.