1 Introduction
Reinforcement Learning (RL) successfully addressed many complex problems such as playing computer games, chess, and even Go with superhuman performance [Mnih et al., 2015, Silver et al., 2018]. These impressive results are possible thanks to a vast amount of interactions of the RL agent with its environment/task. Such strategy is unsuitable in settings where the agent has to perform and learn at the same time. Consider, for example, a care giver robot in a hospital that has to learn a new task, such as a new route to deliver meals. In such a setting, the agent can not collect a vast amount of training samples but has to adapt quickly instead. Transfer learning aims to provide mechanisms quickly to adapt agents in such settings [Taylor and Stone, 2009, Lazaric, 2012, Zhu et al., 2020]. The rationale is to use knowledge from previously encountered source tasks for a new target task to improve the learning performance on the target task. The previous knowledge can help reducing the amount of interactions required to learn the new optimal behavior. For example, the care giver robot could reuse knowledge about the layout of the hospital it learned in previous source tasks (e.g. guiding a person) to learn to deliver meals.
The Successor Feature (SF) and General Policy Improvement (GPI) framework [Barreto et al., 2020] is a prominent transfer learning mechanism for tasks where only the reward function differs. Its basic premise is that the rewards which the RL agent tries to maximize are defined based on a lowdimensional feature descriptor . For our caregiver robot this could be ID’s of beds or rooms that it is visiting, in difference to its highdimensional visual state intput from a camera. The rewards are then computed not based on its visual input but on the ID’s of the beds or rooms that it visits. The expected cumulative discounted successor features () are learned for each behavior that the robot learned in the past. It represents the dynamics in the feature space that the agent experiences for a behavior. This corresponds to the rooms or beds the caregiver agent would visit if using the behavior. This representation of feature dynamics is independent from the reward function. A behavior learned in a previous task and described by this SF representation can be directly reevaluated for a different reward function. In a new task, i.e. for a new reward function, the GPI procedure reevaluates the behaviors learned in previous tasks for it. It then selects at each state the behavior of a previous task if it improves the expected reward. This allows to reuse behaviors learned in previous source tasks for a new target task. A similar transfer strategy can also be observed in the behavior of humans [Momennejad et al., 2017, Momennejad, 2020, Tomov et al., 2021] .
The classical SF&GPI framework [Barreto et al., 2017, 2018] makes the assumption that rewards are a linear composition of the features via a reward weight vector that depends on the task : . This assumption allows to effectively separate the feature dynamics of a behavior from the rewards and thus to reevaluate previous behaviors given a new reward function, i.e. a new weight vector . Nonetheless, this assumption also restricts successful application of SF&GPI only to problems where such a linear decomposition is possible. This paper investigates the application of the SF&GPI framework to general reward functions: . We propose to learn the cumulative discounted probability over the successor features, named function, and refer to the proposed framework as learning. Our work is related to Janner et al. [2020], Touati and Ollivier [2021], and brings two important additional contributions. First, we provide mathematical proof of the convergence of learning. Second, we demonstrate how learning can be used for metaRL, using the function to reevaluate behaviors learned in previous tasks for a new reward function . Furthermore, learning can also be used to transfer knowledge to new tasks using GPI.
The contribution of our paper is threefold:

We introduce a new RL algorithm, learning, based on an cumulative discounted probability of successor features, and two variants of its update operator.

We provide theoretical proofs of the convergence of learning to the optimal policy and for a guarantee of its transfer learning performance under the GPI procedure.

We experimentally compare learning in tasks with linear and general reward functions, and for tasks with discrete and continuous features to standard Qlearning and the classical SF framework, demonstrating the interest and advantage of learning.
2 Background
2.1 Reinforcement Learning
RL investigates algorithms to solve multistep decision problems, aiming to maximize the sum over future rewards [Sutton and Barto, 2018]. RL problems are modeled as Markov Decision Processes (MDPs) which are defined as a tuple , where and are the state and action set. An agent transitions from a state to another state using action at time point collecting a reward : . This process is stochastic and the transition probability describes which state is reached. The reward function defines the scalar reward for the transition. The goal in an MDP is to maximize the expected return , where . The discount factor weights collected rewards by discounting future rewards stronger. RL provides algorithms to learn a policy defining which action to take in which state to maximise .
Valuebased RL methods use the concept of value functions to learn the optimal policy. The stateaction value function, called Qfunction, is defined as the expected future return taking action in and then following policy :
(1) 
The Qfunction can be recursively defined following following the Bellman equation such that the current Qvalue depends on the maximum Qvalue of the next state . The optimal policy for a MDP can then be expressed based on the Qfunction, by taking at every step the maximum action: .
The optimal Qfunction can be learned using a temporal difference method such as Qlearning [Watkins and Dayan, 1992]. Given a transition (), the Qvalue is updated according to:
(2) 
where is the learning rate at iteration .
2.2 Transfer Learning and the SF&GPI Framework
We are interested in the transfer learning setting where the agent has to solve a set of tasks , that in our case differ only in their reward function. The Successor Feature (SF) framework provides a principled way to perform transfer learning [Barreto et al., 2017, 2018]. SF assumes that the reward function can be decomposed into a linear combination of features and a reward weight vector that is defined for a task :
(3) 
We refer to such reward functions as linear reward functions. Since the various tasks differ only in their reward functions, the features are the same for all tasks in .
Given the decomposition above, it is also possible to rewrite the Qfunction into an expected discounted sum over future features and the reward weight vector :
(4) 
This decouples the dynamics of the policy in the feature space of the MDP from the expected rewards for such features. Thus, it is now possible to evaluate the policy in a different task using a simple multiplication of the weight vector with the function: . Interestingly, the function also follows the Bellman equation:
(5) 
and can therefore be learned with conventional RL methods. Moreover, [Lehnert and Littman, 2019] showed the equivalence of SFlearning to Qlearning.
Being in a new task the Generalized Policy Improvement (GPI) can be used to select the action over all policies learned so far that behaves best:
(6) 
[Barreto et al., 2018] proofed that under the appropriate conditions for optimal policy approximates, the policy constructed in (6) is close to the optimal one, and their difference is upperbounded:
(7) 
where . For an arbitrary reward function the result can be interpreted in the following manner. Given the arbitrary task , we identify the theoretically closest possible linear reward task with . For this theoretically closest task, we search the linear task in our set of task (from which we also construct the GPI optimal policy (6)) which is closest to it. The upper bound between and is then defined by 1) the difference between task and the theoretically closest possible linear task : ; and by 2) the difference between theoretical task and the closest task : . If our new task is also linear then and the first term in (7) would vanish.
Very importantly, this result shows that the SF framework will only provide a good approximation of the true Qfunction if the reward function in a task can be represented using a linear decomposition. If this is not the case then the error in the approximation increase with the distance between the true reward function and the best linear approximation of it as stated by .
3 Method: learning
3.1 Definition and foundations of learning
The goal of this paper is to investigate the application of SF&GPI to tasks with general reward functions over state features :
(8) 
where we define . Under this assumption the Qfunction can not be linearly decomposed into a part that describes feature dynamics and one that describes the rewards as in the linear SF framework (4). To overcome this issue, we propose to define the expected cumulative discounted probability of successor features or function, which is going to be the central mathematical object of the paper, as:
(9) 
where , or in short
, is the probability density function of the features at time
, following policy and conditioned to and being the state and action at time respectively. Note that depends not only on the policy but also on the state transition (constant through the paper). With the definition of the function, the Qfunction rewrites:(10) 
Depending on the reward function , there are several functions that correspond to the same function. Formally, this is an equivalence relationship, and the quotient space has a onetoone correspondence with the function space.
Proposition 1.
(Equivalence between functions and Q) Let . Let be defined as . Then, is an equivalence relationship, and there is a bijective correspondence between the quotien space and .
Corollary 1.
The bijection between and allows to induce a norm into from the supreme norm in , with which is a Banach space (since is Banach with ):
(11) 
Similar to the Bellman equation for the Qfunction, we can define a Bellman operator for the function, denoted by , as:
(12) 
As in the case of the function, we can use to construct a contractive operator:
Proposition 2.
(learning has a fixed point) The operator is welldefine w.r.t. the equivalence , and therefore induces an operator defined over . is contractive w.r.t. . Since is Banach, has a unique fixed point and iterating starting anywhere converges to that point.
In other words, successive applications of the operator converge towards the class of optimal functions or equivalently to an optimal function defined up to an additive function satisfying (i.e. ).
While these two results state (see Appendix A for the proofs) the theoretical links to standard Qlearning formulations, the operator defined in (12) is not usable in practice, because of the expectation. In the next section, we define the optimisation iterate, prove its convergence, and provide two variants to perform the updates.
3.2 learning algorithms
In order to learn the function, we introduce the learning update operator, which is an offpolicy temporal difference method analogous to Qlearning. Given a transition the learning update operator is defined as:
(13) 
where .
The following is one of the main results of the manuscript, stating the convergence of learning:
Theorem 1.
(Convergence of learning) For a sequence of stateactionfeature consider the learning update given in (13). If the sequence of stateactionfeature triples visits each state, action infinitely often, and if the learning rate is an adapted sequence satisfying the RobbinsMonro conditions:
(14) 
then the sequence of function classes corresponding to the iterates converges to the optimum, which corresponds to the optimal Qfunction to which standard Qlearning updates would converge to:
(15) 
The proof is provided in Appendix A and follows the same flow as for Qlearning.
The previous theorem provides convergence guarantees under the assumption that either
is known, or an unbiased estimate can be constructed. In the following, we propose two different ways to approximate
from a given transition so as to perform the update (13).Modelfree (MF) Learning:
The first instance of learning, which we call Modelfree (MF) Learning uses the same principle as standard modelfree temporal difference learning methods. The update assumes for a given transition that the probability for the observed feature is . Whereas for all other features () the probability is , see Appendix C for continuous features. The resulting updates are:
(16) 
Due to the stochastic update of the function and if the learning rate discounts over time, the update will learn the true probability of
. A problematic point with the MF procedure is that it induces potentially a high variance when the true feature probabilities are not binary. To cope with this potentially negative effect, we propose a different variant.
OneStep SF Model (MB) Learning:
We introduce a second learning procedure called Onestep SF Model (MB) Learning that attempts to reduce the variance of the update. To do so, MB
Learning estimates the distribution over the successor features over time. Let
denote the current estimate of the feature distribution. Given a transition the model is updated according to:(17) 
where is the learning rate. After updating the model , it can be used for the update as defined in (13). Since the learned model is independent from the reward function and from the policy, it can be learned and used over all tasks.
3.3 Meta learning
After discussing learning on a single task and showing its theoretical convergence, we can now investigate how it can be applied in transfer learning. Similar to the linear SF framework the function allows to reevaluate a policy learned for task , , in a new environment :
(18) 
This allows us to apply GPI in (6) for arbitrary reward functions in a similar manner to what was proposed for linear reward functions in [Barreto et al., 2018]. We extend the GPI result to the learning framework as follows:
Theorem 2.
(Generalised policy improvement in learning) Let be the set of tasks, each one associated to a (possibly different) weighting function . Let be a representative of the optimal class of functions for task , , and let be an approximation to the optimal function, . Then, for another task with weighting function , the policy defined as:
(19) 
satisfies:
(20) 
where .
The proof is provided in Appendix A.
4 Experiments
We evaluated learning in two environments. The first has discrete features. It is a modified version of the object collection task by Barreto et al. [2017]. We introduced to it features with higher complexity allowing the usage of general reward functions. See Appendix D.1 for experimental results in the original environment. The second environment, the racer environment, evaluates the agents in tasks with continuous features.
4.1 Discrete Features  Object Collection Environment
Environment:
The environment consist of 4 rooms (Fig. 1  a). The agent starts an episode in position S and has to learn to reach the goal position G. During an episode, the agent can collect objects to gain further rewards. Each object has 2 properties: 1) color: orange or blue, and 2) form: box or triangle. The state space is a highdimensional vector . It encodes the agent’s position using a
grid of twodimensional Gaussian radial basis functions. Moreover, it includes a memory about which object as been already collected. Agents can move in 4 directions. The features
are binary vectors. The first 2 dimensions encode if an orange or a blue object was picked up. The 2 following dimensions encode the form. The last dimension encodes if the agent reached goal G. For example, encodes that the agent picked up an orange box.Tasks:
Each agent learns sequentially 300 tasks which differ in their reward for collecting objects. We compared agents in two settings: either in tasks with linear or general reward functions. For each linear task , the rewards are defined by a linear combination of features and a weight vector . The weights
for the first 4 dimensions define the rewards for collecting an object with a specific property. They are randomly sampled from a uniform distribution:
. The final weight defines the reward for reaching the goal position which is for each task. The general reward functions are sampled by assigning a different reward to each possible combination of object properties using uniform sampling: , such that picking up an orange box might result in a reward of .(a) Collection Environment  (b) Tasks with Linear Reward Functions 
(c) Effect of NonLinearity  (d) Tasks with General Reward Functions 
learning reached the highest average reward per task for (b) linear, and (d) general reward functions. The average over 10 runs per algorithm and the standard error of the mean are depicted. (c) The performance difference between
learning and SFQL is stronger for general reward tasks that have high nonlinearity, i.e. where a linear reward model yields a high error. SFQL can only reach less than of MF learning’s performance in tasks with a mean linear reward model error of .Agents:
We compared learning to Qlearning (QL), and classical SF Qlearning (SFQL) [Barreto et al., 2017]. All agents use function approximation for their stateaction functions (Q, , or function). An independent linear mapping is used to map the values from the state for each of the 4 actions. As the features are discrete, the function and model are approximated by an independent mapping for each action and possible feature . The Qvalue for the agents (Eq. 10) is computed by: . The reward functions of each task are given to the agents. For SFQL, the sampled reward weights were given in tasks with linear reward functions. For general reward functions, a linear model approximating the rewards was learned for each task and its weights given to SFQL. Each tasks was executed for steps, and the average performance over 10 runs per algorithm was measured. We performed a gridsearch over the parameters of each agent, reporting here the performance of the parameters with the highest total reward over all tasks.
Results:
learning outperformed SFQL and QL for tasks with linear and general reward functions (Fig. 1  b; d). MF showed a slight advantage over MB learning in both settings. We further studied the effect nonlinearity of general reward functions on the performance of classical SF compared to learning by evaluating them in tasks with different levels of nonlinearity. We sampled general reward functions that resulted in different levels of mean absolute model error if they are linearly approximated with . We trained SFQL and MF learning in each of these conditions on 300 tasks and measured the ratio between the total return of SFQL and MF (Fig. 1). The relative performance of SFQL compared to MF reduces with higher nonlinearity of the reward functions. For reward functions that are nearly linear (mean error of ), both have a similar performance. Whereas, for reward functions that are difficult to model with a linear relation (mean error of ) SFQL reaches only less than of the performance of learning. This follows SFQL’s theoretical limitation in (7) and shows the advantage of learning over SFQL in nonlinear reward tasks.
4.2 Continuous Features  Racer Environment
Environment and Tasks:
We further evaluated the agents in an environment with continuous features (Fig. 2  a). The agent is randomly placed in the environment and has to drive around for 200 timesteps before the episode ends. Similar to a car, the agent has an orientation and momentum, so that it can only drive straight, or in a right or left curve. The agent reappears on the opposite side if it exits one side. The distance to 3 markers are provided as features . Rewards depend on the distances , where each component has 1 or 2 preferred distances defined by Gaussian functions. For each of the 65 tasks, the number of Gaussian’s and their properties (, ) are randomly sampled for each feature dimension. Fig. 2 (a) shows a reward function with dark areas depicting higher rewards. The agent has to learn to drive around in such a way as to maximize its trajectory over positions with high rewards. The state space is a highdimensional vector encoding the agent’s position and orientation. As before, the 2D position is encoded using a grid of twodimensional Gaussian radial basis functions. Similarly, the orientation is also encoded using Gaussian radial basis functions.
Agents:
We introduce a MF agent for continuous features (CMF ) (Appendix C.2.1). CMF discretizes each feature dimension in bins with the bin centers: . It learns for each dimension and bin the value . Qvalues (Eq. 10) are computed by: . SFQL received an approximated weight vector that was trained before the task started on several uniformly sampled features and rewards.
Results:
learning reached the highest performance of all agents (Fig. 2  b). SFQL reaches only a low performance below QL, because it is not able to sufficiently well approximate the general reward functions with its linear reward model. learning can only slightly improve over QL, showing that SF&GPI transfer in this environment is less efficient than in the object collection environment (Fig.1).
(a) Racer Environment  (b) Tasks with General Reward Functions 
5 Discussion
learning in Tasks with General Reward Functions:
learning allows to disentangle the dynamics of policies in the feature space of a task from the associated reward, see (10). The experimental evaluation in tasks with general reward functions (Fig. 1  d, and Fig. 2) show that learning can therefore successfully apply GPI to transfer knowledge from learned tasks to new ones. Given a general reward function it can reevaluate successfully learned policies for knowledge transfer. Instead, classical SFQL based on a linear decomposition (3) can not be directly applied given a general reward function. In this case a linear approximation has to be learned which shows inferior performance to learning that directly uses the true reward function.
learning in Tasks with Linear Reward Functions:
Learning also shows an increased performance over SFQL in environments with linear reward functions (Fig. 1  a). This effect can not be attributed to differences in their computation of the expected return of a policy as both are correct. A possible explanation could be that learning reduces the complexity for the function approximation of the function compared to the function in SFQL.
Continuous Feature Spaces:
For tasks with continuous features (racer environment), learning used successfully a discretization of each feature dimension, and learned the values independently for each dimension. This strategy is viable for reward functions that are cumulative over the feature dimensions: . The Qvalue can be computed by summing over the independent dimensions and the bins : . For more general reward functions, the space of all feature combinations would need to be discretized, which grows exponentially with each new dimension. As a solution the function could be directly defined over the continuous feature space, but this yields some problems. First, the computation of the expected return requires an integral over features instead of a sum, which is a priori intractable. Second, the representation and training of the function, which would be defined over a continuum thus increasing the difficulty of approximating the function. Janner et al. [2020] and Touati and Ollivier [2021] propose methods that might allow to represent a continuous function, but it is unclear if they converge and if they can be used for transfer learning.
Computational Complexity:
The improved performance of SFQL and learning over QL in the transfer learning setting comes at the cost of an increased computational complexity. The GPI procedure (6) of both approaches requires to evaluate at each step the function or function over all previous experienced tasks in . As a consequence, the computational complexity increases linearly with each new environment that is added. A solution is to apply GPI only over a subset of learned policies. Nonetheless, an open question is still how to optimally select this subset.
6 Related work
Transfer Learning:
Transfer methods in RL can be generally categorized according to the type of tasks between which transfer is possible and the type of transferred knowledge [Taylor and Stone, 2009, Lazaric, 2012, Zhu et al., 2020]. In the case of SF&GPI which learning is part of, tasks only differ in their reward functions. The type of knowledge that is transferred are policies learned in source tasks which are reevaluated in the target task and recombined using the GPI procedure. A natural usecase for learning are continual problems [Khetarpal et al., 2020] where an agent has continually adapt to changing tasks, which are in our setting different reward functions.
Successor Features:
SF are based on the concept of successor representations [Dayan, 1993, Momennejad, 2020]. Successor representations predict the future occurrence of all states for a policy in the same manner as SF for features. Their application is restricted to lowdimensional state spaces using tabular representations. SF extended them to domains with highdimensional state spaces [Kulkarni et al., 2016, Zhang et al., 2017, Barreto et al., 2017, 2018], by predicting the future occurrence of lowdimensional features that are relevant to define the return. Several extensions to the SF framework have been proposed. One direction aims to learn appropriate features from data such as by optimally reconstruct rewards [Barreto et al., 2017], using the concept of mutual information [Hansen et al., 2019], or the grouping of temporal similar states [Madjiheurem and Toni, 2019]. Another direction is the generalization of the function over policies [Borsa et al., 2018] analogous to universal value function approximation [Schaul et al., 2015]. Similar approaches use successor maps [Madarasz, 2019], goalconditioned policies [Ma et al., 2020], or successor feature sets [Brantley et al., 2021]. Other directions include their application to POMDPs [Vértes and Sahani, 2019], combination with maxentropy principles [Vertes, 2020], or hierarchical RL [Barreto et al., 2021]. In difference to learning all these approaches build on the assumption of linear reward functions, whereas learning allows the SF&GPI framework to be used with general reward functions. Nonetheless, most of the extensions for linear SF can be combined with learning.
Modelbased RL:
SF represent the dynamics of a policy in the feature space that is decoupled from the rewards allowing to reevaluate them under different reward functions. It shares therefore similar properties with modelbased RL [Lehnert and Littman, 2019]. In general, modelbased RL methods learn a onestep model of the environment dynamics . Given a policy and an arbitrary reward function, rollouts can be performed using the learned model to evaluate the return. In practice, the rollouts have a high variance for longterm predictions rendering them ineffective. Recently, [Janner et al., 2020] proposed the model framework that learns to represent values in continuous domains. Nonetheless, the application to transfer learning is not discussed and no convergence is proven as for learning. This is the same case for the forwardbackward MPD representation proposed in Touati and Ollivier [2021]. [Tang et al., 2021] also proposes to decouple the dynamics in the state space from the rewards, but learn an internal representation of the rewards. This does not allow to reevaluate an policy to a new reward function without relearning the mapping.
7 Conclusion
The introduced learning framework learns the expected cumulative discounted probability of successor features which disentangles the dynamics of a policy in the feature space of a task from the expected rewards. This allows learning to reevaluate the expected return of learned policies for general reward functions and to use it for transfer learning utilizing GPI. We proved that learning converges to the optimal policy, and showed experimentally its improved performance over Qlearning and the classical SF framework for tasks with linear and general reward functions.
Ethics Statement
learning and its associated optimization algorithms represent general RL procedures similar to Qlearning. Their potential negative societal impact depends on their application domains which range over all possible societal areas in a similar manner as for other general RL procedures.
Beyond the topic of the paper, we did our best to cite the relevant literature and to fairly compare with previous ideas, concepts and methods. To that aim, all agents are trained and evaluated within the same software environment, and under the very same experimental settings.
Reproducibility Statement
In order to ensure high changes of reproducibility we provided lots of details of the method and experiments associated to the paper. In particular, we have provided the proofs for all mathematical results announced in the main paper (see Appendix A). These constitute the theoretical foundation of the proposed learning methodology. Secondly, we have provided all experimental details (methods, and environments) required for reproducing our experiments, namely: appendix B for the object collection and C for the racer environment respectively. In addition, we provide additional results in appendix D, to completely illustrate the interest of the proposed method. Finally, we provided an anonymous link to the source code, so that reviewers can run it if necessary.
References
 Barreto et al. [2017] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver. Successor features for transfer in reinforcement learning. In Advances in neural information processing systems, pages 4055–4065, 2017.

Barreto et al. [2018]
A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz,
A. Zidek, and R. Munos.
Transfer in deep reinforcement learning using successor features and
generalised policy improvement.
In
International Conference on Machine Learning
, pages 501–510. PMLR, 2018.  Barreto et al. [2020] A. Barreto, S. Hou, D. Borsa, D. Silver, and D. Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48):30079–30087, 2020.
 Barreto et al. [2021] A. Barreto, D. Borsa, S. Hou, G. Comanici, E. Aygün, P. Hamel, D. Toyama, J. Hunt, S. Mourad, D. Silver, et al. The option keyboard: Combining skills in reinforcement learning. arXiv preprint arXiv:2106.13105, 2021.
 Borsa et al. [2018] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, H. van Hasselt, D. Silver, and T. Schaul. Universal successor features approximators. arXiv preprint arXiv:1812.07626, 2018.
 Brantley et al. [2021] K. Brantley, S. Mehri, and G. J. Gordon. Successor feature sets: Generalizing successor representations across policies. arXiv preprint arXiv:2103.02650, 2021.
 Dayan [1993] P. Dayan. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5(4):613–624, 1993.
 Hansen et al. [2019] S. Hansen, W. Dabney, A. Barreto, T. Van de Wiele, D. WardeFarley, and V. Mnih. Fast task inference with variational intrinsic successor features. arXiv preprint arXiv:1906.05030, 2019.
 Janner et al. [2020] M. Janner, I. Mordatch, and S. Levine. models: Generative temporal difference learning for infinitehorizon prediction. In NeurIPS, 2020.
 Khetarpal et al. [2020] K. Khetarpal, M. Riemer, I. Rish, and D. Precup. Towards continual reinforcement learning: A review and perspectives. arXiv preprint arXiv:2012.13490, 2020.
 Kulkarni et al. [2016] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman. Deep successor reinforcement learning. arXiv preprint arXiv:1606.02396, 2016.
 Lazaric [2012] A. Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement Learning, pages 143–173. Springer, 2012.
 Lehnert and Littman [2019] L. Lehnert and M. L. Littman. Successor features support modelbased and modelfree reinforcement learning. CoRR abs/1901.11437, 2019.
 Ma et al. [2020] C. Ma, D. R. Ashley, J. Wen, and Y. Bengio. Universal successor features for transfer reinforcement learning. arXiv preprint arXiv:2001.04025, 2020.
 Madarasz [2019] T. J. Madarasz. Better transfer learning with inferred successor maps. arXiv preprint arXiv:1906.07663, 2019.
 Madjiheurem and Toni [2019] S. Madjiheurem and L. Toni. State2vec: Offpolicy successor features approximators. arXiv preprint arXiv:1910.10277, 2019.
 Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
 Momennejad [2020] I. Momennejad. Learning structures: Predictive representations, replay, and generalization. Current Opinion in Behavioral Sciences, 32:155–166, 2020.
 Momennejad et al. [2017] I. Momennejad, E. M. Russek, J. H. Cheong, M. M. Botvinick, N. D. Daw, and S. J. Gershman. The successor representation in human reinforcement learning. Nature Human Behaviour, 1(9):680–692, 2017.
 Schaul et al. [2015] T. Schaul, D. Horgan, K. Gregor, and D. Silver. Universal value function approximators. In International conference on machine learning, pages 1312–1320, 2015.
 Silver et al. [2018] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and go through selfplay. Science, 362(6419):1140–1144, 2018.
 Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
 Tang et al. [2021] H. Tang, J. Hao, G. Chen, P. Chen, C. Chen, Y. Yang, L. Zhang, W. Liu, and Z. Meng. Foresee then evaluate: Decomposing value estimation with latent future prediction. arXiv preprint arXiv:2103.02225, 2021.
 Taylor and Stone [2009] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.
 Tomov et al. [2021] M. S. Tomov, E. Schulz, and S. J. Gershman. Multitask reinforcement learning in humans. Nature Human Behaviour, pages 1–10, 2021.
 Touati and Ollivier [2021] A. Touati and Y. Ollivier. Learning one representation to optimize all rewards. arXiv preprint arXiv:2103.07945, 2021.
 Tsitsiklis [1994] J. N. Tsitsiklis. Asynchronous stochastic approximation and qlearning. Machine learning, 16(3):185–202, 1994.
 Vertes [2020] E. Vertes. Probabilistic learning and computation in brains and machines. PhD thesis, UCL (University College London), 2020.
 Vértes and Sahani [2019] E. Vértes and M. Sahani. A neurally plausible model learns successor representations in partially observable environments. arXiv preprint arXiv:1906.09480, 2019.
 Watkins and Dayan [1992] C. J. Watkins and P. Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
 Zhang et al. [2017] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard. Deep reinforcement learning with successor features for navigation across similar environments. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2371–2378. IEEE, 2017.
 Zhu et al. [2020] Z. Zhu, K. Lin, and J. Zhou. Transfer learning in deep reinforcement learning: A survey. arXiv preprint arXiv:2009.07888, 2020.
Appendix A Theoretical Proofs
a.1 Proof of Proposition 1
Let us start by recalling the original statement in the main paper.
Proposition 1.
(Equivalence between functions and Q) Let . Let be defined as . Then, is an equivalence relationship, and there is a bijective correspondence between the quotien space and .
Proof.
We will proof the statements sequentially.
is an equivalence relationship:
To prove this we need to demonstrate that is symmetric, reciprocal and transitive. The three are quite straightforward since: , and .
Bijective correspondence:
To prove the bijectivity, we will first prove that it is injective, then surjective. Regarding the injectivity: , we prove it by contrapositive:
(21) 
In order to prove the surjectivity, we start from a function and select an arbitrary , then the following function:
(22) 
satisfies that and that . We conclude that there is a bijective correspondence between the elements of and of . ∎
a.2 Proof of Corollary 1
Let us recall the result:
Corollary 1.
The bijection between and allows to induce a norm into from the supreme norm in , with which is a Banach space (since is Banach with ):
(23) 
Proof.
The norm induced in the quotient space is defined from the correspondence between and and is naturally defined as in the previous equation. The norm is well defined since it does not depend on the class representative. Therefore, all the metric properties are transferred, and is immediately Banach with the norm . ∎
a.3 Proof of Proposition 2
Let’s restate the result:
Proposition 2.
(learning has a fixed point) The operator is welldefine w.r.t. the equivalence , and therefore induces an operator defined over . is contractive w.r.t. . Since is Banach, has a unique fixed point and iterating starting anywhere converges to that point.
Proof.
We prove the statements above one by one:
The operator is well defined:
Let us first recall the definition of the operator in (12), where we removed the dependency on for simplicity:
Let two different representatives of class , we can write:
(24) 
because . Therefore the operator is well defined in the quotient space, since the image of class does not depend on the function chosen to represent the class.
Contractive operator :
The contractiveness of can be proven directly:
(25) 
The contractiveness of can also be understood as being inherited from the standard Bellmann operator on . Indeed, given a function, one can easily see that applying the standard Bellman operator to the function corresponding to leads to the function corresponding to .
Fixed point of :
To conclude the proof, we use the fact that any contractive operator on a Banach space, in our case: has a unique fixed point , and that for any starting point , the sequence converges to w.r.t. to the corresponding norm . ∎
a.4 Proof of Theorem 1
These two propositions will be useful to prove that the learning iterates converge in . Let us restate the definition of the operator from (13):
and the theoretical result:
Theorem 1.
(Convergence of learning) For a sequence of stateactionfeature consider the learning update given in (13). If the sequence of stateactionfeature triples visits each state, action infinitely often, and if the learning rate is an adapted sequence satisfying the RobbinsMonro conditions:
(26) 
then the sequence of function classes corresponding to the iterates converges to the optimum, which corresponds to the optimal Qfunction to which standard Qlearning updates would converge to:
(27) 
Proof.
The proof reuses the flow of the proof used for Qlearning [Tsitsiklis, 1994]. Indeed, we rewrite the operator above as:
with defined as:
Obviously satisfies , which, together with the contractiveness of , is sufficient to demonstrate the convergence of the iterative procedure as done for Qlearning. In our case, the optimal function is defined up to an additive kernel function . The correspondence with the optimal Q learning function is a direct application of the correspondence between the  and Qlearning problems. ∎
a.5 Proof of Theorem 2
Let us restate the result.
Theorem 2.
(Generalised policy improvement in learning) Let be the set of tasks, each one associated to a (possibly different) weighting function . Let be a representative of the optimal class of functions for task , , and let be an approximation to the optimal function, . Then, for another task with weighting function , the policy defined as:
(28) 
satisfies:
(29) 
where .
Proof.
The proof is stated in two steps. First, we exploit the proof of Proposition 1 of [Barreto et al., 2017], and in particular (13) that states:
(30) 
where