Xi-Learning: Successor Feature Transfer Learning for General Reward Functions

10/29/2021
by   Chris Reinke, et al.
0

Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor features (SF) are a prominent transfer mechanism in domains where the reward function changes between tasks. They reevaluate the expected return of previously learned policies in a new target task and to transfer their knowledge. A limiting factor of the SF framework is its assumption that rewards linearly decompose into successor features and a reward weight vector. We propose a novel SF mechanism, ξ-learning, based on learning the cumulative discounted probability of successor features. Crucially, ξ-learning allows to reevaluate the expected return of policies for general reward functions. We introduce two ξ-learning variations, prove its convergence, and provide a guarantee on its transfer performance. Experimental evaluations based on ξ-learning with function approximation demonstrate the prominent advantage of ξ-learning over available mechanisms not only for general reward functions, but also in the case of linearly decomposable reward functions.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

05/20/2022

Task Relabelling for Multi-task Transfer using Successor Features

Deep Reinforcement Learning has been very successful recently with vario...
07/18/2021

A New Representation of Successor Features for Transfer across Dissimilar Environments

Transfer in reinforcement learning is usually achieved through generalis...
06/22/2022

Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer

In many real-world applications, reinforcement learning (RL) agents migh...
02/17/2019

A new Potential-Based Reward Shaping for Reinforcement Learning Agent

Potential-based reward shaping (PBRS) is a particular category of machin...
09/27/2022

Defining and Characterizing Reward Hacking

We provide the first formal definition of reward hacking, a phenomenon w...
07/04/2018

Transfer with Model Features in Reinforcement Learning

A key question in Reinforcement Learning is which representation an agen...
06/13/2019

Curriculum Learning for Cumulative Return Maximization

Curriculum learning has been successfully used in reinforcement learning...

1 Introduction

Reinforcement Learning (RL) successfully addressed many complex problems such as playing computer games, chess, and even Go with superhuman performance [Mnih et al., 2015, Silver et al., 2018]. These impressive results are possible thanks to a vast amount of interactions of the RL agent with its environment/task. Such strategy is unsuitable in settings where the agent has to perform and learn at the same time. Consider, for example, a care giver robot in a hospital that has to learn a new task, such as a new route to deliver meals. In such a setting, the agent can not collect a vast amount of training samples but has to adapt quickly instead. Transfer learning aims to provide mechanisms quickly to adapt agents in such settings [Taylor and Stone, 2009, Lazaric, 2012, Zhu et al., 2020]. The rationale is to use knowledge from previously encountered source tasks for a new target task to improve the learning performance on the target task. The previous knowledge can help reducing the amount of interactions required to learn the new optimal behavior. For example, the care giver robot could reuse knowledge about the layout of the hospital it learned in previous source tasks (e.g. guiding a person) to learn to deliver meals.

The Successor Feature (SF) and General Policy Improvement (GPI) framework [Barreto et al., 2020] is a prominent transfer learning mechanism for tasks where only the reward function differs. Its basic premise is that the rewards which the RL agent tries to maximize are defined based on a low-dimensional feature descriptor . For our care-giver robot this could be ID’s of beds or rooms that it is visiting, in difference to its high-dimensional visual state intput from a camera. The rewards are then computed not based on its visual input but on the ID’s of the beds or rooms that it visits. The expected cumulative discounted successor features () are learned for each behavior that the robot learned in the past. It represents the dynamics in the feature space that the agent experiences for a behavior. This corresponds to the rooms or beds the care-giver agent would visit if using the behavior. This representation of feature dynamics is independent from the reward function. A behavior learned in a previous task and described by this SF representation can be directly re-evaluated for a different reward function. In a new task, i.e. for a new reward function, the GPI procedure re-evaluates the behaviors learned in previous tasks for it. It then selects at each state the behavior of a previous task if it improves the expected reward. This allows to reuse behaviors learned in previous source tasks for a new target task. A similar transfer strategy can also be observed in the behavior of humans [Momennejad et al., 2017, Momennejad, 2020, Tomov et al., 2021] .

The classical SF&GPI framework [Barreto et al., 2017, 2018] makes the assumption that rewards are a linear composition of the features via a reward weight vector that depends on the task : . This assumption allows to effectively separate the feature dynamics of a behavior from the rewards and thus to re-evaluate previous behaviors given a new reward function, i.e. a new weight vector . Nonetheless, this assumption also restricts successful application of SF&GPI only to problems where such a linear decomposition is possible. This paper investigates the application of the SF&GPI framework to general reward functions: . We propose to learn the cumulative discounted probability over the successor features, named -function, and refer to the proposed framework as -learning. Our work is related to Janner et al. [2020], Touati and Ollivier [2021], and brings two important additional contributions. First, we provide mathematical proof of the convergence of -learning. Second, we demonstrate how -learning can be used for meta-RL, using the -function to re-evaluate behaviors learned in previous tasks for a new reward function . Furthermore, -learning can also be used to transfer knowledge to new tasks using GPI.

The contribution of our paper is three-fold:

  • We introduce a new RL algorithm, -learning, based on an cumulative discounted probability of successor features, and two variants of its update operator.

  • We provide theoretical proofs of the convergence of -learning to the optimal policy and for a guarantee of its transfer learning performance under the GPI procedure.

  • We experimentally compare -learning in tasks with linear and general reward functions, and for tasks with discrete and continuous features to standard Q-learning and the classical SF framework, demonstrating the interest and advantage of -learning.

2 Background

2.1 Reinforcement Learning

RL investigates algorithms to solve multi-step decision problems, aiming to maximize the sum over future rewards [Sutton and Barto, 2018]. RL problems are modeled as Markov Decision Processes (MDPs) which are defined as a tuple , where and are the state and action set. An agent transitions from a state to another state using action at time point collecting a reward : . This process is stochastic and the transition probability describes which state is reached. The reward function defines the scalar reward for the transition. The goal in an MDP is to maximize the expected return , where . The discount factor weights collected rewards by discounting future rewards stronger. RL provides algorithms to learn a policy defining which action to take in which state to maximise .

Value-based RL methods use the concept of value functions to learn the optimal policy. The state-action value function, called Q-function, is defined as the expected future return taking action in and then following policy :

(1)

The Q-function can be recursively defined following following the Bellman equation such that the current Q-value depends on the maximum Q-value of the next state . The optimal policy for a MDP can then be expressed based on the Q-function, by taking at every step the maximum action: .

The optimal Q-function can be learned using a temporal difference method such as Q-learning [Watkins and Dayan, 1992]. Given a transition (), the Q-value is updated according to:

(2)

where is the learning rate at iteration .

2.2 Transfer Learning and the SF&GPI Framework

We are interested in the transfer learning setting where the agent has to solve a set of tasks , that in our case differ only in their reward function. The Successor Feature (SF) framework provides a principled way to perform transfer learning [Barreto et al., 2017, 2018]. SF assumes that the reward function can be decomposed into a linear combination of features and a reward weight vector that is defined for a task :

(3)

We refer to such reward functions as linear reward functions. Since the various tasks differ only in their reward functions, the features are the same for all tasks in .

Given the decomposition above, it is also possible to rewrite the Q-function into an expected discounted sum over future features and the reward weight vector :

(4)

This decouples the dynamics of the policy in the feature space of the MDP from the expected rewards for such features. Thus, it is now possible to evaluate the policy in a different task using a simple multiplication of the weight vector with the -function: . Interestingly, the function also follows the Bellman equation:

(5)

and can therefore be learned with conventional RL methods. Moreover, [Lehnert and Littman, 2019] showed the equivalence of SF-learning to Q-learning.

Being in a new task the Generalized Policy Improvement (GPI) can be used to select the action over all policies learned so far that behaves best:

(6)

[Barreto et al., 2018] proofed that under the appropriate conditions for optimal policy approximates, the policy constructed in (6) is close to the optimal one, and their difference is upper-bounded:

(7)

where . For an arbitrary reward function the result can be interpreted in the following manner. Given the arbitrary task , we identify the theoretically closest possible linear reward task with . For this theoretically closest task, we search the linear task in our set of task (from which we also construct the GPI optimal policy (6)) which is closest to it. The upper bound between and is then defined by 1) the difference between task and the theoretically closest possible linear task : ; and by 2) the difference between theoretical task and the closest task : . If our new task is also linear then and the first term in (7) would vanish.

Very importantly, this result shows that the SF framework will only provide a good approximation of the true Q-function if the reward function in a task can be represented using a linear decomposition. If this is not the case then the error in the approximation increase with the distance between the true reward function and the best linear approximation of it as stated by .

3 Method: -learning

3.1 Definition and foundations of -learning

The goal of this paper is to investigate the application of SF&GPI to tasks with general reward functions over state features :

(8)

where we define . Under this assumption the Q-function can not be linearly decomposed into a part that describes feature dynamics and one that describes the rewards as in the linear SF framework (4). To overcome this issue, we propose to define the expected cumulative discounted probability of successor features or -function, which is going to be the central mathematical object of the paper, as:

(9)

where , or in short

, is the probability density function of the features at time

, following policy and conditioned to and being the state and action at time respectively. Note that depends not only on the policy but also on the state transition (constant through the paper). With the definition of the -function, the Q-function rewrites:

(10)

Depending on the reward function , there are several -functions that correspond to the same function. Formally, this is an equivalence relationship, and the quotient space has a one-to-one correspondence with the -function space.

Proposition 1.

(Equivalence between functions and Q) Let . Let be defined as . Then, is an equivalence relationship, and there is a bijective correspondence between the quotien space and .

Corollary 1.

The bijection between and allows to induce a norm into from the supreme norm in , with which is a Banach space (since is Banach with ):

(11)

Similar to the Bellman equation for the Q-function, we can define a Bellman operator for the -function, denoted by , as:

(12)

As in the case of the -function, we can use to construct a contractive operator:

Proposition 2.

(-learning has a fixed point) The operator is well-define w.r.t. the equivalence , and therefore induces an operator defined over . is contractive w.r.t. . Since is Banach, has a unique fixed point and iterating starting anywhere converges to that point.

In other words, successive applications of the operator converge towards the class of optimal functions or equivalently to an optimal function defined up to an additive function satisfying (i.e. ).

While these two results state (see Appendix A for the proofs) the theoretical links to standard Q-learning formulations, the operator defined in (12) is not usable in practice, because of the expectation. In the next section, we define the optimisation iterate, prove its convergence, and provide two variants to perform the updates.

3.2 -learning algorithms

In order to learn the -function, we introduce the -learning update operator, which is an off-policy temporal difference method analogous to Q-learning. Given a transition the -learning update operator is defined as:

(13)

where .

The following is one of the main results of the manuscript, stating the convergence of -learning:

Theorem 1.

(Convergence of -learning) For a sequence of state-action-feature consider the -learning update given in (13). If the sequence of state-action-feature triples visits each state, action infinitely often, and if the learning rate is an adapted sequence satisfying the Robbins-Monro conditions:

(14)

then the sequence of function classes corresponding to the iterates converges to the optimum, which corresponds to the optimal Q-function to which standard Q-learning updates would converge to:

(15)

The proof is provided in Appendix A and follows the same flow as for Q-learning.

The previous theorem provides convergence guarantees under the assumption that either

is known, or an unbiased estimate can be constructed. In the following, we propose two different ways to approximate

from a given transition so as to perform the -update (13).

Model-free (MF) -Learning:

The first instance of -learning, which we call Model-free (MF) -Learning uses the same principle as standard model-free temporal difference learning methods. The update assumes for a given transition that the probability for the observed feature is . Whereas for all other features () the probability is , see Appendix C for continuous features. The resulting updates are:

(16)

Due to the stochastic update of the -function and if the learning rate discounts over time, the -update will learn the true probability of

. A problematic point with the MF procedure is that it induces potentially a high variance when the true feature probabilities are not binary. To cope with this potentially negative effect, we propose a different variant.

One-Step SF Model (MB) -Learning:

We introduce a second -learning procedure called One-step SF Model (MB) -Learning that attempts to reduce the variance of the update. To do so, MB

-Learning estimates the distribution over the successor features over time. Let

denote the current estimate of the feature distribution. Given a transition the model is updated according to:

(17)

where is the learning rate. After updating the model , it can be used for the -update as defined in (13). Since the learned model is independent from the reward function and from the policy, it can be learned and used over all tasks.

3.3 Meta -learning

After discussing -learning on a single task and showing its theoretical convergence, we can now investigate how it can be applied in transfer learning. Similar to the linear SF framework the -function allows to reevaluate a policy learned for task , , in a new environment :

(18)

This allows us to apply GPI in (6) for arbitrary reward functions in a similar manner to what was proposed for linear reward functions in [Barreto et al., 2018]. We extend the GPI result to the -learning framework as follows:

Theorem 2.

(Generalised policy improvement in -learning) Let be the set of tasks, each one associated to a (possibly different) weighting function . Let be a representative of the optimal class of -functions for task , , and let be an approximation to the optimal -function, . Then, for another task with weighting function , the policy defined as:

(19)

satisfies:

(20)

where .

The proof is provided in Appendix A.

4 Experiments

We evaluated -learning in two environments. The first has discrete features. It is a modified version of the object collection task by Barreto et al. [2017]. We introduced to it features with higher complexity allowing the usage of general reward functions. See Appendix D.1 for experimental results in the original environment. The second environment, the racer environment, evaluates the agents in tasks with continuous features.

4.1 Discrete Features - Object Collection Environment

Environment:

The environment consist of 4 rooms (Fig. 1 - a). The agent starts an episode in position S and has to learn to reach the goal position G. During an episode, the agent can collect objects to gain further rewards. Each object has 2 properties: 1) color: orange or blue, and 2) form: box or triangle. The state space is a high-dimensional vector . It encodes the agent’s position using a

grid of two-dimensional Gaussian radial basis functions. Moreover, it includes a memory about which object as been already collected. Agents can move in 4 directions. The features

are binary vectors. The first 2 dimensions encode if an orange or a blue object was picked up. The 2 following dimensions encode the form. The last dimension encodes if the agent reached goal G. For example, encodes that the agent picked up an orange box.

Tasks:

Each agent learns sequentially 300 tasks which differ in their reward for collecting objects. We compared agents in two settings: either in tasks with linear or general reward functions. For each linear task , the rewards are defined by a linear combination of features and a weight vector . The weights

for the first 4 dimensions define the rewards for collecting an object with a specific property. They are randomly sampled from a uniform distribution:

. The final weight defines the reward for reaching the goal position which is for each task. The general reward functions are sampled by assigning a different reward to each possible combination of object properties using uniform sampling: , such that picking up an orange box might result in a reward of .

(a) Collection Environment (b) Tasks with Linear Reward Functions
(c) Effect of Non-Linearity (d) Tasks with General Reward Functions
Figure 1: In the (a) object collection environment,

-learning reached the highest average reward per task for (b) linear, and (d) general reward functions. The average over 10 runs per algorithm and the standard error of the mean are depicted. (c) The performance difference between

-learning and SFQL is stronger for general reward tasks that have high non-linearity, i.e. where a linear reward model yields a high error. SFQL can only reach less than of MF -learning’s performance in tasks with a mean linear reward model error of .
Agents:

We compared -learning to Q-learning (QL), and classical SF Q-learning (SFQL) [Barreto et al., 2017]. All agents use function approximation for their state-action functions (Q, , or -function). An independent linear mapping is used to map the values from the state for each of the 4 actions. As the features are discrete, the -function and -model are approximated by an independent mapping for each action and possible feature . The Q-value for the -agents (Eq. 10) is computed by: . The reward functions of each task are given to the -agents. For SFQL, the sampled reward weights were given in tasks with linear reward functions. For general reward functions, a linear model approximating the rewards was learned for each task and its weights given to SFQL. Each tasks was executed for steps, and the average performance over 10 runs per algorithm was measured. We performed a grid-search over the parameters of each agent, reporting here the performance of the parameters with the highest total reward over all tasks.

Results:

-learning outperformed SFQL and QL for tasks with linear and general reward functions (Fig. 1 - b; d). MF showed a slight advantage over MB -learning in both settings. We further studied the effect non-linearity of general reward functions on the performance of classical SF compared to -learning by evaluating them in tasks with different levels of non-linearity. We sampled general reward functions that resulted in different levels of mean absolute model error if they are linearly approximated with . We trained SFQL and MF -learning in each of these conditions on 300 tasks and measured the ratio between the total return of SFQL and MF (Fig. 1). The relative performance of SFQL compared to MF reduces with higher non-linearity of the reward functions. For reward functions that are nearly linear (mean error of ), both have a similar performance. Whereas, for reward functions that are difficult to model with a linear relation (mean error of ) SFQL reaches only less than of the performance of -learning. This follows SFQL’s theoretical limitation in (7) and shows the advantage of learning over SFQL in non-linear reward tasks.

4.2 Continuous Features - Racer Environment

Environment and Tasks:

We further evaluated the agents in an environment with continuous features (Fig. 2 - a). The agent is randomly placed in the environment and has to drive around for 200 timesteps before the episode ends. Similar to a car, the agent has an orientation and momentum, so that it can only drive straight, or in a right or left curve. The agent reappears on the opposite side if it exits one side. The distance to 3 markers are provided as features . Rewards depend on the distances , where each component has 1 or 2 preferred distances defined by Gaussian functions. For each of the 65 tasks, the number of Gaussian’s and their properties (, ) are randomly sampled for each feature dimension. Fig. 2 (a) shows a reward function with dark areas depicting higher rewards. The agent has to learn to drive around in such a way as to maximize its trajectory over positions with high rewards. The state space is a high-dimensional vector encoding the agent’s position and orientation. As before, the 2D position is encoded using a grid of two-dimensional Gaussian radial basis functions. Similarly, the orientation is also encoded using Gaussian radial basis functions.

Agents:

We introduce a MF -agent for continuous features (CMF ) (Appendix C.2.1). CMF discretizes each feature dimension in bins with the bin centers: . It learns for each dimension and bin the -value . Q-values (Eq. 10) are computed by: . SFQL received an approximated weight vector that was trained before the task started on several uniformly sampled features and rewards.

Results:

-learning reached the highest performance of all agents (Fig. 2 - b). SFQL reaches only a low performance below QL, because it is not able to sufficiently well approximate the general reward functions with its linear reward model. -learning can only slightly improve over QL, showing that SF&GPI transfer in this environment is less efficient than in the object collection environment (Fig.1).

(a) Racer Environment (b) Tasks with General Reward Functions
    
Figure 2: (a) Example of a reward function for the racer environment based on distances to its 3 markers. (b) -learning reaches the highest average reward per task. SFQL yields a performance even below QL as it is not able to model the reward function with its linear combination of weights and features. The average over 10 runs per agent and the standard error of the mean are depicted.

5 Discussion

-learning in Tasks with General Reward Functions:

-learning allows to disentangle the dynamics of policies in the feature space of a task from the associated reward, see (10). The experimental evaluation in tasks with general reward functions (Fig. 1 - d, and Fig. 2) show that -learning can therefore successfully apply GPI to transfer knowledge from learned tasks to new ones. Given a general reward function it can re-evaluate successfully learned policies for knowledge transfer. Instead, classical SFQL based on a linear decomposition (3) can not be directly applied given a general reward function. In this case a linear approximation has to be learned which shows inferior performance to -learning that directly uses the true reward function.

-learning in Tasks with Linear Reward Functions:

-Learning also shows an increased performance over SFQL in environments with linear reward functions (Fig. 1 - a). This effect can not be attributed to differences in their computation of the expected return of a policy as both are correct. A possible explanation could be that -learning reduces the complexity for the function approximation of the -function compared to the -function in SFQL.

Continuous Feature Spaces:

For tasks with continuous features (racer environment), -learning used successfully a discretization of each feature dimension, and learned the -values independently for each dimension. This strategy is viable for reward functions that are cumulative over the feature dimensions: . The Q-value can be computed by summing over the independent dimensions and the bins : . For more general reward functions, the space of all feature combinations would need to be discretized, which grows exponentially with each new dimension. As a solution the -function could be directly defined over the continuous feature space, but this yields some problems. First, the computation of the expected return requires an integral over features instead of a sum, which is a priori intractable. Second, the representation and training of the -function, which would be defined over a continuum thus increasing the difficulty of approximating the function. Janner et al. [2020] and Touati and Ollivier [2021] propose methods that might allow to represent a continuous -function, but it is unclear if they converge and if they can be used for transfer learning.

Computational Complexity:

The improved performance of SFQL and -learning over QL in the transfer learning setting comes at the cost of an increased computational complexity. The GPI procedure (6) of both approaches requires to evaluate at each step the -function or -function over all previous experienced tasks in . As a consequence, the computational complexity increases linearly with each new environment that is added. A solution is to apply GPI only over a subset of learned policies. Nonetheless, an open question is still how to optimally select this subset.

6 Related work

Transfer Learning:

Transfer methods in RL can be generally categorized according to the type of tasks between which transfer is possible and the type of transferred knowledge [Taylor and Stone, 2009, Lazaric, 2012, Zhu et al., 2020]. In the case of SF&GPI which -learning is part of, tasks only differ in their reward functions. The type of knowledge that is transferred are policies learned in source tasks which are re-evaluated in the target task and recombined using the GPI procedure. A natural use-case for -learning are continual problems [Khetarpal et al., 2020] where an agent has continually adapt to changing tasks, which are in our setting different reward functions.

Successor Features:

SF are based on the concept of successor representations [Dayan, 1993, Momennejad, 2020]. Successor representations predict the future occurrence of all states for a policy in the same manner as SF for features. Their application is restricted to low-dimensional state spaces using tabular representations. SF extended them to domains with high-dimensional state spaces [Kulkarni et al., 2016, Zhang et al., 2017, Barreto et al., 2017, 2018], by predicting the future occurrence of low-dimensional features that are relevant to define the return. Several extensions to the SF framework have been proposed. One direction aims to learn appropriate features from data such as by optimally reconstruct rewards [Barreto et al., 2017], using the concept of mutual information [Hansen et al., 2019], or the grouping of temporal similar states [Madjiheurem and Toni, 2019]. Another direction is the generalization of the -function over policies [Borsa et al., 2018] analogous to universal value function approximation [Schaul et al., 2015]. Similar approaches use successor maps [Madarasz, 2019], goal-conditioned policies [Ma et al., 2020], or successor feature sets [Brantley et al., 2021]. Other directions include their application to POMDPs [Vértes and Sahani, 2019], combination with max-entropy principles [Vertes, 2020], or hierarchical RL [Barreto et al., 2021]. In difference to -learning all these approaches build on the assumption of linear reward functions, whereas -learning allows the SF&GPI framework to be used with general reward functions. Nonetheless, most of the extensions for linear SF can be combined with -learning.

Model-based RL:

SF represent the dynamics of a policy in the feature space that is decoupled from the rewards allowing to reevaluate them under different reward functions. It shares therefore similar properties with model-based RL [Lehnert and Littman, 2019]. In general, model-based RL methods learn a one-step model of the environment dynamics . Given a policy and an arbitrary reward function, rollouts can be performed using the learned model to evaluate the return. In practice, the rollouts have a high variance for long-term predictions rendering them ineffective. Recently, [Janner et al., 2020] proposed the -model framework that learns to represent -values in continuous domains. Nonetheless, the application to transfer learning is not discussed and no convergence is proven as for -learning. This is the same case for the forward-backward MPD representation proposed in Touati and Ollivier [2021]. [Tang et al., 2021] also proposes to decouple the dynamics in the state space from the rewards, but learn an internal representation of the rewards. This does not allow to reevaluate an policy to a new reward function without relearning the mapping.

7 Conclusion

The introduced -learning framework learns the expected cumulative discounted probability of successor features which disentangles the dynamics of a policy in the feature space of a task from the expected rewards. This allows -learning to reevaluate the expected return of learned policies for general reward functions and to use it for transfer learning utilizing GPI. We proved that -learning converges to the optimal policy, and showed experimentally its improved performance over Q-learning and the classical SF framework for tasks with linear and general reward functions.

Ethics Statement

-learning and its associated optimization algorithms represent general RL procedures similar to Q-learning. Their potential negative societal impact depends on their application domains which range over all possible societal areas in a similar manner as for other general RL procedures.

Beyond the topic of the paper, we did our best to cite the relevant literature and to fairly compare with previous ideas, concepts and methods. To that aim, all agents are trained and evaluated within the same software environment, and under the very same experimental settings.

Reproducibility Statement

In order to ensure high changes of reproducibility we provided lots of details of the method and experiments associated to the paper. In particular, we have provided the proofs for all mathematical results announced in the main paper (see Appendix A). These constitute the theoretical foundation of the proposed -learning methodology. Secondly, we have provided all experimental details (methods, and environments) required for reproducing our experiments, namely: appendix B for the object collection and C for the racer environment respectively. In addition, we provide additional results in appendix D, to completely illustrate the interest of the proposed method. Finally, we provided an anonymous link to the source code, so that reviewers can run it if necessary.

References

Appendix A Theoretical Proofs

a.1 Proof of Proposition 1

Let us start by recalling the original statement in the main paper.

Proposition 1.

(Equivalence between functions and Q) Let . Let be defined as . Then, is an equivalence relationship, and there is a bijective correspondence between the quotien space and .

Proof.

We will proof the statements sequentially.

is an equivalence relationship:

To prove this we need to demonstrate that is symmetric, reciprocal and transitive. The three are quite straightforward since: , and .

Bijective correspondence:

To prove the bijectivity, we will first prove that it is injective, then surjective. Regarding the injectivity: , we prove it by contrapositive:

(21)

In order to prove the surjectivity, we start from a function and select an arbitrary , then the following function:

(22)

satisfies that and that . We conclude that there is a bijective correspondence between the elements of and of . ∎

a.2 Proof of Corollary 1

Let us recall the result:

Corollary 1.

The bijection between and allows to induce a norm into from the supreme norm in , with which is a Banach space (since is Banach with ):

(23)
Proof.

The norm induced in the quotient space is defined from the correspondence between and and is naturally defined as in the previous equation. The norm is well defined since it does not depend on the class representative. Therefore, all the metric properties are transferred, and is immediately Banach with the norm . ∎

a.3 Proof of Proposition 2

Let’s restate the result:

Proposition 2.

(-learning has a fixed point) The operator is well-define w.r.t. the equivalence , and therefore induces an operator defined over . is contractive w.r.t. . Since is Banach, has a unique fixed point and iterating starting anywhere converges to that point.

Proof.

We prove the statements above one by one:

The operator is well defined:

Let us first recall the definition of the operator in (12), where we removed the dependency on for simplicity:

Let two different representatives of class , we can write:

(24)

because . Therefore the operator is well defined in the quotient space, since the image of class does not depend on the function chosen to represent the class.

Contractive operator :

The contractiveness of can be proven directly:

(25)

The contractiveness of can also be understood as being inherited from the standard Bellmann operator on . Indeed, given a function, one can easily see that applying the standard Bellman operator to the function corresponding to leads to the function corresponding to .

Fixed point of :

To conclude the proof, we use the fact that any contractive operator on a Banach space, in our case: has a unique fixed point , and that for any starting point , the sequence converges to w.r.t. to the corresponding norm . ∎

a.4 Proof of Theorem 1

These two propositions will be useful to prove that the learning iterates converge in . Let us restate the definition of the operator from (13):

and the theoretical result:

Theorem 1.

(Convergence of -learning) For a sequence of state-action-feature consider the -learning update given in (13). If the sequence of state-action-feature triples visits each state, action infinitely often, and if the learning rate is an adapted sequence satisfying the Robbins-Monro conditions:

(26)

then the sequence of function classes corresponding to the iterates converges to the optimum, which corresponds to the optimal Q-function to which standard Q-learning updates would converge to:

(27)
Proof.

The proof re-uses the flow of the proof used for Q-learning [Tsitsiklis, 1994]. Indeed, we rewrite the operator above as:

with defined as:

Obviously satisfies , which, together with the contractiveness of , is sufficient to demonstrate the convergence of the iterative procedure as done for Q-learning. In our case, the optimal function is defined up to an additive kernel function . The correspondence with the optimal Q learning function is a direct application of the correspondence between the - and Q-learning problems. ∎

a.5 Proof of Theorem 2

Let us restate the result.

Theorem 2.

(Generalised policy improvement in -learning) Let be the set of tasks, each one associated to a (possibly different) weighting function . Let be a representative of the optimal class of -functions for task , , and let be an approximation to the optimal -function, . Then, for another task with weighting function , the policy defined as:

(28)

satisfies:

(29)

where .

Proof.

The proof is stated in two steps. First, we exploit the proof of Proposition 1 of [Barreto et al., 2017], and in particular (13) that states:

(30)

where