Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. Transferring the knowledge of prior experience while solving new tasks is a key question in that setting (lazaric2012transfer; taylor2009transfer). We elaborate on the intuitive idea that similar tasks should allow a large amount of transfer. An agent able to compute online a similarity measure between source tasks and the current target task should be able to perform transfer accordingly. By measuring the amount of inter-task similarity, we design a novel method for value transfer, practically deployable in the online Lifelong RL setting. Specifically, we introduce a metric between MDPs and prove that the optimal Q-value function is Lipschitz continuous with respect to the MDP space. This property allows to compute a provable upper bound on the optimal value function of an unknown target task, given the learned optimal value function of a source task. Knowing this upper bound allows to accelerate the convergence of an -like algorithm (brafman2002r)
, relying on an optimistic estimate of the optimal Q-value function. Overall, the proposed transfer method consists in computing online the distance between source and target tasks, deducing the upper bound on the optimal Q value function of the source task and use this bound to accelerate learning. Importantly, this method is non-negative (it cannot cause performance degradation) as the computed upper bound provably does not underestimate the optimal Q-value function.
Our contributions are as follows. First, we study theoretically the Lipschitz continuity of the optimal Q-value function in the task space by introducing a metric between MDPs (Section 3). Then, we use this continuity property to propose a value-transfer method based on a local distance between MDPs (Section 4). Full knowledge of both MDPs is not required and the transfer is non-negative, which makes the method both practical in an online setting and safe. In Section 4.2, we build a PAC-MDP algorithm called Lipschitz , applying this transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase the algorithm in Lifelong RL experiments (Section 5).
2 Background and related work
Reinforcement Learning (RL) (sutton1998introduction) is a framework for sequential decision making. The problem is typically modeled as a Markov Decision Process (MDP) (puterman2014markov) consisting in a 4-tuple where is a state space, an action space, is the expected reward of taking action in state and
is the transition probability of reaching statewhen taking action in state . Without loss of generality, we assume . Given a discount factor , the expected cumulative return obtained along a trajectory starting with state and action using policy in MDP is noted and called the Q-function. The optimal Q-function is the highest attainable expected return from and is the optimal value function in . Notice that implies for all . This maximum upper bound is used by the algorithm as an optimistic initialization of the learned Q function. A key point to reduce the sample complexity of this algorithm is to benefit from a tighter upper-bound, which is the purpose of our transfer method.
Lifelong RL (silver2013lifelong; brunskill2014pac) is the problem of experiencing online a series of MDPs drawn from an unknown distribution. Each time an MDP is sampled, a classical RL problem takes place where the agent is able to interact with the environment to maximize its expected return. In this setting, it is reasonable to think that knowledge gained on previous MDPs could be re-used to improve the performance in new MDPs. In this paper, we provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As commonly done (wilson2007multi; brunskill2014pac; abel2018policy) we restrict the scope of the study to the case where sampled MDPs share the same state-action space . For brevity, we will refer indifferently to MDPs, models or tasks, and write them .
Using a metric between MDPs has the appealing characteristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer achievable. song2016measuring define a metric based on the bi-simulation metric introduced by ferns2004metrics and the Wasserstein metric (villani2008optimal). Value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs completely and is thus unusable in the Lifelong RL setting where we expect to perform transfer before having learned the current MDP. Further, the transfer technique they propose does allow negative transfer (see Appendix, Section LABEL:sec:example-negative-transfer). carroll2005task also define a value-transfer method based on a measure of similarity between tasks. However, this measure is not computable online and thus not applicable to the Lifelong RL setting. mahmud2013clustering and brunskill2013sample propose MDP clustering methods respectively using a metric quantifying the regret of running the optimal policy of one MDP in the other MDP and the norm between the MDP models. An advantage of clustering is to prune the set of possible source tasks. They use their approach for policy transfer, which differs from the value-transfer method proposed in this paper. ammar2014automated learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task space. Their method makes use of samples from both tasks and is not readily applicable to the online setting considered in this paper. lazaric2008transfer provide a practical method for sample transfer, computing a similarity metric reflecting the probability of the models to be identical. Their approach is applicable in a batch RL setting as opposed to the online setting considered in this paper. The approach developed by sorg2009transfer is very similar to ours in the sense that they prove bounds on the optimal Q-function for new tasks, assuming that both MDPs are known and that a soft homomorphism exists between the state spaces. brunskill2013sample also provide a method that can be used for Q-function bounding in multi-task RL.
abel2018policy present the MaxQInit algorithm, providing transferable bounds on the Q-function with high probability while preserving PAC-MDP guarantees (strehl2009reinforcement). Given a set of solved tasks, they derive the probability that the maximum over the Q-values of previous MDPs is an upper bound on the current task’s optimal Q-function. This results in a method for non-negative transfer with high probability once enough tasks have been sampled. The developed method by abel2018policy is similar to ours in two fundamental points: first, a theoretical upper bounds on optimal Q-values across the MDP space is built; secondly, this provable upper bound is used to transfer knowledge between MDPs by replacing the maximum bound in an -like algorithm, providing PAC guarantees. The difference between the two approaches is illustrated in Figure1 where the bound is the one developed by abel2018policy and the bound is the one we present in this paper. On this figure, the essence of the bound is noticeable. It stems from the fact that the optimal Q value function is locally Lipschitz continuous in the MDP space w.r.t. a specific metric. Confirming the intuition, close MDPs w.r.t. this metric have close optimal Q values. It should be noticed that no bound is uniformly better than the other as intuited by Figure 1. Hence, combining all the bounds results in a tighter upper bound as we will illustrate in experiments (Section 5). We first carry out the theoretical characterization of the Lipschitz continuity properties in the following section. Then, we build on this result to propose a practical transfer method for the online Lifelong RL setting.
3 Lipschitz continuity of Q-functions
The intuition we build on is that similar MDPs should have similar optimal Q-functions. Formally, this insight can be translated into a continuity property of the optimal Q-functions over the MDP space . The remainder of this section mathematically formalizes this intuition that will be used in the next Section to derive a practical method for value transfer. To that end, we introduce a local pseudo-metric characterizing the distance between the models of two MDPs at a particular state-action pair. A reminder and a detailed discussion on the metrics (and related objects) used herein can be found in the Appendix, Section LABEL:sec:app:local-mdps-distance-discussion. Given two tasks , , and a function , we define the pseudo-metric between models at w.r.t. as:
This pseudo-metric is relative to a positive function . We implicitly cast this definition in the context of discrete state spaces. The extension to continuous spaces is straightforward but beyond the scope of this paper. [Local pseudo-Lipschitz continuity] For two MDPs , for all ,
with the MDPs local pseudo-metric and the local MDP dissimilarity is the unique solution to the following fixed-point equation for :
All the proofs of the paper can be found in the Appendix. This result establishes that the distance between the optimal Q-functions of two MDPs at is controlled by a local dissimilarity between the MDPs. The latter follows a fixed-point equation (Equation 3), which can be solved by Dynamic Programming (DP) (bellman1957dynamic). Note that, although the local MDP dissimilarity is asymmetric, is a pseudo-metric, hence the name pseudo-Lipschitz continuity. Similar results for the value function of a fixed policy and the optimal value function can easily be derived (Appendix, Section LABEL:sec:app:other-results). Overall, the optimal Q-functions of two close MDPs, in the sense of Equation 1, are themselves close to each other. Borrowing the notations of Proposition 1, given that is known, the function
can be used as an upper bound on with an unknown MDP. This is the basis on which we construct a computable and transferable upper bound in Section 4. A consequence of Proposition 1 is a global pseudo-Lipschitz continuity: [Global pseudo-Lipschitz continuity] For two MDPs , , for all ,
is easier to compute. Indeed, can be computed once and for all, contrarily to that needs to be evaluated for all . However, we do not use this result for transfer because it is impractical to compute online. Indeed, estimating the maximum in the definition of might be as hard as solving both MDPs, which, when it happens, is too late for transfer to be useful.
4 Transfer using the Lipschitz continuity
A purpose of value transfer, when interacting online with a new MDP, is to initialize the value function and drive the exploration to accelerate learning. We aim to exploit value transfer in a method guaranteeing three conditions:
C1. the resulting algorithm is PAC-MDP;
C2. the transfer accelerates learning;
C3. the transfer is non-negative.
From Proposition 1, one can naturally define a local upper bound on the optimal Q-function of an MDP given the optimal Q-function of another MDP.
Given two tasks and , for all , the Lipschitz upper bound on induced by is defined as with:
The optimism in the face of uncertainty principle leads to consider that the long-term expected return from any state is the maximum return, unless proven otherwise. The algorithm (brafman2002r) in particular explores an MDP so as to shrink this upper bound. is a model-based, online RL algorithm with PAC-MDP guarantees (strehl2009reinforcement) which means that convergence to near-optimal policy is guaranteed in a polynomial number of steps with high probability. It relies on an optimistic model initialization that yields an optimistic upper bound on the optimal Q-function, then acts greedily w.r.t. . By default, it takes the maximum value but any tighter upper bound is admissible. Thus, shrinking with Equation 6 is expected to improve the learning speed or sampled complexity for new tasks in Lifelong RL.
In , during the resolution of a task , is split into a subset of known state-action pairs and its complement of unknown pairs. A state-action pair is known if the number of collected reward and transition samples allows estimating an -accurate model in -norm with probability higher than . We refer to and as the precision parameters. This translates into a threshold on the number of visits to a pair that are necessary to reach this precision. Given the experience of a set of MDPs , we define the total bound as the minimum over all the Lipschitz bounds induced by each previous MDP. Given a partially known task , the set of known state-action pairs , and the set of Lipschitz bounds on induced by previous tasks , the function defined below is an upper bound on for all .
with . Traditionally in , Equation 7 is solved to a precision via Value Iteration. This yields a function
that is a valid heuristic (provable upper bound on) for the exploration of MDP .
4.1 A tractable upper bound on
The key issue addressed in this Section is how to actually compute . Consider two tasks and , on which vanilla has been applied, yielding the respective sets of known state-action pairs and , along with the learned models and , and the upper bounds and respectively on and . Notice that, if , then for all . Conversely, if , is an -accurate estimate of in -norm with high probability. Equation 7 allows the transfer of knowledge from to if can be computed. Unfortunately, the true optimal value functions, transition and reward models, necessary to compute , are unknown (see Equation 6). Thus, we propose to compute a looser upper bound based on the learned models and value functions. First, we provide an upper bound on the pseudo metric between models and .
Given two tasks , and respectively , the subsets of where their models are known with accuracy in -norm with probability at least ,
with , the upper bound on the pseudo-metric between models defined as follows:
This upper bound on the distance between MDPs can be calculated analytically (see Appendix, Section LABEL:sec:app:analytical-calculation-dmodelhat). The magnitude of the term is controlled by . In the case where no information is available on the maximum value of , we have that . measures the accuracy with which the tasks are known: the smaller , the tighter the bound. Note that is used as an upper bound on the true . In many cases, ; e.g. for stochastic shortest path problems, which feature rewards only upon reaching terminal states, we have that and thus is a tighter bound for transfer. Combining and Equation 3, one can derive an upper bound on , detailed in Proposition 4.1. Given two tasks and , the set of state-action pairs for which is known with accuracy in -norm with probability at least . If , the solution of the following fixed-point equation on is an upper bound on with probability at least :
Similarly as in Proposition 4.1, the condition illustrates the fact that for a large return horizon (large ), a high accuracy (small ) is needed for the bound to be computable. Eventually, a computable upper bound on given with high probability is given by
The associated upper bound on (Equation 7) given the set of previous tasks is defined by
This upper bound can be used to transfer knowledge from a partially solved task to a target task. If for some pairs, then the convergence rate can be improved. As complete knowledge of both tasks is not needed, it can be applied online in a Lifelong RL setting. In the next section, we explicit an algorithm that leverages this value transfer method.
In Lifelong RL, MDPs are encountered sequentially. Applying to task yields the set of known state-action pairs , the learned models and , and the upper bound on . Saving this information when the task changes allows to compute the upper bound of Equation 11 for the new task, and to use it to shrink the optimistic heuristic of . This effectively transfers value functions between tasks based on task similarity. As the new task is explored online, the task similarity is progressively assessed with better confidence, refining the values of , and eventually , allowing for more efficient transfer where the task similarity is appraised. The resulting algorithm, Lipschitz (), is presented in Algorithm LABEL:alg:lrmax. To avoid ambiguities with , we use to store learned features (, , , ) about previous MDPs. algocf[t] In a nutshell, the behavior of on a given task is precisely that of , but with a tighter admissible heuristic that becomes better as the new task is explored (while this heuristic remains constant in vanilla ). is PAC-MDP (Condition C1) as stated in Propositions 4.2 and 4.2 below. With and , the sample complexity of vanilla is , which is improved by in Proposition 4.2 and meets Condition C2. Finally is a proved upper bound with high probability on , which avoids negative transfer and meets Condition C3. [Sample complexity (strehl2009reinforcement)] With probability , the greedy policy w.r.t. computed by achieves an -optimal return in MDP after
samples (when logarithmic factors are ignored), with defined in Equation 11 a non-static, decreasing quantity, upper bounded by . Consequently from Proposition 4.2, the sample complexity of is no worse than that of . [Computational complexity] The total computational complexity of Lipschitz is
with the number of interaction steps, the precision of value iteration and the number of tasks.
4.3 Refining bounds with maximum model distance
relies on upper bounds on the local distances between tasks (Equation 9). The quality of the Lipschitz bound on greatly depends on the quality of those estimates and can be improved accordingly. We discuss two methods to provide finer estimates.
First, from the definition of , it is easy to show that model pseudo-distances are always upper bounded by . However, in practice, the tasks experienced in Lifelong RL might not cover the full span of possible MDPs and may be systematically closer to each other than . For instance, the distance between two games in the Arcade Learning Environment (ALE) (bellemare2013ale), is smaller than the maximum distance between any two MDPs defined on the common state-action space of the ALE (extended discussion in Appendix, Section LABEL:sec:app:dmax). Let be the maximum model distance at a particular pair. Prior knowledge might indicate a smaller upper bound for than . We will note such an upper bound . Solving Equation 9 boils down to accumulating values in . Reducing a estimate in a single pair actually reduces in all pairs. Thus, replacing in Equation 9 by , provides a smaller upper bound on , and thus a smaller which allows transfer if it is lesser than . Consequently, such an upper bound can make a difference between successful and unsuccessful transfer, even if its value is of little importance. Conversely, setting a value for quantifies the distance between MDPs where transfer is efficient.
Furthermore, one can estimate online the value of , lifting the previous hypothesis of available prior knowledge. One can build an empirical estimate of the maximum model distance at : , being the set of explored tasks. The pitfall being that, with few explored tasks, could underestimate . Proposition 4.3 provides a lower bound on the probability that does not underestimate , depending on the number of sampled tasks. In turn this indicates when upper bounds with high probability, which can be combined with Algorithm LABEL:alg:lrmax to improve the performance. Consider an algorithm producing -accurate in -norm model estimates with probability at least for a subset of after interacting with an MDP. For all , after sampling tasks with , the following lower bound holds:
The assumption of a lower bound on the sampling probability of a task implies that is finite and is commonly seen as a non-adversarial task sampling strategy (abel2018policy).
The experiments reported here111 Code available at https://github.com/SuReLI/llrl illustrate how 1) allows for early performance increase in Lifelong RL by efficiently transferring knowledge between tasks; 2) the Lipschitz bound of Equation 10 improves the sample complexity compared to by providing a tighter upper bound onmlreproducibility2019) is documented in the Appendix, Section LABEL:sec:app:reproducibility-checklist.
We evaluate different variants of in a Lifelong RL experiment. The algorithm will be used as a no-transfer baseline. () denotes Algorithm LABEL:alg:lrmax with prior . denotes the MaxQInit algorithm from abel2018policy, consisting in a state-of-the art PAC-MDP algorithm achieving transfer with PAC guarantees. Both and algorithms achieve value transfer by providing a tighter upper bound on than . Computing both upper-bounds and taking the minimum results in combining the two approaches. We include such a combination in our study with the algorithm. Similarly, () consists in the latter algorithm, benefiting from prior knowledge .
The environment we used in all experiments is a variant of the “tight” environment used by abel2018policy. This is a grid-world, the initial state is in the centre, actions are the cardinal moves (Appendix, Section LABEL:sec:app:tight). The reward is zero everywhere except for the three goal cells in the upper-right corner. Each time a task is sampled, a new reward value is drawn from for each of the three goal cells and a probability of slipping (performing a different action than the one selected) is picked in . Hence, tasks have different reward and transition functions. We sample 15 tasks in sequence among a pool of 5 possible different sampled tasks. Each is run for 2000 episodes of length 10. The operation is repeated 10 times to provide narrow confidence intervals. We used , and (discussion in Appendix, Section LABEL:sec:app:nknown). We drew tasks from a finite set of five MDPs. This allows the application of and the subsequent comparison below. Note, however, that does not require the set of MDPs to be finite, which is a noticeable advantage in applicability. Other lifelong RL experiments are reported in the Appendix, Section LABEL:sec:app:additional-experiments.
The results are reported in Figure 2. Figure 1(a) displays the discounted return for each task, averaged across episodes. Similarly, Figure 1(b) displays the discounted return for each episode, averaged across tasks (same color code as Figure 1(a)). Figure 1(c) displays the discounted return for five specific instances, detailed below. To avoid inter-task disparities, all the aforementioned discounted returns are displayed relatively to an estimator of the optimal expected return for each task. For readability purposes, Figures 1(b) and 1(c) display a moving average over 100 episodes. Figure 1(d) reports the benefits of various values of on the algorithmic properties.
In Figure 1(a), we first observe that benefits from the transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears as early as the second task. Conversely, the algorithm needs to wait for task 12 before benefiting from transfer. As suggested in Section 4.3, various amounts of prior allow the transfer method to be more or less efficient: a smaller known upper bound on causes a larger discounted return gain. Combining both approaches in the algorithm outperforms all other methods. Episode-wise, we observe in Figure 1(b) that the transfer method allows for faster convergence, hence decreases the sample complexity. Interestingly, features three stages in the learning process. 1) The first episodes are characterized by a direct exploitation of the transferred knowledge, causing these episodes to yield high payoff. This is due to the combined facts that the Lipschitz bound of Equation 10 is larger on promising regions of seen on previous tasks and the fact that acts greedily w.r.t. that bound. 2) This high performance regime is followed by the exploration of unknown regions of , in our case yielding low returns. Indeed, as promising regions are explored first, the bound becomes tighter for the corresponding state-action pairs, enough for the Lipschitz bound of unknown pairs to become larger, thus driving the exploration towards low payoff regions. Such regions are quickly identified and never revisited thereafter. 3) Eventually, stops exploring and converges to the optimal policy. Importantly, in all experiments, never features negative transfer as supported by the provability of the Lipschitz upper bound with high probability. This is indeed demonstrated by the fact that it is at least as efficient in learning as the no-transfer baseline.
Figure 1(c) displays the collected returns of , (0.1), and for specific tasks. We observe that benefits from the transfer as early as task 2, where the aforementioned 3-stages behavior is visible. Again, needs to wait for task 12 to leverage the transfer method. However, the bound it provides are tight enough to allow for almost zero exploration of the task.
In Figure 1(d), we display the following quantities for various values of : , is the ratio of the time the Lipschitz bound was tighter than the bound ; , is the relative gain of time steps before convergence when comparing to . This quantity is estimated based on the last updates of the empirical model ; , is the relative total return gain on 2000 episodes of w.r.t. . First, we observe an increase of as becomes tighter. This means that the Lipschitz bound of Equation 10 becomes effectively smaller than . This phenomenon leads to faster convergence, indicated by . Eventually, this increased convergence rate allows for a net total return gain, illustrated by the increase of .
Overall, in this analysis, we have showed that benefits from an enhanced sample complexity thanks to the value transfer method. The knowledge of a prior further increases this benefit. The method is comparable to the method and has some advantages such as the early fitness for use and the applicability to infinite sets of tasks. Moreover, the transfer is non-negative while preserving the PAC-MDP guarantees of the algorithm. Additionally to the analysis performed here, we show in the Appendix, Section LABEL:sec:app:prior-use-experiment that, when provided with any prior knowledge , increasingly stops using this prior as the task is explored. This confirms the claim of section 4.3 that providing enables transfer even if it’s value is of little importance.
We have studied theoretically the Lipschitz continuity property of the optimal Q-function in the MDP space. This led to a local Lipschitz continuity result, establishing that the optimal Q-functions of two close MDPs are themselves close to each other. This distance between Q-functions can be computed by Dynamic Programming. We then proposed a value-transfer method using this continuity property with the Lipschitz algorithm, practically implementing this approach in the Lifelong RL setting. The algorithm preserves PAC-MDP guarantees, accelerates the learning in subsequent tasks and performs non-negative transfer. Potential improvements of the algorithm were discussed in the form of prior knowledge introduction on the maximum distance between models and online estimation with high probability of this distance. We showcased the algorithm in lifelong RL experiments and demonstrated empirically its ability to accelerate learning. The results also confirm that no negative transfer occurs, regardless of parameter settings. It should be noted that our approach can directly extend other PAC-MDP algorithms (szita2010model; rao2012v; pazis2016improving; dann2017unifying) to the Lifelong setting. In hindsight, we believe this contribution provides a sound basis to non-negative value transfer via MDP similarity, a development that was lacking in the literature. Key insights for the practitioner lie both in the theoretical analysis and in the practical derivation of a transfer scheme that achieves non-negative transfer with PAC guarantees.
We would like to thank Dennis Wilson for fruitful discussions and paper reviews. This research was supported by the Occitanie region, France; ISAE-SUPAERO; fondation ISAE-SUPAERO; École Doctorale Systèmes; and ONERA, the French Aerospace Lab.