iw-transfer-rl
Code for our paper "Importance Weighted Transfer of Samples in Reinforcement Learning" (https://arxiv.org/abs/1805.10886), which was accepted at ICML 2018.
view repo
We consider the transfer of experience samples (i.e., tuples < s, a, s', r >) in reinforcement learning (RL), collected from a set of source tasks to improve the learning process in a given target task. Most of the related approaches focus on selecting the most relevant source samples for solving the target task, but then all the transferred samples are used without considering anymore the discrepancies between the task models. In this paper, we propose a model-based technique that automatically estimates the relevance (importance weight) of each source sample for solving the target task. In the proposed approach, all the samples are transferred and used by a batch RL algorithm to solve the target task, but their contribution to the learning process is proportional to their importance weight. By extending the results for importance weighting provided in supervised learning literature, we develop a finite-sample analysis of the proposed batch RL algorithm. Furthermore, we empirically compare the proposed algorithm to state-of-the-art approaches, showing that it achieves better learning performance and is very robust to negative transfer, even when some source tasks are significantly different from the target task.
READ FULL TEXT VIEW PDFCode for our paper "Importance Weighted Transfer of Samples in Reinforcement Learning" (https://arxiv.org/abs/1805.10886), which was accepted at ICML 2018.
The goal of transfer in Reinforcement Learning (RL) (Sutton & Barto, 1998) is to speed-up RL algorithms by reusing knowledge obtained from a set of previously learned tasks. The intuition is that the experience made by learning source tasks might be useful for solving a related, but different, target task. Transfer across multiple tasks may be achieved in different ways. The available approaches differ in the type of information transferred (e.g., samples, value functions, parameters, policies, etc.) and in the criteria used to establish whether such knowledge could be beneficial for solving the target or not.
This work focuses on the problem of transferring samples from a set of source MDPs to augment the dataset used to learn the target MDP. To motivate our approach, consider a typical learning scenario where samples are costly to obtain. This is often the case in robotics applications, where the interaction with the real environment could be extremely time-consuming, thus reducing the number of samples available. The typical remedy of adopting a simulator often leads to sub-optimal solutions due to the differences with respect to the real environment. A more effective approach is to transfer the simulated samples to speed-up learning in the target task.
The transfer of samples has been widely studied in the supervised learning community. In particular, Crammer et al. (2008) formalized the problem from a theoretical perspective and provided generalization bounds for the transfer scenario. An interesting result is a trade-off between the number of tasks from which to transfer and the total number of samples. In RL, Taylor et al. (2008) and Lazaric et al. (2008) proposed almost simultaneously methods to transfer single samples. While the former method focused on a model-based approach, the latter one proposed a selective approach to transfer samples into a batch RL algorithm (e.g., Fitted Q-Iteration (Ernst et al., 2005)). Furthermore, Lazaric et al. (2008) considered a model-free approach to compute a similarity measure between tasks, which was used to decide which samples to transfer. More recently, Lazaric & Restelli (2011) analyzed the transfer of samples in batch RL from a theoretical perspective, demonstrating again the trade-off between the total number of samples and the number of tasks from which to transfer. Finally, Laroche & Barlier (2017) proposed a way to transfer all the samples to augment the dataset used by Fitted Q-Iteration. The limitation of this approach resides in the restrictive assumption that all the tasks are assumed to share the same transition dynamics and differ only in the reward function. For a survey on transfer in RL, we refer the reader to (Taylor & Stone, 2009; Lazaric, 2012).
One of the main drawbacks of many previous works is that, even after a detailed selection, transferred samples are used in the target task without accounting for the differences between the original (source) MDP and the target one, thus introducing a bias even in the asymptotic case. In this paper, we present a novel approach to transfer samples into a batch RL algorithm. Unlike other works, we do not assume any particular similarity between tasks besides a shared state-action space, and we develop a new model-based methodology to automatically select the relevance (importance weight) of each sample. Existing algorithms for transferring across different state-action spaces (e.g., Taylor et al., 2007) can be straightforwardly combined to our method. Our approach transfers all the samples, but their impact in solving the target task is proportional to their importance weight. To compute the importance weight of each sample, we rely on a non-parametric estimate of the MDP structure. In particular, we adopt Gaussian processes (Rasmussen & Williams, 2006) to estimate the reward and state transition models of the source and target tasks from samples. Then, we propose a robust way to compute two sets of importance weights, one for the reward model and one for the transition model. We introduce an approximate value iteration algorithm based on Fitted Q-iteration that uses such weights to account for the distribution shift introduced by the different MDPs, thus implicitly selecting which samples have higher priority based on their likelihood to be generated from the target MDP. We provide a theoretical analysis showing the asymptotic correctness of our approach and an empirical evaluation on two classical RL domains and a real-world task.
In this section, we start by introducing our mathematical notation. Then, we recall concepts of Markov decision processes and approximate value iteration. Finally, we formalize the transfer settings considered in this work.
For a measurable space , we denote by
the set of probability measures over
and by the space of measurable functions over bounded by , i.e., . Given a probability measure , we define the -norm of a measurable function as . Let be a -valued sequence for some space . For , the empirical norm of a function is . Note that when , we have that . Whenever the subscript is dropped, we implicitly consider the -norm.We define a discounted Markov Decision Process (MDP) as a tuple , where is a measurable state space, is a finite set of actions, is the transition probability kernel, is the reward probability kernel, and is the discount factor. We suppose is uniformly bounded by . A Markov randomized policy maps states to distributions over actions as . As a consequence of taking an action in , the agent receives a reward and the state evolves accordingly to . We define the action-value function of a policy as and the optimal action-value function as for all . Notice that is bounded by . Then, the optimal policy is a policy that is greedy with respect to , i.e., for all , . The optimal action-value function is also the unique fixed-point of the optimal Bellman operator , which is defined by (e.g., Puterman, 1994).
Fitted Q-Iteration (FQI) (Ernst et al., 2005) is a batch RL algorithm that belongs to the family of Approximate Value Iteration (AVI). AVI is a value-based approach that represents Q-functions by a hypothesis space of limited capacity. Starting from an initial action-value function , at each iteration , AVI approximates the application of the optimal Bellman operator in such that . Formally, let be a set of transitions such that and define the empirical optimal Bellman operator as . Then, at each iteration , FQI computes
(1) |
We consider a set of tasks, i.e., MDPs, , where denotes the target and the sources. We suppose all tasks share the same state-action space and have potentially different dynamics and reward. Suppose that, for , we have access to a dataset of samples from the -th MDP, , where state-action pairs are drawn from a common distribution .^{1}^{1}1This assumption can be relaxed at the price of a much more complex theoretical analysis. The goal of transfer learning is to use the samples in to speed up the learning process in the target task .
In this section, we introduce our approach to the transfer of samples. Recall that our goal is to exploit at best samples in to augment the dataset used by FQI to solve the target task, thus speeding up the learning process. In the rest of the paper we exploit the fact that FQI decomposes the RL problem into a sequence of supervised learning problems. It is easy to notice that the optimization problem (1) is an instance of empirical risk minimization, where are the input data, are the targets, and is a squared loss.
As mentioned in the introduction, we aim to exploit all the available samples to solve the target task. Suppose we adopt a naive approach where we concatenate all the samples, i.e., , to solve (1). This approach suffers from sample selection bias (Cortes et al., 2008), i.e., samples are collected from different distributions or domains. In fact, although we assumed state-action pairs to be sampled from a fixed task-independent distribution, the target variables are distributed according to the MDP they come from.
A standard technique used to correct the bias or discrepancy induced by the distribution shift is importance weighting
. This technique consists in weighting the loss function to emphasize the error on some samples and decrease it on others, to correct the mismatch between distributions
(Cortes et al., 2008). The definition of the importance weight for the point is where is the distribution of the target, and is the distribution according to which sample is collected. In our specific case, given an arbitrary sample, its joint distribution under MDP
is . Denote by the -th sample drawn from MDP , then its importance weight is given by .By feeding FQI on the full dataset with samples weighted by (for short), we get an algorithm that automatically selects which samples to exploit, i.e., those that, based on the importance weights, are more likely to be generated from the target MDP. This approach looks appealing but presents several issues. First, the distribution is, even in the case where the MDPs are known, very hard to characterize. Second, consider a simple case where we have a source MDP with the same transition dynamics as the target, but with entirely different reward. Then, the importance weights defined above are likely to be very close to zero for any source sample, thus making transfer useless. However, we would like a method able to leverage the fact that transition dynamics do not change, thus transferring only that part of the sample.
To overcome the second limitation, we consider the following variation of the FQI algorithm. At the first iteration of FQI, we use all the samples to fit a model of the target reward function:
(2) |
where is the hypothesis space we consider to represent action-value functions^{2}^{2}2Differently from other works (e.g., Farahmand & Precup, 2012; Tosatto et al., 2017), we suppose, for the sake of simplicity, the hypothesis space to be bounded by . Although this is a strong assumption, it can be relaxed by considering truncated functions. We refer the reader to (Györfi et al., 2006) for the theoretical consequences of such relaxation. and
(3) |
Problem (2) is unbiased if , though
is frequently used since it provides lower variance. The theoretical analysis is not affected by the choice of
, while in the experiments we will use . Then, at each iteration , FQI updates the Q-function as:(4) |
where and . Intuitively, instead of considering the reward in the dataset , we use . Since the stochasticity due to the reward samples is now removed, only the transition kernel plays a role, and the importance weights are given by:
(5) |
The resulting algorithm, named Importance Weighted Fitted Q-Iteration (IWFQI), is shown in Algorithm 1. In practice, we have to compute an estimate of and since and are unknown quantities. We postpone this topic to Section 5 since several approaches can be exploited. Instead, in the following section, we present a theoretical analysis that is independent of the way the importance weights are estimated.
We now study the theoretical properties of our IWFQI algorithm. We analyze the case where we have samples from one source task, but no samples from the target task are available, i.e., , , and . A generalization to the case where target samples or samples from more sources are available is straightforward, and it only complicates our derivation. To ease our notation, we adopt the subscript “T” and “S” to denote the target and the source. Furthermore, we want to emphasize that the results provided in this section are independent from the way the importance weights are estimated.
Consider the sequence of action-value functions computed by IWFQI. At each iteration , we incur in an error in approximating the optimal Bellman operator. Our goal is to bound, in terms of such errors, , i.e., the expected error under distribution between the performance of the optimal policy and that of the policy greedy w.r.t. . Here is an arbitrary evaluation distribution over that the user can freely choose. In practice, it might coincide with the sampling distribution . Since IWFQI belongs to the family of AVI algorithms, we can resort to Theorem 3.4 in (Farahmand, 2011). We report here the version with -norm for the sake of completeness.
(Theorem 3.4 of (Farahmand, 2011)) Let be a positive integer and . Then, for any sequence and the corresponding sequence , where , we have:
where:
We refer the reader to Chapter 3 of (Farahmand, 2011) for the definitions of the coefficients and .
Intuitively, the bound given in Theorem 1 depends on the errors made by IWFQI in approximating the optimal Bellman operator at each iteration. Thus, our problem reduces to bounding such errors. Cortes et al. (2010) already provided a theoretical analysis of importance weighted regression. However, their results are not immediately applicable to our case since they only consider a regression problem where the target variable is a deterministic function of the input . On the other hand, we have the more general regression estimation problem where
is a random variable, and we want to learn its conditional expectation given
. Thus, we extend Theorem 4 of (Cortes et al., 2010) to provide a bound on the expected -error between the hypothesis returned by a weighted regressor (with estimated weights ) and the regression function . Following (Cortes et al., 2010), we denote by the pseudo-dimension of a real-valued function class . The proof is in the appendix.Let be a functional space. Suppose we have a dataset of i.i.d. samples distributed according to , while is the target distribution. Assume almost surely. Let , be any positive function, , , , and , where denotes the empirical expectation on . Furthermore, assume and . Then, for any , the following holds with probability at least :
Notice that this result is of practical interest outside of the reinforcement learning field. Here it is used to bound the errors in order to state the following result.
Let be a hypothesis space, a distribution over , a sequence of Q-functions as defined in Equation (4), and the optimal Bellman operator of the target task. Suppose to have a dataset of N i.i.d. samples drawn from the source task according to a joint distribution . Let , denote the ideal importance weights defined in (5) and (3), and ,
denote arbitrary positive functions with bounded second moments. Define
, , where . Similarly, define , , and for the transition model. Then, for any , with probability at least :where is the concentrability coefficient of one-step transitions as defined in (Farahmand, 2011, Definition 5.2).
As expected, four primary sources of error contribute to our bound: (i) the bias due to estimated weights (first four terms), (ii) the approximation error (fifth and sixth term), (iii) the estimation error (seventh term), (iv) the propagation error (eighth term). Notice that, assuming to have a consistent estimator for the importance weights (an example is given in Section 5), the bias term vanishes as the number of samples tends to infinity. Furthermore, the estimation error decreases with , thus vanishing as the number of samples increases. Thus, in the asymptotic case our bound shows that the only source of error is due to the limited capacity of the functional space under consideration, as in most AVI algorithms. Furthermore, we notice that fitting the reward function and using it instead of the available samples propagates an error term through iterations, i.e., the approximation error . If we were able to estimate the importance weights for the typical case where both reward and transition samples are used, we could get rid of such error. However, since the resulting weights somehow depend on the joint densities between and , we expect their variance, as measured by , to be much bigger, thus making the resulting bound even larger. Furthermore, we argue that, when the reward function is simple enough and only a limited number of samples is available, a separate fit might be beneficial even for plain FQI. In fact, the variance of the empirical optimal Bellman operator can be reduced by removing the source of stochasticity due to the reward samples at the cost of propagating a small approximation error through iterations. The bounds for AVI, (e.g., Munos & Szepesvári, 2008; Farahmand, 2011; Farahmand & Precup, 2012), can be straightforwardly extended to such case by adopting a procedure similar to the one described in the proof of Theorem 3. Finally, in most practical applications the reward function is actually known and, thus, does not need to be fitted. In such cases, it is possible to get rid of the corresponding terms in Theorem 3, allowing transfer to occur without errors even when rewards are completely different between tasks.
In this section, we specify how to compute the importance weights. Since and are unknown, we only have access to an estimation of and used in (3) and (5), respectively. To obtain an approximation of the unknown densities, we consider Gaussian Processes (GPs) although any distribution matching technique and/or probabilistic model can be used.
We use the available samples to fit two Gaussian processes (GPs) (Rasmussen & Williams, 2006) for each task : one for the transition model and one for the reward model . To motivate our choice, GPs have been successfully adopted to model stochastic dynamical systems with high-dimensional and continuous state-action spaces in many existing works (e.g., Kuss & Rasmussen, 2004; Deisenroth & Rasmussen, 2011; Doshi-Velez & Konidaris, 2016; Berkenkamp et al., 2017). For the sake of simplicity, we only show how to compute the importance weights for the reward model. Our procedure straightforwardly generalizes to the transition model.
Given a sample from the -th task, the
-th GP returns a Gaussian distribution over the reward’s mean, i.e.,
, which, together with the target GP’s prediction, induces a distribution over the importance weights. In practice, the choice of a single importance weight can rely on some statistics of such distribution (e.g., its mean or mode). Perhaps not surprisingly, this is made non-trivial by the fact that explicitly characterizing such distribution is very complicated, and computing empirical statistics requires an expensive repeated sampling from the GPs’ posteriors. Interestingly, the following theorem shows that this is not necessary when the reward model follows a Gaussian law, as the expected weights under their unknown distribution can be computed in closed-form.Assume each task to have Gaussian reward distribution with unknown mean. Given the available samples in , we build an estimate of the reward distribution such that, for any MDP , . Then, given a sample from the -th MDP, its importance weight , where is the distribution induced by the GPs’ predictions. Let and suppose , then
(6) |
The proof is in Appendix A. In practice, we estimate the importance weights by taking their expectation as in (6), i.e., . Intuitively, using the expected weights is more robust than merely taking the ratio of the estimated densities. Furthermore, the estimated weights converge to the true ones when the GP predictions are perfect, i.e., when and , both in the source and in the target. This is a significant advantage over the more common approach of density-ratio estimation (Sugiyama et al., 2012), where a parametric form for the weight function is typically assumed. One drawback is that the expectation diverges when , that is, when the source GP has a prediction variance that is greater than the intrinsic noise of the model. Notice, however, that this happens very rarely since the source GP is never asked to predict samples it has not seen during training. Furthermore, since in practice the model noise is unknown and has to be estimated, an overestimation is beneficial as it introduces a regularization effect (Mohammadi et al., 2016), thus avoiding the problem mentioned above.
In (Taylor et al., 2008), the authors propose a method to transfer samples for model-based RL. Although they assume tasks might have different state-action space, inter-task mappings are used to map source samples to the target. However, the proposed method does not account for differences in the transition or reward models, which could lead to significant negative transfer when the tasks are different. Our approach, on the other hand, can selectively discard samples based on the estimated difference between the MDPs. Lazaric et al. (2008) compute a compliance measure between the target and source tasks and use it to specify from which tasks the transfer is more likely to be beneficial. Furthermore, a relevance measure is computed within each task to determine what are the best samples to transfer. These two measures are then combined to transfer samples into FQI. Once again, our approach does not require any explicit condition to decide what to transfer, nor does it require any assumption of similarity between the tasks. Furthermore, the compliance and relevance measures computed in (Lazaric et al., 2008) jointly account for both the reward and transition models, thus discarding samples when either one of the models is very different between the tasks. On the other hand, our approach can retain at least the part of the sample that is similar, at the cost of introducing a small bias. In (Laroche & Barlier, 2017), the authors propose a technique for transferring samples into FQI under the assumption that the transition dynamics do not change between the tasks. Similarly to our method, they learn the reward function at the first iteration and substitute the predicted values to the reward samples in the dataset. This allows them to safely adopt the full set of target and source samples in the remaining FQI iterations, as all tasks share the same transition model and, thus, samples are unbiased. However, we argue that this assumption of shared dynamics indeed limits the applicability of the transfer method to most real-world tasks.
In the supervised learning literature, Crammer et al. (2008) analyzed the transfer problem from a theoretical perspective and extended the classical generalization bounds to the case where samples are directly transferred from a set of source tasks. The most relevant result in their bounds is a trade-off between the total number of samples transferred and the total number of tasks from which transfer occurs. Increasing the first term decreases the variance, while it is likely to increase the bias due to the differences between the tasks. On the other hand, decreasing the first term also decreases the bias, but it is likely to increase the variance due to the limited number of samples. We observe that such trade-off does not arise in our case. Our method transfers all samples while accounting for the differences between the tasks. The only bias term is due to the errors in the estimation of the task models, which is likely to decrease as the number of samples increases.
Another work from the supervised learning literature that is related to our approach is (Garcke & Vanck, 2014). The authors proposed a method to transfer samples from a different dataset and used importance weighting to correct the distribution shift. However, they leveraged ideas from density-ratio estimation (e.g., Sugiyama et al., 2012) and supposed the weight function to have a given parametric form, thus directly estimating it from the data. Conversely, we estimate the densities involved and try to characterize the weight distribution, taking its expectation as our final estimate.
We evaluate IWFQI on three different domains with increasing level of complexity. In all experiments, we compare our method to two existing algorithms for transferring samples into FQI: the relevance-based transfer (RBT) algorithm of (Lazaric et al., 2008) and the shared-dynamics transfer (SDT) algorithm of (Laroche & Barlier, 2017).
Our first experimental domain is a modified version of the puddle world environment presented in (Sutton, 1996). Puddle world is a discrete-action, continuous-state (stochastic) navigation problem (see Appendix C.1 for a complete description). At each time-step, the agent receives a reward of plus a penalization proportional to the distance from all puddles. Each action moves the agent by in the corresponding direction. In particular, we consider two versions of the environment: (i) shared dynamics, where is fixed, and (ii) puddle-based dynamics, where slows-down the agent proportionally to the distance from all puddles.
We consider three source tasks and one target task, where each task has different puddles in different locations (see Appendix C.1). For each source task, we generate a dataset of episodes from a nearly-optimal policy. We run IWFQI with weights computed according to Equation (6), where we set the model noise to be ten times the true value. For evaluating our weight estimation procedure, we also run IWFQI with ideal importance weights (computed as the ratio of the true distributions). In each algorithm, FQI is run for 50 iterations with Extra-Trees (Ernst et al., 2005). An -greedy policy () is used to collect data in the target task.
Shared dynamics. We start by showing the results for in Figure 2(left). All results are averaged over runs and are reported with confidence intervals. As expected, FQI alone is not able to learn the target task in such a small number of episodes. On the other hand, IWFQI has a good jump-start and converges to an optimal policy in only episodes. Interestingly, IWFQI with ideal weights has almost the same performance, thus showing the robustness of our weight estimation procedure. RBT also learns the optimal policy rather quickly. However, the limited number of target and source samples available in this experiment makes it perform significantly worse in the first episodes. Since in this version of the puddle world the dynamics do not change between tasks, SDT also achieves good performance, converging to a nearly-optimal policy.
Puddle-based dynamics. We also show the results for the more challenging version of the environment were puddles both penalize and slow-down the agent (see Figure 2(right)). Notice that, in this case, transition dynamics change between tasks, thus making the transfer more challenging. Similarly, as before, our approach quickly learns the optimal policy and is not affected by the estimated weights. Furthermore, the benefits of over-estimating the model noise can be observed from the small improvement over IWFQI-ID. RBT is also able to learn the optimal policy. However, the consequences of inaccurately computing compliance and relevance are more evident in this case, where the algorithm negatively transfers samples in the first episodes. Finally, SDT still shows an improvement over plain FQI, but it is not able to learn the optimal policy due to the bias introduced by the different dynamics.
Acrobot (Sutton & Barto, 1998) is a classic control problem where the goal is to swing-up a two-link pendulum by applying positive or negative torque to the joint between the two links. Due to its non-linear and complex dynamics, Acrobot represents a very challenging problem, requiring a considerable amount of samples to be solved. In this experiment, we consider a multi-task scenario where robots might have different link lengths and masses . Our target task is the classic Acrobot swing-up problem, where the robot has lengths and masses . Furthermore, we consider two source tasks. The first is another swing-up task where the robot has lengths and masses . The second is a constant-spin task, where the goal is to make the first joint rotate at a fixed constant speed, with lengths and masses . The exact definition of the tasks’ dynamics and rewards is in Appendix C.2. Notice the intrinsic difficulty of transfer: the first source task has the same reward as the target but very different dynamics, and conversely for the second source task. Using nearly-optimal policies, we generate episodes from the first source and episodes from the second. We run all algorithms (except SDT since the problem violates the shared-dynamics assumption) for episodes and average over runs. Results are shown in Figure 2(left). We notice that both our approach and RBT achieve a good jump-start and learn faster than plain FQI. However, to better investigate how samples are transferred, we show the transfer ratio from each source task in Figure 2(right). Since RBT transfers rewards and transitions jointly, it decides to compensate the highly biased reward samples from the constant-spin task by over-sampling the first source task. However, it inevitably introduces bias from the different dynamics. Our approach, on the other hand, correctly transfers almost all reward samples from the swing-up task, while discarding those from the constant-spin task. Due to transition noise over-estimation, IWFQI achieves an interesting adaptive behaviour: during the initial episodes, when few target samples are available, and the GPs are inaccurate, more samples are transferred. This causes a reduction of the variance in the first phases of learning that is much greater than the increase of bias. However, as more target samples are available, the transfer becomes useless, and our approach correctly decides to discard most transition samples, thus minimizing both bias and variance.
In this experiment, we consider a real-world problem where the goal is to learn how to optimally control a water reservoir system. More specifically, the objective is to learn a per-day water release policy that meets a given demand while keeping the water level below a flooding threshold. Castelletti et al. (2010) successfully addressed such problem by adopting batch RL techniques. However, the authors proved that, due to the highly non-linear and noisy environment, an enormous amount of historical data is needed to achieve good performance. Consider now the case where a new water reservoir, for which no historical data is available, needs to be controlled. Since each sample corresponds to one day of release, learning by direct interaction with the environment is not practical and leads to poor control policies during the initial years, when only a little experience has been collected. Although we do not know the new environment, it is reasonable to assume that we have access to operational data from existing reservoirs. Then, our solution is to transfer samples to immediately achieve good performance. However, such reservoirs might be located in very different environments and weight objectives differently, thus making transfer very challenging.
We adopt a system model similar to the one proposed in (Castelletti et al., 2010). The state variables are the current water storage and day , while there are discrete actions, each corresponding to a particular release decision. The system evolves according to the simple mass balance equation , where is the net inflow at day and is modeled as periodic function, with period of one year, plus Gaussian noise. Given the demand and the flooding threshold , the reward function is a convex combination of the two objectives, , where . Different tasks have different inflow functions and different reward weights, which model different geographic regions and objectives, respectively.
We collected samples, corresponding to years of historical data, from each of source water reservoirs under a hand-coded expert policy. Further details about the tasks are given in Appendix C.3. We compared our approach to FQI and RBT over the first years of learning. An -greedy policy () was used to collect batches of year of samples, except for the first batch, for which an expert’s policy was used. Results, averaged over runs, are shown in Figure 3. We notice that IWFQI immediately outperforms the expert’s policy and quickly achieves near-optimal performance. RBT, on the other hand, has a good jump-start but then seems to worsen its performance. Once again, this is because each source task has at least few samples that can be transferred. However, selecting such samples is very complicated and leads to negative transfer in case of failure. Finally, FQI performs significantly worse than all alternatives and is, thus, not reported.
In this paper, we presented Importance Weighted Fitted Q-Iteration, a novel AVI algorithm for transferring samples in batch RL that uses importance weighting to automatically account for the difference in the source and target distributions. IWFQI exploits Gaussian processes to learn transition and reward models that are used to compute the importance weights. The use of two different processes for reward and transition models allows maximizing the information transferred. We theoretically investigated IWFQI showing (i) its asymptotic correctness in general settings, and (ii) how to compute a robust statistical estimate of the weights for Gaussian models. Finally, we empirically proved its effectiveness in common benchmarks and on a real-world water control problem.
One of the drawbacks of our method is that it does not fully exploit possible similarities between tasks. Recent approaches (e.g., Doshi-Velez & Konidaris, 2016; Killian et al., 2017) learn models relating a family of tasks to ease the transfer of knowledge. Exploring how such relations can benefit our approach (e.g., to improve the weight estimates) is an interesting line for future developments.
Products and convolutions of gaussian probability density functions.
2003.In this section we consider the case where the transition and reward models are Gaussian distributions.
A task is jointly Gaussian when it has:
Gaussian Reward: for any
(7) |
Gaussian Transitions: for any
(8) |
.
In this setting, by using a Gaussian process to estimate the transition and reward models, we can compute the expected importance weights, where the expectation is taken under the distribution induced by the GPs’ predictions, in closed form. Note that, since reward and transition weights are independent, we can independently consider Assumptions 7 and 8 for the computation of and , respectively.
See 4
In order to simplify our notation, we consider the dependence on implicit. We have:
The first integral is over the product of two Gaussian densities, which is known to be (Bromiley, 2003):
(9) |
where the values of the mean and variance of the first density are not important to complete this proof since such density integrates out. By adopting the same procedure as the one described in (Bromiley, 2003), we can write the ratio of the two Gaussian densities in the second integral as:
(10) |
where, again, the values of and are not relevant to our proof. Finally, we can write:
∎
We can derive a similar result for the transition model by considering Assumption 8.
Assume each task to have Gaussian transition distribution with unknown mean. Furthermore, suppose that . Given the available samples in , we build an estimate of the transition distribution such that, for any MDP and state component , . Then, given a sample from the -th MDP, the importance weight for the transition model given by is distributed according to some distribution induced by the GPs’ predictions. Let and suppose for all , then
Comments
There are no comments yet.