Does Knowledge Transfer Always Help to Learn a Better Policy?

12/06/2019 ∙ by Fei Feng, et al. ∙ 21

One of the key approaches to save samples when learning a policy for a reinforcement learning problem is to use knowledge from an approximate model such as its simulator. However, does knowledge transfer from approximate models always help to learn a better policy? Despite numerous empirical studies of transfer reinforcement learning, an answer to this question is still elusive. In this paper, we provide a strong negative result, showing that even the full knowledge of an approximate model may not help reduce the number of samples for learning an accurate policy of the true model. We construct an example of reinforcement learning models and show that the complexity with or without knowledge transfer has the same order. On the bright side, effective knowledge transferring is still possible under additional assumptions. In particular, we demonstrate that knowing the (linear) bases of the true model significantly reduces the number of samples for learning an accurate policy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reinforcement learning (RL) is the framework of learning to control an unknown system through trial and error. Recently, RL achieves phenomenal empirical successes, e.g, AlphaGo (Silver et al., 2016) defeated the best human player in Go, and OpenAI used RL to precisely and robustly control a robotic arm (Andrychowicz et al., 2017). The RL framework is general enough such that it can capture a broad spectrum of topics, including health care, traffic control, and experimental design (Sutton et al., 1992; Esteva et al., 2019; Si and Wang, 2001; Wiering, 2000; Denil et al., 2016). However, successful applications of RL in these domains are still rare. The major obstacle that prevents RL being widely used is its high sample complexity: both the AlphaGo and OpenAI arm took nearly a thousand years of human-equivalent experiences to achieve good performances. One way to reduce the number of training samples is to mimic how human beings learn – borrow knowledge from previous experiences. In robotics research, a robot may need to accomplish different tasks at different time. Instead of learning every task from scratch, a more ideal situation is that the robot can utilize the similarities between the underlying models of these tasks and adapt them to future new jobs quickly. Another example is that RL agents are often trained in simulators and then applied to real-world (Ng et al., 2006; Itsuki, 1995; Dosovitskiy et al., 2017). It is still desirable to have their performance improved after seeing samples collected from the real-world. One might hope that agents from simulators (approximate models) can adapt to the real world (true model) faster than knowing nothing. Both examples lead to a natural question: If models are similar, can we achieve fast adaptation through knowledge transferring?

This paper focuses on answering the above question. Suppose the true unknown model is a Markov Decision Process (MDP)

and the RL agent is provided with an approximate model with
where is a statistic distance and is a small scalar. We would like to study the sample complexity of learning a policy for such that its error111The error of a policy is the difference between the values of the policy and the optimal policy. is at most with (i.e., the high-precision regime). For a fixed , a common wisdom would suggest that better (e.g. smaller ) can help reduce the sample complexity.
The most natural choice of is the total-variation (TV) distance between the transition kernels of and . It is well-known (see e.g. Puterman 2014) that an optimal policy for has an error at most in , where hides the constants determined by the model. In this paper we show, however, to obtain an -optimal policy (the formal definition will be given in Sec. 1.2), the number of samples is of the form,
when . Note that the sample complexity is independent of . In particular, the complexity does not improve as becomes smaller (as long as ). It renders the knowledge from a TV-distance ball of useless when pursing a high-precision control without further structural information of the model. In order to show the lower bound, we leverage techniques for proving hardness in the bandit literature (e.g. Mannor and Tsitsiklis 2004) and reinforcement learning (e.g. Azar et al. 2013) to carefully show that the approximate model does not provide critical information that matters for high-precision control of the true model. Therefore, learning high-precision policy does not benefit from the approximate model.
To complement the lower bound, we further investigate the possible structural information of a model that provably helps knowledge transferring. We show that if the unknown model is in the convex hull of a set of known base models, we are able to obtain a high-precision control with a number of samples significantly fewer than that of learning from scratch. Specifically, the number of samples is proportional to
rather than the much larger , the number of states in the model.

1.1 Related Work

Reducing sample complexity is a core research goal in RL. Many related sub-branches of RL, e.g., multi-task RL (Brunskill and Li, 2013; Ammar et al., 2014; Calandriello et al., 2014), lifelong RL (Abel et al., 2018; Brunskill and Li, 2014), and meta-RL (Al-Shedivat et al., 2017), provide different schemes to utilize experiences from previous tasks. Please also see a survey paper (Taylor and Stone, 2009) for more related works. However, these results focus on different special cases of knowledge utilization rather than understanding the fundamental question of whether an approximate model is useful for policy learning and what guarantees we can have. In the area of Sim-to-Real222It stands for simulator-to-real-environment, some works point out that an imperfect approximate model may degrade the performance of learning and efforts have been made to address this issue, e.g (Kober et al., 2013; Buckman et al., 2018; Kalweit and Boedecker, 2017; Kurutach et al., 2018). There has been active empirical research, but little in theory is known. A more related work is Jiang 2018, who shows that even if the approximate model differs from the real environment in a single state-action pair (but which one is unknown), such an approximate model could still be information-theoretically useless. This is another interesting direction to look at. However, the statistic distance from such a model to the true model can be arbitrarily large and hence the policy of the approximate model does not have a guarantee on the true model. The limitation of the benefit that an approximate model could bring can also be found in Jiang and Li 2015

, where the authors build a policy value estimator and use the approximate model to reduce variance. However, they demonstrate that, if no extra knowledge is provided, only the part of variance arising from the randomness in policy can be eliminated rather than the stochasticity in state transitions.

In order to take more advantage of the previous experiences, additional structure information is needed. A number of structure settings have been studied in the literature. For instance, in Brunskill and Li 2013

, all models are assumed to be drawn from a finite set of MDPs with identical state and action spaces, but different reward and/or transition probabilities; in

Abel et al. 2018, one study case requires that all models share the same transition dynamics and only reward functions change with a hidden distribution; in Mann and Choe 2012, a special mapping between the approximate model and the true model is assumed such that the approximate model can provide a good action-value initialization for the true model; in Calandriello et al. 2014

, all tasks can be accurately represented in a linear approximation space and the weight vectors are jointly sparse; in

Modi et al. 2019, every model’s transition kernel and reward function lie in the linear span of known base models. To complement our lower bound, we study an MDP model that shares a similar information structure as in Modi et al. 2019. In contrast to Modi et al. 2019

, our model is of infinite-horizon and the loss function is also different. Although not the main focus of this paper, our proposed model and algorithm provide another effective approach that supports knowledge transferring.

It is worth mentioning that the structure information such as the existence of a lower-dimensional knowledge-sharing space is not exclusive to RL. One can also find their applications in supervised multi-task learning, e.g. Kumar and Daume III 2012; Ruvolo and Eaton 2013; Maurer et al. 2013.

1.2 Preliminaries

Notation

We use small letters (e.g. ) for scalars, capital letters (e.g., ) for vectors or functions and (e.g. ) for some specific scalars, capital boldface letters (e.g. , ) for matrices, and calligraphic letters (e.g., ) for sets. The cardinality of a set is denoted by . We use to represent the set . The simplex in is denoted by

. We abbreviate Kullback-Leibler divergence to KL and use

and to denote leading orders in upper, lower, and minimax lower bounds, respectively; and we use and to hide the polylog factors.

Markov Decision Process

In this paper, we focus on the discounted Markov Decision Process (MDP) with an infinite horizon, while the same analysis straightforwardly extends to other settings of MDP. We use to represent an MDP. Each MDP is described as a tuple , where is a finite state space , is a finite action space , is an matrix with each row being a state transition distribution, is the reward function, and is a discounted factor. We denote by the th entry in and the th row in . We use for the vector and for the total number of state-action pairs. At each time step, the controller observes a state and selects an action according to a policy , where maps a state to an action. The action transitions the environment to a new state with probability . Meanwhile, the controller receives an instant reward . Given a policy , we define its value function as:
(1)
where the expectation is taken over the transition trajectory following , i.e., and . The objective of RL is to learn an optimal policy such that its value (on any state ) is maximized over all policies, i.e.,
We also denote the optimal value function as . In practice, the optimal value/policy is in general not attainable. Therefore, it makes sense to study sub-optimal policies. Denote an -optimal policy for as such that
We also denote the action-value function (or Q-function) for a policy as . Specifically,
(2)
We also adapt the notion of sub-optimality for value function and Q-function, i.e., and are -optimal if and . Furthermore, if we let with the th coordinate being and with the th row being , then by definition, it holds that
(3)

TV-distance for MDPs

To measure the closeness of two MDPs, and , we introduce the following metric :
Note that the distance is only valid between MDPs with the same state and action spaces, and the discounted factor. The name TV comes from that
and is also the total variation distance between and . We denote by if for some .

Generative Model

Given an MDP, we define a special sample oracle – generative model. A generative model allows any as input and outputs with .

2 Problem Formulation and Illustration

We formalize our setting and the considered problem of knowledge transferring as below.
Problem 1.
Suppose the unknown true model is an MDP and an agent is provided with the full information of a prior model satisfying , where is a known constant. How many samples does it take to learn an -optimal policy for , where is an accuracy parameter?
Due to the property of TV-distance, an optimal policy of is already a -optimal policy for . It is natural to hope that a smaller leads to a smaller sample complexity, i.e., the knowledge from helps policy learning in . However, when a higher precision is desired, i.e., , we show that the number of samples only depends on , rather than , unless additional assumptions of the model are made. Such a conclusion is particularly striking since the knowledge of the approximate model from the vicinity of the true model is almost “useless.”

2.1 A Basic Case

We illustrate our point with a simple case. Define two MDPs and as shown in Figure 1. Both of them have 5 states , where has two actions and and the rest are all single-action states. After taking , will transition to deterministically. In , and transition to themselves with probabilities 0.5 and 0.4, respectively and to and with probability 0.5 and 0.6, respectively. In , each transitions to itself with probability and to with probability . and are unknown. In both models, and are absorbing states. They also have the same reward function: if and otherwise. Without loss of generality, we take (the reader can easily generate similar examples for other values of following the same principle). Since , we have that and . Figure 1: The left MDP is and the right MDP is , where and . In , if , the optimal policy should have and . If a policy returns for , then its state value at is . Thus is -optimal. When , to produce an -optimal policy, an algorithm must find out with high probability. For , vice versa. Therefore, the problem of learning an -optimal policy is equivalent to identifying the larger value from }. The knowledge we can use from is that and , which does not help reduce the sample complexity due to the overlap.

2.2 Empirical Verification

Besides the previous simple case, we also do a numerical demonstration on a sailing problem (Vanderbei, 1996). In Figure 2, we generate two MDPs and with . We compare the performances of two algorithms: 1. direct Q-learning (Watkins and Dayan, 1992) with transition samples from (blue line); 2. use the full knowledge of to generate a nearly optimal Q-function for , then use that Q-function to initialize the proceeding Q-learning algorithm with transition samples from (red line). Both algorithms use the same batch of transition samples from . Since is close to , the warm-start Q-learning is much better than the learning-from-scratch counterpart in the initial stage. However, these two curves overlap when they become closer to the optimal value, indicating similar sample complexities for both algorithms when pursuing a high-precision Q-value estimation. Figure 2: A toy test comparison between direct Q-learning and warm-start Q-learning with a nearly optimal Q-value of as initialization.

3 Lower Bound of Transfer Learning from a TV-distance Ball

In this section, we formally prove that an approximate model does not help when learning a high-precision policy. In particular, we show the following lower bound.
Theorem 1.
(Main Result)  Let be an unknown MDP. Suppose MDP is given and it satisfies . There exists , such that for all , , the sample complexity of learning an -optimal policy for with probability at least is
As shown in Azar et al. 2013 and Sidford et al. 2018; Agarwal et al. 2019, the sample complexity of directly learning an -optimal policy for an MDP with high probability under a generative model is
We conclude that for any , when is small enough, the sample complexity of learning with prior knowledge is at least as hard as learning without prior knowledge, if we only know the true model lies in a small TV-distance ball of the approximate model. As any online algorithm can be applied in the generative model case, the lower bound automatically adapts to the online setting as well.
Before starting the proof, we give the following definition about the correctness of RL algorithms.
Definition 1.
(()-correctness) Given and a prior model , we say that an RL algorithm is -correct if for every , can output an -optimal policy with probability at least .
Next, we construct a class of MDPs. We are going to select one model from the class as prior knowledge. Then we show that if an RL algorithm learns with samples significantly fewer than the lower bound, there would always exist an MDP such that cannot be -correct. Hence, the lower bound complexity is established.

Construction of the Hard Case

We define a family of MDPs . These MDPs have the structure as depicted in Figure 3. The state space consists of three disjoint subsets (gray nodes), (green nodes), and (blue nodes). The set includes states and each of them has available actions . States in and are all of single-action. In total, there are state-action pairs. For state , by taking action , it transitions to a state with probability 1. Note that such a mapping is one-to-one from to . For state , it transitions to itself with probability and to a corresponding state with probability . can be different for different models. All states in are absorbing. The reward function is: , if ; , otherwise. Figure 3: The class of MDPs considered in the proof of Theorem 1. Nodes represent states and arrows show transitions. consists of all grey nodes. comprises of all green nodes. Blue nodes form . is a generalization of a multi-armed bandit problem used in Mannor and Tsitsiklis 2004 to prove a lower bound on bandit learning. A similar example is also shown in Azar et al. 2013 to prove a lower bound on reinforcement learning without any prior knowledge. For an MDP , it is fully determined by the parameter set . And its Q-function has the values:
(4)

Prior Model and Hypotheses of

Now, we select a prior model with
where is the discounted factor. We restrict , then . Given , let be an -correct algorithm. Denote by . We consider possibilities of :
(5) (6)
where is selected such that
(7)
and such that
(8)
Note that the parameter set of differs from that of only on action and the parameter set of differs from that of only on pair . When , . Thus, all models above lie in . We refer to them as hypotheses of . Every hypothesis gives a probability measure over the same sample space. We denote by , and , the expectation and probability under hypothesis and , respectively. These probability measures capture both the randomness in the corresponding MDP and the randomization carried out by the algorithm , for example its sampling strategy. It is worth mentioning that in Azar et al. 2013, the authors implicitly assume that the sampling numbers to different states are determined before the start of the algorithm and do not change during learning (this is due to their conditionally independence argument in Lemma 18). Such an assumption does not apply to adaptive sampling strategy. In our result, adaptive sampling is included.
In the sequel, we fix and , where and will be determined later. Let
where is to be determined later. We also define the number of samples that algorithm calls from the generative model with input state till stops (these sample calls are not necessarily consecutive). For every , we define the following three events:
(9) (10) (11)
where is the sum of rewards (non-discounted) by calling the generative model times with input state . For these events, we have the following lemmas.
Lemma 1.
For any , if  , .
Proof.
Thus, . ∎
Lemma 2.
For any , if  , .
Proof.
When , under hypothesis , . By definition, the instant rewards from state are i.i.d. Bernoulli-random variables. Denote by . By Chernoff-Hoeffding bound and , we have that
(12) (13)
Thus, when , . ∎
Now, we set as , then for , should return a policy such that when , for every with probability at least , i.e. for all and . Define the event . Combining the results above, we have that
(14)
Next, we show that if the expectation of number of samples in on any is less than , then occurs with probability greater than under the hypothesis .
Lemma 3.
Let . For any , when , if  , then .
Proof.
Given and , we denote by the length- random sequence of the instant rewards by calling the generative model times with the input state . As one can see, if , this is an i.i.d. Bernoulli- sequence; if , this is an i.i.d Bernoulli- sequence. We define the likelihood function by letting
for every possible realization . This function can be used to define a random variable , where is the sample path of the random sequence. Following the previous notation, is the sum of rewards, i.e. the total number of getting 1s in . Then we have the likelihood ratio as
(15) (16) (17)
By our choice of , , and , it holds that and . With the fact that for and for , we have that
(18) (19)
Thus
(20) (21)
due to . Next, we proceed on the event . By definition, if occurs, event has occurred. Using for , it follows that
Using for , we have that
Further, we have that when occurs, also occurs. Therefore,
(22) (23) (24)
By taking small enough, e.g. , we have
By a change of measure, we deduce that
(25)
If is -correct, under hypothesis , should produce a policy such that with probability greater than . Thus, we should have for all . From Lemma 3, it requires for all . In total, we need samples, which concludes our proof of Theorem 1.

4 A Case Study for Knowledge Transfer in Reinforcement Learning

In this section, we impose a new assumption on similarities among models such that transferring knowledge achieves fast adaptation. We consider a sequence of MDPs, where they have the same state and action spaces, and the discounted factor, but different transition dynamics and/or reward functions. At each time step , we want to learn an -optimal policy for The assumption we propose is a convex hull structure as stated below.
Assumption 1.
Given a finite set of MDPs where , we have 333 if there exists a vector such that for any , and . for all . We have full knowledge of all MDPs in and access to a generative model of each .
We define a set of matrices , where the th column of is . Since , there exists a vector such that
(26)
We define a matrix by stacking all vertically, i.e.
We make the following assumption about .
Assumption 2.
has full column rank.
Since is much smaller than , the assumption is easy to be satisfied in real applications. Then a direct result is:
Lemma 4.
There exists a set such that the matrix formed by stacking all