## 1 Introduction

Reinforcement learning (RL) is the framework of learning to control an unknown system through trial and error. Recently, RL achieves phenomenal empirical successes, e.g, AlphaGo (Silver et al., 2016) defeated the best human player in Go, and OpenAI used RL to precisely and robustly control a robotic arm (Andrychowicz et al., 2017). The RL framework is general enough such that it can capture a broad spectrum of topics, including health care, traffic control, and experimental design (Sutton et al., 1992; Esteva et al., 2019; Si and Wang, 2001; Wiering, 2000; Denil et al., 2016). However,*successful*applications of RL in these domains are still rare. The major obstacle that prevents RL being widely used is its high sample complexity: both the AlphaGo and OpenAI arm took nearly a thousand years of human-equivalent experiences to achieve good performances. One way to reduce the number of training samples is to mimic how human beings learn – borrow knowledge from previous experiences. In robotics research, a robot may need to accomplish different tasks at different time. Instead of learning every task from scratch, a more ideal situation is that the robot can utilize the similarities between the underlying models of these tasks and adapt them to future new jobs quickly. Another example is that RL agents are often trained in simulators and then applied to real-world (Ng et al., 2006; Itsuki, 1995; Dosovitskiy et al., 2017). It is still desirable to have their performance improved after seeing samples collected from the real-world. One might hope that agents from simulators (approximate models) can adapt to the real world (true model) faster than knowing nothing. Both examples lead to a natural question: If models are similar, can we achieve fast adaptation through knowledge transferring?

This paper focuses on answering the above question. Suppose the true unknown model is a Markov Decision Process (MDP)

and the RL agent is provided with an approximate model with^{1}

^{1}1The error of a policy is the difference between the values of the policy and the optimal policy. is at most with (i.e., the

*high-precision*regime). For a fixed , a common wisdom would suggest that better (e.g. smaller ) can help reduce the sample complexity. The most natural choice of is the total-variation (TV) distance between the transition kernels of and . It is well-known (see e.g. Puterman 2014) that an optimal policy for has an error at most in , where hides the constants determined by the model. In this paper we show, however, to obtain an -optimal policy (the formal definition will be given in Sec. 1.2), the number of samples is of the form,

*independent*of . In particular, the complexity does not improve as becomes smaller (as long as ). It renders the knowledge from a TV-distance ball of

*useless*when pursing a high-precision control without further structural information of the model. In order to show the lower bound, we leverage techniques for proving hardness in the bandit literature (e.g. Mannor and Tsitsiklis 2004) and reinforcement learning (e.g. Azar et al. 2013) to carefully show that the approximate model does not provide

*critical information*that matters for high-precision control of the true model. Therefore, learning high-precision policy does not benefit from the approximate model. To complement the lower bound, we further investigate the possible structural information of a model that provably helps knowledge transferring. We show that if the unknown model is in the convex hull of a set of known base models, we are able to obtain a high-precision control with a number of samples significantly fewer than that of learning from scratch. Specifically, the number of samples is proportional to

### 1.1 Related Work

Reducing sample complexity is a core research goal in RL. Many related sub-branches of RL, e.g., multi-task RL (Brunskill and Li, 2013; Ammar et al., 2014; Calandriello et al., 2014), lifelong RL (Abel et al., 2018; Brunskill and Li, 2014), and meta-RL (Al-Shedivat et al., 2017), provide different schemes to utilize experiences from previous tasks. Please also see a survey paper (Taylor and Stone, 2009) for more related works. However, these results focus on different special cases of knowledge utilization rather than understanding the fundamental question of whether an approximate model is useful for policy learning and what guarantees we can have. In the area of Sim-to-Real^{2}

^{2}2It stands for simulator-to-real-environment, some works point out that an imperfect approximate model may degrade the performance of learning and efforts have been made to address this issue, e.g (Kober et al., 2013; Buckman et al., 2018; Kalweit and Boedecker, 2017; Kurutach et al., 2018). There has been active empirical research, but little in theory is known. A more related work is Jiang 2018, who shows that even if the approximate model differs from the real environment in a single state-action pair (but which one is unknown), such an approximate model could still be information-theoretically useless. This is another interesting direction to look at. However, the statistic distance from such a model to the true model can be arbitrarily large and hence the policy of the approximate model does not have a guarantee on the true model. The limitation of the benefit that an approximate model could bring can also be found in Jiang and Li 2015

, where the authors build a policy value estimator and use the approximate model to reduce variance. However, they demonstrate that, if no extra knowledge is provided, only the part of variance arising from the randomness in policy can be eliminated rather than the stochasticity in state transitions.

In order to take more advantage of the previous experiences, additional structure information is needed. A number of structure settings have been studied in the literature. For instance, in Brunskill and Li 2013, all models are assumed to be drawn from a finite set of MDPs with identical state and action spaces, but different reward and/or transition probabilities; in

Abel et al. 2018, one study case requires that all models share the same transition dynamics and only reward functions change with a hidden distribution; in Mann and Choe 2012, a special mapping between the approximate model and the true model is assumed such that the approximate model can provide a good action-value initialization for the true model; in Calandriello et al. 2014, all tasks can be accurately represented in a linear approximation space and the weight vectors are jointly sparse; in

Modi et al. 2019, every model’s transition kernel and reward function lie in the linear span of known base models. To complement our lower bound, we study an MDP model that shares a similar information structure as in Modi et al. 2019. In contrast to Modi et al. 2019, our model is of infinite-horizon and the loss function is also different. Although not the main focus of this paper, our proposed model and algorithm provide another effective approach that supports knowledge transferring.

It is worth mentioning that the structure information such as the existence of a lower-dimensional knowledge-sharing space is not exclusive to RL. One can also find their applications in supervised multi-task learning, e.g. Kumar and Daume III 2012; Ruvolo and Eaton 2013; Maurer et al. 2013.### 1.2 Preliminaries

#### Notation

We use small letters (e.g. ) for scalars, capital letters (e.g., ) for vectors or functions and (e.g. ) for some specific scalars, capital boldface letters (e.g. , ) for matrices, and calligraphic letters (e.g., ) for sets. The cardinality of a set is denoted by . We use to represent the set . The simplex in is denoted by. We abbreviate Kullback-Leibler divergence to KL and use

and to denote leading orders in upper, lower, and minimax lower bounds, respectively; and we use and to hide the polylog factors.#### Markov Decision Process

In this paper, we focus on the discounted Markov Decision Process (MDP) with an infinite horizon, while the same analysis straightforwardly extends to other settings of MDP. We use to represent an MDP. Each MDP is described as a tuple , where is a finite state space , is a finite action space , is an matrix with each row being a state transition distribution, is the reward function, and is a discounted factor. We denote by the th entry in and the th row in . We use for the vector and for the total number of state-action pairs. At each time step, the controller observes a state and selects an action according to a*policy*, where maps a state to an action. The action transitions the environment to a new state with probability . Meanwhile, the controller receives an instant reward . Given a policy , we define its value function as:

*-optimal policy*for as such that

*action-value function*(or

*Q-function*) for a policy as . Specifically,

#### TV-distance for MDPs

To measure the closeness of two MDPs, and , we introduce the following metric :*total variation*distance between and . We denote by if for some .

#### Generative Model

Given an MDP, we define a special sample oracle – generative model. A generative model allows any as input and outputs with .## 2 Problem Formulation and Illustration

We formalize our setting and the considered problem of knowledge transferring as below.###### Problem 1.

Suppose the unknown true model is an MDP and an agent is provided with the full information of a prior model satisfying , where is a known constant. How many samples does it take to learn an -optimal policy for , where is an accuracy parameter?### 2.1 A Basic Case

We illustrate our point with a simple case. Define two MDPs and as shown in Figure 1. Both of them have 5 states , where has two actions and and the rest are all single-action states. After taking , will transition to deterministically. In , and transition to themselves with probabilities 0.5 and 0.4, respectively and to and with probability 0.5 and 0.6, respectively. In , each transitions to itself with probability and to with probability . and are unknown. In both models, and are absorbing states. They also have the same reward function: if and otherwise. Without loss of generality, we take (the reader can easily generate similar examples for other values of following the same principle). Since , we have that and . In , if , the optimal policy should have and . If a policy returns for , then its state value at is . Thus is -optimal. When , to produce an -optimal policy, an algorithm must find out with high probability. For , vice versa. Therefore, the problem of learning an -optimal policy is equivalent to identifying the larger value from }. The knowledge we can use from is that and , which does not help reduce the sample complexity due to the overlap.### 2.2 Empirical Verification

Besides the previous simple case, we also do a numerical demonstration on a sailing problem (Vanderbei, 1996). In Figure 2, we generate two MDPs and with . We compare the performances of two algorithms: 1. direct Q-learning (Watkins and Dayan, 1992) with transition samples from (blue line); 2. use the full knowledge of to generate a nearly optimal Q-function for , then use that Q-function to initialize the proceeding Q-learning algorithm with transition samples from (red line). Both algorithms use the same batch of transition samples from . Since is close to , the warm-start Q-learning is much better than the learning-from-scratch counterpart in the initial stage. However, these two curves overlap when they become closer to the optimal value, indicating similar sample complexities for both algorithms when pursuing a high-precision Q-value estimation.## 3 Lower Bound of Transfer Learning from a TV-distance Ball

In this section, we formally prove that an approximate model does not help when learning a high-precision policy. In particular, we show the following lower bound.###### Theorem 1.

(Main Result) Let be an unknown MDP. Suppose MDP is given and it satisfies . There exists , such that for all , , the sample complexity of learning an -optimal policy for with probability at least is*with*prior knowledge is at least as hard as learning

*without*prior knowledge, if we only know the true model lies in a small TV-distance ball of the approximate model. As any online algorithm can be applied in the generative model case, the lower bound automatically adapts to the online setting as well. Before starting the proof, we give the following definition about the correctness of RL algorithms.

###### Definition 1.

(()-correctness) Given and a prior model , we say that an RL algorithm is -correct if for every , can output an -optimal policy with probability at least .#### Construction of the Hard Case

We define a family of MDPs . These MDPs have the structure as depicted in Figure 3. The state space consists of three disjoint subsets (gray nodes), (green nodes), and (blue nodes). The set includes states and each of them has available actions . States in and are all of single-action. In total, there are state-action pairs. For state , by taking action , it transitions to a state with probability 1. Note that such a mapping is one-to-one from to . For state , it transitions to itself with probability and to a corresponding state with probability . can be different for different models. All states in are absorbing. The reward function is: , if ; , otherwise. is a generalization of a multi-armed bandit problem used in Mannor and Tsitsiklis 2004 to prove a lower bound on bandit learning. A similar example is also shown in Azar et al. 2013 to prove a lower bound on reinforcement learning without any prior knowledge. For an MDP , it is fully determined by the parameter set . And its Q-function has the values:#### Prior Model and Hypotheses of

Now, we select a*prior*model with

*hypotheses*of . Every hypothesis gives a probability measure over the same sample space. We denote by , and , the expectation and probability under hypothesis and , respectively. These probability measures capture both the randomness in the corresponding MDP and the randomization carried out by the algorithm , for example its sampling strategy. It is worth mentioning that in Azar et al. 2013, the authors implicitly assume that the sampling numbers to different states are determined before the start of the algorithm and do not change during learning (this is due to their

*conditionally independence*argument in Lemma 18). Such an assumption does not apply to adaptive sampling strategy. In our result, adaptive sampling is included. In the sequel, we fix and , where and will be determined later. Let

###### Lemma 1.

For any , if , .###### Proof.

###### Lemma 2.

For any , if , .###### Proof.

When , under hypothesis , . By definition, the instant rewards from state are i.i.d. Bernoulli-random variables. Denote by . By Chernoff-Hoeffding bound and , we have that###### Lemma 3.

Let . For any , when , if , then .###### Proof.

Given and , we denote by the length- random sequence of the instant rewards by calling the generative model times with the input state . As one can see, if , this is an i.i.d. Bernoulli- sequence; if , this is an i.i.d Bernoulli- sequence. We define the likelihood function by letting## 4 A Case Study for Knowledge Transfer in Reinforcement Learning

In this section, we impose a new assumption on*similarities*among models such that transferring knowledge achieves fast adaptation. We consider a sequence of MDPs, where they have the same state and action spaces, and the discounted factor, but different transition dynamics and/or reward functions. At each time step , we want to learn an -optimal policy for The assumption we propose is a convex hull structure as stated below.

###### Assumption 1.

Given a finite set of MDPs where , we have^{3}

^{3}3 if there exists a vector such that for any , and . for all . We have full knowledge of all MDPs in and access to a generative model of each .