1 IntroductionReinforcement learning (RL) is the framework of learning to control an unknown system through trial and error. Recently, RL achieves phenomenal empirical successes, e.g, AlphaGo (Silver et al., 2016) defeated the best human player in Go, and OpenAI used RL to precisely and robustly control a robotic arm (Andrychowicz et al., 2017). The RL framework is general enough such that it can capture a broad spectrum of topics, including health care, traffic control, and experimental design (Sutton et al., 1992; Esteva et al., 2019; Si and Wang, 2001; Wiering, 2000; Denil et al., 2016). However, successful applications of RL in these domains are still rare. The major obstacle that prevents RL being widely used is its high sample complexity: both the AlphaGo and OpenAI arm took nearly a thousand years of human-equivalent experiences to achieve good performances. One way to reduce the number of training samples is to mimic how human beings learn – borrow knowledge from previous experiences. In robotics research, a robot may need to accomplish different tasks at different time. Instead of learning every task from scratch, a more ideal situation is that the robot can utilize the similarities between the underlying models of these tasks and adapt them to future new jobs quickly. Another example is that RL agents are often trained in simulators and then applied to real-world (Ng et al., 2006; Itsuki, 1995; Dosovitskiy et al., 2017). It is still desirable to have their performance improved after seeing samples collected from the real-world. One might hope that agents from simulators (approximate models) can adapt to the real world (true model) faster than knowing nothing. Both examples lead to a natural question: If models are similar, can we achieve fast adaptation through knowledge transferring?
This paper focuses on answering the above question. Suppose the true unknown model is a Markov Decision Process (MDP)and the RL agent is provided with an approximate model with
1.1 Related WorkReducing sample complexity is a core research goal in RL. Many related sub-branches of RL, e.g., multi-task RL (Brunskill and Li, 2013; Ammar et al., 2014; Calandriello et al., 2014), lifelong RL (Abel et al., 2018; Brunskill and Li, 2014), and meta-RL (Al-Shedivat et al., 2017), provide different schemes to utilize experiences from previous tasks. Please also see a survey paper (Taylor and Stone, 2009) for more related works. However, these results focus on different special cases of knowledge utilization rather than understanding the fundamental question of whether an approximate model is useful for policy learning and what guarantees we can have. In the area of Sim-to-Real222It stands for simulator-to-real-environment, some works point out that an imperfect approximate model may degrade the performance of learning and efforts have been made to address this issue, e.g (Kober et al., 2013; Buckman et al., 2018; Kalweit and Boedecker, 2017; Kurutach et al., 2018). There has been active empirical research, but little in theory is known. A more related work is Jiang 2018, who shows that even if the approximate model differs from the real environment in a single state-action pair (but which one is unknown), such an approximate model could still be information-theoretically useless. This is another interesting direction to look at. However, the statistic distance from such a model to the true model can be arbitrarily large and hence the policy of the approximate model does not have a guarantee on the true model. The limitation of the benefit that an approximate model could bring can also be found in Jiang and Li 2015
, where the authors build a policy value estimator and use the approximate model to reduce variance. However, they demonstrate that, if no extra knowledge is provided, only the part of variance arising from the randomness in policy can be eliminated rather than the stochasticity in state transitions.In order to take more advantage of the previous experiences, additional structure information is needed. A number of structure settings have been studied in the literature. For instance, in Brunskill and Li 2013
, all models are assumed to be drawn from a finite set of MDPs with identical state and action spaces, but different reward and/or transition probabilities; inAbel et al. 2018, one study case requires that all models share the same transition dynamics and only reward functions change with a hidden distribution; in Mann and Choe 2012, a special mapping between the approximate model and the true model is assumed such that the approximate model can provide a good action-value initialization for the true model; in Calandriello et al. 2014
, all tasks can be accurately represented in a linear approximation space and the weight vectors are jointly sparse; inModi et al. 2019, every model’s transition kernel and reward function lie in the linear span of known base models. To complement our lower bound, we study an MDP model that shares a similar information structure as in Modi et al. 2019. In contrast to Modi et al. 2019
, our model is of infinite-horizon and the loss function is also different. Although not the main focus of this paper, our proposed model and algorithm provide another effective approach that supports knowledge transferring.It is worth mentioning that the structure information such as the existence of a lower-dimensional knowledge-sharing space is not exclusive to RL. One can also find their applications in supervised multi-task learning, e.g. Kumar and Daume III 2012; Ruvolo and Eaton 2013; Maurer et al. 2013.
NotationWe use small letters (e.g. ) for scalars, capital letters (e.g., ) for vectors or functions and (e.g. ) for some specific scalars, capital boldface letters (e.g. , ) for matrices, and calligraphic letters (e.g., ) for sets. The cardinality of a set is denoted by . We use to represent the set . The simplex in is denoted by
. We abbreviate Kullback-Leibler divergence to KL and useand to denote leading orders in upper, lower, and minimax lower bounds, respectively; and we use and to hide the polylog factors.