A more and more viable alternative to hand-coding the whole behavior of a robot, is to let the robot learn by itself how it should act in order to reach a human-defined goal, using Reinforcement Learning (RL) (Sutton2018). However, autonomous robots generally require significant equipment allowing them to sense their surroundings, such as various proximity sensors, and might need their whole environment to be designed so to ease their navigation, and favour smooth operation. Unfortunately, it is often unfeasible: i) to equip robots targeted to a general audience with expensive sensory equipment, and ii) to ensure the best possible conditions in any real-life situations, e.g., preparing a user’s house, or the streets the robot needs to navigate in. It is however possible to provide a prototypical robot with the necessary sensors, as well as a fitting environment within the robot designers’ lab. Nevertheless, to guarantee good performance outside the lab, it is crucial that the robot learns to get by without the help of sensors and exceptionally good environmental conditions. This particular problem of transitioning from the lab world to the real world falls into the realm of transfer learning.
Transfer learning has the potential to make RL agents a lot faster at mastering new tasks, by allowing the agent to reuse knowledge acquired in one or several previous tasks. Many different components can vary between the source and target tasks. The state description, for instance, might not be the same in the two tasks; one might be richer and/or easier to learn from, than the other (Taylor2009, Section 3.2.1). This dissimilarity naturally emerges when trying to share knowledge across robots equipped with different sensors. In this paper, we are interested in transferring knowledge from a robotic platform to another, both tackling the same task with the same action set, while sensing their environment differently. The transfer is made from the robot which state description is empirically easier to learn from (output of 8 proximity sensors) to the robot which state description is harder to learn from (output of a single camera). This could allow a cheap under-equipped robot to perform as well on the field as a more sophisticated one, for which the task is much easier to learn.
Current techniques to transfer learning can be sorted in two categories: the ones that use the transferred knowledge to bias the agent’s exploration strategy (Fernandez2006; Taylor2007a; Madden2004; Zhan2015; Plisnier2019), a technique primarily used with Q-Learning-based methods; the ones that train the agent to imitate the transferred knowledge, either by initializing the agent with the transferred knowledge (Taylor2007b) or by dynamically teaching the agent (Parisotto2015; Brys2015; Plisnier2019). Except for the Actor-Advisor (Plisnier2019), there has not been much investigation on allowing the agent to both be guided by transferred knowledge, and learn to imitate it at the same time. In addition, to our knowledge, transfer between tasks with completely different state-spaces consists in a novel setting.
Our agent, an Epuck robot simulated in V-REP 111V-REP is a software allowing the simulation of several existing robots in an environment that can be customized by algorithms designers: http://www.coppeliarobotics.com/index.html., learns using Bootstrapped Dual Policy Iteration (Steckelmacher2019, BDPI)
. BDPI is a model-free actor-critic reinforcement learning algorithm for continuous states and discrete actions settings, with one actor and several off-policy critics. This algorithm allows us to investigate three forms of transfer, inspired by the Actor-Advisor: i) purely via exploration alteration, ii) purely via imitation learning, and iii) a mix of both exploration alteration and imitation learning. We empirically show that our BDPI extension for transfer learning allows a simulated camera-equipped Epuck robot to leverage proximity sensors that are not at its disposal, and learn a much better policy than what was originally possible to achieve with its simple setup.
2.1 Markov Decision Processes
A discrete-time Markov Decision Process (MDP)(Bellman1957) with discrete actions is defined by the tuple : a possibly-infinite set of states; a finite set of actions; a reward function returning a scalar reward for each state transition; and a transition function taking as input a state-action pair
and returning a probability distribution over new states.
A stochastic stationary policy maps each state to a probability distribution over actions. At each time-step, the agent observes , selects , then observes and . The tuple is called an experience tuple. An optimal policy maximizes the expected cumulative discounted reward , where is a discount factor. The goal of the agent is to find based on its experiences within the environment.
2.2 Bootstrapped Dual Policy Iteration
Bootstrapped Dual Policy Iteration (Steckelmacher2019, BDPI) is an actor-critic method, with one actor and critics. The critics are trained using Aggressive Bootstrapped Clipped DQN (Steckelmacher2019), a version of Clipped DQN (Fujimoto18) that performs
training iterations per training epoch. Each critic maintains two Q-functions,and . Each training epoch, a batch is sampled for each critic from an experience buffer . Then, for each training iteration, every critic swaps its and functions, then is trained using Equation 1 on .
The actor is trained using a variant of Conservative Policy Iteration (Pirotta2013). Every training epoch, after the critics have been updated for a number of times, the actor is trained towards the greedy policy of all its critics. This is achieved by sequentially applying Equation 4 times, each iteration updating the actor based on a different critic.
where . A great asset of BDPI over other state-of-the-art actor-critic methods is its high sample-efficiency, due to the aggressiveness of its off-policy critics.
2.3 Policy Shaping and the Actor-Advisor
Policy Shaping (Kartoun2010; Griffith2013; Macglashan2017; Harrison2018) generally aims at letting an external advisory policy (we call it since, in our case, it is learned in the source task) alter or determine the agent’s behavior. The specific Policy Shaping formula we are considering in this paper is the one suggested by Griffith2013:
where is the state-dependent policy learned by the agent, is the state-dependent advice, and is the dot product. The actions executed by the agent in the environment are sampled from a mixture of the agent’s current learned policy and an external advisory policy . Executing actions from this mixture allows the advisor to guide the agent’s exploration and potentially improves its performance. This method not only allows the actor to benefit from the advisor’s expertise; it also lets the actor eventually outperform its advisor, if the actor’s sensors are more informative than the advisor’s ones. This way, the actor’s performance is never bounded by its advisor’s, and the advisor does not need to have a complete knowledge of the task to be solved.
The Actor-Advisor (Plisnier2019) applies this technique to a variety of RL sub-domains, namely learning from a human teacher, learning under a safe backup policy, and transferring a previously learned policy. The Actor-Advisor assumes a Policy Gradient (Sutton2000) actor, learning a parametric policy . At acting time, actions are sampled from the mixed policy , obtained with the above-mentioned policy shaping formula in Equation 5 (Griffith2013). The mixed policy is integrated in the standard Policy Gradient loss (Sutton2000) used to train the agent:
with the probability to execute action at time , given as input the state and some state-dependent advice , and the return , with , a simple discounted sum of future rewards. Computing the gradient based on the mixed policy rather than on only potentially allows the actor’s learning to be influenced by the advisor, in addition to being guided during exploration.
3 Transfer Learning
Transferring knowledge in Reinforcement Learning potentially improves sample-efficiency, as it allows an agent to exploit relevant past knowledge while tackling a new task, instead of learning it from scratch (Taylor2009). Usually, we consider that the valuable knowledge to be transferred in Reinforcement Learning is the actual output of a reinforcement learner: an action-value function or a policy (Brys2016, p. 34). Some work also consider reusing skills, or options (Sutton1999), as a transfer of knowledge across tasks (Andre2002; Ravindran2003; Konidaris2007). In this section, we sort previous work in categories related to the way is transferred into the agent, and look at what is allowed to be different between the source task and the target task. The two predominant ways in which knowledge can be transferred are i) serves as a guide during exploration, ii) is used to train or initialize the agent, so that the agent actively imitates .
A fast and effective way of transferring a policy is to leverage it in the agent’s exploration strategy. Altering the exploration strategy is a popular technique in the safe RL domain, and consists in biasing or determining the actions taken by the agent at action selection time (Garcia2015). Such exploration requires the agent to be able to learn from off-policy experiences. The motivation behind guided exploration is the poor performance of a fresh agent at the beginning of learning, in addition to the presence of obstacles difficult to overcome in the environment. An exploration guided by a smarter external policy, such as , could help improve the agent’s early performance, as well as help it in the long run explore fruitful areas.
Some existing work applies guided exploration to transfer learning (Fernandez2006; Taylor2007b; Madden2004), and illustrates how this technique allows the agent to outperform ’s performance. Regarding the components of the source and target tasks that are allowed to differ, Fernandez2006 considers different goal placements (hence, different reward functions), Madden2004 uses symbolically learned knowledge to tackle states that are seen by the agent for the first time, and Taylor2007b assumes similar state variables and actions, but a different reward function. The translation functions required to map a state/action in one task to a state/action in another are assumed to be provided.
Although an improved exploration might result in an improved performance, and a jump-start occurs, an agent which actions are simply overridden by an external policy does not actively learn to imitate it. Other techniques have proposed to “teach” the agent to perform the target task (instead of merely guiding it), either by dynamic teaching, or by straightforward initialization. Imitation learning aims to allow a student agent to learn the policy of a demonstrator, out of data that it has generated(Hussein2017). Similarly, policy distillation (Bucila2006) can be applied to RL to train a fresh agent with one or several expert policies, hence resulting in one, smaller, potentially multi-task RL agent (Rusu2015).
Imitation learning and policy distillation are somewhat related to transfer learning (Hussein2017, p.24), although imitation and distillation assume that the source and target tasks are the same, while transfer does not. The Actor-Mimic (Parisotto2015) uses several DQN policies (each expert in a different source task) to train a multi-task student network, by minimizing the cross-entropy loss between the student and experts’ policies. To perform transfer, the resulting multi-task expert policy is used to initialize yet another DQN network, which learns the target task. The Actor-Mimic assumes that the source and target tasks share the same observation and action spaces, with different reward and transition functions. In Q-value reuse (Taylor2007a, Section 5.5), a Q-Learner uses to kickstarts its learning of the target task, while also learning a new action-value function to compensate ’s irrelevant knowledge. In Taylor2007a, the agent learns to play Keepaway games, and is introduced to a game with more players, resulting in more actions and state variables. Brys2015 transfers to a Q-Learning agent through reward shaping; the differences in the action and state spaces between source and target tasks are solved using a provided translation function.
The Actor-Advisor (Plisnier2019) tries to get the best of both the exploration alteration and the teaching worlds. It mixes with a Policy Gradient actor’s policy at action selection time (hence biasing the exploration strategy), using the policy mixing formula in Griffith2013; the mixed policy is also integrated in the actor’s loss. This way, the learning process is influenced by , while it also guides the agent’s exploration. In the transfer task in Plisnier2019, the doors of a maze are shifted, resulting in a change in the dynamics of the environment.
4 BDPI with Transfer Learning
We explore the three transfer learning approaches that can take place while using BDPI as learning algorithm. The first method consists in the transfer of purely via the agent’s exploration strategy at action selection time, for which we reuse the policy shaping formula by Griffith2013 (Section 2.3, Equation 5). Therefore, at acting time, the agent executes an action sampled from the mixture of the BDPI agent’s policy with . In the rest of this paper, we refer to this transfer learning approach as “transfer at acting time”.
The second method to induce transfer learning is by allowing the agent to actively learn to imitate . This is achieved by including the policy mixing in the BDPI actor’s update rule (Section 2.2); we refer to this approach as “transfer at learning time”:
is always set to 0.05, , where denotes a fixed transfer learning parameter, and . When , no transfer at learning time occurs, and it amounts to applying the original BDPI actor update (Equation 4). When evaluating transfer at learning time in our experiments, we generally fix so that , allowing the critics’ greedy policy to have a more important influence on the actor. Still, even with a small , our empirical results show that the transfer of has a non-negligible impact on the agent’s learning.
In our setting, an Epuck robots first learns , the optimal policy to navigate in a room while observing the output of several proximity sensors. is then transferred to another Epuck robot, learning the same task, while observing its environment through a single camera. Based on the transfer learning techniques allowed by BDPI (Section 4), we compare three approaches to transfer learning: i) performing the transfer purely at acting time, ii) performing the transfer purely at learning time, and iii) performing the transfer both at acting and learning time. We now detail the environment in which our experiments take place.
Our agent is embodied by a simulated Epuck robot, in V-REP (Figure 1). It has 2 actuators, one for its left motor and one for its right motor; 8 proximity sensors, spread around the robot, and one camera on the front. Two environments emerge from this setting: one in which the robot observes the scene through its 8 proximity sensors, and one in which it observes the scene through its camera. In both environments, states are continuous. There are 5 discrete actions: staying still, accelerating the left motor, accelerating the right one, accelerating both, or decelerating both. The robot is in a squared room made of 4 one meter-long walls, and an uncentered pillar. An episode starts with the robot appearing in a random position in the room. Even when the agent only observes camera images, the proximity sensors are used to evaluate how close the robot is to the wall; if its distance to the wall is smaller than 10 cm, it receives a -1 reward. The only positively rewarded action is accelerating both motors (i.e., going forward); if the agent chooses this action, it receives +1. An episode ends after 500 timesteps. The optimal policy is to move in circles around the room without hitting the walls nor the pillar.
Learning the task merely out of camera images, on the other hand, is much more challenging (impossible, even) for state-of-the-art RL algorithms than from 8 proximity sensors. Hence, the idea is to first learn , a near-optimal policy that BDPI learns out the 8 proximity sensors, in less than blah episode, and then to transfer to the agent learning out of camera images. For the transfer of to occur, either at acting or learning time, must still be fed with the output of the 8 proximity sensors. This is achieved by allowing the function step() of our environment to return this output via the “info” dictionary; hence, our environment returns both the camera images-based observation, and the proximity sensors-based observation to the agent. Therefore, no translation function (Taylor2007a) between state representations is required. Moreover, this implementation remains aligned with our case-scenario, in which robots learn in a lab environment with all necessary sensors input constantly available to them.
We parametrize BDPI with critics, all trained at every timestep on a new 256 experiences replay minibatch, by applying once () Equation 1. The experience minibatches are sampled from a shared 50000 experiences buffer. The critics’ learning rate in Equation 1 is set to 0.2.
To first learn based on the proximity sensors (without any transfer), the actor learning rate in Equation 4 is set to 0.05, and
. BDPI’ neural network is trained for 20 epochs per training iteration, on the mean-squared-error loss. When learning while observing camera images, the number of training epochs is reduced to 1. The policy network has one hidden layer of 100 neurons, with a learning rate ofwhen learning out of the 8 proximity sensors, and of when learning out of camera images.
In the next section, we evaluate three ways to transfer to the BDPI agent while it is learning using only its camera. When transferring solely at acting time, is set to 0; when transferring solely at learning time, we set to ; when transfer occurs both at acting and learning time, .
We performed three runs for each of the 5 following settings: i) learning out of proximity sensors, no transfer; ii) learning out of camera images, no transfer; iii) learning out of camera images, with transfer solely at acting time; iiii) learning out of camera images, with transfer solely at learning time; iiiii) learning out of camera images, with transfer both at acting and learning time.
Transferring solely at acting time, hence strongly altering the agent’s exploration strategy, leads to a significant improvement of the agent’s performance while learning from camera images (Figure 2). Compared to the other two approaches, it is also the most effective one. The lesser performance of the mixed approach, compared to the transfer at acting time only one, suggests that allowing to also influence the agent’s learning rule can actually be detrimental.
When deploying physical reinforcement learning agent on the field, it is not always possible to ensure optimal equipment and learning conditions to the agent, as it can be the case in a lab environment. Hence, it is desirable to somehow prepare the agent while in the lab, where all necessary equipment is still available. This way, the agent could learn to get by without the particular equipment that will no longer be present once deployed in the real-world. In this paper, we propose to use transfer learning to perform this transition from the lab world to the real-world. A robot which quickly learns the task, thanks to its many expensive sensors that are easy to learn from, helps a lesser equipped robot that senses its environment through raw images. Once the camera-equipped robot has learned the task while being advised by the sensors-equipped one, it performs much better than it would have without transfer, as if it was exploiting expensive sensors that it does not have, and is ready for deployment. To achieve this transfer of policy, we extended BDPI, which allows for three different forms of transfer. Our experiments, simulated in V-REP, show that our method greatly improves the sample-efficiency of an Epuck robot sensing its environment through a single camera, which still consists in a highly challenging problem for state-of-the-art reinforcement learning algorithms.