Trajectory-based Learning for Ball-in-Maze Games

11/28/2018 ∙ by Sujoy Paul, et al. ∙ MERL University of California, Riverside 0

Deep Reinforcement Learning has shown tremendous success in solving several games and tasks in robotics. However, unlike humans, it generally requires a lot of training instances. Trajectories imitating to solve the task at hand can help to increase sample-efficiency of deep RL methods. In this paper, we present a simple approach to use such trajectories, applied to the challenging Ball-in-Maze Games, recently introduced in the literature. We show that in spite of not using human-generated trajectories and just using the simulator as a model to generate a limited number of trajectories, we can get a speed-up of about 2-3x in the learning process. We also discuss some challenges we observed while using trajectory-based learning for very sparse reward functions.



There are no comments yet.


page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The challenging task of solving a Ball-in-Maze Game (BiMGame) using robotics has recently been introduced van Baar et al. (2018); Romeres et al. (2018) (see Fig. 0(a)). This task is challenging since maze puzzles exhibit complex dynamics due to static friction, collisions with the geometry (or between marbles in the case of more than one marble), and long-horizon planning. Sample-efficient, model-based Reinforcement Learning (RL) approaches, e.g., Levine et al. (2016), are desirable. However, for BiMGames, this is not solved Romeres et al. (2018). Instead, in van Baar et al. (2018) the authors propose to use model-free deep RL Mnih et al. (2016)

, and demonstrated a transfer learning approach from simulation to a real robot.

A major drawback of model-free RL is the lack of sample efficiency since exploration is performed in an unstructured fashion. In this paper, we investigate several algorithms to reduce the required number of samples. We adopt ideas from trajectory-based imitation learning methods, in which controllers are learned to model the distribution of trajectories and are subsequently refined. The necessary trajectories are provided by an expert, often a human. Rather than requiring a human expert to generate trajectories for each instance of this game, in the context of Sim2Real, we can exploit the simulator to generate those trajectories instead. We use these trajectories to pre-train the network policy and then fine-tune it with policy gradients based RL.

In the remainder of this paper, we first briefly discuss previously introduced Imitation Learning (IL) approaches along with the challenges we encountered to learn BiMGame. Thereafter, we present the details of the trajectory generation procedure, followed by the IL-RL framework we take to learn from these trajectories and the experimental results for different tasks of BiMGame. We conclude the paper with the challenges we faced with the different tasks and future directions to address them.

2 Related Work

Recently, Deep Neural Networks (DNN) has enabled RL approaches to achieve impressive results. It’s beyond the scope of this paper to list all prior work, and we only discuss the most relevant. The Sim2Real approach in 

van Baar et al. (2018) is based on Asynchronous Advantage Actor-Critic (A3C) Mnih et al. (2016). However, deep RL learns on a trial-and-error basis, thus requiring a significant number of training samples.

Many imitation learning approaches use trajectories obtained by experts Ross et al. (2011); Ross and Bagnell (2014); Sun et al. (2017); Cheng et al. (2018) to guide the learning process to be more structured. Of particular interest is the AggreVateD method Sun et al. (2017)

, which is similar to A3C, with the exception that the advantage estimate is that of an expert policy rather than the current policy being learned. We found that this approach relies heavily on very accurate estimation of the value function at all states visited and turned out to not solve BiMGame consistently–even with the expert value function being that of a trained policy, which is able to solve the game consistently.

To make RL sample efficient, the authors in  Nagabandi et al. (2017) propose to start with model-based deep RL, followed by model-free fine-tuning. The cloning from model-based to model-free is performed by DAgger Ross et al. (2011), using the trajectories generated by model-based RL. As the model is learned from data, the performance obtained by the model-based method is quite low. Modeling even more complex dynamics and high dimensional state-space such as in BiMGame would be difficult to achieve. The authors in Rahmatizadeh et al. (2016) propose a Sim2Real approach which avoids RL, and instead rely on supervised training from expert trajectories. Compared to BiMGame, the dynamics of these tasks are much simpler.

Figure 1: (a) Ball-in-Maze Game puzzle: the real (Left) and simulated (Right) BiMGame. (b) This plot presents the distribution of trajectory lengths generated by using the simulator as the model

3 Generating Trajectories: Using the Simulator as the Model

The trajectories to be used for learning can be obtained from a human teacher Rahmatizadeh et al. (2016); Yu et al. (2018) or from planning based algorithms Guo et al. (2014). In Nagabandi et al. (2017), the authors first learned a neural network dynamics model and then used it to generate trajectories. However, when learning in the simulator (in a Sim2Real paradigm), we could leverage upon the internal physics engine to roll-out in time and generate trajectories by optimizing the cumulative reward function. Formally, at time step , we aim to solve the following:


where is the simulator, is the reward function, is the horizon of optimization and is the set of actions from to . We only take the first action , move to state and repeat the same optimization procedure to choose the next action. As we use a non-differentiable simulator, we employ a random shooting strategy Rao (2009) where we sample sets of and choose the one which maximizes the object function. We set empirically.

Using this method, we obtain a dataset of trajectories , each of which is a sequence of state-action-reward triplets. It may be noted that the optimization process and thus the trajectories obtained using the above process is not optimal. Moreover, the choice of reward function used in the above optimization procedure plays an important role towards the error in the trajectories. In this work, we use the following distance based reward function:


where is the radial distance of the ball at time from the center of the board. Note that as the radial path is not a feasible one due to obstructions, this reward function is only good near the gate openings and not otherwise. In spite of using such a reward function, the above method is able to solve the BiMGame consistently. Fig. 0(b) shows a distribution of trajectory lengths required to solve the BiMGame using the above method. We next describe the methods we use to learn the tasks using the trajectories . The methods are agnostic to the procedure with which we obtain the trajectories.

4 Learning From Trajectories

4.1 Supervised Pre-training

The dataset forms a rich source of information to guide the RL agents for faster convergence. We use this set of trajectories to pre-train the DNN policy. We follow the architecture structure of A3C Mnih et al. (2016) with consists of a DNN with two heads - one for policy and the other one for value estimation , with partially shared parameters between and

. We use two loss functions to pre-train the deep neural network as follows:


where is the number of discrete actions. The first part of the loss function is the cross-entropy loss to train the policy network, the second part of the loss is to train the value function estimator and the last part is the regularization loss. We denote the policy learned by minimizing Eqn. 3 as .

possess the ability to take actions with low error rates at the states sampled from the distribution induced by . However, a small error at the beginning would compound quadratically Ross et al. (2011) with time as the DNN agent starts visiting states which are not sampled from the distribution of . Algorithms like DAgger can be used to finetune the policy on the states distributed by rolling out , by using ground-truth labels obtained from an expert policy . This query to is often very costly and even may not be feasible in some applications.

In this paper, instead of taking the route of DAgger, we fine-tune the policy using policy gradient RL in an A3C framework. This method is quite simple as we just need to initialize A3C with instead of randomly initialized network parameters.

4.2 Value Function as a Reward

Another way of using the trajectories for sample-efficient learning is to shape the reward Ng et al. (1999) using the value function learned from the trajectories. We train a network with only the value head to estimate using the dataset by optimizing only the last two parts of Eqn. 3. Thereafter we use A3C but with a transformed reward function as follows:


While using this reward function, the A3C agents can either start from randomly initialized weights or from weights pre-trained by minimizing the loss in Eqn. 3. Please note that for the latter, and represent two different networks. is kept fixed after training and just serves as a source of an auxiliary reward function and is finetuned using the A3C framework.

5 Experiments

Tasks. We perform experiments on three instances of BiMGame. All the games are initialized with the marble being at the outermost ring. We refer to the first instance as FULL, where the agent receives a +/- 1 reward for moving a marble through a gate towards/away from the center. The goal is to move the marble into the central portion of the maze. We define additional instances with a more sparse reward signal, where a reward of +1 is only received if the goal (terminal) is attained, otherwise the reward is zero. We refer to these games as Steps-to-Go (STG). We define STG1 as the goal for the marble to move through one of the gates between the first and second ring, and STG2 as the goal of moving through one of the gates between the second and third ring. There are 5 possible actions in the games - clockwise and anti-clockwise rotations of along the two principal axes on the plane of the board and a No-Operation action.

Algorithms. We compare the following algorithms: A3C, Supervised Pre-training followed by A3C fine-tuning, A3C with Value based reward function and Supervised Pre-training followed by A3C fine-tuning with Value based reward function. We also compare with DAgger Ross et al. (2011).

Deep Neural Network.The network architecture we train is Conv-Conv-FC-LSTM. The input to the LSTM is the previous layer feature along with the previous step action and reward. After the LSTM layer, we add two FC heads for the policy and value function.

(a) FULL
(b) STG1
(c) STG2
Figure 2: This presents the learning curves for the three tasks (a) FULL, (b) STG1 and (c) STG2.

Results. During pre-training, we obtain a maximum classification accuracy of over a test set of trajectories. The low accuracy shows that the problem at hand is quite difficult to learn just using the trajectories and supervised training. This result is in contrast to that in Nagabandi et al. (2017), where the authors achieved a perfect cloning from model-based to model-free for their tasks using trajectories together with DAgger, which requires a step to query an expert and can be costly in general. Also, it can be observed in the plots that DAgger perform much worse in the harder tasks such as STG2 and FULL compared to the easier task of STG1. Moreover, DAgger based imitation learning performs much worse than the framework presented in this work.

We use the pre-trained network as a starting point for A3C. The cumulative reward over the A3C training period is presented in Fig. 2. As may be observed, pre-training the network helps to learn faster (2-3x) compared to learning from randomly initialized weights. We also experimented the FULL version of BiMGame with an additional reward as in Eqn. 2, but with being the geodesic distance to the center of the board instead of the radial distance. This reward function (even with different scaling factors) did not help to speed-up the learning process. We argue that this is a possible reason behind the value-based reward not playing a major role in faster learning.

It may be noted that we observed some unstable behavior while training STG2, resulting in high variations in time to converge compared to Full and STG1. For the sake of time, we report the best results obtained over a few runs of each algorithm. However, we noted at least 2x improvement with pre-training, in spite of such variations.

6 Challenges and Future Work

BiMGame is a challenging task for both model-based, and model-free RL. In this work, we show that supervised pre-training on trajectories obtained with a simulator can speed up the learning. Yet the number of steps required is still in the order of millions. A combination of model-based and model-free RL seem to be necessary to further reduce the required number of steps, but for BiMGame further research is needed on how to achieve this.

We noticed that fine-tuning for game instances with more sparse rewards such as STG3 and STG4, seem to gradually forget its previous training, as it starts off with a small positive cumulative reward, which decreases gradually until it reaches zero. After many steps, none of the algorithms including randomly initialized A3C showed any sign of starting to learn these two tasks. Currently, it is an open problem. However, DAgger which entirely uses supervised training to mimic the MPC expert shows a very slow growth in cumulative reward, but requiring quite expensive query-to-expert.

Humans can learn to play BiMGame without much effort, particularly different game instances, i.e. different geometry, material, number of marbles. An interesting research direction may be to involve human priors within the learning framework. To that end, breaking the task into a sequence of subtasks, which are determined automatically or from trajectories, seems desirable.