Hybrid Reinforcement Learning with Expert State Sequences

03/11/2019 ∙ by Xiaoxiao Guo, et al. ∙ ibm 8

Existing imitation learning approaches often require that the complete demonstration data, including sequences of actions and states, are available. In this paper, we consider a more realistic and difficult scenario where a reinforcement learning agent only has access to the state sequences of an expert, while the expert actions are unobserved. We propose a novel tensor-based model to infer the unobserved actions of the expert state sequences. The policy of the agent is then optimized via a hybrid objective combining reinforcement learning and imitation learning. We evaluated our hybrid approach on an illustrative domain and Atari games. The empirical results show that (1) the agents are able to leverage state expert sequences to learn faster than pure reinforcement learning baselines, (2) our tensor-based action inference model is advantageous compared to standard deep neural networks in inferring expert actions, and (3) the hybrid policy optimization objective is robust against noise in expert state sequences.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Human expert behavioral data are widely used for policy learning in sequential decision-making tasks [Schaal1999, Argall et al.2009]. One of the most effective paradigms is imitation learning, where a policy is trained via direct supervision to clone expert behaviors [Pomerleau1989, Ross, Gordon, and Bagnell2011]. Imitation learning generally requires both observable states and actions as input. However, expert actions are often unavailable or not directly usable. Literature has considered scenarios where the expert and the imitation learner may have different viewpoints [Liu et al.2017], temporal resolution or action sets [Yu et al.2018]. In such cases, cloning behavior directly is not an option. How to leverage such data to facilitate learning is a realistic and challenging problem.

In this paper, we investigate a novel learning scenario where an agent learns from both its own experience and state-trajectory-only expert demonstrations. We propose an iterative learning framework as follows, illustrated in Figure 1. We first learn a novel tensor-based action inference model as the learning agent interacts with the environment. Our model enforces a duality for consistent learning as the inferred action from two consecutive states reconstructs the latter state. Then, upon observation of demonstration without actions, the learned dynamics is used to infer the missing actions. Finally, we improve the learning policy by jointly considering the imitation performance and rewards from environment interaction via Advantage Actor-Critic (A2C) [Dhariwal et al.2017].

To demonstrate the effectiveness of the proposed iterative learning process, we conduct experiments on the Taxi domain as well as eight commonly used Atari games. The experimental results confirm a faster convergence rate of the proposed framework compared to advantage actor-critic alone, and a better policy compared to behavioral cloning from observations (BCO) [Torabi, Warnell, and Stone2018a], which only considers a similar action inference approach for behavioral cloning but ignores potential reward signals. We additionally show that our framework is robust against noisy expert state trajectories, and works well even when the number of demonstrations is limited.

Figure 1: The proposed hybrid reinforcement learning with expert state sequences framework.

Related Work

Imitation Learning

The tasks of learning from demonstrations and imitation learning have attracted considerable research attention from many fields in machine learning. The approaches are roughly divided into two groups, behavioral cloning and Inverse Reinforcement Learning (IRL). Survey articles include 

[Schaal1999, Billard et al.2008, Argall et al.2009]

. Behavioral Cloning (BC) uses supervised learning, where the learner directly regresses onto the policy of the expert 

[Pomerleau1989, Ross, Gordon, and Bagnell2011, Liu et al.2017]. This requires observation-action tuples, and cannot be applied when actions are absent. On the other hand, IRL methods aim to infer the goal of an expert, expressed as a reward function  [Ng and Russell2000, Abbeel and Ng2004, Ziebart et al.2008, Levine, Popovic, and Koltun2011, Borsa et al.2017]. This can then be used for RL to recover a policy [Ratliff, Bagnell, and Zinkevich2006, Ramachandran and Amir2007]. Contemporaneous work [Torabi, Warnell, and Stone2018b] extends BC to Behavioral Cloning from Observations (BCO), a setting where expert state-transitions are observed but actions are not observed. This differs from our setting where we hybridize BCO with RL.

Hybrid RL with Expert Actions

Several work has focused on combining standard RL with supervised training on expert actions, for example, by alternating steps of RL and IRL to obtain more accurate estimates 

[Ho and Ermon2016, Finn, Levine, and Abbeel2016]. Although both lines of research were successfully applied to a variety of tasks, they almost always assume that the state-action trajectories of experts are in the same space as the learner observed. As argued in some recent work [Stadie, Abbeel, and Sutskever2017, Duan et al.2017, Liu et al.2017], such an assumption is restrictive and unrealistic. Instead, they proposed to discover the transformations between the learner and the teacher state space. Additional recent work combines human expert data and RL in policy learning. gilbert2015reducing gilbert2015reducing, lipton2016efficient lipton2016efficient and xxyyzz xxyyzz store expert state-action pairs in a replay buffer to facilitate learning. hosu2016playing hosu2016playing utilize expert state-action pairs to facilitate action value learning. subramanian2016exploration subramanian2016exploration leverage human data for efficient exploration. hester2017learning hester2017learning combine several approaches and demonstrates superior performance on Atari games.  nair2017rldemo nair2017rldemo reported significant progress in training robotic tasks using a combination of RL and a small set of human demonstrations. Our method differs from all these approaches in that we do not assume expert actions are available.

Very recent work starts to investigate leveraging expert state sequences only to accelerate imitation learning. aytar2018playing aytar2018playing utilize state only demonstration data to address the hard exploration issue for Atari games. zhu2018reinforcement zhu2018reinforcement leverage a small amount of demonstration data to assist a reinforcement learning agent for robotic manipulation tasks.

Model-based RL

Researchers have known for decades that learning a domain model concurrently with learning a behavior policy can significantly improve over model-free RL [Sutton1991].  Chebotar-2017-964 Chebotar-2017-964 recently demonstrated much better sample efficiency using model-based RL for robotics tasks, without requiring demonstrations.  oh2015action oh2015action and machado2018eigenoption machado2018eigenoption trained deep neural network models to predict the next frame or successor representation in Atari games with good effect. They applied element-wise multiplications on the state and action embeddings to obtain the embedding of the next state, which could be viewed as a special case of our model.

Sequential Decision Making with Expert State Sequences

We formulate the sequential decision-making task as a Markov decision process (MDP). An MDP is a tuple

where is the state space and is the action space. is the state transition function and

is the probability that the next state is

given a current state and action is taken. is the reward function with being the expectation of immediate rewards, , of taking action in state . is a temporal discounting factor. A stochastic policy specifies the action to take in states. A state value function is defined as the expected sum of discounted rewards following a policy from a state: . Similarly, a state-action value function is defined as . The optimal policy has action value function . Taking actions greedily with respect to yields the optimal policy .

In addition, we assume a set of state sequences (with unknown actions) demonstrated by an expert is given. We denote expert state sequences as a set of state pairs, , where is the next state of . Note that we do not assume that states are consecutive across different pairs, i.e., is not necessarily equivalent to . We aim to design a flexible framework that can accommodate dataset in various formats.

Hybrid Reinforcement Learning with Expert State Sequences

To utilize the expert state sequences, our method learns a model of the environment to infer the missing expert actions from consecutive expert states. The inferred actions combined with the expert states are then utilized to provide additional supervision on the policy of our agent via behavioral cloning. The learning paradigm of our agent is illustrated in Figure 1. The agent interacts with the environment following its current policy as traditional RL agents. denotes the learning parameters of the policy . When the agent interacts with the environment, the state-action-next-state tuples of its experience are collected to additionally train an action inference model , which maps the state-next-state pair into an action. The details of the model will be provided in the Low-Rank Tensor Formulation section. We sample a batch of consecutive expert state pairs and apply the action inference model to obtain the action estimate for each expert state pair . The action estimate and the expert state are combined as a batch of expert state-action pairs to optimize the policy of the agent via behavioral cloning. The agent also applies RL to optimize its policy simultaneously.

In the rest of this section, we provide details of the action inference model , followed by the hybrid training objective combining behavioral cloning and RL.

Modeling State Transition Dynamics and Action Inference

Traditional model-based RL usually estimates the forward dynamics of the stochastic state transitions of the environment as with being the likelihood of the next states conditioned on the current state and action . Ideally the action inference model should be consistent with such model-based RL approach. Specifically, the output of the action inference model should be the action maximizing the likelihood of the observed state transition. Such approach may require -time computation of in order to find the best action output. A more computation-friendly solution would be to obtain a consistent view of the state transitions which directly estimates the likelihood of actions conditioned on two consecutive states , with being the likelihood of action for the state pair . . Tensors offer natural representation of the joint model of and . A tensor is able to compute the two views of the state transitions effectively. One technical contribution of this paper is a tensor-based action inference model to maintain the two views of the environment dynamics consistently and simultaneously.

Low-Rank Tensor Formulation

Motivation in Small Domains

For an MDP with state space and action space , and can be represented as tensor multiplication on a shared tensor where stores the count number, , of the tuple in the agent’s experience. Then the maximum likelihood estimate of and could be represented as:


is a one-hot vector at location

, is the norm of a vector, and denotes the mode-m product, defined as . Note that , and can be viewed as a score of the tuple () from tensor multiplications. is a vector of length with and is a vector of length with . and are normalized vectors of and . By organizing the count numbers of the tuples () into tensor, and could be modeled jointly by the same tensor in the tabular cases. But for MDPs with large state space, such tensor would be both memory-demanding and computationally expensive.

Figure 2:

Tensor formulation of state-transition modeling for playing Atari games. A state consists of a window of frames, which are images from the game screen. Those images pass a convolutional neural network and result in a state embedding (left part). Given a tuple

, our model predicts a score by multiplying a tensor

with the hidden representations of

, i.e. (right part). The matrices and map state embeddings and

and the one-hot encoding of action

to vectors.

We reduce the tensor computation via the following approaches: (1) We reduce the dimensionality of the tensor by allowing the states embed in a lower dimension. We embed the state representation 111For example, could be a one-hot vector from a lookup table for small state space domains, or an output vector from ConvNet to represent image inputs for video game domains. to a lower dimension space via a matrix : . Instead of using the embedding of the next state , we use the embedding for the difference between two states in tensor multiplications. The rationale in embedding the state differences instead of next states directly is that not all the information in the next states are relevant to the actions. Moreover, the difference between two states could be embedded in a even lower dimensional space since not all state information are relevant to state transitions. The effects of actions are more related to the state differences. We denote the state difference embedding as where . Similarly, we embed the actions . Thus the score of the a tuple () can be computed as where . Figure 2 demonstrates the above formulation with an Atari game example. Under this formulation, the predicted representation of the action and state difference is thus:

In order to have consistent score for the tuple () using either the predicted embedding or the original embedding, part of the learning objective is to minimize the distance between the predicted representation and the original ones: .

(2) Even though is now independent from the state space and the action space, it may still be computationally expensive. To further reduce the computation, first we introduce symmetry, with , into the tensor by assuming the action and the difference in the state embedding between the current state and the next state could be embedded similarly when conditioned on the current state. Then we approximate the tensor slices as a sum of rank-1 matrices following  li2017visual li2017visual:

where matrix , and denotes the outer product. denotes the -th column of the matrix. Thus each element of can be written as . Letting denote Hadamard (point-wise) product, we have:

With our symmetric , the tensor slides could be approximated by the same and , thus we have:

This gives a computationally efficient dual state transition model to predict both ( is action) and ( is state-difference ). The predicted is then used to predict the action probability , where and .

Similarly, the probability of the next state is , where , and . 222Exactly following the tensor scoring function gives the probability estimation , where is the concatenation of s for all ; and , where is the concatenation of s for all . However, as adopted by li2017visual li2017visual, the proposed estimation gives better empirical results. This is possibly because the classification objective benefits from more free parameters. Furthermore, can be replaced by reconstruction loss for large state space as used in oh2015action oh2015action.

Figure 3:

Model architectures for (a) Taxi and (b) Atari games. The Multi-Layer Perceptron (MLP) action inference baseline is also visualized in (a).

Learning Objective for the State Transition Model

The experience of the agent while interacting with the environment is used to optimize the dual state transition model . Since is end-to-end trainable, we optimize to maximize the likelihood of the tuple () of the agent’s own experience.

The training objective is defined as follows: L^dual-model = E_(s,a,s’) [ -logP^f(s’—s,a) - logP^i (a — s, s’) + ∥h_a - ^h_a∥_1 + ∥h_δs - ^h_δs∥_1 ] In our learning scenario, only is relevant so we simplify the objective to model only the action inference part: L^act = E_(s,a,s’) [ - logP^i (a — s, s’) +
h_a - ^h_a∥_1 + ∥h_δs - ^h_δs∥_1 ] The learned is used to infer the actions given two consecutive expert states (): .

Figure 4:

Learning curves of different agents on the Taxi domain. The figures show the average cumulative rewards across 80,000 time steps. The shadow regions represent the standard deviation of the average cumulative rewards. The figures are best viewed in colors. (a) The performance of our method and different baselines. BC represents agents only use the inferred actions on expert states to optimize the policies. MLP/Dual represent the architecture of the action inference model. (b) The performance of our method and variant with different action inference model ranks. (c) The performance of our method varies with different types of noise from expert state sequences.

Hybrid Learning Objective

The hybrid training objective of the policy combines the RL objective to maximize the expected sum of the discounted rewards and the imitation objective to maximize the likelihood of the inferred actions on the demonstrated states. Our RL method is Advantage Actor Critic (A2C). A2C learns the state value by minimizing the squared advantage function values . A2C optimizes the policy via policy gradient . Let denote the parameters of the policy . The hybrid objective of our policy learning is: U^hybrid(θ) = E_s,a[ A(s) logπ(a—s; θ) +αH(π(.—s))] + E_(^s,^s’) ∼ρ(D)[ logπ(M(^s,^s’) — ^s;θ) ] where is a sampling distribution on the expert state pairs. It could be uniform or biased to match the state distribution of the agent’s policy via curriculum learning. is the entropy of the policy for state s.

Empirical Evaluation

To validate the proposed learning paradigm, we evaluate our proposed method on the Taxi domain [Dietterich1998] and eight Atari games from OpenAI Gym [Brockman et al.2016].

Taxi Domain

We first evaluate our method on the Taxi domain. In addition, we analyze the performance of our proposed method when different types of noise exist in the expert state sequences and show that our hybrid learning approach is more robust compared to pure behavioral cloning from expert state sequences  [Torabi, Warnell, and Stone2018a]. Last, we illustrate the parameter compression rate compared to the full tensor approach and analyze the parameter sensitivity.

Experiment Setup

As shown in Figure 2(a)

, our agent architecture consists of two main components: A2C and the action inference model. A2C uses two forward step estimation for the advantage function. The state is represented as a one-hot vector of length 500 for both the actor and critic. The policy and state values are computed via separated linear transformations. No parameters are shared between the actor and the critic. The action inference model first projects current states and next states to vectors of length 128. The matrices

and of the action inference model are of size . The action inference model has rank 2. We use human rule to collect demonstration covering the whole 500 states. The performance of the human rule is optimal.

To analyze our hybrid RL with expert state sequences, we compare with the following methods:

  • A2C: This method trains as a standard RL task, ignoring the expert state sequences. Its configuration is the same as our A2C component.

  • Behavioral cloning with duality action inference (BC-Dual): This agent does not consider RL signals and only optimizes its policies by cloning the inferred actions on expert states. The action inference model is the same as ours.

  • Imitation Learning (IL): Only this agent has access to the expert actions. This agent utilizes the expert actions to conduct behavioral cloning directly. No reinforcement learning signal is leveraged.

To evaluate our action inference model, we additionally compare two variants where our action inference model is replaced by a multi-layer perceptron (MLP) as illustrated in Figure 2(a)

. Similar to our action inference model, the states are first projected as vectors of length 128, then two state embeddings are concatenated and passed through two fully connected layers, and each layer has 128 units followed by ReLU nonlinearity. The total number of parameters of the MLP is close to ours.

333The number of parameters of our action inference model is and the MLP is roughly . The variants replace our action inference model with MLP are:

  • A2C with MLP-based action inference(Hybrid-MLP): It is the same as our proposed hybrid RL method except the action inference model is replaced by the MLP baseline.

  • Behavioral cloning with MLP action inference (BC-MLP): It is the same as Behavioral cloning with duality action inference except the action inference model is replaced by the MLP baseline.

Each agent is trained and evaluated on 16 independent runs with different random seeds.

Figure 5: Action prediction accuracy on the expert actions of our action inference model and the MLP-based action inference model in hybrid policy learning at different training time steps.
(a) Alien
(b) BeamRider
(c) Breakout
(d) MsPacman
(e) Pong
(f) Qbert
(g) Seaquest
(h) SpaceInvaders
Figure 6: Learning curves of our method and the A2C baseline on eight Atari games. The figures show the average scores of the trained models with different number of image frames. The shadow regions represent one standard deviation.

Experiment Results

Figure 3(a) shows learning curves of different agents, and we make the following observations. 1) By comparing the hybrid agents with their pure imitation learning counterparts (Ours vs. BC-DUAL, Hybrid-MLP vs. BC-MLP), we see the hybrid agents have better performance. Pure behavioral cloning training signals only discriminate optimal vs. non-optimal actions, while the reward signals could help the agent to discriminate all actions, which could help to identify good actions to explore the environment. With the help of RL signals, the distribution of training data for the action inference model changes such that it will learn faster on key states. The results also show that, while the pure behavioral cloning agent improves rapidly at the beginning of learning, it takes much longer than the hybrid counterparts to reach the optimal policy. 2) By comparing our action inference model to its MLP counterparts (Ours vs. Hybrid-MLP, and BC-Dual vs. BC-MLP), the dual action inference model demonstrates performance advantages. Since the dual model does not have any nonlinear transformations as MLP, our action inference model is more data efficient in learning the environment dynamics and it is more robust to state distribution changes in learning as shown in Figure 5 showing the action prediction accuracy in Ours vs. Hybrid-MLP. 3) Both the hybrid agents outperform the pure A2C agents, which shows that the hybrid objective is effective in leveraging the expert state sequences to facilitate learning.

Effect of Ranks on Performance & Parameter Reduction

Our low-rank tensor model efficiently reduces the parameter space while keeping good enough performance. First, the full parameter tensor without any low-rank approximation technique contains parameters for the Taxi domain ( and ). In comparison, our best-performing rank-2 model has a total number of parameters of , compressing the original tensor at a ratio of . Figure 3(b) compared the learning curves with different ranks. Even the rank is set to 1, the performance is degenerated but the advantage over pure A2C agents still preserves. Setting a higher rank (R=4) does not improve the results and the learning curve is almost the same as the rank-2 model.

Robustness against Noise in Demonstrations

The above results demonstrate the performance advantage in an ideal setting where the expert state sequences cover the whole state space and the expert behaves optimally. We analyze the robustness of our agent against potential noise from the expert state sequences. We study two potential types of noise in expert state sequences: (1) Missing state ratio (), namely the percentage of states that do not exist in the expert state sequences, and (2) Non-optimal action ratio (), the percentage of state transitions caused by non-optimal actions. The performance of our agents for various values of and is summarized in Figure 3(c). By comparing the hybrid agents with the pure behavioral cloning agents, the hybrid agents are able to recover the optimal policies while the pure behavioral cloning agents get stuck at certain sub-optimal policies. This illustrates that the hybrid approach relies less on the optimality of the demonstrations. The figure also shows that non-optimal state transitions have a significant impact on the performance because the agent could get stuck in the Taxi domain, which could result in a sum of negative rewards until the maximum number of steps is reached.

Atari Games

We evaluate our method on eight Atari games with machine generated state sequences to evaluate the scalability of our method.

Experiment Setup

Model Architecture

We adopt the commonly used CNN architecture as in mnih2015humanlevel mnih2015humanlevel for Atari games. As shown in Figure 2(b), the last four images are stacked in channel and rescaled to as state input. The state encoding function is a four-layer convolutional neural network. The first hidden layer convolves 32 8

8 filters with stride 4. The second layer convolves 64 4

4 filters with stride 2. The third layer convolves 32 33 filters with stride 1. The last layer of the state encoding function is fully-connected and consists of 512 output units . Each layer is followed by ReLU as nonlinearity. The last output vector is passed through a linear layer to generate the value estimates for the critic of A2C and through another linear layer followed by a softmax as the policy for the actor of A2C. Our action inference model shares the same state encoding CNN. The last output vector of length 512 is first passed through a linear layer with 128 output units. The matrices are all . The rank is set to be 8.

We use pre-trained A2C agents with 5 million frames to generate 100 trajectories as demonstration state sequences. Each trajectory is terminated when one player life is lost. Similar to the Taxi domain, the agents are trained and evaluated on 16 simultaneous environments with different random seeds.

Expert State Sampling Curriculum

Unlike the Taxi domain, the state mismatch between the demonstration and the agent at the beginning of the learning is significant. As the action inference model is only trained on the players’ own state distribution, the demonstration states that are far from the agent’s own experience could be wrongly labeled. Because of this, the pure behavioral cloning from demonstration agents (BC-) fail in achieving reasonable performance. On the other hand, learning from states that are far away from the agent’s own state distribution is not immediately helpful. To mitigate such mismatch, we apply a curriculum in sampling the demonstration when optimizing the policies. Specifically, we only sample from the first time steps of each trajectory at the beginning of learning. We gradually increase by 1 for approximate 8,000 frames. In this way, the sampled state distribution of the demonstration matches better to the players’ own experience. Furthermore, we only use the inferred actions after 100,000 frames to optimize the agent’s policies when the action prediction model becomes reliable.

Experiment Results

The learning curves of our agent and the A2C baseline are shown in Figure 6. Of all the eight games we evaluate, the hybrid agent is able to leverage expert demonstration to speed up learning on six games (Alien, BeamRider, Breakout, MsPacman, Pong and Qbert, Figure 6(a-f)). For the other two games, A2C seems to stuck at Seaquest at a score of 1800, either the hybrid agent or the A2C agent is able to escape from that local minima; the action inference has the worst accuracy on the game SpaceInvaders.


We have proposed an iterative learning paradigm to facilitate the problem of decision-making by utilizing demonstrations from experts. Differ from many previous approaches, we consider a realistic and difficult setting that actions performed by the experts are unavailable. To better make use of the state-only demonstrations, we propose to learn a novel tensor-based action inference model based on the agent’s own experience. The learned dynamics is further used to infer the missing actions from the expert demonstrations. At last, a hybrid objective is proposed that improves the policy via imitation learning and A2C jointly. The experiment results on eight Atari games and an illustrative Taxi domain demonstrates the advantageous performances of our model against a set of baselines. We also show that our model is robust against noisy expert state trajectories.