Introduction
Human expert behavioral data are widely used for policy learning in sequential decisionmaking tasks [Schaal1999, Argall et al.2009]. One of the most effective paradigms is imitation learning, where a policy is trained via direct supervision to clone expert behaviors [Pomerleau1989, Ross, Gordon, and Bagnell2011]. Imitation learning generally requires both observable states and actions as input. However, expert actions are often unavailable or not directly usable. Literature has considered scenarios where the expert and the imitation learner may have different viewpoints [Liu et al.2017], temporal resolution or action sets [Yu et al.2018]. In such cases, cloning behavior directly is not an option. How to leverage such data to facilitate learning is a realistic and challenging problem.
In this paper, we investigate a novel learning scenario where an agent learns from both its own experience and statetrajectoryonly expert demonstrations. We propose an iterative learning framework as follows, illustrated in Figure 1. We first learn a novel tensorbased action inference model as the learning agent interacts with the environment. Our model enforces a duality for consistent learning as the inferred action from two consecutive states reconstructs the latter state. Then, upon observation of demonstration without actions, the learned dynamics is used to infer the missing actions. Finally, we improve the learning policy by jointly considering the imitation performance and rewards from environment interaction via Advantage ActorCritic (A2C) [Dhariwal et al.2017].
To demonstrate the effectiveness of the proposed iterative learning process, we conduct experiments on the Taxi domain as well as eight commonly used Atari games. The experimental results confirm a faster convergence rate of the proposed framework compared to advantage actorcritic alone, and a better policy compared to behavioral cloning from observations (BCO) [Torabi, Warnell, and Stone2018a], which only considers a similar action inference approach for behavioral cloning but ignores potential reward signals. We additionally show that our framework is robust against noisy expert state trajectories, and works well even when the number of demonstrations is limited.
Related Work
Imitation Learning
The tasks of learning from demonstrations and imitation learning have attracted considerable research attention from many fields in machine learning. The approaches are roughly divided into two groups, behavioral cloning and Inverse Reinforcement Learning (IRL). Survey articles include
[Schaal1999, Billard et al.2008, Argall et al.2009]. Behavioral Cloning (BC) uses supervised learning, where the learner directly regresses onto the policy of the expert
[Pomerleau1989, Ross, Gordon, and Bagnell2011, Liu et al.2017]. This requires observationaction tuples, and cannot be applied when actions are absent. On the other hand, IRL methods aim to infer the goal of an expert, expressed as a reward function [Ng and Russell2000, Abbeel and Ng2004, Ziebart et al.2008, Levine, Popovic, and Koltun2011, Borsa et al.2017]. This can then be used for RL to recover a policy [Ratliff, Bagnell, and Zinkevich2006, Ramachandran and Amir2007]. Contemporaneous work [Torabi, Warnell, and Stone2018b] extends BC to Behavioral Cloning from Observations (BCO), a setting where expert statetransitions are observed but actions are not observed. This differs from our setting where we hybridize BCO with RL.Hybrid RL with Expert Actions
Several work has focused on combining standard RL with supervised training on expert actions, for example, by alternating steps of RL and IRL to obtain more accurate estimates
[Ho and Ermon2016, Finn, Levine, and Abbeel2016]. Although both lines of research were successfully applied to a variety of tasks, they almost always assume that the stateaction trajectories of experts are in the same space as the learner observed. As argued in some recent work [Stadie, Abbeel, and Sutskever2017, Duan et al.2017, Liu et al.2017], such an assumption is restrictive and unrealistic. Instead, they proposed to discover the transformations between the learner and the teacher state space. Additional recent work combines human expert data and RL in policy learning. gilbert2015reducing gilbert2015reducing, lipton2016efficient lipton2016efficient and xxyyzz xxyyzz store expert stateaction pairs in a replay buffer to facilitate learning. hosu2016playing hosu2016playing utilize expert stateaction pairs to facilitate action value learning. subramanian2016exploration subramanian2016exploration leverage human data for efficient exploration. hester2017learning hester2017learning combine several approaches and demonstrates superior performance on Atari games. nair2017rldemo nair2017rldemo reported significant progress in training robotic tasks using a combination of RL and a small set of human demonstrations. Our method differs from all these approaches in that we do not assume expert actions are available.Very recent work starts to investigate leveraging expert state sequences only to accelerate imitation learning. aytar2018playing aytar2018playing utilize state only demonstration data to address the hard exploration issue for Atari games. zhu2018reinforcement zhu2018reinforcement leverage a small amount of demonstration data to assist a reinforcement learning agent for robotic manipulation tasks.
Modelbased RL
Researchers have known for decades that learning a domain model concurrently with learning a behavior policy can significantly improve over modelfree RL [Sutton1991]. Chebotar2017964 Chebotar2017964 recently demonstrated much better sample efficiency using modelbased RL for robotics tasks, without requiring demonstrations. oh2015action oh2015action and machado2018eigenoption machado2018eigenoption trained deep neural network models to predict the next frame or successor representation in Atari games with good effect. They applied elementwise multiplications on the state and action embeddings to obtain the embedding of the next state, which could be viewed as a special case of our model.
Sequential Decision Making with Expert State Sequences
We formulate the sequential decisionmaking task as a Markov decision process (MDP). An MDP is a tuple
where is the state space and is the action space. is the state transition function andis the probability that the next state is
given a current state and action is taken. is the reward function with being the expectation of immediate rewards, , of taking action in state . is a temporal discounting factor. A stochastic policy specifies the action to take in states. A state value function is defined as the expected sum of discounted rewards following a policy from a state: . Similarly, a stateaction value function is defined as . The optimal policy has action value function . Taking actions greedily with respect to yields the optimal policy .In addition, we assume a set of state sequences (with unknown actions) demonstrated by an expert is given. We denote expert state sequences as a set of state pairs, , where is the next state of . Note that we do not assume that states are consecutive across different pairs, i.e., is not necessarily equivalent to . We aim to design a flexible framework that can accommodate dataset in various formats.
Hybrid Reinforcement Learning with Expert State Sequences
To utilize the expert state sequences, our method learns a model of the environment to infer the missing expert actions from consecutive expert states. The inferred actions combined with the expert states are then utilized to provide additional supervision on the policy of our agent via behavioral cloning. The learning paradigm of our agent is illustrated in Figure 1. The agent interacts with the environment following its current policy as traditional RL agents. denotes the learning parameters of the policy . When the agent interacts with the environment, the stateactionnextstate tuples of its experience are collected to additionally train an action inference model , which maps the statenextstate pair into an action. The details of the model will be provided in the LowRank Tensor Formulation section. We sample a batch of consecutive expert state pairs and apply the action inference model to obtain the action estimate for each expert state pair . The action estimate and the expert state are combined as a batch of expert stateaction pairs to optimize the policy of the agent via behavioral cloning. The agent also applies RL to optimize its policy simultaneously.
In the rest of this section, we provide details of the action inference model , followed by the hybrid training objective combining behavioral cloning and RL.
Modeling State Transition Dynamics and Action Inference
Traditional modelbased RL usually estimates the forward dynamics of the stochastic state transitions of the environment as with being the likelihood of the next states conditioned on the current state and action . Ideally the action inference model should be consistent with such modelbased RL approach. Specifically, the output of the action inference model should be the action maximizing the likelihood of the observed state transition. Such approach may require time computation of in order to find the best action output. A more computationfriendly solution would be to obtain a consistent view of the state transitions which directly estimates the likelihood of actions conditioned on two consecutive states , with being the likelihood of action for the state pair . . Tensors offer natural representation of the joint model of and . A tensor is able to compute the two views of the state transitions effectively. One technical contribution of this paper is a tensorbased action inference model to maintain the two views of the environment dynamics consistently and simultaneously.
LowRank Tensor Formulation
Motivation in Small Domains
For an MDP with state space and action space , and can be represented as tensor multiplication on a shared tensor where stores the count number, , of the tuple in the agent’s experience. Then the maximum likelihood estimate of and could be represented as:
where
is a onehot vector at location
, is the norm of a vector, and denotes the modem product, defined as . Note that , and can be viewed as a score of the tuple () from tensor multiplications. is a vector of length with and is a vector of length with . and are normalized vectors of and . By organizing the count numbers of the tuples () into tensor, and could be modeled jointly by the same tensor in the tabular cases. But for MDPs with large state space, such tensor would be both memorydemanding and computationally expensive.We reduce the tensor computation via the following approaches: (1) We reduce the dimensionality of the tensor by allowing the states embed in a lower dimension. We embed the state representation ^{1}^{1}1For example, could be a onehot vector from a lookup table for small state space domains, or an output vector from ConvNet to represent image inputs for video game domains. to a lower dimension space via a matrix : . Instead of using the embedding of the next state , we use the embedding for the difference between two states in tensor multiplications. The rationale in embedding the state differences instead of next states directly is that not all the information in the next states are relevant to the actions. Moreover, the difference between two states could be embedded in a even lower dimensional space since not all state information are relevant to state transitions. The effects of actions are more related to the state differences. We denote the state difference embedding as where . Similarly, we embed the actions . Thus the score of the a tuple () can be computed as where . Figure 2 demonstrates the above formulation with an Atari game example. Under this formulation, the predicted representation of the action and state difference is thus:
In order to have consistent score for the tuple () using either the predicted embedding or the original embedding, part of the learning objective is to minimize the distance between the predicted representation and the original ones: .
(2) Even though is now independent from the state space and the action space, it may still be computationally expensive. To further reduce the computation, first we introduce symmetry, with , into the tensor by assuming the action and the difference in the state embedding between the current state and the next state could be embedded similarly when conditioned on the current state. Then we approximate the tensor slices as a sum of rank1 matrices following li2017visual li2017visual:
where matrix , and denotes the outer product. denotes the th column of the matrix. Thus each element of can be written as . Letting denote Hadamard (pointwise) product, we have:
With our symmetric , the tensor slides could be approximated by the same and , thus we have:
This gives a computationally efficient dual state transition model to predict both ( is action) and ( is statedifference ). The predicted is then used to predict the action probability , where and .
Similarly, the probability of the next state is , where , and . ^{2}^{2}2Exactly following the tensor scoring function gives the probability estimation , where is the concatenation of s for all ; and , where is the concatenation of s for all . However, as adopted by li2017visual li2017visual, the proposed estimation gives better empirical results. This is possibly because the classification objective benefits from more free parameters. Furthermore, can be replaced by reconstruction loss for large state space as used in oh2015action oh2015action.
Model architectures for (a) Taxi and (b) Atari games. The MultiLayer Perceptron (MLP) action inference baseline is also visualized in (a).
Learning Objective for the State Transition Model
The experience of the agent while interacting with the environment is used to optimize the dual state transition model . Since is endtoend trainable, we optimize to maximize the likelihood of the tuple () of the agent’s own experience.
The training objective is defined as follows:
L^dualmodel = E_(s,a,s’) [ logP^f(s’—s,a)  logP^i (a — s, s’)
+ ∥h_a  ^h_a∥_1 + ∥h_δs  ^h_δs∥_1 ]
In our learning scenario, only is relevant so we simplify the objective to model only the action inference part:
L^act = E_(s,a,s’) [  logP^i (a — s, s’) +
∥h_a  ^h_a∥_1 + ∥h_δs  ^h_δs∥_1 ]
The learned is used to infer the actions given two consecutive expert states (): .
Learning curves of different agents on the Taxi domain. The figures show the average cumulative rewards across 80,000 time steps. The shadow regions represent the standard deviation of the average cumulative rewards. The figures are best viewed in colors. (a) The performance of our method and different baselines. BC represents agents only use the inferred actions on expert states to optimize the policies. MLP/Dual represent the architecture of the action inference model. (b) The performance of our method and variant with different action inference model ranks. (c) The performance of our method varies with different types of noise from expert state sequences.
Hybrid Learning Objective
The hybrid training objective of the policy combines the RL objective to maximize the expected sum of the discounted rewards and the imitation objective to maximize the likelihood of the inferred actions on the demonstrated states. Our RL method is Advantage Actor Critic (A2C). A2C learns the state value by minimizing the squared advantage function values . A2C optimizes the policy via policy gradient . Let denote the parameters of the policy . The hybrid objective of our policy learning is: U^hybrid(θ) = E_s,a[ A(s) logπ(a—s; θ) +αH(π(.—s))] + E_(^s,^s’) ∼ρ(D)[ logπ(M(^s,^s’) — ^s;θ) ] where is a sampling distribution on the expert state pairs. It could be uniform or biased to match the state distribution of the agent’s policy via curriculum learning. is the entropy of the policy for state s.
Empirical Evaluation
To validate the proposed learning paradigm, we evaluate our proposed method on the Taxi domain [Dietterich1998] and eight Atari games from OpenAI Gym [Brockman et al.2016].
Taxi Domain
We first evaluate our method on the Taxi domain. In addition, we analyze the performance of our proposed method when different types of noise exist in the expert state sequences and show that our hybrid learning approach is more robust compared to pure behavioral cloning from expert state sequences [Torabi, Warnell, and Stone2018a]. Last, we illustrate the parameter compression rate compared to the full tensor approach and analyze the parameter sensitivity.
Experiment Setup
As shown in Figure 2(a)
, our agent architecture consists of two main components: A2C and the action inference model. A2C uses two forward step estimation for the advantage function. The state is represented as a onehot vector of length 500 for both the actor and critic. The policy and state values are computed via separated linear transformations. No parameters are shared between the actor and the critic. The action inference model first projects current states and next states to vectors of length 128. The matrices
and of the action inference model are of size . The action inference model has rank 2. We use human rule to collect demonstration covering the whole 500 states. The performance of the human rule is optimal.To analyze our hybrid RL with expert state sequences, we compare with the following methods:

A2C: This method trains as a standard RL task, ignoring the expert state sequences. Its configuration is the same as our A2C component.

Behavioral cloning with duality action inference (BCDual): This agent does not consider RL signals and only optimizes its policies by cloning the inferred actions on expert states. The action inference model is the same as ours.

Imitation Learning (IL): Only this agent has access to the expert actions. This agent utilizes the expert actions to conduct behavioral cloning directly. No reinforcement learning signal is leveraged.
To evaluate our action inference model, we additionally compare two variants where our action inference model is replaced by a multilayer perceptron (MLP) as illustrated in Figure 2(a)
. Similar to our action inference model, the states are first projected as vectors of length 128, then two state embeddings are concatenated and passed through two fully connected layers, and each layer has 128 units followed by ReLU nonlinearity. The total number of parameters of the MLP is close to ours.
^{3}^{3}3The number of parameters of our action inference model is and the MLP is roughly . The variants replace our action inference model with MLP are:
A2C with MLPbased action inference(HybridMLP): It is the same as our proposed hybrid RL method except the action inference model is replaced by the MLP baseline.

Behavioral cloning with MLP action inference (BCMLP): It is the same as Behavioral cloning with duality action inference except the action inference model is replaced by the MLP baseline.
Each agent is trained and evaluated on 16 independent runs with different random seeds.
Experiment Results
Figure 3(a) shows learning curves of different agents, and we make the following observations. 1) By comparing the hybrid agents with their pure imitation learning counterparts (Ours vs. BCDUAL, HybridMLP vs. BCMLP), we see the hybrid agents have better performance. Pure behavioral cloning training signals only discriminate optimal vs. nonoptimal actions, while the reward signals could help the agent to discriminate all actions, which could help to identify good actions to explore the environment. With the help of RL signals, the distribution of training data for the action inference model changes such that it will learn faster on key states. The results also show that, while the pure behavioral cloning agent improves rapidly at the beginning of learning, it takes much longer than the hybrid counterparts to reach the optimal policy. 2) By comparing our action inference model to its MLP counterparts (Ours vs. HybridMLP, and BCDual vs. BCMLP), the dual action inference model demonstrates performance advantages. Since the dual model does not have any nonlinear transformations as MLP, our action inference model is more data efficient in learning the environment dynamics and it is more robust to state distribution changes in learning as shown in Figure 5 showing the action prediction accuracy in Ours vs. HybridMLP. 3) Both the hybrid agents outperform the pure A2C agents, which shows that the hybrid objective is effective in leveraging the expert state sequences to facilitate learning.
Effect of Ranks on Performance & Parameter Reduction
Our lowrank tensor model efficiently reduces the parameter space while keeping good enough performance. First, the full parameter tensor without any lowrank approximation technique contains parameters for the Taxi domain ( and ). In comparison, our bestperforming rank2 model has a total number of parameters of , compressing the original tensor at a ratio of . Figure 3(b) compared the learning curves with different ranks. Even the rank is set to 1, the performance is degenerated but the advantage over pure A2C agents still preserves. Setting a higher rank (R=4) does not improve the results and the learning curve is almost the same as the rank2 model.
Robustness against Noise in Demonstrations
The above results demonstrate the performance advantage in an ideal setting where the expert state sequences cover the whole state space and the expert behaves optimally. We analyze the robustness of our agent against potential noise from the expert state sequences. We study two potential types of noise in expert state sequences: (1) Missing state ratio (), namely the percentage of states that do not exist in the expert state sequences, and (2) Nonoptimal action ratio (), the percentage of state transitions caused by nonoptimal actions. The performance of our agents for various values of and is summarized in Figure 3(c). By comparing the hybrid agents with the pure behavioral cloning agents, the hybrid agents are able to recover the optimal policies while the pure behavioral cloning agents get stuck at certain suboptimal policies. This illustrates that the hybrid approach relies less on the optimality of the demonstrations. The figure also shows that nonoptimal state transitions have a significant impact on the performance because the agent could get stuck in the Taxi domain, which could result in a sum of negative rewards until the maximum number of steps is reached.
Atari Games
We evaluate our method on eight Atari games with machine generated state sequences to evaluate the scalability of our method.
Experiment Setup
Model Architecture
We adopt the commonly used CNN architecture as in mnih2015humanlevel mnih2015humanlevel for Atari games. As shown in Figure 2(b), the last four images are stacked in channel and rescaled to as state input. The state encoding function is a fourlayer convolutional neural network. The first hidden layer convolves 32 8
8 filters with stride 4. The second layer convolves 64 4
4 filters with stride 2. The third layer convolves 32 33 filters with stride 1. The last layer of the state encoding function is fullyconnected and consists of 512 output units . Each layer is followed by ReLU as nonlinearity. The last output vector is passed through a linear layer to generate the value estimates for the critic of A2C and through another linear layer followed by a softmax as the policy for the actor of A2C. Our action inference model shares the same state encoding CNN. The last output vector of length 512 is first passed through a linear layer with 128 output units. The matrices are all . The rank is set to be 8.We use pretrained A2C agents with 5 million frames to generate 100 trajectories as demonstration state sequences. Each trajectory is terminated when one player life is lost. Similar to the Taxi domain, the agents are trained and evaluated on 16 simultaneous environments with different random seeds.
Expert State Sampling Curriculum
Unlike the Taxi domain, the state mismatch between the demonstration and the agent at the beginning of the learning is significant. As the action inference model is only trained on the players’ own state distribution, the demonstration states that are far from the agent’s own experience could be wrongly labeled. Because of this, the pure behavioral cloning from demonstration agents (BC) fail in achieving reasonable performance. On the other hand, learning from states that are far away from the agent’s own state distribution is not immediately helpful. To mitigate such mismatch, we apply a curriculum in sampling the demonstration when optimizing the policies. Specifically, we only sample from the first time steps of each trajectory at the beginning of learning. We gradually increase by 1 for approximate 8,000 frames. In this way, the sampled state distribution of the demonstration matches better to the players’ own experience. Furthermore, we only use the inferred actions after 100,000 frames to optimize the agent’s policies when the action prediction model becomes reliable.
Experiment Results
The learning curves of our agent and the A2C baseline are shown in Figure 6. Of all the eight games we evaluate, the hybrid agent is able to leverage expert demonstration to speed up learning on six games (Alien, BeamRider, Breakout, MsPacman, Pong and Qbert, Figure 6(af)). For the other two games, A2C seems to stuck at Seaquest at a score of 1800, either the hybrid agent or the A2C agent is able to escape from that local minima; the action inference has the worst accuracy on the game SpaceInvaders.
Conclusion
We have proposed an iterative learning paradigm to facilitate the problem of decisionmaking by utilizing demonstrations from experts. Differ from many previous approaches, we consider a realistic and difficult setting that actions performed by the experts are unavailable. To better make use of the stateonly demonstrations, we propose to learn a novel tensorbased action inference model based on the agent’s own experience. The learned dynamics is further used to infer the missing actions from the expert demonstrations. At last, a hybrid objective is proposed that improves the policy via imitation learning and A2C jointly. The experiment results on eight Atari games and an illustrative Taxi domain demonstrates the advantageous performances of our model against a set of baselines. We also show that our model is robust against noisy expert state trajectories.
References
 [Abbeel and Ng2004] Abbeel, P., and Ng, A. Y. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twentyfirst international conference on Machine learning, 1. ACM.
 [Argall et al.2009] Argall, B. D.; Chernova, S.; Veloso, M.; and Browning, B. 2009. A survey of robot learning from demonstration. Robotics and autonomous systems 57(5):469–483.
 [Aytar et al.2018] Aytar, Y.; Pfaff, T.; Budden, D.; Paine, T. L.; Wang, Z.; and de Freitas, N. 2018. Playing hard exploration games by watching youtube. arXiv preprint arXiv:1805.11592.
 [Billard et al.2008] Billard, A.; Calinon, S.; Dillmann, R.; and Schaal, S. 2008. Robot programming by demonstration. In Springer handbook of robotics. Springer. 1371–1394.
 [Borsa et al.2017] Borsa, D.; Piot, B.; Munos, R.; and Pietquin, O. 2017. Observational learning by reinforcement learning. arXiv preprint arXiv:1706.06617.
 [Brockman et al.2016] Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; and Zaremba, W. 2016. Openai gym.
 [Chebotar et al.2017] Chebotar, Y.; Hausman, K.; Zhang, M.; Sukhatme, G.; Schaal, S.; and Levine, S. 2017. Combining modelbased and modelfree updates for trajectorycentric reinforcement learning. In Proceedings of the 34th International Conference on Machine LearningVolume 70, 703–711. JMLR. org.
 [Dhariwal et al.2017] Dhariwal, P.; Hesse, C.; Klimov, O.; Nichol, A.; Plappert, M.; Radford, A.; Schulman, J.; Sidor, S.; and Wu, Y. 2017. Openai baselines. https://github.com/openai/baselines.
 [Dietterich1998] Dietterich, T. G. 1998. The maxq method for hierarchical reinforcement learning. In Proceedings of the Fifteenth International Conference on Machine Learning, 118–126. Morgan Kaufmann Publishers Inc.
 [Duan et al.2017] Duan, Y.; Andrychowicz, M.; Stadie, B.; Ho, J.; Schneider, J.; Sutskever, I.; Abbeel, P.; and Zaremba, W. 2017. Oneshot imitation learning. arXiv preprint arXiv:1703.07326.
 [Finn, Levine, and Abbeel2016] Finn, C.; Levine, S.; and Abbeel, P. 2016. Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning, 49–58.
 [Gilbert et al.2015] Gilbert, H.; Spanjaard, O.; Viappiani, P.; and Weng, P. 2015. Reducing the number of queries in interactive value iteration. In International Conference on Algorithmic DecisionTheory, 139–152. Springer.
 [Hester et al.2017] Hester, T.; Vecerik, M.; Pietquin, O.; Lanctot, M.; Schaul, T.; Piot, B.; Sendonaris, A.; DulacArnold, G.; Osband, I.; Agapiou, J.; et al. 2017. Learning from demonstrations for real world reinforcement learning. arXiv preprint arXiv:1704.03732.
 [Ho and Ermon2016] Ho, J., and Ermon, S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, 4565–4573.
 [Hosu and Rebedea2016] Hosu, I.A., and Rebedea, T. 2016. Playing atari games with deep reinforcement learning and human checkpoint replay. arXiv preprint arXiv:1607.05077.

[Lakshminarayanan, Ozair, and
Bengio2016]
Lakshminarayanan, A. S.; Ozair, S.; and Bengio, Y.
2016.
Reinforcement learning with few expert demonstrations.
In
NIPS Workshop on Deep Learning for Action and Interaction
.  [Levine, Popovic, and Koltun2011] Levine, S.; Popovic, Z.; and Koltun, V. 2011. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems, 19–27.
 [Li et al.2017] Li, Y.; Duan, N.; Zhou, B.; Chu, X.; Ouyang, W.; and Wang, X. 2017. Visual question generation as dual task of visual question answering. arXiv preprint arXiv:1709.07192.
 [Lipton et al.2016] Lipton, Z. C.; Gao, J.; Li, L.; Li, X.; Ahmed, F.; and Deng, L. 2016. Efficient exploration for dialogue policy learning with bbq networks & replay buffer spiking. arXiv preprint arXiv:1608.05081.
 [Liu et al.2017] Liu, Y.; Gupta, A.; Abbeel, P.; and Levine, S. 2017. Imitation from observation: Learning to imitate behaviors from raw video via context translation. arXiv preprint arXiv:1707.03374.
 [Machado et al.2018] Machado, M. C.; Rosenbaum, C.; Guo, X.; Liu, M.; Tesauro, G.; and Campbell, M. 2018. Eigenoption discovery through the deep successor representation. ICLR.
 [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, S.; and Hassabis, D. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533.
 [Nair et al.2017] Nair, A.; McGrew, B.; Andrychowicz, M.; Zaremba, W.; and Abbeel, P. 2017. Overcoming exploration in reinforcement learning with demonstrations. CoRR abs/1709.10089.
 [Ng and Russell2000] Ng, A. Y., and Russell, S. J. 2000. Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, 663–670. Morgan Kaufmann Publishers Inc.
 [Oh et al.2015] Oh, J.; Guo, X.; Lee, H.; Lewis, R. L.; and Singh, S. 2015. Actionconditional video prediction using deep networks in atari games. In Advances in Neural Information Processing Systems, 2863–2871.
 [Pomerleau1989] Pomerleau, D. A. 1989. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information processing systems, 305–313.
 [Ramachandran and Amir2007] Ramachandran, D., and Amir, E. 2007. Bayesian inverse reinforcement learning. Urbana 51(61801):1–4.
 [Ratliff, Bagnell, and Zinkevich2006] Ratliff, N. D.; Bagnell, J. A.; and Zinkevich, M. A. 2006. Maximum margin planning. In Proceedings of the 23rd international conference on Machine learning, 729–736. ACM.

[Ross, Gordon, and
Bagnell2011]
Ross, S.; Gordon, G. J.; and Bagnell, D.
2011.
A reduction of imitation learning and structured prediction to
noregret online learning.
In
International Conference on Artificial Intelligence and Statistics
, 627–635.  [Schaal1999] Schaal, S. 1999. Is imitation learning the route to humanoid robots? Trends in cognitive sciences 3(6):233–242.
 [Stadie, Abbeel, and Sutskever2017] Stadie, B. C.; Abbeel, P.; and Sutskever, I. 2017. Thirdperson imitation learning. arXiv preprint arXiv:1703.01703.
 [Subramanian, Isbell Jr, and Thomaz2016] Subramanian, K.; Isbell Jr, C. L.; and Thomaz, A. L. 2016. Exploration from demonstration for interactive reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous Agents & Multiagent Systems, 447–456. International Foundation for Autonomous Agents and Multiagent Systems.
 [Sutton1991] Sutton, R. S. 1991. Dyna, an integrated architecture for learning, planning and reacting. SIGART Bulletin 2:160–163.
 [Torabi, Warnell, and Stone2018a] Torabi, F.; Warnell, G.; and Stone, P. 2018a. Behavioral cloning from observation. arXiv preprint arXiv:1805.01954.
 [Torabi, Warnell, and Stone2018b] Torabi, F.; Warnell, G.; and Stone, P. 2018b. Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, 4950–4957. AAAI Press.
 [Yu et al.2018] Yu, T.; Finn, C.; Xie, A.; Dasari, S.; Zhang, T.; Abbeel, P.; and Levine, S. 2018. Oneshot imitation from observing humans via domainadaptive metalearning. arXiv preprint arXiv:1802.01557.
 [Zhu et al.2018] Zhu, Y.; Wang, Z.; Merel, J.; Rusu, A.; Erez, T.; Cabi, S.; Tunyasuvunakool, S.; Kramár, J.; Hadsell, R.; de Freitas, N.; et al. 2018. Reinforcement and imitation learning for diverse visuomotor skills. arXiv preprint arXiv:1802.09564.
 [Ziebart et al.2008] Ziebart, B. D.; Maas, A. L.; Bagnell, J. A.; and Dey, A. K. 2008. Maximum entropy inverse reinforcement learning. In AAAI, volume 8, 1433–1438. Chicago, IL, USA.
Comments
There are no comments yet.