1 Introduction
Reinforcement learning (RL) is a framework for training agents to maximize rewards in large, unknown, stochastic environments. In recent years, combining techniques from deep learning with reinforcement learning has yielded a string of successful applications in game playing and robotics
Mnih et al. (2015, 2016); Schulman et al. (2015a); Levine et al. (2016). These successful applications, and the speed at which the abilities of RL algorithms have been increasing, makes it an exciting area of research with significant potential for future applications.One of the major weaknesses of RL is the need to manually specify a reward function. For each task we wish our agent to accomplish, we must provide it with a reward function whose maximizer will precisely recover the desired behavior. This weakness is addressed by the field of Inverse Reinforcement Learning (IRL). Given a set of expert trajectories, IRL algorithms produce a reward function under which these the expert trajectories enjoy the property of optimality. Recently, there has been a significant amount of work on IRL, and current algorithms can infer a reward function from a very modest number of demonstrations (e.g,. Abbeel & Ng (2004); Ratliff et al. (2006); Ziebart et al. (2008); Levine et al. (2011); Ho & Ermon (2016); Finn et al. (2016)).
While IRL algorithms are appealing, they impose the somewhat unrealistic requirement that the demonstrations should be provided from the firstperson point of view with respect to the agent. Human beings learn to imitate entirely from thirdperson demonstrations – i.e., by observing other humans achieve goals. Indeed, in many situations, firstperson demonstrations are outright impossible to obtain. Meanwhile, thirdperson demonstrations are often relatively easy to obtain.
The goal of this paper is to develop an algorithm for thirdperson imitation learning. Future advancements in this class of algorithms would significantly improve the state of robotics, because it will enable people to easily teach robots news skills and abilities. Importantly, we want our algorithm to be unsupervised: it should be able to observe another agent perform a task, infer that there is an underlying correspondence to itself, and find a way to accomplish the same task.
We offer an approach to this problem by borrowing ideas from domain confusion Tzeng et al. (2014)
and generative adversarial networks (GANs)
Goodfellow et al. (2014). The highlevel idea is to introduce an optimizer under which we can recover both a domainagnostic representation of the agent’s observations, and a cost function which utilizes this domainagnostic representation to capture the essence of expert trajectories. We formulate this as a thirdperson RLGAN problem, and our solution builds on the firstperson RLGAN formulation by Ho & Ermon (2016).Surprisingly, we find that this simple approach has been able to solve the problems that are presented in this paper (illustrated in Figure 1), even though the student’s observations are related in a complicated way to the teacher’s demonstrations (given that the observations and the demonstrations are pixellevel). As techniques for training GANs become more stable and capable, we expect our algorithm to be able to infer solve harder thirdperson imitation tasks without any direct supervision.
2 Related Work
Imitation learning (also learning from demonstrations or programming by demonstration) considers the problem of acquiring skills from observing demonstrations. Imitation learning has a long history, with several good survey articles, including (Schaal, 1999; Calinon, 2009; Argall et al., 2009)
. Two main lines of work within imitation learning are: 1) behavioral cloning, where the demonstrations are used to directly learn a mapping from observations to actions using supervised learning, potentially with interleaving learning and data collection (e.g.,
Pomerleau (1989); Ross et al. (2011)). 2) Inverse reinforcement learning (Ng et al., 2000), where a reward function is estimated that explains the demonstrations as (near) optimal behavior. This reward function could be represented as nearness to a trajectory
(Calinon et al., 2007; Abbeel et al., 2010), as a weighted combination of features (Abbeel & Ng, 2004; Ratliff et al., 2006; Ramachandran & Amir, 2007; Ziebart et al., 2008; Boularias et al., 2011; Kalakrishnan et al., 2013; Doerr et al., 2015), or could also involve feature learning (Ratliff et al., 2007; Levine et al., 2011; Wulfmeier et al., 2015; Finn et al., 2016; Ho & Ermon, 2016).This past work, however, is not directly applicable to the third person imitation learning setting. In thirdperson imitation learning, the observations and actions obtained from the demonstrations are not the same as what the imitator agent will be faced with. A typical scenario would be: the imitator agent watches a human perform a demonstration, and then has to execute that same task. As discussed in Nehaniv & Dautenhahn (2001) the ”what and how to imitate” questions become significantly more challenging in this setting. To directly apply existing behavioral cloning or inverse reinforcement learning techniques would require knowledge of a mapping between observations and actions in the demonstrator space to observations and actions in the imitator space. Such a mapping is often difficult to obtain, and it typically relies on providing feature representations that captures the invariance between both environments Carpenter et al. (2002); Shon et al. (2005); Calinon et al. (2007); Nehaniv (2007); Gioioso et al. (2013); Gupta et al. (2016). Contrary to prior work, we consider thirdperson imitation learning from raw sensory data, where no such features are made available.
The most closely related work to ours is by Finn et al. (2016); Ho & Ermon (2016); Wulfmeier et al. (2015), who also consider inverse reinforcement learning directly from raw sensory data. However, the applicability of their approaches is limited to the firstperson setting. Indeed, matching raw sensory observations is impossible in the 3rd person setting.
Our work also closely builds on advances in generative adversarial networks Goodfellow et al. (2014), which are very closely related to imitation learning as explained in Finn et al. (2016); Ho & Ermon (2016). In our optimization formulation, we apply the gradient flipping technique from Ganin & Lempitsky (2014).
The problem of adapting what is learned in one domain to another domain has been studied extensively in computer vision in the supervised learning setting
Yang et al. (2007); Mansour et al. (2009); Kulis et al. (2011); Aytar & Zisserman (2011); Duan et al. (2012); Hoffman et al. (2013); Long & Wang (2015). It has also been shown that features trained in one domain can often be relevant to other domains Donahue et al. (2014). The work most closely related to ours is Tzeng et al. (2014, 2015), who also consider an explicit domain confusion loss, forcing trained classifiers to rely on features that don’t allow to distinguish between two domains. This work in turn relates to earlier work by
Bromley et al. (1993); Chopra et al. (2005), which also considers supervised training of deep feature embeddings.
Our approach to thirdperson imitation learning relies on reinforcement learning from raw sensory data in the imitator domain. Several recent advances in deep reinforcement learning have made this practical, including Deep QNetworks (Mnih et al., 2015), Trust Region Policy Optimization (Schulman et al., 2015a), A3C Mnih et al. (2016), and Generalized Advantage Estimation (Schulman et al., 2015b). Our approach uses Trust Region Policy Optimization.
3 Background and Preliminaries
A discretetime finitehorizon discounted Markov decision process (MDP) is represented by a tuple
, in which is a state set, an action set,a transition probability distribution,
a reward function, an initial state distribution, a discount factor, and the horizon.In the reinforcement learning setting, the goal is to find a policy parametrized by that maximizes the expected discounted sum of rewards incurred, , where , , and .
In the (firstperson) imitation learning setting, we are not given the reward function. Instead we are given traces (i.e., sequences of states traversed) by an expert who acts according to an unknown policy . The goal is to find a policy that performs as well as the expert against the unknown reward function. It was shown in Abbeel & Ng (2004) that this can be achieved through inverse reinforcement learning by finding a policy that matches the expert’s empirical expectation over discounted sum of all features that might contribute to the reward function. The work by Ho & Ermon (2016) generalizes this to the setting when no features are provided as follows: Find a policy that makes it impossible for a discriminator (in their work a deep neural net) to distinguish states visited by the expert from states visited by the imitator agent. This can be formalized as follows:
(1) 
Here, the expectations are over the states experienced by the policy of the imitator agent, , and by the policy of the expert, , respectively.
is the discriminator, which outputs the probability of a state having originated from a trace from the imitator policy
. If the discriminator is perfectly able to distinguish which policy originated stateaction pairs, then will consistently output a probability of 1 in the first term, and a probability of 0 in the second term, making the objective its lowest possible value of zero. It is the role of the imitator agent to find a policy that makes it difficult for the discriminator to make that distinction. The desired equilibrium has the imitator agent making it impractical for the discriminator to distinguish, hence forcing the discriminator to assign probability 0.5 in all cases. Ho & Ermon (2016) present a practical approach for solving this type of game when representing both andas deep neural networks. Their approach repeatedly performs gradient updates on each of them. Concretely, for a current policy
traces can be collected, which together with the expert traces form a dataset on which can be trained with supervised learning minimizing the negative loglikelihood (in practice only performing a modest number of updates). For a fixed , this is a policy optimization problem where is the reward, and policy gradients can be computed from those same traces. Their approach uses trust region policy optimization (Schulman et al., 2015a) to update the imitator policy from those gradients.In our work we will have more terms in the objective, so for compactness of notation, we will realize the discriminative minimization from Eqn. (1) as follows:
(2) 
Where is state , is the correct class label (was the state obtained from an expert vs. from a nonexpert), and is the standard cross entropy loss.
4 A Formal Definition Of The ThirdPerson Imitation Learning Problem
Formally, the thirdperson imitation learning problem can be stated as follows. Suppose we are given two Markov Decision Processes and . Suppose further there exists a set of traces which were generated under a policy acting optimally under some unknown reward . In thirdperson imitation learning, one attempts to recover by proxy through a policy which acts optimally with respect to .
5 A ThirdPerson Imitation Learning Algorithm
5.1 Game Formulation
In this section, we discuss a simple algorithm for thirdperson imitation learning. This algorithm is able to successfully discriminate between expert and novice policies, even when the policies are executed under different environments. Subsequently, this discrimination signal can be used to train expert policies in new domains via RL by training the novice policy to fool the discriminator, thus forcing it to match the expert policy.
In thirdperson learning, observations are more typically available rather than direct state access, so going forward we will work with observations instead of states as representing the expert traces. The top row of Figure 12 illustrates what these observations are like in our experiments.
We begin by recalling that in the algorithm proposed by Ho & Ermon (2016) the loss in Equation 2 is utilized to train a discriminator capable of distinguishing expert vs nonexpert policies. Unfortunately, (2) will likely fail in cases when the expert and nonexpert act in different environments, since will quickly learn these differences and use them as a strong classification signal.
To handle the thirdperson setting, where expert and novice are in different environments, we consider that works by first extracting features from , and then using these features to make a classification. Suppose then that we partition into a feature extractor and the actual classifier which assigns probabilities to the outputs of . Overloading notation, we will refer to the classifier as going forward. For example, in case of a deep neural net representation, would correspond to the earlier layers, and to the later layers. The problem is then to ensure that contains no information regarding the rollout’s domain label (i.e., expert vs. novice domain). This can be realized as
Where is mutual information and hence we have abused notation by using , , and to mean the classifier, feature extractor, and the domain label respectively as well as distributions over these objects.
The mutual information term can be instantiated by introducing another classifier , which takes features produced by and outputs the probability that those features were produced by in the expert vs. nonexpert environment. (See Bridle et al. (1992); Barber & Agakov (2005); Krause et al. (2010); Chen et al. (2016) for further discussion on instantiating the information term by introducing another classifier.) If , then the problem can be written as
(3) 
In words, we wish to minimize class loss while maximizing domain confusion.
Often, it can be difficult for even humans to judge a static image as expert vs. nonexpert because it does not convey any information about the environmental change affected by the agent’s actions. For example, if a pointmass is attempting to move to a target location and starts far away from its goal state, it can be difficult to judge if the policy itself is bad or the initialization was simply unlucky. In response to this difficulty, we give access to not only the image at time , but also at some future time . Define and . The classifier then makes a prediction .
This renders the following formulation:
(4) 
Note we also want to optimize over , the feature extractor, but it feeds both into and into , which are competing (hidden under ), which we will address now.
To deal with the competition over , we introduce a function
that acts as the identity when moving forward through a directed acyclic graph and flips the sign when backpropagating through the graph. This technique has enjoyed recent success in computer vision. See, for example,
(Ganin & Lempitsky, 2014). With this trick, the problem reduces to its final form(5) 
In Equation (5), we flip the gradient’s sign during backpropagation of with respect to the domain classification loss. This corresponds to stochastic gradient ascent away from features that are useful for domain classification, thus ensuring that produces domain agnostic features. Equation 5
can be solved efficiently with stochastic gradient descent. Here
is a hyperparameter that determines the tradeoff made between the objectives that are competing over
.To ensure sufficient signal for discrimination between expert and nonexpert, we collect thirdperson demonstrations in the expert domain from both an expert and from a nonexpert.
Our complete formulation is graphically summarized in Figure 2.
5.2 Algorithm
To solve the game formulation in Equation (5), we perform alternating (partial) optimization over the policy and the reward function and domain confusion encoded through .
The optimization over is done through stochastic gradient descent with ADAM Kingma & Ba (2014).
Our generator () step is similar to the generator step in the algorithm by (Ho & Ermon, 2016). We simply use as the reward. Using policy gradient methods (TRPO), we train the generator to minimize this cost and thus push the policy further towards replicating expert behavior. Once the generator step is done, we start again with the discriminator step. The entire process is summarized in algorithm 1.
6 Experiments
We seek to answer the following questions through experiments:

Is it possible to solve the thirdperson imitation learning problem in simple settings? I.e., given a collection of expert imagebased rollouts in one domain, is it possible to train a policy in a different domain that replicates the essence of the original behavior?

Does the algorithm we propose benefit from both domain confusion and velocity?

How sensitive is our proposed algorithm to the selection of hyperparameters used in deployment?

How sensitive is our proposed algorithm to changes in camera angle?

How does our method compare against some reasonable baselines?
6.1 Environments
To evaluate our algorithm, we consider three environments in the MuJoCo physics simulator. There are two different versions of each environment, an expert variant and a novice variant. Our goal is to train a cost function that is domain agnostic, and hence can be trained with images on the expert domain but nevertheless produce a reasonable cost on the novice domain. See Figure 1 for a visualization of the differences between expert and novice environments for the three tasks.
Point: A pointmass attempts to reach a point in a plane. The color of the target and the camera angle change between domains.
Reacher: A two DOF arm attempts to reach a designated point in the plane. The camera angle, the length of the arms, and the color of the target point are changed between domains. Note that changing the camera angle significantly alters the image background color from largely gray to roughly 30 percent black. This presents a significant challenge for our method.
Inverted Pendulum: A classic RL task wherein a pendulum must be made to balance via control. For this domain, We only change the color of the pendulum and not the camera angle. Since there is no target point, we found that changing the camera angle left the domain invariant representations with too little information and resulted in a failure case. In contrast to some traditional renderings of this problem, we do not terminate an episode when the agent falls but rather allow data collection to continue for a fixed horizon.
6.2 Evaluations
Is it possible to solve the thirdperson imitation learning problem in simple settings? In Figure 3
, we see that our proposed algorithm is indeed able to recover reasonable policies for all three tasks we examined. Initially, the training is quite unstable due to the domain confusion wreaking havoc on the learned cost. However, after several iterations the policies eventually head towards reasonable local minima and the standard deviation over the reward distribution shrinks substantially. Finally, we note that the extracted feature representations used to complete this task are in fact domainagnostic, as seen in Figure
9. Hence, the learning is properly taking place from a thirdperson perspective.Does the algorithm we propose benefit from both domain confusion and the multitime step input? We answer this question with the experiments summarized in Figure 5. This experiment compares our approach with: (i) our approach without the domain confusion loss; (ii) our approach without the multitime step input; (iii) our approach without the domain confusion loss and without the multitime step input (which is very similar to the approach in Ho & Ermon (2016)). We see that adding domain confusion is essential for getting strong performance in all three experiments. Meanwhile, adding multitime step input marginally improves the results. See also Figure 7 for an analysis of the effects of multitime step input on the final results.
How sensitive is our proposed algorithm to the selection of hyperparameters used in deployment? Figure 6 shows the effect of the domain confusion coefficient , which trades off how much we should weight the domain confusion objective vs. the standard costrecovery objective, on the final performance of the algorithm. Setting too low results in slower learning and features that are not domaininvariant. Setting too high results in an objective that is too quick to destroy information, which makes it impossible to recover an accurate cost.
For multitime step input, one must choose the number of lookahead frames that are utilized. If too small a window is chosen, the agent’s actions have not affected a large amount of change in the environment and it is difficult to discern any additional class signal over static images. If too large a timeframe passes, causality becomes difficult to interpolate and the agent does worse than simply being trained on static frames. Figure
7 illustrates that no number of lookahead frames is consistently optimal across tasks. However, a value of showed good performance over all tasks, and so this value was utilized in all other experiments.How sensitive is our algorithm to changes in camera angle? We present graphs for the reacher and point experiments wherein we exam the final reward obtained by a policy trained with thirdperson imitation learning vs the camera angle difference between the firstperson and thirdperson perspective. We omit the inverted double pendulum experiment, as the color and not the camera angle changes in that setting and we found the case of slowly transitioning the color to be the definition of uninteresting science.
How does our method compare against reasonable baselines? We consider the following baselines for comparisons against thirdperson imitation learning. 1) Standard reinforcement learning with using full state information and the true reward signal. This agent is trained via TRPO. 2) Standard GAIL (firstperson imitation learning). Here, the agent receives firstperson demonstration and attempts to imitate the correct behavior. This is an upper bound on how well we can expect to do, since we have the correct perspective. 3) Training a policy using firstperson data and applying it to the thirdperson environment.
We compare all three of these baselines to thirdperson imitation learning. As we see in figure 9: 1) Standard RL, which (unlike the imitation learning approaches) has access to full state and true reward, helps calibrate performance of the other approaches. 2) Firstperson imitation learning is faced with a simpler imitation problem and accordingly outperforms thirdperson imitation, yet thirdperson imitation learning is nevertheless competitive. 3) Applying the firstperson policy to the thirdperson agent fails miserably, illustrating that explicitly considering thirdperson imitation is important in these settings.
Somewhat unfortunately, the different reward function scales make it difficult to capture information on the variance of each learning curve. Consequently, in Appendix A we have included the full learning curves for these experiments with variance bars, each plotted with an appropriate scale to examine the variance of the individual curves.
7 Discussion and Future Work
In this paper, we presented the problem of thirdperson imitation learning. We argue that this problem will be important going forward, as techniques in reinforcement learning and generative adversarial learning improve and the cost of collecting firstperson samples remains high. We presented an algorithm which builds on Generative Adversarial Imitation Learning and is capable of solving simple thirdperson imitation tasks.
One promising direction of future work in this area is to jointly train policy features and cost features at the pixel level, allowing the reuse of image features. Code to train a third person imitation learning agent on the domains from this paper is presented here: https://github.com/bstadie/third_person_im
Acknowledgements
This work was done partially at OpenAI and partially at Berkeley. Work done at Berkeley was supported in part by Darpa under the Simplex program and the FunLoL program.
References

Abbeel & Ng (2004)
P. Abbeel and A. Ng.
Apprenticeship learning via inverse reinforcement learning.
In
International Conference on Machine Learning (ICML)
, 2004.  Abbeel et al. (2010) Pieter Abbeel, Adam Coates, and Andrew Y Ng. Autonomous helicopter aerobatics through apprenticeship learning. The International Journal of Robotics Research, 2010.
 Argall et al. (2009) Brenna D Argall, Sonia Chernova, Manuela Veloso, and Brett Browning. A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5):469–483, 2009.
 Aytar & Zisserman (2011) Yusuf Aytar and Andrew Zisserman. Tabula rasa: Model transfer for object category detection. In 2011 International Conference on Computer Vision, pp. 2252–2259. IEEE, 2011.
 Barber & Agakov (2005) D. Barber and F. V. Agakov. Kernelized infomax clustering. NIPS, 2005.

Boularias et al. (2011)
A. Boularias, J. Kober, and J. Peters.
Relative entropy inverse reinforcement learning.
In
International Conference on Artificial Intelligence and Statistics (AISTATS)
, 2011.  Bridle et al. (1992) J. S. Bridle, A. J. Heading, and D. J. MacKay. Unsupervised classifiers, mutual information and ’phantom targets’. NIPS, 1992.

Bromley et al. (1993)
Jane Bromley, James W Bentz, Léon Bottou, Isabelle Guyon, Yann LeCun, Cliff
Moore, Eduard Säckinger, and Roopak Shah.
Signature verification using a “siamese” time delay neural
network.
International Journal of Pattern Recognition and Artificial Intelligence
, 7(04):669–688, 1993.  Calinon (2009) Sylvain Calinon. Robot programming by demonstration. EPFL Press, 2009.
 Calinon et al. (2007) Sylvain Calinon, Florent Guenter, and Aude Billard. On learning, representing, and generalizing a task in a humanoid robot. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(2):286–298, 2007.
 Carpenter et al. (2002) Malinda Carpenter, Josep Call, and Michael Tomasello. Understanding “prior intentions” enables two–year–olds to imitatively learn a complex task. Child development, 73(5):1431–1441, 2002.
 Chen et al. (2016) Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NIPS, 2016.
 Chopra et al. (2005) Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pp. 539–546. IEEE, 2005.
 Doerr et al. (2015) A. Doerr, N. Ratliff, J. Bohg, M. Toussaint, and S. Schaal. Direct loss minimization inverse optimal control. In Proceedings of Robotics: Science and Systems (R:SS), Rome, Italy, July 2015.
 Donahue et al. (2014) Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recognition. In ICML, pp. 647–655, 2014.
 Duan et al. (2012) Lixin Duan, Dong Xu, and Ivor Tsang. Learning with augmented features for heterogeneous domain adaptation. arXiv preprint arXiv:1206.4660, 2012.
 Finn et al. (2016) C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy optimization. ICML, 2016.
 Ganin & Lempitsky (2014) Y. Ganin and V. Lempitsky. Unsupervised domain adaptation by backpropagation. Arxiv preprint 1409.7495, 2014.
 Gioioso et al. (2013) G Gioioso, G Salvietti, M Malvezzi, and D Prattichizzo. An objectbased approach to map human hand synergies onto robotic hands with dissimilar kinematics. Robotics: Science and Systems VIII, pp. 97, 2013.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
 Gupta et al. (2016) Abhishek Gupta, Clemens Eppner, Sergey Levine, and Pieter Abbeel. Learning dexterous manipulation for a soft robotic hand from human demonstration. arXiv preprint arXiv:1603.06348, 2016.
 Ho & Ermon (2016) J. Ho and S. Ermon. Generative adversarial imitation learning. arXiv preprint: 1606.03476, pp. 1061–1068, 2016.
 Hoffman et al. (2013) Judy Hoffman, Erik Rodner, Jeff Donahue, Trevor Darrell, and Kate Saenko. Efficient learning of domaininvariant image representations. arXiv preprint arXiv:1301.3224, 2013.
 Kalakrishnan et al. (2013) M. Kalakrishnan, P. Pastor, L. Righetti, and S. Schaal. Learning objective functions for manipulation. In International Conference on Robotics and Automation (ICRA), 2013.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), 2014.
 Krause et al. (2010) A. Krause, P. Perona, and R. G. Gomes. Discriminative clustering by regularized information maximization. NIPS, 2010.
 Kulis et al. (2011) Brian Kulis, Kate Saenko, and Trevor Darrell. What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1785–1792. IEEE, 2011.
 Levine et al. (2011) S. Levine, Z. Popovic, and V. Koltun. Nonlinear inverse reinforcement learning with gaussian processes. In Advances in Neural Information Processing Systems (NIPS), 2011.
 Levine et al. (2016) Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. Endtoend training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016.
 Long & Wang (2015) Mingsheng Long and Jianmin Wang. Learning transferable features with deep adaptation networks. CoRR, abs/1502.02791, 1:2, 2015.
 Mansour et al. (2009) Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning bounds and algorithms. arXiv preprint arXiv:0902.3430, 2009.
 Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 Mnih et al. (2016) Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy P Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. arXiv preprint arXiv:1602.01783, 2016.
 Nehaniv (2007) Chrystopher L Nehaniv. Nine billion correspondence problems. Imitation and Social Learning in Robots, Humans and Animals: Behavioural, Social and Communicative Dimensions, Cambridge University Press, 8:10, 2007.
 Nehaniv & Dautenhahn (2001) Chrystopher L Nehaniv and Kerstin Dautenhahn. Like me?measures of correspondence and imitation. Cybernetics & Systems, 32(12):11–51, 2001.
 Ng et al. (2000) A. Ng, S. Russell, et al. Algorithms for inverse reinforcement learning. In International Conference on Machine Learning (ICML), 2000.
 Pomerleau (1989) Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, pp. 305–313, 1989.
 Ramachandran & Amir (2007) D. Ramachandran and E. Amir. Bayesian inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, volume 51, 2007.
 Ratliff et al. (2006) N. Ratliff, J. A. Bagnell, and M. A. Zinkevich. Maximum margin planning. In International Conference on Machine Learning (ICML), 2006.
 Ratliff et al. (2007) N. Ratliff, D. Bradley, J. A. Bagnell, and J. Chestnutt. Boosting structured prediction for imitation learning. In Advances in Neural Information Processing Systems (NIPS), 2007.
 Ross et al. (2011) Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In AISTATS, pp. 6, 2011.
 Schaal (1999) Stefan Schaal. Is imitation learning the route to humanoid robots? Trends in cognitive sciences, 3(6):233–242, 1999.
 Schulman et al. (2015a) John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. Arxiv preprint 1502.05477, 2015a.
 Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. Highdimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
 Shon et al. (2005) Aaron Shon, Keith Grochow, Aaron Hertzmann, and Rajesh P Rao. Learning shared latent structure for image synthesis and robotic imitation. In Advances in Neural Information Processing Systems, pp. 1233–1240, 2005.
 Tzeng et al. (2014) Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain confusion: Maximizing for domain invariance. arXiv preprint arXiv:1412.3474, 2014.
 Tzeng et al. (2015) Eric Tzeng, Coline Devin, Judy Hoffman, Chelsea Finn, Xingchao Peng, Pieter Abbeel, Sergey Levine, Kate Saenko, and Trevor Darrell. Towards adapting deep visuomotor representations from simulated to real environments. arXiv preprint arXiv:1511.07111, 2015.
 Wulfmeier et al. (2015) M. Wulfmeier, P. Ondruska, and I. Posner. Maximum entropy deep inverse reinforcement learning. arXiv preprint arXiv:1507.04888, 2015.
 Yang et al. (2007) Jun Yang, Rong Yan, and Alexander G Hauptmann. Crossdomain video concept detection using adaptive svms. In Proceedings of the 15th ACM international conference on Multimedia, pp. 188–197. ACM, 2007.
 Ziebart et al. (2008) B. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey. Maximum entropy inverse reinforcement learning. In AAAI Conference on Artificial Intelligence, 2008.
8 Appendix A: Learning Curves for baselines
Here, we plot the learning curves for each of the baselines mentioned in the experiments section as a standalone plot. This allows one to better examine the variance of each individual learning curve.
9 Appendix B: Architecture Parameters
Joint Feature Extractor: Input is images are size 50 x 50 with 3 channels, RGB. Layers are 2 convolutional layers each followed by a max pooling layer of size 2. Layers use 5 filters of size 3 each.
Domain Discriminator and the Class Discriminator: Input is domain agnostic output of convolutional layers. Layers are two feed forward layers of size 128 followed by a final feed forward layer of size 2 and a softmax layer to get the log probabilities.
ADAM is used for discriminator training with a learning rate of 0.001. The RL generator uses the offtheshelf TRPO implementation available in RLLab.