Cross-Domain Imitation Learning with a Dual Structure

06/02/2020 ∙ by Sungho Choi, et al. ∙ 0

In this paper, we consider cross-domain imitation learning (CDIL) in which an agent in a target domain learns a policy to perform well in the target domain by observing expert demonstrations in a source domain without accessing any reward function. In order to overcome the domain difference for imitation learning, we propose a dual-structured learning method. The proposed learning method extracts two feature vectors from each input observation such that one vector contains domain information and the other vector contains policy expertness information, and then enhances feature vectors by synthesizing new feature vectors containing both target-domain and policy expertness information. The proposed CDIL method is tested on several MuJoCo tasks where the domain difference is determined by image angles or colors. Numerical results show that the proposed method shows superior performance in CDIL to other existing algorithms and achieves almost the same performance as imitation learning without domain difference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imitation Learning (IL) is a framework that reproduces the behavior of the expert by mimicking its demonstrations Osa et al. (2018)

. IL can circumvent the difficulty of designing the reward function for each task in Reinforcement Learning (RL)

Mnih et al. (2015); Silver et al. (2016); Sutton and Barto (2018) because the reward function should be designed carefully in order to train the agent to learn desirable and intended behavior. There are numerous results in IL Abbeel and Ng (2004); Bain and Sammut (1999); Finn et al. (2016); Ho and Ermon (2016); Ng and Russell (2000); Ross et al. (2011); Torabi et al. (2018); Ziebart et al. (2008) that successfully learn complex behaviors in various environments.

Although conventional IL methods are powerful, they assumed the expert and the agent are in the same domain. In more general cases, the agent in a target domain should mimic the behavior of the expert who exists in a source domain which is different from the target domain. For example, a driving agent might have to learn driving skills in the real world by using demonstration in a driving simulator, or a robot receiving visual data from its sensor might have to imitate new movements of other robots using images taken from different angles. These situations are natural in the real world and this cross-domain imitation learning (CDIL) is more challenging due to the fact that the agent cannot directly follow the expert demonstration Gamrian and Goldberg (2019); Gupta et al. (2017); Higgins et al. (2017); Liu et al. (2018); Stadie et al. (2017); Yu et al. (2018). The domain adaptation problem is a hard problem in RL or IL, especially when visual data are used as inputs Gamrian and Goldberg (2019); Liu et al. (2018); Stadie et al. (2017); Yu et al. (2018)

. For example, change of color or viewing angle of the visual input image seems an easy problem from the human’s perspective, but it is a tough problem to overcome the change of color or viewing angle associated with visual input data when it comes to training policies for RL or IL for robot control based on vision. Variations in color or viewing angle can greatly change pixel values, and even small differences can make learning fail. This is because the true reward is unknown, and it should be estimated only from raw images in this case.

In this paper, we propose a new learning framework to train the agent’s policy in CDIL with visual input under an RL setting based on dual generative-adversarial learning. The basic idea is as follows. We extract two base feature vectors from each input image: one preserving its domain information (source or target) and the other preserving the policy-expertness information (expert or non-expert), and then enhance feature vectors by synthesizing new feature vectors containing both target-domain and policy expertness information (this combination does not exist in the original data set in CDIL). Moreover, we adapt critical hyperparameters for feature extraction to automatically balance between the strength of preserving one type of information and the strength of deleting another type of information. The so-designed proposed method yields significant performance improvement as compared to existing methods for CDIL, and almost achieves the performance of no domain difference in the case of visual input data.

2 Background and Related Works

Imitation Learning

IL aims to learn behaviors from demonstrations, which are given typically in the form of a sequence of states and actions or raw images Osa et al. (2018). There are several categories of IL methods. For example, Behavior Cloning (BC) Bain and Sammut (1999); Ross et al. (2011); Torabi et al. (2018)

uses supervised learning to train models that directly maps states to actions. Inverse Reinforcement Learning (IRL)

Abbeel and Ng (2004); Finn et al. (2016); Ng and Russell (2000); Ziebart et al. (2008) recovers the reward function so that the expert policy is optimal with respect to the recovered reward function. Then, the policy is trained using RL to maximize the performance with respect to the recovered reward function. While BC is simple and does not require any RL steps, IRL methods allow the agent to understand the expert’s behavior and to easily generalize. Also, there are GANGoodfellow et al. (2014)-based methods that match the distribution of the agent’s behavior with that of the expert Fu et al. (2018); Ho and Ermon (2016).

Imitation Learning with Domain Difference

CDIL is the framework for IL when the expert and the agent exist in two different domains, and arises naturally in many real-world situations. There exist a few previous works for CDIL. The method in Liu et al. (2018) trains a model to transform the source-domain demonstrations into the target-domain perspective so that the agent can use them for learning. Although this method requires time-aligned data from multiple source domains, it works in real-world settings. Another approach is Third-Person Imitation Learning (TPIL) Stadie et al. (2017), which is closely related to our work. TPIL trains a model in an adversarial manner, based on a domain-independent adaptation method Ganin and Lempitsky (2015) with generative-adversarial IL (GAIL) Ho and Ermon (2016). From each input image, it extracts a single type of feature vector, which is domain-independent but preserves the information of policy expertness, and the extracted feature vector is fed into another discriminator that determines the policy expertness label to estimate reward. However, due to the absence of the expert data in the target domain, it is hard to extract desired feature vectors when the domain difference becomes large. Our proposed method solves this difficulty using dual feature extraction and dual discrimination, and overcomes the limitation of TPIL.

Domain Adaptation and Transfer Learning

Our method is also related to domain adaptation (DA). DA assumes a covariate shift between domains, i.e., the data distribution in the source domain is different from that in the target domain. DA aims to learn a model in a target domain by exploiting training data in a source domain, and is widely used in image processing Ganin and Lempitsky (2015); Murez et al. (2017); Zhu et al. (2017) and in RL Carr et al. (2019); Gamrian and Goldberg (2019); Gupta et al. (2017). There are two major DA methods; the pixel-level DA Zhu et al. (2017) seeks direct mapping between two domains, whereas the feature-level DA Ganin and Lempitsky (2015); Murez et al. (2017) seeks mapping between the source (and/or target) domain data space and feature space.

Transfer Learning aims to solve a task, given a trained model for a different task. There are numerous works related to transfer learning combined with RL Carr et al. (2019); Gamrian and Goldberg (2019); Gupta et al. (2017). In particular, a method in Gamrian and Goldberg (2019) trains a translation model in an adversarial manner that transfers images between the source domain and target domain, and both RL and supervised learning for IL is used to train target policy network. Although this method also applies RL and IL to deal with similar problems to our work, it is different to our work because the agent is always able to access the true reward function during the learning phase in the target domain, while in our work uses only estimated rewards trained from the model.

Image Translation

Image translation aims to find a mapping between source domain images and target domain images Yi et al. (2017). The methods presented in Lee et al. (2018, 2019); Huang et al. (2018) extract two feature vectors from each image: one contains only domain-specific information and the other contains only domain-independent information. New images are generated by feeding both domain-specific feature vector in one domain and domain-independent feature vector in the other domain into a generator. Our dual feature extraction and synthesis for CDIL are somewhat in the vein of a similar spirit.

3 Cross-Domain Imitation Learning Problem

In this paper, we consider the CDIL problem under a typical RL framework. The source (using index ) and target (using index

) domains are modelled as Markov Decision Processes (MDPs), and each of them is denoted by

and , respectively. In the source domain, is the state space, is the action space,

is the state transition probability,

is the reward function, is the discount factor, is the initial state distribution. The target-domain spaces, functions and distributions , , , , , are similarly defined. In this paper, we consider visual difference in state spaces between two MDPs as domain difference between two domains, which is still a non-trivial domain gap to overcome as already mentioned.

In the source domain, there is an expert (E) policy which is assumed to be optimal for a task. In addition, we assume there is an non-expert (N) policy in the source domain . There can be several ways to model , but we choose a policy taking random actions because it is simple for implementation. In the target domain, there is a learner (L) policy parametrized by , which needs to be trained. We assume that the learner in the target domain does not have access to the true underlying states and reward function of the expert and the non-expert in the source domain but can have visual observation on which the states of the expert and the non-expert are projected. Let , , denotes the set of observations generated by , and , respectively. There are two true labels to inform useful information of each observation. That is, for each observation , the domain label indicates whether is generated in the source domain () or in the target domain (), and the policy-expertness label indicates whether is generated by the expert policy () or the non-expert policy ().

The goal is to train the learner policy so that it performs a given task in the target domain well by using source-domain demonstrations and and its own target-domain observations .

4 Proposed Method

4.1 Reward Estimation Model

In this section, we present our reward estimation method to train the learner for the CDIL problem. Our reward estimation method is based on dual generative-adversarial learning composed of two base feature extractors and and two discriminators and . Fig. 1 shows the overall structure. Each of all observations is fed into and , whose output is a base feature vector with length preserving only the required information in the input. That is, for each , named domain base feature vector preserves the domain information of (i.e., whether or ). On the other hand, named expertness base feature vector preserves the expertness information of (i.e., whether or ). For example, the output of for contains information relevant to (source domain), and the output of for contains information relevant to (expert policy).

An additional aspect of our method is feature synthesis. Feature synthesis produces feature vectors representing all combinations of domain information and expertness information from the base feature vectors. The synthesized feature vectors are described on the right side of Fig. 1, and explained below.

Single-Feature Synthesis

Single-feature synthesis produces single synthesized feature vectors by concatenating a base feature vector and a zero vector in the position of non-required information. For each , we synthesize two feature vectors and , where is a zero vector with proper dimension, is the domain base feature vector for , and is the expertness base feature vector for . The size of each synthesized feature vector is ; the first components intend to represent the domain information and the last components intend to represent the expertness information. For missing information, a zero vector is filled.

The synthesized feature vectors for each are fed into both discriminators and , where predicts and predicts . We want to contain domain information only, so from vector , should predict but should not be able to predict . That is, should extract base feature vector that helps and fools . Likewise, we want to contain expertness information only, so from vector , should predict but should not be able to predict . Hence, should extract base feature vector that helps and fools .

Figure 1: Overall structure of our proposed model: (Left) the leftmost white squares are input observation images

. All trapezoids are implemented with neural networks. The outputs of the discriminators are probabilities. (Right) the possible combinations for feature synthesis: Upper - single-feature synthesis and Lower - double-feature synthesis

Double-Feature Synthesis

Double-feature synthesis generates double synthesized feature vectors of size by combining the output of and the output of without zero vector insertion. Again, the first components represent the domain information and the last components represent the expertness information. Here, it is possible to choose any combination of and , where , e.g., ( and ) is possible. In this way, we produce synthetic feature vectors containing all combinations of the domain information and the expertness information, i.e., and , where the combination does not exists in the original observations. The synthesized feature vector is also fed into and . Here, the input to each discriminator and can be the combination of any two observation types, so the roles of and are modified a bit as follows: From , predicts regardless of . On the other hand, predicts regardless of .

Note that in the considered visual image input data case, a single image is sufficient to contain its domain information. However, it does not fully contain the expertness information because the expertness of a policy is determined by the combination of state and corresponding action, Hence, to handle this problem, we use the combination of the current synthesized feature vector and the synthesized feature vector at 4 timesteps before as the input to like in Stadie et al. (2017), whereas we just use the current synthesized feature vector as the input to . Note that the combination of the current image and the image at 4 timesteps before contains both the state information and the corresponding action information because the state changes due to action. For notational simplicity, the delayed term in the input to is omitted although it exists.

Loss Function

The loss functions for the proposed framework are given below. The major design objective is to enable a pair of generative-adversarial learning, i.e., one for the pair

and the other for the pair , to occur in the dual structure mentioned in the previous part.

First, the loss function for the two base feature extractors () is given by

(1)

where

(2)
(3)
(4)

The components in (2)-(4) are defined as

(5)
(6)
(7)
(8)
(9)
(10)

where . Here, is the empirical expectation over ; is the empirical expectation over ; and are the true domain and expertness labels of ; and are the weighting factors compensating for the difference in the amount of available data defined as

(11)
(12)

where is the cardinality of a set, and is the indicator function, and is the cross entropy for two probability values and . Note that all discriminator outputs and in (5) - (10) are classification probabilities, so they are between 0 and 1.

For the update of the base feature extractors () based on (1), two discriminators () are fixed. In (1), the terms and are associated with the single synthesized feature vectors and the term is associated with the double synthesized feature vectors. Their weighting is controlled by the hyperparameter . In (2) we set so that the output of helps but fools , while in (3) we set so that the output of helps but fools . The helping and fooling is balanced by the hyperparameter . In the case of double synthesized feature vectors in (4), the goals of the base feature extractors are set such that classifies the domain based only on the first half of each and classifies the expertness based only on the second half of each .

Next, the loss function for the two discriminators () is given as follows:

(13)

where

(14)
(15)
(16)

Here, the loss components in the right-hand sides (RHSs) of (14) - (16) are defined in (5) - (10), and and are the hyperparameters which are identical to those used in (1) - (4).

For the update of the discriminators () based on (13), the two base feature extractors () are fixed. (16) means that for fixed and , each discriminator wants to perform its own job well based on double synthesized feature vectors. Note that the signs in the RHSs of (14)-(15) are now reversed as plus as compared to (2)-(3). This is because, from the perspective of the discriminators, both discriminators want to do their jobs well even with the single synthesized feature vectors.

The base feature extractors and the discriminators are updated based on (1) and (13) in an alternation manner.

4.2 Learner Policy Update with Estimated Reward

After the reward estimation model learning for time steps, the learner updates its policy for time steps based on the estimated reward and its state-and-action, and these reward estimation model learning and learning alternate. The estimated reward for update during the learning period is given by

(17)

where is the observation of the learner in the target domain at time step . Any standard RL method can be applied to update the learner policy. In this paper, we chose PPO Schulman et al. (2017) as the policy update algorithm. The details of the policy network and parameters are explained in Appendix A. The summary of training the reward estimation model and the learner policy is presented in Algorithm 1.

  Input:

Number of epochs

, observation datasets , labels for each observation , and networks , and
  Initialize and .
  repeat
     Extract feature vectors and from each
     Produce single-feature and double-feature synthesized vectors
     Update , based on the loss (13).
     Update , based on the loss (1).
     for  do
        Compute .
     end for
     Update .
  until  epochs
Algorithm 1 Cross-Domain Imitation Learning with a Dual Structure (CDIL-DS)

4.3 Hyperparameter Adaptation

For the intended dual generative-adversarial learning for and for CDIL, the hyperparameter plays the key role to extract the domain-information-only base feature and the expertness-information-only base feature.

(a)
(b)
Figure 2: Classification accuracies in Reacher-Angle with : (a) fixed and (b) adaptation ()

Fig. 1(a) shows the average domain and expertness classification accuracies for the first 200 epochs in the Reacher-Angle environment with fixed and , where the classification accuracy is defined as the correct classification probability by discriminator with the single feature synthesized vector composed of and zero vector, and . The ideal case is that and , which corresponds to perfect dual base feature extraction. As seen in Fig. 1(a), all accuracies except shows very good performance with . However, the problem is that it is difficult to manually find a suitable for good dual feature extraction for each environment. Therefore, we propose a method to adapt , while we leave as a hyperparameter (whose impact will be shown to be less sensitive in Section 6).

Figure 3: Performance comparison between fixed (blue) and adaptation (green, ): both used

The proposed adaptation method is as follows. We first compute classification accuracies for observations. The ideal case is that for the domain-information feature vector, and and for the expertness-information feature vector, and . In order to control toward the ideal case, we define and , considering that a common is used for (2)-(3) and (14)-(15). Note that if is low, it means that the base feature vector does not sufficiently contain the necessary information and hence should be decreased. On the other hand, if is larger than 0.5, it means that the base feature vector does not sufficiently remove the unnecessary information and hence should be increased. Furthermore, if is smaller than 0.5, it means that the base feature extractors are too strong so that the discriminator predicts the label in the other way around and hence should be decreased. Here, we compute the expertness accuracy only for observations from the source domain because we do not have a clear-cut expert in the target domain during the learning process.

Based on the above discussion, we present the adaptation method in Algorithm 2, which incorporates a range for and certain dead zones for numerical stability. We applied Algorithm 2 to the same setup as that in Fig. 1(a) except the part. The result is shown in Figs. 1(b) and 5. It is seen that Algorithm 2 properly works to achieve the desired base feature extraction. Note that the deadzone for is set as [0.4, 0.6] in Algorithm 2 with . As seen in Fig. 2, the performance improves with better adaptive control of .

  Input: , , , and
  
  For each epoch, update the reward estimation model and compute and .
  if and then multiply by
  else if and then multiply by
  else if and then multiply by
  else do not change .
  end if
  Clip so that it stays within .
Algorithm 2 Adaptation for reward estimation model

5 Experimental Setup

We evaluated the performance of the proposed method for four tasks in the MuJoCo Todorov et al. (2012) simulator: Inverted Pendulum, Inverted Double Pendulum, Reacher-Angle and Reacher-Complex. Visualized source domain and target domain observations for each environment are provided in Appendix B.

Inverted Pendulum (IP): A movable cart with a pole on the top. The goal is to move the cart to avoid the pole from falling down. The maximum timestep of each episode is 1000, and an episode terminates when the pole falls down. The pole color in the source domain is different from that in the target domain.

Inverted Double Pendulum (IDP): A movable cart with two poles, one on top of the other. This is a harder version of Inverted Pendulum. The rest is the same as that of Inverted Pendulum. The pole color in the source domain is different from that in the target domain.

Reacher-Angle (Reacher-A): A two-link armed robot at the center and a target point. The goal is to move the arm so that the endpoint of the arm reaches the target point without too much movement. The timestep of each episode is fixed as 50. The angle of the camera for observation in the source domain is 30 degrees tilted from that for observations in the target domain. We fixed the goal position to simplify the problem.

Reacher-Complex (Reacher-C): A harder version of Reacher-Angle. In addition to the tilted camera angle, a checkered box and a light source are also added to the source domain. The rest is the same as that of Reacher-Angle. This task is difficult because there are substantial visual differences between the two domains.

Among several CDIL methods Liu et al. (2018); Gamrian and Goldberg (2019); Stadie et al. (2017), we chose TPIL Stadie et al. (2017) together with GAIL Ho and Ermon (2016) as the comparison baseline since the required assumption for the considered baseline is similar to that of the proposed method. Note that the goal of the proposed method becomes the same as that of GAIL when there is no difference in the source and target domains. Unlike the proposed method, Gamrian and Goldberg (2019) assumes the availability of true rewards during policy training, and time-aligned data from multiple source domains are required in Liu et al. (2018). Our method and baselines are implemented based on RLlab Duan et al. (2016). For a fair comparison, we used PPO Schulman et al. (2017) with the same learning rate for all GAIL, TPIL, and the proposed method, and used ADAM Kingma and Ba (2014) as an optimizer. For GAIL, we set the source domain to be identical to the target domain. This would be an upper bound of our performance, because GAIL assumes no domain difference. The details of the network structure and parameters are provided in Appendix A.

6 Results

Performances:    The performance of the trained learner is evaluated in the target domain for the proposed method and the baselines for the four tasks described in Section 5. Fig. 4 shows the average return for each task. We set the learning rate as 0.0001 for all the methods. For the proposed method, we used with fixed and adapted based on Algorithm 2. For Algorithm 2, we set , , , for Inverted Pendulum and Inverted Double Pendulum, and for Reacher-Angle and Reacher-Complex. It is seen that the proposed method with both fixed and adapted almost achieves the performance using GAIL developed for IL with no domain difference, whereas TPIL failed to train the learner in these environments. Hence, the proposed method is very effective in overcoming the visual domain difference in CDIL. It is also seen that the proposed adaptation in Algorithm 2 is effective and it yields performance gain over the fixed version.

(a)
(b)
(c)
(d)
Figure 4:

Performance results for (a) Inverted Pendulum, (b) Inverted Double Pendulum, (c) Reacher-Angle, and (d) Reacher-Complex environment. Solid lines indicate the average return over 5 trials, and fainted area indicates the plus/minus 1 standard deviation.

Algorithm lr IP IDP Reacher-A Reacher-C
CDIL-DS 0.5 0.5 0.0001 874170 59782008 -20.618.5 -19.816.0
CDIL-DS 0.2 0.5 0.0001 933133 51322528 -36.124.4 -23.317.8
CDIL-DS 1.0 0.5 0.0001 96080 57112004 -15.99.1 -20.77.3
CDIL-DS 3.0 0.5 0.0001 820184 61241934 -45.316.7 -12.31.7
CDIL-DS 0.5 0.2 0.0001 10000 50892745 -28.622.5 -19.28.3
CDIL-DS 0.5 1.0 0.0001 741258 70771711 -29.012.1 -24.019.7
CDIL-DS 0.5 3.0 0.0001 693381 65222272 -26.520.4 -34.424.2
CDIL-DS 0.5 0.5 0.001 10000 82401321 -58.918.7 -56.424.1
CDIL-DS 0.5 0.5 0.0005 821357 73622732 -30.617.5 -55.823.5
CDIL-DS 0.5 0.5 0.00003 142133 46192731 -18.19.7 -33.432.2
CDIL-DS adapt 0.5 0.0001 10000 81051617 -10.02.0 -19.69.0
GAIL 0.0001 10000 929918 -9.21.8 -9.21.8
TPIL 0.0001 265172 50298 -64.720.9 -60.024.5
Table 1: Performance results over various hyperparameters of , , and learning rate (lr) for our proposed methods (including fixed and adapted version) and two baselines.

Ablation Study

We tested the impact of several hyperparameters of the proposed method. We considered three hyperparameters: , , and the learning rate. Table 1 shows the average return of the trained learner policy in the target domain with respect to various values of the hyperparameters. First, we fixed and varied the value of as . It is seen in Table 1 that as expected, there is a performance variation in a more difficult task of Reacher-Angle, whereas we do not see a big variation in easier tasks of Inverted Pendulum and Inverted Double Pendulum. Next, we fixed and varied the value of as . It is seen in Table 1 that with this variation of , the performance for a more difficult task of Reacher-Angle and Reacher-Complex does not change much. Again it is seen that for fixed , the adapted version yields the best performance in most cases.

7 Conclusions

In this paper, we have proposed a framework for visual CDIL based on dual generative-adversarial learning with dual feature extraction and dual discrimination. The proposed method extracts two base feature vectors from each input image: one preserving only its domain information and the other preserving only the policy-expertness information, and then synthesizes feature vectors by concatenating the two base feature vectors. Based on the dual feature extraction strategy and dual generative-adversarial learning, the proposed method can train a learner policy in the target domain even if it is different from the source domain. Numerical results show that the proposed method overcomes the visual domain difference in CDIL and almost achieves the performance of GAIL developed for IL with no domain difference for the considered MuJoCo tasks.

Broader Impact

Our dual-structured learning framework for CDIL can be applied to enhance the learning performance and efficiency in imitation learning and robot control applications. In general, domain adaptation learning frameworks in imitation learning alleviates the difficulties in collecting heavily-customized visual demonstrations for robots to acquire desired behaviors. Such flexibility makes robots easier to acquire their skills in new environments in the real world, using demonstrations in easier environments like simulators, as mentioned in the introduction. This can be useful when it is hard to obtain the ideal demonstration data in those environments, due to the lack of available demonstration data or the high cost required to collect the real-world data. Developing advanced CDIL algorithms has affirmative impacts on reducing the cost and time required for learning and increasing efficiency in many real-world problems that can be approached by cross-domain imitation learning. On the other hand, with advances in reinforcement learning or artificial intelligence in general, operation of systems becomes more and more automatic. Hence, some of current jobs may be affected. Some of them may disappear and some new jobs may also be generated.

References

  • P. Abbeel and A. Y. Ng (2004) Apprenticeship learning via inverse reinforcement learning. In

    Proceedings of the Twenty-First International Conference on Machine Learning

    ,
    ICML ’04, New York, NY, USA, pp. 1. External Links: ISBN 1581138385, Link, Document Cited by: §1, §2.
  • M. Bain and C. Sammut (1999) A framework for behavioural cloning. In Machine Intelligence 15, Intelligent Agents [St. Catherine’s College, Oxford, July 1995], GBR, pp. 103–129. External Links: ISBN 0198538677 Cited by: §1, §2.
  • T. Carr, M. Chli, and G. Vogiatzis (2019) Domain adaptation for reinforcement learning on the atari. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS ’19, Richland, SC, pp. 1859–1861. External Links: ISBN 9781450363099 Cited by: §2, §2.
  • Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel (2016) Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 33rd International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 1329–1338. External Links: Link Cited by: §5.
  • C. Finn, S. Levine, and P. Abbeel (2016) Guided cost learning: deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pp. 49–58. Cited by: §1, §2.
  • J. Fu, K. Luo, and S. Levine (2018) Learning robust rewards with adverserial inverse reinforcement learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.
  • S. Gamrian and Y. Goldberg (2019)

    Transfer learning for related reinforcement learning tasks via image-to-image translation

    .
    In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2063–2072. External Links: Link Cited by: §1, §2, §2, §5.
  • Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei (Eds.), Proceedings of Machine Learning Research, Vol. 37, Lille, France, pp. 1180–1189. External Links: Link Cited by: §2, §2.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. External Links: Link Cited by: §2.
  • A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine (2017) Learning invariant feature spaces to transfer skills with reinforcement learning. CoRR abs/1703.02949. External Links: Link, 1703.02949 Cited by: §1, §2, §2.
  • I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner (2017) DARLA: improving zero-shot transfer in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1480–1490. External Links: Link Cited by: §1.
  • J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 4572–4580. External Links: ISBN 9781510838819 Cited by: §1, §2, §2, §5.
  • X. Huang, M. Liu, S. Belongie, and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015 External Links: Link Cited by: §5.
  • H. Lee, H. Tseng, J. Huang, M. K. Singh, and M. Yang (2018) Diverse image-to-image translation via disentangled representations. ArXiv abs/1808.00948. Cited by: §2.
  • H. Lee, H. Tseng, Q. Mao, J. Huang, Y. Lu, M. K. Singh, and M. Yang (2019) DRIT++: diverse image-to-image translation viadisentangled representations. arXiv preprint arXiv:1905.01270. Cited by: §2.
  • Y. Liu, A. Gupta, P. Abbeel, and S. Levine (2018) Imitation from observation: learning to imitate behaviors from raw video via context translation. In 2018 IEEE International Conference on Robotics and Automation (ICRA), Vol. , pp. 1118–1125. External Links: Document, ISSN 2577-087X Cited by: §1, §2, §5.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015) Human-level control through deep reinforcement learning. Nature 518, pp. 529–33. External Links: Document Cited by: §1.
  • Z. Murez, S. Kolouri, D. J. Kriegman, R. Ramamoorthi, and K. Kim (2017) Image to image translation for domain adaptation.

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , pp. 4500–4509.
    Cited by: §2.
  • A. Y. Ng and S. J. Russell (2000) Algorithms for inverse reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, San Francisco, CA, USA, pp. 663–670. External Links: ISBN 1558607072 Cited by: §1, §2.
  • T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters (2018) An algorithmic perspective on imitation learning. Foundations and Trends in Robotics 7 (1-2), pp. 1–179. External Links: Document, Link Cited by: §1, §2.
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson, and M. Dudík (Eds.), Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 627–635. External Links: Link Cited by: §1, §2.
  • J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov (2017) Proximal policy optimization algorithms. ArXiv abs/1707.06347. Cited by: §4.2, §5.
  • D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Document Cited by: §1.
  • B. C. Stadie, P. Abbeel, and I. Sutskever (2017) Third-person imitation learning. External Links: 1703.01703 Cited by: §1, §2, §4.1, §5.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. Second edition, The MIT Press. External Links: Link Cited by: §1.
  • E. Todorov, T. Erez, and Y. Tassa (2012) Mujoco: a physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. External Links: Link Cited by: §5.
  • F. Torabi, G. Warnell, and P. Stone (2018) Behavioral cloning from observation. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, pp. 4950–4957. External Links: ISBN 9780999241127 Cited by: §1, §2.
  • Z. Yi, H. Zhang, P. Tan, and M. Gong (2017) DualGAN: unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2849–2857. Cited by: §2.
  • T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings, External Links: Link Cited by: §1.
  • J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2242–2251. External Links: Document, ISSN 2380-7504 Cited by: §2.
  • B. D. Ziebart, A. Maas, J. A. Bagnell, and A. K. Dey (2008) Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 3, AAAI’08, pp. 1433–1438. External Links: ISBN 9781577353683 Cited by: §1, §2.

Appendix A Details for Training Neural Networks

We trained the learner policy for 1000 epochs. One epoch consists of reward model training and learner policy training. For each epoch we draw 4000 samples from each of and so that we have total 12,000 samples. First, we train the reward model by drawing three mini-batches from three 4000 samples (one from each), where the mini-batch size was 10. We repeated this for 400 times for one epoch. Then, based on the 4000 samples from , we updated the learner policy based on PPO with 125 iterations with mini-batches of size 32 samples.

Reward Estimation Model The domain feature extractor consists of 2 CNN layers with 5 filters with size 3

3, followed by a fully connected layer. The ReLU activation is used except the output layer. The input to the domain base feature extractor

is an RGB image whose size is 50 50 3. The expertness base feature extractor has the same structure as the domain base feature extractor without sharing network weights. The size of the output of both feature extractors (i.e., the size of each base feature vector) is 128. The domain discriminator consist of 2 fully connected hidden layers with size 128, followed by a fully connected output layer. The input size of the domain discriminator is 256=128 2. The ReLU activation is used except the output layer. The expertness discriminator has a similar structure to that of the domain discriminator, except the input size is 512 instead of 256 due to concatenation of two images at the current time step and the 4 time steps before. The parameters of the feature extractors and the discriminators are updated using the ADAM optimizer with the learning rate 0.0001.

Policy Network The policy network consists of 2 fully-connected hidden layers followed by a fully connected output layer. The size of the hidden layer is 32 for Inverted Pendulum and Reacher-Angle and Reacher-Complex and 100 for Inverted Double Pendulum. The policy is trained using the clipping version of PPO with clipping ratio , discount factor , and GAE

=1. PPO also uses the ADAM optimizer for network parameter update with the learning rate 0.0003. Gradient clipping is applied, and the maximum gradient is 1.0.

Appendix B Observation Images for Source and Target Domains

Figure 5: Sample observation images: starting from the left column, (a) Inverted Pendulum, (b) Inverted Double Pendulum, (c) Reacher-Angle and (d) Reacher-Complex environment. The images on the top are from the source domain, and the images at the bottom are from the target domain.

Appendix C Learning Curves for Several Hyperparameters

This section shows the learning curves for several hyperparameters. Each solid line indicates the average return over 5 trials, and the fainted area indicates the plus/minus 1 standard deviation. If not mentioned, the default hyperparameters are and learning rate = 0.0001.

(a)
(b)
(c)
Figure 6: Performance results by varying (a) , (b) and (c) the learning rate in Inverted Pendulum environment.
(a)
(b)
(c)
Figure 7: Performance results by varying (a) , (b) and (c) the learning rate in Inverted Double Pendulum environment.
(a)
(b)
(c)
Figure 8: Performance results by varying (a) , (b) and (c) the learning rate in Reacher-Angle environment.
(a)
(b)
(c)
Figure 9: Performance results by varying (a) , (b) and (c) the learning rate in Reacher-Complex environment.