Domain-Adversarial and -Conditional State Space Model for Imitation Learning

01/31/2020 ∙ by Ryo Okumura, et al. ∙ Panasonic Corporation of North America Ritsumeikan Univ 0

State representation learning (SRL) in partially observable Markov decision processes has been studied to learn abstract features of data useful for robot control tasks. For SRL, acquiring domain-agnostic states is essential for achieving efficient imitation learning (IL). Without these states, IL is hampered by domain-dependent information useless for control. However, existing methods fail to remove such disturbances from the states when the data from experts and agents show large domain shifts. To overcome this issue, we propose a domain-adversarial and -conditional state space model (DAC-SSM) that enables control systems to obtain domain-agnostic and task- and dynamics-aware states. DAC-SSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. To remove domain-dependent information from the states, the model is trained with domain discriminators in an adversarial manner, and the reconstruction is conditioned on domain labels. We experimentally evaluated the model predictive control performance via IL for continuous control of sparse reward tasks in simulators and compared it with the performance of the existing SRL method. The agents from DAC-SSM achieved performance comparable to experts and more than twice the baselines. We conclude domain-agnostic states are essential for IL that has large domain shifts and can be obtained using DAC-SSM.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State representation learning (SRL) [18] has been studied to obtain compact and expressive representation of robot control tasks from high-dimensional sensor data, such as images. Appropriate state representation enables agents to achieve high performance for discrete and continuous control tasks from games [9] to real robots [29]. Sequential state space models have been shown to improve the performance and sample efficiency of robot control tasks in partially observable Markov decision processes (POMDPs). The deep planning network (PlaNet) [10] is a planning methodology in the latent space that is trained with a task- and dynamics-aware state space model called a recurrent state space model (RSSM). RSSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. They have performed model predictive control (MPC) [5, 21] for planning in the obtained state space on RSSM.

Acquiring domain-agnostic states is essential for achieving efficient imitation learning (IL). Without the domain-agnostic states, IL is hampered by domain-dependent information, which is useless for control. In the context of IL, it is natural to assume that the data from experts and agents have domain shifts [27]. However, the current SRL methods [10, 16] fail to remove such disturbances from the states when the domain shifts are large. In IL, a discriminator serves as an imitation reward function to distinguish the state-action pairs of the experts from those of the agents [11]. If the obtained states are NOT domain-agnostic, the discriminator is disturbed by the domain-dependent information, which is eye-catching but unrelated to the control and tasks. As a result, the imitation reward becomes unsuitable for the control, and IL will be disrupted. Figure 1 shows examples of domain shifts between the data from an expert and agent. We define the domain shifts as control-irrelevant changes of the data like appearance: e.g. colors, textures, backgrounds, viewing angles, and objects that are unrelated to the control. The domain shifts are, for example, caused by changing camera settings, location of data collection, appearance of the robot and so on. The domain shifts are also caused when unseen objects in one domain appear in the other domain. For example, an operator will be present in the expert images when she/he makes demonstrations via the direct teaching mode of a robot. In this case, the existence of the operator in the images is the cause of the domain shifts.

To overcome this problem, in this paper, we propose a domain-agnostic and task- and dynamics-aware SRL model, called a domain-adversarial and -conditional state space model (DAC-SSM). DAC-SSM builds on RSSM, and it is trained with a domain discriminator and expert discriminator. To remove the domain-dependent information from the states, (1) the state space is trained with the domain discriminator in an adversarial manner, and (2) the encoder and decoder of DAC-SSM are conditioned on domain labels. The domain discriminator is trained to identify which domain the acquired states belong to. The negative loss function of the domain discriminator, called the domain confusion loss

[28]

, is added to the loss function of the state space. To reduce the domain confusion loss, the states are trained to be domain-agnostic. In other words, due to the domain confusion loss, DAC-SSM is trained to inference the states that have few clues for the domain discriminator to distinguish domain of the states. Moreover, the states are disentangled by conditional domain labels for the encoder and decoder, like conditional variational autoencoders (CVAE)

[13]. Owing to the disentanglement, the domain-dependent information is eliminated from the state representation. Because DAC-SSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models, the obtained states are also task- and dynamics-aware as well as domain-agnostic. To the best of our knowledge, there are no studies that have combined the domain adversarial training with SRL for control tasks.

The main contribution of this paper is implementation and experiments to demonstrate that the obtained state representation via DAC-SSM is suitable for IL with the large domain shifts. We compared DAC-SSM to the existing SRL methods in terms of MPC performance via IL for continuous control sparse reward tasks in the MuJoCo physics simulator [26]. The agents in DAC-SSM achieved a performance comparable to the expert and more than twice that of the baselines.

2 Related studies

State representation learning for POMDPs

The sequential state space model has been studied to solve the tasks in POMDPs. lee2019slac proposed a sequential latent variable model that propagates historical information from a control system via contextual stochastic states [16]. They jointly optimized the actor and critic using the state apace model. Gangwani2019belief jointly optimized the expert discriminator using policy, forward and inverse dynamics, and action models to obtain task- and dynamics-aware state representation [3]. Their state representation, however, is not domain-agnostic.

Domain-agnostic feature representation

Domain-agnostic feature representation has been obtained by domain-adversarial training or by disentangling the latent space [6]. The domain-adversarial training is a simple and effective approach to extract feature representation which is unrelated to the domains of data. Tzeng2014DeepDC added the domain confusion loss to the loss function of the feature extractor [28]. Ganin2015DomainAdversarialTO introduced a gradient reversal layer to back-propagate a negative gradient of the domain discriminator loss to the feature extractor [4]. CVAE is a well-known method that is able to disentangle domain-dependent information from the latent spaces. They made the encoder and decoder conditional on domain labels to obtain the domain-agnostic latent variables [13].

Imitation learning (IL)

IL [22] is a powerful and accepted approach that makes the agents mimic expert behavior by using a set of demonstrations of tasks. ho:GAIL proposed an IL framework called Generative Adversarial Imitation Learning (GAIL) [11]

. In GAIL, imitation rewards are computed by the expert discriminator, which distinguishes if a state-action pair is generated by an agent policy or from the expert demonstrations. They formulated a joint process of reinforcement learning and inverse reinforcement learning as a two-player game of the policy and discriminator, analogous to Generative Adversarial Networks

[7]. GAIL has been shown to solve complex high-dimensional continuous control tasks [15, 1, 20, 23].

IL with the domain shifts

Using common measurable features is one of the popular approaches. For example, keypoints of objects [24] and/or tracking marker positions [8, 17] are used as the states. In this approach, one can directly apply existing IL techniques without focusing on the domain shifts. However, such features are not always available. Stadie2017ThirdPersonIL added the domain confusion loss to the expert discriminator to make it domain-agnostic [25]. By computing the imitation reward using the discriminator, they successfully achieved IL with large domain shifts. Their approach, however, does not include SRL.

(a) Concept of proposed method (b) Training architecture diagram of DAC-SSM
Figure 2: (a) Concept of proposed method. represents the expert discriminator which serves as an imitation reward function. Red, green and blue arrows represent the flows of states inference and computation of imitation rewards. The red arrow is for the expert data. The green arrow is for the agent data whose behavior is expert-like. The blue arrow is for the agent data whose behavior is NOT expert-like. (b) Training architecture of DAC-SSM. The dashed lines represent back-propagation paths. The domain confusion losses are added to the state space losses . , , and represent replay buffers for the data from the agents, experts, and novices. represents the domain discriminator.

3 Proposed Method

3.1 Concept of proposed method

Figure 2 (a) shows a concept of DAC-SSM. represents the expert discriminator which serves as an imitation reward function. Because DAC-SSM builds domain-agnostic state space, higher rewards are provided to the agents for expert-like behavior. On the other hand, the existing method builds domain-aware state space. The expert discriminator easily distinguishes the states from the agents even when the behavior of the agents is expert-like.

3.2 State space model

In POMDPs, an individual image does not have all the information about the states. Therefore, our model builds on RSSM, which has contextual states to propagate historical information. We use the following notations: a discrete time step, , contextual deterministic states, , stochastic states, , image observations, , continuous actions, , and domain labels, . The model follows the mixed deterministic/stochastic dynamics below:

  • Transition model:

  • State model:

  • Observation model:

Transition model

was implemented as a recurrent neural network. To train the model, we maximized the probability of a sequence of observations in the entire generative process:

(1)

Generally this objective is intractable. We utilize the following evidence lower bound (ELBO) on the log-likelihood by introducing the posterior to infer the approximate stochastic states.

(2)

The posterior and the observation model are implemented as an encoder and decoder, respectively. They are conditioned on the domain labels, . The domain labels help them to change their behavior depending on the domain. The domain-dependent information is eliminated from the obtained states and , like CVAE.

3.3 Domain and expert discriminators

We further introduce the domain and expert discriminators, and . The role of the domain discriminator is for computing the domain confusion losses. We denote the replay buffers for the data from the agents, experts, and novices as , , and , respectively. The data from the novices are in the same domain as those from the experts, but are non-optimal for the tasks. The loss function of the domain discriminator is denoted as follows:

(3)

Here, we introduce a simple abbreviation of the expectation to avoid complexity:

(4)

Similarly, the loss function of the expert discriminator is denoted as follows:

(5)

The expert discriminator serves as an imitation reward function. It is trained to distinguish if state-action pairs are from episodes of the experts or not.

3.4 Training of DAC-SSM

Figure 2 (b) displays a diagram of training architecture of DAC-SSM. The dashed lines represent back-propagation paths. The model is trained by minimizing state space losses with the domain confusion losses:

(6)

where is a hyper-parameter. The reward models, , are trained by the losses:

(7)

The gradient of the expert discriminator losses, , is not propagated to DAC-SSM. The gradient of the domain discriminator losses, , is not propagated to DAC-SSM directly, but the domain confusion losses, , are added to the state space losses, . Thus, the obtained states become domain-agnostic, and task- and dynamics-aware. Therefore, the states have considerable information that is useful for control (task- and dynamics-aware), but few clues regarding the domain-dependent information (domain-agnostic). We prepared two types of datasets for each task: expert and novice data. Expert data are successful trajectories for tasks in the expert domain, whereas novice data are non-optimal trajectories for tasks in the expert domain. Agent data are collected during training.

3.5 Planning algorithm

We used the cross entropy method (CEM) [2] to search for the best action sequence in the obtained state space. CEM is a robust population-based optimization algorithm that infers a distribution over action sequences that maximize an objective. Because the objective is modeled as a function of the states and actions, the planner can operate purely in the low-dimensional latent space without generating images. Multiple types of rewards are used for the objective [14, 12] in the context of control as inference [19]. We define the distribution over the task-optimality, , as follows:

(8)

The distribution over the imitation-optimality, , is calculated by using the expert discriminator:

(9)

We use to calculate both rewards because contextual information is essential for the POMDPs. Hence, the objective of the CEM is to maximize the probability of the task- and imitation-optimalities, as given below:

(10)

where is the planning horizon of the CEM.

Cup-Catch Finger-Spin Connector-Insertion Expert Agent Expert Agent Agent(tilt) Expert Agent
Figure 3: We consider three tasks: Cup-Catch, Finger-Spin, and Connector-Insertion. We consider two agent domains for Finger-Spin. In one agent domain, colors of bodies and floors are different from the expert domain. In the other agent domain, viewing angles are further different. In the Connector-Insertion, human fingers hold the connector in the expert domain, while robot fingers hold it in the agent domain.
Cup-Catch Finger-Spin Finger-Spin (Tilted view) Connector-Insertion
Figure 4: Comparison of our proposed method with the baselines. The plots show the test performance over the number of collected episodes. The lines show the medians, and the areas show the percentiles from 5 to 95 over 4 seeds and 20 trajectories. The dashed lines show the average scores of the expert trajectories. We compare DAC-SSM with three types of reward function: task, imitation and dual. The dual means weighted sum of the task and imitation rewards. DAC-SSM: with the domain confusion loss and DC decoder. DA-SSM: with the domain confusion loss without the DC decoder. DC-SSM: without the domain confusion loss with the DC decoder. We used not only a DC decoder but also a DC encoder for the Finger-Spin of the tilted view. PlaNet+: naive implementation of the expert discriminator and RSSM.
Task DAC/dual DAC/imitation DAC/task DA/dual DC/dual PlaNet PlaNet+
Cup-Catch 728223 304323 375371 233350 788149 470398 479359
Finger-Spin 40542 48850 13073 190108 41941 15773 12491
Finger-Spin
(tilted view)
40645 50748 16787 12380 39451 16287 15689
Connector-
Insertion
40.229.1 50.525.0 0.44.0 0.00.0 40.926.7 0.73.4 2.18.1
Table 1: Mean MPC performance after 1,000 episodes, boldface indicates better results,

represents one standard deviation.

Task
DC/dual
DAC/dual
DAC/dual
DAC/dual
DAC/dual
DAC/dual
Finger-Spin 41941 41745 40947 40542 33752 12
Connector-Insertion 40.926.7 35.328.4 37.429.6 40.229.1 49.525.5 17.323.8
Table 2: Mean MPC performance for different domain confusion loss coefficient after 1,000 episodes, boldface indicates better results, represents one standard deviation.
Ground Truth Reconstruction with the labels of agent domain Reconstruction with the labels of expert domain Reconstruction without the domain labels
Figure 5: Example image sequence (the first row) and corresponding open-loop video predictions (second to the last row) observed for the Finger-Spin task. Columns 1-5 are context frames and were reconstructed from posterior samples, and the remaining images were generated from open-loop prior samples. The second and third row was reconstructed with expert and agent domain labels, respectively. The last row was reconstructed from the contextual states, , without domain labels. Another decoder was trained separately for the reconstruction of the images in the last row. The first column of the last row is reconstructed from initialized by zero.

4 Experiments

4.1 Environments and hyperparameters

We considered three tasks in the MuJoCo physics simulator: Cup-Catch, Finger-Spin, and Connector-Insertion. Figure 3 shows the expert and agent domains for each task. For Finger-Spin, we make two different agent domains. One agent domain of Finger-Spin has different colors of objects and floors compared to the expert domain. The other agent domain of Finger-Spin also has a different viewing angle. It is difficult to train control policies by using only task rewards because all tasks here are the sparse reward type. Cup-Catch and Finger-Spin are instances of the DeepMind Control Suite [30]. We also built a new task, Connector-Insertion. The agent attempted to insert a connector to a socket. Constant rewards were obtained when the connector was in the socket. The position and angle of the connector and socket were initialized with random values at the start of the episodes. In this task, we added a constant bias to the action of moving the connector upward on the paper. This is equivalent to introducing domain knowledge that the socket exists upward on the paper.

The contextual state and stochastic state sizes were 32 and 8 for all experiments. A small latent size is enough for DAC-SSM because domain-related information is eliminated from the latent space. The decoder refers to the domain labels to reconstruct domain-specific observation. Domain label was simply concatenated to and and entered into the domain conditional (DC) decoder. We used not only the DC decoder but also the DC encoder for the Finger-Spin of the tilted view. We implemented the DC encoder by training two separate encoders and switching them based on domain label

. We use batches of 40 sequence chunks of 40 steps long for training. Except for the above mentioned, we adopted the same hyperparameters and architectures as PlaNet for the state space model. We implemented both the expert and domain discriminator as two fully connected layers of size 64 with ReLU activations. The domain confusion loss coefficient

is 1.0 unless otherwise noted. For planning, we used CEM with a short planning horizon length of , optimization iterations of , candidate samples of , and refitting to the best . The action repeats were 4, 2, and 800 for Cup-Catch, Finger-Spin, and Connector-Insertion, respectively. The action repeat for Connector-Insertion was extremely large because we set simulation timesteps of MuJoCo to a very small value of ; otherwise, objects easily pass through each other when they come into forceful contact. We evaluate three types of objectives for the planning: dual, imitation and task rewards. The dual rewards are weighted sum of task- and imitation-rewards with ratio of 10:1.

4.2 Applying state representation to IL with domain shifts

Figure 4 and Table 1 compares DAC-SSM using dual rewards (DAC/dual) to a baseline of existing SRL method (PlaNet/task) and naive implementation of the expert discriminator with the baseline (PlaNet+). DAC/dual achieved much higher performance for all tasks than the two baselines. This is because the domain-aware state representation of PlaNet does not help the agents to achieve higher performance via imitation learning with the domain shifts. We also compared DAC-SSM, a version using dual rewards (DAC/dual), a version using imitation rewards (DAC/imitation), and a version using task rewards (DAC/task). Except for Cup-Catch, DAC/imitation achieved the best performance. This is because the planning horizon length is too short for Finger-Spin and Connector-Insertion. We further trained our proposed model (DAC/dual) as well as versions with domain adversarial training but without domain conditional encoders/decoders (DA/dual), and with domain conditional encoders/decoders but without domain adversarial training (DC/dual). The performance of DAC/dual and DC/dual were almost the same, and that of DA/dual was much lower. In the settings of this experiment, the domain adversarial training was not effective because the domain confusion loss coefficient was too small. Table 2 shows DAC/dual achieved higher performance than DC/dual with for Connector-Insertion. These results show that the obtained states on DAC-SSM help the agents to achieve effective imitation learning with the domain shifts.

4.3 Reconstruction from State Representation

Figure 5 shows the sequence of ground-truth examples and reconstructed images from the obtained state representation on DAC-SSM for Finger-Spin. The first 5 columns show context frames that were reconstructed from posterior samples, and the remaining images were generated from open-loop prior samples. The second and third row images were reconstructed from a sequence of states of and with domain label via the DC-decoder . Joint angles of the robotic arm and target object were successfully reconstructed from the states, whereas domain-dependent information (colors of the floor and object) depended on the domain labels. The last row images were reconstructed from the contextual states, , without domain labels using another decoder that is trained separately from our model. The joint angles were successfully reconstructed, whereas the colors appeared to be a mixture of the two domains. These results show that the obtained states on DAC-SSM have control-dependent information like the joint-angle, but do not have domain-dependent information like the colors which is not related to the control. In other words, we successfully acquire the domain-agnostic and task- and dynamics-aware sate representation via DAC-SSM.

5 Conclusion and Discussions

We showed domain-agnostic and task- and dynamics-aware state representation was obtained via DAC-SSM. To obtain such state representation, we introduced domain adversarial training and domain conditional encoders/decoders into the recent task- and dynamics-aware sequential state space model. Moreover, we experimentally evaluated the MPC performance via IL with the large domain shifts for continuous control sparse reward tasks in simulators. The state representation from DAC-SSM helped the agents to achieve comparable performance to the expert. The existing SRL failed to remove domain-dependent information from the states, and thus the agents could not perform effective IL with large domain shifts. We conclude that the domain-agnostic and control-aware states are essential for IL with the large domain shifts, and such states are obtained via DAC-SSM.

A question that remains is if DAC-SSM is applicable to larger and/or different types of domain shifts, e.g. modality-variant of data. Since the domain confusion loss coefficient has task dependency as shown in Table 2, we can expect better state representation is obtained by actively varying . Acquiring task-agnostic states to achieve a universal controller is also appealing future works. Learning from human demonstration is challenging but interesting direction of future works. This work includes obtaining appropriate state representation from expert data without action data. Implementation for real robotic tasks is another important direction for future works. Acquiring fully stochastic state representation is necessary for the real world tasks because the control system of the real robot have much larger uncertainty than simulation.

Acknowledgments

Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.

References

  • [1] N. Baram, O. Anschel, I. Caspi, and S. Mannor (2017) End-to-end differentiable adversarial imitation learning. In ICML, Cited by: §2.
  • [2] K. Chua, R. Calandra, R. McAllister, and S. Levine (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, Cited by: §3.5.
  • [3] T. Gangwani, J. Lehman, Q. Liu, and J. Peng (2019) Learning belief representations for imitation learning in pomdps. In UAI, Cited by: §2.
  • [4] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. S. Lempitsky (2015) Domain-adversarial training of neural networks. In J. Mach. Learn. Res., Vol. 17, pp. 59:1–59:35. Cited by: §2.
  • [5] C. E. Garcia, D. M. Prett, and M. Morari (1989) Model predictive control: theory and practice—a survey. In Automatica, Cited by: §1.
  • [6] A. Gonzalez-Garcia, J. van de Weijer, and Y. Bengio (2018) Image-to-image translation for cross-domain disentanglement. In NIPS, Cited by: §2.
  • [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §2.
  • [8] A. Gupta, C. Eppner, S. Levine, and P. Abbeel (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In IROS, Cited by: §2.
  • [9] D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. In NIPS, Cited by: §1.
  • [10] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson (2018) Learning latent dynamics for planning from pixels. In arXiv, Cited by: §1, §1.
  • [11] J. Ho and S. Ermon (2016) Generative adversarial imitation learning. In NIPS, Cited by: §1, §2.
  • [12] R. Kaushik, K. Chatzilygeroudis, and J. Mouret (2018) Multi-objective model-based policy search for data-efficient learning with sparse rewards. In CoRL, Cited by: §3.5.
  • [13] D. Kingma, D. Rezende, S. Mohamed, and M. Welling (2014) Semi-supervised learning with deep generative models. In NIPS, Cited by: §1, §2.
  • [14] A. Kinose and T. Taniguchi (2019) Integration of imitation learning using gail and reinforcement learning using task-achievement rewards via probabilistic generative model. In arXiv, Cited by: §3.5.
  • [15] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson (2018) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. In ICLR, Cited by: §2.
  • [16] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. In arXiv, Cited by: §1, §2.
  • [17] Y. Lee, E. S. Hu, Z. Yang, and J. J. Lim (2019) To follow or not to follow: selective imitation learning from observations. In CoRL, Cited by: §2.
  • [18] T. Lesort, N. Díaz-Rodríguez, J. Goudou, and D. Filliat (2018) State representation learning for control: an overview. In Neural Networks, Vol. 108. Cited by: §1.
  • [19] S. Levine (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. In arXiv, Cited by: §3.5.
  • [20] Y. Li, J. Song, and S. Ermon (2017) InfoGAIL: interpretable imitation learning from visual demonstrations. In NIPS, Cited by: §2.
  • [21] M. Okada and T. Taniguchi (2019) Variational inference mpc for bayesian model-based reinforcement learning. In CoRL, Cited by: §1.
  • [22] S. Schaal (1999) Is imitation learning the route to humanoid robots?. In Trends in Cognitive Sciences, Vol. 3. Cited by: §2.
  • [23] A. Sharma, M. Sharma, N. Rhinehart, and K. M. Kitani (2018) Directed-info gail: learning hierarchical policies from unsegmented demonstrations using directed information. In arXiv, Cited by: §2.
  • [24] M. Sieb, Z. Xian, A. Huang, O. Kroemer, and K. Fragkiadaki (2019) Graph-structured visual imitation. In CoRL, Cited by: §2.
  • [25] B. C. Stadie, P. Abbeel, and I. Sutskever (2017) Third-person imitation learning. In ICLR, Cited by: §2.
  • [26] E. Todorov, T. Erez, and Y. Tassa (2012) MuJoCo: a physics engine for model-based control. In IROS, Cited by: §1.
  • [27] F. Torabi, G. Warnell, and P. Stone (2019) Recent advances in imitation learning from observation. In IJCAI, Cited by: §1.
  • [28] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. In arXiv, Cited by: §1, §2.
  • [29] A. Wang, T. Kurutach, K. Liu, P. Abbeel, and A. Tamar (2019) Learning robotic manipulation through visual planning and acting. In RSS, Cited by: §1.
  • [30] Yuval,Tassa, Yotam,Doron, Alistair,Muldal, Tom,Erez, Yazhe,Li, D. L. Casas, David,Budden, Abbas,Abdolmaleki, Josh,Merel, Andrew,Lefrancq, Timothy,Lillicrap, and Martin,Riedmiller (2018) DeepMind control suite. In arXiv, Cited by: §4.1.