State representation learning (SRL)  has been studied to obtain compact and expressive representation of robot control tasks from high-dimensional sensor data, such as images. Appropriate state representation enables agents to achieve high performance for discrete and continuous control tasks from games  to real robots . Sequential state space models have been shown to improve the performance and sample efficiency of robot control tasks in partially observable Markov decision processes (POMDPs). The deep planning network (PlaNet)  is a planning methodology in the latent space that is trained with a task- and dynamics-aware state space model called a recurrent state space model (RSSM). RSSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. They have performed model predictive control (MPC) [5, 21] for planning in the obtained state space on RSSM.
Acquiring domain-agnostic states is essential for achieving efficient imitation learning (IL). Without the domain-agnostic states, IL is hampered by domain-dependent information, which is useless for control. In the context of IL, it is natural to assume that the data from experts and agents have domain shifts . However, the current SRL methods [10, 16] fail to remove such disturbances from the states when the domain shifts are large. In IL, a discriminator serves as an imitation reward function to distinguish the state-action pairs of the experts from those of the agents . If the obtained states are NOT domain-agnostic, the discriminator is disturbed by the domain-dependent information, which is eye-catching but unrelated to the control and tasks. As a result, the imitation reward becomes unsuitable for the control, and IL will be disrupted. Figure 1 shows examples of domain shifts between the data from an expert and agent. We define the domain shifts as control-irrelevant changes of the data like appearance: e.g. colors, textures, backgrounds, viewing angles, and objects that are unrelated to the control. The domain shifts are, for example, caused by changing camera settings, location of data collection, appearance of the robot and so on. The domain shifts are also caused when unseen objects in one domain appear in the other domain. For example, an operator will be present in the expert images when she/he makes demonstrations via the direct teaching mode of a robot. In this case, the existence of the operator in the images is the cause of the domain shifts.
To overcome this problem, in this paper, we propose a domain-agnostic and task- and dynamics-aware SRL model, called a domain-adversarial and -conditional state space model (DAC-SSM). DAC-SSM builds on RSSM, and it is trained with a domain discriminator and expert discriminator. To remove the domain-dependent information from the states, (1) the state space is trained with the domain discriminator in an adversarial manner, and (2) the encoder and decoder of DAC-SSM are conditioned on domain labels. The domain discriminator is trained to identify which domain the acquired states belong to. The negative loss function of the domain discriminator, called the domain confusion loss
, is added to the loss function of the state space. To reduce the domain confusion loss, the states are trained to be domain-agnostic. In other words, due to the domain confusion loss, DAC-SSM is trained to inference the states that have few clues for the domain discriminator to distinguish domain of the states. Moreover, the states are disentangled by conditional domain labels for the encoder and decoder, like conditional variational autoencoders (CVAE). Owing to the disentanglement, the domain-dependent information is eliminated from the state representation. Because DAC-SSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models, the obtained states are also task- and dynamics-aware as well as domain-agnostic. To the best of our knowledge, there are no studies that have combined the domain adversarial training with SRL for control tasks.
The main contribution of this paper is implementation and experiments to demonstrate that the obtained state representation via DAC-SSM is suitable for IL with the large domain shifts. We compared DAC-SSM to the existing SRL methods in terms of MPC performance via IL for continuous control sparse reward tasks in the MuJoCo physics simulator . The agents in DAC-SSM achieved a performance comparable to the expert and more than twice that of the baselines.
2 Related studies
State representation learning for POMDPs
The sequential state space model has been studied to solve the tasks in POMDPs. lee2019slac proposed a sequential latent variable model that propagates historical information from a control system via contextual stochastic states . They jointly optimized the actor and critic using the state apace model. Gangwani2019belief jointly optimized the expert discriminator using policy, forward and inverse dynamics, and action models to obtain task- and dynamics-aware state representation . Their state representation, however, is not domain-agnostic.
Domain-agnostic feature representation
Domain-agnostic feature representation has been obtained by domain-adversarial training or by disentangling the latent space . The domain-adversarial training is a simple and effective approach to extract feature representation which is unrelated to the domains of data. Tzeng2014DeepDC added the domain confusion loss to the loss function of the feature extractor . Ganin2015DomainAdversarialTO introduced a gradient reversal layer to back-propagate a negative gradient of the domain discriminator loss to the feature extractor . CVAE is a well-known method that is able to disentangle domain-dependent information from the latent spaces. They made the encoder and decoder conditional on domain labels to obtain the domain-agnostic latent variables .
Imitation learning (IL)
IL  is a powerful and accepted approach that makes the agents mimic expert behavior by using a set of demonstrations of tasks. ho:GAIL proposed an IL framework called Generative Adversarial Imitation Learning (GAIL) 
. In GAIL, imitation rewards are computed by the expert discriminator, which distinguishes if a state-action pair is generated by an agent policy or from the expert demonstrations. They formulated a joint process of reinforcement learning and inverse reinforcement learning as a two-player game of the policy and discriminator, analogous to Generative Adversarial Networks. GAIL has been shown to solve complex high-dimensional continuous control tasks [15, 1, 20, 23].
IL with the domain shifts
Using common measurable features is one of the popular approaches. For example, keypoints of objects  and/or tracking marker positions [8, 17] are used as the states. In this approach, one can directly apply existing IL techniques without focusing on the domain shifts. However, such features are not always available. Stadie2017ThirdPersonIL added the domain confusion loss to the expert discriminator to make it domain-agnostic . By computing the imitation reward using the discriminator, they successfully achieved IL with large domain shifts. Their approach, however, does not include SRL.
3 Proposed Method
3.1 Concept of proposed method
Figure 2 (a) shows a concept of DAC-SSM. represents the expert discriminator which serves as an imitation reward function. Because DAC-SSM builds domain-agnostic state space, higher rewards are provided to the agents for expert-like behavior. On the other hand, the existing method builds domain-aware state space. The expert discriminator easily distinguishes the states from the agents even when the behavior of the agents is expert-like.
3.2 State space model
In POMDPs, an individual image does not have all the information about the states. Therefore, our model builds on RSSM, which has contextual states to propagate historical information. We use the following notations: a discrete time step, , contextual deterministic states, , stochastic states, , image observations, , continuous actions, , and domain labels, . The model follows the mixed deterministic/stochastic dynamics below:
Generally this objective is intractable. We utilize the following evidence lower bound (ELBO) on the log-likelihood by introducing the posterior to infer the approximate stochastic states.
The posterior and the observation model are implemented as an encoder and decoder, respectively. They are conditioned on the domain labels, . The domain labels help them to change their behavior depending on the domain. The domain-dependent information is eliminated from the obtained states and , like CVAE.
3.3 Domain and expert discriminators
We further introduce the domain and expert discriminators, and . The role of the domain discriminator is for computing the domain confusion losses. We denote the replay buffers for the data from the agents, experts, and novices as , , and , respectively. The data from the novices are in the same domain as those from the experts, but are non-optimal for the tasks. The loss function of the domain discriminator is denoted as follows:
Here, we introduce a simple abbreviation of the expectation to avoid complexity:
Similarly, the loss function of the expert discriminator is denoted as follows:
The expert discriminator serves as an imitation reward function. It is trained to distinguish if state-action pairs are from episodes of the experts or not.
3.4 Training of DAC-SSM
Figure 2 (b) displays a diagram of training architecture of DAC-SSM. The dashed lines represent back-propagation paths. The model is trained by minimizing state space losses with the domain confusion losses:
where is a hyper-parameter. The reward models, , are trained by the losses:
The gradient of the expert discriminator losses, , is not propagated to DAC-SSM. The gradient of the domain discriminator losses, , is not propagated to DAC-SSM directly, but the domain confusion losses, , are added to the state space losses, . Thus, the obtained states become domain-agnostic, and task- and dynamics-aware. Therefore, the states have considerable information that is useful for control (task- and dynamics-aware), but few clues regarding the domain-dependent information (domain-agnostic). We prepared two types of datasets for each task: expert and novice data. Expert data are successful trajectories for tasks in the expert domain, whereas novice data are non-optimal trajectories for tasks in the expert domain. Agent data are collected during training.
3.5 Planning algorithm
We used the cross entropy method (CEM)  to search for the best action sequence in the obtained state space. CEM is a robust population-based optimization algorithm that infers a distribution over action sequences that maximize an objective. Because the objective is modeled as a function of the states and actions, the planner can operate purely in the low-dimensional latent space without generating images. Multiple types of rewards are used for the objective [14, 12] in the context of control as inference . We define the distribution over the task-optimality, , as follows:
The distribution over the imitation-optimality, , is calculated by using the expert discriminator:
We use to calculate both rewards because contextual information is essential for the POMDPs. Hence, the objective of the CEM is to maximize the probability of the task- and imitation-optimalities, as given below:
where is the planning horizon of the CEM.
represents one standard deviation.
4.1 Environments and hyperparameters
We considered three tasks in the MuJoCo physics simulator: Cup-Catch, Finger-Spin, and Connector-Insertion. Figure 3 shows the expert and agent domains for each task. For Finger-Spin, we make two different agent domains. One agent domain of Finger-Spin has different colors of objects and floors compared to the expert domain. The other agent domain of Finger-Spin also has a different viewing angle. It is difficult to train control policies by using only task rewards because all tasks here are the sparse reward type. Cup-Catch and Finger-Spin are instances of the DeepMind Control Suite . We also built a new task, Connector-Insertion. The agent attempted to insert a connector to a socket. Constant rewards were obtained when the connector was in the socket. The position and angle of the connector and socket were initialized with random values at the start of the episodes. In this task, we added a constant bias to the action of moving the connector upward on the paper. This is equivalent to introducing domain knowledge that the socket exists upward on the paper.
The contextual state and stochastic state sizes were 32 and 8 for all experiments. A small latent size is enough for DAC-SSM because domain-related information is eliminated from the latent space. The decoder refers to the domain labels to reconstruct domain-specific observation. Domain label was simply concatenated to and and entered into the domain conditional (DC) decoder. We used not only the DC decoder but also the DC encoder for the Finger-Spin of the tilted view. We implemented the DC encoder by training two separate encoders and switching them based on domain label
. We use batches of 40 sequence chunks of 40 steps long for training. Except for the above mentioned, we adopted the same hyperparameters and architectures as PlaNet for the state space model. We implemented both the expert and domain discriminator as two fully connected layers of size 64 with ReLU activations. The domain confusion loss coefficientis 1.0 unless otherwise noted. For planning, we used CEM with a short planning horizon length of , optimization iterations of , candidate samples of , and refitting to the best . The action repeats were 4, 2, and 800 for Cup-Catch, Finger-Spin, and Connector-Insertion, respectively. The action repeat for Connector-Insertion was extremely large because we set simulation timesteps of MuJoCo to a very small value of ; otherwise, objects easily pass through each other when they come into forceful contact. We evaluate three types of objectives for the planning: dual, imitation and task rewards. The dual rewards are weighted sum of task- and imitation-rewards with ratio of 10:1.
4.2 Applying state representation to IL with domain shifts
Figure 4 and Table 1 compares DAC-SSM using dual rewards (DAC/dual) to a baseline of existing SRL method (PlaNet/task) and naive implementation of the expert discriminator with the baseline (PlaNet+). DAC/dual achieved much higher performance for all tasks than the two baselines. This is because the domain-aware state representation of PlaNet does not help the agents to achieve higher performance via imitation learning with the domain shifts. We also compared DAC-SSM, a version using dual rewards (DAC/dual), a version using imitation rewards (DAC/imitation), and a version using task rewards (DAC/task). Except for Cup-Catch, DAC/imitation achieved the best performance. This is because the planning horizon length is too short for Finger-Spin and Connector-Insertion. We further trained our proposed model (DAC/dual) as well as versions with domain adversarial training but without domain conditional encoders/decoders (DA/dual), and with domain conditional encoders/decoders but without domain adversarial training (DC/dual). The performance of DAC/dual and DC/dual were almost the same, and that of DA/dual was much lower. In the settings of this experiment, the domain adversarial training was not effective because the domain confusion loss coefficient was too small. Table 2 shows DAC/dual achieved higher performance than DC/dual with for Connector-Insertion. These results show that the obtained states on DAC-SSM help the agents to achieve effective imitation learning with the domain shifts.
4.3 Reconstruction from State Representation
Figure 5 shows the sequence of ground-truth examples and reconstructed images from the obtained state representation on DAC-SSM for Finger-Spin. The first 5 columns show context frames that were reconstructed from posterior samples, and the remaining images were generated from open-loop prior samples. The second and third row images were reconstructed from a sequence of states of and with domain label via the DC-decoder . Joint angles of the robotic arm and target object were successfully reconstructed from the states, whereas domain-dependent information (colors of the floor and object) depended on the domain labels. The last row images were reconstructed from the contextual states, , without domain labels using another decoder that is trained separately from our model. The joint angles were successfully reconstructed, whereas the colors appeared to be a mixture of the two domains. These results show that the obtained states on DAC-SSM have control-dependent information like the joint-angle, but do not have domain-dependent information like the colors which is not related to the control. In other words, we successfully acquire the domain-agnostic and task- and dynamics-aware sate representation via DAC-SSM.
5 Conclusion and Discussions
We showed domain-agnostic and task- and dynamics-aware state representation was obtained via DAC-SSM. To obtain such state representation, we introduced domain adversarial training and domain conditional encoders/decoders into the recent task- and dynamics-aware sequential state space model. Moreover, we experimentally evaluated the MPC performance via IL with the large domain shifts for continuous control sparse reward tasks in simulators. The state representation from DAC-SSM helped the agents to achieve comparable performance to the expert. The existing SRL failed to remove domain-dependent information from the states, and thus the agents could not perform effective IL with large domain shifts. We conclude that the domain-agnostic and control-aware states are essential for IL with the large domain shifts, and such states are obtained via DAC-SSM.
A question that remains is if DAC-SSM is applicable to larger and/or different types of domain shifts, e.g. modality-variant of data. Since the domain confusion loss coefficient has task dependency as shown in Table 2, we can expect better state representation is obtained by actively varying . Acquiring task-agnostic states to achieve a universal controller is also appealing future works. Learning from human demonstration is challenging but interesting direction of future works. This work includes obtaining appropriate state representation from expert data without action data. Implementation for real robotic tasks is another important direction for future works. Acquiring fully stochastic state representation is necessary for the real world tasks because the control system of the real robot have much larger uncertainty than simulation.
Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.
-  (2017) End-to-end differentiable adversarial imitation learning. In ICML, Cited by: §2.
-  (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, Cited by: §3.5.
-  (2019) Learning belief representations for imitation learning in pomdps. In UAI, Cited by: §2.
-  (2015) Domain-adversarial training of neural networks. In J. Mach. Learn. Res., Vol. 17, pp. 59:1–59:35. Cited by: §2.
-  (1989) Model predictive control: theory and practice—a survey. In Automatica, Cited by: §1.
-  (2018) Image-to-image translation for cross-domain disentanglement. In NIPS, Cited by: §2.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §2.
-  (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In IROS, Cited by: §2.
-  (2018) Recurrent world models facilitate policy evolution. In NIPS, Cited by: §1.
-  (2018) Learning latent dynamics for planning from pixels. In arXiv, Cited by: §1, §1.
-  (2016) Generative adversarial imitation learning. In NIPS, Cited by: §1, §2.
-  (2018) Multi-objective model-based policy search for data-efficient learning with sparse rewards. In CoRL, Cited by: §3.5.
-  (2014) Semi-supervised learning with deep generative models. In NIPS, Cited by: §1, §2.
-  (2019) Integration of imitation learning using gail and reinforcement learning using task-achievement rewards via probabilistic generative model. In arXiv, Cited by: §3.5.
-  (2018) Discriminator-actor-critic: addressing sample inefficiency and reward bias in adversarial imitation learning. In ICLR, Cited by: §2.
-  (2019) Stochastic latent actor-critic: deep reinforcement learning with a latent variable model. In arXiv, Cited by: §1, §2.
-  (2019) To follow or not to follow: selective imitation learning from observations. In CoRL, Cited by: §2.
-  (2018) State representation learning for control: an overview. In Neural Networks, Vol. 108. Cited by: §1.
-  (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. In arXiv, Cited by: §3.5.
-  (2017) InfoGAIL: interpretable imitation learning from visual demonstrations. In NIPS, Cited by: §2.
-  (2019) Variational inference mpc for bayesian model-based reinforcement learning. In CoRL, Cited by: §1.
-  (1999) Is imitation learning the route to humanoid robots?. In Trends in Cognitive Sciences, Vol. 3. Cited by: §2.
-  (2018) Directed-info gail: learning hierarchical policies from unsegmented demonstrations using directed information. In arXiv, Cited by: §2.
-  (2019) Graph-structured visual imitation. In CoRL, Cited by: §2.
-  (2017) Third-person imitation learning. In ICLR, Cited by: §2.
-  (2012) MuJoCo: a physics engine for model-based control. In IROS, Cited by: §1.
-  (2019) Recent advances in imitation learning from observation. In IJCAI, Cited by: §1.
-  (2014) Deep domain confusion: maximizing for domain invariance. In arXiv, Cited by: §1, §2.
-  (2019) Learning robotic manipulation through visual planning and acting. In RSS, Cited by: §1.
-  (2018) DeepMind control suite. In arXiv, Cited by: §4.1.