1 Introduction
State representation learning (SRL) [18] has been studied to obtain compact and expressive representation of robot control tasks from highdimensional sensor data, such as images. Appropriate state representation enables agents to achieve high performance for discrete and continuous control tasks from games [9] to real robots [29]. Sequential state space models have been shown to improve the performance and sample efficiency of robot control tasks in partially observable Markov decision processes (POMDPs). The deep planning network (PlaNet) [10] is a planning methodology in the latent space that is trained with a task and dynamicsaware state space model called a recurrent state space model (RSSM). RSSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models. They have performed model predictive control (MPC) [5, 21] for planning in the obtained state space on RSSM.
Acquiring domainagnostic states is essential for achieving efficient imitation learning (IL). Without the domainagnostic states, IL is hampered by domaindependent information, which is useless for control. In the context of IL, it is natural to assume that the data from experts and agents have domain shifts [27]. However, the current SRL methods [10, 16] fail to remove such disturbances from the states when the domain shifts are large. In IL, a discriminator serves as an imitation reward function to distinguish the stateaction pairs of the experts from those of the agents [11]. If the obtained states are NOT domainagnostic, the discriminator is disturbed by the domaindependent information, which is eyecatching but unrelated to the control and tasks. As a result, the imitation reward becomes unsuitable for the control, and IL will be disrupted. Figure 1 shows examples of domain shifts between the data from an expert and agent. We define the domain shifts as controlirrelevant changes of the data like appearance: e.g. colors, textures, backgrounds, viewing angles, and objects that are unrelated to the control. The domain shifts are, for example, caused by changing camera settings, location of data collection, appearance of the robot and so on. The domain shifts are also caused when unseen objects in one domain appear in the other domain. For example, an operator will be present in the expert images when she/he makes demonstrations via the direct teaching mode of a robot. In this case, the existence of the operator in the images is the cause of the domain shifts.
To overcome this problem, in this paper, we propose a domainagnostic and task and dynamicsaware SRL model, called a domainadversarial and conditional state space model (DACSSM). DACSSM builds on RSSM, and it is trained with a domain discriminator and expert discriminator. To remove the domaindependent information from the states, (1) the state space is trained with the domain discriminator in an adversarial manner, and (2) the encoder and decoder of DACSSM are conditioned on domain labels. The domain discriminator is trained to identify which domain the acquired states belong to. The negative loss function of the domain discriminator, called the domain confusion loss
[28], is added to the loss function of the state space. To reduce the domain confusion loss, the states are trained to be domainagnostic. In other words, due to the domain confusion loss, DACSSM is trained to inference the states that have few clues for the domain discriminator to distinguish domain of the states. Moreover, the states are disentangled by conditional domain labels for the encoder and decoder, like conditional variational autoencoders (CVAE)
[13]. Owing to the disentanglement, the domaindependent information is eliminated from the state representation. Because DACSSM jointly optimizes the state inference, observation reconstruction, forward dynamics, and reward models, the obtained states are also task and dynamicsaware as well as domainagnostic. To the best of our knowledge, there are no studies that have combined the domain adversarial training with SRL for control tasks.The main contribution of this paper is implementation and experiments to demonstrate that the obtained state representation via DACSSM is suitable for IL with the large domain shifts. We compared DACSSM to the existing SRL methods in terms of MPC performance via IL for continuous control sparse reward tasks in the MuJoCo physics simulator [26]. The agents in DACSSM achieved a performance comparable to the expert and more than twice that of the baselines.
2 Related studies
State representation learning for POMDPs
The sequential state space model has been studied to solve the tasks in POMDPs. lee2019slac proposed a sequential latent variable model that propagates historical information from a control system via contextual stochastic states [16]. They jointly optimized the actor and critic using the state apace model. Gangwani2019belief jointly optimized the expert discriminator using policy, forward and inverse dynamics, and action models to obtain task and dynamicsaware state representation [3]. Their state representation, however, is not domainagnostic.
Domainagnostic feature representation
Domainagnostic feature representation has been obtained by domainadversarial training or by disentangling the latent space [6]. The domainadversarial training is a simple and effective approach to extract feature representation which is unrelated to the domains of data. Tzeng2014DeepDC added the domain confusion loss to the loss function of the feature extractor [28]. Ganin2015DomainAdversarialTO introduced a gradient reversal layer to backpropagate a negative gradient of the domain discriminator loss to the feature extractor [4]. CVAE is a wellknown method that is able to disentangle domaindependent information from the latent spaces. They made the encoder and decoder conditional on domain labels to obtain the domainagnostic latent variables [13].
Imitation learning (IL)
IL [22] is a powerful and accepted approach that makes the agents mimic expert behavior by using a set of demonstrations of tasks. ho:GAIL proposed an IL framework called Generative Adversarial Imitation Learning (GAIL) [11]
. In GAIL, imitation rewards are computed by the expert discriminator, which distinguishes if a stateaction pair is generated by an agent policy or from the expert demonstrations. They formulated a joint process of reinforcement learning and inverse reinforcement learning as a twoplayer game of the policy and discriminator, analogous to Generative Adversarial Networks
[7]. GAIL has been shown to solve complex highdimensional continuous control tasks [15, 1, 20, 23].IL with the domain shifts
Using common measurable features is one of the popular approaches. For example, keypoints of objects [24] and/or tracking marker positions [8, 17] are used as the states. In this approach, one can directly apply existing IL techniques without focusing on the domain shifts. However, such features are not always available. Stadie2017ThirdPersonIL added the domain confusion loss to the expert discriminator to make it domainagnostic [25]. By computing the imitation reward using the discriminator, they successfully achieved IL with large domain shifts. Their approach, however, does not include SRL.
3 Proposed Method
3.1 Concept of proposed method
Figure 2 (a) shows a concept of DACSSM. represents the expert discriminator which serves as an imitation reward function. Because DACSSM builds domainagnostic state space, higher rewards are provided to the agents for expertlike behavior. On the other hand, the existing method builds domainaware state space. The expert discriminator easily distinguishes the states from the agents even when the behavior of the agents is expertlike.
3.2 State space model
In POMDPs, an individual image does not have all the information about the states. Therefore, our model builds on RSSM, which has contextual states to propagate historical information. We use the following notations: a discrete time step, , contextual deterministic states, , stochastic states, , image observations, , continuous actions, , and domain labels, . The model follows the mixed deterministic/stochastic dynamics below:

Transition model:

State model:

Observation model:
Transition model
was implemented as a recurrent neural network. To train the model, we maximized the probability of a sequence of observations in the entire generative process:
(1) 
Generally this objective is intractable. We utilize the following evidence lower bound (ELBO) on the loglikelihood by introducing the posterior to infer the approximate stochastic states.
(2) 
The posterior and the observation model are implemented as an encoder and decoder, respectively. They are conditioned on the domain labels, . The domain labels help them to change their behavior depending on the domain. The domaindependent information is eliminated from the obtained states and , like CVAE.
3.3 Domain and expert discriminators
We further introduce the domain and expert discriminators, and . The role of the domain discriminator is for computing the domain confusion losses. We denote the replay buffers for the data from the agents, experts, and novices as , , and , respectively. The data from the novices are in the same domain as those from the experts, but are nonoptimal for the tasks. The loss function of the domain discriminator is denoted as follows:
(3) 
Here, we introduce a simple abbreviation of the expectation to avoid complexity:
(4) 
Similarly, the loss function of the expert discriminator is denoted as follows:
(5) 
The expert discriminator serves as an imitation reward function. It is trained to distinguish if stateaction pairs are from episodes of the experts or not.
3.4 Training of DACSSM
Figure 2 (b) displays a diagram of training architecture of DACSSM. The dashed lines represent backpropagation paths. The model is trained by minimizing state space losses with the domain confusion losses:
(6) 
where is a hyperparameter. The reward models, , are trained by the losses:
(7) 
The gradient of the expert discriminator losses, , is not propagated to DACSSM. The gradient of the domain discriminator losses, , is not propagated to DACSSM directly, but the domain confusion losses, , are added to the state space losses, . Thus, the obtained states become domainagnostic, and task and dynamicsaware. Therefore, the states have considerable information that is useful for control (task and dynamicsaware), but few clues regarding the domaindependent information (domainagnostic). We prepared two types of datasets for each task: expert and novice data. Expert data are successful trajectories for tasks in the expert domain, whereas novice data are nonoptimal trajectories for tasks in the expert domain. Agent data are collected during training.
3.5 Planning algorithm
We used the cross entropy method (CEM) [2] to search for the best action sequence in the obtained state space. CEM is a robust populationbased optimization algorithm that infers a distribution over action sequences that maximize an objective. Because the objective is modeled as a function of the states and actions, the planner can operate purely in the lowdimensional latent space without generating images. Multiple types of rewards are used for the objective [14, 12] in the context of control as inference [19]. We define the distribution over the taskoptimality, , as follows:
(8) 
The distribution over the imitationoptimality, , is calculated by using the expert discriminator:
(9) 
We use to calculate both rewards because contextual information is essential for the POMDPs. Hence, the objective of the CEM is to maximize the probability of the task and imitationoptimalities, as given below:
(10) 
where is the planning horizon of the CEM.
Task  DAC/dual  DAC/imitation  DAC/task  DA/dual  DC/dual  PlaNet  PlaNet+  

CupCatch  728223  304323  375371  233350  788149  470398  479359  
FingerSpin  40542  48850  13073  190108  41941  15773  12491  

40645  50748  16787  12380  39451  16287  15689  

40.229.1  50.525.0  0.44.0  0.00.0  40.926.7  0.73.4  2.18.1 
represents one standard deviation.
Task 








FingerSpin  41941  41745  40947  40542  33752  12  
ConnectorInsertion  40.926.7  35.328.4  37.429.6  40.229.1  49.525.5  17.323.8 
4 Experiments
4.1 Environments and hyperparameters
We considered three tasks in the MuJoCo physics simulator: CupCatch, FingerSpin, and ConnectorInsertion. Figure 3 shows the expert and agent domains for each task. For FingerSpin, we make two different agent domains. One agent domain of FingerSpin has different colors of objects and floors compared to the expert domain. The other agent domain of FingerSpin also has a different viewing angle. It is difficult to train control policies by using only task rewards because all tasks here are the sparse reward type. CupCatch and FingerSpin are instances of the DeepMind Control Suite [30]. We also built a new task, ConnectorInsertion. The agent attempted to insert a connector to a socket. Constant rewards were obtained when the connector was in the socket. The position and angle of the connector and socket were initialized with random values at the start of the episodes. In this task, we added a constant bias to the action of moving the connector upward on the paper. This is equivalent to introducing domain knowledge that the socket exists upward on the paper.
The contextual state and stochastic state sizes were 32 and 8 for all experiments. A small latent size is enough for DACSSM because domainrelated information is eliminated from the latent space. The decoder refers to the domain labels to reconstruct domainspecific observation. Domain label was simply concatenated to and and entered into the domain conditional (DC) decoder. We used not only the DC decoder but also the DC encoder for the FingerSpin of the tilted view. We implemented the DC encoder by training two separate encoders and switching them based on domain label
. We use batches of 40 sequence chunks of 40 steps long for training. Except for the above mentioned, we adopted the same hyperparameters and architectures as PlaNet for the state space model. We implemented both the expert and domain discriminator as two fully connected layers of size 64 with ReLU activations. The domain confusion loss coefficient
is 1.0 unless otherwise noted. For planning, we used CEM with a short planning horizon length of , optimization iterations of , candidate samples of , and refitting to the best . The action repeats were 4, 2, and 800 for CupCatch, FingerSpin, and ConnectorInsertion, respectively. The action repeat for ConnectorInsertion was extremely large because we set simulation timesteps of MuJoCo to a very small value of ; otherwise, objects easily pass through each other when they come into forceful contact. We evaluate three types of objectives for the planning: dual, imitation and task rewards. The dual rewards are weighted sum of task and imitationrewards with ratio of 10:1.4.2 Applying state representation to IL with domain shifts
Figure 4 and Table 1 compares DACSSM using dual rewards (DAC/dual) to a baseline of existing SRL method (PlaNet/task) and naive implementation of the expert discriminator with the baseline (PlaNet+). DAC/dual achieved much higher performance for all tasks than the two baselines. This is because the domainaware state representation of PlaNet does not help the agents to achieve higher performance via imitation learning with the domain shifts. We also compared DACSSM, a version using dual rewards (DAC/dual), a version using imitation rewards (DAC/imitation), and a version using task rewards (DAC/task). Except for CupCatch, DAC/imitation achieved the best performance. This is because the planning horizon length is too short for FingerSpin and ConnectorInsertion. We further trained our proposed model (DAC/dual) as well as versions with domain adversarial training but without domain conditional encoders/decoders (DA/dual), and with domain conditional encoders/decoders but without domain adversarial training (DC/dual). The performance of DAC/dual and DC/dual were almost the same, and that of DA/dual was much lower. In the settings of this experiment, the domain adversarial training was not effective because the domain confusion loss coefficient was too small. Table 2 shows DAC/dual achieved higher performance than DC/dual with for ConnectorInsertion. These results show that the obtained states on DACSSM help the agents to achieve effective imitation learning with the domain shifts.
4.3 Reconstruction from State Representation
Figure 5 shows the sequence of groundtruth examples and reconstructed images from the obtained state representation on DACSSM for FingerSpin. The first 5 columns show context frames that were reconstructed from posterior samples, and the remaining images were generated from openloop prior samples. The second and third row images were reconstructed from a sequence of states of and with domain label via the DCdecoder . Joint angles of the robotic arm and target object were successfully reconstructed from the states, whereas domaindependent information (colors of the floor and object) depended on the domain labels. The last row images were reconstructed from the contextual states, , without domain labels using another decoder that is trained separately from our model. The joint angles were successfully reconstructed, whereas the colors appeared to be a mixture of the two domains. These results show that the obtained states on DACSSM have controldependent information like the jointangle, but do not have domaindependent information like the colors which is not related to the control. In other words, we successfully acquire the domainagnostic and task and dynamicsaware sate representation via DACSSM.
5 Conclusion and Discussions
We showed domainagnostic and task and dynamicsaware state representation was obtained via DACSSM. To obtain such state representation, we introduced domain adversarial training and domain conditional encoders/decoders into the recent task and dynamicsaware sequential state space model. Moreover, we experimentally evaluated the MPC performance via IL with the large domain shifts for continuous control sparse reward tasks in simulators. The state representation from DACSSM helped the agents to achieve comparable performance to the expert. The existing SRL failed to remove domaindependent information from the states, and thus the agents could not perform effective IL with large domain shifts. We conclude that the domainagnostic and controlaware states are essential for IL with the large domain shifts, and such states are obtained via DACSSM.
A question that remains is if DACSSM is applicable to larger and/or different types of domain shifts, e.g. modalityvariant of data. Since the domain confusion loss coefficient has task dependency as shown in Table 2, we can expect better state representation is obtained by actively varying . Acquiring taskagnostic states to achieve a universal controller is also appealing future works. Learning from human demonstration is challenging but interesting direction of future works. This work includes obtaining appropriate state representation from expert data without action data. Implementation for real robotic tasks is another important direction for future works. Acquiring fully stochastic state representation is necessary for the real world tasks because the control system of the real robot have much larger uncertainty than simulation.
Acknowledgments
Most of the experiments were conducted in ABCI (AI Bridging Cloud Infrastructure), built by the National Institute of Advanced Industrial Science and Technology, Japan.
References
 [1] (2017) Endtoend differentiable adversarial imitation learning. In ICML, Cited by: §2.
 [2] (2018) Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NIPS, Cited by: §3.5.
 [3] (2019) Learning belief representations for imitation learning in pomdps. In UAI, Cited by: §2.
 [4] (2015) Domainadversarial training of neural networks. In J. Mach. Learn. Res., Vol. 17, pp. 59:1–59:35. Cited by: §2.
 [5] (1989) Model predictive control: theory and practice—a survey. In Automatica, Cited by: §1.
 [6] (2018) Imagetoimage translation for crossdomain disentanglement. In NIPS, Cited by: §2.
 [7] (2014) Generative adversarial nets. In NIPS, Cited by: §2.
 [8] (2016) Learning dexterous manipulation for a soft robotic hand from human demonstrations. In IROS, Cited by: §2.
 [9] (2018) Recurrent world models facilitate policy evolution. In NIPS, Cited by: §1.
 [10] (2018) Learning latent dynamics for planning from pixels. In arXiv, Cited by: §1, §1.
 [11] (2016) Generative adversarial imitation learning. In NIPS, Cited by: §1, §2.
 [12] (2018) Multiobjective modelbased policy search for dataefficient learning with sparse rewards. In CoRL, Cited by: §3.5.
 [13] (2014) Semisupervised learning with deep generative models. In NIPS, Cited by: §1, §2.
 [14] (2019) Integration of imitation learning using gail and reinforcement learning using taskachievement rewards via probabilistic generative model. In arXiv, Cited by: §3.5.
 [15] (2018) Discriminatoractorcritic: addressing sample inefficiency and reward bias in adversarial imitation learning. In ICLR, Cited by: §2.
 [16] (2019) Stochastic latent actorcritic: deep reinforcement learning with a latent variable model. In arXiv, Cited by: §1, §2.
 [17] (2019) To follow or not to follow: selective imitation learning from observations. In CoRL, Cited by: §2.
 [18] (2018) State representation learning for control: an overview. In Neural Networks, Vol. 108. Cited by: §1.
 [19] (2018) Reinforcement learning and control as probabilistic inference: tutorial and review. In arXiv, Cited by: §3.5.
 [20] (2017) InfoGAIL: interpretable imitation learning from visual demonstrations. In NIPS, Cited by: §2.
 [21] (2019) Variational inference mpc for bayesian modelbased reinforcement learning. In CoRL, Cited by: §1.
 [22] (1999) Is imitation learning the route to humanoid robots?. In Trends in Cognitive Sciences, Vol. 3. Cited by: §2.
 [23] (2018) Directedinfo gail: learning hierarchical policies from unsegmented demonstrations using directed information. In arXiv, Cited by: §2.
 [24] (2019) Graphstructured visual imitation. In CoRL, Cited by: §2.
 [25] (2017) Thirdperson imitation learning. In ICLR, Cited by: §2.
 [26] (2012) MuJoCo: a physics engine for modelbased control. In IROS, Cited by: §1.
 [27] (2019) Recent advances in imitation learning from observation. In IJCAI, Cited by: §1.
 [28] (2014) Deep domain confusion: maximizing for domain invariance. In arXiv, Cited by: §1, §2.
 [29] (2019) Learning robotic manipulation through visual planning and acting. In RSS, Cited by: §1.
 [30] (2018) DeepMind control suite. In arXiv, Cited by: §4.1.