1 Introduction
Reinforcement learning (RL) can leverage the modeling capability of generative models to solve complex sequential decision making problems more efficiently. RL has been applied to endtoend training of deep visuomotor robotic policies [levine2016end, levine2018learning] but it is typically too datainefficient especially when applied to the tasks that provide only a terminal reward at the end of an episode. One way to alleviate the datainefficiency problem in RL is by leveraging prior knowledge to reduce the complexity of the optimization problem. One prior that significantly reduces the temporal complexity is an approximation of the distribution from which valid action sequences can be sampled. Such distributions can be efficiently approximated by training generative models given a sufficient amount of valid action sequences.
The question is then how to combine the powerful RL optimization algorithms with the great modeling capability of the generative models to improve the efficiency of the policy training? Moreover, which characteristics of the generative models are important for the efficient policy training? A suitable generative model must capture the widest distribution of the training data to generate as many distinct motion trajectories as possible while avoiding the generation of invalid trajectories outside the training dataset. The diversity of the generated data enables the policy to complete a given task for a wider set of target states. On the other hand, adhering to the distribution of the training data ensures safety in generating trajectories which are running on a realrobotic platform.
In this paper, we (1) exploit RL to train a deep visuomotor policy that combined with a generative model generates feedforward sequences of motor commands to control a robotic arm given image pixels as the input, and (2) provide a set of measures, two of which introduced by us, to evaluate the quality of the latent space of different generative models when regulated by the RL policy search algorithms.
Regarding (1), we propose a learning framework that divides the deep visuomotor sequential decisionmaking problem into the following subproblems that can be solved more efficiently: (a) an unsupervised generative model training problem that approximates the distribution of motor actions, (b) a trustregion policy optimization problem that solves a contextual multiarmed bandit without temporal complexities, and (c) a supervised learning problem that trains the deep visuomotor policy in an endtoend fashion.
Regarding (2), we evaluate generative models based on (a) the quality and coverage of the samples they generate using the precision and recall metric
[kynkaanniemi2019improved, sajjadi2018assessing], and (b) the quality of their latent representations using our novel measures of disentanglement and local linearity. Both of our metrics leverage the end states obtained after execution of the generated trajectories on a robotic platform. Disentanglement measures to which extend separate sets of dimensions in the latent space control different aspects of the task, while local linearity measures the complexity of the generative process and system dynamics in the neighbourhood of each latent representation. Our hypothesis is that a generative model that is well disentangled, locally linear and able to generate realistic samples that closely follow the training data (i.e. has high precision and high recall) leads to a more efficient neural network policy training. We experimentally investigate this hypothesis on several generative models, namely
VAEs [higgins2017beta] and InfoGANs [chen2016infogan], and use automatic relevance determination regression (ARD) to quantify the effect of each characteristic on the RL policy training performance.In summary, the advantages of our framework are:

It improves dataefficiency of the policy training algorithm by incorporating the prior knowledge in terms of a distribution over valid sequences of actions, therefore, reducing the search space.

It helps to acquire complex visuomotor policies given sparse terminal rewards provided at the end of a successful episode. The proposed formulation converts the sequential decisionmaking problem into a contextual multiarmed bandit without time complexities. Therefore, it alleviates the temporal credit assignment problem that is inherent in the sequential decisionmaking tasks which enables efficient policy training with only terminal rewards.

It enables safe RL exploration by sampling actions only from the approximated distribution. This is in stark contrast to the typical RL algorithms in which random actions are taken during the exploration phase.

It provides a set of measures for evaluation of the generative model based on which it is possible to predict the performance of the RL policy training task prior to the actual training.
This paper provides a comprehensive overview of our earlier work for RL policy training based on generative models [ghadirzadeh2017deep, arndt2019meta, chen2019adversarial, hamalainen2019affordance, butepage2019imitating] and is organized as follows: in Section 2, we provide an overview of the related work. We formally introduce the problem of policy training with generative models in Section 3, and describe how the framework is trained in Section 4. In Section 5 we first briefly overview VAEs and GANs, and then define all of the evaluation measures used to predict the final policy training performance. Section 6 describes the endtoend training of the perception and control modules. We present the experimental results in Section 7. Finally, the conclusions and future work are provided in Section 8.
2 Related work
Our work addresses two types of problems: (1) visuomotor policy training based on unsupervised generative model training and trustregion policy optimization, and (2) evaluation of generative models to forecast the efficiency of the final policy training task. We introduce the related work for each of the problems in the following sections.
2.1 Dataefficient Endtoend Policy Training
In recent years, endtoend training of visuomotor policies using deep reinforcement learning has gained in popularity in robotics research [ghadirzadeh2017deep, levine2016end, finn2016deep, kalashnikov2018qt, quillen2018deep, singh2017gplac, devin2018deep, pinto2017asymmetric]. However, deep RL algorithms are typically very datahungry and learning a general policy, i.e., a policy that performs well also for previously unseen inputs, requires a farm of robots continuously collecting data for several days [levine2018learning, finn2017deep, gu2017deep, dasari2019robonet]. The limitation of largescale data collection has hindered the applicability of RL solutions to many practical robotics tasks. Recent studies tried to improve the dataefficiency by training the policy in simulation and transferring the acquired visuomotor skills to the real setup [quillen2018deep, pinto2017asymmetric, abdolmaleki2020distributional, peng2018sim]
, a paradigm known as simtoreal transfer learning. Simtoreal approaches are utilized for two tasks in deep policy training, (1) training the perception model via randomization of the texture and shape of visual objects in simulation and using the trained model directly in the real world setup (zeroshot transfer)
[hamalainen2019affordance, tobin2017domain], and (2) training the policy in simulation by randomizing the dynamics of the task and transferring the policy to the real setup by finetuning it with the real data (fewshot transfer learning) [arndt2019meta, peng2018sim]. However, challenges in the design of the simulation environment can cause large differences between the real and the simulated environments which hinder an efficient transfer knowledge between these two domains. In such cases, transfer learning from other domains, e.g., human demonstrations [butepage2019imitating, yu2018one] and simpler task setups [chen2019adversarial, chen2018deep] can help the agent to learn a policy more efficiently. In this work, we exploit human demonstration to shape the robot motion trajectories by training generative models that reproduce the demonstrated trajectories. Following our earlier work [chen2019adversarial], we exploit adversarial domain adaptation techniques [tzeng2017adversarial, tzeng2020adapting] to improve the generality of the acquired policy when the policy is trained in a simple task environment with a small amount of training data. In the rest of this section, we review related studies that improve the dataefficiency and generality of RL algorithms by utilizing trustregion terms, converting the RL problem into a supervised learning problem, and trajectorycentered approaches that shape motion trajectories prior to the policy training.Schulman et al. introduced two policy gradient methods, trustregion policy optimization (TRPO) [schulman2015trust] and proximal policy optimization (PPO) [schulman2017proximal], that scale well to nonlinear policies, such as neural networks. The key component of TRPO and PPO is a surrogate objective function with a trustregion term based on which the policy can be updated and monotonically improved while the policy distribution is not abruptly changed after each iteration. In TRPO, the changes in the distributions of the policies before and after each update are penalized by a KL divergence term. Therefore, the policy is forced to stay in a trustregion given by the action distribution of the current policy. Our EM formulation yields a similar trustregion term with the difference being that it penalizes the changes in the distribution of the deep policy and a socalled variational policy that will be introduced as a part of our proposed optimization algorithm. Since our formulation allows the use of any policy gradient solution, we use the same RL objective function.
EM algorithm has been used for policy training in a number of prior work, e.g., by Neumann [neumann2011variational], Deisenroth et al., [deisenroth2013survey], and Levine and Koltun [levine2013variational]. The key idea is to introduce variational policies to decompose the policy training into two downstream tasks that are trained iteratively until no further policy improvement can be observed [ghadirzadeh2018sensorimotor]. In a more recent work, Levine et al. [levine2016end] introduced guided policy search (GPS) algorithm which divides the visuomotor policy training task into a trajectory optimization and a supervised learning problem. GPS alternates between two steps: (1) optimizing a set of trajectories by exploiting a trustregion term to stay close to the action distribution given by the deep policy, and (2) updating the deep policy to reproduce the motion trajectories. Our EM solution differs from the GPS framework and earlier approaches in that we optimize the trajectories by regulating a generative model that is trained prior to the policy training. Training generative models enables the learning framework to exploit human expert knowledge as well as to optimize the policy given only terminal rewards as explained earlier.
Trajectorycentric approaches, such as dynamic movement primitives (DMPs), have been popular because of the ease of integrating expert knowledge in the policy training process via physical demonstration [peters2006policy, peters2008reinforcement, ijspeert2003learning, ijspeert2013dynamical, hazara2019transferring]. However, such models are less expressive compared to deep neural networks and are particularly limited when it comes to the endtoend training of the perception and the control elements of the model. Moreover, these approaches cannot be used to train reactive policies where the action is adjusted in every timestep based on the observed sensory input [haarnoja2018composable]. On the other hand, deep generative models can model complex dependencies within the data by learning the underlying data distribution from which realistic samples can be obtained. Furthermore, generative models can be easily accommodated in larger neural networks without affecting the data integrity. Our framework based on generative models enables training both feedback (reactive) and feedforward policies by adjusting the policy network architecture.
The use of generative models in policy training has become popular in recent years [ghadirzadeh2017deep, butepage2019imitating, hamalainen2019affordance, chen2019adversarial, arndt2019meta, lippi2020latent, gothoskar2020learning, igl2018deep, buesing2018learning, mishra2017prediction, ke2018modeling, hafner2018learning, rhinehart2018deep, krupnik2019multi] because of their lowdimensional and regularized latent spaces. However, latent variable generative models are mainly studied to train a longterm state prediction model that is used in the context of trajectory optimization and modelbased reinforcement learning [buesing2018learning, mishra2017prediction, ke2018modeling, hafner2018learning, rhinehart2018deep, krupnik2019multi]. Regulating generative models based on reinforcement learning to produce sequences of actions according to the visual state has first appeared in our prior work [ghadirzadeh2017deep]. Since then we applied the framework in different robotic tasks, e.g., throwing balls [ghadirzadeh2017deep], shooting hockeypucks [arndt2019meta], pouring into mugs [chen2019adversarial, hamalainen2019affordance], and in a variety of problem domains, e.g., simtoreal transfer learning [hamalainen2019affordance, arndt2019meta] and domain adaptation to acquire general policies [chen2019adversarial]. This work is a comprehensive study that builds the theoretical foundation of our earlier work and in addition studies the characteristics of the generative models that can affect the efficiency of the policy training task.
2.2 Evaluation of Generative Models
Although generative models have proved successful in many domains [lippi2020latent, brock2018large, wang2018high, vae_anom, vae_text8672806] assessing their quality remains a challenging problem [challenging_common]. It involves analysing the quality of both latent representations and generated samples. A general belief is that a superior model possesses well structured latent space and tightly matches the distribution of the training data.
A widely adopted approach for assessing the quality of the latent representations is the measure of disentanglement [higgins2018towards, repr_learning_survey]. A representation is said to be disentangled if each latent component encodes exactly one ground truth generative factor present in the data [kim2018disentangling]. Existing frameworks for both learning and evaluating disentangled representations [higgins2017beta, kim2018disentangling, eastwood2018framework, chen2018isolating]
rely on the assumption that the ground truth factors of variation are known a priori and are independent. Their core idea is to measure how changes in the generative factors affect the latent representations and vice versa. In cases when an encoder network is available, this is typically achieved with a classifier that was trained to predict which generative factor was held constant given a latent representation. In generative models without an encoder network, such as GANs, disentanglement is measured by visually inspecting the latent traversals provided that the input data are images
[chen2016infogan, jeon2019ibgan, lee2020high, liu2019oogan]. However, these measures are difficult to apply when generative factors of variation are unknown or when manual visual inspection is not possible, both of which is the case with sequences of motor commands for controlling a robotic arm. We therefore define a measure of disentanglement that does not rely on any of these requirements and instead leverages the end states of the downstream robotics task corresponding to a given set of latent action representation. In contrast to the existing measures it measures how changes in the latent space affect the obtained end states in a fully unsupervised way.Assessing the quality of the learned distribution and samples from it presents a separate challenge. Generated samples and their variation should resemble those obtained from the training data distribution. Early developed metrics such as IS [IS_NIPS2016_6125], FID [FID_NIPS2017_7240] and KID [binkowski2018demystifying] provided a promising start but were shown to be unable to separate between failure cases, such as mode collapse or unrealistic generated samples. Instead of using onedimensional score, [sajjadi2018assessing] propose to evaluate the learned distribution by comparing the samples from it with the ground truth training samples using the notion of precision and recall. Intuitively, precision measures the similarity between the generated and real samples, while recall determines the fraction of the true distribution that is covered by the distribution learned by the model. The measure was further improved both theoretically and practically by [revisiting_pr], while [kynkaanniemi2019improved] provides an explicit nonparametric variant of the original probabilistic approach.
3 Preliminaries
Our framework consists of three parts: (1) policy training using EM algorithm combined with generative models, (2) training and evaluation of the generative models, and (3) training of the perception module. In this section, we introduce the problem of training a feedforward deep policy network and describe our approach to solving it by leveraging generative models. We then present each of the three parts of the framework in the following section.
We deal with episodic tasks where the goal is to find a sequence of actions consisting of timesteps and motor actions for a given state . More specifically, we wish to find a policy , represented by a deep neural network , based on which a sequence of motor actions can be sampled given a state of the environment . A state contains information about the current configuration of the environment as well as the goal to reach in case a goalconditioned policy has to be obtained. For a given state and a sequence of actions a reward
is given at the end of each trial according to the reward probability distribution
which is unknown to the learning agent. The policy is represented by a neural network with parameters that are trained to maximize the loglikelihood of receiving high rewards(1) 
where denotes the distribution over states.
Our approach is based on training a generative model , parametrized by , that maps a lowdimensional latent action variable into a motion trajectory where . In other words, we assume that the search space is limited to the trajectories spanned by the generative model. In this case, the feedforward policy search problem splits into two subproblems: (1) finding the mapping and (2) finding the subpolicy , where . In the rest of the text we refer to the subpolicy as the policy and omit the parameters from the notation when they are not relevant. Instead of marginalizing over trajectories as in Eq. (1) we marginalize over the latent variable by exploiting the generative model
(2) 
Once the models are trained the policy output is found by first sampling from the policy and then using the mapping to get the sequence of motor actions . There are four advantages to perform the policy search in the lowdimensional latent space: (1) the search space is considerably smaller compared to the highdimensional trajectory space, (2) the sequential decision making problem is reduced to a contextual multiarmed bandit without temporal complexities and credit assignment problem, (3) the exploration is done based on trajectories generated by the generative model which increases the safety of real robot explorations, (4) it provides a natural way to provide prior knowledge based on human demonstration or general trajectory planning algorithms.
In the following section, we introduce the expectationmaximization algorithm for training an actionselection policy using a generative model based on which different motor trajectories suitable for solving a given task can be generated.
4 ExpectationMaximization Policy Training
We use the EM algorithm to find an optimal policy . We first introduce a variational policy which is a simpler auxiliary distribution used to improve the training of . As the goal is to maximize the reward probability we start by expressing its logarithm as , where we used the identity and omitted the conditioning on in the reward probability for simplicity. Following the EM derivation introduced in [neumann2011variational] and using the identity , the expression can be further decomposed into
(3) 
The second term (II) is the KullbackLeibler (KL) divergence between distributions and , which is a nonnegative quantity. Therefore, the first term (I) provides a lowerbound for . To maximize the latter we use the EM algorithm which is an iterative procedure consisting of two steps known as the expectation (E) and the maximization (M) steps, introduced in the following sections.
4.1 Expectation step
The Estep maximizes by minimizing the KL divergence term (II) in Eq. (3). Since does not depend on the sum of the KL divergence term (II) and the lower bound term (I) is a constant value for different . Therefore, reducing (II) by optimizing increases the lower bound (I). Assuming that is parametrized by the Estep objective function is given by
(4) 
where we used the Bayes rule and substituted by . In typical RL applications we assume that the reward is given by a deterministic reward function . In this case, can be maximized indirectly by maximizing the expected reward value on which we can apply the policy gradient theorem. Moreover, acts as a trust region term forcing not to deviate too much from the policy distribution . Therefore, we can apply policy search algorithms with trust region terms to optimize the objective given in Eq. (4). Following the derivations introduced in [schulman2015trust], we adopt TRPO objective for the Estep optimization
(5) 
where is the advantage function, is the value function and denotes the optimal solution for the given iteration. Note that the action latent variable is always sampled from the policy and not from the variational policy .
4.2 Maximization step
The Mstep directly maximizes the lower bound (I) in Eq. (3) by optimizing the policy parameters while holding the variational policy constant. Following [deisenroth2013survey] and noting that the dynamics of the system are not affected by the choice of the policy parameters , we maximize (I) by minimizing the following KL divergence
(6) 
In other words, the Mstep updates the policy to match the distribution of the variational policy which was updated in the Estep. Similarly as in the Estep, denotes the optimal solution for the given iteration.
A summary of the EM policy training is given in Algorithm 1. In each iteration, a set of states is sampled from the state distribution . For each state a latent action variable is sampled from the distribution given by the policy . A generative model is then used to map every latent action variable into a full motor trajectory which is then deployed on the robot to get the corresponding reward value. In the inner loop, the variational policy and the main policy are updated iteratively on batches of data using the objective function for the E and Msteps of the policy optimization method.
The number of variational policies can be chosen arbitrarily. We suggest to optimize several variational policies simultaneously in the Estep by splitting the state space into smaller sub spaces for each of which one variational policy is optimized. This can improve the efficiency of the policy training since variational policies with fewer parameters have to be optimized in a small region of the state space. The analysis of the effect of the number of the variational policies for the policy training is presented in Sec. 7.4.
5 Generative Model Training
So far we discussed how to train an actionselection policy based on the EM algorithm to regulate the action latent variable which is the input to a generative model. In this section we review two prominent approaches to train a generative model, VAE and GAN, which we use to generate sequences of actions required to solve the sequential decisionmaking problem. We then introduce a set of measures used to predict which properties of a generative model will influence the policy training.
5.1 Training generative models
We aim to model the true distribution
of the motor actions that are suitable to complete a given task. To this end, we introduce a lowdimensional random variable
with a probability density function
representing the latent actions which are mapped into unique trajectories by a generative model . The model is trained to maximize the likelihood of the training trajectories under the entire latent variable space.5.1.1 Variational autoencoders
A VAE [kingma2014auto, rezende2014stochasticvae2] consists of an encoder and decoder neural networks representing the parameters of the approximate posterior distribution and the likelihood function , respectively. The encoder and decoder neural networks, parametrized by and , respectively, are jointly trained to optimize the variational lower bound
(7) 
where the parameter [higgins2017beta] is a variable that controls the tradeoff between the reconstruction fidelity and the structure of the latent space regulated by the KL divergence.
5.1.2 Generative adversarial networks
A GAN model [goodfellow2014generative] consists of a generator and discriminator neural networks that are trained by playing a minmax game. The generative model , parametrized by , transforms a latent variable sampled from the prior noise distribution into a trajectory . The model needs to produce realistic samples resembling those obtained from the training data distribution . It is trained by playing an adversarial game against the discriminator network , parametrized by , which needs to distinguish a generated sample from a real one. The competition between the two networks is expressed as the following minmax objective
(8) 
However, the original GAN formulation from Eq. (8) does not impose any restrictions on the latent variable and therefore the generator can use in an arbitrary way. To learn disentangled latent representations we instead use InfoGAN [chen2016infogan] which is a version of GAN with an informationtheoretic regularization added to the original objective. The regularization is based on the idea to maximise the mutual information between the latent code and the corresponding generated sampled . An InfoGAN model is trained using the following information minmax objective [chen2016infogan]
(9) 
where is an approximation of the true unknown posterior distribution and
a hyperparameter. In practice,
is a neural network which shares all the convolutional layers with the discriminator network except for the last few output layers.5.2 Evaluation of the generative models
We review the characteristics of generative models that can potentially improve the policy training by measuring precision and recall, disentanglement and local linearity. Our goal is to be able to judge the quality of the policy training by evaluating the generative models prior to the RL training. We relate the measures to the performance of the policy training in Section 7.2.
5.2.1 Precision and recall
Precision and recall for distributions is a measure, first introduced by [sajjadi2018assessing] and further improved by [kynkaanniemi2019improved], for evaluating the quality of a distribution learned by a generative model . It is based on the comparison of samples obtained from with the samples from the ground truth reference distribution. In our case, the reference distribution is the one of the training motor trajectories. Intuitively, precision measures the quality of the generated sequences of motor actions by quantifying how similar they are to the training trajectories. It determines the fraction of the generated samples that are realistic. On the other hand, recall evaluates how well the learned distribution covers the reference distribution and it determines the fraction of the training trajectories that can be generated by the generative model. In the context of the policy training, we would like the output of to be as similar as possible to the demonstrated motor trajectories. It is also important that covers the entire state space as it must be able to reach different goal states from different task configurations. Therefore, the generative model needs to have both high precision and high recall scores.
The improved measure introduced by [kynkaanniemi2019improved] is based on an approximation of manifolds of both training and generated data. In particular, given a set of either real training trajectories or generated trajectories
, the corresponding manifold is estimated by forming hyperspheres around each trajectory
with radius equal to its th nearest neighbour . To determine whether or not a given novel trajectory lies within the volume of the approximated manifold we define a binary functionBy counting the number of generated trajectories that lie on the manifold of the real data we obtain the precision, and similarly the recall by counting the number of real trajectories that lie on the manifold of the generated data
In our experiments, we use the original implementation provided by [kynkaanniemi2019improved] directly on the trajectories as opposed to their representations as suggested in the paper.
5.2.2 Disentangled representation
In this section, we define our measure for evaluating the disentanglement of latent action representations. Disentangled representations obtained through unsupervised representation learning have shown to improve the learning efficiency of a large variety of downstream machine learning tasks
[bengio2013representation, chen2016infogan, higgins2017beta]. We defined a representation of the motor data obtained from the latent space of a generative model as disentangled if every end state of the system is controllable by one latent dimension. For example, consider a reaching task where the goal is to reach different points in the Cartesian space. A welldisentangled representation given by a generative model forces each axis of the position of the endeffector to be controlled by one dimension of the latent variable. Our hypothesis is that the more disentangled the representation is, the more efficient is the policy training.Our measure is based on statistical testing performed on the end state space of the system. Let the set of end states obtained by executing the training motor trajectories on a robotic platform. If representations given by are well disentangled, then setting one latent dimension to a fixed value should result in limited variation in the corresponding generated end states . For example, if the st latent dimension controls the axis position of the endeffector in the reaching task then setting it to a fixed value should limit the set of possible positions. In other words, we wish to quantify how dissimilar the set of end states , obtained by holding one latent dimension constant, is from the set . To compute such dissimilarity we use maximum mean discrepancy (MMD) [JMLR:v13:gretton12a]
which is a statistical test for determining if two sets of samples were produced by the same distribution. Using kernels, MMD maps both sets into a feature space called reproducing kernel Hilbert space, and computes the distance between mean values of the samples in each group. In our implementations we compute the unbiased estimator of the squared MMD (Lemma 6 in
[JMLR:v13:gretton12a]) given bywhere , are the two sets of samples and is the exponential kernel with hyperparameter determining the smoothness. Due to the nature of the exponential kernel, the higher the MMD score, the lower the similarity between the two sets.
Our measure can be described in three phases. We provide an intuitive summary of each phase but refer the reader to Appendix A for a rigorous description of the algorithm. In phase (Fig. 2 left) we generate the two sets of end states on which we run the statistical tests. For a fixed latent dimension we perform a series of latent interventions where we set the th component of a latent code to a fixed value for . Each intervention is performed on a set of samples sampled from the prior distribution . We denote by the set of end states obtained by executing the trajectories generated from the intervened latent samples. For example, latent interventions on the st latent dimension yield the sets . Moreover, we denote by the set of randomly subsampled end states that correspond to the training motor data.
In phase (Fig. 2 right), we perform the tests on each pair of sets and obtained in phase . In particular, we wish to determine if an intervention on a given dimension induced a change in any of the components of the end state space. If such a change exists, we consider the latent dimension to be well disentangled. Moreover, if we can find a set of latent dimensions that induce changes on different components of the end states, we consider the generative model to be well disentangled. Therefore, for a fixed latent dimension and a fixed intervention , the objective is to find the component of the end state space for which and are most dissimilar. This translates to finding the component yielding the largest value where denotes the set of the th components of the states in . Note that if the dimension is entangled such component does not exists (see Appendix A for details).
In phase we aggregate the values of the performed tests and define the final disentanglement score for the generative model . In phase we linked each latent dimension with zero or one end state component and computed the corresponding value of the test. Here, we first select pairs with the largest values. We define as the sum of the selected values, and as the number of unique end state space components present in the selected pairs normalised by the total number of components . Finally, we define the Disentanglement Score as a pair
Intuitively, quantifies the effect of latent interventions on the end states and can therefore be thought of as disentangling precision. On the other hand, measures how many different aspects of the end states are captured in the latent space and can be thus thought of as disentangling recall. The defined score is a novel fully unsupervised approximate measure of disentanglement for generative models combined with the RL policy training. Its absolute values can however vary depending on the kernel parameter determining its smoothness. Moreover, this measure is not to be confused with the precision and recall from Sec. 5.2.1 where the aim is to evaluate the quality of the generated samples as opposed to the quality of the latent representations.
5.2.3 Local linearity
The linearity of system dynamics plays a vital role in control theory but has not been studied in the context of generative model training. The system dynamics governs the evolution of the states as the result of applying a sequence of motor actions to the robot. Our hypothesis is that a generative model combined with the environment that satisfies the local linearity performs better in the policy training.
Let the mapping represent the environment and let denote the state of the system after executing the actions . Let be the Euclidean neighbourhood of a latent action . Then the composition of the maps mapping from the action latent space to the end state of the system is considered linear in the neighbourhood of if there exists an affine transformation
(10)  
such that for every . We measure the local linearity of a generative model on a subset of latent actions by calculating the mean square error (MSE) of obtained on .
6 Endtoend Training of Perception and Control
The EM policy training algorithm presented in Sec. 4 updates the deep policy using the supervised learning objective function introduced in Eq. (6) (the Mstep objective). Similar to guided policy search [levine2016end], the EM policy training formulation enables training of the perception and control parts of the deep policy together in an endtoend fashion. In this section, we describe two techniques that can improve the efficiency of the endtoend training.
6.1 Input remapping trick
The input remapping trick [levine2016end] can be applied to condition the variational policies on a lowdimensional compact state representation, , instead of the highdimensional states given by the sensory observations, e.g., camera images. The policy training phase can be done in a controlled environment such that extra measures other than the sensory observation of the system can be provided. These extra measures can be for example the position of a target object on a tabletop. Therefore, the image observations can be paired with a compact taskspecific state representation such that is used in the Estep for updating the variational policy and in the Mstep for updating the policy .
6.2 Domain adaptation for perception training
The goal of this section is to exploit unlabeled images, captured without involving the robot, to improve the dataefficiency of the visuomotor policy training. Domain adaptation techniques, e.g., adversarial methods [chen2019adversarial], can improve the endtoend training of visuomotor policies with limited robot data samples. The unlabeled taskspecific images are exploited in the Mstep to improve the generality of the visuomotor policy to manipulate novel task objects in cluttered backgrounds.
The Mstep is updated to include an extra loss function to adapt data from the two different domains, (1) unlabeled images and (2) robot visuomotor data. The images must contain only one task object in a cluttered background, possibly different than the task object used by the robot during the policy training. Given images from the two domains, the basic idea is to extract visual features such that it is not possible to detect the source of the features. More details of the method can be found in our recent work in
[chen2019adversarial].7 Experiments
In this section, we present our experimental results aimed at answering the following questions:

How does the proposed approach to divide the visuomotor policy training into several more manageable downstream tasks perform for endtoend training of visuomotor policies?

Which of the characteristics precision and recall, disentanglement and local linearity of a generative model affect the policy training the most?

Does multiagent training improve the efficiency of the policy training? What is the optimal number of the agents (i.e., variational policies) to speed up the training?
We answer (1) by training a deep visuomotor policy for a picking task on a real robotic platform. We demonstrate that a deep visuomotor policy can be be trained after a few hundreds of trials that can be safely executed on a physical robot in less than 30 minutes. In this case, the visuomotor policy maps raw image pixel values to a complete sequence of motor actions to pick an object at different positions and orientations on a tabletop. By providing the results from our earlier study [chen2019adversarial], we motivate the use of the introduced adversarial domain adaptation algorithm to improve the generality of the obtained deep visuomotor policy. The policy is general in the sense that it can manipulate novel task objects in a cluttered environment, a situation that has never been encountered during the policy training phase. We emphasize that our approach cannot be directly compared to neither PPO, since it requires vast amounts of data, nor GPS, since it requires a reward at every time step instead of the terminal reward as in our case.
We answer (2) by training several VAE and InfoGAN generative models with various hyperparameters and evaluate them using the measures introduced in Sec. 5.2. For each model, we train several RL policies based on the proposed EM algorithm and investigate how the performance is related to the discussed properties of the generative models.
Finally, we answer (3) by modifying the Estep of the EM algorithm to optimize for different number of variational policies. We choose a generative model that performed well in the single policy training task and evaluate the performance of the final policy training when deploying several variational policies. As the result, we can give the optimal number of variational policies that result in the best final policy training performance.
7.1 Experimental setup
We apply the framework to a picking task in which a 7 degreeoffreedom robotic arm (ABB YuMi) must move its endeffector to different positions and orientations on a tabletop to pick a randomly placed object on a table. This task is a suitable benchmark to answer our research questions. First of all, the task requires feedforward controlling of the arm over 79 timesteps to precisely reach a target position without any position feedback during the execution. Therefore, precision is an important factor for this problem setup. Secondly, reaching a wide range of positions and orientations on the tabletop requires the generative model
to generate all possible combination of motor commands that bring the endeffector to every possible target position and orientation. This means that needs to have a high recall. Thirdly, it is straightforward to evaluate the disentanglement of the latent representations as well as the local linearity of the dynamical system that is formed by and the robot kinematic model. Finally, this is a suitable task for endtoend training, especially by exploiting the introduced adversarial domain adaptation technique to obtain generality for the policy training task.Note that the applicability of our framework (up to minor differences) to a wide range of robotic problems has already been addressed by our prior work. In particular, we successfully evaluated it in several robotic task domains, e.g., ball throwing to visual targets [ghadirzadeh2017deep], shooting hockey pucks using a hockey stick [arndt2019meta], pouring into different mugs [hamalainen2019affordance], picking objects [chen2019adversarial], and imitating human greeting gestures [butepage2019imitating].
7.2 Generative model training
We construct a dataset containing sequences of motor actions using MoveIt default planner. We collected joint velocity trajectories to move the endeffector of the robot from a home position to different positions and orientations on a tabletop. The trajectories were sampled at Hz and trimmed to 79 timesteps ( seconds duration) by adding zeros at the end of the joint velocities that are shorter than 79 timesteps. The target positions and orientations consisting of Euler angles form a dimensional end state space, and were sampled uniformly to cover an area of cm rad.
The generative models are neural networks mapping a lowdimensional action latent variable into a
dimensional vector representing
motor actions and timesteps. The input size of the networks is the same as the dimension of the latent space . We refer the reader to the Appendix B for the exact architecture as well as all the training details. The prior distributionis assumed to be standard normal distribution
in case of the VAEs, and uniform distribution
in case of the InfoGAN models.In total, we trained VAE models and InfoGAN models. The VAEs were trained with different parameters and latent sizes . The parameters of each model together with the values of both the KL divergence term and the reconstruction term (right and left term in Eq. (7), respectively) are summarised in Table 1. At the beginning of the training we set and gradually increase its value until the KL loss drops below a predetermined threshold set to and . The resulting value is reported in Table 1 and kept fixed until the end of the training.
Table 2 summarizes the training parameters and loss function values of the InfoGAN models. We trained the models with latent sizes and (right term in Eq. (9)). We report the total model loss (Eq. (9)), the generator loss (middle term in Eq. (9)) and the mutual information loss (right term in Eq. (9
)) obtained on the last epoch.
index  latent size  KL loss  reconst. loss  

VAE1  2  
VAE2  2  
VAE3  2  
VAE4  3  
VAE5  3  
VAE6  3  
VAE7  6  
VAE8  6  
VAE9  6 
Model  latent size  Gloss  Iloss  total loss  

GAN1  2  
GAN2  2  
GAN3  2  
GAN4  3  
GAN5  3  
GAN6  3  
GAN7  6  
GAN8  6  
GAN9  6 
7.3 Evaluating the generative models
We evaluated all the generative models based on precision and recall, disentanglement and local linearity measures described in Section 5.2.
Precision and recall For each generative model we randomly sampled samples from the latent prior distribution . The corresponding generated trajectories were compared with randomly chosen training trajectories which were sampled only once and fixed for all the models. The neighbourhood size was set to as suggested in [kynkaanniemi2019improved]. The resulting precision and recall scores are shown in Figure 3.
Disentanglement We measured the disentanglement using with the kernel parameter . For a given model we performed interventions on every latent dimension . The size of the sets and containing the end states was set to . For complete details we refer the reader to Appendix A. The obtained disentanglement scores are visualised in Figure 4.
Local linearity Given a generative model we randomly sampled latent actions from the prior . For each such latent action we set and sampled points from its neighbourhood . We then fit an affine transformation (Eq. (10)) on randomly selected neighbourhood points and calculated test MSE on the remaining points. We report the average test MSE obtained across all points in Figure 5.
EM policy training For each generative model, introduced in Table 1 and Table 2, we trained one policy with three different random seeds. The average training performance for each of the models is provided in Fig. 6.
Correlation between evaluation metrics and EM policy training
We labeled each generative model with the maximum as well as the average reward achieved during the EM policy training across all three random seeds. We used automatic relevance determination (ARD) framework to determine the influence of each of the evaluation measures presented in Section 5.2 on the final EM policy training. In particular, we fit an ARD model on the evaluation results obtained in Section 7.3 and report the resulting regression weights in Table 3. We first note that the negative coefficient for the precision metric is a consequence of the experimental results as the generative models in general have a larger precision than the recall (see Figure 3). In future, this can be avoided by a more thorough fine tuning.Next, we see that for achieving maximum reward, the most important property of generative models is recall followed by precision and then the disentangling precision . The disentangling recall and well as the test MSE error have in this case a low influence on the policy training. On contrary, for average achieved reward, we observe more balanced weights for the pair of precision and recall metrics as well as for the pair of disentangling precision and disentangling recall. Moreover, we see that local linearity can be disregarded.
We conclude that for a successful policy training in the case of the picking task, the generative model should be trained to have high precision and high recall. Moreover, the results imply that the performance of is not affected by the structure of the latent space measured by disentanglement and local linearity. However, we note that the latent structure can be essential when performing tasks represented by more complex data such that structured latent representations would be beneficial.
DiP  DiR  Precision  Recall  Test MSE  

max reward  
mean reward 
7.4 Multiagent learning
Fig. 7 demonstrates the performance of the EM policy training algorithm for different numbers of variational policies in the Estep. As it is shown, using more than one variational policies in the Estep improves the performance in the beginning of the training. In other words, the RL agent learns faster with more than 4 variational policies for the first 100 iterations. However, still the best performance is achieved by the EM policy training with one variational policy. This observation requires more investigations that is a part of our future work.
8 Conclusion
We presented an RL framework that combined with generative models trains deep visuomotor policies in a dataefficient manner. The generative models are integrated to the RL optimization by introducing a latent variable that is a lowdimensional representation of motor actions. Using the latent action variable , we divided the optimization of the parameters of the deep visuomotor policy into two parts: optimizing the parameters of a generative model that generates valid sequences of motor actions, and optimizing the parameters of a subpolicy , where . The subpolicy parameters are found using the EM algorithm, while generative model parameters are trained unsupervised to optimize the objective corresponding to the chosen generative model. In summary, the complete framework consists of dataefficient three downstream tasks: (1) training the generative model , (2) training the subpolicy , and (3) supervised endtoend training the deep visuomotor policy .
Moreover, we provided a set of measures for evaluating the quality of the generative models regulated by the RL policy search algorithms such that we can predict the performance of the deep policy training prior to the actual training. In particular, we defined two new measures, disentanglement and local linearity, that evaluate the quality of the latent space of the generative model , and complemented them with precision and recall measure [kynkaanniemi2019improved] which evaluates the quality of the generated samples. We experimentally demonstrated the predictive power of these measures on a picking task using a set of different VAE and GAN generative models.
Acknowledgments
This work was supported by Knut and Alice Wallenberg Foundation, the EU through the project EnTimeMent, the Swedish Foundation for Strategic Research through the COIN project, and also by the Academy of Finland through the DEEPEN project.
References
Appendix A Disentanglement Score
Let be a fixed latent dimension. In phase , we perform interventions on for . Interventions are chosen from the set of equidistant points on the interval such that . The value is chosen such that is in the support of the prior distribution . Since is a standard normal distribution in the case of VAEs and a uniform distribution on the interval in the case of GANs, we set to be and in the case of the VAEs and GANs, respectively. Each intervention is performed on samples from the prior distribution and yields a set of end states denoted by . For each intervention we additionally randomly sample a set of end states corresponding to the training motor data. Note that elements of both and are dimensional with being the dimension of the end state space.
In phase , we calculate the for fixed intervention and every end state component . In order to determine if the difference between the sets and is significant, we first perform a permutation test to determine the critical value where denotes the significance level. We say that the intervention was significant for an end state component if . The calculations in phase were repeated times with a resampled set of training end states . In all our experiments we set and perform the permutation test times with .
Therefore, phases and yield functions and defined by:
where denotes the average score calculated on a subset of performed interventions that were significant. For the sake of simplicity we omit the explicit reference to the model from the notations and simply write and .
In phase we define the final disentanglement score for the generative model using the functions and . Let be a subset of containing its largest three elements, and let be the set of the corresponding end state components. We define the Disentanglement Score as a pair
(11) 
where denotes the subset of unique elements of the set .
Appendix B Generative models
b.1 Variational Autoencoder
The architecture of the decoder neural network is visualised in Table 5. The encoder neural network is symmetric to the generator with two output linear layers of size
representing the mean and the log standard deviation of the approximate posterior distribution. All the models were trained for
epochs with the learning rate fixed to .b.2 InfoGAN
The architecture of the generator, discriminator and Q neural network parametrizing are summarised in Tables 5 and 5. All the models were trained for epochs with the learning rates of the optimizers for the generator and discriminator networks fixed to .
Linear(, ) + BatchNorm + ReLU 

Linear(, ) + BatchNorm + ReLU 
Linear(, ) + BatchNorm + ReLU 
Linear(, ) 
Shared layers  Linear(, ) + ReLU 

Linear(, ) + ReLU  
discriminator  Linear(, ) + Sigmoid 
Qnet  Linear(, ) 
Linear(, ) 
Comments
There are no comments yet.