Recently, there has been a lot of interest in ways to solve difficult robotics problems using learning. Learning based approaches offer a lot of advantages in flexibility and adaptability, and don’t require the engineer to offer much knowledge about the behavior or structure of the system. Reinforcement learning (RL), a common approach here, is used to solve a problem from the ground up, only guided by a reward function and experience interacting with the world, with no knowledge of the way the world works. Recent advances in reinforcement learning have led to significant success in solving challenging problems, but usually only in spaces where it is possible to leverage enormous data collection capabilities, either with large numbers of physical robots or simulated environments. A lack of world knowledge makes longer horizons increasingly difficult for RL algorithms, so we are interested in looking for ways to incorporate knowledge from other sources to improve the performance of our RL learners.
Substantial world knowledge is available in the form of existing planning techniques for robots that are able to reason directly about cause and effect relationships in manipulating the world. In many cases they may not be able to provide fully optimized solutions to planning problems, but they can offer a significant prior base of knowledge about what useful motions in an environment look like. We allow them to express this knowledge by providing a set of demonstration trajectories that show useful, goal directed actions in our environment. In this work we aim to make this knowledge available to an RL or other planning agent in the form of an embedding of the state space of the robot and environment such that the geometry of the embedding space has a desirable relationship to the demonstration trajectories.
The idea of our embedding objective is that if we take a demonstrated trajectory, the path of that trajectory in the embedding space should be linear and constant velocity. Our system pursues this objective on short overlapping segments of the demonstrations, thus driving the full embedded demonstrations towards linearity as well. In this objective, an ideal representation would be one that creates a metric space for our system, with distances representing actual manipulation distance in our system. It is also desirable to do this embedding in a variational framework as this will allow us to avoid the difficult problem of defining a distance metric that makes sense for our raw state space.
Although these learned embedding spaces are likely to be imperfect against our objective, we hypothesize that they provide a useful intermediate representation for RL algorithms that attempt to learn a policy function. The mechanism here could be a reduction in the learned function distance from our embedding to an ideal target function or smoothing of the reward surface enabling more reliable convergence during training. We test this empirically by training embedding networks for sample environments and then using standard RL algorithms to learn behaviors in these environments. We compare the training performance of using our state embeddings against learning from the unmodified state observations and see significant improvements in training performance and reliability.
An observation we make in our experiments is that in some problems, by augmenting the observed state with our embedding space we are able to significantly reduce the variance in training performance that results from changes in the initial random seed, which is a significant confounding factor in practical RL as well as in doing good RL research.
Ii Related Work
Our optimization objective is inspired by work in word embeddings (such as 
) for language models where words are mapped to vectors in a latent space. Canonical techniques in this space use sequence data from natural language and set an optimization objective that words that occur sequentially in language should have embeddings that are nearly co-linear in the latent space. We treat the sequence data from a robot trajectory in a similar way, and develop a new embedding formulation to meet the needs of our robot state embeddings, which we discuss further in SectionIII.
Robot learning from demonstrations exists in multiple other forms. Inverse reinforcement learning supposes that the demonstrations are drawn from a distribution optimizing some reward function, and attempts to learn that reward function 
. A very similar problem form known as imitation learning or behavior cloning aims to learn to reproduce the behavior of the demonstrator, without explicitly learning the underlying reward function. Generative adversarial imitation learning (GAIL) combines these ideas with inspiration from generative adversarial networks  by training a discriminative cost function simultaneously with an RL agent. Our approach exists slightly outside these techniques in that while we assume the demonstrator has some form of optimality in movement costs, we do not attempt to learn the demonstrator’s reward function per se, and indeed in our RL step we may drop in a reward function of our choice.
Embedding spaces have been applied before in reinforcement learning, such as in the form of action space embeddings to support skill transfer between tasks . State embeddings, while less common in RL, have been used in robotics applications. In , Ichter and Pavone construct a way to train embedded state spaces with predictable dynamics in the latent space to enable sampling based planning techniques to function directly in the latent space. An important point of comparison for our work is the embed-to-control algorithm proposed in . This algorithm uses a variational model to learn state embeddings supporting the objective of having locally linear state transition dynamics in the embedding space. Our approach also takes the high-level form of a variational auto-encoder , with some specific structure in the distribution of states and embedded states. Although the embed-to-control structure and objective is different from our own, they use some very similar mathematical tools for resolving underlying difficulties and learning the embedding.
Iii Plan-Space State Embedding
In this section we will describe our formulation for learning a plan-space state embedding. This embedding space is built on the concept of constraining demonstrated trajectories to be linear in the embedding space. We present an initial direct formulation of this objective, then show why optimizing that objective may be problematic, and fix those issues with a more sophisticated variational model.
Iii-a Trajectory Linearity Model
We wish to learn an encoding of the raw state space such that locality in the embedding implies locality in manipulation or control distance. To do this we take advantage of the availability of a dataset of demonstration trajectories believed to be efficient, or ideally near optimal, and due to the optimal substructure property, we note this would also imply that snippets of these trajectories are efficient for their respective sub-goals (final position in the snippet). To satisfy our embedding objective, we attempt to constrain the demonstrated trajectories to form straight lines through our embedding space. We form a loss function on this by sampling evenly spaced triplets of states along a demonstrated trajectory, and rewarding the middle point for having an embedding value close to the average of the embeddings of the edge points.
We define our robot state as , and we define our embedding space as . The dimensionality
of the embedding space will be a hyperparameter, and can in general be greater or less than the robot state dimensionality. We consider a trajectory to be a vector of states such as, and sampled triplets from trajectories are notated as , although in general the states may be more than a single time step apart from each other. We also define the trajectory distribution of our training data as .
Iii-C Direct State Encoding
A simple and direct form of this problem would be to use a modified auto-encoder structure. We define two parametric functions and , for encoding and decoding respectively. We could then define our loss function as
This defines our loss in terms of the -norm of the difference in state values, however it is not clear that this is a reasonable distance metric in the robot state space, particularly for spaces that include velocities or higher order inertial states. We could make this more general by allowing an arbitrary distance metric in the state space as
Unfortunately, in this formulation becomes a powerful hyperparameter without any obviously good choices. In another sense, if we had access to a good distance metric on our space, we would already have a solution to the majority of our problem. In the next section, we propose a probabilistic approach that can be optimized without the need for a distance metric in the robot state space.
Iii-D Variational State Encoding
Instead of modeling our embedding function as a deterministic function of the state , we make the embedding probabilistic, so . In this way we will now be able to more flexibly model uncertainties in paths, as well as no longer having to specify a distance metric in the state space .
We assume that our encoded states are drawn from a prior distribution and robot states are conditioned on the latent states as . When considering a trajectory triplet we say that states for respectively. However, while , is defined differently as . We create an approximation of the encoding distribution . We also create an approximate inverse decoding distribution .
We see that when we define and our problem takes the form of a variational auto-encoder , where we wish to maximize the marginal log likelihood of the data:
We choose to restrict andand as a function of and for and
respectively. In this form we can use the generic stochastic gradient variational bayes (SGVB) estimator from. If we further approximate as
then we can analytically calculate the KL-divergence and use the lower variance form of the SGVB estimator. Unfortunately, in this case we have unbound a key constraint in our system, so we add an additional KL-divergence term between and , modifying the lower bound as follows:
where is a training hyperparameter. These calculations are visualized in the information flow diagram Figure 2. Note that in this diagram, values are shown where we internally track the distribution of the respective ’s, and these distributions are only sampled when necessary (shown as dotted lines).
We note that is a distribution over and we will refer to values sampled from this distribution as . Because we have assumed that our ’s are drawn from a normal distribution, we can analytically calculate the distribution where is constrained to be the mean value of and . Our embedding distribution gives us
where and are vectors in the dimensionality of our embedding space. The distribution of
is also a normal random variable as follows:
While we optimize the modified lower bound above, we will track performance of the embedding against a metric designed to measure whether the embedding space is creating the geometry we desire. We look at distances in the embedding space compared with the trajectory distance between points along a trajectory. To do this analysis we have to account for the fact that there could be arbitrary scale differences between distances in the two spaces, so we have to find a scale factor to match the embedding space distances to the trajectory space distances. We calculate the distance integral along each of our demonstration trajectories denoted and with a current snapshot of the embedding function calculate embedded distances where is the mean value from . For consistency we normalize the trajectory lengths by dividing them by the mean trajectory length.
We then calculate the best matching scale factor by minimizing the following error term:
which can be calculated analytically as:
We can then look at the error between the scaled embedded distances and the demonstrated path distances
. During training of the state embedding we track the mean and standard deviation of the absolute values of these errors on our training dataset.
Iv-a Kinodynamic Planner Demonstrations
Our embedding algorithm relies on a dataset of trajectory demonstrations, however it is agnostic of the source of these trajectories. As such, we might use any available planning algorithm, expert system, or even human demonstrations to gather this data. In this work we use the SST* algorithm  implemented in the OMPL motion planning library . Our robot environments are specified in and simulated by the MuJoCo robot simulator .
To produce the necessary trajectory demonstrations for our algorithm, we have built an interface layer between MuJoCo and OMPL. Our interface allows us to use MuJoCo in the inner loop of OMPL algorithms as the state propagator, enabling OMPL’s kinodynamic planning algorithms to function on a wide variety of spaces specified for MuJoCo with minimal additional problem specification. We are sharing this interface code for academic use at the link below111https://github.com/mpflueger/mujoco-ompl.
Iv-B Variational State Embedding
We implemented the variational form of our state encoder and are sharing our source code222https://github.com/mpflueger/plan-space-state-embedding. Our implementation maximizes the lower variance form of the SGVB estimator using the Adam algorithm .
For the experiments in this paper we parameterized the encoder and decoder networks as fully connected nets with 2 hidden layers of 64 units per layers. The networks produce two sets of output for and the diagonal terms of the covariance matrix . The units have linear output. In order to avoid numerical issues we define our to output standard deviations rather than variances, and bound outputs to the range
by using the activation function:
We also observe that the distribution of is systematically smaller than the distributions of otherwise predicted by as a result of it being distributed as the average of and . This could create a systematic error in learning because in equation 7 we place a cost on the distributional match of to , despite them having the same underlying parameters with ’s drawn from the same distribution. By looking at equation 9 we can see that the variance of is expected to be half that of the variance of , yet we are constraining them to come from the same distribution. We correct this issue by doubling the variance of to match the distribution from . Lambda () from equation 7 was set to 0.5.
Before training begins the dataset of demo trajectories is transformed into a set of state triplets. The triplets have even and varied spacing along the trajectories, we used step differences of .
Iv-C RL in Embedding Space
To evaluate how using this embedding space affects the performance of reinforcement learning (RL) algorithms, we have chosen a couple of benchmark RL environments where we can modify the observed state space sent to the RL algorithm while keeping other aspects of the system the same. To facilitate this we used implementations of trust region policy optimization (TRPO)  and proximal policy optimization (PPO)  provided by the garage library , and created custom environment wrappers using the pre-trained state encoders from our variational state embedding.
Our environment wrappers always return the mean value as the embedded state. Additionally, we can look at the effect of completely replacing the normal state space, or augmenting it by appending our embedded state vector onto the existing state vector.
In our experiments we used two primary environments, each with two variants. Our environments are based on standard environments from the OpenAI Gym project, however in each case we have also created modified environments to make the tasks more challenging.
The cartpole environment has a cart that can move along a 1 meter track and has an uncontrolled spinning pole attached to it. The cartpole environment has a 4 dimensional state, including the position and velocity of both the cart and pole. Control of the cart along the track is done by choosing an applied force (continuous value), with an upper limit on the available force. Two RL tasks are defined for the cartpole environment. In the first environment the cart starts near the center of the track with the pole nearly upright and reward is given for keeping the pole upright as long as possible. The second, much more difficult task, starts the cart and pole at random locations with small random velocities and reward is given for swinging the pole into an upright position and keeping it there. We refer to this variant as the cartpole swingup task.
The reacher environment consists of a 2-link robot arm operating in a plane, with an objective to move the tip of the second link to a particular position. The target position is encoded as part of the state space of the environment as the desired position of the end point of the robot arm. The robot is controlled by choosing bounded joint torques. Our two reacher variants differ in the observation space presented to the policy function. The initial version from Gym observes sine and cosine of the joint angles, which we refer to as reacher trig. We also created a version of the task that observes the joint angles directly as radians instead (reacher raw). Both versions observe joint velocities. In both cases, the state input to our embedding function is the raw joint angles and joint velocities (a 4 dimensional space). Note here that our state embedding does not observe the target location, so although we include some performance curves of the embedding replacing the normal observation space, the target is not observable in these cases and goal directed behavior is not possible.
V-B Embedding Performance
In Figure 4
we show training curves for our plan-space state embeddings for the cartpole and reacher environments. The plots show performance on our evaluation metric from SectionIII-E and the variational lower bound from equation 7. For each environment we train with our embedding space Z-dimension set to 10 (the high-Z configuration) and set to 3 (the low-Z configuration). Note that both the cartpole and reacher environments have a native dimensionality of 4. In each configuration we run training 5 times with different random seeds, and choose as our final embeddings for each configuration the one with the best performance as measured by our metric from Section III-E.
We see from training curves for our state embeddings that although loss on the optimization objective converges consistently, there is still significant seed noise in the performance on our evaluation metric error. Because this is an offline process it is easy to deal with this seed noise by training multiple seeds in parallel and selecting the best one, however, we believe that in future work it may make sense to more closely examine the convergence characteristics of this algorithm in order to get more consistent performance.
V-C RL Performance
We examine the performance benefit of using our algorithm with two reinforcement learning algorithms, trust region policy optimization (TRPO) and proximal policy optimization (PPO). We use the reference implementation of these algorithms provided by the garage project . TRPO and PPO are both policy gradient algorithms, but with different underlying mechanisms. Policy representations were parameterized as a neural network with two fully connected hidden layers of 32 units. Experiments were run with a discount factor of 0.99 for 100 time steps, except for the cartpole swingup, which got 200 time steps (to allow enough time to swing up and demonstrate balance of the pole).
Figure 5 shows training curves for TRPO and Figure 6 shows the same curves for PPO. All of the plots show average (un-discounted) performance across the training runs. The cartpole swingup task in particular has some observable qualitative performance levels: policies that achieve near 100 can quickly swing up the pole from any initial state and maintain stable balance for the rest of the episode. In the 0 to -100 range are policies that can swing up the pole but fail to achieve stable balance. The -200 to -300 range are policies that are putting energy in the pole, but not in a controlled way. In the -600 range are policies that have learned to zero controls to avoid the control penalty and do nothing. For the reacher task it can generally be said that scores around -6 are effectively achieving the task of reaching to the target point, where as scores below -14 do not appear to demonstrate significant goal directed behavior.
In the standard cartpole task (where the pole starts in a nearly upright position with low velocity) the baselines are extremely strong, achieving near perfect performance almost immediately. In these cases augmenting with the embedded state seems to be even or slightly destabilizing. Using the embedded state only can still achieve good performance, especially with TRPO, though not as good as baseline.
In the cartpole swingup task, augmenting with the embedded state offers significant performance improvement in both TRPO and PPO. In TRPO we see slightly faster rise time, better average performance from both high-Z and low-Z embeddings, similar best case performance, and reduced seed variance towards the end. With PPO we see substantially better best and 2nd best case performance from both embeddings, and slightly improved average performance, though PPO is generally under-performing TRPO in our experiments.
For our reacher tasks with TRPO, we see much smaller variations. With trigonometric observations the baseline reacher appears difficult to outperform, and our low-Z embedding is even, with the high-Z underperforming early and then catching up. With angular (raw) observations the low-Z embedding offers some early benefits in reduced variance, though high-Z slightly under performs. With PPO both embeddings appear to slightly under perform, and best case results are mixed. Absolute performance on the PPO reacher tasks is not good compared to TRPO though.
When training with policies that observe only the embedding we generally do not get good performance, which suggests that these state embedding are not encoding enough information to fully observe the system state, or that that learning problem is particularly difficult. In the case of the cartpole clearly some system state can be inferred though due to respectable (though sub-baseline) performance on the standard cartpole task. Also, as noted earlier, the reacher spaces have a changing goal position that is not observed by our embedding only experiments, so we do not expect to see good performance on this task.
Overall, we see a couple of patterns in performance when augmenting with embedding spaces. On tasks where the baseline algorithm is already very strong and reliable we don’t see improvements, and we think this is reasonable as there may not be much room for improvement. The standard cartpole is firmly in this category, and trigonometric observations of the reacher represent a very engineered feature space, which may be difficult to improve upon.
In tasks where the baseline can achieve good performance but does not do so reliably is where we see the most improvement, which clearly describes the cartpole swingup, and to a lesser extent the reacher with angular (raw) observations.
Vi Conclusion & Discussion
In this paper we’ve proposed a new form of state embedding to learn the geometry of the plan-space of an environment from expert demonstrations, in this case supplied by a motion planner. We’ve shown how to train these embeddings, and that once trained they appear to improve the performance and reliability of policy gradient RL algorithms when used to augment the observed state space.
In future work we hope to examine the impact on a greater variety of RL algorithms as well as looking at more varied problem spaces. The experiments in this paper have formulated our embedding from an already observed joint state space, however it could also be formulated to be a function of a more difficult to interpret observation space such as images, where it might also offer particular advantages in encoding information about the environment beyond robot joint states.
Another interesting direction of future work could be to evaluate the usefulness of this technique using human demonstrations, this could be particularly relevant to complex problems where effective planners are not readily available.
The performance improvements we have observed in RL problems suggest the need for a better understanding of the reward contours of learned policy functions. Since our embedding space is a function of the observed state space in these example problems, it could be interpreted as an expansion of the expressiveness of the policy network through a pretrained sub-network. Although we offer some intuition as to why our embedding objective should produce a useful embedding function, the field lacks a more principled understanding of the need for expressiveness in policy functions.
A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852. Cited by: §II.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §II.
-  (2018) Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, External Links: Cited by: §II.
-  (2016) Generative adversarial imitation learning. In Advances in neural information processing systems, pp. 4565–4573. Cited by: §II.
-  (2019) Robot motion planning in learned latent spaces. IEEE Robotics and Automation Letters 4 (3), pp. 2407–2414. Cited by: §II.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II, §III-D, §III-D, §III-D.
-  (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980. Cited by: §IV-B.
-  (2016) Asymptotically optimal sampling-based kinodynamic planning. The International Journal of Robotics Research 35 (5), pp. 528–564. Cited by: §IV-A.
-  (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §II.
Algorithms for inverse reinforcement learning.
in Proc. 17th International Conf. on Machine Learning, pp. 663–670. Cited by: §II.
-  (2019) Rover-irl: inverse reinforcement learning with soft value iteration networks for planetary rover path planning. IEEE Robotics and Automation Letters 4 (2), pp. 1387–1394. Cited by: §II.
-  (2015) Trust region policy optimization. In International conference on machine learning, pp. 1889–1897. Cited by: §IV-C.
-  (2017) Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347. Cited by: §IV-C.
-  (2012-12) The Open Motion Planning Library. IEEE Robotics & Automation Magazine 19 (4), pp. 72–82. Note: http://ompl.kavrakilab.org External Links: Cited by: §IV-A.
-  (2019) Garage: a toolkit for reproducible reinforcement learning research. GitHub. Note: https://github.com/rlworkgroup/garage Cited by: §IV-C, §V-C.
-  (2012) Mujoco: a physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. Cited by: §IV-A.
-  (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pp. 2746–2754. Cited by: §II, §III-D.
-  (2008) Maximum entropy inverse reinforcement learning.. In AAAI, Vol. 8, pp. 1433–1438. Cited by: §II.