EMI
Implementation for ICML 2019 paper, EMI: Exploration with Mutual Information.
view repo
Policy optimization struggles when the reward feedback signal is very sparse and essentially becomes a random search algorithm until the agent accidentally stumbles upon a rewarding or the goal state. Recent works utilize intrinsic motivation to guide the exploration via generative models, predictive forward models, or more adhoc measures of surprise. We propose EMI, which is an exploration method that constructs embedding representation of states and actions that does not rely on generative decoding of the full observation but extracts predictive signals that can be used to guide exploration based on forward prediction in the representation space. Our experiments show the state of the art performance on challenging locomotion task with continuous control and on imagebased exploration tasks with discrete actions on Atari.
READ FULL TEXT VIEW PDF
Exploration is a difficult challenge in reinforcement learning and is of...
read it
Efficient exploration remains a challenging problem in reinforcement
lea...
read it
In this work we consider partially observable environments with sparse
r...
read it
Sparse rewards are doubleedged training signals in reinforcement learni...
read it
Exploration in sparse reward reinforcement learning remains a difficult ...
read it
A major challenge in reinforcement learning for continuous stateaction
...
read it
This paper presents the HomeoHeterostatic Value Gradients (HHVG) algori...
read it
Implementation for ICML 2019 paper, EMI: Exploration with Mutual Information.
The central task in reinforcement learning is to learn policies that would maximize the total reward received from interacting with the unknown environment. Although recent methods have demonstrated to solve a range of complex tasks
(Mnih et al., 2015; Schulman et al., 2015, 2017), the success of these methods, however, hinges on whether the agent constantly receives the intermediate reward feedback or not. In case of challenging environments with sparse reward signals, these methods struggle to obtain meaningful policies unless the agent luckily stumbles into the rewarding or predefined goal states.To this end, prior works on exploration generally utilize some kind of intrinsic motivation mechanism to provide a measure of surprise. These measures can be based on density estimation via generative models
(Bellemare et al., 2016; Fu et al., 2017; Oh et al., 2015), predictive forward models (Stadie et al., 2015; Houthooft et al., 2016), or more adhoc measures that aim to approximate surprise (Pathak et al., 2017). Methods based on predictive forward models and generative models must model the distribution over state observations, which can make them difficult to scale to complex, highdimensional observation spaces, while models that eschew direct forward predictive or density estimation rely on heuristic measures of surprise that may not transfer effectively to a wide range of tasks.
Our aim in this work is to devise a method for exploration that does not require a direct generation of highdimensional state observations, while still retaining the benefits of being able to measure surprise based on the forward prediction. If exploration is performed by seeking out states that maximize surprise, the problem, in essence, is in measuring surprise, which requires a representation where functionally similar states are close together, and functionally distinct states are far apart.
In this paper, we propose to learn compact representations for both the states and actions simultaneously satisfying the following criteria: First, given the representations of state and the corresponding next state, the uncertainty of the representation of the corresponding action should be minimal. Second, given the representations of the state and the corresponding action, the uncertainty of the representation of the corresponding next state should also be minimal. Third, the action embedding representation should seamlessly support both the continuous and discrete actions. Finally, we impose the linear dynamics model in the representation space which can also explain the rare irreducible error under the dynamics model. Given the representation, we guide the exploration by measuring surprise based on forward prediction and relative increase in diversity in the embedding representation space. Figure 1 illustrates an example visualization of our learned state embedding representations and sample trajectories in the representation space in Montezuma’s Revenge.
We present two main technical contributions that make this into a practical exploration method. First, we describe how compact state and action representations can be constructed via Donsker & Varadhan (1983) estimation of mutual information without relying on generative decoding of full observations. Second, we show that imposing linear topology on the learned embedding representation space (such that the transitions are linear), thereby offloading most of the modeling burden onto the embedding function itself, provides an essential informative measure of surprise when visiting novel states.
For the experiments, we show that we can use our representations on a range of complex imagebased tasks and robotic locomotion tasks with continuous actions. We report significantly improved results compared to recent intrinsic motivation based exploration methods (Fu et al., 2017; Pathak et al., 2017) on several challenging Atari tasks and robotic locomotion tasks with sparse rewards.
Our work is related to the following strands of active research:
Unsupervised representation learning via mutual information estimation Recent literature on unsupervised representation learning generally focus on extracting latent representation maximizing approximate lower bound on the mutual information between the code and the data. In the context of generative adversarial networks (Goodfellow et al., 2014), Chen et al. (2016); Belghazi et al. (2018) aims at maximizing the approximation of mutual information between the latent code and the raw data. Belghazi et al. (2018)
estimates the mutual information with neural network via
Donsker & Varadhan (1983) estimation to learn better generative model. Hjelm et al. (2018) builds on the idea and trains a decoderfree encoding representation maximizing the mutual information between the input image and the representation. Furthermore, the method uses divergence (Nowozin et al., 2016) estimation of JensenShannon divergence rather than the KL divergence to estimate the mutual information for better numerical stability. Oord et al. (2018)estimates mutual information via autoregressive model and makes predictions on local patches in an image.
Thomas et al. (2017) aims to learn the representations that maximize the causal relationship between the distributed policies and the representation of changes in the state.Exploration with intrinsic motivation Prior works on exploration mostly employ intrinsic motivation to estimate the measure of novelty or surprisal to guide the exploration. Bellemare et al. (2016) utilize density estimation via CTS (Bellemare et al., 2014) generative model and derive pseudocounts as the intrinsic motivation. Fu et al. (2017) avoids building explicit density models by training Kexemplar models that distinguish a state from all other observed states. Some methods train predictive forward models (Stadie et al., 2015; Houthooft et al., 2016; Oh et al., 2015) and estimate the prediction error as the intrinsic motivation. Oh et al. (2015)
employs generative decoding of the full observation via recursive autoencoders and thus can be challenging to scale for high dimensional observations. VIME (
Houthooft et al. (2016)) approximates the environment dynamics, uses the information gain of the learned dynamics model as intrinsic rewards, and showed encouraging results on robotic locomotion problems. However, the method needs to update the dynamics model per each observation and is unlikely to be scalable for complex tasks with high dimensional states such as Atari games.Other approaches utilize more adhoc measures (Pathak et al., 2017; Tang et al., 2017) that aim to approximate surprise. ICM (Pathak et al. (2017)) transforms the high dimensional states to feature space and imposes cross entropy and Euclidean loss so the action and the feature of the next state are predictable. However, ICM does not utilize mutual information like VIME to directly measure the uncertainty and is limited to discrete actions. Our method (EMI) is also reminiscent of (Kohonen & Somervuo, 1998) in a sense that we seek to construct a decoderfree latent space from the high dimensional observation data with a topology in the latent space. In contrast to the prior works on exploration, we seek to construct the representation under linear topology and does not require decoding the full observation but seek to encode the essential predictive signal that can be used for guiding the exploration.
We consider a Markov decision process defined by the tuple
, where is the set of states, is the set of actions, is the environment transition distribution, is the reward function, and is the discount factor. Let denote a stochastic policy over actions given states. Denote as the distribution of initial state . The discounted sum of expected rewards under the policy is defined by(1) 
where denotes the trajectory, and . The objective in policy based reinforcement learning is to search over the space of parameterized policies (i.e. neural network) in order to maximize .
Also, denote
as the joint probability distribution of singleton experience tuples
starting from and following the policy . Furthermore, define as the marginal distribution of actions, as the marginal distribution of states and the corresponding next states, as the marginal distribution of the next states, and as the marginal distribution of states and the actions following the policy .Our goal is to construct the embedding representation of the observation and action (discrete or continuous) for complex dynamical systems that does not rely on generative decoding of the full observation, but still provides a useful predictive signal that can be used for exploration. This requires a representation where functionally similar states are close together, and functionally distinct states are far apart. We approach this objective from maximizing mutual information under several criteria.
We first introduce the embedding function of states and actions with parameters and (i.e. neural networks) respectively. We seek to learn the embedding function of states () and actions () satisfying the following two criteria:
Given the embedding representation of states and the actions , the uncertainty of the embedding representation of the corresponding next states should be minimal and vice versa.
Given the embedding representation of states and the corresponding next states , the uncertainty of the embedding representation of the corresponding actions should also be minimal and vice versa.
Intuitively, the first criterion translates to maximizing the mutual information between and which we define as in Equation 2. And the second criterion translates to maximizing the mutual information between and defined as in Equation 3.
(2) 
(3) 
Mutual information is not bounded from above and maximizing mutual information is notoriously difficult to compute in high dimensional settings. Motivated by Hjelm et al. (2018); Belghazi et al. (2018), we compute Donsker & Varadhan (1983)
lower bound of mutual information. Concretely, DonskerVaradhan representation is a tight estimator for the mutual information of two random variables
and , derived as in Equation 4.(4) 
where is a differentiable transform with parameter . Furthermore, for better numerical stability, we utilize a different measure between the joint and marginals than the KLdivergence. In particular, we employ JensenShannon divergence (JSD) (Hjelm et al., 2018) which is bounded both from below and above by and ^{1}^{1}1In Nowozin et al. (2016), the authors actually derived the lower bound of , instead of , where .
(5) 
where the inequality in the second line holds from the definition of divergence (Nowozin et al., 2016). In the third line, we substituted and Fenchel conjugate of JensenShannon divergence, . ∎
From creftypecap 1, we have,
(6)  
(7)  
where . The expectations in Equation 6 and Equation 7 are approximated using the empirical samples trajectories . Note, the samples and from the marginals are obtained by dropping and in samples and from . Figure 2 illustrates the computational architecture for estimating the lower bounds on and .
Since the embedding representation space is learned, it is natural to impose a topology on it (Kohonen, 1983). In EMI, we impose a simple and convenient topology where transitions are linear since this spares us from having to also represent a complex dynamical model. This allows us to offload most of the modeling burden onto the embedding function itself, which in turn provides us with a useful and informative measure of surprise when visiting novel states. Once the embedding representations are learned, this linear dynamics model allows us to measure surprise in terms of the residual error under the model or measure diversity in terms of the similarity in the embedding space. Section 5 discusses the intrinsic reward computation procedure in more detail.
Concretely, we seek to learn the representation of states and the actions such that the representation of the corresponding next state follow linear dynamics i.e. . Intuitively, we would like the nonlinear aspects of the dynamics to be offloaded to the neural networks so that in the embedding space, the dynamics become linear. Regardless of the expressivity of the neural networks, however, there always exists irreducible error under the linear dynamic model. For example, the state transition which leads the agent from one room to another in Atari environments (i.e. Venture, Montezuma’s revenge, etc.) or the transition leading the agent in the same position under certain actions (i.e. Agent bumping into a wall when navigating a maze environment) would be extremely challenging to explain under the linear dynamics model.
To this end, we introduce the error model , which is another neural network taking the state and action as input, estimating the irreducible error under the linear model. Motivated by the work of Candès et al. (2011), we seek to minimize for the sparsity of the term so that the error term contributes only on rare unexplainable occasions. Equation 8 shows the embedding learning problem under linear dynamics with sparse errors.
(8) 
where we used the matrix notation for compactness. denotes the matrices of respective embedding representations stacked columns wise. Relaxing the norm with norm, Equation 9 shows our final learning objective.
(9)  
are hyperparameters which control the relative contributions of the linear dynamics error and the sparsity. In practice, we found the optimization process to be more stable when we further regularize the distribution of action embedding representation to follow a predefined prior distribution. Concretely, we regularize the action embedding distribution to follow a standard normal distribution via
similar to VAEs Kingma & Welling (2013). Intuitively, this has the effect of grounding the distribution of action embedding representation (and consequently the state embedding representation) across different iterations of the learning process. ^{2}^{2}2Note, regularizing the distribution of state embeddings instead renders the optimization process much more unstable. This is because the distribution of states are much more likely to be skewed than the distribution of actions, especially during the initial stage of optimization, so the Gaussian approximation becomes much less accurate in contrast to the distribution of actions.
We consider two different formulations of computing the intrinsic reward. First, we consider a relative difference in the novelty of state representations based on the distance in the embedding representation space similar to Oh et al. (2015) as shown in Equation 10. The relative difference makes sure the intrinsic reward diminishes to zero (Ng et al., 1999) once the agent has sufficiently explored the state space. Also, we consider a formulation based on the prediction error under the linear dynamics model as shown in Equation 11. This formulation incorporates the sparse error term and makes sure we differentiate the irreducible error that does not contribute as the novelty.
(10) 
(11) 
Note the relative diversity term should be computed after the representations are updated based on the samples from the latest trajectories while the prediction error term should be computed before the update. Algorithm 1 shows the complete learning procedure in detail.
We compare the experimental performance of EMI to recent prior works on both of the lowdimensional locomotion tasks with continuous control from rllab benchmark (Duan et al., 2016) and the complex visionbased tasks with discrete control from the Arcade Learning Environment (Bellemare et al., 2013). For the locomotion tasks, we chose SwimmerGather and SparseHalfCheetah environments for direct comparison against the prior work of Fu et al. (2017). SwimmerGather is a hierarchical task where a twolink robot needs to reach green pellets, which give positive rewards, instead of red pellets, which give negative rewards. SparseHalfCheetah is a challenging locomotion task where a cheetahlike robot does not receive any rewards until it moves 5 units in one direction.
For visionbased tasks, we selected Freeway, Frostbite, Venture, Montezuma’s Revenge, Gravitar, and Solaris for comparison with recent prior works (Pathak et al., 2017; Fu et al., 2017). These six Atari environments feature very sparse reward feedback and often contain many moving distractor objects which can be challenging for the methods that rely on explicit decoding of the full observations (Oh et al., 2015).
We use TRPO (Schulman et al., 2015)
for policy optimization because of its capability to support both the discrete and continuous actions and its robustness with respect to the hyperparameters. In the locomotion experiments, we use a 2layer fully connected neural network as the policy network. In the Atari experiments, we use a 2layer convolutional neural network followed by a single layer fully connected neural network. We convert the 84 x 84 input RGB frames to grayscale images and resize them to 52 x 52 images following the practice in
Tang et al. (2017). The embedding dimensionality is set to in all of the environments except for Gravitar and Solaris where we set due to their complex environment dynamics. We use Adam (Kingma & Ba, 2015) optimizer to train embedding networks. Please refer to Section A.1 for more details.We compare EMI with TRPO (Schulman et al., 2015), EX2 (Fu et al., 2017), and ICM (Pathak et al., 2017) on two challenging locomotion environments: SwimmerGather and SparseHalfCheetah. Figure 4 shows that EMI outperforms the baseline methods on both tasks. Figure 2(b) visualizes the scatter plot of the learned state embeddings and an example trajectory for the SparseHalfCheetah experiment. The figure shows that the learned representation successfully preserves the similarity in observation space. Please refer to Section A.3 for further experiments including ablation study.
Performance of EMI on locomotion tasks with sparse rewards compared to baseline methods (TRPO, EX2, ICM). The solid line is the mean reward (yaxis) of 5 different seeds at each iteration (xaxis) and the shaded area represents one standard deviation from the mean.
For visionbased exploration tasks, our results in Figure 5 show that EMI achieves the state of the art performance on Freeway, Frostbite, Venture, and Montezuma’s Revenge in comparison to the baseline exploration methods. Figures 2(f), 2(e), 2(d) and 2(c) illustrate our learned state embeddings . Since our embedding dimensionality is set to , we directly visualize the scatter plot of the embedding representation in 2D. Figure 2(d) shows that the embedding space naturally separates state samples into two clusters each of which corresponds to different rooms in Montezuma’s revenge. Figure 2(f) shows smooth sample transitions along the embedding space in Frostbite where functionally similar states are close together and distinct states are far apart. For information about how our error term works in those visionbased tasks, please refer to Section A.2.
Extending our experiments in Figure 4 and Figure 5, we further compare EMI with other exploration methods as shown in Table 1. EMI shows the outstanding performance on 6 out of 8 environments.
EMI (5 seeds)  EX2 (5 seeds)  ICM (5 seeds)  SimHash  VIME  TRPO (5 seeds)  
SwimmerGather  0.442  0.200  0  0.258  0.196  0 
SparseHalfCheetah  194.9  153.7  1.4  0.5  98.0  0 
Freeway  34.0  27.1  33.6  33.5    26.7 
Frostbite  7388  3387  4465  5214    2034 
Venture  646  589  418  616    263 
Gravitar  599  550  424  604    508 
Solaris  2775  2276  2453  4467    3101 
Montezuma  387  0  161  238    0 
We presented EMI, a practical exploration method that does not rely on direct generation of high dimensional observations while extracting the predictive signal that can be used for exploration within a compact representation space. Our results on challenging robotic locomotion tasks with continuous actions and high dimensional imagebased games with sparse rewards show that our approach transfers to a wide range of tasks and shows state of the art results significantly outperforming recent prior works on exploration. As future work, we would like to explore utilizing the learned linear dynamic model for optimal planning in the embedding representation space. In particular, we would like to investigate how an optimal trajectory from a state to a given goal in the embedding space under the linear representation topology translates to the optimal trajectory in the observation space under complex dynamical systems.
This work is supported by Samsung Advanced Institute of Technology. Hyun Oh Song is the corresponding author.
International Conference on Machine Learning
, volume 2018, 2018.Journal of Artificial Intelligence Research
, 47:253–279, 2013.Robust principal component analysis?
Journal of the ACM (JACM), 58(3):11, 2011.In all experiments, we use Adam optimizer with a learning rate of 0.001 and a minibatch size of 512 for 3 epochs to optimize embedding networks. In each iteration, we utilized collected TRPO batch at each iteration to train embedding networks except for SparseHalfCheetah which uses FIFO replay buffer of size 250000. The embedding dimensionality is set to
in all of the environments except for Gravitar and Solaris where we set . Relative diversity term is used as an intrinsic reward with the weight of 0.1, except for Venture and Montezuma’s Revenge where the intrinsic reward is set as a prediction error term with the weight of 0.001. The following tables give the detailed information of the remaining hyperparameters.Environments  SwimmerGather  SparseHalfCheetah  
TRPO method  Single Path  
TRPO step size  0.01  
TRPO batch size  50k  5k  
Policy network  A 2layer FC with (64, 32) hidden units (tanh)  
Baseline network  A 32 hidden units FC (ReLU) 
Linear baseline  
network  Same structure as policy network  
network  A 64 hidden units FC (ReLU)  
Information network  A 2layer FC with (64, 64) hidden units (ReLU)  
Error network 


Max path length  500  
Discount factor  0.995  
0.05  
0.1  
Environments  Freeway, Frostbite, Venture, Montezuma’s Revenge, Gravitar, Solaris 
TRPO method  Single Path 
TRPO step size  0.01 
TRPO batch size  100k 
Policy network  2 convolutional layers (16 8x8 filters of stride 4, 32 4x4 filters of stride 2), followed by a 256 hidden units FC (ReLU) 
Baseline network  Same structure as policy network 
network  Same structure as policy network 
network  A 64 hidden units FC (ReLU) 
Information network  A 2layer FC with (64, 64) hidden units (ReLU) 
Error network  State input passes the same network structure as policy network. Concat layer concatenates state output and action. A 256 units FC (ReLU) 
Max path length  4500 
Discount factor  0.995 
0.1  
0.5  
In order to understand how the error term in EMI works in practice, we visualize three representative transition samples in Figure 6 and check the residual error norm without the error term () and the error term norm ().
In the case of Figure 5(a), due to the discrepancy between the two different background images, becomes large which makes the residual error as well as the error term larger, too. For this specific sample, the residual error norm without the error term was and the norm of the error term was . Figure 5(b) describes the case where the action chosen by the policy has no effect on i.e. . Linear models without any noise terms can easily fail in such events. Thus, the error term in our model gets bigger to mitigate the modeling error. The norm of the residual error without the error model for this example transition was , and its error term had a norm of .
On the other hand, Figure 5(c) represents cases that the chosen action works in the environment as intended. The residual error norm for this sample was without the error term and the error term norm was .
In conclusion, we observed the error terms generally had much larger norms in the cases such as Figure 5(a) () and Figure 5(b) () compared to the case like Figure 5(c) (), in order to alleviate the occasional irreducible large residual errors under the linear dynamics model.
Figure 7 shows the ablation study of loss terms in EMI to verify the influence of each factor. Ablating a single factor like the information term, the linear dynamics with sparse noise or the unit KL divergence constraint degrades performance significantly. It means that each factor has a nontrivial impact on EMI. Also, simultaneously ablating the information gain term with another factor diminishes reward into zero. It denotes that the information gain term has the most critical impact on EMI.
In reward augmentation process, EMI agent computes intrinsic reward and then learns from . Figure 8 shows the impact of in EMI. Although gives the best performance, other choices also give comparable performance. It can be concluded that EMI is robust to the choice of intrinsic reward coefficient.
Comments
There are no comments yet.