The Laplacian in RL: Learning Representations with Efficient Approximations

10/10/2018 ∙ by Yifan Wu, et al. ∙ Google Carnegie Mellon University 18

The smallest eigenvectors of the graph Laplacian are well-known to provide a succinct representation of the geometry of a weighted graph. In reinforcement learning (RL), where the weighted graph may be interpreted as the state transition process induced by a behavior policy acting on the environment, approximating the eigenvectors of the Laplacian provides a promising approach to state representation learning. However, existing methods for performing this approximation are ill-suited in general RL settings for two main reasons: First, they are computationally expensive, often requiring operations on large matrices. Second, these methods lack adequate justification beyond simple, tabular, finite-state settings. In this paper, we present a fully general and scalable method for approximating the eigenvectors of the Laplacian in a model-free RL context. We systematically evaluate our approach and empirically show that it generalizes beyond the tabular, finite-state setting. Even in tabular, finite-state settings, its ability to approximate the eigenvectors outperforms previous proposals. Finally, we show the potential benefits of using a Laplacian representation learned using our method in goal-achieving RL tasks, providing evidence that our technique can be used to significantly improve the performance of an RL agent.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of machine learning methods generally depends on the choice of data representation

(Bengio et al., 2013). In reinforcement learning (RL), the choice of state representation may affect generalization (Rafols et al., 2005), exploration (Tang et al., 2017; Pathak et al., 2017), and speed of learning (Dubey et al., 2018). As a motivating example, consider goal-achieving tasks, a class of RL tasks which has recently received significant attention (Andrychowicz et al., 2017; Pong et al., 2018). In such tasks, the agent’s task is to achieve a certain configuration in state space; e.g. in Figure 1 the environment is a two-room gridworld and the agent’s task is to reach the red cell. A natural reward choice is the negative Euclidean (L2) distance from the goal (e.g., as used in (Nachum et al., 2018)). The ability of an RL agent to quickly and successfully solve the task is thus heavily dependent on the representation of the states used to compute the L2 distance. Computing the distance on one-hot (i.e. tabular) representations of the states (equivalent to a sparse reward) is most closely aligned with the task’s directive. However, such a representation can be disadvantageous for learning speed, as the agent receives the same reward signal for all non-goal cells. One may instead choose to compute the L2 distance on representations of the grid cells. This allows the agent to receive a clear signal which encourages it to move to cells closer to the goal. Unfortunately, this representation is agnostic to the environment dynamics, and in cases where the agent’s movement is obstructed (e.g. by a wall as in Figure 1), this choice of reward is likely to cause premature convergence to sub-optimal policies unless sophisticated exploration strategies are used. The ideal reward structure would be defined on state representations whose distances roughly correspond to the ability of the agent to reach one state from another. Although there are many suitable such representations, in this paper, we focus on a specific approach based on the graph Laplacian, which is notable for this and several other desirable properties.

Figure 1: Visualization of the shaped reward defined by the L2 distance from the red cell on an representation (left) and Laplacian representation (right).

For a symmetric weighted graph, the Laplacian is a symmetric matrix with a row and column for each vertex. The smallest eigenvectors of the Laplacian provide an embedding of each vertex in which has been found to be especially useful in a variety of applications, such as graph visualization (Koren, 2003), clustering (Ng et al., 2002), and more (Chung & Graham, 1997).

Naturally, the use of the Laplacian in RL has also attracted attention. In an RL setting, the vertices of the graph are given by the states of the environment. For a specific behavior policy, edges between states are weighted by the probability of transitioning from one state to the other (and vice-versa). Several previous works have proposed that approximating the eigenvectors of the graph Laplacian can be useful in RL. For example,

Mahadevan (2005) shows that using the eigenvectors as basis functions can accelerate learning with policy iteration. Machado et al. (2017a, b) show that the eigenvectors can be used to construct options with exploratory behavior. The Laplacian eigenvectors are also a natural solution to the aforementioned reward-shaping problem. If we use a uniformly random behavior policy, the Laplacian state representations will be appropriately aware of the walls present in the gridworld and will induce an L2 distance as shown in Figure 1(right). This choice of representation accurately reflects the geometry of the problem, not only providing a strong learning signal at every state, but also avoiding spurious local optima.

While the potential benefits of using Laplacian-based representations in RL are clear, current techniques for approximating or learning the representations are ill-suited for model-free RL. For one, current methods mostly require an eigendecomposition of a matrix. When this matrix is the actual Laplacian (Mahadevan, 2005), the eigendecomposition can easily become prohibitively expensive. Even for methods which perform the eigendecomposition on a reduced matrix (Machado et al., 2017a, b), the eigendecomposition step may be computationally expensive, and furthermore precludes the applicability of the method to stochastic or online settings, which are common in RL. Perhaps more crucially, the justification for many of these methods is made in the tabular setting. The applicability of these methods to more general settings is unclear.

To resolve these limitations, we propose a computationally efficient approach to approximate the eigenvectors of the Laplacian with function approximation based on the spectral graph drawing objective, an objective whose optimum yields the desired eigenvector representations. We present the objective in a fully general RL setting and show how it may be stochastically optimized over mini-batches of sampled experience. We empirically show that our method provides a better approximation to the Laplacian eigenvectors than previous proposals, especially when the raw representation is not tabular. We then apply our representation learning procedure to reward shaping in goal-achieving tasks, and show that our approach outperforms both sparse rewards and rewards based on L2 distance in the raw feature space. Results are shown under a set of gridworld maze environments and difficult continuous control navigation environments.

2 Background

We present the eigendecomposition framework in terms of general Hilbert spaces. By working with Hilbert spaces, we provide a unified treatment of the Laplacian and our method for approximating its eigenvectors (Cayley, 1858)eigenfunctions in Hilbert spaces (Riesz, 1910) – regardless of the underlying space (discrete or continuous). To simplify the exposition, the reader may substitute the following simplified definitions:

  • The state space is a finite enumerated set .

  • The probability measure

    is a probability distribution over


  • The Hilbert space is , for which elements are

    dimensional vectors representing

    functions .

  • The inner product of two elements is a weighted dot product of the corresponding vectors, with weighting given by ; i.e. .

  • A linear operator is a mapping corresponding to a weighted matrix multiplication; i.e. .

  • A self-adjoint linear operator is one for which for all . This corresponds to being a symmetric matrix.

2.1 A Space and a Measure

We now present the more general form of these definitions. Let be a set, be a -algebra, and be a measure such that constitutes a measure space. Consider the set of square-integrable real-valued functions . When associated with the inner-product,

this set of functions forms a complete inner product Hilbert space (Hilbert, 1906; Riesz, 1910). The inner product gives rise to a notion of orthogonality: Functions are orthogonal if . It also induces a norm on the space: . We denote and additionally restrict to be a probability measure, i.e. .

2.2 The Laplacian

To construct the graph Laplacian in this general setting, we consider linear operators which are Hilbert-Schmidt integral operators (Bump, 1998), expressable as,

where with a slight abuse of notation we also use to denote the kernel function. We assume that (i) the kernel function satisfies for all so that the operator is self-adjoint; (ii) for each , is the Radon-Nikodym derivative (density function) from some probability measure to , i.e. for all . With these assumptions, is a compact, self-adjoint linear operator, and hence many of the spectral properties associated with standard symmetric matrices extend to .

The Laplacian of is defined as the linear operator on given by,


The Laplacian may also be written as the linear operator , where

is the identity operator. Any eigenfunction with associated eigenvalue

of the Laplacian is an eigenfunction with eigenvalue for , and vice-versa.

Our goal is to find the first eigenfunctions associated with the smallest eigenvalues of (subject to rotation of the basis).111The existence of these eigenfunctions is formally discussed in Appendix A. The mapping defined by then defines an embedding or representation of the space .

2.3 Spectral Graph Drawing

Spectral graph drawing (Koren, 2003) provides an optimization perspective on finding the eigenvectors of the Laplacian. Suppose we have a large graph, composed of (possibly infinitely many) vertices with weighted edges representing pairwise (non-negative) affinities (denoted by for vertices and ). To visualize the graph, we would like to embed each vertex in a low dimensional space (e.g., in this work) so that pairwise distances in the low dimensional space are small for vertices with high affinity. Using our notation, the graph drawing objective is to find a set of orthonormal functions defined on the space which minimize


The orthonormal constraints can be written as for all where if and otherwise.

The graph drawing objective (2) may be expressed more succinctly in terms of the Laplacian:


The minimum value of (3) is the sum of the smallest eigenvalues of . Accordingly, the minimum is achieved when span the same subspace as the corresponding eigenfunctions. In the next section, we will show that the graph drawing objective is amenable to stochastic optimization, thus providing a general, scalable approach to approximating the eigenfunctions of the Laplacian.

3 Representation Learning with the Laplacian

In this section, we specify the meaning of the Laplacian in the RL setting (i.e., how to set

appropriately). We then elaborate on how to approximate the eigenfunctions of the Laplacian by optimizing the graph drawing objective via stochastic gradient descent on sampled states and pairs of states.

3.1 The Laplacian in a Reinforcement Learning Setting

In RL, an agent interacts with an environment by observing states and acting on the environment. We consider the standard MDP setting (Puterman, 1990). Briefly, at time the environment produces an observation , which at time is determined by a random sample from an environment-specific initial distribution . The agent’s policy produces a probability distribution over possible actions from which it samples a specific action to act on the environment. The environment then yields a reward sampled from an environment-specific reward distribution function , and transitions to a subsequent state sampled from an environment-specific transition distribution function . We consider defining the Laplacian with respect to a fixed behavior policy . Then, the transition distributions

form a Markov chain. We assume this Markov chain has a unique stationary distribution.

We now introduce a choice of and for the Laplacian in the RL setting. We define to be the stationary distribution of the Markov chain such that for any measurable we have .

As represents the pairwise affinity between two vertices and on the graph, it is natural to define in terms of the transition distribution.222The one-step transitions can be generalized to multi-step transitions in the definition of , which provide better performance for RL applications in our experiments. See Appendix B for details. Recall that needs to satisfy (i) (ii) is the density function from a probability measure to for all . We define


which satisfies these conditions333 follows from definition. See Appendix B for a proof that is a density.. In other words, the affinity between states and is the average of the two-way transition probabilities: If is finite then the first term in (4) is and the second term is .

3.2 Approximating the Laplacian Eigenfunctions

Given this definition of the Laplacian, we now aim to learn the eigen-decomposition embedding . In the model-free RL context, we have access to states and pairs of states (or sequences of states) only via sampling; i.e. we may sample states from and pairs of from . This imposes several challenges on computing the eigendecomposition:

  • Enumerating the state space may be intractable due to the large cardinality or continuity.

  • For arbitrary pairs of states , we do not have explicit access to .

  • Enforcing exact orthonormality of may be intractable in innumerable state spaces.

With our choices for and , the graph drawing objective (Eq. 2) is a good start for resolving these challenges because it can be expressed as an expectation (see Appendix C for the derivation):


Minimizing the objective with stochastic gradient descent is straightforward by sampling transition pairs as from the replay buffer. The difficult part is ensuring orthonormality of the functions. To tackle this issue, we first relax the orthonormality constraint to a soft constraint . Using standard properties of expectations, we rewrite the inequality as follows:

In practice, we transform this constraint into a penalty and solve the unconstrained minimization problem. The resulting penalized graph drawing objective is


where is the Lagrange multiplier.

The -dimensional embedding

may be learned using a neural network function approximator. We note that

has a form which appears in many other representation learning objectives, being comprised of an attractive and a repulsive term. The attractive term minimizes the squared distance of embeddings of randomly sampled transitions experienced by the policy , while the repulsive term repels the embeddings of states independently sampled from . The repulsive term is especially interesting and we are unaware of anything similar to it in other representation learning objectives: It may be interpreted as orthogonalizing the embeddings of two randomly sampled states while regularizing their norm away from zero by noticing


4 Related Work

One of the main contributions of our work is a principled treatment of the Laplacian in a general RL setting. While several previous works have proposed the use of the Laplacian in RL (Mahadevan, 2005; Machado et al., 2017a), they have focused on the simple, tabular setting. In contrast, we provide a framework for Laplacian representation learning that applies generally (i.e., when the state space is innumerable and may only be accessed via sampling).

Our main result is showing that the graph drawing objective may be used to stochastically optimize a representation module which approximates the Laplacian eigenfunctions. Although a large body of work exists regarding stochastic approximation of an eigendecomposition (Cardot & Degras, 2018; Oja, 1985), many of these approaches require storage of the entire eigendecomposition. This scales poorly and fails to satisfy the desiderata for model-free RL – a function approximator which yields arbitrary rows of the eigendecomposition. Some works have proposed extensions that avoid this requirement by use of Oja’s rule (Oja, 1982). Originally defined within the Hebbian framework, recent work has applied the rule to kernelized PCA (Xie et al., 2015), and extending it to settings similar to ours is a potential avenue for future work.

In RL, Machado et al. (2017b) propose a method to approximate the Laplacian eigenvectors with functions approximators via an equivalence between proto-value functions (Mahadevan, 2005) and spectral decomposition of the successor representation (Stachenfeld et al., 2014). Importantly, they propose an approach for stochastically approximating the eigendecomposition when the state space is large. Unfortunately, their approach is only justified in the tabular setting and, as we show in our results below, does not generalize beyond. Moreover, their eigenvectors are based on an explicit eigendecomposition of a constructed reduced matrix, and thus are not appropriate for online settings.

Approaches more similar to ours (Shaham et al., 2018; Pfau et al., 2018) optimize objectives similar to Eq. 2, but handle the orthonormality constraint differently. Shaham et al. (2018)

introduce a special-purpose orthonormalizing layer, which ensures orthonormality at the mini-batch level. Unfortunately, this does not ensure orthonormality over the entire dataset and requires large mini-batches for stability. Furthermore, the orthonormalization process can be numerically unstable, and in our preliminary experiments we found that TensorFlow frequently crashed due to numerical errors from this sort of orthonormalization.

Pfau et al. (2018) turn the problem into an unconstrained optimization objective. However, in their chosen form, one cannot compute unbiased stochastic gradients. Moreover, their approach scales quadratically in the number of embedding dimensions. Our approach does not suffer from these issues.

Finally, we note that our work provides a convincing application of Laplacian representations on difficult RL tasks, namely reward-shaping in continuous-control environments. Although previous works have presented interesting preliminary results, their applications were either restricted to small discrete state spaces (Mahadevan, 2005) or focused on qualitative assessments of the learned options (Machado et al., 2017a, b).

5 Experiments

5.1 Evaluating the Learned Representations

We first evaluate the learned representations by how well they approximate the subspace spanned by the smallest eigenfunctions of the Laplacian. We use the following evaluation protocol: (i) Given an embedding , we first find its principal -dimensional orthonormal basis

, onto which we project all embeddings in order to satisfy the orthonormality constraint of the graph drawing objective; (ii) the evaluation metric is then computed as the value of the graph drawing objective using the projected embeddings. In this subsection, we use finite state spaces, so step (i) can be performed by SVD.

Figure 2: FourRoom Env.

We used a FourRoom gridworld environment (Figure 2). We generate a dataset of experience by randomly sampling transitions using a uniformly random policy with random initial state. We compare the embedding learned by our approximate graph drawing objective against methods proposed by Machado et al. (2017a, b). Machado et al. (2017a) find the first eigenvectors of the Laplacian by eigen-decomposing a matrix formed by stacked transitions, while Machado et al. (2017b) eigen-decompose a matrix formed by stacked learned successor representations. We evaluate the methods with three different raw state representations of the gridworld: (i) one-hot vectors (“index”), (ii) coordinates (“position”) and (iii) top-down pixel representation (“image”).

Figure 3: Evaluation of learned representations. The x-axis shows number of transitions used for training and y-axis shows the gap between the graph drawing objective of the learned representations and the optimal Laplacian-based representations (lower is better). We find our method (graph drawing) more accurately approximates the desired representations than previous methods. See Appendix D for details and additional results.

We present the results of our evaluations in Figure 3. Our method outperforms the previous methods with all three raw representations. Both of the previous methods were justified in the tabular setting, however, surprisingly, they underperform our method even with the tabular representation. Moreover, our method performs well even when the number of training samples is small.

5.2 Laplacian Representation Learning for Reward Shaping

We now move on to demonstrating the power of our learned representations to improve the performance of an RL agent. We focus on a family of tasks – goal-achieving tasks – in which the agent is rewarded for reaching a certain state. We show that in such settings our learned representations are well-suited for reward shaping.

Goal-achieving tasks and reward shaping. A goal-achieving task is defined by an environment with transition dynamics but no reward, together with a goal vector , where is the goal space. We assume that there is a known predefined function that maps any state to a goal vector . The learning objective is to train a policy that controls the agent to get to some state such that . For example the goal space may be the same as the state space with and being the identity mapping, in which case the target is a state vector. More generally the goal space can be a subspace of the state space. For example, in control tasks a state vector may contain both position and velocity information while a goal vector may just be a specific position. See Plappert et al. (2018) for an extensive discussion and additional examples.

A reward function needs to be defined in order to apply reinforcement learning to train an agent that can perform a goal achieving task. Two typical ways of defining a reward function for this family of tasks are (i) the sparse reward: as used by Andrychowicz et al. (2017) and (ii) the shaped reward based on Euclidean distance as used by Pong et al. (2018); Nachum et al. (2018). The sparse reward is consistent with what the agent is supposed to do but may slow down learning. The shaped reward may either accelerate or hurt the learning process depending on the whether distances in the raw feature space accurately reflect the geometry of the environment dynamics.

Reward shaping with learned representations. We expect that distance based reward shaping with our learned representations can speed up learning compared to sparse reward while avoiding the bias in the raw feature space. More specifically, we define the reward based on distance in a learned latent space. If the goal space is the same as the state space, i.e. , the reward function can be defined as . If we propose two options: (i) The first is to learn an embedding of the goal space and define . (ii) The second options is to learn an an embedding of the state space and define , where is defined as picking arbitrary state (may not be unique) that achieves . We experiment with both options when .

5.2.1 GridWorld

We experiment with the gridworld environments with coordinates as the observation. We evaluate on three different mazes: OneRoom, TwoRooms and HardMaze, as shown in the top row of Figure 4. The red grids are the goals and the heatmap shows the distances from each grid to the goal in the learned Laplacian embedding space. We can qualitatively see that the learned rewards are well-suited to the task and appropriately reflect the environment dynamics, especially in TwoRoom and HardMaze where the raw feature space is very ill-suited.

Figure 4: Results of reward shaping with a learned Laplacian embedding in GridWorld environments. The top row shows the L2 distance in the learned embedding space. The bottom row shows empirical performance. Our method (mix) can reach optimal performance faster than the baselines, especially in harder mazes. Policies are trained by DQN.

These representations are learned according to our method using a uniformly random behavior policy. Then we define the shaped reward as a half-half mix of the L2 distance in the learned latent space and the sparse reward. We found this mix to be advantageous, as the L2 distance on its own does not provide enough of a gradient in the reward when near the goal. We plot the learning performance of an agent trained according to this learned reward in Figure 4. All plots are based on 5 different random seeds. We compare against (i) sparse: the sparse reward, (ii) l2: the shaped reward based on the L2 distance in the raw feature space, (iii) rawmix: the mixture of (i) and (ii). Our mixture of shaped reward based on learning representations and the sparse reward is labelled as “mix” in the plots. We observe that in the OneRoom environment all shaped reward functions significantly outperform the sparse reward, which indicates that in goal-achieving tasks properly shaped reward can accelerate learning of the policy, justifying our motivation of applying learned representations for reward shaping. In TwoRoom and HardMaze environments when the raw feature space cannot reflect an accurate distance, our Laplacian-based shaped reward learned using the graph drawing objective (“mix”) significantly outperforms all other reward settings.

5.2.2 Continuous Control

To further verify the benefit of our learned representations in reward shaping, we also experiment with continuous control navigation tasks. These tasks are much harder to solve than the gridworld tasks because the agent must simultaneously learn to control itself and navigate to the goal. We use Mujoco (Todorov et al., 2012) to create 3D mazes and learn to control two types of agents, PointMass and Ant, to navigate to a certain area in the maze, as shown in Figure 5. Unlike the gridworld environments the goal space is distinct from the state space, so we apply our two introduced methods to align the spaces: (i) learning to only embed the coordinates of the state (mix) or (ii) learning to embed the full state (fullmix). We run experiments with both methods. As shown in Figure 5 both “mix” and “fullmix” outperform all other methods, which further justifies the benefits of using our learned representations for reward shaping. It is interesting to see that both embedding the goal space and embedding the state space still provide a significant advantage even if neither of them is a perfect solution. For goal space embedding, part of the state vector (e.g. velocities) is ignored so the learned embedding may not be able to capture the full structure of the environment dynamics. For state space embedding, constructing the state vector from the goal vector makes achieving the goal more challenging since there is a larger set of states (e.g. with different velocities) that achieve the goal but the shaped reward encourage the policy to reach only one of them. Having a better way to align the two spaces would be an interesting future direction.

Figure 5: Results of reward shaping with a learned Laplacian embedding in continuous control environments. Our learned representations are used by the “mix” and “fullmix” variants (see text for details), whose performance dominates that of all other methods. Policies are trained by DDPG.

6 Conclusion

We have presented an approach to learning a Laplacian-based state representation in RL settings. Our approach is both general – being applicable to any state space regardless of cardinality – and scalable – relying only on the ability to sample mini-batches of states and pairs of states. We have further provided an application of our method to reward shaping in both discrete spaces and continuous-control settings. With our scalable and general approach, many more potential applications of Laplacian-based representations are now within reach, and we encourage future work to continue investigating this promising direction.


Appendix A Existence of smallest eigenvalues of the Laplacian.

Since the Hilbert space may have infinitely many dimensions we need to make sure that the smallest eigenvalues of the Laplacian operator is well defined. Since if is an eigenvalue of then is an eigenvalue of . So we turn to discuss the existence of the largest eigenvalues of . According to our definition is a compact self-adjoint linear operator on . So it has the following properties according to the spectral theorem:

  • has either (i) a finite set of eigenvalues or (ii) countably many eigenvalues and if there are infinitely many. All eigenvalues are real.

  • Any eigenvalue satisfies where is the operator norm.

If the operator has a finite set of eigenvalues its largest eigenvalues exist when is smaller than .

If has a infinite but countable set of eigenvalues we first characterize what the eigenvalues look like:

Let be for all . Then for all thus . So is an eigenvalue of .

Recall that the operator norm is defined as

Define be the probability measure such that . We have


which hold for any . Hence .

So the absolute values of the eigenvalues of can be written as a non-increasing sequence which converges to with the largest eigenvalue to be . If is smaller than the number of positive eigenvalues of then the largest eigenvalues are guaranteed to exist. Note that this condition for is stricter than the condition when has finitely many eigenvalues. We conjecture that this restriction is due to an artifact of the analysis and in practice using any value of would be valid when has infinite dimensions.

Appendix B Defining for Multi-step Transitions

To introduce a more general definition of , we first introduce a generalized discounted transition distribution defined by


where is a discount factor, with corresponding to the one-step transition distribution . Notice that can be also written as where . So sampling from can be done by first sampling then rolling out the Markov chain for steps starting from .

Note that for readability we state the definition of in terms of discrete probability distributions but in general are defined as a probability measure by stating the discounted sum (8) for any measurable set of states , instead of a single state .

Also notice that when sampling from required rolling out more than one steps from (and can be arbitrarily long). Given that the replay buffer contains finite length (say ) trajectories sampling exactly from the defined distribution is impossible. In practice, after sampling in a trajectory and from we discard this sample if .

With the discounted transition, distributions now the generalized is defined as


We assume that is absolutely continuous to for any so that the Radon Nikodym derivatives are well defined. This assumption is mild since it is saying that for any state that is reachable from some state under we have a positive probability to sample it from , i.e. the behavior policy is able to explore the whole state space (not necessarily efficiently).

Proof of being a density of some probability measure with respect to . We need to show that . Let be the density function of with respect to then . According to the definition of we have . It remains to show that for any .

First notice that if is the stationary distribution of it is also the stationary distribution of such that for any measurable . Let . For any measurable set we have

(Property of the stationary distribution.)

which means that is the density function of with respect to . So holds for all . (For simplicity we ignore the statement of “almost surely” throughout the paper.)

Discussion of finite time horizon. Because proving to be a density requires the fact that is the stationary distribution of , the astute reader may suspect that sampling from the replay buffer will differ from the stationary distribution when the initial state distribution is highly concentrated, the mixing rate is slow, and the time horizon is short. In this case, one can adjust the definition of the transition probabilities to better reflect what is happening in practice: Define a new transition distribution by adding a small probability to “reset”: . This introduces a randomized termination to approximate termination of trajectories (e.g,. due to time limit) without adding dependencies on (to retain the Markov property). Then, and can be defined in the same way with respect to . Now the replay buffer can be viewed as rolling out a single long trajectory with so that sampling from the replay buffer approximates sampling from the stationary distribution. Note that under the new definition of , minimizing the graph drawing objective requires sampling state pairs that may span over the “reset” transition. In practice, we ignore these pairs as we do not want to view “resets” as edges in RL. When (e.g. ) is small, the chance of sampling these “reset” pairs is very small, so our adjusted definition still approximately reflects what is being done in practice.

Appendix C Derivation of (5)

(Switching the notation and in the second term gives the same quantity as the first term )

Appendix D Additional results and experiment details

d.1 Evaluating the Learned Representations

Environment details. The FourRoom gridworld environment, as shown in Figure 2, has 152 discrete states and 4 actions. A tabular (index) state representation is a one hot vector with 152 dimensions. A position state representation is a two dimensional vector representing the coordinates, scaled within . A image state representation contains 15 by 15 RGB pixels with different colors representing the agent, walls and open ground. The transitions are deterministic. Each episode starts from a uniformly sampled state has a length of 50. We use this data to perform representation learning and evaluate the final learned representation using our evaluation protocol.

Implementation of baselines. Both of the approaches in Machado et al. (2017a) and Machado et al. (2017b) output eigenoptions where is the dimension of a state feature space , which can be either the raw representation (Machado et al., 2017a) or a representation learned by a forward prediction model (Machado et al., 2017b). Given the eigenoptions, an embedding can be obtained by letting . Following their theoretical results it can be seen that if is the one-hot representation of the tabular states and the stacked rows contains unique enumeration of all transitions/states spans the same subspace as the smallest eigenvectors of the Laplacian.

Additional results. Additional results for are shown in Figure 6.

Figure 6: Evaluation of learned representations for .

Choice of in (6). When the optimization problem associated with our objective (6) may be solved exactly, increasing will always lead to better approximations of the exact graph drawing objective (2) as the soft constraint approaches to the hard constraint. However, the optimization problem becomes harder to be solve by SGD when the value of is too large. We perform an ablation study over to show this trade-off in Figure 7. We can see that the optimal value of increases as in creases.

Figure 7: Ablation study for the value of . We use and .

Hyperparameters. is defined using one-step transitions ( in (9)). We use , batch size , Adam optimizer with learning rate and total training steps . For representation mappings: we use a linear mapping for index states, a

two hidden layer fully connected neural network for position states and a convolutional network for image states. All activation functions are relu. The convolutional network contains 3 conv-layers with output channels

, kernel sizes

, strides

and a final linear mapping to representations.

d.2 Laplacian Representation Learning for Reward Shaping

d.2.1 GridWorld

Environment details All mazes have a total size of by grids, with actions and total number of states decided by the walls. We use position as raw state representations. Since the states are discrete the success criteria is set as reaching the exact grid. Each episode has a length of .

Hyperparameters For representation learning we use . In the definition of we use the discounted multi-step transitions (9) with . For the approximate graph drawing objective (6) we use and (instead of ) if otherwise to control the scale of L2 distances. We pretrain the representations for steps (This number of steps is not optimized and we observe that the training converges much earlier) by Adam with batch size and learning rate . For policy training, we use the vanilla DQN with a online network and a target network. The target network is updated every steps with a mixing rate of (of the current online network with of the previous target network). Epsilon greedy with is used for exploration. Reward discount is . The policy is trained with Adam optimizer with learning rate . For both representation mapping and Q functions we use a fully connected network (parameter not shared) with 3 hidden layers and 256 units in each layer. All activation functions are relu.

d.2.2 Continuous Control

Environment details The PointMass agent has a dimensional state space a dimensional action space. The Ant agent has a dimensional state space and a dimensional action space. The success criteria is set as reaching an L2 ball centered around a specific position with the radius as of the total size of the maze, as shown in Figure 5. Each episode has a length of .

Hyperparameters For representation learning we use . In the definition of we use the discounted multi-step transitions (9) with for PointMass and for Ant. For the approximate graph drawing objective (6) we use and (instead of ) if otherwise to control the scale of L2 distances. We pretrain the representations for steps by Adam with batch size and learning rate for PointMass and for Ant. For policy training, we use the vanilla DDPG with a online network and a target network. The target network is updated every step with a mixing rate of . For exploration we follow the Ornstein-Uhlenbeck process as described in the original DDPG paper. Reward discount is . The policy is trained with Adam optimizer with batch size , actor learning rate and critic learning rate for PointMass and for Ant. For representation mapping we use a fully connected network (parameter not shared) with 3 hidden layers and 256 units in each layer. Both actor network and critic network have 2 hidden layers with units . All activation functions are relu.

Figure 8: Results of reward shaping with pretrained-then-fixed v.s. online-learned representations.

Online learning of representations We also present results of learning the representations online instead of pretraining-and-fix and observe equivalent performance, as shown in Figure 8, suggesting that our method may be successfully used in online settings. For online training the agent moves faster in the maze during policy learning so we anneal the in from its inital value to towards the end of training with linear decay. The reason that online training provides no benefit is that our randomized starting position setting enables efficient exploration even with just random walk policies. Investigating the benefit of online training in exploration-hard tasks would be an interesting future direction.