1 Introduction
The performance of machine learning methods generally depends on the choice of data representation
(Bengio et al., 2013). In reinforcement learning (RL), the choice of state representation may affect generalization (Rafols et al., 2005), exploration (Tang et al., 2017; Pathak et al., 2017), and speed of learning (Dubey et al., 2018). As a motivating example, consider goalachieving tasks, a class of RL tasks which has recently received significant attention (Andrychowicz et al., 2017; Pong et al., 2018). In such tasks, the agent’s task is to achieve a certain configuration in state space; e.g. in Figure 1 the environment is a tworoom gridworld and the agent’s task is to reach the red cell. A natural reward choice is the negative Euclidean (L2) distance from the goal (e.g., as used in (Nachum et al., 2018)). The ability of an RL agent to quickly and successfully solve the task is thus heavily dependent on the representation of the states used to compute the L2 distance. Computing the distance on onehot (i.e. tabular) representations of the states (equivalent to a sparse reward) is most closely aligned with the task’s directive. However, such a representation can be disadvantageous for learning speed, as the agent receives the same reward signal for all nongoal cells. One may instead choose to compute the L2 distance on representations of the grid cells. This allows the agent to receive a clear signal which encourages it to move to cells closer to the goal. Unfortunately, this representation is agnostic to the environment dynamics, and in cases where the agent’s movement is obstructed (e.g. by a wall as in Figure 1), this choice of reward is likely to cause premature convergence to suboptimal policies unless sophisticated exploration strategies are used. The ideal reward structure would be defined on state representations whose distances roughly correspond to the ability of the agent to reach one state from another. Although there are many suitable such representations, in this paper, we focus on a specific approach based on the graph Laplacian, which is notable for this and several other desirable properties.For a symmetric weighted graph, the Laplacian is a symmetric matrix with a row and column for each vertex. The smallest eigenvectors of the Laplacian provide an embedding of each vertex in which has been found to be especially useful in a variety of applications, such as graph visualization (Koren, 2003), clustering (Ng et al., 2002), and more (Chung & Graham, 1997).
Naturally, the use of the Laplacian in RL has also attracted attention. In an RL setting, the vertices of the graph are given by the states of the environment. For a specific behavior policy, edges between states are weighted by the probability of transitioning from one state to the other (and viceversa). Several previous works have proposed that approximating the eigenvectors of the graph Laplacian can be useful in RL. For example,
Mahadevan (2005) shows that using the eigenvectors as basis functions can accelerate learning with policy iteration. Machado et al. (2017a, b) show that the eigenvectors can be used to construct options with exploratory behavior. The Laplacian eigenvectors are also a natural solution to the aforementioned rewardshaping problem. If we use a uniformly random behavior policy, the Laplacian state representations will be appropriately aware of the walls present in the gridworld and will induce an L2 distance as shown in Figure 1(right). This choice of representation accurately reflects the geometry of the problem, not only providing a strong learning signal at every state, but also avoiding spurious local optima.While the potential benefits of using Laplacianbased representations in RL are clear, current techniques for approximating or learning the representations are illsuited for modelfree RL. For one, current methods mostly require an eigendecomposition of a matrix. When this matrix is the actual Laplacian (Mahadevan, 2005), the eigendecomposition can easily become prohibitively expensive. Even for methods which perform the eigendecomposition on a reduced matrix (Machado et al., 2017a, b), the eigendecomposition step may be computationally expensive, and furthermore precludes the applicability of the method to stochastic or online settings, which are common in RL. Perhaps more crucially, the justification for many of these methods is made in the tabular setting. The applicability of these methods to more general settings is unclear.
To resolve these limitations, we propose a computationally efficient approach to approximate the eigenvectors of the Laplacian with function approximation based on the spectral graph drawing objective, an objective whose optimum yields the desired eigenvector representations. We present the objective in a fully general RL setting and show how it may be stochastically optimized over minibatches of sampled experience. We empirically show that our method provides a better approximation to the Laplacian eigenvectors than previous proposals, especially when the raw representation is not tabular. We then apply our representation learning procedure to reward shaping in goalachieving tasks, and show that our approach outperforms both sparse rewards and rewards based on L2 distance in the raw feature space. Results are shown under a set of gridworld maze environments and difficult continuous control navigation environments.
2 Background
We present the eigendecomposition framework in terms of general Hilbert spaces. By working with Hilbert spaces, we provide a unified treatment of the Laplacian and our method for approximating its eigenvectors (Cayley, 1858) – eigenfunctions in Hilbert spaces (Riesz, 1910) – regardless of the underlying space (discrete or continuous). To simplify the exposition, the reader may substitute the following simplified definitions:

The state space is a finite enumerated set .

The inner product of two elements is a weighted dot product of the corresponding vectors, with weighting given by ; i.e. .

A linear operator is a mapping corresponding to a weighted matrix multiplication; i.e. .

A selfadjoint linear operator is one for which for all . This corresponds to being a symmetric matrix.
2.1 A Space and a Measure
We now present the more general form of these definitions. Let be a set, be a algebra, and be a measure such that constitutes a measure space. Consider the set of squareintegrable realvalued functions . When associated with the innerproduct,
this set of functions forms a complete inner product Hilbert space (Hilbert, 1906; Riesz, 1910). The inner product gives rise to a notion of orthogonality: Functions are orthogonal if . It also induces a norm on the space: . We denote and additionally restrict to be a probability measure, i.e. .
2.2 The Laplacian
To construct the graph Laplacian in this general setting, we consider linear operators which are HilbertSchmidt integral operators (Bump, 1998), expressable as,
where with a slight abuse of notation we also use to denote the kernel function. We assume that (i) the kernel function satisfies for all so that the operator is selfadjoint; (ii) for each , is the RadonNikodym derivative (density function) from some probability measure to , i.e. for all . With these assumptions, is a compact, selfadjoint linear operator, and hence many of the spectral properties associated with standard symmetric matrices extend to .
The Laplacian of is defined as the linear operator on given by,
(1) 
The Laplacian may also be written as the linear operator , where
is the identity operator. Any eigenfunction with associated eigenvalue
of the Laplacian is an eigenfunction with eigenvalue for , and viceversa.Our goal is to find the first eigenfunctions associated with the smallest eigenvalues of (subject to rotation of the basis).^{1}^{1}1The existence of these eigenfunctions is formally discussed in Appendix A. The mapping defined by then defines an embedding or representation of the space .
2.3 Spectral Graph Drawing
Spectral graph drawing (Koren, 2003) provides an optimization perspective on finding the eigenvectors of the Laplacian. Suppose we have a large graph, composed of (possibly infinitely many) vertices with weighted edges representing pairwise (nonnegative) affinities (denoted by for vertices and ). To visualize the graph, we would like to embed each vertex in a low dimensional space (e.g., in this work) so that pairwise distances in the low dimensional space are small for vertices with high affinity. Using our notation, the graph drawing objective is to find a set of orthonormal functions defined on the space which minimize
(2) 
The orthonormal constraints can be written as for all where if and otherwise.
The graph drawing objective (2) may be expressed more succinctly in terms of the Laplacian:
(3) 
The minimum value of (3) is the sum of the smallest eigenvalues of . Accordingly, the minimum is achieved when span the same subspace as the corresponding eigenfunctions. In the next section, we will show that the graph drawing objective is amenable to stochastic optimization, thus providing a general, scalable approach to approximating the eigenfunctions of the Laplacian.
3 Representation Learning with the Laplacian
In this section, we specify the meaning of the Laplacian in the RL setting (i.e., how to set
appropriately). We then elaborate on how to approximate the eigenfunctions of the Laplacian by optimizing the graph drawing objective via stochastic gradient descent on sampled states and pairs of states.
3.1 The Laplacian in a Reinforcement Learning Setting
In RL, an agent interacts with an environment by observing states and acting on the environment. We consider the standard MDP setting (Puterman, 1990). Briefly, at time the environment produces an observation , which at time is determined by a random sample from an environmentspecific initial distribution . The agent’s policy produces a probability distribution over possible actions from which it samples a specific action to act on the environment. The environment then yields a reward sampled from an environmentspecific reward distribution function , and transitions to a subsequent state sampled from an environmentspecific transition distribution function . We consider defining the Laplacian with respect to a fixed behavior policy . Then, the transition distributions
form a Markov chain. We assume this Markov chain has a unique stationary distribution.
We now introduce a choice of and for the Laplacian in the RL setting. We define to be the stationary distribution of the Markov chain such that for any measurable we have .
As represents the pairwise affinity between two vertices and on the graph, it is natural to define in terms of the transition distribution.^{2}^{2}2The onestep transitions can be generalized to multistep transitions in the definition of , which provide better performance for RL applications in our experiments. See Appendix B for details. Recall that needs to satisfy (i) (ii) is the density function from a probability measure to for all . We define
(4) 
which satisfies these conditions^{3}^{3}3 follows from definition. See Appendix B for a proof that is a density.. In other words, the affinity between states and is the average of the twoway transition probabilities: If is finite then the first term in (4) is and the second term is .
3.2 Approximating the Laplacian Eigenfunctions
Given this definition of the Laplacian, we now aim to learn the eigendecomposition embedding . In the modelfree RL context, we have access to states and pairs of states (or sequences of states) only via sampling; i.e. we may sample states from and pairs of from . This imposes several challenges on computing the eigendecomposition:

Enumerating the state space may be intractable due to the large cardinality or continuity.

For arbitrary pairs of states , we do not have explicit access to .

Enforcing exact orthonormality of may be intractable in innumerable state spaces.
With our choices for and , the graph drawing objective (Eq. 2) is a good start for resolving these challenges because it can be expressed as an expectation (see Appendix C for the derivation):
(5) 
Minimizing the objective with stochastic gradient descent is straightforward by sampling transition pairs as from the replay buffer. The difficult part is ensuring orthonormality of the functions. To tackle this issue, we first relax the orthonormality constraint to a soft constraint . Using standard properties of expectations, we rewrite the inequality as follows:
In practice, we transform this constraint into a penalty and solve the unconstrained minimization problem. The resulting penalized graph drawing objective is
(6) 
where is the Lagrange multiplier.
The dimensional embedding
may be learned using a neural network function approximator. We note that
has a form which appears in many other representation learning objectives, being comprised of an attractive and a repulsive term. The attractive term minimizes the squared distance of embeddings of randomly sampled transitions experienced by the policy , while the repulsive term repels the embeddings of states independently sampled from . The repulsive term is especially interesting and we are unaware of anything similar to it in other representation learning objectives: It may be interpreted as orthogonalizing the embeddings of two randomly sampled states while regularizing their norm away from zero by noticing(7) 
4 Related Work
One of the main contributions of our work is a principled treatment of the Laplacian in a general RL setting. While several previous works have proposed the use of the Laplacian in RL (Mahadevan, 2005; Machado et al., 2017a), they have focused on the simple, tabular setting. In contrast, we provide a framework for Laplacian representation learning that applies generally (i.e., when the state space is innumerable and may only be accessed via sampling).
Our main result is showing that the graph drawing objective may be used to stochastically optimize a representation module which approximates the Laplacian eigenfunctions. Although a large body of work exists regarding stochastic approximation of an eigendecomposition (Cardot & Degras, 2018; Oja, 1985), many of these approaches require storage of the entire eigendecomposition. This scales poorly and fails to satisfy the desiderata for modelfree RL – a function approximator which yields arbitrary rows of the eigendecomposition. Some works have proposed extensions that avoid this requirement by use of Oja’s rule (Oja, 1982). Originally defined within the Hebbian framework, recent work has applied the rule to kernelized PCA (Xie et al., 2015), and extending it to settings similar to ours is a potential avenue for future work.
In RL, Machado et al. (2017b) propose a method to approximate the Laplacian eigenvectors with functions approximators via an equivalence between protovalue functions (Mahadevan, 2005) and spectral decomposition of the successor representation (Stachenfeld et al., 2014). Importantly, they propose an approach for stochastically approximating the eigendecomposition when the state space is large. Unfortunately, their approach is only justified in the tabular setting and, as we show in our results below, does not generalize beyond. Moreover, their eigenvectors are based on an explicit eigendecomposition of a constructed reduced matrix, and thus are not appropriate for online settings.
Approaches more similar to ours (Shaham et al., 2018; Pfau et al., 2018) optimize objectives similar to Eq. 2, but handle the orthonormality constraint differently. Shaham et al. (2018)
introduce a specialpurpose orthonormalizing layer, which ensures orthonormality at the minibatch level. Unfortunately, this does not ensure orthonormality over the entire dataset and requires large minibatches for stability. Furthermore, the orthonormalization process can be numerically unstable, and in our preliminary experiments we found that TensorFlow frequently crashed due to numerical errors from this sort of orthonormalization.
Pfau et al. (2018) turn the problem into an unconstrained optimization objective. However, in their chosen form, one cannot compute unbiased stochastic gradients. Moreover, their approach scales quadratically in the number of embedding dimensions. Our approach does not suffer from these issues.Finally, we note that our work provides a convincing application of Laplacian representations on difficult RL tasks, namely rewardshaping in continuouscontrol environments. Although previous works have presented interesting preliminary results, their applications were either restricted to small discrete state spaces (Mahadevan, 2005) or focused on qualitative assessments of the learned options (Machado et al., 2017a, b).
5 Experiments
5.1 Evaluating the Learned Representations
We first evaluate the learned representations by how well they approximate the subspace spanned by the smallest eigenfunctions of the Laplacian. We use the following evaluation protocol: (i) Given an embedding , we first find its principal dimensional orthonormal basis
, onto which we project all embeddings in order to satisfy the orthonormality constraint of the graph drawing objective; (ii) the evaluation metric is then computed as the value of the graph drawing objective using the projected embeddings. In this subsection, we use finite state spaces, so step (i) can be performed by SVD.
We used a FourRoom gridworld environment (Figure 2). We generate a dataset of experience by randomly sampling transitions using a uniformly random policy with random initial state. We compare the embedding learned by our approximate graph drawing objective against methods proposed by Machado et al. (2017a, b). Machado et al. (2017a) find the first eigenvectors of the Laplacian by eigendecomposing a matrix formed by stacked transitions, while Machado et al. (2017b) eigendecompose a matrix formed by stacked learned successor representations. We evaluate the methods with three different raw state representations of the gridworld: (i) onehot vectors (“index”), (ii) coordinates (“position”) and (iii) topdown pixel representation (“image”).
We present the results of our evaluations in Figure 3. Our method outperforms the previous methods with all three raw representations. Both of the previous methods were justified in the tabular setting, however, surprisingly, they underperform our method even with the tabular representation. Moreover, our method performs well even when the number of training samples is small.
5.2 Laplacian Representation Learning for Reward Shaping
We now move on to demonstrating the power of our learned representations to improve the performance of an RL agent. We focus on a family of tasks – goalachieving tasks – in which the agent is rewarded for reaching a certain state. We show that in such settings our learned representations are wellsuited for reward shaping.
Goalachieving tasks and reward shaping. A goalachieving task is defined by an environment with transition dynamics but no reward, together with a goal vector , where is the goal space. We assume that there is a known predefined function that maps any state to a goal vector . The learning objective is to train a policy that controls the agent to get to some state such that . For example the goal space may be the same as the state space with and being the identity mapping, in which case the target is a state vector. More generally the goal space can be a subspace of the state space. For example, in control tasks a state vector may contain both position and velocity information while a goal vector may just be a specific position. See Plappert et al. (2018) for an extensive discussion and additional examples.
A reward function needs to be defined in order to apply reinforcement learning to train an agent that can perform a goal achieving task. Two typical ways of defining a reward function for this family of tasks are (i) the sparse reward: as used by Andrychowicz et al. (2017) and (ii) the shaped reward based on Euclidean distance as used by Pong et al. (2018); Nachum et al. (2018). The sparse reward is consistent with what the agent is supposed to do but may slow down learning. The shaped reward may either accelerate or hurt the learning process depending on the whether distances in the raw feature space accurately reflect the geometry of the environment dynamics.
Reward shaping with learned representations. We expect that distance based reward shaping with our learned representations can speed up learning compared to sparse reward while avoiding the bias in the raw feature space. More specifically, we define the reward based on distance in a learned latent space. If the goal space is the same as the state space, i.e. , the reward function can be defined as . If we propose two options: (i) The first is to learn an embedding of the goal space and define . (ii) The second options is to learn an an embedding of the state space and define , where is defined as picking arbitrary state (may not be unique) that achieves . We experiment with both options when .
5.2.1 GridWorld
We experiment with the gridworld environments with coordinates as the observation. We evaluate on three different mazes: OneRoom, TwoRooms and HardMaze, as shown in the top row of Figure 4. The red grids are the goals and the heatmap shows the distances from each grid to the goal in the learned Laplacian embedding space. We can qualitatively see that the learned rewards are wellsuited to the task and appropriately reflect the environment dynamics, especially in TwoRoom and HardMaze where the raw feature space is very illsuited.
These representations are learned according to our method using a uniformly random behavior policy. Then we define the shaped reward as a halfhalf mix of the L2 distance in the learned latent space and the sparse reward. We found this mix to be advantageous, as the L2 distance on its own does not provide enough of a gradient in the reward when near the goal. We plot the learning performance of an agent trained according to this learned reward in Figure 4. All plots are based on 5 different random seeds. We compare against (i) sparse: the sparse reward, (ii) l2: the shaped reward based on the L2 distance in the raw feature space, (iii) rawmix: the mixture of (i) and (ii). Our mixture of shaped reward based on learning representations and the sparse reward is labelled as “mix” in the plots. We observe that in the OneRoom environment all shaped reward functions significantly outperform the sparse reward, which indicates that in goalachieving tasks properly shaped reward can accelerate learning of the policy, justifying our motivation of applying learned representations for reward shaping. In TwoRoom and HardMaze environments when the raw feature space cannot reflect an accurate distance, our Laplacianbased shaped reward learned using the graph drawing objective (“mix”) significantly outperforms all other reward settings.
5.2.2 Continuous Control
To further verify the benefit of our learned representations in reward shaping, we also experiment with continuous control navigation tasks. These tasks are much harder to solve than the gridworld tasks because the agent must simultaneously learn to control itself and navigate to the goal. We use Mujoco (Todorov et al., 2012) to create 3D mazes and learn to control two types of agents, PointMass and Ant, to navigate to a certain area in the maze, as shown in Figure 5. Unlike the gridworld environments the goal space is distinct from the state space, so we apply our two introduced methods to align the spaces: (i) learning to only embed the coordinates of the state (mix) or (ii) learning to embed the full state (fullmix). We run experiments with both methods. As shown in Figure 5 both “mix” and “fullmix” outperform all other methods, which further justifies the benefits of using our learned representations for reward shaping. It is interesting to see that both embedding the goal space and embedding the state space still provide a significant advantage even if neither of them is a perfect solution. For goal space embedding, part of the state vector (e.g. velocities) is ignored so the learned embedding may not be able to capture the full structure of the environment dynamics. For state space embedding, constructing the state vector from the goal vector makes achieving the goal more challenging since there is a larger set of states (e.g. with different velocities) that achieve the goal but the shaped reward encourage the policy to reach only one of them. Having a better way to align the two spaces would be an interesting future direction.
6 Conclusion
We have presented an approach to learning a Laplacianbased state representation in RL settings. Our approach is both general – being applicable to any state space regardless of cardinality – and scalable – relying only on the ability to sample minibatches of states and pairs of states. We have further provided an application of our method to reward shaping in both discrete spaces and continuouscontrol settings. With our scalable and general approach, many more potential applications of Laplacianbased representations are now within reach, and we encourage future work to continue investigating this promising direction.
References
 Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
 Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 Bump (1998) D. Bump. Automorphic Forms and Representations. Automorphic Forms and Representations. Cambridge University Press, 1998. ISBN 9780521658188. URL https://books.google.com/books?id=QQ1cr7B6XqQC.

Cardot & Degras (2018)
Hervé Cardot and David Degras.
Online principal component analysis in high dimension: Which algorithm to choose?
International Statistical Review, 86(1):29–50, 2018.  Cayley (1858) Arthur Cayley. A memoir on the theory of matrices. Philosophical transactions of the Royal society of London, 148:17–37, 1858.
 Chung & Graham (1997) Fan RK Chung and Fan Chung Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
 Dubey et al. (2018) Rachit Dubey, Pulkit Agrawal, Deepak Pathak, Thomas L Griffiths, and Alexei A Efros. Investigating human priors for playing video games. ICML, 2018.
 Hilbert (1906) David Hilbert. Grundzüge einer allgemeinen theorie der linearen integralgleichungen. vierte mitteilung. Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, MathematischPhysikalische Klasse, 1906:157–228, 1906.
 Koren (2003) Yehuda Koren. On spectral graph drawing. In International Computing and Combinatorics Conference, pp. 496–508. Springer, 2003.
 Machado et al. (2017a) Marlos C Machado, Marc G Bellemare, and Michael Bowling. A laplacian framework for option discovery in reinforcement learning. arXiv preprint arXiv:1703.00956, 2017a.
 Machado et al. (2017b) Marlos C Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. arXiv preprint arXiv:1710.11089, 2017b.
 Mahadevan (2005) Sridhar Mahadevan. Protovalue functions: Developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 553–560. ACM, 2005.
 Nachum et al. (2018) Ofir Nachum, Shane Gu, Honglak Lee, and Sergey Levine. Dataefficient hierarchical reinforcement learning. arXiv preprint arXiv:1805.08296, 2018.

Ng et al. (2002)
Andrew Y Ng, Michael I Jordan, and Yair Weiss.
On spectral clustering: Analysis and an algorithm.
In Advances in neural information processing systems, pp. 849–856, 2002. 
Oja (1982)
Erkki Oja.
Simplified neuron model as a principal component analyzer.
Journal of mathematical biology, 15(3):267–273, 1982. 
Oja (1985)
Erkki Oja.
On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix.
1985.  Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning (ICML), volume 2017, 2017.
 Pfau et al. (2018) David Pfau, Stig Petersen, Ashish Agarwal, David Barrett, and Kim Stachenfeld. Spectral inference networks: Unifying spectral methods with deep learning. arXiv preprint arXiv:1806.02215, 2018.
 Plappert et al. (2018) Matthias Plappert, Marcin Andrychowicz, Alex Ray, Bob McGrew, Bowen Baker, Glenn Powell, Jonas Schneider, Josh Tobin, Maciek Chociej, Peter Welinder, et al. Multigoal reinforcement learning: Challenging robotics environments and request for research. arXiv preprint arXiv:1802.09464, 2018.
 Pong et al. (2018) Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. Temporal difference models: Modelfree deep rl for modelbased control. arXiv preprint arXiv:1802.09081, 2018.
 Puterman (1990) Martin L Puterman. Markov decision processes. Handbooks in operations research and management science, 2:331–434, 1990.
 Rafols et al. (2005) Eddie J Rafols, Mark B Ring, Richard S Sutton, and Brian Tanner. Using predictive representations to improve generalization in reinforcement learning. In IJCAI, pp. 835–840, 2005.
 Riesz (1910) Friedrich Riesz. Untersuchungen über systeme integrierbarer funktionen. Mathematische Annalen, 69(4):449–497, 1910.
 Shaham et al. (2018) Uri Shaham, Kelly Stanton, Henry Li, Boaz Nadler, Ronen Basri, and Yuval Kluger. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587, 2018.
 Stachenfeld et al. (2014) Kimberly L Stachenfeld, Matthew Botvinick, and Samuel J Gershman. Design principles of the hippocampal cognitive map. In Advances in neural information processing systems, pp. 2528–2536, 2014.
 Tang et al. (2017) Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, OpenAI Xi Chen, Yan Duan, John Schulman, Filip DeTurck, and Pieter Abbeel. # exploration: A study of countbased exploration for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2753–2762, 2017.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Xie et al. (2015) Bo Xie, Yingyu Liang, and Le Song. Scale up nonlinear component analysis with doubly stochastic gradients. In Advances in Neural Information Processing Systems, pp. 2341–2349, 2015.
Appendix A Existence of smallest eigenvalues of the Laplacian.
Since the Hilbert space may have infinitely many dimensions we need to make sure that the smallest eigenvalues of the Laplacian operator is well defined. Since if is an eigenvalue of then is an eigenvalue of . So we turn to discuss the existence of the largest eigenvalues of . According to our definition is a compact selfadjoint linear operator on . So it has the following properties according to the spectral theorem:

has either (i) a finite set of eigenvalues or (ii) countably many eigenvalues and if there are infinitely many. All eigenvalues are real.

Any eigenvalue satisfies where is the operator norm.
If the operator has a finite set of eigenvalues its largest eigenvalues exist when is smaller than .
If has a infinite but countable set of eigenvalues we first characterize what the eigenvalues look like:
Let be for all . Then for all thus . So is an eigenvalue of .
Recall that the operator norm is defined as
Define be the probability measure such that . We have
and
which hold for any . Hence .
So the absolute values of the eigenvalues of can be written as a nonincreasing sequence which converges to with the largest eigenvalue to be . If is smaller than the number of positive eigenvalues of then the largest eigenvalues are guaranteed to exist. Note that this condition for is stricter than the condition when has finitely many eigenvalues. We conjecture that this restriction is due to an artifact of the analysis and in practice using any value of would be valid when has infinite dimensions.
Appendix B Defining for Multistep Transitions
To introduce a more general definition of , we first introduce a generalized discounted transition distribution defined by
(8) 
where is a discount factor, with corresponding to the onestep transition distribution . Notice that can be also written as where . So sampling from can be done by first sampling then rolling out the Markov chain for steps starting from .
Note that for readability we state the definition of in terms of discrete probability distributions but in general are defined as a probability measure by stating the discounted sum (8) for any measurable set of states , instead of a single state .
Also notice that when sampling from required rolling out more than one steps from (and can be arbitrarily long). Given that the replay buffer contains finite length (say ) trajectories sampling exactly from the defined distribution is impossible. In practice, after sampling in a trajectory and from we discard this sample if .
With the discounted transition, distributions now the generalized is defined as
(9) 
We assume that is absolutely continuous to for any so that the Radon Nikodym derivatives are well defined. This assumption is mild since it is saying that for any state that is reachable from some state under we have a positive probability to sample it from , i.e. the behavior policy is able to explore the whole state space (not necessarily efficiently).
Proof of being a density of some probability measure with respect to . We need to show that . Let be the density function of with respect to then . According to the definition of we have . It remains to show that for any .
First notice that if is the stationary distribution of it is also the stationary distribution of such that for any measurable . Let . For any measurable set we have
(Property of the stationary distribution.) 
which means that is the density function of with respect to . So holds for all . (For simplicity we ignore the statement of “almost surely” throughout the paper.)
Discussion of finite time horizon. Because proving to be a density requires the fact that is the stationary distribution of , the astute reader may suspect that sampling from the replay buffer will differ from the stationary distribution when the initial state distribution is highly concentrated, the mixing rate is slow, and the time horizon is short. In this case, one can adjust the definition of the transition probabilities to better reflect what is happening in practice: Define a new transition distribution by adding a small probability to “reset”: . This introduces a randomized termination to approximate termination of trajectories (e.g,. due to time limit) without adding dependencies on (to retain the Markov property). Then, and can be defined in the same way with respect to . Now the replay buffer can be viewed as rolling out a single long trajectory with so that sampling from the replay buffer approximates sampling from the stationary distribution. Note that under the new definition of , minimizing the graph drawing objective requires sampling state pairs that may span over the “reset” transition. In practice, we ignore these pairs as we do not want to view “resets” as edges in RL. When (e.g. ) is small, the chance of sampling these “reset” pairs is very small, so our adjusted definition still approximately reflects what is being done in practice.
Appendix C Derivation of (5)
(Switching the notation and in the second term gives the same quantity as the first term )  
Appendix D Additional results and experiment details
d.1 Evaluating the Learned Representations
Environment details. The FourRoom gridworld environment, as shown in Figure 2, has 152 discrete states and 4 actions. A tabular (index) state representation is a one hot vector with 152 dimensions. A position state representation is a two dimensional vector representing the coordinates, scaled within . A image state representation contains 15 by 15 RGB pixels with different colors representing the agent, walls and open ground. The transitions are deterministic. Each episode starts from a uniformly sampled state has a length of 50. We use this data to perform representation learning and evaluate the final learned representation using our evaluation protocol.
Implementation of baselines. Both of the approaches in Machado et al. (2017a) and Machado et al. (2017b) output eigenoptions where is the dimension of a state feature space , which can be either the raw representation (Machado et al., 2017a) or a representation learned by a forward prediction model (Machado et al., 2017b). Given the eigenoptions, an embedding can be obtained by letting . Following their theoretical results it can be seen that if is the onehot representation of the tabular states and the stacked rows contains unique enumeration of all transitions/states spans the same subspace as the smallest eigenvectors of the Laplacian.
Additional results. Additional results for are shown in Figure 6.
Choice of in (6). When the optimization problem associated with our objective (6) may be solved exactly, increasing will always lead to better approximations of the exact graph drawing objective (2) as the soft constraint approaches to the hard constraint. However, the optimization problem becomes harder to be solve by SGD when the value of is too large. We perform an ablation study over to show this tradeoff in Figure 7. We can see that the optimal value of increases as in creases.
Hyperparameters. is defined using onestep transitions ( in (9)). We use , batch size , Adam optimizer with learning rate and total training steps . For representation mappings: we use a linear mapping for index states, a
two hidden layer fully connected neural network for position states and a convolutional network for image states. All activation functions are relu. The convolutional network contains 3 convlayers with output channels
, kernel sizes, strides
and a final linear mapping to representations.d.2 Laplacian Representation Learning for Reward Shaping
d.2.1 GridWorld
Environment details All mazes have a total size of by grids, with actions and total number of states decided by the walls. We use position as raw state representations. Since the states are discrete the success criteria is set as reaching the exact grid. Each episode has a length of .
Hyperparameters For representation learning we use . In the definition of we use the discounted multistep transitions (9) with . For the approximate graph drawing objective (6) we use and (instead of ) if otherwise to control the scale of L2 distances. We pretrain the representations for steps (This number of steps is not optimized and we observe that the training converges much earlier) by Adam with batch size and learning rate . For policy training, we use the vanilla DQN with a online network and a target network. The target network is updated every steps with a mixing rate of (of the current online network with of the previous target network). Epsilon greedy with is used for exploration. Reward discount is . The policy is trained with Adam optimizer with learning rate . For both representation mapping and Q functions we use a fully connected network (parameter not shared) with 3 hidden layers and 256 units in each layer. All activation functions are relu.
d.2.2 Continuous Control
Environment details The PointMass agent has a dimensional state space a dimensional action space. The Ant agent has a dimensional state space and a dimensional action space. The success criteria is set as reaching an L2 ball centered around a specific position with the radius as of the total size of the maze, as shown in Figure 5. Each episode has a length of .
Hyperparameters For representation learning we use . In the definition of we use the discounted multistep transitions (9) with for PointMass and for Ant. For the approximate graph drawing objective (6) we use and (instead of ) if otherwise to control the scale of L2 distances. We pretrain the representations for steps by Adam with batch size and learning rate for PointMass and for Ant. For policy training, we use the vanilla DDPG with a online network and a target network. The target network is updated every step with a mixing rate of . For exploration we follow the OrnsteinUhlenbeck process as described in the original DDPG paper. Reward discount is . The policy is trained with Adam optimizer with batch size , actor learning rate and critic learning rate for PointMass and for Ant. For representation mapping we use a fully connected network (parameter not shared) with 3 hidden layers and 256 units in each layer. Both actor network and critic network have 2 hidden layers with units . All activation functions are relu.
Online learning of representations We also present results of learning the representations online instead of pretrainingandfix and observe equivalent performance, as shown in Figure 8, suggesting that our method may be successfully used in online settings. For online training the agent moves faster in the maze during policy learning so we anneal the in from its inital value to towards the end of training with linear decay. The reason that online training provides no benefit is that our randomized starting position setting enables efficient exploration even with just random walk policies. Investigating the benefit of online training in explorationhard tasks would be an interesting future direction.
Comments
There are no comments yet.