Towards Better Laplacian Representation in Reinforcement Learning with Generalized Graph Drawing

07/12/2021
by   Kaixin Wang, et al.
26

The Laplacian representation recently gains increasing attention for reinforcement learning as it provides succinct and informative representation for states, by taking the eigenvectors of the Laplacian matrix of the state-transition graph as state embeddings. Such representation captures the geometry of the underlying state space and is beneficial to RL tasks such as option discovery and reward shaping. To approximate the Laplacian representation in large (or even continuous) state spaces, recent works propose to minimize a spectral graph drawing objective, which however has infinitely many global minimizers other than the eigenvectors. As a result, their learned Laplacian representation may differ from the ground truth. To solve this problem, we reformulate the graph drawing objective into a generalized form and derive a new learning objective, which is proved to have eigenvectors as its unique global minimizer. It enables learning high-quality Laplacian representations that faithfully approximate the ground truth. We validate this via comprehensive experiments on a set of gridworld and continuous control environments. Moreover, we show that our learned Laplacian representations lead to more exploratory options and better reward shaping.

READ FULL TEXT VIEW PDF

page 5

page 6

page 17

page 18

03/21/2022

Temporal Abstractions-Augmented Temporally Contrastive Learning: An Alternative to the Laplacian in RL

In reinforcement learning, the graph Laplacian has proved to be a valuab...
10/10/2018

The Laplacian in RL: Learning Representations with Efficient Approximations

The smallest eigenvectors of the graph Laplacian are well-known to provi...
12/11/2017

The Eigenoption-Critic Framework

Eigenoptions (EOs) have been recently introduced as a promising idea for...
01/20/2022

Multi-agent Covering Option Discovery based on Kronecker Product of Factor Graphs

Covering option discovery has been developed to improve the exploration ...
05/20/2019

Reinforcement Learning without Ground-Truth State

To perform robot manipulation tasks, a low dimension state of the enviro...
03/12/2020

Option Discovery in the Absence of Rewards with Manifold Analysis

Options have been shown to be an effective tool in reinforcement learnin...
04/06/2020

Weakly-Supervised Reinforcement Learning for Controllable Behavior

Reinforcement learning (RL) is a powerful framework for learning to take...

1 Introduction

Reinforcement learning (RL) aims to train an agent that can take proper sequential actions based on the perceived states from the environments (Sutton and Barto, 2018). Thus the quality of state representations is important to the agent performance by benefiting its generalization ability (Zhang et al., 2018; Stooke et al., 2020; Agarwal et al., 2021), exploration ability (Pathak et al., 2017; Machado et al., 2017, 2020), and learning efficiency (Dubey et al., 2018; Wu et al., 2019), etc. Recently, the Laplacian representation receives increasing attention (Mahadevan, 2005; Machado et al., 2017; Wu et al., 2019; Jinnai et al., 2019). It views the states and transitions in an RL environment as nodes and edges in a graph, and forms a -dimension state representation with smallest eigenvectors of the graph Laplacian. Such representations can capture the geometry of the underlying state space, as illustrated in Fig. 1(b), which greatly benefits RL in option discovery (Machado et al., 2017; Jinnai et al., 2019) and reward shaping (Wu et al., 2019).

Figure 1: Visualization of environment and Laplacian state representations (2nd dimension). (a) Top view of a continuous control navigation environment. (b) Ground-truth Laplacian representation. It encodes geometry of the environment: nearby states have similar values while distant states have dissimilar values. (c) Our learned representation. It can be seen that our result is very similar to ground truth. (d) Representation learned by spectral graph drawing (Wu et al., 2019). It significantly diverges from ground truth and fails to capture geometric information about the state space. Best viewed in color.

However, to compute the exact Laplacian representation is very challenging, as directly computing eigendecomposition of the graph Laplacian requires access to the environment transition dynamics and involves expensive matrix operation. Hence it is largely limited to small finite state spaces. To deal with large (or even continuous) state spaces, previous works resort to approximation methods (Machado et al., 2017, 2018; Wu et al., 2019). A most efficient one is (Wu et al., 2019), which minimizes a spectral graph drawing objective (Koren, 2005). However, this objective has infinitely many other global minimizers besides the ground truth Laplacian representation (i.e., smallest eigenvectors), as it is invariant to an arbitrary orthogonal transformation over the optimization variables. The resulted representations can then correspond to other minimizers and diverge from the ground truth, thus unable to encode the geometry of the state space as expected (see Fig. 1(d)).

To break such invariance and approximate Laplacian representation closer to the ground-truth, we reformulate the graph drawing objective into a generalized form by introducing coefficients for each term in it. By assigning decreasing values for these coefficients, we can derive a training objective that breaks the undesired invariance. We provide theoretical guarantees that the proposed objective has the smallest eigenvectors as its unique global minimizer under mild assumptions. As shown in Fig. 1(c), minimizing the new objective is able to ensure faithful approximation to the ground truth Laplacian representation.

To verify the effectiveness of our method for learning high-quality Laplacian representations, we conduct experiments in gridworld and continuous control environments. We show that the learned representations by our method more accurately approximate the ground truth, compared with the ones from the graph drawing objective. Furthermore, we apply the learned representations to two downstream RL tasks. It is demonstrated that, for option discovery task (Machado et al., 2017) our method leads to discovered options that are more exploratory compared to using the representation learned by graph drawing; in reward shaping task (Wu et al., 2019), our learned representation is better at accelerating agents’ learning than previous work (Wu et al., 2019).

The rest of the paper is organized as follows. In Sec. 2, we introduce background about RL and Laplacian representation in RL. In Sec. 3, we propose our new objective for learning Laplacian representation. Then, we conduct experiments to demonstrate that our proposed objective is able to learn high-quality representations in Sec. 4. In Sec. 5 we review related works and Sec. 6 concludes the paper.

2 Background

2.1 Reinforcement Learning

In the RL framework (Sutton and Barto, 2018)

, an agent interacts with an environment by observing states and taking actions, with an aim of maximizing cumulative reward. We consider Markov Decision Process (MDP) formalism in this paper. An MDP can be described by a 5-tuple

. Specifically, at time the agent observes state and takes an action . The environment then yields a reward signal sampled from the reward function . The state observation in the next timestep is sampled according to an environment-specific transition distribution function . A policy is defined as a mapping that returns an action given a state . The goal of the agent is to learn an optimal policy that maximizes the expected cumulative reward:

(1)

where denotes the policy space and is a discount factor.

2.2 Laplacian Representation in RL

By considering states and transitions in an MDP as nodes and edges in a graph, the Laplacian state representation is formed by the smallest eigenvectors of the graph Laplacian. Specifically, each eigenvector (of length ) corresponds to a dimension of the Laplacian representation for all states. Formally, we denote the graph as where is the edge set consisting of transitions between states. The graph Laplacian of is defined as , where is the adjacency matrix of , and is the degree matrix (Chung and Graham, 1997). We denote the

-th smallest eigenvalue of

as , and the corresponding unit eigenvector as . The -dimensional Laplacian representation of a state is , where

denotes the entry in vector

that corresponds to state . In particular, is a normalized all-ones vector and has the same value for all .

The Laplacian representation is known to be able to capture the geometry of the underlying state space (Mahadevan, 2005; Machado et al., 2017), and thus has been applied in option discovery (Machado et al., 2017; Jinnai et al., 2019) and reward shaping (Wu et al., 2019).

In the Laplacian framework for option discovery (Machado et al., 2017), each dimension of the Laplacian representation defines an intrinsic reward function . The options (Sutton et al., 1999) are discovered by maximizing the cumulative discounted intrinsic reward. These options act at different time scales; that is, when an agent follows an option to take actions, the length of its trajectory until termination varies across different options (see Fig. 3 in (Machado et al., 2017)). Such a property makes these options helpful for exploration: longer options enable agents to quickly reach distant areas and shorter options ensure sufficient exploration in local areas.

When using the Laplacian representation for reward shaping in goal-achieving tasks (Wu et al., 2019), the reward is shaped based on Euclidean distance, as in (Pong* et al., 2018; Nachum et al., 2018). Specifically, the pseudo-reward is defined as the negative distance between the agent’s state and the goal state in representation space: . Since the Laplacian representation can reflect the geometry of the environment dynamics, such pseudo-reward can be helpful in accelerating the learning process.

2.3 Approximating Laplacian Representation

Obtaining Laplacian representation by directly computing eigendecomposition of graph Laplacian requires access to transition dynamics of the environment and involves expensive matrix operations, which is infeasible for environments with a large or even continuous state space. One efficient approach for approximating Laplacian representation is proposed by Wu et al. (2019), which minimizes the following spectral graph drawing objective (Koren, 2005):

(2)

where are to approximate the eigenvectors , and is the Kronecker delta. However, minimizing such an objective can only ensure that span the same subspace as , as mentioned in (Wu et al., 2019). It does not guarantee for , because the global minimizer is not unique.

Transforming with an arbitrary orthogonal transformation also achieves global minimum (Koren, 2005). Therefore, the problem in Eqn. (2) does not ensure that the solution is the eigenvectors, and may converge to any other global minima. Accordingly, the learned Laplacian representations may diverge from the ground truth. We will show in Sec. 4 that such representation is less helpful in discovering exploratory options and reward shaping.

3 Method

As discussed in Sec. 2.3, the graph drawing objective is invariant under orthogonal transformation: applying an arbitrary orthogonal transformation to the smallest eigenvectors also yields a global minimizer, which hinders learning Laplacian representations close to the ground truth.

We then consider breaking such invariance for achieving more accurate approximation. To this end, we reformulate the graph drawing objective into a weighted-sum form, yielding the following generalized graph drawing objective:

(3)

where is the coefficient for the -th term . When for every , it degenerates to the original graph drawing objective in Eqn. (2).

We find that, under mild assumptions, if are strictly decreasing, then the smallest eigenvectors (ground-truth Laplacian representation) make the unique global minimizer of the above generalized graph drawing objective, as stated in the following theorem.

Theorem 1.

Assume , , and . Then, is a sufficient condition for the generalized graph drawing objective to have a unique global minimizer , and the corresponding minimum is .

Proof.

Here we give the proof sketch of Theorem 1 and the full proof is deferred to the Appendix. Denote the objective of problem (3) as . Let . Since and , without loss of generality, can be written as , where

is an orthogonal matrix.

We first prove optimality. By applying Fubini’s Theorem (Fubini, 1907) to , we can rewrite as , where depends on and . We can prove (this is given by ). Hence . Notice that the inequality is tight when , which proves the optimality.

Then, we prove uniqueness by contradiction. Denote the global minimum as . Assume that there exists another global minimizer, denoted as , i.e., . Here we require because the sign of is arbitrary. Again, we rewrite , where is an orthogonal matrix. Therefore, the assumption is equivalent to . Due to the optimality of , we know . However, we can prove that this equality implies . This contradicts with , and hence proves uniqueness. ∎

We will empirically show that the two assumptions hold in our experiments (see Sec. 4.4.1). Based on the above theorem, we can choose to obtain a learning objective that can faithfully approximate the Laplacian representation. A natural choice is , which gives the following objective:

(4)

We use the above objective throughout the rest of our paper, and conduct ablative experiments with other choices of the coefficients (see Sec.4.4.2).

Note that Theorem 1 implies a property of the generalized graph drawing objective in Eqn. (3): there is one-to-one correspondence between its solutions and the smallest eigenvectors, i.e., . With this, a specific dimension (e.g., the 2nd dimension) of the Laplacian representation can be easily derived from the corresponding solution (e.g., ). This exact correspondence is useful for studying how each dimension of the representation influences an RL task, e.g., reward shaping (see Sec. 4.3). Note that the spectral graph drawing objective does not have such a property.

The above theoretical results can be easily generalized to the function space (i.e., Hilbert space), which corresponds to a continuous state space in RL (see Appendix).

Training objective  In RL applications, it is hard to directly optimize the problem (4) because is not accessible and enumerating the state space may be infeasible. To make the optimization amenable, we follow the practice in (Wu et al., 2019) to express the objective as an expectation. The objective in Eqn. (4) can be rewritten as

(5)

where the inner summation of the right hand side is over all edges (i.e. transitions) in the graph, and denotes the entry of vector corresponding to state

. In practice, we train a neural network

with -dimension output to approximate the Laplacian representation for state . Since we only have sampled transitions, we can express Eqn. (5) as an expectation and minimize the following objective

(6)

where is a state-transition sampled from a dataset of transitions .

The orthonormal constraints in Eqn. (4) can be implemented as a penalty term:

(7)

where

(8)

Here denotes the distribution of states in . Please refer to (Wu et al., 2019) and the Appendix for detailed derivation.

4 Experiments

Figure 2: Environments for experiments (agents depicted in red).

In this section, we conduct extensive experiments to validate the effectiveness of our method in improving learned Laplacian representation. Specifically, in Sec. 4.1, we evaluate the learned representations on how well they approximate the ground truth. In Sec. 4.2 and Sec. 4.3, we evaluate the learned representations on their effectiveness in two downstream tasks, i.e. for discovering exploratory options and improving reward shaping. Finally, in Sec. 4.4, we empirically verify the assumptions used in Theorem 1 and evaluate other coefficient choices.

We use two discrete gridworld environments and two continuous control environments for our experiments (see Fig. 2), following previous work (Wu et al., 2019). The gridworld environments are built with MiniGrid (Chevalier-Boisvert et al., 2018) and the continuous control environments are created with PyBullet (Coumans and Bai, 2016). Note that for gridworld environments, our setting is not tabular, since we approximate the Laplacian representation via training neural networks on raw observations (such as positions or top-view images) rather than learning a mapping table for all states. For all experiments, we use for the dimension of the Laplacian representation. More details about training setup can be found in the Appendix. For clarity, throughout the experiments we use baseline to refer to the method of (Wu et al., 2019).

4.1 Learning Laplacian Representations

Figure 3: Visualization of the learned 10-dimension Laplacian representation and the ground truth on GridMaze. Each heatmap shows a dimension of the representation for all states in the environment, where each state is a single cell. Best viewed in color.
Figure 4: Visualization of the learned 10-dimension Laplacian representations and the ground truth on PointRoom

. Each heatmap shows a dimension of the representation for all states (via interpolation) in the environment. Best viewed in color.

Environment GridRoom GridMaze GridRoom (image) GridMaze (image) PointRoom PointMaze
Baseline 0.239 0.220 0.310 0.229 0.239 0.255
Ours 0.991 0.962 0.985 0.984 0.963 0.779
Table 1: between learned representation and ground truth, averaged across 3 runs. (image) denotes using image observations.

We take (Wu et al., 2019) as our baseline and following its practice, we also train a neural network to approximate the Laplacian representation, using trajectories collected by a uniformly random policy with random starts. For the environments with discrete state-space (GridRoom and GridMaze), we conduct experiments with both position observation and image observation. The ground-truth Laplacian representations (i.e., eigenvectors) are computed by eigendecomposing the graph Laplacian matrix. For environments with continuous state spaces (PointRoom and PointMaze), we use positions as observations, and the ground-truth representations (i.e

., eigenfunctions) are approximated by the finite difference method with 5-point stencil 

(Peter and Lutz, 2003). Please see the Appendix for more training details.

To get an intuitive comparison between our method and the baseline in approximating the Laplacian representation, we visualize the learned state representations as well as the ground truth ones of GridMaze and PointRoom in Fig. 3 and Fig. 4. As the figures show, our learned Laplacian representations approximate the ground truth much more accurately, while the baseline representations significantly diverge from the ground truth. Similar results in other environments are included in the Appendix.

To quantify the approximation quality of the learned representations, we calculate the absolute dimension-wise cosine similarities between the learned representations and ground-truth ones, and take the average over all dimensions, which yields the following

metric:

(9)

where is a state, is the -th dimension of the learned representation of (defined in Sec. 3), and is the -th dimension of the ground truth for (defined in Sec. 2.2) respectively. Note that is in the range , and larger means that the representation is closer to the ground truth. As shown in Tab. 1, our method achieves much higher than the baseline, indicating better approximation.

Moreover, we provide empirical evidence for our discussion in Sec. 3, that our method converges to the unique global minimizer, while the baseline method can converge to different minima. To illustrate this, we visualize the learned representations in 3 different runs, and use the following

metric to measure the variance between the representations learned in

-th run and those learned in the -th run () via the following:

(10)

where and denote the -th dimension of the learned representation of state in the -th and -th run, respectively. is in the range , and larger value implies larger inconsistency in the learned representations between 2 runs. As Fig. 5 shows, the learned representations of the baseline method vary a lot across runs, indicating convergence to different minima. In contrast, our method yields consistent approximations.

The above results demonstrate the superiority of learning Laplacian representation with our proposed objective and empirically support our theoretical analysis in Sec. 3.

Figure 5: (a) and (b): Visualization of first 3 dimensions of representations learned by our method (a) and baseline method (b). (c) and (d): for computed with our learned representations (c) and those learned by baseline (d).

4.2 Option Discovery

As discussed in Sec. 2.2, the Laplacian representation can be applied in discovering exploratory options. We here evaluate effectiveness of our learned representations in discovering exploratory options, to further show the superiority of our method over the baseline.

Following (Machado et al., 2017), we learn 2 options for each dimension of the learned representation: one with an intrinsic reward function and the other with , where denotes -th dimension of the representation for state (see Sec. 3). The options are learned with Deep Q-learning (Mnih et al., 2013). Since the first dimension of the Laplacian representation has the same value for every state (see Sec. 2.2), it cannot provide informative intrinsic reward. Therefore, we do not learn options for the first dimension of our learned representation.

For each learned option, we compute the average trajectory length for an agent that starts from each state and follows this option until arriving at termination states. This reflects the time scale at which an option acts: options acting at longer time scales enable agents to quickly reach distant areas, and shorter options ensure sufficient exploration in local areas. As Fig. 6 shows, trajectory lengths for our method vary across different dimensions, implying the options operate at different time scales. Such options enable exploration in both nearby and distant areas. In contrast, options discovered from the baseline representation operate at similarly short time scales, which may hinder exploration to the distant areas.

To further validate this, we measure the expected number of steps required for an agent (equipped with learned options) to navigate between different rooms in the GridRoom environment. Specifically, we denote as the average number of steps required for an agent starting from room to reach room . We then calculate as the expected number of steps required to navigate between two rooms. As shown in Fig. 7, agents equipped with options discovered from the baseline representation typically take more steps to reach distant rooms. In comparison, with our method, agents can reach faraway rooms within a similar number of steps as for nearby rooms.

Figure 6: Average length of trajectories for options discovered from each dimension of different representations. Longer length implies the option acts at a longer time scale. For each dimension, the result is averaged from corresponding 2 options.
Figure 7: (a) Index of rooms. Neighboring rooms are assigned with consecutive indexes. Larger difference between two indexes means that the corresponding rooms are farther in the environment. (b) Average steps needed to navigate between two rooms (baseline). The agent needs many steps to navigate between distant rooms, e.g., room 1 and room 16. (c) Average steps needed to navigate between two rooms (ground truth). (d) Average steps needed to navigate between two rooms (our method).

4.3 Reward Shaping

The Laplacian representation can be used for reward shaping in goal-achieving tasks, as mentioned in Sec. 2.2. Previous work (Wu et al., 2019) uses all dimensions of the learned representation to define the pseudo-reward, i.e., , where is the -dimension representation of state output by a neural network. Such pseudo-reward is influenced by each dimension of the representation. As each dimension of Laplacian representation captures different geometric information about the state space (e.g., see Fig. 4), a natural question is: which dimension of the Laplacian representation matters more to reward shaping? Furthermore, can we achieve better reward shaping than using all dimensions?

Figure 8: Results of reward shaping with each dimension of learned Laplacian representations. denotes reward shaping with L2 distance in raw observation space (i.e., position), and sparse denotes no reward shaping.

In this subsection, we study these questions by comparing individual dimensions of the learned representation for reward shaping. Specifically, we define the pseudo-reward as where denotes -th dimension of the representation for state . Following (Wu et al., 2019), we train the agents using Deep Q-learning (Mnih et al., 2013) with positions as observations, and measure the agent’s success rate of reaching the goal state. To eliminate the bias brought by the goal position, we select multiple goals for each environment such that they spread over the state space, and average the results of different goals. We do not experiment with the first dimension of our learned representation since every state has same value and the pseudo-reward is always 0.

As shown in Fig. 8(a) and Fig. 8(b), using lower dimensions of our learned Laplacian representation for reward shaping better accelerates the agent’s learning process. Furthermore, we compare using best dimension and all dimensions, of both our learned representation and the baseline representation, in Fig. 8(c) and Fig. 8(d). Results show that the lower dimension of our representation significantly outperforms others, further improving reward shaping. The above results suggest that lower dimensions of the Laplacian representation are more important to reward shaping. By learning a high-quality Laplacian representation, our method enables one to more easily choose eigenvectors to use for reward shaping, which leads to improved performance.

Figure 9: Sum of absolute cosine similarities between and eigenvectors , during training.
Figure 10: Eigenvalues of the graph Laplacian matrix for two environments with discrete state-space.

4.4 Analysis

Here we first empirically verify the two assumptions in Theorem 1, and then conduct ablative experiments with different choices of coefficients for our generalized graph drawing objective (3). We use GridRoom and GridMaze environments for experiments in this subsection.

4.4.1 Verification of assumptions

The first assumption requires that all optimization variables lie in the span space of the smallest eigenvectors during the optimization process, i.e., .

We empirically verify a necessary and sufficient condition for this assumption: for each , the angle between it and its projection onto (denoted as ) is 0. Specifically, we compute the cosine distance between and during training. As Fig. 9 shows, the cosine distance is close to 0 during the whole training process, which implies that the assumption holds in our experiments.

To verify whether the second assumption holds, i.e. whether the smallest eigenvalues of the graph Laplacian are distinct: , we calculate the smallest eigenvalues of the graph Laplacian of our environments, and plot them in Fig. 10. It is clear that the eigenvalues are distinct, demonstrating the validity of this assumption.

4.4.2 Evaluation on other coefficient choices

In this above experiments, we choose the coefficients of our generalized graph drawing objective to be . In this subsubsection, we evaluate the effectiveness of other choices of .

Specifically, we select two groups of that are different from the default group (i.e., as used in Eqn. 4), of which the first group has increasing first-order difference (i.e., ), while the second group has decreasing first-order difference (i.e., ). We plot the two groups in Fig. 11(a). For comparison, we also include the default coefficient group, which has a constant first-order difference of 1. We then train neural networks with the generalized objective in Eqn. (3) using group 1 and group 2 (other experimental settings are the same with Sec. 4.1). We use the similarity between learned representations and ground truth to evaluate the quality of the representations, as done in Sec. 4.1. As can be seen from Fig. 11(b), representations learned with objectives using group 1 or group 2 are as good as those learned by the default setting.

Figure 11: (a) Coefficient values of different groups. (b) Absolute cosine similarity (averaged across dimensions) between our learned representation and ground truth.

5 Related Works

By viewing the state transition process in RL as a graph where nodes are states and edges are transitions, previous works build a Laplacian-based state representation and successfully apply it in value function approximation (Mahadevan, 2005), option discovery (Machado et al., 2017) and reward shaping (Wu et al., 2019).

Mahadevan (2005) proposes proto-value functions, viewing the Laplacian representations as basis state representations, and use them to approximate value functions. Recently, Machado et al. (2017) introduce a framework for option discovery, which builds options from the Laplacian representation. They show that such options are useful for multiple tasks and helpful in exploration. Later, Machado et al. (2018) extend their Laplacian option discovery framework to settings where handcrafted features are not available, based on an connection between proto-value functions (Mahadevan, 2005) and successor representations (Dayan, 1993; Stachenfeld et al., 2014; Barreto et al., 2017). Jinnai et al. (2019) leverage the approximated Laplacian representation to learn deep covering options for exploration.

Our work focuses on better approximating the Laplacian representation in environments with large state space. Most related to our method is the approach proposed in (Wu et al., 2019). The authors optimize a spectral graph drawing objective (Koren, 2005)

to approximate the eigenvectors. Though being efficient, their method has difficulties in learning a Laplacian representation close to the ground truth, due to the fact that their minimization objective has infinitely many other global minimizers besides the eigenvectors. Our work improves their method by proposing a new objective, which admits eigenvectors as its unique global minimizer and hence greatly increases the approximation quality in empirical evaluations. Other approaches for approximating the Laplacian representation include performing singular value decomposition on the incidence matrix 

(Machado et al., 2017, 2018), training neural networks with constrained stochastic optimization (Shaham et al., 2018) or bi-level stochastic optimization (Pfau et al., 2019). However, as discussed in (Wu et al., 2019), these approaches either require expensive matrix operations or suffer poor scalability.

6 Conclusion

Laplacian representation provides a succinct and informative state representation for RL, which captures the geometry of the underlying state space. Such representation is beneficial in discovering exploratory options and reward shaping. In this paper, we propose a new objective that greatly improves the approximation quality of the learned Laplacian, for environments with large or even continuous state space. We demonstrate the superiority of our method over previous work via theoretical analysis and empirical evaluation. Our method is efficient and simple to implement. With our method, one can learn high-quality Laplacian representation and apply it to various RL tasks such as option discovery and reward shaping.

7 Acknowledgements

Jiashi Feng is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG-100E-2019-035), Singapore National Research Foundation (“CogniVision” grant NRF-CRP20-2017-0003).

References

  • R. Agarwal, M. C. Machado, P. S. Castro, and M. G. Bellemare (2021) Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. van Hasselt, and D. Silver (2017) Successor features for transfer in reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4058–4068. Cited by: §5.
  • M. Chevalier-Boisvert, L. Willems, and S. Pal (2018) Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gym-minigrid Cited by: §4, D. Environment Descriptions.
  • F. R. Chung and F. C. Graham (1997) Spectral graph theory. American Mathematical Soc.. Cited by: §2.2.
  • E. Coumans and Y. Bai (2016)

    PyBullet, a python module for physics simulation for games, robotics and machine learning

    .
    Note: http://pybullet.org Cited by: §4, D. Environment Descriptions.
  • P. Dayan (1993) Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §5.
  • R. Dubey, P. Agrawal, D. Pathak, T. Griffiths, and A. Efros (2018) Investigating human priors for playing video games. In International Conference on Machine Learning, pp. 1349–1357. Cited by: §1.
  • G. Fubini (1907) Sugli integrali multipli: nota. Tipografia della R. Accademia dei Lincei. External Links: Link Cited by: §3, A. Proof of Theorem 1.
  • Y. Jinnai, J. W. Park, M. C. Machado, and G. Konidaris (2019) Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, Cited by: §1, §2.2, §5.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR (Poster), External Links: Link Cited by: E.1 Learning Laplacian Representations.
  • Y. Koren (2005) Drawing graphs by eigenvectors: theory and practice. Computers & Mathematics with Applications 49 (11-12), pp. 1867–1888. Cited by: §1, §2.3, §2.3, §5.
  • M. C. Machado, M. G. Bellemare, and M. Bowling (2017) A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. Cited by: §1, §1, §1, §2.2, §2.2, §4.2, §5, §5, §5, E.2 Option Discovery.
  • M. C. Machado, M. G. Bellemare, and M. Bowling (2020) Count-based exploration with the successor representation. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 5125–5133. Cited by: §1.
  • M. C. Machado, C. Rosenbaum, X. Guo, M. Liu, G. Tesauro, and M. Campbell (2018) Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5, §5.
  • S. Mahadevan (2005) Proto-value functions: developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 553–560. Cited by: §1, §2.2, §5, §5.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §4.2, §4.3, E.2 Option Discovery, E.3 Reward Shaping.
  • O. Nachum, S. Gu, H. Lee, and S. Levine (2018) Data-efficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3307–3317. Cited by: §2.2.
  • D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell (2017) Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, pp. 2778–2787. Cited by: §1.
  • K. Peter and A. Lutz (2003) For the beginning: the finite difference method for the poisson equation. In

    Numerical Methods for Elliptic and Parabolic Partial Differential Equations

    ,
    pp. 19–45. External Links: ISBN 978-0-387-21762-8, Document, Link Cited by: §4.1.
  • D. Pfau, S. Petersen, A. Agarwal, D. G. T. Barrett, and K. L. Stachenfeld (2019) Spectral inference networks: unifying deep and spectral learning. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • V. Pong*, S. Gu*, M. Dalal, and S. Levine (2018) Temporal difference models: model-free deep RL for model-based control. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.
  • U. Shaham, K. Stanton, H. Li, R. Basri, B. Nadler, and Y. Kluger (2018)

    SpectralNet: spectral clustering using deep neural networks

    .
    In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • K. L. Stachenfeld, M. Botvinick, and S. J. Gershman (2014) Design principles of the hippocampal cognitive map. Advances in neural information processing systems 27, pp. 2528–2536. Cited by: §5.
  • A. Stooke, K. Lee, P. Abbeel, and M. Laskin (2020) Decoupling representation learning from reinforcement learning. arXiv preprint arXiv:2009.08319. Cited by: §1.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.1.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: §2.2.
  • Y. Wu, G. Tucker, and O. Nachum (2019) The laplacian in RL: learning representations with efficient approximations. In International Conference on Learning Representations, External Links: Link Cited by: Figure 1, §1, §1, §1, §2.2, §2.2, §2.3, §3, §3, §4.1, §4.3, §4.3, §4, §5, §5, B. Extension to Continuous setting, B. Extension to Continuous setting, C. Obtaining Training objective, E.1 Learning Laplacian Representations, E.3 Reward Shaping.
  • A. Zhang, H. Satija, and J. Pineau (2018)

    Decoupling dynamics and reward for transfer learning

    .
    arXiv preprint arXiv:1804.10689. Cited by: §1.

A. Proof of Theorem 1

To prove Theorem 1, we first introduce the following Lemma 1.

Lemma 1.

Let be an orthogonal matrix, and . For , we have

(11)
Proof.

Since is an orthogonal matrix, we know that the sum of ’s first rows is equal to that of ’s first columns, i.e.:

(12)

Therefore, we have

(13)

If we view as a block matrix, i.e.

(14)

Lemma 1 says the sum of elements in is equal to the sum of elements in .

With Lemma 1, now we can prove Theorem 1 as following.

Proof.

Let denote the matrix of the first eigenvectors of , i.e., . Since , and , we may rewrite , where is an orthogonal matrix. Let . Then, the objective of problem (3) becomes:

(15)

We first prove optimality. Let denote the gap between the objective and . We have

(16)

Note that , then we have:

(17)

Let , and , then we can rewrite as:

(18)

Note that and that, for , . We then apply Fubini’s Theorem (Fubini, 1907) to :

(19)

Note that for , we have

(20)

According to Lemma 1, we know

(21)

Therefore, we have

(22)

Since , with Eqn. (22), we can obtain

(23)

I.e., the following inequality holds:

(24)

Since , the inequality is tight when

(25)

Therefore, we conclude that is the global minimum, and is one minimizer.

Next, we prove uniqueness. Assume that there is another minimizer for this problem, denoted as . We have

(26)

Here we require because the sign of is arbitrary and hence we do not distinguish them. Again, can be written as , where is an orthogonal matrix. Therefore, proposition in Eqn. (26) is equivalent to

(27)

Denote . By the optimality of , we have

(28)

From Eqn. (16) to Eqn. (22), we know

(29)

The equality holds if and only if . Additionally, according to Lemma 1, we have

(30)

Therefore, we also have . Accordingly, all off-diagonal elements of are 0, i.e., . Moreover, since is orthogonal, the following equality holds

(31)

So we have

(32)

which contradicts with proposition in Eqn. (27). Based on the above, we conclude that is the unique global miminizer. ∎

B. Extension to Continuous setting

In Sec. 2 and Sec. 3, we discuss the Laplacian representation and our proposed objective in discrete case. In this section we extend previous discussions to continuous settings. Consider a graph with infinitely many nodes (i.e., states), where weighted edges represent pairwise non-negative affinities (denoted by for nodes and ).

Following (Wu et al., 2019), we give the following definitions. A Hilbert space is defined to be the set of square-integrable real-valued functions on graph nodes, i.e. , associated with the inner-product

(33)

where

is a probability measure,

i.e. . The norm of a function is defined as . Functions are orthogonal if ; functions are orthonormal if . The graph Laplacian is defined as a linear operator on , given by

(34)

Our goal is to learn for approximating the eigenfunctions associated with the smallest eigenvalues of . The graph drawing objective used in (Wu et al., 2019) is

(35)

Extending this objective to the generalized form gives us

(36)

Similarly, for continuous setting, Theorem 1 can be extended to the following theorem:

Theorem 2.

Assume , and . Then, is a sufficient condition for the generalized graph drawing objective to have a unique global minimizer , and the corresponding minimum is .

To prove the Theorem 2, we need the following Lemma 2 and Lemma 3.

Lemma 2.

Let be orthonormal functions in , and be the inner product of and , i.e., . Then we have (i)