1 Introduction
Reinforcement learning (RL) aims to train an agent that can take proper sequential actions based on the perceived states from the environments (Sutton and Barto, 2018). Thus the quality of state representations is important to the agent performance by benefiting its generalization ability (Zhang et al., 2018; Stooke et al., 2020; Agarwal et al., 2021), exploration ability (Pathak et al., 2017; Machado et al., 2017, 2020), and learning efficiency (Dubey et al., 2018; Wu et al., 2019), etc. Recently, the Laplacian representation receives increasing attention (Mahadevan, 2005; Machado et al., 2017; Wu et al., 2019; Jinnai et al., 2019). It views the states and transitions in an RL environment as nodes and edges in a graph, and forms a dimension state representation with smallest eigenvectors of the graph Laplacian. Such representations can capture the geometry of the underlying state space, as illustrated in Fig. 1(b), which greatly benefits RL in option discovery (Machado et al., 2017; Jinnai et al., 2019) and reward shaping (Wu et al., 2019).
However, to compute the exact Laplacian representation is very challenging, as directly computing eigendecomposition of the graph Laplacian requires access to the environment transition dynamics and involves expensive matrix operation. Hence it is largely limited to small finite state spaces. To deal with large (or even continuous) state spaces, previous works resort to approximation methods (Machado et al., 2017, 2018; Wu et al., 2019). A most efficient one is (Wu et al., 2019), which minimizes a spectral graph drawing objective (Koren, 2005). However, this objective has infinitely many other global minimizers besides the ground truth Laplacian representation (i.e., smallest eigenvectors), as it is invariant to an arbitrary orthogonal transformation over the optimization variables. The resulted representations can then correspond to other minimizers and diverge from the ground truth, thus unable to encode the geometry of the state space as expected (see Fig. 1(d)).
To break such invariance and approximate Laplacian representation closer to the groundtruth, we reformulate the graph drawing objective into a generalized form by introducing coefficients for each term in it. By assigning decreasing values for these coefficients, we can derive a training objective that breaks the undesired invariance. We provide theoretical guarantees that the proposed objective has the smallest eigenvectors as its unique global minimizer under mild assumptions. As shown in Fig. 1(c), minimizing the new objective is able to ensure faithful approximation to the ground truth Laplacian representation.
To verify the effectiveness of our method for learning highquality Laplacian representations, we conduct experiments in gridworld and continuous control environments. We show that the learned representations by our method more accurately approximate the ground truth, compared with the ones from the graph drawing objective. Furthermore, we apply the learned representations to two downstream RL tasks. It is demonstrated that, for option discovery task (Machado et al., 2017) our method leads to discovered options that are more exploratory compared to using the representation learned by graph drawing; in reward shaping task (Wu et al., 2019), our learned representation is better at accelerating agents’ learning than previous work (Wu et al., 2019).
The rest of the paper is organized as follows. In Sec. 2, we introduce background about RL and Laplacian representation in RL. In Sec. 3, we propose our new objective for learning Laplacian representation. Then, we conduct experiments to demonstrate that our proposed objective is able to learn highquality representations in Sec. 4. In Sec. 5 we review related works and Sec. 6 concludes the paper.
2 Background
2.1 Reinforcement Learning
In the RL framework (Sutton and Barto, 2018)
, an agent interacts with an environment by observing states and taking actions, with an aim of maximizing cumulative reward. We consider Markov Decision Process (MDP) formalism in this paper. An MDP can be described by a 5tuple
. Specifically, at time the agent observes state and takes an action . The environment then yields a reward signal sampled from the reward function . The state observation in the next timestep is sampled according to an environmentspecific transition distribution function . A policy is defined as a mapping that returns an action given a state . The goal of the agent is to learn an optimal policy that maximizes the expected cumulative reward:(1) 
where denotes the policy space and is a discount factor.
2.2 Laplacian Representation in RL
By considering states and transitions in an MDP as nodes and edges in a graph, the Laplacian state representation is formed by the smallest eigenvectors of the graph Laplacian. Specifically, each eigenvector (of length ) corresponds to a dimension of the Laplacian representation for all states. Formally, we denote the graph as where is the edge set consisting of transitions between states. The graph Laplacian of is defined as , where is the adjacency matrix of , and is the degree matrix (Chung and Graham, 1997). We denote the
th smallest eigenvalue of
as , and the corresponding unit eigenvector as . The dimensional Laplacian representation of a state is , wheredenotes the entry in vector
that corresponds to state . In particular, is a normalized allones vector and has the same value for all .The Laplacian representation is known to be able to capture the geometry of the underlying state space (Mahadevan, 2005; Machado et al., 2017), and thus has been applied in option discovery (Machado et al., 2017; Jinnai et al., 2019) and reward shaping (Wu et al., 2019).
In the Laplacian framework for option discovery (Machado et al., 2017), each dimension of the Laplacian representation defines an intrinsic reward function . The options (Sutton et al., 1999) are discovered by maximizing the cumulative discounted intrinsic reward. These options act at different time scales; that is, when an agent follows an option to take actions, the length of its trajectory until termination varies across different options (see Fig. 3 in (Machado et al., 2017)). Such a property makes these options helpful for exploration: longer options enable agents to quickly reach distant areas and shorter options ensure sufficient exploration in local areas.
When using the Laplacian representation for reward shaping in goalachieving tasks (Wu et al., 2019), the reward is shaped based on Euclidean distance, as in (Pong* et al., 2018; Nachum et al., 2018). Specifically, the pseudoreward is defined as the negative distance between the agent’s state and the goal state in representation space: . Since the Laplacian representation can reflect the geometry of the environment dynamics, such pseudoreward can be helpful in accelerating the learning process.
2.3 Approximating Laplacian Representation
Obtaining Laplacian representation by directly computing eigendecomposition of graph Laplacian requires access to transition dynamics of the environment and involves expensive matrix operations, which is infeasible for environments with a large or even continuous state space. One efficient approach for approximating Laplacian representation is proposed by Wu et al. (2019), which minimizes the following spectral graph drawing objective (Koren, 2005):
(2)  
where are to approximate the eigenvectors , and is the Kronecker delta. However, minimizing such an objective can only ensure that span the same subspace as , as mentioned in (Wu et al., 2019). It does not guarantee for , because the global minimizer is not unique.
Transforming with an arbitrary orthogonal transformation also achieves global minimum (Koren, 2005). Therefore, the problem in Eqn. (2) does not ensure that the solution is the eigenvectors, and may converge to any other global minima. Accordingly, the learned Laplacian representations may diverge from the ground truth. We will show in Sec. 4 that such representation is less helpful in discovering exploratory options and reward shaping.
3 Method
As discussed in Sec. 2.3, the graph drawing objective is invariant under orthogonal transformation: applying an arbitrary orthogonal transformation to the smallest eigenvectors also yields a global minimizer, which hinders learning Laplacian representations close to the ground truth.
We then consider breaking such invariance for achieving more accurate approximation. To this end, we reformulate the graph drawing objective into a weightedsum form, yielding the following generalized graph drawing objective:
(3)  
where is the coefficient for the th term . When for every , it degenerates to the original graph drawing objective in Eqn. (2).
We find that, under mild assumptions, if are strictly decreasing, then the smallest eigenvectors (groundtruth Laplacian representation) make the unique global minimizer of the above generalized graph drawing objective, as stated in the following theorem.
Theorem 1.
Assume , , and . Then, is a sufficient condition for the generalized graph drawing objective to have a unique global minimizer , and the corresponding minimum is .
Proof.
Here we give the proof sketch of Theorem 1 and the full proof is deferred to the Appendix. Denote the objective of problem (3) as . Let . Since and , without loss of generality, can be written as , where
is an orthogonal matrix.
We first prove optimality. By applying Fubini’s Theorem (Fubini, 1907) to , we can rewrite as , where depends on and . We can prove (this is given by ). Hence . Notice that the inequality is tight when , which proves the optimality.
Then, we prove uniqueness by contradiction. Denote the global minimum as . Assume that there exists another global minimizer, denoted as , i.e., . Here we require because the sign of is arbitrary. Again, we rewrite , where is an orthogonal matrix. Therefore, the assumption is equivalent to . Due to the optimality of , we know . However, we can prove that this equality implies . This contradicts with , and hence proves uniqueness. ∎
We will empirically show that the two assumptions hold in our experiments (see Sec. 4.4.1). Based on the above theorem, we can choose to obtain a learning objective that can faithfully approximate the Laplacian representation. A natural choice is , which gives the following objective:
(4)  
We use the above objective throughout the rest of our paper, and conduct ablative experiments with other choices of the coefficients (see Sec.4.4.2).
Note that Theorem 1 implies a property of the generalized graph drawing objective in Eqn. (3): there is onetoone correspondence between its solutions and the smallest eigenvectors, i.e., . With this, a specific dimension (e.g., the 2nd dimension) of the Laplacian representation can be easily derived from the corresponding solution (e.g., ). This exact correspondence is useful for studying how each dimension of the representation influences an RL task, e.g., reward shaping (see Sec. 4.3). Note that the spectral graph drawing objective does not have such a property.
The above theoretical results can be easily generalized to the function space (i.e., Hilbert space), which corresponds to a continuous state space in RL (see Appendix).
Training objective In RL applications, it is hard to directly optimize the problem (4) because is not accessible and enumerating the state space may be infeasible. To make the optimization amenable, we follow the practice in (Wu et al., 2019) to express the objective as an expectation. The objective in Eqn. (4) can be rewritten as
(5) 
where the inner summation of the right hand side is over all edges (i.e. transitions) in the graph, and denotes the entry of vector corresponding to state
. In practice, we train a neural network
with dimension output to approximate the Laplacian representation for state . Since we only have sampled transitions, we can express Eqn. (5) as an expectation and minimize the following objective(6) 
where is a statetransition sampled from a dataset of transitions .
4 Experiments
In this section, we conduct extensive experiments to validate the effectiveness of our method in improving learned Laplacian representation. Specifically, in Sec. 4.1, we evaluate the learned representations on how well they approximate the ground truth. In Sec. 4.2 and Sec. 4.3, we evaluate the learned representations on their effectiveness in two downstream tasks, i.e. for discovering exploratory options and improving reward shaping. Finally, in Sec. 4.4, we empirically verify the assumptions used in Theorem 1 and evaluate other coefficient choices.
We use two discrete gridworld environments and two continuous control environments for our experiments (see Fig. 2), following previous work (Wu et al., 2019). The gridworld environments are built with MiniGrid (ChevalierBoisvert et al., 2018) and the continuous control environments are created with PyBullet (Coumans and Bai, 2016). Note that for gridworld environments, our setting is not tabular, since we approximate the Laplacian representation via training neural networks on raw observations (such as positions or topview images) rather than learning a mapping table for all states. For all experiments, we use for the dimension of the Laplacian representation. More details about training setup can be found in the Appendix. For clarity, throughout the experiments we use baseline to refer to the method of (Wu et al., 2019).
4.1 Learning Laplacian Representations
Environment  GridRoom  GridMaze  GridRoom (image)  GridMaze (image)  PointRoom  PointMaze 
Baseline  0.239  0.220  0.310  0.229  0.239  0.255 
Ours  0.991  0.962  0.985  0.984  0.963  0.779 
We take (Wu et al., 2019) as our baseline and following its practice, we also train a neural network to approximate the Laplacian representation, using trajectories collected by a uniformly random policy with random starts. For the environments with discrete statespace (GridRoom and GridMaze), we conduct experiments with both position observation and image observation. The groundtruth Laplacian representations (i.e., eigenvectors) are computed by eigendecomposing the graph Laplacian matrix. For environments with continuous state spaces (PointRoom and PointMaze), we use positions as observations, and the groundtruth representations (i.e
., eigenfunctions) are approximated by the finite difference method with 5point stencil
(Peter and Lutz, 2003). Please see the Appendix for more training details.To get an intuitive comparison between our method and the baseline in approximating the Laplacian representation, we visualize the learned state representations as well as the ground truth ones of GridMaze and PointRoom in Fig. 3 and Fig. 4. As the figures show, our learned Laplacian representations approximate the ground truth much more accurately, while the baseline representations significantly diverge from the ground truth. Similar results in other environments are included in the Appendix.
To quantify the approximation quality of the learned representations, we calculate the absolute dimensionwise cosine similarities between the learned representations and groundtruth ones, and take the average over all dimensions, which yields the following
metric:(9) 
where is a state, is the th dimension of the learned representation of (defined in Sec. 3), and is the th dimension of the ground truth for (defined in Sec. 2.2) respectively. Note that is in the range , and larger means that the representation is closer to the ground truth. As shown in Tab. 1, our method achieves much higher than the baseline, indicating better approximation.
Moreover, we provide empirical evidence for our discussion in Sec. 3, that our method converges to the unique global minimizer, while the baseline method can converge to different minima. To illustrate this, we visualize the learned representations in 3 different runs, and use the following
metric to measure the variance between the representations learned in
th run and those learned in the th run () via the following:(10) 
where and denote the th dimension of the learned representation of state in the th and th run, respectively. is in the range , and larger value implies larger inconsistency in the learned representations between 2 runs. As Fig. 5 shows, the learned representations of the baseline method vary a lot across runs, indicating convergence to different minima. In contrast, our method yields consistent approximations.
The above results demonstrate the superiority of learning Laplacian representation with our proposed objective and empirically support our theoretical analysis in Sec. 3.
4.2 Option Discovery
As discussed in Sec. 2.2, the Laplacian representation can be applied in discovering exploratory options. We here evaluate effectiveness of our learned representations in discovering exploratory options, to further show the superiority of our method over the baseline.
Following (Machado et al., 2017), we learn 2 options for each dimension of the learned representation: one with an intrinsic reward function and the other with , where denotes th dimension of the representation for state (see Sec. 3). The options are learned with Deep Qlearning (Mnih et al., 2013). Since the first dimension of the Laplacian representation has the same value for every state (see Sec. 2.2), it cannot provide informative intrinsic reward. Therefore, we do not learn options for the first dimension of our learned representation.
For each learned option, we compute the average trajectory length for an agent that starts from each state and follows this option until arriving at termination states. This reflects the time scale at which an option acts: options acting at longer time scales enable agents to quickly reach distant areas, and shorter options ensure sufficient exploration in local areas. As Fig. 6 shows, trajectory lengths for our method vary across different dimensions, implying the options operate at different time scales. Such options enable exploration in both nearby and distant areas. In contrast, options discovered from the baseline representation operate at similarly short time scales, which may hinder exploration to the distant areas.
To further validate this, we measure the expected number of steps required for an agent (equipped with learned options) to navigate between different rooms in the GridRoom environment. Specifically, we denote as the average number of steps required for an agent starting from room to reach room . We then calculate as the expected number of steps required to navigate between two rooms. As shown in Fig. 7, agents equipped with options discovered from the baseline representation typically take more steps to reach distant rooms. In comparison, with our method, agents can reach faraway rooms within a similar number of steps as for nearby rooms.
4.3 Reward Shaping
The Laplacian representation can be used for reward shaping in goalachieving tasks, as mentioned in Sec. 2.2. Previous work (Wu et al., 2019) uses all dimensions of the learned representation to define the pseudoreward, i.e., , where is the dimension representation of state output by a neural network. Such pseudoreward is influenced by each dimension of the representation. As each dimension of Laplacian representation captures different geometric information about the state space (e.g., see Fig. 4), a natural question is: which dimension of the Laplacian representation matters more to reward shaping? Furthermore, can we achieve better reward shaping than using all dimensions?
In this subsection, we study these questions by comparing individual dimensions of the learned representation for reward shaping. Specifically, we define the pseudoreward as where denotes th dimension of the representation for state . Following (Wu et al., 2019), we train the agents using Deep Qlearning (Mnih et al., 2013) with positions as observations, and measure the agent’s success rate of reaching the goal state. To eliminate the bias brought by the goal position, we select multiple goals for each environment such that they spread over the state space, and average the results of different goals. We do not experiment with the first dimension of our learned representation since every state has same value and the pseudoreward is always 0.
As shown in Fig. 8(a) and Fig. 8(b), using lower dimensions of our learned Laplacian representation for reward shaping better accelerates the agent’s learning process. Furthermore, we compare using best dimension and all dimensions, of both our learned representation and the baseline representation, in Fig. 8(c) and Fig. 8(d). Results show that the lower dimension of our representation significantly outperforms others, further improving reward shaping. The above results suggest that lower dimensions of the Laplacian representation are more important to reward shaping. By learning a highquality Laplacian representation, our method enables one to more easily choose eigenvectors to use for reward shaping, which leads to improved performance.
4.4 Analysis
Here we first empirically verify the two assumptions in Theorem 1, and then conduct ablative experiments with different choices of coefficients for our generalized graph drawing objective (3). We use GridRoom and GridMaze environments for experiments in this subsection.
4.4.1 Verification of assumptions
The first assumption requires that all optimization variables lie in the span space of the smallest eigenvectors during the optimization process, i.e., .
We empirically verify a necessary and sufficient condition for this assumption: for each , the angle between it and its projection onto (denoted as ) is 0. Specifically, we compute the cosine distance between and during training. As Fig. 9 shows, the cosine distance is close to 0 during the whole training process, which implies that the assumption holds in our experiments.
To verify whether the second assumption holds, i.e. whether the smallest eigenvalues of the graph Laplacian are distinct: , we calculate the smallest eigenvalues of the graph Laplacian of our environments, and plot them in Fig. 10. It is clear that the eigenvalues are distinct, demonstrating the validity of this assumption.
4.4.2 Evaluation on other coefficient choices
In this above experiments, we choose the coefficients of our generalized graph drawing objective to be . In this subsubsection, we evaluate the effectiveness of other choices of .
Specifically, we select two groups of that are different from the default group (i.e., as used in Eqn. 4), of which the first group has increasing firstorder difference (i.e., ), while the second group has decreasing firstorder difference (i.e., ). We plot the two groups in Fig. 11(a). For comparison, we also include the default coefficient group, which has a constant firstorder difference of 1. We then train neural networks with the generalized objective in Eqn. (3) using group 1 and group 2 (other experimental settings are the same with Sec. 4.1). We use the similarity between learned representations and ground truth to evaluate the quality of the representations, as done in Sec. 4.1. As can be seen from Fig. 11(b), representations learned with objectives using group 1 or group 2 are as good as those learned by the default setting.
5 Related Works
By viewing the state transition process in RL as a graph where nodes are states and edges are transitions, previous works build a Laplacianbased state representation and successfully apply it in value function approximation (Mahadevan, 2005), option discovery (Machado et al., 2017) and reward shaping (Wu et al., 2019).
Mahadevan (2005) proposes protovalue functions, viewing the Laplacian representations as basis state representations, and use them to approximate value functions. Recently, Machado et al. (2017) introduce a framework for option discovery, which builds options from the Laplacian representation. They show that such options are useful for multiple tasks and helpful in exploration. Later, Machado et al. (2018) extend their Laplacian option discovery framework to settings where handcrafted features are not available, based on an connection between protovalue functions (Mahadevan, 2005) and successor representations (Dayan, 1993; Stachenfeld et al., 2014; Barreto et al., 2017). Jinnai et al. (2019) leverage the approximated Laplacian representation to learn deep covering options for exploration.
Our work focuses on better approximating the Laplacian representation in environments with large state space. Most related to our method is the approach proposed in (Wu et al., 2019). The authors optimize a spectral graph drawing objective (Koren, 2005)
to approximate the eigenvectors. Though being efficient, their method has difficulties in learning a Laplacian representation close to the ground truth, due to the fact that their minimization objective has infinitely many other global minimizers besides the eigenvectors. Our work improves their method by proposing a new objective, which admits eigenvectors as its unique global minimizer and hence greatly increases the approximation quality in empirical evaluations. Other approaches for approximating the Laplacian representation include performing singular value decomposition on the incidence matrix
(Machado et al., 2017, 2018), training neural networks with constrained stochastic optimization (Shaham et al., 2018) or bilevel stochastic optimization (Pfau et al., 2019). However, as discussed in (Wu et al., 2019), these approaches either require expensive matrix operations or suffer poor scalability.6 Conclusion
Laplacian representation provides a succinct and informative state representation for RL, which captures the geometry of the underlying state space. Such representation is beneficial in discovering exploratory options and reward shaping. In this paper, we propose a new objective that greatly improves the approximation quality of the learned Laplacian, for environments with large or even continuous state space. We demonstrate the superiority of our method over previous work via theoretical analysis and empirical evaluation. Our method is efficient and simple to implement. With our method, one can learn highquality Laplacian representation and apply it to various RL tasks such as option discovery and reward shaping.
7 Acknowledgements
Jiashi Feng is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG100E2019035), Singapore National Research Foundation (“CogniVision” grant NRFCRP2020170003).
References
 Contrastive behavioral similarity embeddings for generalization in reinforcement learning. In International Conference on Learning Representations, External Links: Link Cited by: §1.
 Successor features for transfer in reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4058–4068. Cited by: §5.
 Minimalistic gridworld environment for openai gym. GitHub. Note: https://github.com/maximecb/gymminigrid Cited by: §4, D. Environment Descriptions.
 Spectral graph theory. American Mathematical Soc.. Cited by: §2.2.

PyBullet, a python module for physics simulation for games, robotics and machine learning
. Note: http://pybullet.org Cited by: §4, D. Environment Descriptions.  Improving generalization for temporal difference learning: the successor representation. Neural Computation 5 (4), pp. 613–624. Cited by: §5.
 Investigating human priors for playing video games. In International Conference on Machine Learning, pp. 1349–1357. Cited by: §1.
 Sugli integrali multipli: nota. Tipografia della R. Accademia dei Lincei. External Links: Link Cited by: §3, A. Proof of Theorem 1.
 Exploration in reinforcement learning with deep covering options. In International Conference on Learning Representations, Cited by: §1, §2.2, §5.
 Adam: a method for stochastic optimization. In ICLR (Poster), External Links: Link Cited by: E.1 Learning Laplacian Representations.
 Drawing graphs by eigenvectors: theory and practice. Computers & Mathematics with Applications 49 (1112), pp. 1867–1888. Cited by: §1, §2.3, §2.3, §5.
 A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. Cited by: §1, §1, §1, §2.2, §2.2, §4.2, §5, §5, §5, E.2 Option Discovery.

Countbased exploration with the successor representation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 34, pp. 5125–5133. Cited by: §1.  Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §5, §5.
 Protovalue functions: developmental reinforcement learning. In Proceedings of the 22nd international conference on Machine learning, pp. 553–560. Cited by: §1, §2.2, §5, §5.
 Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §4.2, §4.3, E.2 Option Discovery, E.3 Reward Shaping.
 Dataefficient hierarchical reinforcement learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3307–3317. Cited by: §2.2.
 Curiositydriven exploration by selfsupervised prediction. In International Conference on Machine Learning, pp. 2778–2787. Cited by: §1.

For the beginning: the finite difference method for the poisson equation.
In
Numerical Methods for Elliptic and Parabolic Partial Differential Equations
, pp. 19–45. External Links: ISBN 9780387217628, Document, Link Cited by: §4.1.  Spectral inference networks: unifying deep and spectral learning. In International Conference on Learning Representations, External Links: Link Cited by: §5.
 Temporal difference models: modelfree deep RL for modelbased control. In International Conference on Learning Representations, External Links: Link Cited by: §2.2.

SpectralNet: spectral clustering using deep neural networks
. In International Conference on Learning Representations, External Links: Link Cited by: §5.  Design principles of the hippocampal cognitive map. Advances in neural information processing systems 27, pp. 2528–2536. Cited by: §5.
 Decoupling representation learning from reinforcement learning. arXiv preprint arXiv:2009.08319. Cited by: §1.
 Reinforcement learning: an introduction. MIT press. Cited by: §1, §2.1.
 Between mdps and semimdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (12), pp. 181–211. Cited by: §2.2.
 The laplacian in RL: learning representations with efficient approximations. In International Conference on Learning Representations, External Links: Link Cited by: Figure 1, §1, §1, §1, §2.2, §2.2, §2.3, §3, §3, §4.1, §4.3, §4.3, §4, §5, §5, B. Extension to Continuous setting, B. Extension to Continuous setting, C. Obtaining Training objective, E.1 Learning Laplacian Representations, E.3 Reward Shaping.

Decoupling dynamics and reward for transfer learning
. arXiv preprint arXiv:1804.10689. Cited by: §1.
A. Proof of Theorem 1
Lemma 1.
Let be an orthogonal matrix, and . For , we have
(11) 
Proof.
Since is an orthogonal matrix, we know that the sum of ’s first rows is equal to that of ’s first columns, i.e.:
(12) 
Therefore, we have
(13)  
∎
If we view as a block matrix, i.e.
(14) 
Lemma 1 says the sum of elements in is equal to the sum of elements in .
Proof.
Let denote the matrix of the first eigenvectors of , i.e., . Since , and , we may rewrite , where is an orthogonal matrix. Let . Then, the objective of problem (3) becomes:
(15)  
We first prove optimality. Let denote the gap between the objective and . We have
(16)  
Note that , then we have:
(17)  
Let , and , then we can rewrite as:
(18) 
Note that and that, for , . We then apply Fubini’s Theorem (Fubini, 1907) to :
(19)  
Since , with Eqn. (22), we can obtain
(23)  
I.e., the following inequality holds:
(24) 
Since , the inequality is tight when
(25) 
Therefore, we conclude that is the global minimum, and is one minimizer.
Next, we prove uniqueness. Assume that there is another minimizer for this problem, denoted as . We have
(26)  
Here we require because the sign of is arbitrary and hence we do not distinguish them. Again, can be written as , where is an orthogonal matrix. Therefore, proposition in Eqn. (26) is equivalent to
(27) 
Denote . By the optimality of , we have
(28) 
From Eqn. (16) to Eqn. (22), we know
(29) 
The equality holds if and only if . Additionally, according to Lemma 1, we have
(30) 
Therefore, we also have . Accordingly, all offdiagonal elements of are 0, i.e., . Moreover, since is orthogonal, the following equality holds
(31) 
So we have
(32)  
which contradicts with proposition in Eqn. (27). Based on the above, we conclude that is the unique global miminizer. ∎
B. Extension to Continuous setting
In Sec. 2 and Sec. 3, we discuss the Laplacian representation and our proposed objective in discrete case. In this section we extend previous discussions to continuous settings. Consider a graph with infinitely many nodes (i.e., states), where weighted edges represent pairwise nonnegative affinities (denoted by for nodes and ).
Following (Wu et al., 2019), we give the following definitions. A Hilbert space is defined to be the set of squareintegrable realvalued functions on graph nodes, i.e. , associated with the innerproduct
(33) 
where
is a probability measure,
i.e. . The norm of a function is defined as . Functions are orthogonal if ; functions are orthonormal if . The graph Laplacian is defined as a linear operator on , given by(34) 
Our goal is to learn for approximating the eigenfunctions associated with the smallest eigenvalues of . The graph drawing objective used in (Wu et al., 2019) is
(35)  
Extending this objective to the generalized form gives us
(36)  
Similarly, for continuous setting, Theorem 1 can be extended to the following theorem:
Theorem 2.
Assume , and . Then, is a sufficient condition for the generalized graph drawing objective to have a unique global minimizer , and the corresponding minimum is .
Lemma 2.
Let be orthonormal functions in , and be the inner product of and , i.e., . Then we have (i)