1 Introduction
In reinforcement learning (RL), an agent, or decision maker, takes sequential actions and observes the consequent rewards and states, which are unknown a priori
. These sequent observations improve the agent’s knowledge of the environment with the final goal of learning the optimal policy that maximizes the long term reward. The learning control problem is usually formulated as Markov decision process (MDP), where each state has an associated value function, which estimates the expected long term reward for some policy (usually the optimal one). Classical MDPs represent the value function by a lookup table, with one entry for each state. However, this does not scale with the state (and implicity also action) space dimension, leading to slow learning processes in highdimensional reinforcement learning problems. Approximated reinforcement learning addresses this problem by learning a function to properly approximate the true value function. In the literature, many types of functions have been studied
[Kaelbling et al., 1996, Sutton and Barto, 1998].In this work, we study linear value function approximation, where the value function is represented as a weighted linear sum of a set of features (called basis function
). Linear function approximation allows to represent complex value functions by choosing arbitrarily complex basis functions. Under this framework, one of the main challenges is to identify the right set of basis functions. Typical linear approximation architectures such as polynomial basis functions (where each basis function is a polynomial term) and radial basis functions (where each basis function is a Gaussian with fixed mean and variance) have been studied in the case of reinforcement learning
[Lagoudakis, 2003]. These architectures make the assumption that the underlying state space has Euclidiean geometry. However, in realistic scenarios, the MDP’s state space is likely to exhibit irregularities. For instance, let’s consider the environment depicted in Figure 1. As it can been seen in Figure 1(b), neighboring states can have values that are far apart (such as states on opposite sides of a wall). In such cases, these traditional parametric functions may not be able to accurately approximate value functions.Consequently, other basis functions have been studied to address this issue. Example of such methods include Fourier basis [Konidaris et al., 2011], diffusion wavelets, [Mahadevan and Maggioni, 2006], Krylov basis [Petrik et al., 2010] and Bellman Error Basis Function [Parr et al., 2007, Parr et al., 2008]. In particular, work by [Mahadevan and Maggioni, 2007] introduces the representation policy iteration
(RPI), a spectral graph framework for solving Markov decision processes by jointly learning representations and optimal policies. In their work, the authors first note that MDPs can be intuitively represented using graphs, with states being the nodes and the transition probability being the similarity matrix. Then, under the assumption that the value function is usually modelled as a diffusion process over the stategraph (and therefore it is a smooth function), they approximate the value function (smooth signal on the graph) as a linear combination of the first Laplacian eigenmaps on the stategraph. These features, known as
protovalue functions (PVFs), preserve the smoothness of the value function. In this paper, we argue that constructing the graph that perfectly models the MDP such that the value function is indeed smooth on the graph is not trivial. Therefore, there is a need to automatically learn the basis functions that capture the geometry of the underlying state space from limited data to further improve the performance. Hence, given the success of recent node embedding models [Grover and Leskovec, 2016, Kipf and Welling, 2016b, Ribeiro et al., 2017, Donnat et al., 2018], we propose to investigate representation learning on graphs algorithms to learn basis functions in the linear value function approximation.The idea behind recent successful representation learning approaches is to learn a mapping that embeds the nodes of a graph as lowdimensional vectors. They aim to optimize the representations so that geometric relationships in the embedding space preserve the structure of the original graph.
[Hamilton et al., 2017] surveys recent representation learning on graph methods. Therefore, in this work, we generalize the RPI algorithm [Mahadevan and Maggioni, 2007] to allow different basis functions, and analyze the performance of several representation learning methods for value function approximation.The rest of this paper is structured as follows: in Section 2 we introduce background material, providing details on Markov decision processes and value function approximation and describe the representation learning algorithm used in this work. The General Representation Policy Iteration algorithm is described in Section 3. In Section 4, we discuss experimental results and proceed to summarize the main findings and give direction for future work. We finally conclude in Section 5.
2 Background
2.1 Markov decision process (MDP)
Markov decision processes are discrete time stochastic control processes that provide a widelyused mathematical framework for modeling decision making strategies under uncertainty. Specifically, at each time step, the process is in some state , and the agent can choose any action that is available in state . As a consequence of the action taken, the agent finds itself in a new state and observes an instantaneous reward . We define a discrete MDP by the tuple , where is a finite set of discrete states, a finite set of actions, describes the transition model with giving the probability of moving from state to given action and describes the reward function with giving the immediate reward from taking action in sate . Given a policy , a value function is a mapping that describes the expected longterm discounted sum of rewards observed by the agent in any given state when following policy . Solving a MDP requires to find a policy that defines the optimal value function , which satisfies the following constraints:
This recursive equation is known as the standard form of Bellman’s equation. The optimal policy is a unique solution to the Bellman’s equation and can be found by dynamic programming, iteratively evaluating the value functions for all states.
2.2 Value Function Approximation
In large state spaces, computing exact value functions can be computationally intractable. A possible solution is to estimate the value function with function approximation (value function approximation method) [Bertsekas and Tsitsiklis, 1996]. Commonly, the value function is approximated as a weighted sum of a set of features (called basis function) [Montague, 1999, Mahadevan, 2007, Konidaris et al., 2011, Lagoudakis, 2003]:
where is the dimension of the features space.
The basis functions can be handcrafted [Sutton and Barto, 1998] or automatically constructed [Mahadevan and Maggioni, 2007], and the model parameters are typically learned via standard parameter estimation methods such as leastsquare policy iteration (LSPI) [Lagoudakis, 2003]. However, how to properly design the set of basis for a dataefficient function approximation framework is still an open question. The main question is how to find the set of basis that is lowdimensional (to ensure a dataefficient learning) and yet a meaningful representation of the MPD (to reduce the suboptimality due to the value function approximation).
The representation policy iteration algorithm (RPI) was introduced in [Mahadevan and Maggioni, 2007] to address this problem. It is a three steps algorithm consisting of (1) a sample collection phase, (2) a representation learning phase and (3) a parameter estimation phase. RPI is described in further details in Section 3. In this work, we propose to generalize RPI to allow different representation learning methods. In particular, we first observe that state space topologies of MDPs can be intuitively modeled as (un)directed weighted graphs, with the nodes being the states and the transition probability matrix being the similarity matrix. When the transition probabilities are unknown, we can construct a graph from collected samples by connecting temporally consecutive states with a unit cost edge. Therefore, similarly to [Mahadevan, 2007], we propose to construct the graph from collected samples of an agent acting in the environment given by the MDP. We then learn representations on the graph induced by the MDP using node embedding methods. Finally, we use the learned representations to linearly approximate the value function. In the next section, we describe the node embedding models that we exploit within this framework.
2.3 Representation Learning on Graph
We propose to use the following learned node embedding models as basis functions for the value function approximation in order to automatically learn to encode the graph structure  hence the MDP  into lowdimensional embeddings.
Node2Vec
Node2vec [Grover and Leskovec, 2016] is an algorithmic framework for learning continuous feature representations for nodes in networks. It is inspired by the powerful language model Skipgram [Mikolov et al., 2013]
which is based on the hypothesis that words that appear in the same context share semantic meaning. In networks, the same hypothesis can be made for nodes, where the context of a node is derived by considering the nodes that appear in the same random walk on the graph. Therefore, node2vec learns the node embeddings based on random walk statistics. The key is to optimize the node embeddings so that nodes have similar embeddings if they tend to cooccur on short (biased) random walks over the graph. Moreover, it allows for a flexible definition of random walks by introducing parameters that allow to interpolate between walks that are more breadthfirst search or depthfirst search.
Specifically, for a graph (where is a set of nodes, a set of edges and the weight matrix) and a set of biased random walks collected under a specific sampling strategy on the graph , node2vec seeks to maximize the logprobability of observing the network neighborhood of each node conditioned on its features representations, given by (a matrix of size parameters, where is the dimension of the feature space):
where describes the neighborhood of the th node in the walk .
Struc2Vec
By introducing a bias in the sampling strategy, node2vec allows to learn representations that do not only focus on optimizing node embeddings so that nearby nodes in the graph have similar embeddings, but also consider representations that capture the structural roles of the nodes, independently of their global location on the graph. The recent node embedding approach, struc2vec, proposed by [Ribeiro et al., 2017] addresses the problem of specifically embedding nodes such that their structural roles are preserved. The model generates a series of weighted auxiliary graphs (with ) from the original graph , where the auxiliary graph captures structural similarities between nodes hop neighborhoods. Formally, let denote the ordered sequence of degrees of the nodes that are exactly khops away from , the edgeweights, , in the auxiliary graph are recursively represented by the structural distance between nodes and defined as
where and is the distance between the ordered degree sequences and computed via dynamic time warping [Ribeiro et al., 2017].
Once the weighted auxillary graphs are computed, struc2vec runs biased random walks over them and proceeds as node2vec, optimising the logprobability of observing a network neighborhood based on these random walks.
GraphWave
The GraphWave algorithm as proposed by [Donnat et al., 2018] takes a different approach to learning structural node embeddings. It learns node representations based on the diffusion of a spectral graph wavelet centered at each node. For a graph , denotes the graph Laplacian, where is the adjacency matrix and is a diagonal matrix, whose entries are row sums of the adjacency matrix. Let
denote the eigenvector decomposition of the graph Laplacian
anddenote the eigenvalues of
. Given a heat kernel for a given scale , GraphWave uses and to compute a vector representing diffusion patterns for node as follows:where is the onehot indicator vector for node
. Then, the characteristic function for each node’s coefficients
is computed asFinally, to obtain the structural node embedding for node , the paramatric function is sampled at evenly spaced points :
Variational Graph AutoEncoder
As opposed to directly encoding each node, autoencoders aim at directly incorporating the graph structure into the encoder algorithm. The key idea is to compress information about a node’s local neighborhood. The Variational Graph AutoEncoder proposed by [Kipf and Welling, 2016b]
is a latent variable model for graphstructure data capable of learning interpretable latent representations for undirected graphs. The Graph AutoEncoder uses a Graph Convolutional Neural Network (GCN)
[Kipf and Welling, 2016a] to encode graphs and another GCN to reconstruct the graph. The Variational Graph AutoEncoder makes use of latent variables.3 General Representation Policy Iteration
Within the context of approximated value function, the representation policy iteration algorithm (RPI) was introduced in [Mahadevan and Maggioni, 2007] to learn the approximating function. RPI is a three step algorithm consisting of (1) a sample collection phase, to build a training dataset with quadruples ; (2) a representation learning phase that defines a set of basis functions; and (3) a parameter estimation phase, in which the coefficients of the linear approximation are learned. A generalized version of the RPI algorithm [Mahadevan, 2007] is described in Algorithm 1.
In the original RPI, the representation learning phase is predefined. Namely, an undirected weighted graph is built from the available data set . Then a diffusion operator , such as the normalized Laplacian is computed on graph and the dimensional basis functions are constructed from spectral analysis of the diffusion operator. Specifically, the ’s are the smoothest eigenvectors of the graph Laplacian and are known as protovalue functions (PVFs). The key is that given a stategraph that perfectly represents the MDP, the value function is modelled as a diffusion process over the graph (and therefore it is a smooth function). Hence, given the spectral properties of the Laplacian operator, the protovalue functions are a good choice of basis functions for preserving the smoothness of the value function.
However, it is not guaranteed that we can construct a graph from a limited number samples such that its derived protovalue functions reflect the underlying state space. If fact, we can show that the value function is not as smooth on the estimated graph (constructed from samples) as it is on the ideal graph where the edges are weighted by the transition probability. We consider the environment depicted in Figure 1. To construct the estimated graph , we first collect samples by running 100 independent episodes starting at a random initial state and taking successive random actions until either a maximum of 100 steps have been made or the goal state has been reached. We then connect temporally consecutive states with a unit cost edge. The ideal graph is simply the graph with edges representing actual transition probabilities (i.e. edges between accessible states have weight 1, edges between an accessible state and a wall state have weight 0, and edges between an accessible or difficult access state and a difficult access state have weight 0.2). We use the following function to measure the global smoothness of the value function on a graph:
Where is the graph Laplacian. In other words, if values and from a smooth function reside on two well connected nodes (i.e. is large), they are expected to have a small distance , hence is small overall.
As seen in Table 1, this analysis shows a reduction of the value function smoothness when going from the ideal weighted graph to the estimated and unweighted graph (usually considered in realistic settings, when the transition probability is not known a priori).
Estimated graph  14831.72 

Weighted graph  5705.65 
As results, it is expected that the smoothest protovalue functions of the estimated graph on which the value function is less smooth, do not allow to reconstruct the true value function as well as the smoothest protovalue functions of the ideal graph. This phenomenon is verified in Figure 2, where we show in both cases the mean squared error (MSE) of the approximate value function computed in a leastsquare way using the true value function computed via value iteration [Montague, 1999] for the environment shown in Figure 1.
To overcome this limitation, we propose to use the node embedding methods described in Section 2.3 to automatically learn the basis functions from the geometry of the underlying state space to further improve the performance. In the following, we describe how to apply these features learning methodologies within reinforcement learning strategies.
4 Experiments
4.1 Set up
We consider the tworoom environment used in [Mahadevan and Maggioni, 2007], shown in Figure 3. It consists of 100 states in total, divided into 57 accessible states and 43 inaccessible states representing walls. There is one goal state, marked in red and the agent is rewarded by for reaching the goal state.
We also consider the obstaclesroom environment depicted in Figure 3. In this environment, there are states in total, some of which are inaccessible since they represent exterior walls and of which are accessible from neighbouring states with a fixed probability of (they represent a moving obstacle or difficult access space). All the other states are reachable with probability . The agent is rewarded by for reaching the state located at the upperright corner.
We construct the corresponding graphs where each location is a node, and the transitions (4 possible actions: left, right, up and down) are represented by the edges.
We run and evaluate the General Representation Policy Iteration (GRPI) algorithm using embedding models from Section 2.3 to compute the basis functions in the second phase of the algorithm.

We first collect a set of 100 sampled random walks, each of length 100 (or terminating early when the goal state was reached). The sampling dynamic is as follows: starting from a random accessible sate, the agent takes one of the four possible actions (move up, down, left or right). If a movement is possible, it succeeds with probability 0.9. Otherwise, the agent remains in the same state. If the agent reaches the gold state, it receives a reward of 100, and is randomly reset to an accessible interior state. We use offpolicy sampling (
random policy) to collect the samples, except in the case of node2vec, where the samples are generated under a biased random walk. We use grid search to find the optimal hyperparameters
and that guide the walk according to [Grover and Leskovec, 2016]. 
We then use sample transitions in to build an undirected graph where the weight matrix is the adjacency matrix and run with {node2vec (n2v), struc2vec (s2v), variational graph autoencoder (VGAE), GraphWave (GW)} for diffenrent choices of . In the case of node2vec, we reuse the samples set to derive the node neighbourhoods used in the objective function.

We learned the parameters of the linear value approximation using the parameter estimation method LSPI [Lagoudakis, 2003] with the set of samples .

We used the policies learned by GRPI for each model to run simulations starting from each accessible states. We compare the performance of each models in terms of the average number of steps required to reach the goal. We also compare to the traditional PVF basis functions. The results for the two environments, averaged over 20 independent runs, are shown in Figures 4 and 4. Each run consists of episodes of a maximum of 100 steps, where each episode is terminated earlier if the agent reached the goal state.
4.2 Discussion
Figures 4 and 4 show the average number of steps to reach the goal as a function of the dimension of the basis function. We first observe that the policy learned via the GraphWave basis function lead to very poor performances regardless of the dimension size. We investigate this phenomenon by looking at the approximate value function learned under these basis. The approximate state values are depicted in Figrue 5. Because GraphWave aims at learning embeddings that are exclusively structural, we hypothesise that they fail at capturing global network properties. In fact, the embeddings learned by GraphWave for the corner states in the tworoom environment are equals, making it obviously impossible to learn different state values with linear approximation. This suggests that although the GraphWave is a powerful model for capturing structural information in networks, it is not a good choice of basis function for approximating value function on a graph.
On the other hand, we notice that although struc2vec was also designed to capture structural similarities between nodes, it also preserves the local properties of the graph by considering neighborhoods of different sizes [Ribeiro et al., 2017]. Hence, struc2vec is able to accurately approximate the value function even in graphs that have symmetrical structure such as the tworoom environment.
Finally, the result show that VGAE and node2vec are good choices of basis functions for approximating the value function in low dimension. Indeed, they lead to good performances in terms of number of steps to reach the goal states with basis functions of dimension as low as 20 for VGAE and 30 for node2vec. On the contrary, we observe that the PVFs require dimension of at least 70 to reach comparable performances on the tworoom domain and dimension of 50 on the obstacleroom domain.
We observed that the sampling strategy used in node2vec has a significant impact on the performance of the learned policy. Using grid search, we find that the optimal value of the parameters and that guide the random are and respectively. We show the performances of node2vec with selected values of and in Figure 6. When and , the strategy is biased to encourage walks to backtrack a step and to visit nodes that are close to the current node in the walk. Therefore, it leads to walks that approximate a breadthfirst search behavior, gathering a local view of the underlying graph with respect to the starting node. On the other hand, when and , the walk approximate a depthfirst search behavior and lead to more outward exploration. [Grover and Leskovec, 2016] show that the former type of sampling strategy allows to reflect structural equivalences of nodes whereas the second type allows to capture homophily within the network. Figure 6 suggests that for approximating value functions, structural equivalence plays a more important role than homophily.
4.3 Additional Results
In order to investigate whether we can expect similar a behaviour in larger environments, we consider a 100 by 50 threeroom environment (similar to the tworoom environment but with two interior walls, with the upper wall having the opening more on the right and the lower wall having the opening more on the left). We construct the graph from 500 collected samples of length at most 100 and derive the PVFs and the node2vec embeddings. For each of these basis function, we solve the linear approximation problem in the leastsquare sense by minimizing the following loss function with respect to the parameter
using the optimal value function computed via value iteration [Montague, 1999]:Figure‘7 shows the gain of adopting node2vec feature learning in reinforcement learning in high dimensional state space.
4.4 Main Findings and Future Work
We summarize below the main findings of our work.

The smoothness assumption of the value function on an estimated unweighted graph does not necessarily hold.

Using basis functions that automatically learn to embed the geometry of the graph induced by the MPD can lead to improved performance over the protovalue functions.

Such embedding models need to capture the structural equivalence of the nodes while preserving the local properties of the graph.

Under sampling strategies that satisfy the requirements of the previous point, Node2vec [Grover and Leskovec, 2016] outperforms the commonly used protovalue functions.

The Variational Graph AutoEncoder, which is more complex than node2vec and requires more training, leads to minor performance improvement compared to node2vec.
These findings encourage the further study of representation learning on graphs for achieving efficient and accurate policy learning for reinforcement learning. In particular, the question of scalability in large or continuous state space arises. Future work includes analyzing to what extend one can efficiently learn good embeddings with limited samples in very large state spaces. Another interesting open question in this direction, is to investigate whether good representations can be inferred for states that have never been visited.
Naturally, future work should also aim at further improving the quality of the embeddings for solving reinforcement learning problems. A possibility would be to make use of the reward observed during the sample collection phase to build features that are not only based on state transitions, but capture reward information as well.
5 Conclusion
In this work, we have studied the representation policy iteration algorithm with a modified representation learning phase that allows to use any model for computing the basis functions in the linear value approximation. We investigate several models for learning high quality node embeddings that preserve the geometry of the graph induced by the Markov decision process. We compare the performance of several representation learning model in the context of value function approximation. Finally, we observe that models that are designed to capture the global structural geometry of the graph while preserving local properties do well at approximating the value function in low feature space dimensions, significantly outperforming the commonly considered protovalue functions for this task.
References
 [Bertsekas and Tsitsiklis, 1996] Bertsekas, D. P. and Tsitsiklis, J. N. (1996). NeuroDynamic Programming. Athena Scientific, 1st edition.
 [Donnat et al., 2018] Donnat, C., Zitnik, M., Hallac, D., and Leskovec, J. (2018). Learning structural node embeddings via diffusion wavelets. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’18, pages 1320–1329, New York, NY, USA. ACM.
 [Grover and Leskovec, 2016] Grover, A. and Leskovec, J. (2016). node2vec. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining  KDD ’16, pages 855–864.
 [Hamilton et al., 2017] Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Representation learning on graphs: Methods and applications. CoRR, abs/1709.05584.

[Kaelbling et al., 1996]
Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996).
Reinforcement learning: A survey.
Journal of Artificial Intelligence Research
, 4:237–285.  [Kipf and Welling, 2016a] Kipf, T. N. and Welling, M. (2016a). Semisupervised classification with graph convolutional networks. CoRR, abs/1609.02907.
 [Kipf and Welling, 2016b] Kipf, T. N. and Welling, M. (2016b). Variational graph autoencoders. CoRR, abs/1611.07308.
 [Konidaris et al., 2011] Konidaris, G., Osentoski, S., and Thomas, P. (2011). Value Function Approximation in Reinforcement Learning using the Fourier Basis. Proceedings of the TwentyFifth Conference on Artificial Intelligence, pages 380–385.

[Lagoudakis, 2003]
Lagoudakis, M. (2003).
Leastsquares policy iteration.
The Journal of Machine Learning Research
, 4:1107–1149.  [Mahadevan, 2007] Mahadevan, S. (2007). Learning Representation and Control in Markov Decision Processes: New Frontiers. Foundations and Trends® in Machine Learning, 1(4):403–565.

[Mahadevan and Maggioni, 2006]
Mahadevan, S. and Maggioni, M. (2006).
Value function approximation with diffusion wavelets and laplacian eigenfunctions.
In Weiss, Y., Schölkopf, B., and Platt, J. C., editors, Advances in Neural Information Processing Systems 18, pages 843–850. MIT Press.  [Mahadevan and Maggioni, 2007] Mahadevan, S. and Maggioni, M. (2007). Protovalue functions: A laplacian framework for learning representation and control in markov decision processes. J. Mach. Learn. Res., 8:2169–2231.
 [Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.
 [Montague, 1999] Montague, P. (1999). Reinforcement Learning: An Introduction, by Sutton, R.S. and Barto, A.G. Trends in Cognitive Sciences, 3(9):360.

[Parr et al., 2008]
Parr, R., Li, L., Taylor, G., PainterWakefield, C., and Littman, M. L. (2008).
An analysis of linear models, linear valuefunction approximation, and feature selection for reinforcement learning.
In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 752–759, New York, NY, USA. ACM.  [Parr et al., 2007] Parr, R., PainterWakefield, C., Li, L., and Littman, M. (2007). Analyzing feature generation for valuefunction approximation. In Proceedings of the 24th International Conference on Machine Learning, ICML ’07, pages 737–744, New York, NY, USA. ACM.

[Petrik et al., 2010]
Petrik, M., Taylor, G., Parr, R., and Zilberstein, S. (2010).
Feature selection using regularization in approximate linear programs for markov decision processes.
In Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel, pages 871–878.  [Ribeiro et al., 2017] Ribeiro, L. F., Saverese, P. H., and Figueiredo, D. R. (2017). Struc2vec: Learning node representations from structural identity. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, pages 385–394, New York, NY, USA. ACM.
 [Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. IEEE Trans. Neural Networks, 9(5):1054–1054.
Comments
There are no comments yet.