# Representations for Stable Off-Policy Reinforcement Learning

Reinforcement learning with function approximation can be unstable and even divergent, especially when combined with off-policy learning and Bellman updates. In deep reinforcement learning, these issues have been dealt with empirically by adapting and regularizing the representation, in particular with auxiliary tasks. This suggests that representation learning may provide a means to guarantee stability. In this paper, we formally show that there are indeed nontrivial state representations under which the canonical TD algorithm is stable, even when learning off-policy. We analyze representation learning schemes that are based on the transition matrix of a policy, such as proto-value functions, along three axes: approximation error, stability, and ease of estimation. In the most general case, we show that a Schur basis provides convergence guarantees, but is difficult to estimate from samples. For a fixed reward function, we find that an orthogonal basis of the corresponding Krylov subspace is an even better choice. We conclude by empirically demonstrating that these stable representations can be learned using stochastic gradient descent, opening the door to improved techniques for representation learning with deep networks.

There are no comments yet.

## Authors

• 7 publications
• 35 publications
• ### Representation Learning on Graphs: A Reinforcement Learning Application

In this work, we study value function approximation in reinforcement lea...
01/16/2019 ∙ by Sephora Madjiheurem, et al. ∙ 0

• ### Efficient Hierarchical Exploration with Stable Subgoal Representation Learning

Goal-conditioned hierarchical reinforcement learning (HRL) serves as a s...
05/31/2021 ∙ by Siyuan Li, et al. ∙ 0

• ### Deep Orthogonal Representations: Fundamental Properties and Applications

Several representation learning and, more broadly, dimensionality reduct...
06/21/2018 ∙ by Hsiang Hsu, et al. ∙ 0

• ### CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

Off-policy Temporal Difference (TD) learning methods, when combined with...
02/14/2019 ∙ by Aditya Bhatt, et al. ∙ 0

• ### Gamma-Nets: Generalizing Value Estimation over Timescale

We present Γ-nets, a method for generalizing value function estimation o...
11/18/2019 ∙ by Craig Sherstan, et al. ∙ 31

• ### Measuring and Characterizing Generalization in Deep Reinforcement Learning

Deep reinforcement-learning methods have achieved remarkable performance...
12/07/2018 ∙ by Sam Witty, et al. ∙ 12

• ### Near-Optimal Representation Learning for Hierarchical Reinforcement Learning

We study the problem of representation learning in goal-conditioned hier...
10/02/2018 ∙ by Ofir Nachum, et al. ∙ 12

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Value function learning algorithms are known to demonstrate divergent behavior under the combination of bootstrapping, function approximation, and off-policy data, what Sutton & Barto (2018) call the “deadly triad” (see also van Hasselt et al., 2018). In reinforcement learning theory, it is well-established that methods such as Q-learning and TD(0) enjoy no general convergence guarantees under linear function approximation and off-policy data (Baird, 1995; Tsitsiklis & Roy, 1996). Despite this potential for failure, Q-learning and other temporal-difference algorithms remain the methods of choice for learning value functions in practice due to their simplicity and scalability.

In deep reinforcement learning, instability has been mitigated empirically through the use of auxiliary tasks, which shape and regularize the representation that is learned by the neural network. Methods using auxiliary tasks concurrently optimize the value function loss and an auxiliary representation learning objective such as visual reconstruction of observation

(Jaderberg et al., 2016), latent transition and reward prediction (Gelada et al., 2019), adversarial value functions (Bellemare et al., 2019), or inverse kinematics (Pathak et al., 2017). In robotics, distributional reinforcement learning (Bellemare et al., 2017) in particular has proven a surprisingly effective auxiliary task (Bodnar et al., 2019; Vecerik et al., 2019; Cabi et al., 2019). While the stability of such methods remains an empirical phenomenon, it suggests that a carefully chosen representation learning algorithm may provide a means towards formally guaranteed stability of value function learning.

In this paper, we seek procedures for discovering representations that guarantee the stability of TD(0), a canonical algorithm for estimating the value function of a policy. We analyze the expected dynamics of TD(0), with the aim of characterizing representations under which TD(0) is provably stable. Learning dynamics of temporal-difference methods have been studied in depth in the context of a fixed state representation (Tsitsiklis & Roy, 1996; Borkar & Meyn, 2000; Yu & Bertsekas, 2009; Maei et al., 2009; Dalal et al., 2017). We go one step further by considering this representation as a component that can actively be shaped, and study stability guarantees that emerge from various representation learning schemes.

We show that the stability of a state representation is affected by: 1) the space of value functions it can express, and 2) how it parameterizes this space. We find a tight connection between stability and the geometry of the transition matrix, enabling us to provide stability conditions for algorithms that learn features from the transition matrix of a policy (Dayan, 1993; Mahadevan & Maggioni, 2007; Wu et al., 2018; Behzadian et al., 2019) and rewards (Petrik, 2007; Parr et al., 2007). Our analysis reveals that a number of popular representation learning algorithms, including proto-value functions, generally lead to representations that are not stable, despite their appealing approximation characteristics.

As special cases of a more general framework, we study two classes of stable representations. The first class consists of representations that are approximately invariant under the transition dynamics (Parr et al., 2008), while the second consists of representations that remain stable under reparameterization. From this study, we find that stable representations can be obtained from common matrix decompositions and furthermore, as solutions of simple iterative optimization procedures. Empirically, we find that different procedures trade off learnability, stability, and approximation error. In the large data regime, the Schur decomposition and a variant of the Krylov basis (Petrik, 2007) emerge as reliable techniques for obtaining a stable representation.

We conclude by demonstrating that these techniques can be operationalized using stochastic gradient descent on losses. We show that the Schur decomposition arises from the task of predicting the expectation of one’s own features at the next time step, whereas a variant of the Krylov basis arises as from the task of predicting future expected rewards. This is particularly significant, as both of these auxiliary tasks have in fact been heuristically proposed in prior work

(François-Lavet et al., 2018; Gelada et al., 2019). Our result confirms the validity of these auxiliary tasks, not only for improving approximation error but, more importantly, for taming the famed instabilities of off-policy learning.

## 2 Background

We consider a Markov decision process (MDP)

on a finite state space and finite action space . The state transition distribution is given by , the reward function , the initial state distribution , and the discount factor . We write with

, and treat real-valued functions of state and action as vectors in

.

A stochastic policy

induces a Markov chain on

with transition matrix . The value function for a policy is the expected return conditioned on the starting state-action pair,

 Qπ(si,ai)=Eπ[∑t≥0γtr(st,at)|s0=si,a0=ai].

The value function also satisfies Bellman’s equation; in vector notation (Puterman, 1994),

 Qπ=r+γPπQπ

from which we recover the concise .

### 2.1 Approximate Policy Evaluation

Approximate policy evaluation is the problem of estimating from a family of value functions given a distribution of transitions (c.f. Bertsekas, 2011). We refer to as the data distribution, and define a diagonal matrix with the elements of on the diagonal. If the data distribution is the stationary distribution of , the data is on-policy and off-policy otherwise. We equip with the inner product and norm that is induced by the data distribution: . Most concepts from Euclidean inner products extend to this setting; see Appendix A for a review.

We consider a two-stage procedure for estimating value functions (Levine et al., 2017; Chung et al., 2019; Bertsekas, 2018). We first learn a representation, a -dimensional mapping , through an explicit representation learning step. After a representation is learned, approximate policy evaluation is performed with the family of value functions linear in the representation : , where is a vector of weights.

The representation corresponds to a matrix whose rows are the vectors for different state-action pairs . For clarity of presentation, we assume that has full rank. A representation is orthogonal if ; these correspond to features which are normalized and uncorrelated. We write to denote the subspace of value functions expressible using , and denote the orthogonal projection operator onto , with closed form .

### 2.2 Temporal Difference Methods

TD fixed-point methods are a popular class of methods for approximate policy evaluation that attempt to find value functions that satisfy (Bradtke & Barto, 1996; Gordon, 1995; Maei et al., 2009; Dann et al., 2014). If has a fixed-point, the solution is unique (Lagoudakis & Parr, 2003) and can be expressed as

 θ∗TD=(ΦTΞ(I−γPπ)Φ)−1ΦTΞr.

We study TD(0), the canonical update rule to discover this fixed point. With a step size and transitions sampled , TD(0) takes the update

 θk+1=θk−η∇Qθk(s,a)(Qθt(s,a)−(r+γQθt(s′,a′))).

In matrix form, this corresponds to an expected update over all state-action pairs:

 θk+1=θk−η(Φ⊤Ξ(I−γPπ)Φθk−Φ⊤Ξr). (1)

With appropriately chosen decay of the step size, the stochastic update will converge if the expected update converges (Benveniste et al., 1990; Tsitsiklis & Roy, 1996). However, these updates are not the gradient of any well-defined objective function except in special circumstances (Barnard, 1993; Ollivier, 2018), and hence do not inherit convergence properties from the classical optimization literature. The main aim of this paper is to provide conditions on the representation matrix under which the update is convergent. We are especially interested in schemes that are convergent independent of the data distribution .

We will characterize the stability of TD(0) and a representation through the spectrum of relevant matrices. For a matrix

, the spectrum is the set of eigenvalues of

, written as . The spectral radius denotes the maximum magnitude of eigenvalues. Stochastic transition matrices satisfy . We consider a potentially nonsymmetric matrix to be positive definite if all non-zero vectors satisfy .

### 2.3 Representation Learning

In reinforcement learning, a large class of methods have focused on constructing a representation from the transition and reward functions, beginning perhaps with proto-value functions (Mahadevan & Maggioni, 2007). Involving and in the representation learning process is natural, since the value function is itself constructed from these two objects. As we shall later see, the stability criteria for these are also simple and coherent. Additionally, there is a large body of literature on the ease (or difficulty) with which these methods can be estimated from samples, and by proxy are amenable to gradient-descent schemes. Here we review the most common of these representation learning methods along with a few obvious extensions. Table 1 shows how their construction arises from different matrix operations on and, in the case of the Krylov basis, of .

Laplacian Representations: Proto-value functions (Mahadevan & Maggioni, 2007)

capture the high-level structure of an environment, using the bottom eigenvectors of the normalized Laplacian of an undirected graph formed from environment transitions. This formalism extends to reversible Markov chains with on-policy data, but does not generalize to directional transitions, stochastic dynamics, and off-policy data. In the general setting, the Laplacian representation

(Wu et al., 2018) uses the top eigenvectors of the symmetrized transition matrix (EigSymm) . We demonstrate in Section 4.3 that when data is off-policy, modifying the representation to omit eigenvectors whose eigenvalues exceed a threshold can provide strong stability guarantees.

Singular Vector Representations:

Representations using singular vectors have been well-studied in representation learning for RL, because they are expressive and often yield strong performance guarantees. Fast Feature Selection

(Behzadian et al., 2019) uses the top left singular vectors of the transition matrix as features. Similarly, Stachenfeld et al. (2014) and Machado et al. (2018) use the top left singular vectors of the successor representation (Dayan, 1993), a time-based representation which predicts future state visitations: . We discover in Section 3.4 that the SVD objective of minimizing the norm of approximation error fails to preserve the spectral properties of transition matrices needed for stability, and can induce divergent behavior in TD(0) . In contrast, we show that decompositions constrained to preserve the spectrum of the transition matrix, such as the Schur decomposition, guarantee stability and performance.

Reward-Informed Methods: If the reward structure of the problem is known apriori, a representation can focus its capacity on modelling future rewards and how they diffuse through the environment. Towards this goal, Petrik (2007) suggested the Krylov basis generated by and as features. Bellman Error Basis Functions (BEBFs) (Parr et al., 2007) iteratively builds a representation by adding the Bellman error for the best solution found so far as a new feature. Parr et al. (2008) show that under certain initial conditions for BEBFs, both representations span the Krylov subspace generated by rewards. Although no general guarantees exist for arbitrary rewards, we discover that when rewards are easily predictable, orthogonal representations that span this Krylov subspace have stability guarantees.

## 3 Stability Analysis of Arbitrary Representations

To begin, we study the stability of TD(0) given an arbitrary representation. For conciseness, we call TD(0) the algorithm whose expected update is described by equation 1; this is an algorithm which may or may not be off-policy (according to and ), and learns a linear approximation of the value function using features . The following formalizes our notion of stability.

###### Definition 3.1.

TD(0) is stable if there is a step-size such that when taking updates according to equation 1 from any , we have .

### 3.1 Learning Dynamics

For a sufficiently small step-size , the discrete update of equation 1 behaves like the continuous-time dynamical system

 ∂∂t(θt−θ∗TD)=−AΦ(θt−θ∗TD), (2)

whose behaviour is driven by the iteration matrix

 AΦ=Φ⊤Ξ(I−γPπ)Φ.

Put another way, the learned parameters evolve approximately according to the linear dynamical system defined by the iteration matrix . As might be expected, TD(0) is stable if this linear dynamical system is globally stable in the usual sense (Borkar & Meyn, 2000).

The iteration matrix – and as we shall see, the global stability of the linear dynamical system – depends on the data distribution, the representation, and, to a lesser extent, on the discount factor. It does not, however, depend on the reward function, which only affects the accuracy of the TD fixed-point solution .

### 3.2 Stability Criteria

To understand the behaviour of TD(0), it is useful to contrast it with gradient descent on a weighted squared loss

 ℓ(θ)=(Φθ−y)⊤Ξ(Φθ−y),

where is a vector of supervised targets. Gradient descent on also corresponds to a linear dynamical system, albeit one whose iteration matrix is symmetric and positive definite. The behaviour of TD(0) is complicated by the fact that is not guaranteed to be positive definite or symmetric, as the matrix itself is in general neither. In fact, the documented good behaviour of TD(0) arises in contexts where itself is closer to a gradient descent iteration matrix: positive definite when the data distribution is on-policy (Tsitsiklis & Roy, 1996), and symmetric when the Markov chain described by is reversible (Ollivier, 2018).

Following a well-known result from linear system theory (see e.g. Zadeh & Desoer, 2008), the asymptotic behavior of TD(0) more generally depends on the eigenvalues of the iteration matrix.

###### Proposition 3.1.

TD(0) is stable if and only if the eigenvalues of the implied iteration matrix have positive real components, that is

 Spec(AΦ)⊂C+:={z:Re(z)>0}.

We say that a particular choice of representation is stable for when satisfies the above condition.

###### Proof.

See Appendix B for all proofs. ∎

Whenever the transition matrix, data distribution, and discount factor is evident, we will refer to simply as a stable representation.

### 3.3 Effect of Subspace Parametrization

When measuring the approximation error that arises from a particular representation , it suffices to consider the subspace spanned by the columns of . It therefore makes no difference whether these columns are orthogonal (corresponding, informally speaking, to correlated features) or not. By contrast, we now show that the stability of the learning process does depend on how the linear subspace spanned by is parametrized.

Recall that is orthogonal if . As it turns out, the stability of an orthogonal representation is determined by the induced transition matrix , which describes how next-state features affect the TD(0) value estimates.

###### Proposition 3.2.

An orthogonal representation is stable if and only if the real part of the eigenvalues of the induced transition matrix is bounded above, according to

 Spec(ΠPπΠ)⊂{z∈C:Re(z)<1γ }

In particular, is stable if .

Although the original transition matrix satisfies the spectral radius condition with , the induced transition matrix can have eigenvalues beyond the stable region and lead to learning instability.

More generally, a representation can be decomposed into an orthogonal basis and reparametrization , where is an orthogonal representation spanning the same space as and is a reparametrization for . The eigenvalues of the iteration matrix can be re-expressed as

 Spec(AΦ)=Spec(R⊤AΦ′R)=Spec(RR⊤AΦ′).

Despite spanning the same space, and have iteration matrices with different spectra: . As a result, the stability of not only depends on the spectrum of , but also how the reparametrization shifts these eigenvalues. Put another way, may be unstable even if its orthogonal equivalent is stable. The classical example of divergence given by Baird (1995) can be attributed to this phenomenon. In this example, the constructed representation expresses the same value functions as a stable tabular representation, but parametrizes the space in an different way and thus induces divergence.

### 3.4 Singular Vector Representations

The singular value decomposition is an appealing approach to representation learning: choosing vectors corresponding to large singular values guarantees, in a certain measure, low approximation error (Stachenfeld et al., 2014; Behzadian et al., 2019). Unfortunately, as now we show, doing so may be inimical to stability.

We denote and the representations with the top left singular vectors of and as features. Recall that these vectors arise as part of a solution to a low-rank matrix approximation . We write to denote the corresponding rank- approximations.

###### Proposition 3.3 (Svd).

The representation is stable if and only if the low-rank approximation satisfies

 ρ(^Pπ)<1γ.
###### Proposition 3.4 (Successor Representation).

Recall that . The representation is stable if and only if the low-rank approximation satisfies

 Spec(^Ψ)⊂C+∪{0}.

Stability of a singular vector representation requires that the low-rank approximation maintain the spectral properties of the original matrix. This implies that such representations are not stable in general – the SVD low-rank approximation is chosen to minimize the norm of the error, and the spectrum of the approximation can deviate arbitrarily from the original matrix (Golub & van Loan, 2013). We note that the spectral conditions hold in the limit of almost-perfect approximation, but achieving this level of accuracy in practice may require an impractical number of additional features.

## 4 Representation Learning with Stability Guarantees

Our analysis of singular vector representations show that representations that optimize for alternative measures, such as approximation error, may lose properties of the transition matrix needed for stability. In this section, we study representations that are constrained, either in expressibility or in spectrum, to ensure stability.

### 4.1 Invariant Representations

We first consider representations whose induced transition matrix preserves the eigenvalues of the transition matrix to guarantee stability. These representations are closely linked to invariant subspaces of value functions that are closed under the transition dynamics of the policy.

###### Definition 4.1.

A representation is -invariant if its corresponding linear subspace is closed under , that is

 Span(PπΦ)⊆Span(Φ).

-invariant subspaces are generated by the eigenspaces of

, and so invariant representations provide a natural way to reflect the geometry of the transition matrix. For these representations, we show that any eigenvalue of the induced transition matrix is also an eigenvalue of the transition matrix; this constraint ensures that invariant representations are always stable.

###### Theorem 4.1.

An orthogonal invariant representation satisfies

 Spec(ΠPπΠ)⊆Spec(Pπ)∪{0}

and is therefore stable.

Parr et al. (2008) studied the quality of the TD fixed-point solution on invariant subspaces, and found it to directly correlate with how well the subspace models reward. Our findings on stability emphasize the importance of their result – with invariant representations that can predict reward, good value functions not only exist, but are also reliably discovered by TD(0).

Although estimation of eigenvectors for a nonsymmetric matrix is numerically unstable, finding orthogonal bases for their eigenspaces can be done tractably, for example through the Schur decomposition.

###### Definition 4.2.

Let be a complex matrix. A Schur decomposition of , written , is , where is upper triangular and is orthogonal. For any , is an -invariant subspace.

The Schur decomposition of provides a sequence of vectors that span invariant subspaces, and can be constructed so that the first basis vectors spans the top -dimensional eigenspace of . We define a representation using the first Schur basis vectors to be the Schur representation.

When the transition matrix is reversible and data is on-policy, the Schur representation coincides with proto-value functions, and consequently also the successor representation (Machado et al., 2018). Unlike singular value representations, the Schur representation preserves the spectrum of the transition matrix at every step, and always guarantees stability.

###### Corollary 4.1.1.

The Schur representation is invariant and thus stable.

A partial Schur basis can be constructed through orthogonal iteration, a generalized variant of power iteration.

###### Proposition 4.1 (Golub & van Loan (2013)).

Let be the ordered eigenvalues of . If and , the sequence generated via orthogonal iteration is

 Φk=\textscOrthog(Span(PπΦk−1))

where Orthog finds an orthogonal basis. As , converges to the unique top eigenspace of .

In Section 5

, we will see that the orthogonal iteration scheme can be approximated using a loss function and a target network

(Mnih et al., 2015), and subsequently minimized with stochastic gradient descent, making it a potentially important tool for learning stable representations in practice.

### 4.2 Approximately Invariant Representations

In the previous section, we studied invariant representations, which are constrained to exactly preserve the eigenvalues of the transition matrix. We relax the notion of invariancy discuss a relaxation to approximate invariance, for which the spectrum of the induced matrix deviates from the transition matrix by a controlled amount, while still preserving stability. We find that approximate invariance leads to interesting implications for representations that span a Krylov subspace generated by rewards (Petrik, 2007; Parr et al., 2007).

###### Definition 4.3.

A representation is -invariant if

 maxv∈Span(Φ)∥ΠPπv−Pπv∥Ξ∥v∥Ξ≤ϵ.

An approximately invariant representation spans a space in which the transition dynamics are not fully closed, but approximately so, as measured by the -norm. We provide a simple condition of when an -invariant representation is stable under assumptions of diagonalizability of the transition matrix. If is diagonalizable with eigenbasis , the distance between the eigenvalues of the induced transition matrix and the original transition matrix can be bounded by a function of a) , the degree of approximate invariance and b) the condition number of the eigenbasis (Trefethen & Embree, 2005).

###### Theorem 4.2.

Let be an orthogonal and -invariant representation for . If is diagonalizable with eigenbasis , then is stable if

 ϵ<1−γγ1κΞ(A).

This bound is quite stringent, especially for discount factors close to one and ill-conditioned eigenvector bases, but may be improved if the transition matrix has a special structure. For the general setting when the transition matrix is not diagonalizable, similar but more complicated bounds exist (Shi & Wei, 2012).

Approximately invariant representations are of particular interest when studying the Krylov subspace generated by rewards, .

 Kd(Pπ,r)=Span{r,Pπr,…,(Pπ)d−1v}.

Representations that span this space admit a simple form of approximate invariancy.

###### Proposition 4.2.

A representation spanning is -invariant if

 ∥ΠPπv−Pπv∥Ξ∥v∥Ξ≤ϵ

Where , and is a projection onto the -dimensional Krylov subspace .

Orthogonal representations for this Krylov subspace are approximately invariant if they can predict the reward at the -th timestep well from the rewards attained in the first timesteps. For rewards that diffuse through the environment rapidly and can be predicted easily, an orthogonal basis of the Krylov space generated by rewards is approximately invariant and thus stable. Challenging environments with sparse rewards and temporal separation however may require a prohibitively large Krylov space to guarantee stability. Note that there is an important distinction between orthogonal representations spanning a Krylov subspace and the Krylov basis itself: for most practical applications, rewards are highly correlated and because of the challenges of parametrization, the latter can be unstable.

### 4.3 Positive-Definite Representations

Invariant representations are stable because the spectrum of the projected transitions is constrained to closely mimic the eigenvalues of the transition matrix. What we call positive definite representations instead guarantee stability by constraining the set of expressible value functions to lie within a safe set. Positive definite representations are stable regardless of parametrization, unlike any family of representations discussed so far.

###### Definition 4.4.

The set of positive-definite value functions is

 SPD={ v∈Rn  |  ⟨v,Pπv⟩Ξ < γ−1∥v∥2Ξ }.

Note that is not necessarily closed under addition. The two-state MDP presented by Tsitsiklis & Roy (1996) where TD(0) diverges can be interpreted through the lens of this set. For this example, the state representation only expresses value functions outside of , which “grow” faster than , and consequently leads to divergence. We focus on representations whose span falls within this set of safe value functions.

###### Definition 4.5.

We say that a representation is positive-definite if

 Span(Φ)⊆SPD.

Note that a positive definite representation remains so under reparametrization, unlike the general case. In the special case of on-policy data, and all representations are positive-definite (Tsitsiklis & Roy, 1996).

###### Theorem 4.3.

A positive-definite representation has a positive-definite iteration matrix , and is thus stable.

The Laplacian representation, which computes the spectral eigendecomposition of the symmetrized transition matrix

 K:=12(Pπ+Ξ−1Pπ⊤Ξ)=UΛU⊤Ξ,

provides an interesting bifurcation of value functions into those that are positive-definite and those that are not. As a consequence of Theorem 4.3, a stable representation is obtained by using eigenvectors corresponding to eigenvalues smaller or equal to .

###### Proposition 4.3.

Let be the eigenvalues of , in decreasing order, and the corresponding eigenvectors. Define as the smallest integer such that . For any , the safe Laplacian representation , defined as

 Φ=[ud∗,ud∗+1,…,ud∗+i],

is positive-definite and stable.

While including eigenvectors for larger eigenvalues does not guarantee divergence, the basis for is unstable (See appendix). When the data is on-policy, all eigenvalues of are below the threshold , and the safe Laplacian corresponds exactly to the original representation.

We finish our discussion with a cautionary point. Although positive-definite representations admit amenable optimization properties, such as invariance to reparametrization and monotonic convergence, they can only express value functions that satisfy a growth condition. Under on-policy sampling this growth condition is nonrestrictive, but as the policy deviates from the data distribution, the expressiveness of positive-definite representations reduces greatly.

## 5 Experiments

We complement our theoretical results with an experimental evaluation, focusing on the following questions:

• [noitemsep]

• How closely do the theoretical conditions we describe match stability requirements in practice?

• Can stable representations be learned using samples?

• Can they be learned using neural networks?

We conduct our study in the four-room domain (Sutton et al., 1999). We augment this domain with a task where the agent must reach the top right corner to receive a reward of +1 (Figure 1). The policy evaluation problem is to accurately estimate the value function of a near-optimal policy from data consisting of trajectories sampled by an uniform policy.

We are interested in the usefulness of the representation learning schemes summarized in Table 1 as a function of the number of features that are used. We measure both the stability of the learned representation and its accuracy in estimating the greedy policy with respect to the fixed value function. We chose the latter measure as it is more informative than value approximation error when the number of features is small. See Appendix C for full details about the experimental setup.

Exact Representations: We first consider the quality of the representations in exact form, assuming access to the true transition matrix and reward function (Figure 2). We find that the general empirical profiles for stability match our theoretical characterizations. Singular vectors of the successor representation have low error but are unstable for most choices of small . Although the Krylov basis of rewards and its orthogonalization both have the same estimation errors, they have drastically different stability profiles, confirming our analysis from Section 4.2. Amongst the proposed methods that consistently produce stable representations, the Schur basis admits low error and with enough features, is fully expressible. In contrast, the safe Laplacian representation takes an irrecoverable performance hit, as it discards the top eigenvectors of the symmetrized transition matrix that contain reward-relevant information.

Estimation with Samples: In practice, representations must be learned from finite data. To test the numerical robustness of the representation learning schemes, we construct an empirical transition matrix from a variable number of samples and learn a representation using this matrix.

We measure the difference between the subspaces spanned by the estimated and true representation (Figure 3). We find that estimating the Schur representation can be more challenging than the other methods, and requires an order of magnitude more data to accurately compute than representations for singular vectors and spectral decompositions. This is a well-known problem in numerical linear algebra, as eigenspaces for nonsymmetric matrices (Schur) are more sensitive to perturbation and estimation error than for eigenspaces of symmetric matrices (Spectral, Svd). This implies a three-way tradeoff between stability, approximation error, and ease of estimation when choosing a representation for a general environment. The successor representation is unstable, the safe Laplacian is limited in its approximation power, and the Schur decomposition is harder to learn from samples. The orthogonal Krylov basis emerges as a strong method by these measures, but requires additional knowledge in the guise of the reward function.

Estimation with Neural Networks: In our final set of experiments, we show that the Schur representation and the orthogonal Krylov representation can be learned by neural networks by performing stochastic gradient descent on certain auxiliary objectives.

It has been noted previously that training a representation network with a final linear layer to predict features causes the neural network to learn a basis for the target features (Bellemare et al., 2019). A -dimensional Krylov representation then can be learned by predicting reward values at the next time-steps. Similarly, orthogonal iteration for learning the Schur representation (Proposition 3.2) can be approximated with a two-timescale algorithm that (a) at each step, predicts the feature values of a fixed target representation network at the next time step and (b) infrequently refreshes the target representation network with the current. As our stability guarantees hold for orthogonal representations, the neural network must learn uncorrelated features, which can be enforced explicitly or with a penalty-based orthogonality loss (Wu et al., 2018). We fully describe the auxiliary objectives and provide implementation details in Appendix C.

Figure 4 demonstrates that these predictive losses can be optimized easily with neural networks and can learn stable approximately invariant representations. We note that this auxiliary task of predicting future latent states has been heuristically proposed before (François-Lavet et al., 2018; Gelada et al., 2019), as a way to improve approximation errors. Our results indicate that such auxiliary tasks may not only help reduce approximation error, but more importantly, can mitigate divergence in the learning process and provide for stable optimization.

## 6 Conclusion

We have presented an analysis of stability guarantees for value-function learning under various representation learning procedures. Our analysis provides conditions for stability of many algorithms that learn features from transitions, and demonstrates how representation learning procedures constrained to respect the geometry of the transition matrix can induce stability. We demonstrated that the Schur decomposition and orthogonal Krylov bases are rich representations that mitigate divergence in off-policy value function learning, and further showed that they can be learned using stochastic gradient descent on a loss function.

Our work provides formal evidence that representation learning can prevent divergence without sacrificing approximation quality. To carry our results to the full practical case, stability should be extended to the sequence of policies that are encountered during policy iteration. One should also consider the effects of learning value functions and representations concurrently, and the ensuing interactions in the representation. Our work suggests that studying stable representations in these contexts can be a promising avenue forward for the development of principled auxiliary tasks for stable deep reinforcement learning.

## Acknowledgements

We would like to thank Nicolas Le Roux, Marlos C. Machado, Courtney Paquette, Fabian Pedregosa, Doina Precup, and Ahmed Touati for helpful discussions and contributions. We additionally thank Marlos C. Machado and Courtney Paquette for constructive feedback on an earlier manuscript.

## References

• Baird (1995) Baird, L. C. Residual algorithms: Reinforcement learning with function approximation. In ICML, 1995.
• Barnard (1993) Barnard, E.

Temporal-difference methods and markov models.

IEEE Transactions on Systems, Man, and Cybernetics, 1993.
• Behzadian et al. (2019) Behzadian, B., Gharatappeh, S., and Petrik, M. Fast feature selection for linear value function approximation. In ICAPS, 2019.
• Bellemare et al. (2017) Bellemare, M. G., Dabney, W., and Munos, R. A distributional perspective on reinforcement learning. In ICML, 2017.
• Bellemare et al. (2019) Bellemare, M. G., Dabney, W., Dadashi, R., Taïga, A. A., Castro, P. S., Roux, N. L., Schuurmans, D., Lattimore, T., and Lyle, C. A geometric perspective on optimal representations for reinforcement learning. In Advances in Neural Information Processing Systems, 2019.
• Benveniste et al. (1990) Benveniste, A., Priouret, P., and Métivier, M. Adaptive Algorithms and Stochastic Approximations. Springer-Verlag, Berlin, Heidelberg, 1990. ISBN 0387528946.
• Bertsekas (2011) Bertsekas, D. P. Approximate policy iteration: a survey and some new methods. Journal of Control Theory and Applications, 9:310–335, 2011.
• Bertsekas (2018) Bertsekas, D. P. Feature-based aggregation and deep reinforcement learning: a survey and some new implementations. IEEE/CAA Journal of Automatica Sinica, 6:1–31, 2018.
• Bodnar et al. (2019) Bodnar, C., Li, A., Hausman, K., Pastor, P., and Kalakrishnan, M. Quantile QT-opt for risk-aware vision-based robotic grasping. arXiv, 2019.
• Borkar & Meyn (2000) Borkar, V. S. and Meyn, S. P. The o.d.e. method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control and Optimization, 38:447–469, 2000.
• Bradtke & Barto (1996) Bradtke, S. J. and Barto, A. G. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22:33–57, 1996.
• Cabi et al. (2019) Cabi, S., Colmenarejo, S. G., Novikov, A., Konyushkova, K., Reed, S., Jeong, R., Zolna, K., Aytar, Y., Budden, D., Vecerik, M., Sushkov, O., Barker, D., Scholz, J., Denil, M., de Freitas, N., and Wang, Z. A framework for data-driven robotics. arXiv, 2019.
• Chung et al. (2019) Chung, W., Nath, S., Joseph, A. G., and White, M. Two-timescale networks for nonlinear value function approximation. In International Conference on Learning Representations, 2019.
• Dalal et al. (2017) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td(0) with function approximation. In AAAI, 2017.
• Dann et al. (2014) Dann, C., Neumann, G., and Peters, J. Policy evaluation with temporal differences: a survey and comparison. J. Mach. Learn. Res., 15:809–883, 2014.
• Dayan (1993) Dayan, P. Improving generalization for temporal difference learning: The successor representation. Neural Computation, 5:613–624, 1993.
• François-Lavet et al. (2018) François-Lavet, V., Bengio, Y., Precup, D., and Pineau, J. Combined reinforcement learning via abstract representations. arXiv, 2018.
• Gelada et al. (2019) Gelada, C., Kumar, S., Buckman, J., Nachum, O., and Bellemare, M. G. DeepMDP: Learning continuous latent space models for representation learning. In Proceedings of the International Conference on Machine Learning, 2019.
• Golub & van Loan (2013) Golub, G. H. and van Loan, C. F. Matrix Computations. JHU Press, fourth edition, 2013. ISBN 1421407949 9781421407944.
• Gordon (1995) Gordon, G. J. Stable function approximation in dynamic programming. In ICML, 1995.
• Jaderberg et al. (2016) Jaderberg, M., Mnih, V., Czarnecki, W., Schaul, T., Leibo, J. Z., Silver, D., and Kavukcuoglu, K. Reinforcement learning with unsupervised auxiliary tasks. ArXiv, abs/1611.05397, 2016.
• Lagoudakis & Parr (2003) Lagoudakis, M. G. and Parr, R. Least-squares policy iteration. J. Mach. Learn. Res., 4:1107–1149, 2003.
• Levine et al. (2017) Levine, N., Zahavy, T., Mankowitz, D., Tamar, A., and Mannor, S. Shallow updates for deep reinforcement learning. In Advances in Neural Information Processing Systems, 2017.
• Machado et al. (2018) Machado, M. C., Rosenbaum, C., Guo, X., Liu, M., Tesauro, G., and Campbell, M. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations, 2018.
• Maei et al. (2009) Maei, H. R., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D., and Sutton, R. S. Convergent temporal-difference learning with arbitrary smooth function approximation. In NIPS, 2009.
• Mahadevan & Maggioni (2007) Mahadevan, S. and Maggioni, M. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes. J. Mach. Learn. Res., 8:2169–2231, 2007.
• Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
• Ollivier (2018) Ollivier, Y. Approximate temporal difference learning is a gradient descent for reversible policies. ArXiv, abs/1805.00869, 2018.
• Parr et al. (2007) Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. Analyzing feature generation for value-function approximation. In ICML ’07, 2007.
• Parr et al. (2008) Parr, R., Li, L., Taylor, G., Painter-Wakefield, C., and Littman, M. L. An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In ICML ’08, 2008.
• Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self-supervised prediction.

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

, pp. 488–489, 2017.
• Petrik (2007) Petrik, M. An analysis of laplacian methods for value function approximation in mdps. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pp. 2574–2579, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc.
• Puterman (1994) Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
• Shi & Wei (2012) Shi, X. and Wei, Y. A sharp version of bauer–fike’s theorem. Journal of Computational and Applied Mathematics, 236(13):3218 – 3227, 2012. ISSN 0377-0427.
• Stachenfeld et al. (2014) Stachenfeld, K. L., Botvinick, M., and Gershman, S. J. Design principles of the hippocampal cognitive map. In Advances in Neural Information Processing Systems, 2014.
• Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement learning: An introduction. MIT Press, 2nd edition, 2018.
• Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. P. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artif. Intell., 112:181–211, 1999.
• Trefethen & Embree (2005) Trefethen, L. N. and Embree, M. Spectra and pseudospectra : the behavior of nonnormal matrices and operators. 2005.
• Tsitsiklis & Roy (1996) Tsitsiklis, J. N. and Roy, B. V. Analysis of temporal-diffference learning with function approximation. In NIPS, 1996.
• van Hasselt et al. (2018) van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. ArXiv, abs/1812.02648, 2018.
• Vecerik et al. (2019) Vecerik, M., Sushkov, O., Barker, D., Rothörl, T., Hester, T., and Scholz, J. A practical approach to insertion with variable socket position using deep reinforcement learning. 2019.
• Wu et al. (2018) Wu, Y., Tucker, G., and Nachum, O. The laplacian in rl: Learning representations with efficient approximations. ArXiv, abs/1810.04586, 2018.
• Yu & Bertsekas (2009) Yu, H. and Bertsekas, D. P. Basis function adaptation methods for cost approximation in mdp. In Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
• Zadeh & Desoer (2008) Zadeh, L. and Desoer, C. Linear system theory: the state space approach. Courier Dover Publications, 2008.

## Appendix A Linear Algebra and Spectral Theory

### a.1 Inner Products

A positive-definite symmetric matrix induces an inner product and norm on . Specifically, the inner product is written as , and the corresponding norm . This corresponds to a Hilbert space . In our work, we equip (where ) with the inner-product induced by the data distribution . We also equip (the parameter space) with the usual Euclidean inner product.

Most definitions and constructions with the Euclidean inner product generalize to arbitrary Hilbert spaces, some which we describe on . Two vectors are orthogonal if . A matrix is orthogonal if the columns have unit norm, and are orthogonal to one another: . The generalization of transposes and symmetrice matrices comes through the adjoint of a matrix , written as . A matrix is self-adjoint if , and for matrices that are not self-adjoint, the symmetric component is given as . We refer to as the matrix norm induced by the equivalent norm on vectors.

Matrix decompositions for a matrix can be re-visited with respect to this inner-product.

• Spectral Decomposition: If is self-adjoint, it admits a decomposition , where

is an orthogonal matrix whose columns are eigenvectors of

and a diagonal matrix with the corresponding eigenvalues.

• SVD: admits a decomposition , where is an orthogonal matrix whose columns are the left singular vectors of , is an orthogonal matrix whose columns are the right singular vectors of , and a diagonal matrix with the corresponding singular values. Letting correspond to the first singular vectors and the diagonal matrix with the corresponding singular values, then the low-rank approximation minimizes amongst all rank matrices.

### a.2 Eigenvalues

We define the eigenvalues of to be the roots of the characteristic polynomial . Some eigenvalues may correspond to a multiple root – we refer to this multiplicity as the algebraic multiplicity. Every eigenvalue corresponds to an eigenspace of eigenvectors with this eigenvalue. If the algebraic multiplicity of any eigenvalue does not equal the dimensionality of , then is said to be defective. Otherwise, the matrix is diagonalizable as , where is a basis of eigenvectors of , and the corresponding eigenvalues.

We write to denote the set of eigenvalues of the matrix . The spectral radius of a matrix is the maximum magnitude of eigenvalues, written as . For two matrices , we have the following cyclicity: . As a consequence, we also have that . We utilize this cyclicity heavily in the ensuing proofs.

The perturbation of eigenvalues for a diagonalizable matrix can be bounded simply via the Bauer-Fike theorem. Specifically, if

is diagonalizable as , then eigenvalues of the perturbed matrix can be bounded in distance from the original eigenvalues as , where . As a simple corollary of the Bauer-Fike Theorem, we have that .

## Appendix B Proofs

See 3.1

###### Proof of Proposition 3.1.

We review the update taken by TD(0) (equation 1), rewritten to express the connection to the implied iteration matrix . Notice that .

 θk+1−θ∗TD =θk−η(Φ⊤Ξ(I−γPπ)Φθk−Φ⊤Ξr)−θ∗TD =θk−θ∗TD−η(AΦθk−AΦθ∗TD) =(I−ηAΦ)(θk−θ∗TD) Unrolling the iteration, the error to the optimal solution takes the form θk−θ∗TD =(I−ηAΦ)k(θ0−θ∗TD)

This above iteration converges from any initialization if and only if the spectral radius is bounded by one: .

From here, we can easily show that TD(0) is stable if and only if . If there is some step-size for which , then . Similarly, if , then letting satisfies that .

See 3.2

###### Proof of Proposition 3.2.

For an orthogonal representation, the iteration matrix can be written as . Then,

 Spec(AΦ)⊂C+ ⟺Spec(Φ⊤ΞPπΦ)⊂{z∈C:Re(z)<1γ} ⟺Spec(ΠPπ)⊂{z∈C:Re(z)<1γ} ⟺Spec(ΠPπΠ)⊂{z∈C:Re(z)<1γ}

The second step falls from the cyclicity of the spectrum and the observation that for an orthogonal representation , the projection can be written as . The spectral radius condition is immediate. ∎

See 3.3

###### Proof of Proposition 3.3.

We can write the SVD factorization of the transition matrix as

 Pπ=[U1U2][Σ100Σ2][V⊤1V⊤2]Ξ

Then, for , . The necessary and sufficient conditions follow from Proposition 3.2. ∎

See 3.4

###### Proof of Proposition 3.4.

We can write the SVD factorization of the successor representation

 Ψ=[U1U2][Σ100Σ2][V⊤1V⊤2]Ξ         (I−γPπ)=[V1V2][Σ−1100Σ−12][U⊤1U⊤2]Ξ

Then, for , the iteration matrix can be written as .

Now, writing as The cyclicity of the spectrum implies the desired criterion.

 Spec(^Ψ)=Spec(^Ψ+)=Spec(V1Σ−11U⊤1Ξ)=Spec(U⊤1ΞV1Σ−11)⋃{0}=Spec(AΦ)⋃{0}.

See 4.1

###### Proof of Theorem 4.1.

Let be an nonzero eigenvalue of with an eigenvector . Since , .

Since is invariant on , , and therefore is an eigenvalue of . Therefore,

The spectrum of implies the stability of the representation.

is a stochastic matrix satisfying

, and thus , implying stability through Proposition 3.2. ∎

See 4.1

###### Proof of Proposition 4.1.

See Theorem 7.3.1 in Golub & van Loan (2013). ∎

See 4.2

###### Proof of Theorem 4.2.

We can rewrite the definition of -invariance in terms of a matrix norm: . Thus, letting , we have .

Now, suppose that has an eigenvalue, eigenvector pair . This means that .

 λv=ΠPπΠv=PπΠv+Ev=Pπv+Ev⟹λ∈% Spec(Pπ+E)

Now, the Bauer-Fike Theorem (see Appendix A above) thus implies that . Now, if , then , and stability follows from Proposition 3.2. ∎

See 4.2 Remark: The vector can be interpreted as the component of the reward at the -th timestep that cannot be predicted from the first timesteps.

###### Proof of Proposition 4.2.

Any vector can be decomposed into two components: .

 ∥ΠPπv−Pπv∥Ξ∥v∥Ξ =∥ΠPπ(Πd−1v+(I−Πd−1)v)−Pπ(Πd−1v+(I−Πd−1)v)∥Ξ∥Πd−1v+(I−Πd−1)v∥Ξ =∥ΠPπ(I−Πd−1)−Pπ(I−Πd−1)v∥Ξ∥Πd−1v∥Ξ+∥(I−Πd−1)v∥Ξ

This expression is maximized whenever is nonzero and , which is true whenever .

 supv∈Span(Φ)∥ΠPπv−Pπv∥Ξ∥v∥Ξ=∥ΠPπv−Pπv∥Ξ∥v∥Ξ

See 4.3

###### Proof of Theorem 4.3.

First, we show that the iteration matrix is positive-definite, and then show that this implies stability.

For any , let . Because is positive-definite, . Notice that rearranging the definition of positive definiteness implies that .

 x⊤AΦTDx=v⊤Ξ(I−γPπ)v=⟨v,(I−γPπ)v⟩Ξ>0.

Now, we consider an eigenvalue of the iteration matrix , and a corresponding unit eigenvector . We know that is also an eigenvalue of with unit eigenvector . Then,

 (x+¯¯¯x)⊤AΦ(x+¯¯¯x)=λx⊤x+λ¯¯¯x⊤x+¯¯¯λx⊤¯¯¯x+¯¯¯λ¯¯¯x⊤¯¯¯x=2(λ+¯¯¯λ)

Positive-definiteness implies that , and therefore the real component of , , must also be positive. ∎

See 4.3

###### Proof of Proposition 4.3.

We shall show that , which implies the proposition.

 ⟨v,Pπv⟩Ξ=⟨v,12(Pπ+Ξ−1(Pπ)⊤Ξ)v⟩Ξ

Consider some which can be expressed as . We have

 ⟨v,Pπv⟩Ξ =⟨v,12(Pπ+Ξ−1(Pπ)⊤Ξ)v⟩Ξ =⟨n∑k=d∗αkuk,n∑k=d∗λkαkuk⟩Ξ <γ−1⟨n∑k=d∗αkuk,n∑k=d∗αkuk⟩Ξ =γ−1∥v∥2Ξ

Hence, and . The second-to-last line is a result of eigenvalues being bounded by .

Since , we also have , and stability ensues from Theorem 4.3.

As a sidenote, we can use this same sequence of steps to show that a representation using only the top eigenvectors of is always not stable. Defining the representation , and following the same set of steps yields that for any . This implies that for this representation, the iteration matrix is negative-definite, and has all eigenvalues with negative real component, therefore not stable. ∎

## Appendix C Empirical Evaluation

### c.1 Experimental Setup

Four-room Domain: The four-room domain (Sutton et al., 1999) has 104 discrete states arranged into four “rooms”. At any state, the agent can take one of four actions corresponding to cardinal directions; if a wall blocks movement in the selected direction, the agent remains in place.

Policy Evaluation: We augment this domain with a task where the agent must reach the top right corner of the environment. The corresponding reward function is sparse, with the agent receiving +1 reward when it is in the desired state, and zero otherwise. The policy evaluation problem is to find the value function of a near-optimal policy in the environment

, which takes the optimal action with probability

, and a randomly selected action otherwise. Data is collected by rolling out -step trajectories from the center of the bottom-left room with a uniform policy, which samples actions uniformly at random. The discount factor is .

### c.2 Exact Evaluation

In this setting, the exact transition matrix and data distribution are used to create the representation. We compute the decompositions according to Table 1 and Appendix A. Stability is measured for a given representation by explicitly creating the induced iteration matrix, computing the eigenvalues, and checking for real positive parts. To measure accuracy, we considered three metrics (Figure C.2).

• Policy Accuracy: (displayed in paper) This measures how well the greedy policy for the true value function matches the greedy policy for the estimated value function. This is given as

 1|S|∑s∈Sδ(argmaxa^Q(s,a)≠argmaxaQπ(s,a))
• Optimal Projection Error: This measures how far the true value function is from the subspace of expressible value functions . As the number of features increases, this error monotonically decreases, but may not be indicative of the quality of the solution.

• Bellman Projection Error: This measures how far the solution reached by TD(0) (the TD-fixed point) is from the true value function: . This measure of error is nonmonotonic (adding extra features can cause errors to increase) and unbounded. Furthermore, in the regime of a low number of features, this error greatly underestimates the quality of the recovered solution.

### c.3 Estimation from Samples

To measure how well the representations can be measured using samples, we consider the difference between the subspace spanned by the estimated and true representations. In particular, we sample transitions from the data distribution, and reconstruct the empirical transition matrix given these transitions. If a particular pair is never sampled, the prior we use for the transition matrix is that taking this action deterministically leads back to . We construct the estimated representation as , and measure the distance between the true representation and the estimated representation as . The Frobenius norm is selected in particular as this measures an expected distance, as compared to the maximum distance, measured by the operator norm .

### c.4 Estimation with Gradient Descent:

When learning the representation using gradient descent, we train a network with one hidden layer with

units with no activation function, that takes in state-action pairs encoded in one-hot form (as vectors in

) and outputs in . In our experiments, . The value of the units in the hidden layer is the representation . The network is trained with a minibatch size of for steps, all implemented in Jax.

• Schur Decomposition: To mimic the orthogonal iteration procedure, we use the following training loss function, where are the parameters for the target network.

 L(θ;θt)=E(s,a)∼ξs′∼P(⋅|s,a)[∥∥f(s,a;θ)−Ea′∼π[ϕ(s′,a′;θt)]∥∥2]

This loss is optimized using stochastic gradient descent with a step-size of . The target network is updated every steps, and after every target network update, the representation is renormalized to satisfy .

• Reward Krylov Basis: We use the following regression training loss function

where the inner expectation comes from trajectories that are generated from the policy being evaluated starting from . Although this loss requires that the evaluated policy be run in the environment, it serves a didactic purpose to show that these Krylov bases can be learned with additional domain knowledge. This loss is optimized using the Adam optimizer with a learning rate of .