# Efficient decorrelation of features using Gramian in Reinforcement Learning

Learning good representations is a long standing problem in reinforcement learning (RL). One of the conventional ways to achieve this goal in the supervised setting is through regularization of the parameters. Extending some of these ideas to the RL setting has not yielded similar improvements in learning. In this paper, we develop an online regularization framework for decorrelating features in RL and demonstrate its utility in several test environments. We prove that the proposed algorithm converges in the linear function approximation setting and does not change the main objective of maximizing cumulative reward. We demonstrate how to scale the approach to deep RL using the Gramian of the features achieving linear computational complexity in the number of features and squared complexity in size of the batch. We conduct an extensive empirical study of the new approach on Atari 2600 games and show a significant improvement in sample efficiency in 40 out of 49 games.

## Authors

• 6 publications
• 5 publications
• 6 publications
• ### A Simple Reward-free Approach to Constrained Reinforcement Learning

In constrained reinforcement learning (RL), a learning agent seeks to no...
07/12/2021 ∙ by Sobhan Miryoosefi, et al. ∙ 0

• ### Sample Efficient Ensemble Learning with Catalyst.RL

We present Catalyst.RL, an open-source PyTorch framework for reproducibl...
03/29/2020 ∙ by Sergey Kolesnikov, et al. ∙ 21

• ### Greedy Algorithms for Sparse Reinforcement Learning

Feature selection and regularization are becoming increasingly prominent...
06/27/2012 ∙ by Christopher Painter-Wakefield, et al. ∙ 0

• ### Generalization and Regularization in DQN

Deep reinforcement learning (RL) algorithms have shown an impressive abi...
09/29/2018 ∙ by Jesse Farebrother, et al. ∙ 0

• ### Agent57: Outperforming the Atari Human Benchmark

Atari games have been a long-standing benchmark in the reinforcement lea...

• ### Sample-Efficient Reinforcement Learning Is Feasible for Linearly Realizable MDPs with Limited Revisiting

Low-complexity models such as linear function representation play a pivo...
05/17/2021 ∙ by Gen Li, et al. ∙ 6

• ### Symmetry Learning for Function Approximation in Reinforcement Learning

In this paper we explore methods to exploit symmetries for ensuring samp...
06/09/2017 ∙ by Anuj Mahajan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Learning a good representation is an important part of machine learning

[2]. In reinforcement learning (RL) and in particular deep RL, achieving a good representation is a significant challenge in learning features that generalize to new states and tasks [13][33]

. Disentangling factors of variance, especially from highly structured and correlated data such as images, is important to achieving good compact representations

[2]. While some work has been done on studying and improving generalization by regularizing features in RL [13], very little work has been done on disentangling factors of variance in RL in an online setting. A sensible approach to disentangling, or decorrelating, features in RL is to perform dimensionality reduction techniques like principle component analysis (PCA) on data collected offline [10][18]. Collecting data in advance is a significant disadvantage to these methods; our objective is to demonstrate a theoretically justified approach to decorrelating features in RL in an online manner that is computationally efficient and achieves performance gains, particuarly in deep RL problems.

Decorrelating features in an online manner applicable to RL and computationally efficient is not an easy problem [23]. Usually one assumes that the features provided are uncorrelated [4] [25]. The need for decorrelating features online has increased with the explosion of interest in deep RL. Firstly, it is often not practically feasible to collect large amounts of data before training the agent especially in infinite or uncountable state spaces. Secondly, if the environment is non-stationary or extremely large and complex, continual learning and disentangling of the representation can be advantageous to track the important features of the environment for making decisions.

The desire is to learn a generalized representation such that states observed that are similar produce similar value estimates. In the tabular setting, there is no representation learning; however, in linear and deep function approximation settings, representation learning is an important problem especially in infinitely large state spaces. Some older works look at state aggregation

[21] and soft state aggregation [27] which involve partitioning the state space according to some property. While helpful in improving generalization to new states, state aggregation only allows for generalization within a specific partition or cluster.

Neural networks [17] are so-called universal approximators, i.e. can approximate any continuous function over a compact set [11], and can be used to achieve generalization over unseen states in deep RL [32] [22] [26]. Other methods of learning generalizable feature representations include kernel methods [6]. We focus on decorrelating linear and deep representations in RL. On supervised problems, decorrelating features has been shown to improve generalization. [24] is one of the earliest results of applying decorrelation in neural networks. The decorrelating of features was hypothesized to explain, in part, the underlying working principle of the visual cortex V1 [3]. In more recent research, it was empirically demonstrated that feature decorrelation improves generalization and reduces overfitting in a variety of tasks and across different models [8] [9] [7]. [9] show experimentally that decorrelating features is competitive in performance to dropout [28], a widely used regularizer approach, and can be combined with dropout to achieve even better performance.

Unfortunately, common methods of regularizing a network in supervised learning seems to improve generalization of RL but does not improve performance

[13]. The authors performed an empirical study on the effects of dropout and weight regularization on the generalization properties of DQN [20] and concluded that they help learn more general purpose representations. Another approach proposed an weight regularization approach in [12] focusing on the supporting theory and lacking experimental results on the generalization properies of the method. While generalizing to similar tasks in RL is an important problem, our focus is on a theoretically grounded approach to decorrelating features while the agent interacts online with the environment and measure its impact on learning performance. Earlier work by [19] introduced online feature decorrelation in RL and provided only empirical results. Their work was computationally inefficient in the number of features since it scaled quadratically with the number of features while the proposed approach scales linearly with the number of features with non-linear function approximation.

The main contributions of this paper are as follows:

• Develop an online algorithm for decorrelating features in RL

• Prove its convergence

• Justify theoretically that the proposed regularizing decorrelation loss in the RL setting does not change the RL objective

• Scale up linearly in features in the deep RL setting and demonstrate empirically that decorrelating features in RL improves performance on 40 out of 49 Atari 2600 games

The rest of the paper is organized as follows. In section 2, we present the proposed algorithm and justify it theoretically. In section 3, we show empirically that the proposed algorithm for online decorrelation of features provides performance benefits on many RL problems. In section 4, we draw some conclusions and in the appendix we show the proofs and derivations for the algorithm along with further details on the experiments.

## Decorrelating features in RL

In this section we first describe the problem setting and notation. Then, we introduce the decorrelating constraint into the Mean Squared Value Error minimization problem with TD(0) [29]. We consider the linear function approximation case in our theoretical analysis due to its tractability, and we show that the adding a regularizer term in this case does not change the original TD(0) solution. Finally, we prove the convergence of the stochastic approximation algorithm for TD(0) with the decorrelating regularizer by applying the Borkar-Meyn theorem [chapter 2 of [5]].

### Problem setting

We model the environment as a Markov Decision Process with state transition probabilities

and initial state distribution invariant with respect to , i.e. and . We also assume finite state and finite actions spaces, i.e. , , with a given set of feature functions and . We denote the set of all terminal states by . The reward is a real-valued function . Agent’s objective is to maximize expected discounted sum of rewards: by adjusting its decision rule represented by a policy function . Specifically, in Q-learning, the agent estimates the state-action values and given picks an action with the highest state-action value. In the function approximation case, state-action values are approximated by a linear function: , where . The feature function can be freely chosen, including for example a one-hot state encoding to a Neural Network.

### Decorrelating regularizer and analytical gradients

Features are decorrelated when the covariance matrix of the state features is diagonal, i.e. given a batch of states features , the off-diagonal elements that correspond to the covariances vanish , where

are the standard basis vectors. In the case when features are correlated it is possible to find a transformation of features,

via diagonalization or SVD, s.t. [15]. Such a solution requires access to the features of all states. In the case of decorrelating features in RL, it is desirable to learn such a transformation

online. One possible loss function for learning the value function of a given policy is to augment the Mean Squared Value Error (MSVE) loss,

 LTD(θ)=0.5∑sμ(s)[R(s)+γϕ(s′)Tθ−ϕ(s)Tθ]2, (1)

with a feature decorrelating regularizer, which is the L2 loss of the off-diagonal elements of the .

 LREG(θ,A)=0.5∑sμ(s)[R(s)+γϕ(s′)TAθ−ϕ(s)TAθ]2+0.5λ∑i

Geometrically, orthogonal transformation of features can be viewed as a rotation (given ) of the feature space. Therefore, if the feature space is rotated, then the value function defined over that space does not change if its parameters are adjusted accordingly. However, this form of the loss is not as simple as it could be because of the two sets of parameters .

Note that matrix diagonalization is unique only if we restrict the diagonalizing matrix to be orthogonal. If the column vectors of the diagonalizing matrix are rescaled, i.e. , the resulting matrix, while no longer guaranteed to be orthogonal, is still a diagonalizing matrix. This observation allows to reduce the number of parameters in Equation (2) from to since the regularization term will be zero for any such rescaling of a matrix into .

The benefit of such a reduction is two-fold. First, fewer parameters may lead to faster learning. Second, theoretical analysis of the reduced problem is simpler. Observe that the original problem in Equation (2) is parametrized by a vector and a matrix which belong to different metric spaces. On the other hand, the reduced problem is parametrized by a single matrix and the update is easier to analyze.

The loss of the reduced problem is

 LREG(A)=0.5∑sμ(s)[R(s)+γϕ(s′)TA1−ϕ(s)TA1]2+0.5λ∑i

The following proposition ensures that solving Equation (3) also solves Equation (1).

###### Proposition 1.

Let be a global minimum of Equation (3). Then is a diagonal matrix. Furthermore, the global minimum values of Equation (3) and Equation (1) are equal.

See Appendix. ∎

### Convergence

In order to obtain an online algorithm one needs the semi-gradient of the Equation (3). The reparametrized update is 111See Appendix for derivation.:

 ∂LREG(A)∂A=−∑sμ(s)ϕ(s)1T[R(s)+γϕ(s′)TA1−ϕ(s)TA1]+λΦTDΦA[ET1~DE2+ET2~DE1] (4)

where for

• a diagonal matrix constructed from the off diagonal elements of the covariance matrix for the transformed features, i.e.

• are standard basis vectors of

• is the vector of ones

For example in , , .

In order to apply the Borkar-Meyn theorem, we have to ensure the iterates are bounded; i.e. . One option is to simply assume this condition outright, which is often done in practice. In that case we arrive at the following convergence result.

###### Proposition 2.

Assuming that almost surely and the Robbins-Munro step-size conditions, the iterates of the stochastic update converge to a compact, connected, internally chain-transitive set of . Additionally, if is of full rank, there exists such a set that contains at least one equilibrium of .

###### Proof.

See Appendix. ∎

Alternatively, a more theoretically sound approach is to ensure . We achieve this by introducing an additional step that projects222See section 5.4 of [5]. onto the space of orthogonal matrices i.e. , where is the Frobenius norm. The solution to this projection problem is provided by the following proposition.

###### Proposition 3.

The projection of a square matrix onto the space of orthogonal matrices is given by , where and are obtained from SVD of .

###### Proof.

See Appendix. ∎

For the Borkar-Meyn theorem, we also need to show that the semi-gradient above is Lipschitz over its domain. If we project the iterates to , the following proposition gives us the Lipschitz property.

###### Proposition 4.

defined by Equation (4) over the space of orthogonal matrices satisfies the Lipschitz condition.

###### Proof.

See Appendix. ∎

Lastly, Borkar-Meyn theorem conditions on the Martingale difference noise also follow from the projection step. The proof of this result is in the Proposition 6 in the Appendix.

### Linear Q-learning

Incorporating the decorrelating update into Q-learning in the linear setting is quite straightforward. The only change is in the weight update step.

In practice we found to be close to orthogonal without projection step step in Algorithm 1; in this case, the complexity reduces to quadratic in the number of features. Such complexity might be still prohibitive when scaling to higher dimensional spaces.

### Scaling up to deep RL

In the case of high dimensional-representations such as with Neural Networks, squared complexity in the features in Equation (1) might be a significant computational limitation. One possible solution to the problem was suggested by [3]. The main idea is to move the squared complexity in features to the sample size by representing the covariance by a Gramian matrix. Such an approach is based on the fact that in practice neural networks are trained with Mini-Batch SGD which assumes a fixed, small batch size.

For completeness, we reproduce the result by [3] in notation consistent with our work. Let , where is the last hidden layer of the NN. Then the sum of squared elements of is

 ∑i,jCov2i,j=Tr(ΦTΦΦTΦ)=Tr(ΦΦTΦΦT)=∑i,jG2i,j (5)

where . Therefore, the decorrelating regularizer is equivalent to

 ∑i≠jCov2i,j=∑i,jCov2i,j−∑i=jCov2i,j=Tr(ΦTΦΦTΦ)−∑i=jCov2i,j=Tr(ΦΦTΦΦT)−∑i=jCov2i,j=∑i,jG2i,j−∑iVar2i (6)

We extend this idea to the deep RL setting via the following optimization objective:

 LDQN-Gram=LDQN+λ[∑i,jG2i,j−∑iVar2i] (7)

In Equation (7) the complexity is linear in features and squared in the sample size which allows to scale Algorithm 1 to the deep RL setting more efficiently compared to [19]. We call our algorithm DQN-Gram. It is outlined in Algorithm 2.

## Experiments

We perform extensive experiments of our algorithm in a few settings: stochastic linear regression (SLR), linear RL and deep RL. Experiments reveal significant improvement in sample efficiency. We also examine the properties of the trained neural network and conclude that decorrelating regularizer allows to increase model capacity and the number of the learned features.

### Stochastic Linear Regression

Decorrelating objective developed in the previous section applies to the SLR case, since it is a special case of (2) where the target is a fixed label and . The data for the SLR is generated as follows: , label . We perform two sets of experiments:

• data is uncorrelated, i.e. ;

• data is correlated .

We set the learning rate to 0.01 and mini-batch size to 1. The results are reported in Figures 1. In both cases SLR with decorrelating regularizer has better sample efficiency.

### Linear RL

For the linear RL setting we perform experiments on the Mountain car environment [30]. We use tile coding as a feature approximator with 2 tiles. Note that tile coding produces sparse features with low correlation. In order to control for correlation we augment the feature space by adding extra dimensions that are copies of the original features. For example if , then with . To test the hypothesis of robustness with respect to correlated features we perform a step size sweep jointly with the sweep. For each (step-size, ) pair we average the number of steps in each episode over 50 episodes and 100 runs [following (Sutton 2018 p 248)]. It can be concluded from the results presented in Figure 2 that decorrelating Q-learning improves performance and decreases parameter sensitivity when the features have relatively high correlation, compared to when the features already have relatively low correlation.

### Deep RL

We investigate decorrelating features of the value function in deep RL with DQN. In our experiments we set all the hyperparameters following

[20] except for the optimizer and the learning rate. In our setup we used Adam [16] with 1e-4 as the learning rate. It can be seen from Figure 3 that DQN-Gram and DQN-decor produce equivalent simulation results as predicted by their theoretical equivalence. However, DQN-Gram has lower computational feature complexity.

In order to get a better understanding of the improved performance of the model with the decorrelating regularizer we study the properties of the neural networks in question.

We hypothesize that there is difference in sparsity of the learned features between DQN and DQN trained with the decorrelating regularizer. In order to test this hypothesis we compute the histogram of non zero activations of features (last hidden layer) of the trained agents across 1 million states. In addition we measure the sparsity of activations by varying the threshold, i.e.

 sparsityϵ=1−∑Ni=1I|ϕi|>ϵN (8)

It can be seen from Figure 5 that there is significant difference in sparsity. Interestingly, the model trained with the decorrelating regularizer has more dense representations indicating a richer representation that potentially better exploits the representational capabilities of the network architecture.

Observe that the the Gram regularizer can be decomposed in the following way:

 ∑n,dG2n,d−∑dVar(ϕ∗d)2=∑n||ϕn∗||4+∑n!=m(ϕTn∗ϕm∗)2−∑dVar(ϕ∗d)2 (9)

Therefore, minimizing the Gram regularizer results in the following:

• minimizes the norm of features through

• decorrelates samples through

• increases the variance of features via

where the first dimension enumerated by is the sample number and the second dimension enumerated by is the feature dimension. However, these 3 objectives are not independent which in practice introduces trade-offs between them. For example from Figure 6 we can see that in the Atari 2600 game of Seaquest the norm of features and sample correlation do drop, but the variance is not changing in the direction of the regularizer: the variance drops despite the pressure being applied by the regularizer to increase it. In addition, the above decomposition explains the difference in the gradient dynamics introduced by the Gram regularizer. It can be seen from Figure 6 that the growth of the gradients of the last linear layer in DQN is mainly due to the norm of the features and not the TD error.

## Summary

Representation learning in RL has attracted more attention in recent years due to the advancements in deep learning (DL) and its application in RL. However, not every approach that improves representation learning in supervised deep learning yields similar gains in RL, e.g. dropout

[13]. The main reason of such phenomenon might be that RL differs from supervised learning in its objective. Therefore, we introduced a theoretically justifiable regularization approach in RL. We showed that the feature decorrelating regularizer in RL does not interfere with the main objective and introduced a new algorithm based on it that is proved to converge in policy evaluation. We showed that our method can be scaled to deep RL in linear computational complexity of the features and quadratic complexity in the mini-batch size. Finally, we examined the statistical properties of the features in deep RL setting and found that the decorrelating regularizer better exploits the representational capabilities of the neural network by increasing the number of useful features learned.

An area worth investigating in future work is the effect of decorrelating features on generalization to unseen states and similar tasks in RL, which is a significant challenge in RL [13]. Another area of future work is to investigate how decorrelation of features can help improve performance with other enhancements like distributional RL and prioritized sampling in the Rainbow architecture [14]. It is likely that the decorrelating regularizer can improve features learned in a similar fashion as distributional RL does if viewed as an auxillary task [1]; hence, we think that combining the decorrelating regularizer with Rainbow might be beneficial.

## References

• [1] M. G. Bellemare, W. Dabney, and R. Munos (2017) A distributional perspective on reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 449–458. Cited by: Summary.
• [2] Y. Bengio, A. Courville, and P. Vincent (2013-08) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. External Links: Document, ISSN 0162-8828 Cited by: 1. Introduction.
• [3] Y. Bengio and J. S. Bergstra (2009) Slow, decorrelated features for pretraining complex cell-like networks. In Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 99–107. External Links: Link
• [4] D. P. Bertsekas and J. N. Tsitsiklis (1996) Neuro-dynamic programming. Vol. 5, Athena Scientific Belmont, MA. Cited by: 1. Introduction.
• [5] V. S. Borkar (2009) Stochastic approximation: a dynamical systems viewpoint. Vol. 48, Springer. Cited by: Decorrelating features in RL, Appendix, footnote 2.
• [6] B. E. Boser, I. M. Guyon, and V. N. Vapnik (1992)

A training algorithm for optimal margin classifiers

.
In

Proceedings of the fifth annual workshop on Computational learning theory

,
pp. 144–152. Cited by: 1. Introduction.
• [7] X. Chang, T. Xiang, and T. M. Hospedales (2018) Scalable and effective deep cca via soft decorrelation. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 1488–1497. Cited by: 1. Introduction.
• [8] B. Cheung, J. A. Livezey, A. K. Bansal, and B. A. Olshausen (2014) Discovering hidden factors of variation in deep networks. arXiv preprint arXiv:1412.6583. Cited by: 1. Introduction.
• [9] M. Cogswell, F. Ahmed, R. Girshick, L. Zitnick, and D. Batra (2015) Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068. Cited by: 1. Introduction.
• [10] W. Curran, T. Brys, M. Taylor, and W. Smart (2015-05) Using pca to efficiently represent state spaces. pp. . Cited by: 1. Introduction.
• [11] G. Cybenko (1989)

Approximation by superposition of sigmoidal functions

.
Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. Cited by: 1. Introduction.
• [12] A. M. Farahmand, M. Ghavamzadeh, S. Mannor, and C. Szepesvári (2009) Regularized policy iteration. In Advances in Neural Information Processing Systems, pp. 441–448. Cited by: 1. Introduction.
• [13] J. Farebrother, M. C. Machado, and M. Bowling (2018) Generalization and regularization in dqn. arXiv preprint arXiv:1810.00123. Cited by: 1. Introduction, 1. Introduction, Summary, Summary.
• [14] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver (2018) Rainbow: combining improvements in deep reinforcement learning. In

Thirty-Second AAAI Conference on Artificial Intelligence

,
Cited by: Summary.
• [15] I. Jolliffe (1986) . Spring-verlag, New York.
• [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Deep RL.
• [17] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski (1988) A theoretical framework for back-propagation. In Proceedings of the 1988 connectionist models summer school, Vol. 1, pp. 21–28. Cited by: 1. Introduction.
• [18] D. Liu, H. Li, and D. Wang (2015-06-01) Feature selection and feature learning for high-dimensional batch reinforcement learning: a survey. International Journal of Automation and Computing 12 (3), pp. 229–242. External Links: ISSN 1751-8520, Document, Link Cited by: 1. Introduction.
• [19] B. Mavrin, H. Yao, and L. Kong (2019) Deep reinforcement learning with decorrelation. arXiv preprint arXiv:1903.07765. Cited by: 1. Introduction, Scaling up to deep RL.
• [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: 1. Introduction, Deep RL.
• [21] A. W. Moore (1991) Variable resolution dynamic programming: efficiently learning action maps in multivariate real-valued state-spaces. In Machine Learning Proceedings 1991, pp. 333–337. Cited by: 1. Introduction.
• [22] B. Neyshabur, S. Bhojanapalli, D. Mcallester, and N. Srebro (2017) Exploring generalization in deep learning. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5947–5956. External Links: Link Cited by: 1. Introduction.
• [23] E. Oja and J. Karhunen (1985)

On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix

.
Journal of mathematical analysis and applications 106 (1), pp. 69–84. Cited by: 1. Introduction.
• [24] E. Oja (1982)

Simplified neuron model as a principal component analyzer

.
Journal of mathematical biology 15 (3), pp. 267–273. Cited by: 1. Introduction.
• [25] R. Parr, L. Li, G. Taylor, C. Painter-Wakefield, and M. L. Littman (2008) An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pp. 752–759. Cited by: 1. Introduction.
• [26] R. Shwartz-Ziv and N. Tishby (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: 1. Introduction.
• [27] S. P. Singh, T. Jaakkola, and M. I. Jordan (1995) Reinforcement learning with soft state aggregation. In Advances in neural information processing systems, pp. 361–368. Cited by: 1. Introduction.
• [28] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: 1. Introduction.
• [29] R. S. Sutton (1988) Learning to predict by the methods of temporal differences. Machine learning 3 (1), pp. 9–44. Cited by: Decorrelating features in RL.
• [30] R. S. Sutton (1996) Generalization in reinforcement learning: successful examples using sparse coarse coding. In Advances in neural information processing systems, pp. 1038–1044. Cited by: Linear RL.
• [31] J. N. Tsitsiklis and B. Van Roy (1997) Analysis of temporal-diffference learning with function approximation. In Advances in neural information processing systems, pp. 1075–1081. Cited by: Appendix.
• [32] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: 1. Introduction.
• [33] C. Zhao, O. Sigaud, F. Stulp, and T. M. Hospedales (2019) Investigating generalisation in continuous deep reinforcement learning. CoRR abs/1902.07015. Cited by: 1. Introduction.

## Appendix

###### Proposition 1.

Let be a global minimum of Equation (3). Then is a diagonal matrix. Furthermore, the global minimum values of Equation (3) and Equation (1) are equal.

###### Proof.

Let be a global minimum of Equation (3). Assume that is not diagonal. We will derive a contradiction.

Write , where is the regularizer term in Equation (3), and is the MSVE term. Let be the global minimum of Equation (1). By assumption, . Let by any matrix that diagonalizes . Such a matrix must exist since is real and symmetric. Define . By definition, we have that , so that . As well, satisfies the following.

 (VDθ)TΦTDΦVDθ=DθDDθ.

Hence, . We then have

 LREG(VDθ)=L1(VDθ)=LTD(θ∗TD)≤LTD(A1)=L1(A)

This is a contradiction by our assumption that minimizes . Hence, must be diagonal.

For the second part of the proof, assume that is a global minimum of Equation (3) as above. Since from the discussion above we know that , we have that . Let be the global minimum of Equation (1). Assume for the sake of contradiction that . But using the same construction of from above, we would have that , which cannot be the case. Hence, . ∎

###### Proposition 2.

The projection of a square matrix onto the space of orthogonal matrices is given by , where and are obtained from the SVD of , .

###### Proof.

The projection of onto is the solution of . Consider SVD of , . Then, by the unitary invariance of

 ||A−Q||F=||UΣVT−Q||F=||Σ−UTQV||F. (10)

Note that since is a group and , it follows that . Therefore, .

Taking into account that is a diagonal matrix,

 ||Σ−Q||2F=∑i(Σii−Qii)2+∑i≠jQ2ij=∑i(Σ2ii+Q2ii−2ΣiiQii)+∑i≠jQ2ij=∑iΣ2ii−2∑iΣiiQii+∑i,jQ2ij=∑iΣ2ii−2∑iΣiiQii+Tr(QTQ)=∑iΣ2ii−2∑iΣiiQii+Tr(I)=∑iΣ2ii−2∑iΣiiQii+n

Noting that by SVD and ,

 I=argminQTQ=I∑iΣ2ii−2∑iΣiiQii+n

Hence, . The result follows from Equation (10). ∎

###### Proposition 3.

defined by Equation (4) over the space of orthogonal matrices satisfies the Lipschitz condition.

###### Proof.

Let is orthogonal . Consider defined by . Hence, and . Since is continuous and is closed (as a singleton), is also closed. Note also that is bounded in the operator norm, since for any , . Therefore, is compact by Heine-Borel theorem.

Observe that . is therefore continuous and reaches its maximum over the compact . This means that is bounded over , so that satisfies the Lipschitz condition if restricted to . ∎

###### Proposition 4.

Assuming that almost surely and the Robbins-Munro step-size conditions, the iterates of the unprojected stochastic update converge to a compact, connected, internally chain-transitive set of . Additionally, if is of full rank, there exists such a set that contains at least one equilibrium of .

###### Proof.

The claim follows from satisfying the assumptions for Theorem 2 in [5, Ch. 2]. First, the martingale differences satisfy the required bound by Proposition 6, we assume the Robbins-Munro step-size conditions, and we assume almost surely. Second, note that the proof that is Lipschitz in Proposition 4 still follows through if and we take to be defined on a compact ball centered at 0 with radius more than .

The existence of a compact, connected, internally chain-transitive set of that contains an equilibrium of this ODE follows from Proposition 5 (namely, the set is the equilibrium point itself).

###### Proposition 5.

Assuming that is full rank, the following ODE has at least one equilibrium point.

 ˙A(t)=h(A(t)).
###### Proof.

Let

be an orthogonal matrix that diagonalizes

. Let us constrain to be a diagonal matrix. We aim to solve the following system of equations for .

 0 =h(VA)= −∑sμ(s)ϕ(s)1T(R(s)+γϕ(s′)TVA1−ϕ(s)TVA1) +λ∑i

Since diagonalizes and since is diagonal, we have that for . Hence, the second term in vanishes. We are left with

 ∑sμ(s)ϕ(s)1TR(s) +∑sμ(s)ϕ(s)(γϕ(s′)T−ϕ(s)T)VA1.

Since we assume that is full rank and , is invertible, as proven for instance in [31]. Because is assumed to be orthogonal and therefore invertible, we can explicitly solve for . ∎

For generality, let us first derive the semi-gradient of .

 ∂L(θ,A)∂θ=−∑sμ(s)ATϕ(s)[R(s)+γϕ(s′)TAθ−ϕ(s)TAθ] (11)
 ∂L(θ,A)∂A=−∑sμ(s)ϕ(s)θT[R(s)+γϕ(s′)TAθ−ϕ(s)TAθ]+∂∂A0.5λ∑i

We now expand the second term above.

 ∂∂A0.5λ∑i

Setting yields the semi-gradient of .

Define

 g(An)=ϕ(s)1T[R(s)+γϕ(s′)TAn1−ϕ(s)TAn1]+λ∑i

which is the argument of Equation (4) at . Define also

 Mn+1=g(An)−∂L(A)∂A∣∣An (15)
###### Proposition 6.

Assume that almost surely, or that we project the iterates to at every step. Then defined by Equation (15) is

• a martingale difference sequence with respect to

 Fn=σ(Am,Mm,m≤n)=σ(A0,M1,…Mn)

i.e. a.s.

• and the are square-integrable with

 E[||Mn+1||2|Fn]≤K(1+||An||2)a.s.∀n∈N
###### Proof.

The application of iterated expectations immediately yields a.s. .

Recall that by Proposition 4. Also note that is Lipschitz by the same argument as in Proposition 4. Furthermore, if is Lipschitz, then for any fixed ,

 ||f(x)||−||f(x0)||≤||f(x)−f(x0)||≤L||x−x0||

Rearranging the terms and applying the triangle inequality one obtains:

 ||f(x)||≤L||x−x0||+||f(x0)||≤L||x||−L||x0||+||f(x0)||

Hence, by the triangle inequality and Equation (16):

 ||Mn+1||=∣∣∣∣g(An)−∂L(A)∂A∣∣An∣∣∣∣≤||g(An)||+∣∣∣∣∂L(A)∂A∣∣An∣∣∣∣≤K1(1+||An||)K2(1+||An||)=K(1+||An||)

with . Therefore,

 ||Mn+1||2=K2(1+||An||)2=K2(1+2||An||)+||An||2

where by compactness of if we are projecting iterates, or by the other boundedness assumption. Finally,

 K2(1+2C+||An||2)≤K2(1+2C)+K2||An||2)≤K2(1+2C)+K2(1+2C)||An||2)K2(1+2C)(1+||An||)=~K(1+||An||2)

Combining last two inequalities, one obtains:

 ||Mn+1||2≤~K(1+||An||2)

Applying conditional expectations completes the proof. ∎