1 Introduction
The recent impressive advances in reinforcement learning (RL) range from robotics, to strategy games and recommendation systems (Kalashnikov et al., 2018; Li et al., 2010)
. Reinforcement learning is canonically regarded as an active learning process  also referred to as online RL  where the agent interacts with the environment at each training run. In contrast, offline RL algorithms learn from large, previously collected static datasets, and thus do not rely on environment interactions
(Agarwal et al., 2020b; Ernst et al., 2005; Fujimoto et al., 2019). Online data collection is performed by simulations or by means of real world interactions e.g. robotics and in either scenario interactions maybe costly and/or dangerous.In principle offline datasets only need to be collected once which alleviates the beforementioned shortcomings of costly online interactions. Offline datasets are typically collected using behavioral policies for the specific task ranging from, random policies, or nearoptimal policies to human demonstrations. In particular, being able to leverage the latter is a major advantage of offline RL over online approaches, and then the learned policies can be deployed or finetuned on the desired environment. Offline RL has successfully been applied to learn agents that outperform the behavioral policy used to collect the data (Kumar et al., 2020; Wu et al., 2019; Agarwal et al., 2020a; Ernst et al., 2005). However algorithms admit major shortcomings in regard to overfitting and overestimating the true stateaction values of the distribution. One solution was recently propsed by Sinha et al. (2021), where they tested several data augmentation schemes to improve the performance and generalization capabilities of the learned policies.
However, despite the recent progress, learning from offline demonstrations is a tedious endeavour as the dataset typically does not cover the full stateaction space. Moreover, offline RL algorithms per definition do not admit the possibility for further environment exploration to refine their distributions towards an optimal policy. It was argued previously that is basically impossible for an offline RL agent to learn an optimal policy as the generalization to near data generically leads to compounding errors such as overestimation bias (Kumar et al., 2020). In this paper, we look at offline RL through the lens of Koopman spectral theory in which nonlinear dynamics are represented in terms of a linear operator acting on the space of measurement functions of the system. Through which the representation the symmetries of the dynamics may be inferred directly, and can then be used to guide data augmentation strategies see Figure 1. We further provide theoretical results on the existence on nature of symmetries relevant for control systems such as reinforcement learning. More specifically, we apply Koopman spectral theory by: first learning symmetries of the system’s underlying dynamic in a selfsupervised fashion from the static dataset, and second employing the latter to extend the offline dataset at training time by outofdistribution values. As this reflects the system’s dynamics the additional data is to be interpreted as an exploration of the environment’s phase space.
Some prior works have explored symmetry of the stateaction space in the context of Markov Decision Processes (MDP’s)
(Higgins et al., 2018; Balaraman and Andrew, 2004; van der Pol et al., 2020) since many control tasks exhibit apparent symmetries e.g. the classic cartpole task which is symmetric across yaxis. However, the paradigm we introduce in this work is of a different nature entirely. The distinction is twofold: first, the symmetries are learned in a selfsupervised way and are in general not apparent to the developer; second: we concern with symmetry transformation from state tuples which leave the action invariant inferred from the dynamics inherited by the behavioral policy of the underlying offline data. In other words we seek to derive a neighbourhood around a MDP tuple in the offline dataset in which the behavioral policy is likely to choose the same action based on its dynamics in the environment. In practice the Koopman latent space representation is learned in a selfsupervised manner by training to predict the next state using a VAE model (Kingma and Welling, 2013).To summarize, in this paper, we propose Koopman Forward (Conservative) Qlearning (KFC): a modelfree Qlearning algorithm which uses the symmetries in the dynamics of the environment to guide data augmentation strategies. We also provide thorough theoretical justifications for KFC. Finally, we empirically test our approach on several challenging benchmark datasets from D4RL (Fu et al., 2021), MetaWorld (Yu et al., 2019) and Robosuite (Zhu et al., 2020) and find that by using KFC we can improve the stateoftheart on most benchmark offline reinforcement learning tasks.
2 Preliminaries and background
2.1 Offline RL & Conservative Qlearning
Reinforcement learning algorithms train policies to maximize the cumulative reward received by an agent who interacts with an environment. Formally the setting is given by a Markov decision process , with state space , action space , and the transition density function from the current state and action to the next state. Moreover, is the discount factor and the reward function. At any discrete time the agent chooses an action according to its underlying policy based on the information of the current state where the policy is parametrized by . We focus on the ActorCritic methods for continuous control tasks in the following. In deep RL the parameters
are the weights in a deep neural network function approximation of the policy or Actor as well as the stateaction value function
or Critic, respectively, and are optimized by gradient decent. The agent i.e. the ActorCritic is trained to maximize the expected discounted cumulative reward , with respect to the policy network i.e. its parameters . For notational simplicity we omit the explicit dependency of the latter in the remainder of this work. Furthermore the stateaction value function , returns the value of performing a given action while being in the state . The Qfunction is trained by minimizing the so called Bellman error as(1) 
This is commonly referred to as the i policy evaluation step where the hat denotes the target Qfunction. In offline RL one aims to learn an optimal policy for the given the dataset as the option for exploration of the MDP is not available. The policy is optimized to maximize the stateaction value function via the policy improvement
(2) 
Note that behavioural policies including suboptimal or randomized ones may be used to generate the static dataset . In that case offline RL algorithms face difficulties in the learning process.
CQL algorithm:
CQL is built on top of a SoftActor Critic algorithm (SAC) (Haarnoja et al., 2018), which employs softpolicy iteration of a stochastic policy (Haarnoja et al., 2017). A policy entropy regularization term is added to the policy improvement step in Eq. (2) as
(3) 
where
either is a fixed hyperparameter or may be chosen to be trainable. CQL reduces the overestimation of statevalues  in particular those outof distribution from
. It achieves this by regularizing the Qfunction in Eq. (1) by a term minimizing its values over out of distribution randomly sampled actions as(4) 
where is given by the prediction of the policy and is a hyperparameter balancing the regulizer term.
2.2 Koopman theory
Historically, the Koopman theoretic perspective of dynamical systems was introduced to describe the evolution of measurements of Hamiltonian systems (Koopman, 1931; Mezić, 2005). The underlying dynamic of most modern reinforcement learning tasks is of nonlinear nature, i.e. the agents actions lead to changes of it state described by a complex nonlinear dynamical system. In contrast to linear systems which are completely characterized by their spectral decomposition nonlinear systems lack such a unified characterisation. The Koopman operator theoretic framework describes nonlinear dynamics via a linear infinitedimensional Koopman operator and thus inherits certain tools applicable to linear control systems (Mauroy et al., 2020; Kaiser et al., 2021)
. In practice one aims to find a finitedimensional representation of the Koopman operator which is equivalent to obtaining a coordinate transformations in which the nonlinear dynamics are approximately linear. A general nonaffine control system is governed by the system of nonlinear ordinary differential equations (ODE’s) as
(5) 
where is the ndimensional state vector, the mdimensional action vector with the stateactionspace. Moreover, is the time derivative, and is some general nonlinear  at least differentiable  vector valued function. For a discrete time system, Eq. (5) takes the form
(6) 
where denotes the state at time where F is at least differentiable vector valued function.
Definition 1 (Koopman operator)
Let be the (Banach) space of all measurement functions (observables). Then the Koopman operator is defined by
(7) 
where .
Many systems can be modeled by a bilinearisation where the action enters the controlling equations 5 linearly as for functions . In that case the action of the Koopman operator takes the simple form
(8) 
where
decompose the Koopman operator in a the free and forcingterm, respectively. Associated with a Koopman operator is its eigenspectrum, that is, the eigenvalues
, and the corresponding eigenfunctions
, such that . In practice one derives a finite set of observable in which case the approximation to the Koopman operator admits a finitedimensional matrix representation. The matrix representing the Koopman operator may be diagonalized by a matrix containing the eigenvectors of as columns. In which case the eigenfunctions are derived by and one infers from Eq. (5) that with eigenvalues . These ODE’s admit simple solutions for their timeevolution namely the exponential functions which is expected for a linear system.3 The Koopman forward framework
In this section we introduce the Koopman forward framework. We initiate our discussion with a focus on symmetries in dynamical control systems in Subsection 3.1 where we additionally present some theoretical results. We then proceed in Subsection 3.2 by presenting the Koopman forward framework for Qlearning based on CQL.
Moreover, we discuss the algorithm as well as the Koopman Forward model’s deep neural network architecture.
Overview: Simply, our theoretical results culminate in Theorem 3.6 and 3.7, which provides a roadmap on how specific symmetries of the dynamical control system are to be inferred from a VAE forward prediction model.
In particular, Theorem 3.6 guarantees that the procedure leads to true symmetries of the system and Theorem 3.7 that actual new data points can be derived by applying symmetry transforming to existing ones.
The practical limitation is that the VAE model, parameterized by a neural net, is trained on the collected data from one of many behavior policies which amounts to learning the approximate dynamics of the real system.
The theoretical limitations are twofold, firstly the theorems only hold for dynamical systems with differentiable statetransitions;
secondly, we employ a Bilinearisation Ansatz for the Koopman operator of the system.
Practically many RL environments incorporate dynamics with discontinuous “contact” events where the Bilinearisation Ansatz may not be applicable.
However, empirically we find that our approach nevertheless is successful for such RL environments
and such “contact” events do not affect the performance significantly (see Appendix C.1).
3.1 Symmetries of dynamical control system
Let us start by introducing symmetries in the simpler context of dynamical system without control (Sinha et al., 2020) given by where we use analog notations as in Eq. (5).
Definition 2 (Equivariant Dynamical System)
Consider the dynamical system and let be a group acting on the statespace . Then the system is called equivariant if . For a discrete time dynamical system one defines equivariance analogously, namely if .
Lemma 3.1
The map given by defines a group action on the Koopman space of observables .
Theorem 3.2
Let be the Koopman operator associated with a equivariant system . Then
(9) 
Theorem 3.2 states that for a equivariant system any symmetry transformation commutes with the Koopman operator. For the proof see (Sinha et al., 2020). Let us next turn to the case relevant for RL, namely control systems. In the remainder of this section we focus on dynamical systems given as in Eq. (6) and Eq. (8).
Definition 3 (Action Equivariant Dynamical System)
Let be a group acting on the stateactionspace of a general control system as in Eq. (6) such that it acts as the identity operation on i.e. Then the system is called actionequivariant if
(10) 
In particular, it is easy to see that the biliniarisation in Eq. (8) is actionequivariant if .
Lemma 3.3
The map given by defines a group action on the Koopman space of observables .
Theorem 3.4
Let be the Koopman operator associated with a actionequivariant system . Then
(11) 
Neither Theorem 3.2 nor 3.4 serve as a theoretical foundation for the control setup aimed for in the next section. Let us thus turn our focus on the relevant case of a control system which admits a Koopman operator description as
(12) 
where is a family of operators with analytical dependence on . Note that the bilinearisation in Eq. (8) is a special case of Eq. (12). Furthermore, let be a family of invertible operators s.t. is a mapping to the (Banach) space of eigenfunctions . Which moreover obeys , with and where . The existence of such operators puts further restriction on the Koopman operator in Eq. (12).^{1}^{1}1See Appendix A for the details. As we are in the end concerned with the finitedimensional approximation of the Koopman operator i.e. its matrix representation this amounts simple for to be diaganolisable and is the matrix containing its eigenvectors as columns.
Lemma 3.5
The map given by
(13) 
defines a group action on the Koopman space of observables . Where is defined analog to Lemma 3.3 but acting on instead of by .
We refer to a system in Eq. (12) admitting a symmetry as in Lemma 3.5 as actionequivariant control system in the following.
Theorem 3.6
Let us next build on the core of Theorem 3.6 to provide theoretical statements on how datapoints may be shifted by symmetry transformations. For that let us change our notation to establish an easier connection to the. Let and denote the encoder and decoder to and from the Koopman space approximation, respectively, i.e. .
Theorem 3.7
Let be a actionequivariant control system as in Theorem 3.6. Then one can use a symmetry transformation to shift both as well as
(16) 
(17) 
One may account for practical limitations i.e. an error by the assumption . One then finds that . Thus the error becomes suppressed when .^{2}^{2}2The error may be due to practical limitations of capturing the true dynamics as well as the symmetry map. Moreover, note that the equivalent theorem holds when .
3.2 KFC algorithm
In the previous section we laid the theoretical foundation for the KFCalgorithm by providing a raodmap on how to derive symmetries of dynamical control systems based on a Koopman latent space representation. The goal is to generate new data points for the RL algorithm at trainingtime as Eq. (16) in Theorem 3.7. The reward is not part of the symmetry shift process and will just remain unchanged as an assertion.
On the power of using symmetries: Let us emphasize the practical advantage of employing symmetries to generate augmented data points. It is evident from Theorem 3.7 that a symmetry transformation shifts both as well as which evades the necessity of forecasting states. Thus the use of an inaccurate forecast model is avoided and the accuracy and generalisation capabilities of the VAE are fully utilized. The magnitude of the induced shift is controlled by the parameter such that to limit outofdistribution generalisation errors.^{3}^{3}3For the details of the errors, refer the appendix section A.
Let us next discuss technical details of the incorporation of the symmetry shifts into a specific Qlearning algorithm. In practice the Koopman latent space representation is Ndimensional i.e. finite. One uses the matrix representations of the Koopman operator and the symmetry generators and matrix multiplication instead of the formal mapping
. Moreover, we use a normally distributed random variable for the control parameter
. We base our analysis on the CQL algorithm, however we expect that our approach of exploration should benefit a wider range of RL algorithms. Following in the footsteps of (Sinha et al., 2021) our approach leaves the policy improvement as in Eq. (3) unchanged but modifies the policy optimisation step in Eq. (4) as(18) 
The statespace symmetry generating function depends on the normally distributed random variables . Note that we only modify the Bellman error of Eq. (3.2) and leave the CQL specific regulizer untouched. We study two distinct cases which differ on an algorithmic level^{4}^{4}4Eq. (3.2) holds for both as well as thus we have deliberately dropped the subscript. Moreover, note that in case (II) the symmetry transformations are of the form .
(19) 
and diagonalizes the Koopman operator with eigenvalues i.e. the latter can be expressed as . Moreover, is a fixed matrix representation of a symmetry transformation of the dynamical system in Koopman space. Note that we abuse our notation as i.e. the encoder provides the Koopman space observables. Following the guidelines provided by Theorem 3.6 for a symmetry acting on the stateaction space of dynamical control systems in Eq. (8) as it requires the symmetry generators to commute with the Koopman operator. Thus on the one hand, for the case (I) in Eq. (3.2) the symmetry generator matrix is obtained by solving the equation which may be accomplished by employing a Sylvester algorithm (31). While on the other hand case (II) is solved by computing the eigenvectors of which then constitute the columns of . Thus in particular one infers that
(20) 
where denotes the commutator of two matrices. Eq. (20) is solved by construction of which commutes with the Koopman operator for all values of the random variables. ^{5}^{5}5Moreover, let us emphasize that the Koopman operators eigenvalues and eignevectors generically are valued. However, the definition in Eq. (3.2) ensures that is a valued matrix. Concludingly, the advantage of case (I) is that it is computationally less expensive, however it it provides less freedom to explore different symmetry directions as case (II). Lastly, it proofed useful empirically to train the RLalgorithm not only on symmetry shifted data as in Eq. (3.2
). We employ our data exploration shift only with probability
. Otherwise we use , i.e. a shift by a normally distributed random variable.The Koopman forward model: The KFC algorithm requires pretraining of a Koopman forward model which is closely related to a VAE architecture as
(21) 
where both of and
are approximated by MultiLayerPerceptrons (MLP’s) and the bilinear Koopmanspace operator approximation are implemented by a single fully connected layer for
, respectively. The model is trained on batches of the offline dataset tuplesand optimized via an additive lossfunction of the VAE and the forward prediction part of the model. We refer the reader to Appendix
B for details.Domain  Task Name  CQL  S4RL()  S4RL(Adv)  KFC  KFC++ 

AntMaze  antmazeumaze  74.0  91.3  94.1  96.9  99.8 
antmazeumazediverse  84.0  87.8  88.0  91.2  91.1  
antmazemediumplay  61.2  61.9  61.6  60.0  63.1  
antmazemediumdiverse  53.7  78.1  82.3  87.1  90.5  
antmazelargeplay  15.8  24.4  25.1  24.8  25.6  
antmazelargediverse  14.9  27.0  26.2  33.1  34.0  
Gym  cheetahrandom  35.4  52.3  53.9  48.6  49.2 
cheetahmedium  44.4  48.8  48.6  55.9  59.1  
cheetahmediumreplay  42.0  51.4  51.7  58.1  58.9  
cheetahmediumexpert  62.4  79.0  78.1  79.9  79.8  
hopperrandom  10.8  10.8  10.7  10.4  10.7  
hoppermedium  58.0  78.9  81.3  90.6  94.2  
hoppermediumreplay  29.5  35.4  36.8  48.6  49.0  
hoppermediumexpert  111.0  113.5  117.9  121.0  125.5  
walkerrandom  7.0  24.9  25.1  19.1  17.6  
walkermedium  79.2  93.6  93.1  102.1  108.0  
walkermediumreplay  21.1  30.3  35.0  48.0  46.1  
walkermediumexpert  98.7  112.2  107.1  114.0  115.3  
Adroit  penhuman  37.5  44.4  51.2  61.3  60.0 
pencloned  39.2  57.1  58.2  71.3  68.4  
hammerhuman  4.4  5.9  6.3  7.0  9.4  
hammercloned  2.1  2.7  2.9  3.0  4.2  
doorhuman  9.9  27.0  35.3  44.1  46.1  
doorcloned  0.4  2.1  0.8  3.6  5.6  
relocatehuman  0.2  0.2  0.2  0.2  0.2  
relocatecloned  0.1  0.1  0.1  0.1  0.1  
Franka  kitchencomplete  43.8  77.1  88.1  94.1  94.9 
kitchenpartial  49.8  74.8  83.6  92.3  95.9 
4 Empirical evaluation
In this section, we will first experiment with the popular D4RL benchmark commonly used for offline RL (Fu et al., 2021). The benchmark covers various different tasks such as locomotion tasks with Mujoco Gym (Brockman et al., 2016), tasks that require hierarchical planning such as antmaze, and other robotics tasks such as kitchen and adroit (Rajeswaran et al., 2017). Furthermore, similar to S4RL (Sinha et al., 2021), we perform experiments on 6 different challenging robotics tasks from MetaWorld (Yu et al., 2019) and RoboSuite (Zhu et al., 2020). We compare KFC to the baseline CQL algorithm (Kumar et al., 2020), and two best performing augmentation variants from S4RL, S4RL and S4RLadv (Sinha et al., 2021). We use the exact same hyperparameters as proposed in the respective papers. Furthermore, similar to S4RL, we build KFC on top of CQL (Kumar et al., 2020)
to ensure conservative Qestimates during for policy evaluation.
4.1 D4RL benchmarks
We present results in the benchmark D4RL test suite and report the normalized return in Table 1. We see that both KFC and KFC++ consistently outperform both the baseline CQL and S4RL across multiple tasks and data distributions. Outperforming S4RL and S4RLadv on various different types of environments suggests that KFC and KFC++ fundamentally improves the data augmentation strategies discussed in S4RL. KFCvariants also improve the performance of learned agents on challenging environments such as antmaze: which requires hierarchical planning, kitchen and adroit tasks: which are sparse reward and have large action spaces. Similarly, KFCvariants also perform well on difficult data distributions such as “mediumreplay”: which is a collected by simply using all the data that the policy encountered while training base SAC policy, and “human” which is collected using human demonstrations on robotic tasks which results in a nonMarkovian behaviour policy (more details can be found in the D4RL manuscript (Fu et al., 2021)). Furthermore, to our knowledge, the results for KFC++ are stateoftheart in policy learning from D4RL datasets for most environments and data distributions.
We do note that S4RL outperforms KFC on the “random” split of the data distributions, which is expected as KFC depends on learning a simple dynamics model of the data to use to guide the data augmentation strategy. Since the “random” split consists of random actions in the environment, the dynamics model is unable to learn a useful dynamics model.
4.2 Metaworld and Robosuite benchmarks
To further test the ability of KFC, we perform additional experiments on challenging robotic tasks. Following (Sinha et al., 2021), we perform additional experiments with 4 MetaWorld environments (Yu et al., 2019) and 2 RoboSuite environments (Zhu et al., 2020). We followed the same method to collect the data as described in Appendix F of S4RL (Sinha et al., 2021), and report the mean percent of goals reached, where the condition of reaching the goal is defined by the environment.
We report the results in Figure 2, where we see that using KFC to guide the data augmentation strategy for a base CQL agent, we are able to learn an agent that performs significantly better. Furthermore, we see that for more challenging tasks such as “push” and “doorclose” in the MetaWorld, KFC++ outperforms the base CQL algorithm and the S4RL agent by a significant margin. These set of experiments further highlight the ability of KFC to guide the data augmentation strategy.
5 Related works
The use of data augmentation techniques in Qlearning has been discussed recently (Laskin et al., 2020, 2020; Sinha et al., 2021). In particular, our work shares strong parallels with (Sinha et al., 2021). Our modification of the policy evaluation step of the CQL algorithm (Kumar et al., 2020) is analogous to the one in Sinha et al. (2021). However, the latter randomly augments the data while our augmentation framework is based on symmetry state shifts. Regarding the connection to world models (Ha and Schmidhuber, 2018). Here a VAE is used to decode the state information while a recurrent separate neural network predicts future states. Their latent representation is not of Koopman type. Also no symmetries and dataaugmentations are derived.
Algebraic symmetries of the stateaction space in Markov Decision Processes (MDP) originate (Balaraman and Andrew, 2004) an were discussed recently in the context of RL in (van der Pol et al., 2020). Their goal is to preserve the essential algebraic homomorphism symmetry structure of the original MDP while finding a more compact representation. The symmetry maps considered in our work are more general and are utilized in a different way. Symmetrybased representation learning (Higgins et al., 2018) refers to the study of symmetries of the environment manifested in the latent representation. The symmetries in our case are derived form the Koopman operator not the latent representation directly. In (CasellesDupré et al., 2019) the authors discuss representation learning of symmetries (Higgins et al., 2018) allowing for interactions with the environment. A ForwardVAE model which is similar to our KoopmanForward VAE model is employed. However, our approach is based on theoretical results providing a roadmap to derive explicit symmetries of the dynamical systems as well as their utilisation for stateshifts.
In (Sinha et al., 2020) the authors extend the Koopman operator from a local to a global description using symmetries of the dynamics. They do not discuss actionequivariant dynamical control systems nor data augmentation. In (Salova,Anastasiya et al., 2019) the imprint of known symmetries on the blockdiagonal Koopman space representation for noncontrol dynamical systems is discussed. This is close to the spirit of disentanglement (Higgins et al., 2018). Our results are on control setups and deriving symmetries. On another front, the application of Koopman theory in control or reinforcement learning has also been discussed recently. For example, Li et al. (2020) propose to use compositional Koopman operators using graph neural networks to learn dynamics that can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Kaiser et al. (2021) discuss the use of Koopman eigenfunction as a transformation of the state into a globally linear space where the classical control techniques is applicable. To the best of our knowledge, this paper is the first to discuss Koopman latent space for data augmentation.
6 Conclusions
In this work we proposed a symmetrybased data augmentation technique derived from a Koopman latent space representation. It enables a meaningful extension of offline RL datasets describing dynamical systems, i.e. further "exploration" without additional environment interactions. The approach is based on our theoretical results on symmetries of dynamical control systems and symmetry shifts of data. Both hold for systems with differentiable state transitions and with a Bilinearisation Ansatz for the Koopman operator. However, the empirical results show that the framework is successfully applicable beyond those limitations. We empirically evaluated our method on several benchmark offline reinforcement learning tasks D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the stateoftheart of Qlearning algorithms.
Acknowledgments
This work was supported by JSPS KAKENHI (Grant Number JP18H03287), and JST CREST (Grant Number JPMJCR1913). We would like to thank Y. Nishimura for technical support.
References

An optimistic perspective on offline reinforcement learning.
In
International Conference on Machine Learning
, Cited by: §1.  An optimistic perspective on offline reinforcement learning. External Links: 1907.04543 Cited by: §1.
 Approximate homomorphisms: a framework for nonexact minimization in markov decision processes.. In In International Conference on Knowledge Based Computer Systems, Cited by: §1, §5.
 OpenAI gym. External Links: 1606.01540 Cited by: §C.1, §4.
 Modern koopman theory for dynamical systems. External Links: 2102.12086 Cited by: §C.4.

Symmetrybased disentangled representation learning requires interaction with environments
. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlchéBuc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §5.  Treebased batch mode reinforcement learning. Journal of Machine Learning Research 6 (18), pp. 503–556. Cited by: §1, §1.
 D4RL: datasets for deep datadriven reinforcement learning. External Links: 2004.07219 Cited by: Table 3, §1, Table 1, §4.1, §4.
 Offpolicy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 2052–2062. Cited by: §1.
 Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §5.
 Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1352–1361. Cited by: §2.1.
 Soft actorcritic: offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1861–1870. Cited by: Appendix B, §2.1.
 Towards a definition of disentangled representations. External Links: 1812.02230 Cited by: §1, §5, §5.
 Datadriven discovery of Koopman eigenfunctions for control. Machine Learning: Science and Technology 2, pp. 035023. Cited by: §2.2, §5.
 Scalable deep reinforcement learning for visionbased robotic manipulation. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 651–673. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
 Hamiltonian systems and transformation in Hilbert space. Proceedings of the National Academy of Sciences of the United States of America 17 (5), pp. 315–318. Cited by: §2.2.
 Conservative qlearning for offline reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1179–1191. Cited by: Appendix B, Appendix B, §1, §1, §4, §5.
 CURL: contrastive unsupervised representations for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5639–5650. Cited by: §5.
 Reinforcement learning with augmented data. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 19884–19895. Cited by: §5.
 A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1.
 Learning compositional Koopman operators for modelbased control. In Proc. of the 8th Int’l Conf. on Learning Representation (ICLR’20), Cited by: §5.
 The Koopman operator in systems and control: concepts, methodologies, and applications. Book, Springer. Cited by: §2.2.
 Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynamics 41, pp. 309–325. Cited by: §2.2.
 Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: Appendix B.
 Pytorch: an imperative style, highperformance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: Appendix B.
 Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §4.
 Koopman operator and its approximations for systems with symmetries. Chaos: An Interdisciplinary Journal of Nonlinear Science 29 (9), pp. 093128. External Links: Document Cited by: §5.
 S4RL: Surprisingly Simple SelfSupervision for Offline Reinforcement Learning. In Conference on Robot Learning, External Links: 2103.06326 Cited by: Appendix B, §C.2, §1, §3.2, Table 1, Figure 2, §4.2, §4, §5.
 Koopman operator methods for global phase space exploration of equivariant dynamical systems. External Links: 2003.04870 Cited by: Appendix A, §3.1, §3.1, §5.
 [31] Sylvester algorithm  scipy.org. External Links: Link Cited by: §3.2.
 MDP homomorphic networks: group symmetries in reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 4199–4210. Cited by: §1, §5.
 Behavior regularized offline reinforcement learning. External Links: 1911.11361 Cited by: §1.
 Metaworld: a benchmark and evaluation for multitask and meta reinforcement learning. arXiv preprint arXiv:1910.10897. Cited by: §1, §4.2, §4.
 Robosuite: a modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293. Cited by: §1, §4.2, §4.
Appendix A Proofs
In this section we provide the proofs of the theoretical results of Section 3.
Proof of Lemma 3.1 and Theorem 3.2
The proofs of the lemma as well as the theorem can be found in (Sinha et al., 2020).
Proof of Lemma 3.3:
We aim to show that map given by defines a group action on the Koopman space of observables , where defines the symmetry group of definition 3. Firstly, let and be the identity element, then we see that it provides existence of an identity element of by
Secondly, let and denoting the group operation i.e. . Then
where in we have used the invertibility of the group property of . Lastly it follows analogously that for that
Thus the existence of an inverse is established which concludes to show the group property of .
Proof of Theorem 3.4:
We aim to show that with being the Koopman operator associated with a actionequivariant system . Then
First of all note that by the definition of the Koopman operator of nonaffine control systems it obeys
Using the latter one thus infers that
where in (I) we have used that it is a actionequivariant system. Moreover, one finds that
which concludes the proof.
To proceed let us reevaluate some information from the main text of this work. Let be a family of invertible operators s.t. is a mapping to the (Banach) space of eigenfunctions . Which moreover obeys , with and where . The existence of such operators puts further restriction on the Koopman operator in Eq. (12). However, as our algorithm employs the finitedimensional approximation of the Koopman operator i.e. its matrix representation this amounts simple for to be diagonalizable and is the matrix containing its eigenvectors as columns. To evolve a better understanding on the required criteria for the infinitedimensional case we employ an alternative formulation of the so called spectral theorem below.
Theorem A.1 (Spectral Theorem)
Let be a bounded selfadjoint operator on a Hilbert space . Then there is a measure space and a realvalued essentially bounded measurable function on and a unitary operator i.e. such that
(22) 
In other words every bounded selfadjoint operator is unitarily equivalent to a multiplication operator. In contrast to the finitedimensional case we need to slightly alter our criteria to , with and where . Concludingly, a sufficient condition for our criteria to hold in terms of operators on Hilbert spaces is that the Koopman operator is selfadjoint i.e. that
(23) 
Proof of Lemma 3.5:
We aim to show that the map given by
defines a group action on the Koopman space of observables . Where is defined analog to Lemma 3.3 but acting on instead of by . First of all note that
(24) 
We proceed analogously as in the proof of Lemma 3.3 above. Firstly, let and be the identity element then on infers from Eq. (24) that it provides existence of an identity element of by
where we have used the notation
Secondly, let and denoting the group operation i.e. . Then
where in we have used the invertibility of the group property of . Lastly, it follows analogously that for that
Thus the existence of an inverse is established which concludes to show the group property of .
Proof of Theorem 3.6:
We aim to show twofold.
:
Firstly, that with be a actionequivariant control system with a symmetry action as in Lemma 3.5 which furthermore admits a Koopman operator representation as
Then
:
Secondly, the converse. Namely, if a control system obeys Eqs. (14) and (15), then it is actionequivariant, i.e. .
Let us start with the first implication i.e. .
First of all note that by the definition the Koopman operator it obeys
Using the latter one infers that
Moreover, one derives that
which concludes the proof of the first part of the theorem. Let us next show the converse implication i.e. . For this case it is practical to use the discrete timesystem notation explicitly. Let be a symmetry of the statespace and be and the shifted states. Then
Thus in particular
where in (I) we have used that the symmetry operator commutes with the Koopman operator. Moreover, one finds that
from which one concludes that
Finally, we us the invertibility of the Koopman space observables i.e. to infer
Thus
(25) 
which at last concludes our proof of the second part of the theorem.
Proof of Theorem 3.7:
Let be a actionequivariant control system as in Theorem 3.6. For symmetry maps
Comments
There are no comments yet.