Koopman Q-learning: Offline Reinforcement Learning via Symmetries of Dynamics

11/02/2021
by   Matthias Weissenbacher, et al.
0

Offline reinforcement learning leverages large datasets to train policies without interactions with the environment. The learned policies may then be deployed in real-world settings where interactions are costly or dangerous. Current algorithms over-fit to the training dataset and as a consequence perform poorly when deployed to out-of-distribution generalizations of the environment. We aim to address these limitations by learning a Koopman latent representation which allows us to infer symmetries of the system's underlying dynamic. The latter is then utilized to extend the otherwise static offline dataset during training; this constitutes a novel data augmentation framework which reflects the system's dynamic and is thus to be interpreted as an exploration of the environments phase space. To obtain the symmetries we employ Koopman theory in which nonlinear dynamics are represented in terms of a linear operator acting on the space of measurement functions of the system and thus symmetries of the dynamics may be inferred directly. We provide novel theoretical results on the existence and nature of symmetries relevant for control systems such as reinforcement learning settings. Moreover, we empirically evaluate our method on several benchmark offline reinforcement learning tasks and datasets including D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art for Q-learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/10/2021

S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning

Offline reinforcement learning proposes to learn policies from large col...
11/14/2020

PLAS: Latent Action Space for Offline Reinforcement Learning

The goal of offline reinforcement learning is to learn a policy from a f...
11/09/2021

Dealing with the Unknown: Pessimistic Offline Reinforcement Learning

Reinforcement Learning (RL) has been shown effective in domains where th...
04/12/2021

Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment

Reinforcement learning from large-scale offline datasets provides us wit...
06/16/2020

Accelerating Online Reinforcement Learning with Offline Datasets

Reinforcement learning provides an appealing formalism for learning cont...
11/26/2021

Measuring Data Quality for Dataset Selection in Offline Reinforcement Learning

Recently developed offline reinforcement learning algorithms have made i...
06/29/2020

Extracting Latent State Representations with Linear Dynamics from Rich Observations

Recently, many reinforcement learning techniques were shown to have prov...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent impressive advances in reinforcement learning (RL) range from robotics, to strategy games and recommendation systems (Kalashnikov et al., 2018; Li et al., 2010)

. Reinforcement learning is canonically regarded as an active learning process - also referred to as online RL - where the agent interacts with the environment at each training run. In contrast, offline RL algorithms learn from large, previously collected static datasets, and thus do not rely on environment interactions

(Agarwal et al., 2020b; Ernst et al., 2005; Fujimoto et al., 2019). Online data collection is performed by simulations or by means of real world interactions e.g. robotics and in either scenario interactions maybe costly and/or dangerous.

In principle offline datasets only need to be collected once which alleviates the before-mentioned short-comings of costly online interactions. Offline datasets are typically collected using behavioral policies for the specific task ranging from, random policies, or near-optimal policies to human demonstrations. In particular, being able to leverage the latter is a major advantage of offline RL over online approaches, and then the learned policies can be deployed or finetuned on the desired environment. Offline RL has successfully been applied to learn agents that outperform the behavioral policy used to collect the data (Kumar et al., 2020; Wu et al., 2019; Agarwal et al., 2020a; Ernst et al., 2005). However algorithms admit major shortcomings in regard to over-fitting and overestimating the true state-action values of the distribution. One solution was recently propsed by Sinha et al. (2021), where they tested several data augmentation schemes to improve the performance and generalization capabilities of the learned policies.

However, despite the recent progress, learning from offline demonstrations is a tedious endeavour as the dataset typically does not cover the full state-action space. Moreover, offline RL algorithms per definition do not admit the possibility for further environment exploration to refine their distributions towards an optimal policy. It was argued previously that is basically impossible for an offline RL agent to learn an optimal policy as the generalization to near data generically leads to compounding errors such as overestimation bias (Kumar et al., 2020). In this paper, we look at offline RL through the lens of Koopman spectral theory in which nonlinear dynamics are represented in terms of a linear operator acting on the space of measurement functions of the system. Through which the representation the symmetries of the dynamics may be inferred directly, and can then be used to guide data augmentation strategies see Figure 1. We further provide theoretical results on the existence on nature of symmetries relevant for control systems such as reinforcement learning. More specifically, we apply Koopman spectral theory by: first learning symmetries of the system’s underlying dynamic in a self-supervised fashion from the static dataset, and second employing the latter to extend the offline dataset at training time by out-of-distribution values. As this reflects the system’s dynamics the additional data is to be interpreted as an exploration of the environment’s phase space.

Some prior works have explored symmetry of the state-action space in the context of Markov Decision Processes (MDP’s)

(Higgins et al., 2018; Balaraman and Andrew, 2004; van der Pol et al., 2020) since many control tasks exhibit apparent symmetries e.g. the classic cart-pole task which is symmetric across y-axis. However, the paradigm we introduce in this work is of a different nature entirely. The distinction is twofold: first, the symmetries are learned in a self-supervised way and are in general not apparent to the developer; second: we concern with symmetry transformation from state tuples which leave the action invariant inferred from the dynamics inherited by the behavioral policy of the underlying offline data. In other words we seek to derive a neighbourhood around a MDP tuple in the offline dataset in which the behavioral policy is likely to choose the same action based on its dynamics in the environment. In practice the Koopman latent space representation is learned in a self-supervised manner by training to predict the next state using a VAE model (Kingma and Welling, 2013).

To summarize, in this paper, we propose Koopman Forward (Conservative) Q-learning (KFC): a model-free Q-learning algorithm which uses the symmetries in the dynamics of the environment to guide data augmentation strategies. We also provide thorough theoretical justifications for KFC. Finally, we empirically test our approach on several challenging benchmark datasets from D4RL (Fu et al., 2021), MetaWorld (Yu et al., 2019) and Robosuite (Zhu et al., 2020) and find that by using KFC we can improve the state-of-the-art on most benchmark offline reinforcement learning tasks.

2 Preliminaries and background

2.1 Offline RL & Conservative Q-learning

Reinforcement learning algorithms train policies to maximize the cumulative reward received by an agent who interacts with an environment. Formally the setting is given by a Markov decision process , with state space , action space , and the transition density function from the current state and action to the next state. Moreover, is the discount factor and the reward function. At any discrete time the agent chooses an action according to its underlying policy based on the information of the current state where the policy is parametrized by . We focus on the Actor-Critic methods for continuous control tasks in the following. In deep RL the parameters

are the weights in a deep neural network function approximation of the policy or Actor as well as the state-action value function

or Critic, respectively, and are optimized by gradient decent. The agent i.e. the Actor-Critic is trained to maximize the expected -discounted cumulative reward , with respect to the policy network i.e. its parameters . For notational simplicity we omit the explicit dependency of the latter in the remainder of this work. Furthermore the state-action value function , returns the value of performing a given action while being in the state . The Q-function is trained by minimizing the so called Bellman error as

(1)

This is commonly referred to as the i policy evaluation step where the hat denotes the target Q-function. In offline RL one aims to learn an optimal policy for the given the dataset as the option for exploration of the MDP is not available. The policy is optimized to maximize the state-action value function via the policy improvement

(2)

Note that behavioural policies including sub-optimal or randomized ones may be used to generate the static dataset . In that case offline RL algorithms face difficulties in the learning process.
CQL algorithm: CQL is built on top of a Soft-Actor Critic algorithm (SAC) (Haarnoja et al., 2018), which employs soft-policy iteration of a stochastic policy (Haarnoja et al., 2017). A policy entropy regularization term is added to the policy improvement step in Eq. (2) as

(3)

where

either is a fixed hyperparameter or may be chosen to be trainable. CQL reduces the overestimation of state-values - in particular those out-of distribution from

. It achieves this by regularizing the Q-function in Eq. (1) by a term minimizing its values over out of distribution randomly sampled actions as

(4)

where is given by the prediction of the policy and is a hyperparameter balancing the regulizer term.

Figure 1: Overview of Koopman Q-learning. The states of the data point are shifted along symmetry trajectories parametrized by for constant . Symmetry transformations can be combined to reach other specific subset of the -ball region.

2.2 Koopman theory

Historically, the Koopman theoretic perspective of dynamical systems was introduced to describe the evolution of measurements of Hamiltonian systems (Koopman, 1931; Mezić, 2005). The underlying dynamic of most modern reinforcement learning tasks is of nonlinear nature, i.e. the agents actions lead to changes of it state described by a complex non-linear dynamical system. In contrast to linear systems which are completely characterized by their spectral decomposition non-linear systems lack such a unified characterisation. The Koopman operator theoretic framework describes nonlinear dynamics via a linear infinite-dimensional Koopman operator and thus inherits certain tools applicable to linear control systems (Mauroy et al., 2020; Kaiser et al., 2021)

. In practice one aims to find a finite-dimensional representation of the Koopman operator which is equivalent to obtaining a coordinate transformations in which the nonlinear dynamics are approximately linear. A general non-affine control system is governed by the system of non-linear ordinary differential equations (ODE’s) as

(5)

where is the n-dimensional state vector, the m-dimensional action vector with the state-action-space. Moreover, is the time derivative, and is some general non-linear - at least -differentiable - vector valued function. For a discrete time system, Eq. (5) takes the form

(6)

where denotes the state at time where F is at least -differentiable vector valued function.

Definition 1 (Koopman operator)

Let be the (Banach) space of all measurement functions (observables). Then the Koopman operator is defined by

(7)

where .

Many systems can be modeled by a bilinearisation where the action enters the controlling equations 5 linearly as for functions . In that case the action of the Koopman operator takes the simple form

(8)

where

decompose the Koopman operator in a the free and forcing-term, respectively. Associated with a Koopman operator is its eigenspectrum, that is, the eigenvalues

, and the corresponding eigenfunctions

, such that . In practice one derives a finite set of observable in which case the approximation to the Koopman operator admits a finite-dimensional matrix representation. The matrix representing the Koopman operator may be diagonalized by a matrix containing the eigen-vectors of as columns. In which case the eigen-functions are derived by and one infers from Eq. (5) that with eigenvalues . These ODE’s admit simple solutions for their time-evolution namely the exponential functions which is expected for a linear system.

3 The Koopman forward framework

In this section we introduce the Koopman forward framework. We initiate our discussion with a focus on symmetries in dynamical control systems in Subsection 3.1 where we additionally present some theoretical results. We then proceed in Subsection 3.2 by presenting the Koopman forward framework for Q-learning based on CQL. Moreover, we discuss the algorithm as well as the Koopman Forward model’s deep neural network architecture.
Overview: Simply, our theoretical results culminate in Theorem 3.6 and 3.7, which provides a road-map on how specific symmetries of the dynamical control system are to be inferred from a VAE forward prediction model. In particular, Theorem 3.6 guarantees that the procedure leads to true symmetries of the system and Theorem 3.7 that actual new data points can be derived by applying symmetry transforming to existing ones. The practical limitation is that the VAE model, parameterized by a neural net, is trained on the collected data from one of many behavior policies which amounts to learning the approximate dynamics of the real system. The theoretical limitations are twofold, firstly the theorems only hold for dynamical systems with differentiable state-transitions; secondly, we employ a Bilinearisation Ansatz for the Koopman operator of the system. Practically many RL environments incorporate dynamics with discontinuous “contact” events where the Bilinearisation Ansatz may not be applicable. However, empirically we find that our approach nevertheless is successful for such RL environments and such “contact” events do not affect the performance significantly (see Appendix C.1).

3.1 Symmetries of dynamical control system

Let us start by introducing symmetries in the simpler context of dynamical system without control (Sinha et al., 2020) given by where we use analog notations as in Eq. (5).

Definition 2 (Equivariant Dynamical System)

Consider the dynamical system and let be a group acting on the state-space . Then the system is called -equivariant if . For a discrete time dynamical system one defines equivariance analogously, namely if .

Lemma 3.1

The map given by defines a group action on the Koopman space of observables .

Theorem 3.2

Let be the Koopman operator associated with a -equivariant system . Then

(9)

Theorem 3.2 states that for a -equivariant system any symmetry transformation commutes with the Koopman operator. For the proof see (Sinha et al., 2020). Let us next turn to the case relevant for RL, namely control systems. In the remainder of this section we focus on dynamical systems given as in Eq. (6) and Eq. (8).

Definition 3 (Action Equivariant Dynamical System)

Let be a group acting on the state-action-space of a general control system as in Eq. (6) such that it acts as the identity operation on i.e. Then the system is called -action-equivariant if

(10)

In particular, it is easy to see that the biliniarisation in Eq. (8) is -action-equivariant if .

Lemma 3.3

The map given by defines a group action on the Koopman space of observables .

Theorem 3.4

Let be the Koopman operator associated with a -action-equivariant system . Then

(11)

Neither Theorem 3.2 nor 3.4 serve as a theoretical foundation for the control setup aimed for in the next section. Let us thus turn our focus on the relevant case of a control system which admits a Koopman operator description as

(12)

where is a family of operators with analytical dependence on . Note that the bilinearisation in Eq. (8) is a special case of Eq. (12). Furthermore, let be a family of invertible operators s.t. is a mapping to the (Banach) space of eigenfunctions . Which moreover obeys , with and where . The existence of such operators puts further restriction on the Koopman operator in Eq. (12).111See Appendix A for the details. As we are in the end concerned with the finite-dimensional approximation of the Koopman operator i.e. its matrix representation this amounts simple for to be diaganolisable and is the matrix containing its eigen-vectors as columns.

Lemma 3.5

The map given by

(13)

defines a group action on the Koopman space of observables . Where is defined analog to Lemma 3.3 but acting on instead of by .

We refer to a system in Eq. (12) admitting a symmetry as in Lemma 3.5 as -action-equivariant control system in the following.

Theorem 3.6

Let be a -action-equivariant control system with a symmetry action as in Lemma 3.5 which furthermore admits a Koopman operator representation as

(14)

Then

(15)

Moreover, the converse is also true. If a control system obeys Equations 14 and 15, then it is -action-equivariant, i.e. .

Let us next build on the core of Theorem 3.6 to provide theoretical statements on how data-points may be shifted by symmetry transformations. For that let us change our notation to establish an easier connection to the. Let and denote the encoder and decoder to and from the Koopman space approximation, respectively, i.e. .

Theorem 3.7

Let be a -action-equivariant control system as in Theorem 3.6. Then one can use a symmetry transformation to shift both as well as

(16)
(17)

One may account for practical limitations i.e. an error by the assumption . One then finds that . Thus the error becomes suppressed when .222The error may be due to practical limitations of capturing the true dynamics as well as the symmetry map. Moreover, note that the equivalent theorem holds when .

3.2 KFC algorithm

In the previous section we laid the theoretical foundation for the KFC-algorithm by providing a raod-map on how to derive symmetries of dynamical control systems based on a Koopman latent space representation. The goal is to generate new data points for the RL algorithm at training-time as Eq. (16) in Theorem 3.7. The reward is not part of the symmetry shift process and will just remain unchanged as an assertion.
On the power of using symmetries: Let us emphasize the practical advantage of employing symmetries to generate augmented data points. It is evident from Theorem 3.7 that a symmetry transformation shifts both as well as which evades the necessity of forecasting states. Thus the use of an inaccurate fore-cast model is avoided and the accuracy and generalisation capabilities of the VAE are fully utilized. The magnitude of the induced shift is controlled by the parameter such that to limit out-of-distribution generalisation errors.333For the details of the errors, refer the appendix section A.

Let us next discuss technical details of the incorporation of the symmetry shifts into a specific Q-learning algorithm. In practice the Koopman latent space representation is N-dimensional i.e. finite. One uses the matrix representations of the Koopman operator and the symmetry generators and matrix multiplication instead of the formal mapping

. Moreover, we use a normally distributed random variable for the control parameter

. We base our analysis on the CQL algorithm, however we expect that our approach of exploration should benefit a wider range of RL algorithms. Following in the footsteps of (Sinha et al., 2021) our approach leaves the policy improvement as in Eq. (3) unchanged but modifies the policy optimisation step in Eq. (4) as

(18)

The state-space symmetry generating function depends on the normally distributed random variables . Note that we only modify the Bellman error of Eq. (3.2) and leave the CQL specific regulizer untouched. We study two distinct cases which differ on an algorithmic level444Eq. (3.2) holds for both as well as thus we have deliberately dropped the subscript. Moreover, note that in case (II) the symmetry transformations are of the form .

(19)

and diagonalizes the Koopman operator with eigenvalues i.e. the latter can be expressed as . Moreover, is a fixed matrix representation of a symmetry transformation of the dynamical system in Koopman space. Note that we abuse our notation as i.e. the encoder provides the Koopman space observables. Following the guidelines provided by Theorem 3.6 for a symmetry acting on the state-action space of dynamical control systems in Eq. (8) as it requires the symmetry generators to commute with the Koopman operator. Thus on the one hand, for the case (I) in Eq. (3.2) the symmetry generator matrix is obtained by solving the equation which may be accomplished by employing a Sylvester algorithm (31). While on the other hand case (II) is solved by computing the eigen-vectors of which then constitute the columns of . Thus in particular one infers that

(20)

where denotes the commutator of two matrices. Eq. (20) is solved by construction of which commutes with the Koopman operator for all values of the random variables. 555Moreover, let us emphasize that the Koopman operators eigenvalues and eignevectors generically are -valued. However, the definition in Eq. (3.2) ensures that is a -valued matrix. Concludingly, the advantage of case (I) is that it is computationally less expensive, however it it provides less freedom to explore different symmetry directions as case (II). Lastly, it proofed useful empirically to train the RL-algorithm not only on symmetry shifted data as in Eq. (3.2

). We employ our data exploration shift only with probability

. Otherwise we use , i.e. a shift by a normally distributed random variable.

The Koopman forward model: The KFC algorithm requires pre-training of a Koopman forward model which is closely related to a VAE architecture as

(21)

where both of and

are approximated by Multi-Layer-Perceptrons (MLP’s) and the bilinear Koopman-space operator approximation are implemented by a single fully connected layer for

, respectively. The model is trained on batches of the offline data-set tuples

and optimized via an additive loss-function of the VAE and the forward prediction part of the model. We refer the reader to Appendix

B for details.

Domain Task Name CQL S4RL-() S4RL-(Adv) KFC KFC++
AntMaze antmaze-umaze 74.0 91.3 94.1 96.9 99.8
antmaze-umaze-diverse 84.0 87.8 88.0 91.2 91.1
antmaze-medium-play 61.2 61.9 61.6 60.0 63.1
antmaze-medium-diverse 53.7 78.1 82.3 87.1 90.5
antmaze-large-play 15.8 24.4 25.1 24.8 25.6
antmaze-large-diverse 14.9 27.0 26.2 33.1 34.0
Gym cheetah-random 35.4 52.3 53.9 48.6 49.2
cheetah-medium 44.4 48.8 48.6 55.9 59.1
cheetah-medium-replay 42.0 51.4 51.7 58.1 58.9
cheetah-medium-expert 62.4 79.0 78.1 79.9 79.8
hopper-random 10.8 10.8 10.7 10.4 10.7
hopper-medium 58.0 78.9 81.3 90.6 94.2
hopper-medium-replay 29.5 35.4 36.8 48.6 49.0
hopper-medium-expert 111.0 113.5 117.9 121.0 125.5
walker-random 7.0 24.9 25.1 19.1 17.6
walker-medium 79.2 93.6 93.1 102.1 108.0
walker-medium-replay 21.1 30.3 35.0 48.0 46.1
walker-medium-expert 98.7 112.2 107.1 114.0 115.3
Adroit pen-human 37.5 44.4 51.2 61.3 60.0
pen-cloned 39.2 57.1 58.2 71.3 68.4
hammer-human 4.4 5.9 6.3 7.0 9.4
hammer-cloned 2.1 2.7 2.9 3.0 4.2
door-human 9.9 27.0 35.3 44.1 46.1
door-cloned 0.4 2.1 0.8 3.6 5.6
relocate-human 0.2 0.2 0.2 0.2 0.2
relocate-cloned -0.1 -0.1 -0.1 -0.1 -0.1
Franka kitchen-complete 43.8 77.1 88.1 94.1 94.9
kitchen-partial 49.8 74.8 83.6 92.3 95.9
Table 1: We experiment with the full set of the D4RL tasks and report the mean normalized episodic returns over 5 random seeds using the same protocol as Fu et al. (2021). We compare against 3 competitive baselines including CQL and the two best performing S4RL-data augmentation strategies. We see that KFC and KFC++ consistently outperforms the baselines. We use the baseline numbers reported in Sinha et al. (2021).

4 Empirical evaluation

In this section, we will first experiment with the popular D4RL benchmark commonly used for offline RL (Fu et al., 2021). The benchmark covers various different tasks such as locomotion tasks with Mujoco Gym (Brockman et al., 2016), tasks that require hierarchical planning such as antmaze, and other robotics tasks such as kitchen and adroit (Rajeswaran et al., 2017). Furthermore, similar to S4RL (Sinha et al., 2021), we perform experiments on 6 different challenging robotics tasks from MetaWorld (Yu et al., 2019) and RoboSuite (Zhu et al., 2020). We compare KFC to the baseline CQL algorithm (Kumar et al., 2020), and two best performing augmentation variants from S4RL, S4RL- and S4RL-adv (Sinha et al., 2021). We use the exact same hyperparameters as proposed in the respective papers. Furthermore, similar to S4RL, we build KFC on top of CQL (Kumar et al., 2020)

to ensure conservative Q-estimates during for policy evaluation.

4.1 D4RL benchmarks

We present results in the benchmark D4RL test suite and report the normalized return in Table 1. We see that both KFC and KFC++ consistently outperform both the baseline CQL and S4RL across multiple tasks and data distributions. Outperforming S4RL- and S4RL-adv on various different types of environments suggests that KFC and KFC++ fundamentally improves the data augmentation strategies discussed in S4RL. KFC-variants also improve the performance of learned agents on challenging environments such as antmaze: which requires hierarchical planning, kitchen and adroit tasks: which are sparse reward and have large action spaces. Similarly, KFC-variants also perform well on difficult data distributions such as “medium-replay”: which is a collected by simply using all the data that the policy encountered while training base SAC policy, and “-human” which is collected using human demonstrations on robotic tasks which results in a non-Markovian behaviour policy (more details can be found in the D4RL manuscript (Fu et al., 2021)). Furthermore, to our knowledge, the results for KFC++ are state-of-the-art in policy learning from D4RL datasets for most environments and data distributions.

We do note that S4RL outperforms KFC on the “-random” split of the data distributions, which is expected as KFC depends on learning a simple dynamics model of the data to use to guide the data augmentation strategy. Since the “-random” split consists of random actions in the environment, the dynamics model is unable to learn a useful dynamics model.

(a) MetaWorld Environments
(b) RoboSuite Environments
Figure 2: Results on challenging dexterous robotics environments using data collected using a similar strategy as S4RL (Sinha et al., 2021). We report the of goals that the agent is able to reach during evaluation, where the goal is set by the environments. We see that KFC and KFC++ consistently outperforms both CQL and the two best performing S4RL variants.

4.2 Metaworld and Robosuite benchmarks

To further test the ability of KFC, we perform additional experiments on challenging robotic tasks. Following (Sinha et al., 2021), we perform additional experiments with 4 MetaWorld environments (Yu et al., 2019) and 2 RoboSuite environments (Zhu et al., 2020). We followed the same method to collect the data as described in Appendix F of S4RL (Sinha et al., 2021), and report the mean percent of goals reached, where the condition of reaching the goal is defined by the environment.

We report the results in Figure 2, where we see that using KFC to guide the data augmentation strategy for a base CQL agent, we are able to learn an agent that performs significantly better. Furthermore, we see that for more challenging tasks such as “push” and “door-close” in the MetaWorld, KFC++ outperforms the base CQL algorithm and the S4RL agent by a significant margin. These set of experiments further highlight the ability of KFC to guide the data augmentation strategy.

5 Related works

The use of data augmentation techniques in Q-learning has been discussed recently (Laskin et al., 2020, 2020; Sinha et al., 2021). In particular, our work shares strong parallels with (Sinha et al., 2021). Our modification of the policy evaluation step of the CQL algorithm (Kumar et al., 2020) is analogous to the one in Sinha et al. (2021). However, the latter randomly augments the data while our augmentation framework is based on symmetry state shifts. Regarding the connection to world models (Ha and Schmidhuber, 2018). Here a VAE is used to decode the state information while a recurrent separate neural network predicts future states. Their latent representation is not of Koopman type. Also no symmetries and data-augmentations are derived.

Algebraic symmetries of the state-action space in Markov Decision Processes (MDP) originate (Balaraman and Andrew, 2004) an were discussed recently in the context of RL in (van der Pol et al., 2020). Their goal is to preserve the essential algebraic homomorphism symmetry structure of the original MDP while finding a more compact representation. The symmetry maps considered in our work are more general and are utilized in a different way. Symmetry-based representation learning (Higgins et al., 2018) refers to the study of symmetries of the environment manifested in the latent representation. The symmetries in our case are derived form the Koopman operator not the latent representation directly. In (Caselles-Dupré et al., 2019) the authors discuss representation learning of symmetries (Higgins et al., 2018) allowing for interactions with the environment. A Forward-VAE model which is similar to our Koopman-Forward VAE model is employed. However, our approach is based on theoretical results providing a road-map to derive explicit symmetries of the dynamical systems as well as their utilisation for state-shifts.

In (Sinha et al., 2020) the authors extend the Koopman operator from a local to a global description using symmetries of the dynamics. They do not discuss action-equivariant dynamical control systems nor data augmentation. In (Salova,Anastasiya et al., 2019) the imprint of known symmetries on the block-diagonal Koopman space representation for non-control dynamical systems is discussed. This is close to the spirit of disentanglement (Higgins et al., 2018). Our results are on control setups and deriving symmetries. On another front, the application of Koopman theory in control or reinforcement learning has also been discussed recently. For example, Li et al. (2020) propose to use compositional Koopman operators using graph neural networks to learn dynamics that can quickly adapt to new environments of unknown physical parameters and produce control signals to achieve a specified goal. Kaiser et al. (2021) discuss the use of Koopman eigenfunction as a transformation of the state into a globally linear space where the classical control techniques is applicable. To the best of our knowledge, this paper is the first to discuss Koopman latent space for data augmentation.

6 Conclusions

In this work we proposed a symmetry-based data augmentation technique derived from a Koopman latent space representation. It enables a meaningful extension of offline RL datasets describing dynamical systems, i.e. further "exploration" without additional environment interactions. The approach is based on our theoretical results on symmetries of dynamical control systems and symmetry shifts of data. Both hold for systems with differentiable state transitions and with a Bilinearisation Ansatz for the Koopman operator. However, the empirical results show that the framework is successfully applicable beyond those limitations. We empirically evaluated our method on several benchmark offline reinforcement learning tasks D4RL, Metaworld and Robosuite and find that by using our framework we consistently improve the state-of-the-art of Q-learning algorithms.

Acknowledgments

This work was supported by JSPS KAKENHI (Grant Number JP18H03287), and JST CREST (Grant Number JPMJCR1913). We would like to thank Y. Nishimura for technical support.

References

  • R. Agarwal, D. Schuurmans, and M. Norouzi (2020a) An optimistic perspective on offline reinforcement learning. In

    International Conference on Machine Learning

    ,
    Cited by: §1.
  • R. Agarwal, D. Schuurmans, and M. Norouzi (2020b) An optimistic perspective on offline reinforcement learning. External Links: 1907.04543 Cited by: §1.
  • R. Balaraman and G. B. Andrew (2004) Approximate homomorphisms: a framework for non-exact minimization in markov decision processes.. In In International Conference on Knowledge Based Computer Systems, Cited by: §1, §5.
  • G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba (2016) OpenAI gym. External Links: 1606.01540 Cited by: §C.1, §4.
  • S. L. Brunton, M. Budišić, E. Kaiser, and J. N. Kutz (2021) Modern koopman theory for dynamical systems. External Links: 2102.12086 Cited by: §C.4.
  • H. Caselles-Dupré, M. Garcia Ortiz, and D. Filliat (2019)

    Symmetry-based disentangled representation learning requires interaction with environments

    .
    In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . Cited by: §5.
  • D. Ernst, P. Geurts, and L. Wehenkel (2005) Tree-based batch mode reinforcement learning. Journal of Machine Learning Research 6 (18), pp. 503–556. Cited by: §1, §1.
  • J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine (2021) D4RL: datasets for deep data-driven reinforcement learning. External Links: 2004.07219 Cited by: Table 3, §1, Table 1, §4.1, §4.
  • S. Fujimoto, D. Meger, and D. Precup (2019) Off-policy deep reinforcement learning without exploration. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 97, pp. 2052–2062. Cited by: §1.
  • D. Ha and J. Schmidhuber (2018) Recurrent world models facilitate policy evolution. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §5.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1352–1361. Cited by: §2.1.
  • T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine (2018) Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1861–1870. Cited by: Appendix B, §2.1.
  • I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner (2018) Towards a definition of disentangled representations. External Links: 1812.02230 Cited by: §1, §5, §5.
  • E. Kaiser, J. N. Kutz, and S. L. Brunton (2021) Data-driven discovery of Koopman eigenfunctions for control. Machine Learning: Science and Technology 2, pp. 035023. Cited by: §2.2, §5.
  • D. Kalashnikov, A. Irpan, P. Pastor, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly, M. Kalakrishnan, V. Vanhoucke, and S. Levine (2018) Scalable deep reinforcement learning for vision-based robotic manipulation. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 651–673. Cited by: §1.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • B. O. Koopman (1931) Hamiltonian systems and transformation in Hilbert space. Proceedings of the National Academy of Sciences of the United States of America 17 (5), pp. 315–318. Cited by: §2.2.
  • A. Kumar, A. Zhou, G. Tucker, and S. Levine (2020) Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1179–1191. Cited by: Appendix B, Appendix B, §1, §1, §4, §5.
  • M. Laskin, A. Srinivas, and P. Abbeel (2020) CURL: contrastive unsupervised representations for reinforcement learning. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 5639–5650. Cited by: §5.
  • M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas (2020) Reinforcement learning with augmented data. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 19884–19895. Cited by: §5.
  • L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §1.
  • Y. Li, H. He, J. Wu, D. Katabi, and A. Torralba (2020) Learning compositional Koopman operators for model-based control. In Proc. of the 8th Int’l Conf. on Learning Representation (ICLR’20), Cited by: §5.
  • A. Mauroy, I. Mezić, and Y. Susuki (2020) The Koopman operator in systems and control: concepts, methodologies, and applications. Book, Springer. Cited by: §2.2.
  • I. Mezić (2005) Spectral properties of dynamical systems, model reduction and decompositions. Nonlinear Dynamics 41, pp. 309–325. Cited by: §2.2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Icml, Cited by: Appendix B.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: Appendix B.
  • A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine (2017) Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087. Cited by: §4.
  • Salova,Anastasiya, Emenheiser,Jeffrey, Rupe,Adam, C. P., and D. M. (2019) Koopman operator and its approximations for systems with symmetries. Chaos: An Interdisciplinary Journal of Nonlinear Science 29 (9), pp. 093128. External Links: Document Cited by: §5.
  • S. Sinha, A. Mandlekar, and A. Garg (2021) S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning. In Conference on Robot Learning, External Links: 2103.06326 Cited by: Appendix B, §C.2, §1, §3.2, Table 1, Figure 2, §4.2, §4, §5.
  • S. Sinha, S. P. Nandanoori, and E. Yeung (2020) Koopman operator methods for global phase space exploration of equivariant dynamical systems. External Links: 2003.04870 Cited by: Appendix A, §3.1, §3.1, §5.
  • [31] Sylvester algorithm - scipy.org. External Links: Link Cited by: §3.2.
  • E. van der Pol, D. Worrall, H. van Hoof, F. Oliehoek, and M. Welling (2020) MDP homomorphic networks: group symmetries in reinforcement learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 4199–4210. Cited by: §1, §5.
  • Y. Wu, G. Tucker, and O. Nachum (2019) Behavior regularized offline reinforcement learning. External Links: 1911.11361 Cited by: §1.
  • T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine (2019) Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. arXiv preprint arXiv:1910.10897. Cited by: §1, §4.2, §4.
  • Y. Zhu, J. Wong, A. Mandlekar, and R. Martín-Martín (2020) Robosuite: a modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293. Cited by: §1, §4.2, §4.

Appendix A Proofs

In this section we provide the proofs of the theoretical results of Section 3.

Proof of Lemma 3.1 and Theorem 3.2

The proofs of the lemma as well as the theorem can be found in (Sinha et al., 2020).

Proof of Lemma 3.3:

We aim to show that map given by defines a group action on the Koopman space of observables , where defines the symmetry group of definition 3. Firstly, let and be the identity element, then we see that it provides existence of an identity element of by

Secondly, let and denoting the group operation i.e. . Then

where in we have used the invertibility of the group property of . Lastly it follows analogously that for that

Thus the existence of an inverse is established which concludes to show the group property of .

Proof of Theorem 3.4:

We aim to show that with being the Koopman operator associated with a -action-equivariant system . Then

First of all note that by the definition of the Koopman operator of non-affine control systems it obeys

Using the latter one thus infers that

where in (I) we have used that it is a -action-equivariant system. Moreover, one finds that

which concludes the proof.

To proceed let us re-evaluate some information from the main text of this work. Let be a family of invertible operators s.t. is a mapping to the (Banach) space of eigenfunctions . Which moreover obeys , with and where . The existence of such operators puts further restriction on the Koopman operator in Eq. (12). However, as our algorithm employs the finite-dimensional approximation of the Koopman operator i.e. its matrix representation this amounts simple for to be diagonalizable and is the matrix containing its eigen-vectors as columns. To evolve a better understanding on the required criteria for the infinite-dimensional case we employ an alternative formulation of the so called spectral theorem below.

Theorem A.1 (Spectral Theorem)

Let be a bounded self-adjoint operator on a Hilbert space . Then there is a measure space and a real-valued essentially bounded measurable function on and a unitary operator i.e.  such that

(22)

In other words every bounded self-adjoint operator is unitarily equivalent to a multiplication operator. In contrast to the finite-dimensional case we need to slightly alter our criteria to , with and where . Concludingly, a sufficient condition for our criteria to hold in terms of operators on Hilbert spaces is that the Koopman operator is self-adjoint i.e. that

(23)
Proof of Lemma 3.5:

We aim to show that the map given by

defines a group action on the Koopman space of observables . Where is defined analog to Lemma 3.3 but acting on instead of by . First of all note that

(24)

We proceed analogously as in the proof of Lemma 3.3 above. Firstly, let and be the identity element then on infers from Eq. (24) that it provides existence of an identity element of by

where we have used the notation

Secondly, let and denoting the group operation i.e. . Then

where in we have used the invertibility of the group property of . Lastly, it follows analogously that for that

Thus the existence of an inverse is established which concludes to show the group property of .

Proof of Theorem 3.6:

We aim to show twofold.

:

Firstly, that with be a -action-equivariant control system with a symmetry action as in Lemma 3.5 which furthermore admits a Koopman operator representation as

Then

:

Secondly, the converse. Namely, if a control system obeys Eqs. (14) and (15), then it is -action-equivariant, i.e. .

Let us start with the first implication i.e. .

First of all note that by the definition the Koopman operator it obeys

Using the latter one infers that

Moreover, one derives that

which concludes the proof of the first part of the theorem. Let us next show the converse implication i.e. . For this case it is practical to use the discrete time-system notation explicitly. Let be a symmetry of the state-space and be and the -shifted states. Then

Thus in particular

where in (I) we have used that the symmetry operator commutes with the Koopman operator. Moreover, one finds that

from which one concludes that

Finally, we us the invertibility of the Koopman space observables i.e. to infer

Thus

(25)

which at last concludes our proof of the second part of the theorem.

Proof of Theorem 3.7:

Let be a -action-equivariant control system as in Theorem 3.6. For symmetry maps