# Towards Scalable Koopman Operator Learning: Convergence Rates and A Distributed Learning Algorithm

In this paper, we propose an alternating optimization algorithm to the nonconvex Koopman operator learning problem for nonlinear dynamic systems. We show that the proposed algorithm will converge to a critical point with rate O(1/T) or O(1/ T) under some mild assumptions. To handle the high dimensional nonlinear dynamical systems, we present the first-ever distributed Koopman operator learning algorithm. We show that the distributed Koopman operator learning has the same convergence properties as a centralized Koopman operator learning problem, in the absence of optimal tracker, so long as the basis functions satisfy a set of state-based decomposition conditions. Experiments are provided to complement our theoretical results.

## Authors

• 127 publications
• 3 publications
• 11 publications
• 5 publications
• ### Bridging data science and dynamical systems theory

This short review describes mathematical techniques for statistical anal...
02/18/2020 ∙ by Tyrus Berry, et al. ∙ 0

• ### On the Koopman operator of algorithms

A systematic mathematical framework for the study of numerical algorithm...
07/25/2019 ∙ by Felix Dietrich, et al. ∙ 0

• ### Successive Projection for Solving Systems of Nonlinear Equations/Inequalities

Solving large-scale systems of nonlinear equations/inequalities is a fun...
12/14/2020 ∙ by Wen-Jun Zeng, et al. ∙ 0

• ### On the Convergence of Local Descent Methods in Federated Learning

In federated distributed learning, the goal is to optimize a global trai...

• ### A Tutorial on Distributed (Non-Bayesian) Learning: Problem, Algorithms and Results

We overview some results on distributed learning with focus on a family ...
09/23/2016 ∙ by Angelia Nedić, et al. ∙ 0

• ### On the Convergence of Tsetlin Machines for the XOR Operator

The Tsetlin Machine (TM) is a novel machine learning algorithm with seve...
01/07/2021 ∙ by Lei Jiao, et al. ∙ 0

• ### SLSGD: Secure and Efficient Distributed On-device Machine Learning

We consider distributed on-device learning with limited communication an...
03/16/2019 ∙ by Cong Xie, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In recent years, there is an increasing interest in transferring operator theoretic techniques such as Koopman operator [1, 2]

for the analysis of dynamical systems. Such operator based methods differ from classical approaches, in that they define the evolution of observable functions in a function space rather than using state vectors in a state space. The power of these operator theoretic methods is that it provides linear representations of nonlinear time-invariant systems, albeit in higher dimensional spaces that are sometimes countable or uncountable. Various numerical approaches, such as dynamic mode decomposition(DMD), Hankel-DMD, extended dynamic mode decomposition (E-DMD), structured dynamic mode decomposition (S-DMD) have been proposed for discovering the Koopman operator of a nonlinear system, using a series of dictionary functions with spanning or universal function approximation properties

[3, 4, 2, 5, 6]

. Recently, researchers have shown it is possible to integrate machine-driven learning representations with dynamic mode decomposition algorithms, using variational autoencoders to achieve phase-dependent representations of spectra

[7] or delay embeddings [8]

, shallow neural networks

[3]

, linearly recurrent neural networks for balancing expressiveness and overfitting

[9]

, and deep RELU feedforward networks for predictive modeling in biological and transmission systems

[10]. E-DMD [3] and Deep-DMD [10] have been utilized in various domains, including nonlinear system identification [11, 12, 13, 14], image processing [4, 15] and robotic control [16, 17].

Generally speaking, the learning especially the training phase of Koopman operator is trying to minimize the empirical loss based on the training set, e.g., the data sampled from the real trajectory of dynamic system. Compared to the traditional machine learning problem which learns the unknown mapping from input to output, the Koopman learning has two tasks: 1) Learning the function space that lifts state space to a high even infinite dimensional space. 2) Learning a linear mapping within that function space. These two tasks are highly related to each other, e.g., inappropriate function space learned will lead to poor learning performance even the linear mapping is perfect. However, to the best of our knowledge, the method of Koopman training has not gotten enough attention up to now. Another challenge is that, when parameterized function approximation such as neural network is used, the learning problem is nonconvex. For instance, even for single layer neutral network, it is NP-complete to find the global optimal

[18]. However, recent works [19, 20, 21] show that for over-parameterized (wide) shallow neural networks, local optima provide satisfactory performance. Specifically, they show that every local optimum is global optimum if the hidden layer is non-singular; every local minimum of the simplified objective is close to the global minimum. In this paper, we contribute a proof of convergence for Koopman learning algorithms utilizing shallow neural networks, and derive conditions for first-order optimality, the properties of the so-called dictionary functions used in deep and E-DMD that guarantee convergence. We propose alternating optimization algorithm with an optimal tracker for training the Koopman operator. By proving the objective function’s smoothness property, we show that our algorithm admits convergence rate for chosen constant learning rate and for diminishing learning rate. We illustrate convergence of the alternating optimization algorithm for single-node training (non-distributed) on two nonlinear systems with oscillatory dynamics.

A second major contribution of this paper is the development of a distributed Koopman operator learning algorithm. Most Koopman operator learning algorithms operate under the assumption of full-state measurements. Frequently, in engineered and natural systems represented by data, full-state measurements are not available, or are too expensive to collect. For example, power distribution networks consisting of hundreds of thousands of nodes exhibit real-time dynamics on systems that are poorly modeled, calibrated, or dated. Biological networks operate on thousands of genes to generate transcriptomic reponse profiles as a function of time; full-state measurement via deep sequencing is prohibitively expensive. In many instances, it is much more feasible to collect measurements from select locations, via strategic placement of observers [22]

, which gives rise to a different form of data — time-series data that is spatially distributed or fragmented across the whole network. We address the challenge of training distributed representations of Koopman operators and develop a

distributed Koopman learning algorithm, proving asymptotic convergence, and illustrating predictive accuracy and convergence on several simulated examples.

The rest of the paper is organized as follows. Section II introduces the Koopman operator learning problem. Section III describes our alternating optimization algorithm for Koopman learning and proves the convergence. Section IV extends our algorithm to a distributed version and shows its convergence. Section V shows the performance of two algorithms validated by two nonlinear systems. Section VI concludes this paper.

## Ii Koopman operator Learning Problem

In this paper, we consider a discrete time open-loop nonlinear dynamic system of the following form

 xn+1 =f(xn), (1) yn =h(xn), (2)

where and are continuously differentiable. The function is the state-space model and the function maps current state to a vector of observables or output . The Koopman operator of system (1), if it exists, is a linear operator that acts on observable functions and forward propagates them in time. To be more specific, the Koopman operator for this system must satisfy the equation

 ψ(xn+1) =K(ψ(xn)), (3) yn =H(ψ(xn)), (4)

where is a basis function that defines the lifted space of observables and is a constant matrix. Based on the Koopman operator theory, is the basis function of observables under which is -invariant for all . This implies that the Koopman operator comprehensively interprets the flow of the observable trajectory .

Based on the data-driven method [10] [23], a general model for approximating Koopman operator given the data trajectory can be formulated as follows:

 minψ,K D(ψ,K):=12NN∑i=1∥ψ(xi+1)−Kψ(xi)∥22. (5)

The above model aims to minimize the empirical loss from the learning perspective. One can slightly change the objective function by adding certain regularized term, e.g., for sparse operator or for avoiding large training lose, to make the tradeoff between the training and generalization error.

While there has been a surge of interest in using neural networks to perform Koopman learning, little is known regarding the convergence and numerical stability of the training process. This motivates us to investigate the property of Koopman learning during its training phase. There are two challenges in solving for the optimization problem (5) in practice. First, the basis function is unknown. This makes it difficult to ascertain what functions and how many functions to include, let alone the minimal number of functions, to ensure -invariant. Recently, EDMD [23] uses an expansive set of orthonomal polynomial basis functions, but this approach does not scale well and suffers from overfitting with an increasing number of basis functions. Deep-DMD [10] adopts the neutral networks to approximate the basis function based on universal approximation theorem, but it lacks the theoretical guarantee, e.g., the stability and convergence. Second, the objective function is nonconvex. Therefore it is unrealistic to expect an algorithm to converge to global minima.

Here we focus on the basis function based on parametric method, specifically, . A typical example is a fully connected one-layer neutral network since for wide shallow neutral network, local optima provide satisfactory under some mild conditions [19, 20, 21], where

is the layer parameter and activation function. With the parametric basis method, problem (

5) becomes

 (6)

Although this problem is nonconvex, there are some interesting structures. For example, if we fix the parameter of the basis function, optimizing is a quadratic problem that finds the linear mapping to . On the other hand, with fixed , optimizing is to adjust the parameter

to find the function space that satisfies the linear transformation mapping (this is still nonconvex but will reduce the complexity a lot). We thus consider the algorithm that alternatively optimize over

and .

## Iii Alternating Optimization Algorithm

In this section, we first states our alternating algorithm and then investigate its convergence properties. Let and denote by the Frobenius norm. The detail of the algorithm is shown in Algorithm 1. Here measures how far the gradient is from that at the critical point and track the best parameters so far. We make the following assumptions.

###### Assumption 1

The function is bounded and has a bounded gradient and Hessian.

###### Assumption 2

The parameters and are bounded, e.g., there exist two constant and such that and .

Assumption 1 looks strong. However, it holds for several popular activation functions such as logistic function (), hyperbolic tangent (), and inverse hyperbolic tangent (. By Assumptions 1 and 2, one can verify that the objective function is bounded, i.e., there exists a constant such that . We can show that has Lipschitz-continuous gradient with respect to the parameter of basis functions.

###### Lemma 1

Under Assumptions 1 and 2 and given the data trajectory , we have

 ∥∇WF(W1,K)−∇WF(W2,K)∥F≤LW∥W1−W2∥F

with

 LW=√2dUKLΨ∑Ni=1∥xi∥2ΔiN,

where and is the Lipschitz constant for the function .

First, denote by the -th column of matrix , the -th row of matrix , and the -th dimension of . We can compute the element of as:

 ∇WF(W,K)[j,k] = 1NN∑i=1−(ψ(Wxi+1)−Kψ(Wxi))TK[:,j]ψ′(Wxi)xi[k],

and as:

 ∇WF(W,K) = −1NN∑i=1KT(ψ(Wxi+1)−Kψ(Wxi))⊙ψ′(Wxi)αWixTi = −1NN∑i=1KTαWixTi,

where denotes the element-wise production. We can then write the gradient difference with respect to and as

 ∥∇WF(W1,K)−∇WF(W2,K)∥F = 1NN∑i=1∥KT(αW1i−αW2i)xTi∥F ≤ 1NN∑i=1∥K∥F∥(αW1i−αW2i)xTi∥F ≤ 1NN∑i=1∥K∥F∥αW1i−αW2i∥2∥xi∥2.

So if we can show that is Lipschitz-continuous, the proof is done. We have

 αW1i[j]−αW2i[j] = (ψ(W1jxi+1)ψ′(W1jxi)−ψ(W2jxi+1)ψ′(W2jxi))βij −(Kjψ(W1xi)ψ′(W1jxi)−Kjψ(W2xi)ψ′(W2jxi)γij).

Consider function in and its gradient By Assumption 1, is bounded by some constant, denoted by . Let , we can bound as follows:

 ∥βi∥22 =∥ψ(W1xi+1)⊙ψ′(W1xi)−ψ(W2yi)⊙ψ′(W2xi)∥22 ≤2L2Ψ(∥W1xi+1−W2xi+1∥22+∥W1xi−W2xi∥22) ≤2L2Ψ(∥xi∥22+∥xi+1∥22)∥W1−W2∥2F,

where the last inequality is due to Cauchy-Schwarz inequality. Similarly, we can bound as follows:

 ∥γi∥22 ≤n∑j=1∥Kj∥22L2Ψ(d∑k=1(W1kxi−W2kxi)2+(W1jxi−W2jxi)2) ≤(dU2K+d2∥Kmaxj∥22)L2Ψ∥W1xi−W2xi∥22 ≤(dU2K+d2∥Kmaxj∥22)L2Ψ∥xi∥22∥W1−W2∥2F ≤2dU2KL2Ψ∥xi∥22∥W1−W2∥2F,

where the second inequality is by Assumption 2. Combining the above results, we have

 ∥∇WF(W1,K)−∇WF(W2,K)∥F ≤ 1NN∑i=1∥K∥F∥αW1i−αW2i∥2∥xi∥2 ≤ √2d∥K∥FLΨ∑Ni=1∥xi∥2ΔiN∥W1−W2∥F,

where . Similarly, has Lipschitz-continuous gradient with respect to the parameter of the linear mapping.

###### Lemma 2

Under Assumption 1 and assume that the basis function is bounded by , we have

 ∥∇KF(W,K1)−∇KF(W,K2)∥F≤LK∥K1−K2∥F

with .

 ∇KF(W,K)=1NN∑i=1(Kψ(Wxi)−ψ(Wxi+1))ψ(Wxi)T,

and

 ∥∇KF(W,K1)−∇KF(W,K2)∥F =1N∥N∑i=1(K1−K2)ψ(Wxi)ψ(Wxi)T∥F ≤1N∥K1−K2∥F∥N∑i=1ψ(Wxi)ψ(Wxi)T∥F ≤1N∥K1−K2∥F√d2(Nh2)2 =dh2∥K1−K2∥F.

With Lemmas 1 and 2, we now show that Algorithm 1 will converge to a critical point with convergence rate or .

###### Theorem 1

Under Assumptions 1 and 2, Algorithm 1 for Koopman operator learning will converge to a critical point. With constant learning rate , its convergence rate is ; and with diminishing learning rate , its convergence rate is .

Since the objective function is Lipschitz gradient continuous with respect to , the descent lemma [24] can be applied and we have

 F(Wt,Kt+1) ≤ F(Wt,Kt)+tr(∇KF(Wt,Kt)T(Kt+1−Kt)) +LK2∥Kt+1−Kt∥2F = F(Wt,Kt)−ηKtr(∇KF(Wt,Kt)T∇KF(Wt,Kt)) +LK2∥Kt+1−Kt∥2F = F(Wt,Kt)+(η2KLK2−ηK)∥∇KF(Wt,Kt)∥2F, (7)

where denotes the trace of the matrix. The first equality is due to the gradient update of and the second equality is by the fact that .

As for the basis function’s parameter , we can have the similar result since the objective function is Lipschitz gradient continuous with respect to .

 F(Wt+1,Kt+1) ≤ F(Wt,Kt+1)+(η2WLW2−ηW)∥∇WF(Wt,Kt+1)∥2F. (8)

So by the equation (7) and (8), we have the following for each complete update from .

 F(Wt+1,Kt+1) ≤ F(Wt,Kt)+(η2KLK2−ηK)∥∇KF(Wt,Kt)∥2F +(η2WLW2−ηW)∥∇WF(Wt,Kt+1)∥2F. (9)

We sum both sides of inequality (9) from and obtain

 F(WT+1,KT+1) ≤ F(W0,K0)+T∑t=0(η2KLK2−ηK)∥∇KF(Wt,Kt)∥2F +T∑t=0(η2WLW2−ηW)∥∇WF(Wt,Kt+1)∥2F (10)

(1) Constant learning rate

If we choose the constant stepsize, e.g., , and let , , and we can bound the gradients as follows.

 T∑t=0∥∇KF(Wt,Kt)∥2F+∥∇WF(Wt,Kt+1)∥2F ≤ (F(W0,K0)−F(WT+1,KT+1))S. (11)

One can see that, each term on the right is non-negative and their summation is bounded by some constant. Based on current analysis, we can conclude that the alternating optimization algorithm will converge asymptotically to one critical point even without optimal tracker and when

Based on inequality (11), one can bound the minimum gradients up to for Algorithm 1.

 mint=0,⋯,T∥∇KF(Wt,Kt)∥2F+∥∇WF(Wt,Kt+1)∥2F ≤ (F(W0,K0)−F(WT+1,KT+1))ST≤2RST

(2) Diminishing learning rate

If we choose the diminishing learning rate, e.g, , the result becomes

 mint=0,⋯,T∥∇KF(Wt,Kt)∥2F+∥∇WF(Wt,Kt+1)∥2F ≤ (F(W0,K0)−F(WT+1,KT+1))∑Tt=0(ηt−L(ηt)22).

We know

 T∑t=0(ηt−L(ηt)22)=T∑t=0(1t+1−L2(t+1)2) ≥ ln(T+2)−L2−T∑t=1L2t(t+1)=ln(T+2)−L+L2(T+1). (12)

So for diminishing stepsize, we can obtain

 mint=0,⋯,T∥∇KF(Wt,Kt)∥2F+∥∇WF(Wt,Kt+1)∥2F ≤ 2RO(lnT).

So we can see for this problem, the constant stepsize has the better convergence rate than the diminishing stepsize with the help of optimal tracker. Both cases show that

 ∥∇KF(W∗,K∗)∥2F+∥∇WF(W∗,K∗)∥2F→0.

## Iv Distributed Koopman Learning

We now develop an algorithm to treat the learning problem for Koopman operators of high dimensional nonlinear dynamical systems. Even if there are only a thousand states in the underlying nonlinear system, the dimension of the dictionary functions explodes exponentially with the number of states. Memory constraints thus make it infeasible to train a Koopman operator using a centralized or stand-alone computing node. This motivates the derivation of a scalable, distributed approximation algorithm to relieve this problem.

###### Assumption 3

The basis function can be decomposed or approximated by where is the new basis function for and is a subset of with .

Based on Assumption 3, we can reformulate the centralized Koopman objective function as

 F(W,K) = 12NN∑i=1∥∥ ∥ ∥∥⎡⎢ ⎢ ⎢⎣ψ1(x1i+1;W1)⋮ψq(xqi+1;Wq)⎤⎥ ⎥ ⎥⎦−⎡⎢ ⎢ ⎢⎣K11⋯K1q⋮  ⋱  ⋮Kq1⋯Kqq⎤⎥ ⎥ ⎥⎦⎡⎢ ⎢⎣ψ1(x1i;W1)⋮ψq(xqi;Wq)⎤⎥ ⎥⎦∥∥ ∥ ∥∥2F

Our distributed Koopman learning’s structure is as follows. Denote by the set of computation nodes which can communicate with each other. For each computation node , it only store part of the data set , its corresponding row and column of Koopman operator and its basis function .

For node , its gradient will compose two parts. The first part can be calculated based on its own knowledge. Another part needs the information from other nodes. We first define by the error term for node with data point , where

 eij=ψi(xij+1;Wi)−[Ki1,⋯,Kiq]⎡⎢ ⎢ ⎢⎣ψ1(x1j;W1)⋮ψq(xqj;Wq)⎤⎥ ⎥ ⎥⎦,

and define by the Jacobi matrix of function , we then have the following distributed Koopman learning algorithm shown in Algorithm 2.

For our distributed Koopman operator learning Algorithm 2, line 6-8 is the communication stage, each computation node calculates its result in the lifted dimensional space and broadcast within our communication network. After the communication, the information is enough to compute local error term , and node send to node (line 8). Here the communication stage ends and computation stage (line 9-11) begins. will sum up all the information for each data point. The last is update stage with gradient descent method (line 14-15). Based on this distributed algorithm and Assumption 3, we can prove this is equivalent to the centralized gradient descent algorithm.

###### Lemma 3

Under Assumption 3, the distributed Koopman learning in Algorithm 2 is equivalent to the following update:

 Kt+1 =Kt−ηK∇KF(Wt,Kt), (13) Wt+1 =Wt−ηW∇WF(Wt,Kk).

Based on line 9-11 in Algorithm 2, one can verify the following after updating all the data points

 Ai =N∑j=1J(ψi(xij+1;Wi))⊤eij, Bi =−N∑jJ(ψi(xi;Wi))⊤∑k∈QKTiveij, Ci =N∑jeijvec[S1j,⋯,Sqj].

Compared to gradients of , we can find that

 1N(Ai+Bi) =∇WF(Wt,Kk), 1NCi =∇KF(Wt,Kt).

So the update stage (line 14-15) is the same with equation (13) which finishes the proof.

###### Remark 1

Our alternating Koopman operator learning (Algorithm 1) can be regarded as nonlinear Gauss-Seidel iterations [25], while our distributed Koopman operator learning lies in the model of nonlinear Jacobi iteration [25]. Here we choose nonlinear Jacobi iteration for distributed Koopman operator learning due to that nonlinear Jacobi iteration is (1) suitable for parallel computation and (2) with less communication overhead.

By lemma 3, we can get the convergence result for our distributed Koopman operator learning.

###### Theorem 2

Under Assumption 1, 2 and 3, the distributed Koopman operator learning based on Algorithm 2 will converge to one critical point asymptotically.

The equation (13) is the case without optimal tracker of Algorithm 1. The proof of theorem 1 can be applied directly here (equation (11)). The advantages of distributed Koopman learning over centralized one are not only the scalability, e.g., the ability to handle the high dimensional nonlinear dynamics, but also the feasibility to adjust to different complexity of the partial observations. For example, if one partial observation is with complexity dynamic, we can increase the number of basis function.

On the other hand, algorithm 2 for distributed Koopman learning is under the ideal synchronous model. Although the computation of each node is parallel, the computation will not start until the broadcast process finishes. This can harm heavily on the efficiency of the distributed Koopman learning due to some reason, e.g., if one node has a very low link speed, all the other need wait for this node. Also, one packet loss will lead to all nodes waiting for the resending. However, it is easily to extend Algorithm 2 to handle asynchronous model as displayed in Algorithm 3. Each node will store the received information in its memory keeping updating when new information comes. once the computation node comes to computation stage, it will directly use the information stored in the memory instead of waiting for the newest information.

For the asynchronous version of gradient descent algorithm, [26] (Theorem 2), [27] (Proposition 2.1), [28] (Theorem 7) e.t.c. show that synchronous and asynchronous algorithm will converge to the same point once the communication delay is bounded. Their proof can be applied to our case with slight change.

###### Lemma 4

([26],[27],[28]) If the communication delay is bounded by some constant, with small enough stepsize, asynchronous algorithm 3 will asymptotically converge to the same point with synchronous one.

## V Experiments

In our experiment, we evaluate the performance of our alternating optimization and distributed algorithms respectively. At each experiment, we sample some points from the real trajectory to prepare the training set and prediction set. Note that our prediction phase is multi-step prediction, e.g, given one initial state, our algorithm will predict the following trajectory using the -invariant property:

To evaluate the performance of alternating optimization, we consider Van der Pol oscillator shown in Example 1.

###### Example 1

Van der Pol oscillator

 ˙x1 =μ(x1−13x31−x2) (14) ˙x2 =1μx1 (15)

In this example, we choose . The number of date points we sampled is 600 with 400 points for training and 200 for prediction. We construct a very simple network with one layer and 3 dimensions to learn the pattern of Van der Pol oscillator. The total training time is around 1.08s (i7-8700 CPU @ 3.20 GHz, 8 GB RAM) with 500 iterations and constant stepsize is 0.23. Fig. 1 shows the multi-step prediction result with alternating optimization method. One step prediction error is around and 200 step prediction error is around .

###### Example 2

Glycolytic pathway

 ˙x1 =J−k1x1x61+(x6k1)q (16) ˙x2 =2k1x1x61+(x6k1)q−k2x2(n−x5)−k6x2