Towards Scalable Koopman Operator Learning: Convergence Rates and A Distributed Learning Algorithm

09/30/2019 ∙ by Zhiyuan Liu, et al. ∙ University of Colorado Boulder The Regents of the University of California 0

In this paper, we propose an alternating optimization algorithm to the nonconvex Koopman operator learning problem for nonlinear dynamic systems. We show that the proposed algorithm will converge to a critical point with rate O(1/T) or O(1/ T) under some mild assumptions. To handle the high dimensional nonlinear dynamical systems, we present the first-ever distributed Koopman operator learning algorithm. We show that the distributed Koopman operator learning has the same convergence properties as a centralized Koopman operator learning problem, in the absence of optimal tracker, so long as the basis functions satisfy a set of state-based decomposition conditions. Experiments are provided to complement our theoretical results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, there is an increasing interest in transferring operator theoretic techniques such as Koopman operator [1, 2]

for the analysis of dynamical systems. Such operator based methods differ from classical approaches, in that they define the evolution of observable functions in a function space rather than using state vectors in a state space. The power of these operator theoretic methods is that it provides linear representations of nonlinear time-invariant systems, albeit in higher dimensional spaces that are sometimes countable or uncountable. Various numerical approaches, such as dynamic mode decomposition(DMD), Hankel-DMD, extended dynamic mode decomposition (E-DMD), structured dynamic mode decomposition (S-DMD) have been proposed for discovering the Koopman operator of a nonlinear system, using a series of dictionary functions with spanning or universal function approximation properties

[3, 4, 2, 5, 6]

. Recently, researchers have shown it is possible to integrate machine-driven learning representations with dynamic mode decomposition algorithms, using variational autoencoders to achieve phase-dependent representations of spectra

[7] or delay embeddings [8]

, shallow neural networks

[3]

, linearly recurrent neural networks for balancing expressiveness and overfitting

[9]

, and deep RELU feedforward networks for predictive modeling in biological and transmission systems

[10]. E-DMD [3] and Deep-DMD [10] have been utilized in various domains, including nonlinear system identification [11, 12, 13, 14], image processing [4, 15] and robotic control [16, 17].

Generally speaking, the learning especially the training phase of Koopman operator is trying to minimize the empirical loss based on the training set, e.g., the data sampled from the real trajectory of dynamic system. Compared to the traditional machine learning problem which learns the unknown mapping from input to output, the Koopman learning has two tasks: 1) Learning the function space that lifts state space to a high even infinite dimensional space. 2) Learning a linear mapping within that function space. These two tasks are highly related to each other, e.g., inappropriate function space learned will lead to poor learning performance even the linear mapping is perfect. However, to the best of our knowledge, the method of Koopman training has not gotten enough attention up to now. Another challenge is that, when parameterized function approximation such as neural network is used, the learning problem is nonconvex. For instance, even for single layer neutral network, it is NP-complete to find the global optimal

[18]. However, recent works [19, 20, 21] show that for over-parameterized (wide) shallow neural networks, local optima provide satisfactory performance. Specifically, they show that every local optimum is global optimum if the hidden layer is non-singular; every local minimum of the simplified objective is close to the global minimum. In this paper, we contribute a proof of convergence for Koopman learning algorithms utilizing shallow neural networks, and derive conditions for first-order optimality, the properties of the so-called dictionary functions used in deep and E-DMD that guarantee convergence. We propose alternating optimization algorithm with an optimal tracker for training the Koopman operator. By proving the objective function’s smoothness property, we show that our algorithm admits convergence rate for chosen constant learning rate and for diminishing learning rate. We illustrate convergence of the alternating optimization algorithm for single-node training (non-distributed) on two nonlinear systems with oscillatory dynamics.

A second major contribution of this paper is the development of a distributed Koopman operator learning algorithm. Most Koopman operator learning algorithms operate under the assumption of full-state measurements. Frequently, in engineered and natural systems represented by data, full-state measurements are not available, or are too expensive to collect. For example, power distribution networks consisting of hundreds of thousands of nodes exhibit real-time dynamics on systems that are poorly modeled, calibrated, or dated. Biological networks operate on thousands of genes to generate transcriptomic reponse profiles as a function of time; full-state measurement via deep sequencing is prohibitively expensive. In many instances, it is much more feasible to collect measurements from select locations, via strategic placement of observers [22]

, which gives rise to a different form of data — time-series data that is spatially distributed or fragmented across the whole network. We address the challenge of training distributed representations of Koopman operators and develop a

distributed Koopman learning algorithm, proving asymptotic convergence, and illustrating predictive accuracy and convergence on several simulated examples.

The rest of the paper is organized as follows. Section II introduces the Koopman operator learning problem. Section III describes our alternating optimization algorithm for Koopman learning and proves the convergence. Section IV extends our algorithm to a distributed version and shows its convergence. Section V shows the performance of two algorithms validated by two nonlinear systems. Section VI concludes this paper.

Ii Koopman operator Learning Problem

In this paper, we consider a discrete time open-loop nonlinear dynamic system of the following form

(1)
(2)

where and are continuously differentiable. The function is the state-space model and the function maps current state to a vector of observables or output . The Koopman operator of system (1), if it exists, is a linear operator that acts on observable functions and forward propagates them in time. To be more specific, the Koopman operator for this system must satisfy the equation

(3)
(4)

where is a basis function that defines the lifted space of observables and is a constant matrix. Based on the Koopman operator theory, is the basis function of observables under which is -invariant for all . This implies that the Koopman operator comprehensively interprets the flow of the observable trajectory .

Based on the data-driven method [10] [23], a general model for approximating Koopman operator given the data trajectory can be formulated as follows:

(5)

The above model aims to minimize the empirical loss from the learning perspective. One can slightly change the objective function by adding certain regularized term, e.g., for sparse operator or for avoiding large training lose, to make the tradeoff between the training and generalization error.

While there has been a surge of interest in using neural networks to perform Koopman learning, little is known regarding the convergence and numerical stability of the training process. This motivates us to investigate the property of Koopman learning during its training phase. There are two challenges in solving for the optimization problem (5) in practice. First, the basis function is unknown. This makes it difficult to ascertain what functions and how many functions to include, let alone the minimal number of functions, to ensure -invariant. Recently, EDMD [23] uses an expansive set of orthonomal polynomial basis functions, but this approach does not scale well and suffers from overfitting with an increasing number of basis functions. Deep-DMD [10] adopts the neutral networks to approximate the basis function based on universal approximation theorem, but it lacks the theoretical guarantee, e.g., the stability and convergence. Second, the objective function is nonconvex. Therefore it is unrealistic to expect an algorithm to converge to global minima.

Here we focus on the basis function based on parametric method, specifically, . A typical example is a fully connected one-layer neutral network since for wide shallow neutral network, local optima provide satisfactory under some mild conditions [19, 20, 21], where

is the layer parameter and activation function. With the parametric basis method, problem (

5) becomes

(6)

Although this problem is nonconvex, there are some interesting structures. For example, if we fix the parameter of the basis function, optimizing is a quadratic problem that finds the linear mapping to . On the other hand, with fixed , optimizing is to adjust the parameter

to find the function space that satisfies the linear transformation mapping (this is still nonconvex but will reduce the complexity a lot). We thus consider the algorithm that alternatively optimize over

and .

Iii Alternating Optimization Algorithm

In this section, we first states our alternating algorithm and then investigate its convergence properties. Let and denote by the Frobenius norm. The detail of the algorithm is shown in Algorithm 1. Here measures how far the gradient is from that at the critical point and track the best parameters so far. We make the following assumptions.

1 Initialization: randomly initialize and , ,,.
2 while Not Converge do
3      
4      
5      
6       if   then
7            
8       end if
9      
10 end while
Algorithm 1 Alternating Operator Koopman Learning With Tracking
Assumption 1

The function is bounded and has a bounded gradient and Hessian.

Assumption 2

The parameters and are bounded, e.g., there exist two constant and such that and .

Assumption 1 looks strong. However, it holds for several popular activation functions such as logistic function (), hyperbolic tangent (), and inverse hyperbolic tangent (. By Assumptions 1 and 2, one can verify that the objective function is bounded, i.e., there exists a constant such that . We can show that has Lipschitz-continuous gradient with respect to the parameter of basis functions.

Lemma 1

Under Assumptions 1 and 2 and given the data trajectory , we have

with

where and is the Lipschitz constant for the function .

First, denote by the -th column of matrix , the -th row of matrix , and the -th dimension of . We can compute the element of as:

and as:

where denotes the element-wise production. We can then write the gradient difference with respect to and as

So if we can show that is Lipschitz-continuous, the proof is done. We have

Consider function in and its gradient By Assumption 1, is bounded by some constant, denoted by . Let , we can bound as follows:

where the last inequality is due to Cauchy-Schwarz inequality. Similarly, we can bound as follows:

where the second inequality is by Assumption 2. Combining the above results, we have

where . Similarly, has Lipschitz-continuous gradient with respect to the parameter of the linear mapping.

Lemma 2

Under Assumption 1 and assume that the basis function is bounded by , we have

with .

The gradient

and

With Lemmas 1 and 2, we now show that Algorithm 1 will converge to a critical point with convergence rate or .

Theorem 1

Under Assumptions 1 and 2, Algorithm 1 for Koopman operator learning will converge to a critical point. With constant learning rate , its convergence rate is ; and with diminishing learning rate , its convergence rate is .

Since the objective function is Lipschitz gradient continuous with respect to , the descent lemma [24] can be applied and we have

(7)

where denotes the trace of the matrix. The first equality is due to the gradient update of and the second equality is by the fact that .

As for the basis function’s parameter , we can have the similar result since the objective function is Lipschitz gradient continuous with respect to .

(8)

So by the equation (7) and (8), we have the following for each complete update from .

(9)

We sum both sides of inequality (9) from and obtain

(10)

(1) Constant learning rate

If we choose the constant stepsize, e.g., , and let , , and we can bound the gradients as follows.

(11)

One can see that, each term on the right is non-negative and their summation is bounded by some constant. Based on current analysis, we can conclude that the alternating optimization algorithm will converge asymptotically to one critical point even without optimal tracker and when

Based on inequality (11), one can bound the minimum gradients up to for Algorithm 1.

(2) Diminishing learning rate

If we choose the diminishing learning rate, e.g, , the result becomes

We know

(12)

So for diminishing stepsize, we can obtain

So we can see for this problem, the constant stepsize has the better convergence rate than the diminishing stepsize with the help of optimal tracker. Both cases show that

Iv Distributed Koopman Learning

We now develop an algorithm to treat the learning problem for Koopman operators of high dimensional nonlinear dynamical systems. Even if there are only a thousand states in the underlying nonlinear system, the dimension of the dictionary functions explodes exponentially with the number of states. Memory constraints thus make it infeasible to train a Koopman operator using a centralized or stand-alone computing node. This motivates the derivation of a scalable, distributed approximation algorithm to relieve this problem.

Assumption 3

The basis function can be decomposed or approximated by where is the new basis function for and is a subset of with .

Based on Assumption 3, we can reformulate the centralized Koopman objective function as

Our distributed Koopman learning’s structure is as follows. Denote by the set of computation nodes which can communicate with each other. For each computation node , it only store part of the data set , its corresponding row and column of Koopman operator and its basis function .

For node , its gradient will compose two parts. The first part can be calculated based on its own knowledge. Another part needs the information from other nodes. We first define by the error term for node with data point , where

and define by the Jacobi matrix of function , we then have the following distributed Koopman learning algorithm shown in Algorithm 2.

1 Initialization:
2    node randomly initilizes its , .
3 while Not Converge do
4      
5       for  do
6             node calculates ,.
7             node broadcasts ,.
8             node calculates , sends to node , .
9            
10             .
11            
12       end for
13      for   do
14             .
15            
16       end for
17      
18 end while
Algorithm 2 Distributed Koopman Operator Learning

For our distributed Koopman operator learning Algorithm 2, line 6-8 is the communication stage, each computation node calculates its result in the lifted dimensional space and broadcast within our communication network. After the communication, the information is enough to compute local error term , and node send to node (line 8). Here the communication stage ends and computation stage (line 9-11) begins. will sum up all the information for each data point. The last is update stage with gradient descent method (line 14-15). Based on this distributed algorithm and Assumption 3, we can prove this is equivalent to the centralized gradient descent algorithm.

Lemma 3

Under Assumption 3, the distributed Koopman learning in Algorithm 2 is equivalent to the following update:

(13)

Based on line 9-11 in Algorithm 2, one can verify the following after updating all the data points

Compared to gradients of , we can find that

So the update stage (line 14-15) is the same with equation (13) which finishes the proof.

Remark 1

Our alternating Koopman operator learning (Algorithm 1) can be regarded as nonlinear Gauss-Seidel iterations [25], while our distributed Koopman operator learning lies in the model of nonlinear Jacobi iteration [25]. Here we choose nonlinear Jacobi iteration for distributed Koopman operator learning due to that nonlinear Jacobi iteration is (1) suitable for parallel computation and (2) with less communication overhead.

By lemma 3, we can get the convergence result for our distributed Koopman operator learning.

Theorem 2

Under Assumption 1, 2 and 3, the distributed Koopman operator learning based on Algorithm 2 will converge to one critical point asymptotically.

The equation (13) is the case without optimal tracker of Algorithm 1. The proof of theorem 1 can be applied directly here (equation (11)). The advantages of distributed Koopman learning over centralized one are not only the scalability, e.g., the ability to handle the high dimensional nonlinear dynamics, but also the feasibility to adjust to different complexity of the partial observations. For example, if one partial observation is with complexity dynamic, we can increase the number of basis function.

On the other hand, algorithm 2 for distributed Koopman learning is under the ideal synchronous model. Although the computation of each node is parallel, the computation will not start until the broadcast process finishes. This can harm heavily on the efficiency of the distributed Koopman learning due to some reason, e.g., if one node has a very low link speed, all the other need wait for this node. Also, one packet loss will lead to all nodes waiting for the resending. However, it is easily to extend Algorithm 2 to handle asynchronous model as displayed in Algorithm 3. Each node will store the received information in its memory keeping updating when new information comes. once the computation node comes to computation stage, it will directly use the information stored in the memory instead of waiting for the newest information.

1 lines 1-7 of Algorithm 2.
2 node calculates based on the current in the memory, sends to node , .
3
4 based on the current in the memory, .
5 based on the current in the memory, .
lines 13-16 of Algorithm 2.
Algorithm 3 Asynchronous Distributed Koopman Learning

For the asynchronous version of gradient descent algorithm, [26] (Theorem 2), [27] (Proposition 2.1), [28] (Theorem 7) e.t.c. show that synchronous and asynchronous algorithm will converge to the same point once the communication delay is bounded. Their proof can be applied to our case with slight change.

Lemma 4

([26],[27],[28]) If the communication delay is bounded by some constant, with small enough stepsize, asynchronous algorithm 3 will asymptotically converge to the same point with synchronous one.

V Experiments

In our experiment, we evaluate the performance of our alternating optimization and distributed algorithms respectively. At each experiment, we sample some points from the real trajectory to prepare the training set and prediction set. Note that our prediction phase is multi-step prediction, e.g, given one initial state, our algorithm will predict the following trajectory using the -invariant property:

To evaluate the performance of alternating optimization, we consider Van der Pol oscillator shown in Example 1.

Example 1

Van der Pol oscillator

(14)
(15)

In this example, we choose . The number of date points we sampled is 600 with 400 points for training and 200 for prediction. We construct a very simple network with one layer and 3 dimensions to learn the pattern of Van der Pol oscillator. The total training time is around 1.08s (i7-8700 CPU @ 3.20 GHz, 8 GB RAM) with 500 iterations and constant stepsize is 0.23. Fig. 1 shows the multi-step prediction result with alternating optimization method. One step prediction error is around and 200 step prediction error is around .

Fig. 1: Alternating optimization for centralized Koopman operator learning with Van der Pol oscillator. In this experiments, only the points at time 0 are given. All the data points [1-200] are our predictions with Koopman learning.
Example 2

Glycolytic pathway

(16)