DeepAI

# Distributed Online System Identification for LTI Systems Using Reverse Experience Replay

Identification of linear time-invariant (LTI) systems plays an important role in control and reinforcement learning. Both asymptotic and finite-time offline system identification are well-studied in the literature. For online system identification, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) was recently proposed, where the data sequence is stored in several buffers and the stochastic-gradient descent (SGD) update performs backward in each buffer to break the time dependency between data points. Inspired by this work, we study distributed online system identification of LTI systems over a multi-agent network. We consider agents as identical LTI systems, and the network goal is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed variant of the SGD-RER algorithm, and theoretically characterize the improvement of the estimation error with respect to the network size. Our numerical experiments certify the reduction of estimation error as the network size grows.

• 6 publications
• 37 publications
03/10/2021

### Streaming Linear System Identification with Reverse Experience Replay

We consider the problem of estimating a stochastic linear time-invariant...
09/29/2020

### Distributed Online Linear Quadratic Control for Linear Time-invariant Systems

Classical linear quadratic (LQ) control centers around linear time-invar...
06/16/2020

### Least Squares Regression with Markovian Data: Fundamental Limits and Algorithms

We study the problem of least squares linear regression where the data-p...
06/07/2022

### Look Back When Surprised: Stabilizing Reverse Experience Replay for Neural Approximation

Experience replay methods, which are an essential part of reinforcement ...
11/15/2020

### Accelerating Distributed SGD for Linear Regression using Iterative Pre-Conditioning

This paper considers the multi-agent distributed linear least-squares pr...
02/23/2021

### Online Stochastic Gradient Descent Learns Linear Dynamical Systems from A Single Trajectory

This work investigates the problem of estimating the weight matrices of ...
12/04/2020

In this work, we study an optimizer, Grad-Avg to optimize error function...

## I Introduction

System identification, the process of estimating the unknown parameters of a dynamical system from the observed input-output sequence, is a classical problem in control, reinforcement learning and time-series analysis. Among this class of problems, learning the transition matrix of a LTI system is a prominent well-studied case, and classical results characterize the asymptotic properties of these estimators [aastrom1971system, ljung1999system, chen2012identification, goodwin1977dynamic].

Recently, there has been a renewed interest in the problem of identification of LTI systems, and modern statistical techniques are applied to achieve finite-time sample complexity guarantees [dean2019sample, jedra2020finite, faradonbeh2018finite, simchowitz2018learning, sarkar2019near]. However, the aforementioned works focus on the offline setup, where the estimator has access to the entire data sequence from the outset. The offline estimator cannot be directly extended to the streaming/online setup, where the system parameters need to be estimated on-the-fly. To this end, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) is proposed [kowshik2021streaming] to build an online estimator, where the data sequence is stored in several buffers, and the SGD update performs backward in each buffer to break the time dependency between data points. This online method achieves the optimal sample complexity of the offline setup up to log factors.

In this paper, we study the distributed online identification problem for a network of identical LTI systems with unknown dynamics. Each system is modeled as an agent in a multi-agent network, which receives its own data sequence from the underlying system. The goal of this network is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed version of the online system identification algorithm in [kowshik2021streaming], where every agent applies SGD in the reverse order and communicates its estimate with its neighbors. We show that this decentralized scheme can improve the estimation error bound as the network size increases, and the simulation results demonstrate this theoretical property.

### I-a Related Work

Recently, several works have studied the finite-time properties of LTI system identification. In [dean2019sample], it is shown that the dynamics of a fully observable system can be recovered from multiple trajectories by a least-squares estimator when the number of trajectories scales linearly with the system dimension. Furthermore, system identification using a single trajectory for stable [jedra2020finite] and unstable [faradonbeh2018finite, simchowitz2018learning, sarkar2019near] LTI systems has been studied. The works of [simchowitz2018learning, jedra2019sample] establish the theoretical lower bound of the sample complexity for fully observable LTI systems. The case of partially observable LTI systems is also addressed for stable [oymak2019non, sarkar2019finite, tsiamis2019finite, simchowitz2019learning] and unstable [zheng2020non] systems. By applying an -regularized estimator, the work of [fattahi2020learning] improves the dependency of the sample complexity on the system dimension to poly-logarithmic scaling. Contrary to the aforementioned works, which focus on the finite-time analysis of offline estimators, [kowshik2021streaming] proposes an online estimation algorithm, where the data sequence is split into several buffers, and in each buffer the SGD update is applied in the reverse order to remove the bias due to the time-dependency of data points. As mentioned before, our work builds on [kowshik2021streaming] for distributed online system identification. Finite-time analysis of system identification has also been studied for non-linear dynamical systems more recently. In the works of [sattar2020non, foster2020learning], the sample complexity bounds are derived for dynamical systems described via generalized linear models, and [kowshik2021near] further improves the dependence of the complexity bound on the mixing time constant.

## Ii Problem Formulation

### Ii-a Notation

We use the following notation in this work:

[m] The set of {1,2,…,m} for any integer m The largest singular value of A The smallest singular value of A Euclidean (spectral) norm of a vector (matrix) The expectation operator The entry in i-th row j-th column of A The Kronecker product of A and B (A−B) is positive semi-definite

### Ii-B Distributed Online System Identification

We consider a multi-agent network of identical LTI systems, where the dynamics of agent is defined as,

 xkt+1=Axkt+wkt,k∈[m].

is the unknown transition matrix, and is the noise sequence generated independently from a distribution with zero mean and finite covariance . The goal of the agents is to recover the matrix collaboratively. Though this task can be accomplished by an individual agent (e.g., as in [kowshik2021streaming]), we will show that the collective system identification improves the theoretical error bound of estimation.

###### Assumption 1

The considered system is stable in the sense that .

With Assumption 1, it can be shown that for any initial state , the distribution of converges to a stationary distribution with the covariance matrix . Note that Assumption 1 is more stringent compared to the standard stability assumption (i.e., ); however, the analysis is extendable to this case under some modifications of the time horizon (see [kowshik2021streaming] for more details).

###### Assumption 2

For any , is sub-Gaussian, i.e.

 Ew[exp(y⟨x,w⟩)]≤exp(y22Cμ⟨x,Σx⟩),

for any .

Sub-Gaussian noise is a standard choice for finite-time analysis of system identification (see e.g., [abbasi2011online, simchowitz2018learning, sarkar2019near]). Also, in some existing works, i.i.d. Gaussian noise is applied to ensure the persistence of excitation ([jedra2019sample, oymak2019non]). In this paper, we assume that is positive-definite.

Problem Statement:

Departing from the classical offline ordinary least squares (OLS) estimator, the goal is to develop a fully decentralized algorithm, where each agent produces estimates of the unknown system

in an online fashion. Specifically, at each iteration, agent first updates its estimate based on the current information, and then it communicates its estimate with its neighbors. The communication is modeled via a graph that captures the underlying network topology. As mentioned earlier, the goal of an online distributed system identification algorithm, as opposed to the centralized one, is to leverage the network element to provide a finer estimation guarantee for each agent’s estimation error. In this work, we quantify the estimation error of each agent, such that the error bound encapsulates the dependency on the network size and topology.

Network Structure:

During the estimation process, the agents communicate locally based on a symmetric doubly stochastic matrix

, i.e., all elements of are non-negative and . Agents and communicate with each other if ; otherwise . Thus, is the neighborhood of agent . The network is assumed to be connected, i.e., for any two agents , there is a (potentially multi-hop) path from to . We also assume has a positive diagonal. Then, there exists a geometric mixing bound for [liu2008monte], such that where is the second largest singular value of . Agents exchange only local system estimates and they do not share any other information in the process. The communication is consistent with the structure of . We elaborate on this in the algorithm description.

## Iii Algorithm and Theoretical Results

We now lay out the distributed online linear system identification algorithm and provide the theoretical bound for the estimation error.

### Iii-a Offline and Online Settings

For the identification of LTI systems, it is well-known that the (centralized) OLS estimator is statistically optimal. OLS is an offline estimator that can also be implemented in an online fashion (i.e., in the form of recursive least squares) by keeping track of the data covariance matrix and the residual based on the current estimate. On the other hand, gradient-based methods provide efficient mechanisms for system identification. In particular, SGD uses the gradient of the current data pair to perform the update

 At+1=At−2γ(Atxt−xt+1)x⊤t,

where is the step size. Despite the efficient update, SGD suffers from the time-dependency over the data sequence, which leads to biased estimators. To observe this, unrolling the recursive update rule of SGD, we have

 At−A=(A0−A)Πt−1s=0(I−2γxsx⊤s)+2γt−1∑s=0wsx⊤sΠt−1l=s+1(I−2γxlx⊤l),

where in the second term, the dependence of later states on previous noise realizations prevents the estimator from being unbiased even if is initialized with a distribution where . To deal with this issue, [kowshik2021streaming] develops the method SGD-RER, which applies SGD in the reverse order of the sequence to break the dependency over time. Suppose that the estimate is updated along the opposite direction of the sequence. In particular, if we have samples and use the SGD update in the reverse order, the problematic term in the above equation takes the following form

 2γt−1∑k=0wT−kx⊤T−kΠt−1l=k+1(I−2γxT−lx⊤T−l).

whose expectation is equal to zero since the later noise realizations are independent of previous states. Evidently, this approach does not work in an online sense (as the entire sequence of samples has to be collected first). However, we can mimic this approach by dividing the data into smaller buffers and use reverse SGD for each buffer [kowshik2021streaming].

### Iii-B Distributed SGD with Reverse Experience Replay

We extend SGD-RER to the distributed case and call it the DSGD-RER algorithm. Each agent splits the sequence of data pairs into several buffers of size , and within each buffer all agents perform SGD in the reverse order collectively based on the network topology. Between two consecutive buffers, data pairs are discarded in order to decrease the correlation of data in different buffers. The proposed method is outlined in Algorithm 1 and depicted in Fig. 1.

Averaged Iterate over Buffers: To further improve the convergence rate, we average the iterates. Each agent computes as the estimation output the tail-averaged estimate, i.e., the average of the last estimates of all buffers (line 11 of Algorithm 1).

Coupled Process: Despite the gap between buffers, there is still dependency across data points in various buffers, which makes the analysis challenging. To simplify the analysis, we consider the idea of coupled process [kowshik2021streaming], where for the state sequence received by agent , the corresponding coupled sequence is defined as follows.

1. For each buffer , the starting state is independently generated from , the stationary distribution of the state.

2. The coupled process evolves according to the noise sequence of the actual process, such that

 ~xk,ti+1=A~xk,ti+wk,ti,i=0,…,S−1(S=B+u).

We will see in the analysis that it is more straightforward to first compute the estimation error based on the coupled process and then quantify the excessive error introduced by replacing the coupled process with the actual one. Note that based on the dynamics of the coupled process, the excessive error is related to the gap size .

### Iii-C Theoretical Guarantee

In this section, we present our main theoretical result. By running Algorithm 1 with specified hyper-parameters, we show that for the estimation error upper bound (of any agent), the term corresponding to the leading term in [kowshik2021streaming] can be improved by increasing the network size.

###### Theorem 1

Let Assumptions 1-2 hold and consider the following hyper-parameter setup:

1. .

2. , where is a problem-dependent constant.

3. .

Then, by applying Algorithm 1

, with probability at least

where and are some positive constants, the estimation error of the tail-averaged estimate of agent over all buffers is bounded as follows

 (1)

where , and are problem-dependent constants.

###### Remark 1

Comparing (1) and the estimation error upper bound of the centralized SGD-RER in [kowshik2021streaming], we can observe that for the dominant term in [kowshik2021streaming], the corresponding term in (1) is , which shows that the contributed error can be improved by increasing the network size .

## Iv Numerical Experiments

Experiment Setup: We consider a network of LTI systems, where the transition matrix has the form .

is a randomly generated orthogonal matrix and

is a diagonal matrix with half of its diagonal entries equal to and the rest equal to . For the step size, we set , where is estimated as the sum of the norms of the first samples of agent . We also set the other hyper-parameters as follows: , , and . The starting state

, and the noise follows the standard Gaussian distribution

.

Networks: We examine the dependence of the estimation error to 1) the network size and 2) the network topology. For the former, we consider the centralized SGD-RER [kowshik2021streaming] and our distributed algorithm for -cyclic graph with various agent numbers . For the latter, we look at the performance of a -agent network with the following topologies: a) (Net A), fully disconnected; b) captures a -cyclic graph, where each agent has a self-weight of and assigns the weight of to each of its two neighbors (Net B); c) (Net C), the fully connected network.

Performance: From Fig. 2, we verify the dependency of the estimation precision on the network size. For a larger network size, the error of the tail-averaged estimate is smaller. Furthermore, all decentralized schemes outperform centralized SGD-RER [kowshik2021streaming]. To see the influence of network topology, notice that the value of for different networks is ordered as . We can see in Fig. 3 that smaller results in a better estimation error. Furthermore, to see the effect of DSGD-RER, we also plot the estimation error of applying vanilla distributed SGD in a fully connected network. We can see that the error does not shrink over time as the estimator is biased, so the reverse update process in DSGD-RER is critical.

## V Conclusion

In this work, we considered the distributed online system identification problem for a network of identical LTI systems. We proposed a distributed online estimation algorithm and characterized the estimation error with respect to the network size and topology. We indeed observed in numerical experiments that larger network size as well as better connectivity can improve the estimation performance. For future directions, this online estimation method can be combined with decentralized control techniques to build decentralized, real-time adaptive controllers as explored in [chang2021distributed, chang2021regret]. Another potential direction is to extend the technique for identification of non-linear dynamical systems.

### -a Notations and Proof Sketch

To analyze the estimation error with the network topology , we decompose it into two parts: (I)The estimation error results from apply the fully-connected net as the communication scheme; (II)The difference between estimates derived from and . To better analyze (I), we apply the idea of coupled processes: we consider , agent ’s estimate based on the coupled process and . From the estimate initialization and the update scheme, we know , which we denote as , coming with the recursive update rule:

 ~At,avgi+1=~At,avgi−2γmm∑k=1(~At,avgixk,t−i−xk,t−(i+1))xk,t⊤−i.

In Section -B, we give the bounds on and , which, together with the results for (II), derive the main theorem.

For the ease of the analysis, We have the following notations for the rest of the proof:

 ~Pt,avgi=(I−2γ∑mk=1~xk,ti~xk,t⊤im),~Ht,avgi,j={∏js=i~Pt,avg−s,i≤jI,i>j

and

 xk,t−i=xk,t(S−1)−i,0≤i≤S−1,Dt−j={∥∥xk,t−i∥∥≤R:∀k,j≤i≤B−1},~Dt−j={∥∥~xk,t−i∥∥≤R:∀k,j≤i≤B−1},Ds,t={∩tr=sDr−0,s≤tΩ,s>t,~Ds,t={∩tr=s~Dr−0,s≤tΩ,s>t,^Ds,t=Ds,t∩~Ds,t.

The estimate

consists of the bias term and the variance term which have the expression as follows:

 ~At,avgB−A =(~At,avgB,bias−A)+~At,avgB,var, (~At,avgB,bias−A) =(A0−A)t∏s=0~Hs,avg0,B−1,
 ~At,avgB,var=2γt∑r=0B−1∑j=0(∑mk=1wk,t−r−j~xk,t−r⊤−jm)~Ht−r,avgj+1,B−10∏s=r−1~Ht−s,avg0,B−1.

### -B Bounds of the estimation error of ~Aavg and Aavg

###### Lemma 2

Suppose that Assumption 2 and , then and for any and , we have

 E[exp(γmλ2Cμx⊤Σx⟨y,~Ht,avg⊤0,B−1~Ht,avg0,B−1y⟩+λx⊤Δty)|~Dt−0]≤exp(γmλ2Cμx⊤Σx∥y∥2)P(~Dt−0).

For the ease of analysis, we have the following notations: ,

 Δt:=2γB−1∑j=0(∑mk=1wk,t−j~xk,t⊤−jm)~Ht,avgj+1,B−1,Δt−i:=2γB−1∑j=i(∑mk=1wk,t−j~xk,t⊤−jm)~Ht,avgj+1,B−1,Ξt0:=γmλ2Cμx⊤Σx⟨y,~Ht,avg⊤0,B−1~Ht,avg0,B−1y⟩,Ξti:=γmλ2Cμx⊤Σx⟨y,~Ht,avg⊤i,B−1~Ht,avgi,B−1y⟩.

From the notation above and that , under the event of we have

 (2)

where the second inequality comes from the following result:

 ~Pt,avg⊤−i~Pt,avg−i+2γmm∑k=1(~xk,t−i~xk,t⊤−i)=(I−2γmm∑k=1~xk,t−i~xk,t⊤−i+4γ2m2m∑k=1m∑s=1~xk,t−i~xk,t⊤−i~xs,t−i~xs,t⊤−i)⪯(I−2γmm∑k=1~xk,t−i~xk,t⊤−i+4γ2Rmm∑k=1~xk,t−i~xk,t⊤−i)⪯I.

Knowing that and defining

 ϰti:=m∑k=12γ2λ2Cμm2⟨x,Σx⟩⟨y,~Ht,avg⊤i+1,B−1~xk,t−i~xk,t⊤−i~Ht,avgi+1,B−1y⟩,

we get

 E[exp(Ξt0+λ⟨x,Δty⟩)∣∣~Dt−0]=E[exp(Ξt0+λ⟨x,Δt−0y⟩)1(~Dt−0)]P(~Dt−0)=E[exp(Ξt0+λ⟨x,(Δt−0−Δt−1)y⟩+λ⟨x,Δt−1y⟩)1(~Dt−0)]P(~Dt−0)≤E[exp(Ξt0+ϰt0+λ⟨x,Δt−1y⟩)1(~Dt−0)]P(~Dt−0)≤E[exp(Ξt1+λ⟨x,Δt−1y⟩)1(~Dt−0)]P(~Dt−0)≤E[exp(Ξt1+λ⟨x,Δt−1y⟩)1(~Dt−1)]P(~Dt−0), (3)

where the first inequality is due to the sub-gaussianity of and the fact that , is independent with , , and ; the second inequality comes from (2) and the last inequality is based on . Then by expanding the recursive relationship in (3), the result is proved.

###### Corollary 3

From Lemma 2, we have

 E[λx⊤Δty|~Dt−0]≤exp(γmλ2Cμx⊤Σx∥y∥2)P(~Dt−0).
###### Lemma 4

Suppose that , then under the following relationship holds:

The result can be derived by following the similar proof of Lemma 28 in [kowshik2021streaming].

###### Lemma 5

Suppose that , Assumption 2 holds, and there exists a constant such that and . Then for any we get

where .

The result can be derived with Lemma 4 and the similar proof of Lemma 30 in [kowshik2021streaming].

###### Lemma 6

Suppose that the conditions in Lemma 5 hold and there exists a constant such that and . Then conditioned on , we have

1. almost surely.

2. , where are constants and depends on .

The result can be derived by following the similar proof of Lemma 31 in [kowshik2021streaming].

Here we define .

###### Lemma 7

Suppose that and the conditions in Lemma 6 hold. Then for any , we get

 P(∥Fa,N−1∥≥c5(ζ+1γBσmin(G))∣∣∣~Da,N−1)≤δ,

where , and is a constant depending on .

Consider the tail-averaged estimate:

 ^~Avar,avg0.N−1=1NN−1∑t=0(~At,avgB,var)=1NN−1∑t=0ΔtFt,N−1.

Define the following events:

 ~Mt:={∥∥∥Ft,N−1≤c5(ζ+1γBσmin(G))∥∥∥},~Mt,N−1=∩N−1s=t~Mt.
###### Theorem 8

Suppose that the conditions in Lemma 2, Lemma 4 and Lemma 7 holds. Assume and define . Then we have

 P(∥∥∥^~Avar,avg0.N−1∥∥∥≥β∣∣∣~M0,N−1∩~D0,N−1)≤exp(c6d−β2Nm8γCμσmax(Σ)(1+2α)),

where is a problem-independent constant.

The result can be derived by following the similar proof of Theorem 33 in [kowshik2021streaming].

###### Lemma 9

Suppose that and the event of holds. Then for and , we have

 ∥∥At,avgi−~At,avgi∥∥≤(16γ2R2T2+8γRT)∥Au∥.

Consider the update rule of :

 At,avgi+1=At,avgi−2γmm∑k=1(At,avgixk,t−i−xk,t−(i+1))xk,t⊤−i=At,avgi−2γmm∑k=1(At,avgi~xk,t−i−~xk,t−(i+1))~xk,t⊤−i+2γmm∑k=1At,avgi(~xk,t−i~xk,t⊤−i−xk,t−ixk,t⊤−i)+2γmm∑k=1(xk,t−(i+1)xk,t⊤−i−~xk,t−(i+1)~xk,t⊤−i). (4)

Based on the equation above, it can be shown that

 At,avgi+1−~At,avgi+1=(At,avgi−~At,avgi)~Pt,avg−i+2γmm∑k=1At,avgi(~xk,t−i~xk,t⊤−i−xk,t−ixk,t⊤−i)+2γmm∑k=1(xk,t−(i+1)xk,t⊤−i−~xk,t−(i+1)~xk,t⊤−i). (5)

For the term , based on the update rule, the choice of and the condition of , we have

from which we conclude that for and ,

 ∥∥At,avgi∥∥≤2γRT. (6)

For the term , it can be shown that

 ∥∥~xk,t−i~xk,t⊤−i−xk,t−ixk,t⊤−i∥∥=∥∥~xk,t−i~xk,t⊤−i−xk,t−i~xk,t⊤−i+xk,t−i~xk,t⊤−i−xk,t−ixk,t⊤−i∥∥≤2√R∥∥xk,t−i−~xk,t−i∥∥=2√R∥∥A(S−1)−i(xk,t0−~xk,t0)∥∥≤4R∥Au∥, (7)

where the first inequality is due to the condition of and the last inequality comes from the gap between buffers. Similarly, we can show .

Substituting (6) and (7) into (5), we get

from which we conclude that for and ,

 ∥∥At,avgi−~At,avgi∥∥≤(16γ2R2T2+8γRT)∥Au∥.

### -C Difference between At,Pi,k and At,avgi

###### Lemma 10

Suppose that and the event of holds. Then for and , we have for any agent

 ∥∥At,Pi,k−At,a