I Introduction
System identification, the process of estimating the unknown parameters of a dynamical system from the observed inputoutput sequence, is a classical problem in control, reinforcement learning and timeseries analysis. Among this class of problems, learning the transition matrix of a LTI system is a prominent wellstudied case, and classical results characterize the asymptotic properties of these estimators [aastrom1971system, ljung1999system, chen2012identification, goodwin1977dynamic].
Recently, there has been a renewed interest in the problem of identification of LTI systems, and modern statistical techniques are applied to achieve finitetime sample complexity guarantees [dean2019sample, jedra2020finite, faradonbeh2018finite, simchowitz2018learning, sarkar2019near]. However, the aforementioned works focus on the offline setup, where the estimator has access to the entire data sequence from the outset. The offline estimator cannot be directly extended to the streaming/online setup, where the system parameters need to be estimated onthefly. To this end, the idea of stochasticgradient descent with reverse experience replay (SGDRER) is proposed [kowshik2021streaming] to build an online estimator, where the data sequence is stored in several buffers, and the SGD update performs backward in each buffer to break the time dependency between data points. This online method achieves the optimal sample complexity of the offline setup up to log factors.
In this paper, we study the distributed online identification problem for a network of identical LTI systems with unknown dynamics. Each system is modeled as an agent in a multiagent network, which receives its own data sequence from the underlying system. The goal of this network is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGDRER, a distributed version of the online system identification algorithm in [kowshik2021streaming], where every agent applies SGD in the reverse order and communicates its estimate with its neighbors. We show that this decentralized scheme can improve the estimation error bound as the network size increases, and the simulation results demonstrate this theoretical property.
Ia Related Work
Recently, several works have studied the finitetime properties of LTI system identification. In [dean2019sample], it is shown that the dynamics of a fully observable system can be recovered from multiple trajectories by a leastsquares estimator when the number of trajectories scales linearly with the system dimension. Furthermore, system identification using a single trajectory for stable [jedra2020finite] and unstable [faradonbeh2018finite, simchowitz2018learning, sarkar2019near] LTI systems has been studied. The works of [simchowitz2018learning, jedra2019sample] establish the theoretical lower bound of the sample complexity for fully observable LTI systems. The case of partially observable LTI systems is also addressed for stable [oymak2019non, sarkar2019finite, tsiamis2019finite, simchowitz2019learning] and unstable [zheng2020non] systems. By applying an regularized estimator, the work of [fattahi2020learning] improves the dependency of the sample complexity on the system dimension to polylogarithmic scaling. Contrary to the aforementioned works, which focus on the finitetime analysis of offline estimators, [kowshik2021streaming] proposes an online estimation algorithm, where the data sequence is split into several buffers, and in each buffer the SGD update is applied in the reverse order to remove the bias due to the timedependency of data points. As mentioned before, our work builds on [kowshik2021streaming] for distributed online system identification. Finitetime analysis of system identification has also been studied for nonlinear dynamical systems more recently. In the works of [sattar2020non, foster2020learning], the sample complexity bounds are derived for dynamical systems described via generalized linear models, and [kowshik2021near] further improves the dependence of the complexity bound on the mixing time constant.
Ii Problem Formulation
Iia Notation
We use the following notation in this work:
The set of for any integer  
The largest singular value of 

The smallest singular value of  
Euclidean (spectral) norm of a vector (matrix) 

The expectation operator  
The entry in th row th column of  
The Kronecker product of and  
is positive semidefinite 
IiB Distributed Online System Identification
We consider a multiagent network of identical LTI systems, where the dynamics of agent is defined as,
is the unknown transition matrix, and is the noise sequence generated independently from a distribution with zero mean and finite covariance . The goal of the agents is to recover the matrix collaboratively. Though this task can be accomplished by an individual agent (e.g., as in [kowshik2021streaming]), we will show that the collective system identification improves the theoretical error bound of estimation.
Assumption 1
The considered system is stable in the sense that .
With Assumption 1, it can be shown that for any initial state , the distribution of converges to a stationary distribution with the covariance matrix . Note that Assumption 1 is more stringent compared to the standard stability assumption (i.e., ); however, the analysis is extendable to this case under some modifications of the time horizon (see [kowshik2021streaming] for more details).
Assumption 2
For any , is subGaussian, i.e.
for any .
SubGaussian noise is a standard choice for finitetime analysis of system identification (see e.g., [abbasi2011online, simchowitz2018learning, sarkar2019near]). Also, in some existing works, i.i.d. Gaussian noise is applied to ensure the persistence of excitation ([jedra2019sample, oymak2019non]). In this paper, we assume that is positivedefinite.
Problem Statement:
Departing from the classical offline ordinary least squares (OLS) estimator, the goal is to develop a fully decentralized algorithm, where each agent produces estimates of the unknown system
in an online fashion. Specifically, at each iteration, agent first updates its estimate based on the current information, and then it communicates its estimate with its neighbors. The communication is modeled via a graph that captures the underlying network topology. As mentioned earlier, the goal of an online distributed system identification algorithm, as opposed to the centralized one, is to leverage the network element to provide a finer estimation guarantee for each agent’s estimation error. In this work, we quantify the estimation error of each agent, such that the error bound encapsulates the dependency on the network size and topology.Network Structure:
During the estimation process, the agents communicate locally based on a symmetric doubly stochastic matrix
, i.e., all elements of are nonnegative and . Agents and communicate with each other if ; otherwise . Thus, is the neighborhood of agent . The network is assumed to be connected, i.e., for any two agents , there is a (potentially multihop) path from to . We also assume has a positive diagonal. Then, there exists a geometric mixing bound for [liu2008monte], such that where is the second largest singular value of . Agents exchange only local system estimates and they do not share any other information in the process. The communication is consistent with the structure of . We elaborate on this in the algorithm description.Iii Algorithm and Theoretical Results
We now lay out the distributed online linear system identification algorithm and provide the theoretical bound for the estimation error.
Iiia Offline and Online Settings
For the identification of LTI systems, it is wellknown that the (centralized) OLS estimator is statistically optimal. OLS is an offline estimator that can also be implemented in an online fashion (i.e., in the form of recursive least squares) by keeping track of the data covariance matrix and the residual based on the current estimate. On the other hand, gradientbased methods provide efficient mechanisms for system identification. In particular, SGD uses the gradient of the current data pair to perform the update
where is the step size. Despite the efficient update, SGD suffers from the timedependency over the data sequence, which leads to biased estimators. To observe this, unrolling the recursive update rule of SGD, we have
where in the second term, the dependence of later states on previous noise realizations prevents the estimator from being unbiased even if is initialized with a distribution where . To deal with this issue, [kowshik2021streaming] develops the method SGDRER, which applies SGD in the reverse order of the sequence to break the dependency over time. Suppose that the estimate is updated along the opposite direction of the sequence. In particular, if we have samples and use the SGD update in the reverse order, the problematic term in the above equation takes the following form
whose expectation is equal to zero since the later noise realizations are independent of previous states. Evidently, this approach does not work in an online sense (as the entire sequence of samples has to be collected first). However, we can mimic this approach by dividing the data into smaller buffers and use reverse SGD for each buffer [kowshik2021streaming].
IiiB Distributed SGD with Reverse Experience Replay
We extend SGDRER to the distributed case and call it the DSGDRER algorithm. Each agent splits the sequence of data pairs into several buffers of size , and within each buffer all agents perform SGD in the reverse order collectively based on the network topology. Between two consecutive buffers, data pairs are discarded in order to decrease the correlation of data in different buffers. The proposed method is outlined in Algorithm 1 and depicted in Fig. 1.
Averaged Iterate over Buffers: To further improve the convergence rate, we average the iterates. Each agent computes as the estimation output the tailaveraged estimate, i.e., the average of the last estimates of all buffers (line 11 of Algorithm 1).
Coupled Process: Despite the gap between buffers, there is still dependency across data points in various buffers, which makes the analysis challenging. To simplify the analysis, we consider the idea of coupled process [kowshik2021streaming], where for the state sequence received by agent , the corresponding coupled sequence is defined as follows.

For each buffer , the starting state is independently generated from , the stationary distribution of the state.

The coupled process evolves according to the noise sequence of the actual process, such that
We will see in the analysis that it is more straightforward to first compute the estimation error based on the coupled process and then quantify the excessive error introduced by replacing the coupled process with the actual one. Note that based on the dynamics of the coupled process, the excessive error is related to the gap size .
IiiC Theoretical Guarantee
In this section, we present our main theoretical result. By running Algorithm 1 with specified hyperparameters, we show that for the estimation error upper bound (of any agent), the term corresponding to the leading term in [kowshik2021streaming] can be improved by increasing the network size.
Theorem 1
Let Assumptions 12 hold and consider the following hyperparameter setup:

.

, where is a problemdependent constant.

.
Then, by applying Algorithm 1
, with probability at least
where and are some positive constants, the estimation error of the tailaveraged estimate of agent over all buffers is bounded as follows(1) 
where , and are problemdependent constants.
Remark 1
Iv Numerical Experiments
Experiment Setup: We consider a network of LTI systems, where the transition matrix has the form .
is a randomly generated orthogonal matrix and
is a diagonal matrix with half of its diagonal entries equal to and the rest equal to . For the step size, we set , where is estimated as the sum of the norms of the first samples of agent . We also set the other hyperparameters as follows: , , and . The starting state, and the noise follows the standard Gaussian distribution
.Networks: We examine the dependence of the estimation error to 1) the network size and 2) the network topology. For the former, we consider the centralized SGDRER [kowshik2021streaming] and our distributed algorithm for cyclic graph with various agent numbers . For the latter, we look at the performance of a agent network with the following topologies: a) (Net A), fully disconnected; b) captures a cyclic graph, where each agent has a selfweight of and assigns the weight of to each of its two neighbors (Net B); c) (Net C), the fully connected network.
Performance: From Fig. 2, we verify the dependency of the estimation precision on the network size. For a larger network size, the error of the tailaveraged estimate is smaller. Furthermore, all decentralized schemes outperform centralized SGDRER [kowshik2021streaming]. To see the influence of network topology, notice that the value of for different networks is ordered as . We can see in Fig. 3 that smaller results in a better estimation error. Furthermore, to see the effect of DSGDRER, we also plot the estimation error of applying vanilla distributed SGD in a fully connected network. We can see that the error does not shrink over time as the estimator is biased, so the reverse update process in DSGDRER is critical.
V Conclusion
In this work, we considered the distributed online system identification problem for a network of identical LTI systems. We proposed a distributed online estimation algorithm and characterized the estimation error with respect to the network size and topology. We indeed observed in numerical experiments that larger network size as well as better connectivity can improve the estimation performance. For future directions, this online estimation method can be combined with decentralized control techniques to build decentralized, realtime adaptive controllers as explored in [chang2021distributed, chang2021regret]. Another potential direction is to extend the technique for identification of nonlinear dynamical systems.
a Notations and Proof Sketch
To analyze the estimation error with the network topology , we decompose it into two parts: (I)The estimation error results from apply the fullyconnected net as the communication scheme; (II)The difference between estimates derived from and . To better analyze (I), we apply the idea of coupled processes: we consider , agent ’s estimate based on the coupled process and . From the estimate initialization and the update scheme, we know , which we denote as , coming with the recursive update rule:
In Section B, we give the bounds on and , which, together with the results for (II), derive the main theorem.
For the ease of the analysis, We have the following notations for the rest of the proof:
and
The estimate
consists of the bias term and the variance term which have the expression as follows:
B Bounds of the estimation error of and
Lemma 2
Suppose that Assumption 2 and , then and for any and , we have
For the ease of analysis, we have the following notations: ,
From the notation above and that , under the event of we have
(2) 
where the second inequality comes from the following result:
Knowing that and defining
we get
(3) 
where the first inequality is due to the subgaussianity of and the fact that , is independent with , , and ; the second inequality comes from (2) and the last inequality is based on . Then by expanding the recursive relationship in (3), the result is proved.
Corollary 3
From Lemma 2, we have
Lemma 4
Suppose that , then under the following relationship holds:
The result can be derived by following the similar proof of Lemma 28 in [kowshik2021streaming].
Lemma 5
Suppose that , Assumption 2 holds, and there exists a constant such that and . Then for any we get
where .
The result can be derived with Lemma 4 and the similar proof of Lemma 30 in [kowshik2021streaming].
Lemma 6
Suppose that the conditions in Lemma 5 hold and there exists a constant such that and . Then conditioned on , we have

almost surely.

, where are constants and depends on .
The result can be derived by following the similar proof of Lemma 31 in [kowshik2021streaming].
Here we define .
Lemma 7
Suppose that and the conditions in Lemma 6 hold. Then for any , we get
where , and is a constant depending on .
Consider the tailaveraged estimate:
Define the following events:
Theorem 8
The result can be derived by following the similar proof of Theorem 33 in [kowshik2021streaming].
Lemma 9
Suppose that and the event of holds. Then for and , we have
Consider the update rule of :
(4) 
Based on the equation above, it can be shown that
(5) 
For the term , based on the update rule, the choice of and the condition of , we have
from which we conclude that for and ,
(6) 
For the term , it can be shown that
(7) 
where the first inequality is due to the condition of and the last inequality comes from the gap between buffers. Similarly, we can show .
C Difference between and
Lemma 10
Suppose that and the event of holds. Then for and , we have for any agent