System identification, the process of estimating the unknown parameters of a dynamical system from the observed input-output sequence, is a classical problem in control, reinforcement learning and time-series analysis. Among this class of problems, learning the transition matrix of a LTI system is a prominent well-studied case, and classical results characterize the asymptotic properties of these estimators [aastrom1971system, ljung1999system, chen2012identification, goodwin1977dynamic].
Recently, there has been a renewed interest in the problem of identification of LTI systems, and modern statistical techniques are applied to achieve finite-time sample complexity guarantees [dean2019sample, jedra2020finite, faradonbeh2018finite, simchowitz2018learning, sarkar2019near]. However, the aforementioned works focus on the offline setup, where the estimator has access to the entire data sequence from the outset. The offline estimator cannot be directly extended to the streaming/online setup, where the system parameters need to be estimated on-the-fly. To this end, the idea of stochastic-gradient descent with reverse experience replay (SGD-RER) is proposed [kowshik2021streaming] to build an online estimator, where the data sequence is stored in several buffers, and the SGD update performs backward in each buffer to break the time dependency between data points. This online method achieves the optimal sample complexity of the offline setup up to log factors.
In this paper, we study the distributed online identification problem for a network of identical LTI systems with unknown dynamics. Each system is modeled as an agent in a multi-agent network, which receives its own data sequence from the underlying system. The goal of this network is to jointly estimate the system parameters by leveraging the communication between agents. We propose DSGD-RER, a distributed version of the online system identification algorithm in [kowshik2021streaming], where every agent applies SGD in the reverse order and communicates its estimate with its neighbors. We show that this decentralized scheme can improve the estimation error bound as the network size increases, and the simulation results demonstrate this theoretical property.
I-a Related Work
Recently, several works have studied the finite-time properties of LTI system identification. In [dean2019sample], it is shown that the dynamics of a fully observable system can be recovered from multiple trajectories by a least-squares estimator when the number of trajectories scales linearly with the system dimension. Furthermore, system identification using a single trajectory for stable [jedra2020finite] and unstable [faradonbeh2018finite, simchowitz2018learning, sarkar2019near] LTI systems has been studied. The works of [simchowitz2018learning, jedra2019sample] establish the theoretical lower bound of the sample complexity for fully observable LTI systems. The case of partially observable LTI systems is also addressed for stable [oymak2019non, sarkar2019finite, tsiamis2019finite, simchowitz2019learning] and unstable [zheng2020non] systems. By applying an -regularized estimator, the work of [fattahi2020learning] improves the dependency of the sample complexity on the system dimension to poly-logarithmic scaling. Contrary to the aforementioned works, which focus on the finite-time analysis of offline estimators, [kowshik2021streaming] proposes an online estimation algorithm, where the data sequence is split into several buffers, and in each buffer the SGD update is applied in the reverse order to remove the bias due to the time-dependency of data points. As mentioned before, our work builds on [kowshik2021streaming] for distributed online system identification. Finite-time analysis of system identification has also been studied for non-linear dynamical systems more recently. In the works of [sattar2020non, foster2020learning], the sample complexity bounds are derived for dynamical systems described via generalized linear models, and [kowshik2021near] further improves the dependence of the complexity bound on the mixing time constant.
Ii Problem Formulation
We use the following notation in this work:
Ii-B Distributed Online System Identification
We consider a multi-agent network of identical LTI systems, where the dynamics of agent is defined as,
is the unknown transition matrix, and is the noise sequence generated independently from a distribution with zero mean and finite covariance . The goal of the agents is to recover the matrix collaboratively. Though this task can be accomplished by an individual agent (e.g., as in [kowshik2021streaming]), we will show that the collective system identification improves the theoretical error bound of estimation.
The considered system is stable in the sense that .
With Assumption 1, it can be shown that for any initial state , the distribution of converges to a stationary distribution with the covariance matrix . Note that Assumption 1 is more stringent compared to the standard stability assumption (i.e., ); however, the analysis is extendable to this case under some modifications of the time horizon (see [kowshik2021streaming] for more details).
For any , is sub-Gaussian, i.e.
for any .
Sub-Gaussian noise is a standard choice for finite-time analysis of system identification (see e.g., [abbasi2011online, simchowitz2018learning, sarkar2019near]). Also, in some existing works, i.i.d. Gaussian noise is applied to ensure the persistence of excitation ([jedra2019sample, oymak2019non]). In this paper, we assume that is positive-definite.
Departing from the classical offline ordinary least squares (OLS) estimator, the goal is to develop a fully decentralized algorithm, where each agent produces estimates of the unknown systemin an online fashion. Specifically, at each iteration, agent first updates its estimate based on the current information, and then it communicates its estimate with its neighbors. The communication is modeled via a graph that captures the underlying network topology. As mentioned earlier, the goal of an online distributed system identification algorithm, as opposed to the centralized one, is to leverage the network element to provide a finer estimation guarantee for each agent’s estimation error. In this work, we quantify the estimation error of each agent, such that the error bound encapsulates the dependency on the network size and topology.
During the estimation process, the agents communicate locally based on a symmetric doubly stochastic matrix, i.e., all elements of are non-negative and . Agents and communicate with each other if ; otherwise . Thus, is the neighborhood of agent . The network is assumed to be connected, i.e., for any two agents , there is a (potentially multi-hop) path from to . We also assume has a positive diagonal. Then, there exists a geometric mixing bound for [liu2008monte], such that where is the second largest singular value of . Agents exchange only local system estimates and they do not share any other information in the process. The communication is consistent with the structure of . We elaborate on this in the algorithm description.
Iii Algorithm and Theoretical Results
We now lay out the distributed online linear system identification algorithm and provide the theoretical bound for the estimation error.
Iii-a Offline and Online Settings
For the identification of LTI systems, it is well-known that the (centralized) OLS estimator is statistically optimal. OLS is an offline estimator that can also be implemented in an online fashion (i.e., in the form of recursive least squares) by keeping track of the data covariance matrix and the residual based on the current estimate. On the other hand, gradient-based methods provide efficient mechanisms for system identification. In particular, SGD uses the gradient of the current data pair to perform the update
where is the step size. Despite the efficient update, SGD suffers from the time-dependency over the data sequence, which leads to biased estimators. To observe this, unrolling the recursive update rule of SGD, we have
where in the second term, the dependence of later states on previous noise realizations prevents the estimator from being unbiased even if is initialized with a distribution where . To deal with this issue, [kowshik2021streaming] develops the method SGD-RER, which applies SGD in the reverse order of the sequence to break the dependency over time. Suppose that the estimate is updated along the opposite direction of the sequence. In particular, if we have samples and use the SGD update in the reverse order, the problematic term in the above equation takes the following form
whose expectation is equal to zero since the later noise realizations are independent of previous states. Evidently, this approach does not work in an online sense (as the entire sequence of samples has to be collected first). However, we can mimic this approach by dividing the data into smaller buffers and use reverse SGD for each buffer [kowshik2021streaming].
Iii-B Distributed SGD with Reverse Experience Replay
We extend SGD-RER to the distributed case and call it the DSGD-RER algorithm. Each agent splits the sequence of data pairs into several buffers of size , and within each buffer all agents perform SGD in the reverse order collectively based on the network topology. Between two consecutive buffers, data pairs are discarded in order to decrease the correlation of data in different buffers. The proposed method is outlined in Algorithm 1 and depicted in Fig. 1.
Averaged Iterate over Buffers: To further improve the convergence rate, we average the iterates. Each agent computes as the estimation output the tail-averaged estimate, i.e., the average of the last estimates of all buffers (line 11 of Algorithm 1).
Coupled Process: Despite the gap between buffers, there is still dependency across data points in various buffers, which makes the analysis challenging. To simplify the analysis, we consider the idea of coupled process [kowshik2021streaming], where for the state sequence received by agent , the corresponding coupled sequence is defined as follows.
For each buffer , the starting state is independently generated from , the stationary distribution of the state.
The coupled process evolves according to the noise sequence of the actual process, such that
We will see in the analysis that it is more straightforward to first compute the estimation error based on the coupled process and then quantify the excessive error introduced by replacing the coupled process with the actual one. Note that based on the dynamics of the coupled process, the excessive error is related to the gap size .
is initialized as a zero matrix for all agents. The number of buffers , where .
Iii-C Theoretical Guarantee
In this section, we present our main theoretical result. By running Algorithm 1 with specified hyper-parameters, we show that for the estimation error upper bound (of any agent), the term corresponding to the leading term in [kowshik2021streaming] can be improved by increasing the network size.
, where is a problem-dependent constant.
where , and are problem-dependent constants.
Iv Numerical Experiments
Experiment Setup: We consider a network of LTI systems, where the transition matrix has the form .
is a randomly generated orthogonal matrix andis a diagonal matrix with half of its diagonal entries equal to and the rest equal to . For the step size, we set , where is estimated as the sum of the norms of the first samples of agent . We also set the other hyper-parameters as follows: , , and . The starting state
, and the noise follows the standard Gaussian distribution.
Networks: We examine the dependence of the estimation error to 1) the network size and 2) the network topology. For the former, we consider the centralized SGD-RER [kowshik2021streaming] and our distributed algorithm for -cyclic graph with various agent numbers . For the latter, we look at the performance of a -agent network with the following topologies: a) (Net A), fully disconnected; b) captures a -cyclic graph, where each agent has a self-weight of and assigns the weight of to each of its two neighbors (Net B); c) (Net C), the fully connected network.
Performance: From Fig. 2, we verify the dependency of the estimation precision on the network size. For a larger network size, the error of the tail-averaged estimate is smaller. Furthermore, all decentralized schemes outperform centralized SGD-RER [kowshik2021streaming]. To see the influence of network topology, notice that the value of for different networks is ordered as . We can see in Fig. 3 that smaller results in a better estimation error. Furthermore, to see the effect of DSGD-RER, we also plot the estimation error of applying vanilla distributed SGD in a fully connected network. We can see that the error does not shrink over time as the estimator is biased, so the reverse update process in DSGD-RER is critical.
In this work, we considered the distributed online system identification problem for a network of identical LTI systems. We proposed a distributed online estimation algorithm and characterized the estimation error with respect to the network size and topology. We indeed observed in numerical experiments that larger network size as well as better connectivity can improve the estimation performance. For future directions, this online estimation method can be combined with decentralized control techniques to build decentralized, real-time adaptive controllers as explored in [chang2021distributed, chang2021regret]. Another potential direction is to extend the technique for identification of non-linear dynamical systems.
-a Notations and Proof Sketch
To analyze the estimation error with the network topology , we decompose it into two parts: (I)The estimation error results from apply the fully-connected net as the communication scheme; (II)The difference between estimates derived from and . To better analyze (I), we apply the idea of coupled processes: we consider , agent ’s estimate based on the coupled process and . From the estimate initialization and the update scheme, we know , which we denote as , coming with the recursive update rule:
In Section -B, we give the bounds on and , which, together with the results for (II), derive the main theorem.
For the ease of the analysis, We have the following notations for the rest of the proof:
consists of the bias term and the variance term which have the expression as follows:
-B Bounds of the estimation error of and
Suppose that Assumption 2 and , then and for any and , we have
For the ease of analysis, we have the following notations: ,
From the notation above and that , under the event of we have
where the second inequality comes from the following result:
Knowing that and defining
where the first inequality is due to the sub-gaussianity of and the fact that , is independent with , , and ; the second inequality comes from (2) and the last inequality is based on . Then by expanding the recursive relationship in (3), the result is proved.
From Lemma 2, we have
Suppose that , then under the following relationship holds:
The result can be derived by following the similar proof of Lemma 28 in [kowshik2021streaming].
Suppose that , Assumption 2 holds, and there exists a constant such that and . Then for any we get
The result can be derived with Lemma 4 and the similar proof of Lemma 30 in [kowshik2021streaming].
Suppose that the conditions in Lemma 5 hold and there exists a constant such that and . Then conditioned on , we have
, where are constants and depends on .
The result can be derived by following the similar proof of Lemma 31 in [kowshik2021streaming].
Here we define .
Suppose that and the conditions in Lemma 6 hold. Then for any , we get
where , and is a constant depending on .
Consider the tail-averaged estimate:
Define the following events:
The result can be derived by following the similar proof of Theorem 33 in [kowshik2021streaming].
Suppose that and the event of holds. Then for and , we have
Consider the update rule of :
Based on the equation above, it can be shown that
For the term , based on the update rule, the choice of and the condition of , we have
from which we conclude that for and ,
For the term , it can be shown that
where the first inequality is due to the condition of and the last inequality comes from the gap between buffers. Similarly, we can show .
-C Difference between and
Suppose that and the event of holds. Then for and , we have for any agent