# Distributed Identification of Contracting and/or Monotone Network Dynamics

This paper proposes methods for identification of large-scale networked systems with guarantees that the resulting model will be contracting – a strong form of nonlinear stability – and/or monotone, i.e. order relations between states are preserved. The main challenges that we address are: simultaneously searching for model parameters and a certificate of stability, and scalability to networks with hundreds or thousands of nodes. We propose a model set that admits convex constraints for stability and monotonicity, and has a separable structure that allows distributed identification via the alternating directions method of multipliers (ADMM). The performance and scalability of the approach is illustrated on a variety of linear and non-linear case studies, including a nonlinear traffic network with a 200-dimensional state space.

## Authors

• 6 publications
• 10 publications
• 11 publications
12/31/2021

### Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers

For training recurrent neural network models of nonlinear dynamical syst...
03/26/2021

### Improved Initialization of State-Space Artificial Neural Networks

The identification of black-box nonlinear state-space models requires a ...
06/11/2020

### Deep Learning for Stable Monotone Dynamical Systems

Monotone systems, originating from real-world (e.g., biological or chemi...
10/15/2018

### Nonlinear System Identification of Soft Robot Dynamics Using Koopman Operator Theory

Soft robots are challenging to model due to their nonlinear behavior. Ho...
03/02/2018

### Specialized Interior Point Algorithm for Stable Nonlinear System Identification

Estimation of nonlinear dynamic models from data poses many challenges, ...
10/06/2021

### Deep Identification of Nonlinear Systems in Koopman Form

The present paper treats the identification of nonlinear dynamical syste...
02/16/2016

### Uniform ε-Stability of Distributed Nonlinear Filtering over DNAs: Gaussian-Finite HMMs

In this work, we study stability of distributed filtering of Markov chai...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

System identification is the process of generating dynamic models from data [1], and is also referred to as learning dynamical systems (e.g. [2]). When scaling control and identification algorithms to large-scale systems, it can be useful to treat a system as a sparse network of local subsystems interconnected through a graph [3, 4, 5]. In this paper, we propose algorithms for identification of such networked systems in state space form:

 xt+1=a(xt,ut,ut+1), (1)

where and are the state and input respectively, and the model dynamics

can be either linear or non-linear. We assume that measurements (or estimates) of state and input sequences are available.

Our approach:

1. uses distributed computation (i.e. network nodes only share data and parameters with immediate neighbors),

2. can generate models with a strong form of stability called contraction,

3. can generate monotone models, i.e. ordering relations between states are preserved.

Imposing contraction and/or monotonicity on models provides two benefits when identifying systems that are known to satisfy those properties. Firstly, incorporating prior knowledge can significantly improve the quality of the identified models. Secondly, it guarantees that properties of the real system that are useful for controller design are present in the identified model.

This work is motivated by the observation that many systems have the combination of large-scale, sparse dynamics, monotonicity and stability. Examples include traffic networks [6, 7, 8], chemical reactions [9], combination therapies [10, 11, 12, 13], wildfires [14] and power scheduling [10].

The key technical difficulty we address is the simultaneous identification and stability verification of large-scale networked systems. We propose a convex model set with scalable stability conditions and an algorithm based on ADMM that decomposes the identification problem into easily solvable, sub-problems that require only local communication between subsystems.

### I-a Networked System Identification

Standard approaches to system identification do not work well for large-scale networked systems for three reasons [15]: firstly, the dataset must be collected at a central location, a process which may be prohibitive for complex systems; secondly, the computational and memory complexities prohibit application to large systems; finally, the network structure may not be preserved by identification. For instance, standard subspace identification methods have an computational and memory complexities respectively, and any sparse structure in the dynamics is destroyed through an unknown similarity transformation [16].

Previous work in networked system identification can be loosely categorized into two areas; the identification of a network topology [17, 18, 19], and the identification of a system’s dynamics with known topology. In the latter category, almost all prior work has focused on the case where subsystems are linear time invariant (LTI) and described by state space models [15, 20, 21] or transfer functions (a.k.a. modules) [22, 23].

When identifying the subsytem dynamics, states or outputs of neighbors are treated as exogenous inputs, ignoring feedback loops induced by the network topology. This improves scalability as the identification of each subsystem can be performed in parallel. However, accurate identification of the individual subsystems does not imply accurate identification of the full network, because the ignored feedback loops may have a strong effect and even introduce instability. A simple case with two subsystems which has received significant attention is closed-loop identification [24]

Prior works in networked system identification assume stable LTI network dynamics and establish identifiability [25] and consistency [26]. These assumptions then imply model stability in the infinite data limit. However, model stability is not guaranteed with finite data sets or in non-linear black-box identification problems, where the true system is usually not in the model set.

### I-B Identification of Stable Models

Standard methods for system identification do not guarantee model stability, even if the system from which the data are collected is stable. For linear system identification considerable attention has been paid to this problem, and several methods have been suggested based on regularisation or model constraints [27, 28, 29, 30]. Even for linear systems, the set of stable models is not convex using standard parameterisations, to the authors’ knowledge all existing methods introduce some bias in the identification procedure.

For linear systems most definitions of stability are equivalent. The nonlinear case is more nuanced, and the definition used depends on the requirements of the problem at hand. Standard Lyapunov methods are not appropriate in system identification as the stability certificate must be constructed about a known stable solution, whereas the very purpose of system identification is to predict a system’s response to previously unseen inputs. Contraction [31] and incremental stability (e.g. [32]) are more appropriate since they ensure stability of all possible solutions and consequently, do not require a-priori knowledge of the inputs and state trajectories.

Stability guarantees have also been investigated for nonlinear system identification. For instance, systems can be identified using sets of stable recurrent neural networks

[33] or stable Gaussian process state space models [34]. A limitation of these approaches is that they do not allow joint search for a model and its stability certificate, which can be conservative even for linear systems.

This paper builds on previous work in jointly-convex parameterization of models and their stability certificates via implicit models [35, 36, 37, 38, 39, 40] and associated convex bounds for model fidelity via Lagrangian relaxation [38, 41, 42]. The main development in this paper is to significantly improve scalability of this approach via a novel model parametrization and contraction constraint that are jointly convex and permit a particular upstream/downstream network decomposition (defined below).

### I-C Monotone and Positive Systems

Monotone systems are a class of dynamic system characterized by the preservation of an order relation for solutions (c.f. Definition 2 below). A closely related class is positive systems, for which state variables remain non-negative for all non-negative inputs (c.f. Definition 3 below). For linear systems, positivity and monotonicity are equivalent.

A useful property of monotone systems is that they often admit simplified stability tests. In particular, for linear positive systems the existence of separable Lyapunov functions, i.e. those representable as the sum or maximum over functions of individual state variables, is necessary and sufficient for stability [43]. This property has been used to simplify analysis [44], control [45] and identification [46] of positive systems. Separable stability certificates have also been shown to exist for certain classes of nonlinear monotone systems [47, 48, 49, 50]. and have been used for distributed stability verification [51] and control [52]. Monotonicity can also simplify nonlinear model predictive control [10] and formal verification using signal temporal logic [53].

There are however, few identification algorithms that guarantee monotonicity. In [54], monotone gene networks are identified using the monotone P-splines developed in [55]. This approach, however, does no guarantee model stability.

### I-D Least-Squares Equation Error

Identification typically involves the optimization of a quality of fit metric over a model set. In this paper we use what is arguably the simplest and most widely-applied quality-of-fit metric, least-squares equation error (a.k.a. one step ahead prediction error):

 Jee(θ)=T−1∑t=0|a(~xt,~ut)−~xt+1|2, (2)

where and are state and input measurements or estimates. Least-squares equation error is a natural choice for short-term prediction if state measurements are available.

If long-term predictions are needed, then simulation error, defined as

 Jse(θ)=T−1∑t=0|xt−~xt|2, s.t. xt+1=a(xt,~ut), (3)

is a better measure of performance. The dependence on simulated states, however, renders the cost function non-convex [56, 57] and notoriously difficult to optimize [58]. Consequently, equation error optimization is often used to initialize local search methods (e.g. gradient descent) for models with good simulation error or used as a surrogate for simulation error with better numerical properties. In the latter context, model stability is particularly important since a model can have small equation error but be unstable and therefore exhibit very large simulation error. In fact, when a model is contracting, it can be shown that small equation error implies small simulation error [35].

In many contexts, system state measurements are not available. Nevertheless, equation error frequently arises as a sub-problem via estimated states, e.g. in subspace identification algorithms [59, 60, 21]

, where states are estimated using using matrix factorizations, or in maximum likelihood identification via the expectation maximization (EM) algorithm where they are estimated from the joint smoothing distribution

[61, 62].

### I-E Contributions

The main contributions of this work as are follows: we propose a model structure and convex constraints that guarantee monotonicity, positivity, and/or contraction of the model. For large scale networked systems, we refine the model and constraints to have a separable structure, and we introduce a separable bound on equation error, so the identification problem can be solved using distributed computation. The algorithm, based on ADMM, decomposes into easily solved separable optimization problems at each step. Data and parameters are only communicated to immediate neighbours in the network. Finally, we evaluate the scalability and fitting performance of the method on a number of numerical examples.

## Ii Preliminaries and Problem Setup

### Notation

A graph is defined by a set of nodes (vertices) and edges

. The vector

is the column vector of ones, with size inferred from context. For vectors , refers to the element-wise inequality. For matrices , and refer to element-wise inequalities. For symmetric matrices , means that is positive definite. For a vector , is the matrix with the elements of along the diagonal. The set of symmetric matrices is denoted . The set of non-singular M-matrices is denoted . For a matrix , means and for , where

are the eigenvalues of

. For brevity, we will sometimes drop the arguments from a function where the meaning may be inferred from context.

### Ii-a Differential Dynamics

The contraction and monotonicity conditions we study can be verified by way of a systems differential dynamics, a.k.a. linearized, variational, or prolonged dynamics. For the system (1), the differential dynamics are

 δxt+1=A(xt,ut,ut+1)δxt+B(xt,ut,ut+1)δut. (4)

where and . In conjunction with (1), the differential dynamics describe the linearized dynamics along all solutions of the system.

### Ii-B Contraction Analysis

We use the following definition of nonlinear stability:

###### Definition 1 (Contraction).

A system is termed contracting with rate , where , if for any two initial conditions , , given the same input sequence , and some , there exists a continuous function such that the corresponding trajectories satisfy .

Contraction can be proven by finding a contraction metric which verifies conditions on the differential dynamics [31]. A contraction metric is a function such that:

 V(t,x,0)=0,  V(t,x,δ)≥μ|δ|p, (5) V(t+1,xt+1,δxt+1)≤αV(t,x,δx). (6)

for some

The choice of contraction metric is problem dependent. Prior works have proposed quadratic contraction metrics for which (6) is linear in the stability certificate and can be verified using semi-definite programming. A number of works have also noted that using a weighted norm can lead to separable constraints [63, 51] allowing for stability verification of large-scale networked systems.

In the context of system identification, the joint search for model in (1) and contraction metric is non-convex due to the nonlinear function composition .

### Ii-C Monotone and Positive Systems

We now define system monotonicity and positivity of dynamical systems.

###### Definition 2 (Monotone System).

A system (1) is termed monotone if for inputs and and initial conditions , , the following implication holds:

 xa0≥xb0, uat≥ubt ∀t⟹xat≥xbt ∀t.

Monotonicity results from and where and come from the differential dynamics (4).

###### Definition 3 (Positive System).

A system (1) is positive if for all inputs and initial conditions , the resulting trajectory has .

A sufficent condition for a system to positive is for it to be monotone and admit as a solution, i.e. in (1).

### Ii-D Network Structure

We assume model (1) is partitioned into subsystems. The interactions between these subsystems is described by a directed graph . Here, we have a set of nodes denoted corresponding to the subsystems. Each subsystem has its own state denoted and may take an input denoted (we allow for the case ). The global state and input is attained by concatenating the states and inputs of each subsystem,

 x=⎡⎢ ⎢⎣x1⋮xN⎤⎥ ⎥⎦,  u=⎡⎢ ⎢⎣u1⋮uN⎤⎥ ⎥⎦. (7)

The set of edges describes how the subsystems interact with each other. In particular, means that the state of subsystem affects the state of subsystem . The edge list may arise naturally from the context of the problem, e.g. in traffic networks where edges come from the physical topology of the road network, or may be identified from data [17, 64].

For each subsystem , we define the set of upstream neighbours and the set of downstream neighbours . The term upstream neighbours of refers to the subsystems whose state affects the state of subsystem , and the term downstream neighbours refers to the subsystems whose state is affected by subsystem ’s state. In general, we allow self-loops so that a node can be both upstream and downstream to itself. This notation is illustrated in Fig. 1.

We can write the dynamics of the individual interacting subsystems as follows:

 xit+1=ai(˘xit,˘uit,˘uit+1),  i=1,...,N. (8)

where corresponds to the element in (1) and and .

### Ii-E Separable Optimization using ADMM

Consider an optimization problem of the form,

 minθ J(θ), (9)

which may include constraints on via indicator functions appearing in . The indicator function for the constraint is the function which is zero for and infinite otherwise.

###### Definition 4 (Separable).

The problem (9) is termed separable with respect to the partitioning if it can be written as .

In this paper we encounter problems of the form:

 minθ N∑i=1Jia(θia)+M∑j=1Jjb(θjb), (10)

where and are two different partitions of the same vector . In our context, these partitionings correspond to the sets of upstream or downstream neighbors discussed in the previous section. For such problems, the alternating directions method of multipliers (ADMM) can be applied [65]. We write (10) as

 minθ,ϕ N∑i=1Jia(θia)+M∑j=1Jjb(ϕjb), (11) s.t. θ−ϕ=0.

Applying ADMM results in iterations in which each step is separable with respect to the partition or , and can thus be solved via distributed computing. For convex problems, ADMM is guaranteed to converge to the optimal solution [65].

### Ii-F Problem Statement

To summarise, the main objective of this paper is as follows. Given state and input measurements , and a graph describing the network topology, identify models (8) at each node such that:

• during the identification procedure, each subsystem only communicates with immediate (upstream and downstream) neighbours;

• convergence is guaranteed and least-squares equation error is small at each subsystem;

• model behavioural constraints such as contraction, monotonicity, and/or positivity can be guaranteed for the interconnected system (1).

## Iii Convex Behavioral Constraints

In this section we develop a convex parametrization of models with contraction, monotonicity and/or positivity guarantees. As described in subsection II-B, jointly searching for a model (1) and contraction metric is non-convex.

Following [37, 38], we solve this problem by instead searching for models in the following implicit form:

 e(xt+1,ut+1)=f(xt,ut). (12)

The differential dynamics of (12) are:

 E(xt+1,ut+1)δxt+1=F(xt,ut)δxt+K(xt,ut)δut, (13)

where , and .

###### Definition 5 (Well-Posed).

An implicit model of the form (12) is termed well-posed if for every there is a unique satisfying (12).

I.e., well-posedness means that is a bijection with respect to its first argument, and implies the existence of an explicit model of the form (1) where . Furthermore, it implies that for any initial condition and sequence of inputs , there exists a unique trajectory satisfying (12).

### Iii-a Stability and Monotonicity Constraints

In this section, we develop convex conditions on the implicit model (12) that guarantee well-posedness, monotonicity, positivity, and contraction. The main result is the following:

###### Theorem 1.

A model of the form (12) is:

1. [(a)]

2. well-posed if there exists such that for all ,

 E(x,u)+E(x,u)T≻ϵI, (14)
3. contracting with rate if (a) holds and there exists a matrix function such that for all :

 −S(x,u)≤F(x,u)≤S(x,u), (15) 1⊤(αE(x,u)−S(x,u))≥0, (16)
4. monotone if (a) holds and for all :

 F(x,u)≥0,K(x,u)≥0,E(x,u)∈Mn, (17)
5. positive if (c) holds and:

 e(0)=f(0,0), (18)
6. contracting and monotone if (a) and (c) hold, and for all

 1⊤(αE(x,u)−F(x,u))≥0. (19)

Positivity is also enforced if (18) holds.

###### Proof.

See appendix -A. ∎

We refer to the stability conditions in Theorem 1 (b) or (e) as contraction conditions as they ensure contraction using a state dependent weighted norm of the differentials: , noting that for the purpose of contraction analysis the exogenous input can be considered as a time-variation.

###### Remark 1.

Theorem 1 requires an exponential contraction rate to be specified. A weaker form of incremental stability can also be imposed by replacing (16) with

 1⊤(E(x,u)−S(x,u))≥μ1⊤ (20)

for some , and similarly for (19). This implies that , following a line of reasoning similar to [38].

### Iii-B Model Parametrizations

As formulated above, Theorem 1 applies to models represented by the infinite dimensional space of continuously differentiable functions and . In practice, these functions are usually parametrized by a finite-dimensional vector. In this section we briefly discuss some common model parametrizations and how the constraints can be enforced.

For linear models, (14) is a semidefinite constraint, (15)-(19) are linear and can be enforced using semidefinite programming. Furthermore, if is diagonal, then (14) is also linear and the model set is polytopic.

If the functions and are multivariate polynomials or trigonometric polynomials, then the constraints can be enforced using sum of squares programming [66, 67].

The model set (12

) also contains a class of recurrent neural networks with slope-restricted, invertible activation functions. In this case,

is the inverse of the activation functions, is affine, and simulation of the explicit model yields the equation of a standard recurrent neural network [68]. The conditions in Theorem 1 (b) or (d) then correspond to diagonal dominance conditions on the weight matrices which can be enforced via linear constraints.

Finally, if the requirement for global verification of these properties is relaxed, then these constraints can be applied pointwise for arbitrary parametrizations and , which amount to linear and semidefinite constraints if and are linearly parametrized.

## Iv Distributed Identification

In this section we consider the problem of distributed identification of networked systems with the behavioral constraints introduced in Theorem 1. First, we propose a particular structure for (12) for which the constraints in Theorem 1 are separable. We then propose an objective function that is separable (with respect to a different partition). Finally we propose an algorithm for fitting the proposed models that requires only local communication between subsystems at each step.

### Iv-a Distributed Model

We propose the following model structure for distributed identification, in which depends only on local states and inputs, and is a summation of nonlinear functions of states and inputs from upstream neighbours:

 ei(xit+1,uit+1)=∑j∈Viufij(xj,uj). (21)

Models of the form (21) are widely used for statistical modelling, and are referred to as generalized additive models (GAMs) [69]. This class of models also includes linear systems, and a class of recurrent neural networks. We assume that each of the functions and are linearly parametrized by and respectively.

We define two partitions of the model parameters; the sets of upstream and downstream parameters. These are denoted and respectively. Objective functions, constraints and optimization problems are called upstream-separable or downstream-separable if they are separable with respect to these partitions. Upstream and downstream separable optimization problems are closely related to the column-wise and row-wise separable optimization problems used in [70].

For the parametrization (21), the differential dynamics have a sparsity pattern determined by the network topology. In particular, the block of is:

 Fik=∂∂xk∑j∈Viufij(xj,uj)=⎧⎨⎩∂fik∂xk,k∈Viu0,k∉Viu.

and is block diagonal. This means depends only on parameters and the block depends only on . As each block of and has an independent parametrization, functions of disjoint sets of elements of or will be separable.

### Iv-B Convex Bounds for Equation Error

In Section IV-A we propose a convex set of implicit models. However, this approach shifts the convexity problem from the model set to the objective function as equation error (2), s.t. , is no longer convex in the model parameters.

One approach might be to minimize the implicit equation error

 Jiee=T−1∑t=1|e(~xt+1,~ut+1)−f(~xt,~ut)|2 (22)

as a surrogate for equation error. This approach however, strongly biases the resulting model and leads to poor performance [42]. Instead we use the convex upper bound for equation error proposed in [42], which is based on Lagrangian relaxation.

The least-squares equation error (2) for the implicit model (12) is:

Note that this problem is not jointly convex in and . The following convex upper bound was proposed in [42]:

 Jee≤^Jee(θ)=T−1∑t=1supxt+1{|xt+1−~xt+1|2−2λ(xt+1)⊤(e(xt+1,~ut+1)−f(~xt,~ut))}, (24)

where is a Lagrange multiplier. The function (24) is convex in as it is the supremum of an infinite family of convex functions [38].

For our parametrization (21), is block diagonal which then implies that (24) is upstream separable so it can be written as

 ^Jee(θ)=N∑i=1^Jiee(θiu), (25)

where

 ^Jiee(θiu)=T−1∑t=1supxit{|xit+1−~xit+1|2−2(xit+1−~xit+1)⊤(ei(xit+1,~uit+1)−∑j∈Viufij(~xjt,~ujt))}.

The evaluation of is not trivial as it involves the calculation of the supremum of a non-linear multivariate function. In this work we linearise (25) with respect to and solved for the supremum of the resulting concave quadratic function, giving:

 ^Jiee(θiu)≈¯Jil(θiu)=T−1∑t=1ϵit⊤(Eit+Eit⊤−I)−1ϵit, (26)

where is the implicit equation error and and . The cost function (26) can be optimized via a semidefinite program. Alternative methods for minimizing LREE can also be found in [42].

### Iv-C Alternating Directions Method of Multipliers (ADMM)

In Section IV-A we introduced a model set for which the constraints in Theorem 1 are downstream separable and in Section IV-B we introduced an upstream separable objective function. Note however, that the constraints and objective are not jointly separable with respect to the same partition. We use ADMM to solve this problem.

We now develop the algorithm for the case where (12) is well-posed, monotone and contracting, however, a parallel construction without monotonicity or contraction constraints introduces no additional complexity. Consider the following set of parameters

 Θmℓ1={θ | (???),(???),(???),(???)}. (27)

Applying ADMM as discussed in Section II-E to the problem gives the following iteration scheme for iteration :

 θ(k+1)=argminθ^Jee(θ)+ρ2||θ−ϕ(k)+v(k)||2, (28) ϕ(k+1)=argminϕ∈Θmℓ1ρ2||θ(k+1)−ϕ−v(k)||2, (29) v(k+1)=v(k)−θ(k+1)+ϕ(k+1). (30)

for .

When using a GAM structure (21), we have the following result:

###### Proposition 1.

For the model structure (21), the ADMM iteration (28) separates into upstream-separable optimization problems of the form (31) and the ADMM iteration (29) separates into downstream-separable optimization problems of the form (32).

###### Proof.

See Appendix -B. ∎

In particular, the ADMM approach corresponds to performing the following iterations locally at each node :

 θiu(k+1) (31) ϕid(k+1) (32) viu(k+1) =viu(k)−θiu(k+1)+ϕiu(k+1). (33)

The distributed algorithm is listed in Algorithm 1. The steps (31) and (32) require access to the upstream and downstream parameters respectively. These can be solved by the nodes in the graph, however, communication between both upstream and downstream parameters is necessary between steps. The update (33) is trivially separable and can be solved as either an upstream or downstream separable problem.

Termination of ADMM after a finite number of iterations means that the two parameter vectors and will disagree. For this reason, we take as the solution to ensure that the well-posedness, monotonicity and contraction constraints (14), (17), (19) are satisfied.

## V Discussion

### V-a Conservatism of the Separable Model Structure

We have proposed searching over the model set (21) with (27), and it is important to understand which systems may fall into this model set. A particular question of interest is whether there are contracting and monotone systems which cannot be represented by this structure, and there are two main reasons why this may occur: the separable structure of the model (21), and the assumption of a separable contraction metric in condition (19).

An exact characterization of the functions functions that be approximated via the GAM structure (21) is difficult to give, however, they have widely applied in statistical modelling, see [69] for details. Note that while the functions in the implicit system (21) are additive, the resulting explicit system (8) may not be. For example, the scalar functions and . Both and are additive; however, the function is not.

Conservatism may also be introduced by the assumption of a separable contraction metric. For the case of linear positive systems, it is has been shown that the existence of a separable Lyapunov functions is both necessary and sufficient [43]. This means that contains all positive linear systems [46]:

###### Theorem 2.

For the system (21), if and are affine in , then the model set characterised by (14), (17) and (19) is a parametrization of all stable, discrete-time, positive linear systems.

###### Proof.

See Appendix -C. ∎

Things are more complicated for nonlinear monotone systems. Separable contraction metrics have been shown to exist for certain classes of monotone systems [49] and separable weighted contraction metrics have been used for the analysis of monotone systems [6, 51]. For incrementally exponentially stable systems, it has been shown that the existence of weighted contraction metrics, are necessary and sufficient [50], however the state-dependant weighting depends on the all system states and is therefore not separable in the sense we use. To the authors’ knowledge, a complete characterisation of the class of contracting monotone systems that admit separable metrics is still an open problem.

### V-B Consistency

It has be previously noted that system identification approaches that guarantee stability lead to a bias towards systems that are too stable [28, 29, 71]. Empirical evidence suggests that for methods based on Lagrangian relaxation [41, 42] this bias is smaller.

There are a number of situations that lend themselves towards consistent identification. Firstly, consider the situation where we have noiseless state and input measurements produced by a model with such that . Then we also have so the bound is tight and LREE recovers the true minimizer of equation error.

Now, consider the situation where the unconstrained minimizer of equation error (2), is a monotone, additive function that is contracting in the identity metric. That is, for the function where , the following hold:

1. is additive so that (8) can be written as ,

2. ,

3. .

where . Then, optimizing (26) returns the same solution as the unconstrained least squares minimizer of .

###### Proposition 2.

Consider models of the form (21) with and for some . If properties 1, 2, 3 hold for where , then for , we have .

###### Proof.

Our proof mirrors that of [42, Sec. IV Proposition 1]. ∎

### V-C Iteration Complexity of Distributed Algorithm

In this section, we investigate the computational complexity of each step in the distributed algorithm. In general, the complexity depends on the model parametrization used, however, we limit our discussion to the case where the models are parametrized by polynomials and the constraints are enforced using sum of squares programming.

The first step, (31), is a semi-definite program and can be solved using standard solvers. If no structural properties are exploited, a primal-dual interior point method (IPM), would require operations per iteration per node [72], where is the number of upstream free parameters .

The second step, (32), is a sum-of-squares problem that can solved as a semi-definite program. If and both have degree , then the size of Gram matrix corresponding to (19) for the additive model (21) is . Solving (32) using a primal-dual IPM requires approximately operations per iteration per node [72], where is the number of downstream free parameters.

If a local computational resource is associated with each node in the network, and the number of neighbours for each node satisfies a uniform bound, then the time taken for each iteration will not increase with the number of nodes. However, computation time will grow quickly with the number neighbours, the size of the local states and the degrees of the polynomials used in the model.

### V-D Other Quality of Fit Criteria

Lagrangian relaxation of least-squares equation error was chosen as it is convex, upstream separable, quick to compute, and leads to a simple implementation of ADMM. Any method that treats neighbouring states as exogenous inputs will be upstream separable. However, any such approach will also be susceptible to instability due to the introduction of new feedback loops via the network topology, even if it guarantees stability of the local models. Consequently, one can similarly apply any convex quality of fit criteria such us convex upper bounds on simulation error [38, 41] and still guarantee convergence of ADMM. Alternatively, a non-convex quality of fit criteria like simulation error can be used at the expense of ADMM’s convergence guarantees.

If a model structure does not permit distributed identification, the conditions proposed in Section III can still be used to ensure stability and/or monotonicity. Joint convexity of the model set and stability constraints is still an important as it simplifies constrained optimization allowing for the easy application of penalty, barrier or projected gradient methods [39].

## Vi Numerical Experiments

In this section we present numerical results exploring the scalability and identification performance the proposed approach.

This section is structured as follows: first, we look at the identification of positive linear systems, and explore the computational complexity of the and contraction conditions; we then explore the consistency of fitting nonlinear models when the true system lies in the model set, essentially analysing the effect of convex bound on equation error; finally, we apply the method to the identification of a (simulated) nonlinear traffic network. The traffic network does not lie in the model set so only an approximate model can be identified. We explore the regularising effect of the model constraints and scalability of the method to large networks.

Previous methods for the identification of models with stability guarantees have ensured contraction using a quadratic metric [38, 41, 42]. Contraction is implied by the following semidefinite constraint:

 W(x,u,θ)⪰0    ∀(x,u), (34) W(x,u,θ)=[E(x,u)+E(x,u)⊤−P−ηIF(x,u)⊤F(x,u)P]

where , . We refer to (34) as an contraction condition as it implies the contraction conditions (6) with a state dependent weighted norm of the differentials .

We will make future reference to the following convex sets of parameters, in addition to defined in (27):

 Θu={θ | (???),(???)},  Θm={θ | (???),(???),(???)} Θmℓ2={θ | (???),(???),(???)}

Here the subscripts refer to the following properties:

• - Monotone contracting models i.e. ,

• - Monotone models i.e. ,

• - Models that are not constrained to be contracting or monotone i.e. ,

• - Models that are monotone and contracting in , i.e. ,

All functions and are polynomials in all monomials of their arguments up to a certain degree.

As a baseline for comparison, we will also compare to models denoted , with explicit polyonomial models (1

) fit by least-squares without any separable structure imposed. We will also compare to standard wavelet and sigmoid Nonlinear AutoRegressive with Exogenous input (NARX) models implemented as part of the Matlab system identification toolbox.

For the implicit models, the model class prefix is followed by the degrees of the polynomials in and in parenthesis. For example, the notation refers to unconstrained models with having degree and having degree . For the explicit polynomial models , the degree used follows in parenthesis, so are explicit polynomial models of degree 5 in all arguments.

The NARX models were fit at each node using the regressors . The wavelet NARX models were set to automatically choose the number of basis functions and the sigmoid NARX models were set to use basis functions. The focus for each model was set to produce the best performance. For the wavelet network, we used a focus on simulation and for the sigmoid network, we used a focus on prediction.

The constraints (14), (17), (18), (19) and (34) are enforced using sum of squares programming [66]. All programs are solved using the SDP solver MOSEK with the parser YALMIP [73] on a standard desktop computer (intel core i7, 16GB RAM).

### Vi-a Identification of Linear Positive Systems

In this subsection we study the scalability of the proposed method for the identification of linear positive systems.

We compare the computation time using the proposed contraction constraint to a previously proposed contraction constraint (i.e. quadratic Lyapunov function). Note that for linear systems, the model sets and both are parameterizations of all stable positive linear systems so no difference in quality of fit is observed.

#### Vi-A1 Scalability of Separable Linear and Quadratic Metrics

We illustrate the difference in scalability between the models and . Each experimental trial consists of the following steps:

1. A stable positive system with state dimension is randomly generated using Matlab’s rand function; has a banded structure with band width equal to . Stability was ensured by rescaling to have a spectral radius of .

2. The system is simulated for time steps;

is obtained by adding white noise to the simulated states at SNR equal to 40dB.

3. This process is repeated 5 times for each .

The time taken to solve each optimization problem is shown in Fig. 2. Here, we see a significant improvement in the computational complexity from approximately cubic growth for to linear growth for . The networked approach allows us to solve stable identification problems with at least states.

Note that no explicit attempts to exploit the sparsity of the system were made; use of solvers and parsers designed to exploit sparsity could improve performance, especially for the SDPs associated with the LMI parametrization, e.g. [74].

### Vi-B Identification of Nonlinear Models

In this section we study the consistency of fitting nonlinear implicit models via the LREE bound on equation error. In Section V-B we saw that in the noiseless case, optimization of LREE will return the true model parameters. We will now explore the effect of introducing noise on the model estimates. The experiments in this section can be seen to supplement those in [42, Sec. IV] which studied the effects of noise and model stability on consistency in the linear setting.

We generate models by sampling a parameter vector and then projecting onto the set . The models have degree 3, state size and . We then generate training data with samples by randomly sampling

from the uniform distribution on

and generated noisy measurements of by , where

is normally distributed noise with a specified Signal to Noise Ratio (SNR). Models

are then trained by minimizing with and performance measured using Normalized Equation Error (NEE):

 NEE=|a(x,u)−a∗(x,u)|22|a∗(x,u)|22 (35)

where is the identified dynamic model and is the true where is the sample estimate of the 2-norm of the function .

In Figure 3, we have plotted the NEE that results from fitting models from by optimizing LREE (26) and implicit equation error (22). We can see that LREE provides a much better fit than implicit equation error, especially as the number of data points increases.

To explore the effect of noise on the consistency of LREE, we have plotted NEE versus the size of the dataset for varying noise level (measured in decibels) in Figure 4. If we had a consistent estimator of the explicit model (1), we would expect to see with consistent slope for all SNR levels. What we in fact observe, however, is that in noisier conditions the NEE initially decreases and then plateaus at a certain level. This phenomena can also be seen in [75, Sec. IV], where LREE produces models biased towards being too stable, even in the infinite data limit.

### Vi-C Identification of Traffic Networks

In this section we examine a potential application of our approach, the identification of a traffic network. The dynamics of traffic networks are thought to be monotone when operating in the free flow regime [7]. Note that monotonicity of some traffic models is lost when certain nodes are congested [76].

The data are generated using the model in [7], which is not in the proposed model set. Hence this section provides a test of robustness of the proposed approach to modelling assumptions.

For this application, we consider using equation error as a surrogate for simulation error. Model performance is therefore measured using Normalized Simulation Error (NSE):

 NSE=∑t|xt−~xt|2∑t|~xt|2, (36)

where are the simulated states.

We will first introduce the model, then study the effect of the model constraints by comparison to existing methods, and finally examine scalability to large networks.

#### Vi-C1 Simulation of a traffic network

The dynamics are simulated over a graph (e.g. Fig. 6), where, each node represents a road with state corresponding to the density of traffic on the road, denoted . Nodes marked in allow cars to flow into the network, and nodes marked out allow cars to flow out of the network. Each edge is randomly assigned a turning preference denoted such that (this ensures that the total number of cars at each intersection is conserved). Each node has a capacity of . Vehicles transfer from roads to according to the routing policy,

 fi→j(ρ)=Rjidi(ρi)min{1,sj(ρj)∑k∈ViuRkjdk(ρj)},

where and are monotone demand and supply curves for road . The dynamics of the complete system are then found to be

 ˙ρi=fiin−fiout, (37)

where

 fiin={ui, i∈in∑j∈Viufj→i, i∉in
 fiout={di(ρi), i∈out∑j∈Vidfi→j, i∉out.

The input nodes take a time varying input . We use the following method to generate data sets of size :

1. [(i)]

2. First, we generate an input signal for each of size . This signal changes value every 5 seconds to a new value that is normally distributed with mean

. Negative values of are set to zero. An example input signal is shown in Fig. 5.

3. The dynamics (37) are integrated over seconds.

4. A training set of size is generated by sampling every 0.5 seconds.

#### Vi-C2 Regularization Effect of Model Constraints

In this section we will explore the effects of introducing monotonicity, positivity, and contraction constraints.

Introducing model constraints limits the expressivity of our model. Consequently, one might expect the estimator bias to increase and the variance to decrease

[77, Chapter 7]. Empirical evidence in this section suggests that a judicious choice of constraints can reduce the variance with a minimal increase in bias.

Using the method outlined in Section VI-C1 for simulating a traffic network and the graph depicted in Fig. 6, we generate 100 different training sets of size with and then compare the results on three different validation sets. The first validation set has inputs generated with parameters (the same as the training set). The second and third validation sets have parameters and respectively. These are used to test the generalizability of our model to inputs outside the training set.

In figures 10 and 14, we have plotted the on both the training set and validation sets 1 and 2 for our proposed model sets, the polynomial model, the NARX models and the model set . The percentage of total models that displayed instability is indicated in both the bar graph in the upper portion of the figures.

In all cases, the identified linear models performed poorly. This is unsurprising as the true system is highly non-linear.

Comparing the models and with the remaining models, we can see that the stability constraints have a regularizing effect where increasing the degree of the polynomials reduces the median NSE; in other words, increasing model complexity improves model fidelity. The other models on the other hand perform worse with increasing the complexity. This is most clearly seen in the models , where increase the polynomial degree results in poorer fits on validation data.

Our results also suggest that model stability constraints significantly improve robustness. Without stability constraints, a model that appears stable during training may turn out to be unstable under a slight shift in the input data distribution. This can be seen most clearly in the models and , where on the training data distribution, most models are stable. However, increasing the variance of the inputs to the network results a large number of unstable models with unbounded NSE, c.f. Fig. (c)c and Fig. (c)c. Further evidence is shown in Table I, where we can see that once the variance of the input data doubles, almost all models that do not have stability constraints are unstable.

To compare to a standard approach, we also compare to wavelet and sigmoid NARX models fit using the Matlab system identification tool box. The resulting is shown in Fig. 14 and show the number of models producing unstable models and negative state estimates in tables I and II respectively. While we observed extremely high performance of the individually identified sub-systems, simulating the network interconnection of those sub-systems produces many unstable models, many negative state estimates and poor quality of fit.