Learning Topology and Dynamics of Large Recurrent Neural Networks

Large-scale recurrent networks have drawn increasing attention recently because of their capabilities in modeling a large variety of real-world phenomena and physical mechanisms. This paper studies how to identify all authentic connections and estimate system parameters of a recurrent network, given a sequence of node observations. This task becomes extremely challenging in modern network applications, because the available observations are usually very noisy and limited, and the associated dynamical system is strongly nonlinear. By formulating the problem as multivariate sparse sigmoidal regression, we develop simple-to-implement network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant of progressive recurrent network screening is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov's sense, and integrate such stability constraints into sparse network learning. Experiments show excellent performance of the proposed algorithms in network topology identification and forecasting.

Comments

There are no comments yet.

Authors

• 14 publications
• 2 publications
• 19 publications
• Joint Association Graph Screening and Decomposition for Large-scale Linear Dynamical Systems

This paper studies large-scale dynamical networks where the current stat...
11/17/2014 ∙ by Yiyuan She, et al. ∙ 0

read it

• When Recurrent Models Don't Need To Be Recurrent

We prove stable recurrent neural networks are well approximated by feed-...
05/25/2018 ∙ by John Miller, et al. ∙ 0

read it

• A Study on Graph-Structured Recurrent Neural Networks and Sparsification with Application to Epidemic Forecasting

We study epidemic forecasting on real-world health data by a graph-struc...
02/13/2019 ∙ by Zhijian Li, et al. ∙ 26

read it

• A Multi-Horizon Quantile Recurrent Forecaster

We propose a framework for general probabilistic multi-step time series ...
11/29/2017 ∙ by Ruofeng Wen, et al. ∙ 0

read it

• Dynamical Neural Network: Information and Topology

A neural network works as an associative memory device if it has large s...
06/20/2005 ∙ by David Dominguez, et al. ∙ 0

read it

• Identification of Chimera using Machine Learning

Coupled dynamics on the network models have been tremendously helpful in...
01/16/2020 ∙ by M. A. Ganaie, et al. ∙ 0

read it

• System Identification with Time-Aware Neural Sequence Models

Established recurrent neural networks are well-suited to solve a wide va...
11/21/2019 ∙ by Thomas Demeester, et al. ∙ 0

read it

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There has been an increasing interest in identifying network dynamics and topologies in the emerging scientific discipline of network science. In a dynamical network, the evolution of a node is controlled not only by itself, but also by other nodes. For example, in gene regulatory networks [1], the expression levels of genes influence each other, following some dynamic rules, such that the genes are connected together to form a dynamical system. If the topology and evolution rules of the network are known, we can analyze the regulation between genes or detect unusual behaviors to help diagnose and cure genetic diseases. Similarly, the modeling and estimation of dynamical networks are of great importance in various domains including stock market, brain network and social network [2, 3, 4]. To accurately identify the topology and dynamics underlying those networks, scientists are devoted to developing appropriate mathematical models and corresponding estimation methods.

In the literature, linear dynamical models are commonly used. For example, the human brain connectivity network [5] can be characterized by a set of linear differential equations, where the rate of change of activation/observation of any node is a weighted sum of the activations/observations of its neighbors: Here provide the connection weights and is the decay rate. Nevertheless, a lot of complex dynamical networks clearly demonstrate nonlinear relationships between the nodes. For instance, the strength of influence is unbounded in the previous simple linear combination, but the so-called “saturation

” effect widely exists in physical/biological systems (neurons, genes, and stocks)—the external influence on a node, no matter how strong the total input activation is, cannot go beyond a certain threshold. To capture the mechanism, nonlinearity must be introduced into the network system:

, where

denotes a nonlinear activation function typically taken to be the sigmoidal function

. It has a proper shape to resemble many real-world mechanisms and behaviors.

The model description is associated with a continuous-time recurrent neural network

. The existing feedback loops allow the network to exhibit interesting dynamic temporal behaviors to capture many kinds of relationships. It is also biologically realistic in say modeling the effect of an input spike train. Recurrent networks have been successfully applied to a wide range of problems in bioinformatics, financial market forecast, electric circuits, computer vision, and robotics; see, e.g.,

[6, 7, 8, 9, 10] among many others.

In practical applications, it is often necessary to include noise contamination: where stands for an -dimensional Brownian motion. Among the very many unknown parameters, might be the most important: the zero-nonzero pattern of indicates if there exists a (direct) connection from node to node . Collecting all such connections results in a directed graph to describe the node interaction structure.

A fundamental question naturally arises: Given a sequence of node observations (possibly at very few time points), can one identify all existing connections and estimate all system parameters of a recurrent network?

This task becomes extremely challenging in modern big network applications, because the available observations are usually very noisy and only available at a relatively small number of time points (say ), due to budget or equipment limitations. One frequently faces applications with much larger than . In addition, in this continuous time setting, no analytical formula of the likelihood exists for the stochastic model, which increases the estimation difficulty even in large samples [11]. Instead of considering multi-step ad-hoc procedures, this paper aims at learning the network system as a whole. Multivariate statistical techniques will be developed for identifying complete topology and recovering all dynamical parameters. To the best of our knowledge, automatic topology and dynamics learning in large-scale recurrent networks has not been studied before.

In this work, we are interested in networks that are sparse in topology. First, many real-world complex dynamical networks indeed have sparse or approximately sparse structures. For example, in regulatory networks, a gene is only regulated by a handful of others  [12]. Second, when the number of nodes is large or very large compared with the number of observations, the sparsity assumption reduces the number of model parameters so that the system is estimable. Third, from a philosophical point of view, a sparse network modeling is consistent with the principle of Occam’s razor.

Not surprisingly, there is a surge of interest of using compressive sensing techniques for parsimonious network topology learning and dynamics prediction. However, relying on sparsity alone seems to have only limited power in addressing the difficulties of large-scale network learning from big data. To add more prior knowledge and to further reduce the number of effective unknowns, we propose to study how to incorporate structural properties of the network system into learning and estimation, in addition to sparsity. In fact, real-life networks of interest usually demonstrate asymptotic stability. This is one of the main reasons why practitioners only perform limited number of measurements of the system, which again provides a type of parsimony or shrinkage in network learning.

In this paper we develop sparse sigmoidal network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant, the progressive recurrent network screening, is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov’s sense, and incorporate such stability constraints into sparse network learning. The remaining of this paper is organized as follows. Section II introduces the sigmoidal recurrent network model, and formulates a multivariate regularized problem based on the discrete-time approximate likelihood. Section III proposes a class of sparse network learning algorithms based on the study of sparse sigmoidal regressions. A novel and efficient recurrent network screening (RNS) with theoretical guarantee of convergence is advocated for topology identification in ultra-high dimensions. Section IV investigates asymptotic stability conditions in recurrent systems, resulting in a stable-sparse sigmoidal (S) network learning. In Section V, synthetic data experiments and real applications are given. All proof details are left to the Appendices.

Ii Model Formulation

To describe the evolving state of a continuous-time recurrent neural network, one usually defines an associated dynamical system. Ideally, without any randomness, the network behavior can be specified by a set of ordinary differential equations:

 dxidt=liπ(∑j≠iαijxj+ui)−dixi+ci,i=1,⋯,n, (1)

where , short for , denotes the dynamic process of node . Throughout the paper, is the sigmoidal activation function

 π(θ)=11+e−θ, (2)

which is most frequently used in recurrent networks. This function is smooth and strictly increasing. It has a proper shape to resemble many real-world mechanisms and behaviors. Due to noise contamination, a stochastic differential equation model is more realistic

 dxi=(liπ(∑j≠iαijxj+ui)−dixi+ci)dt+σdBt, (3)

where is a standard Brownian motion and reflects the stochastic nature of the system. Typically , and in some applications (no self-regulation) is required.

In the sigmoidal recurrent network model, the coefficients characterize the between-node interactions. In particular, indicates that node does not directly influence node ; otherwise, node regulates node , and is referred to as a regulator of node in gene regulatory networks. Such a regulation relationship can be either excitatory (if is positive) or inhibitory (if is negative). In this way, is associated with a directed graph that captures the Granger causal relationships between all nodes [13], and the topology of the recurrent network is revealed by the zero-nonzero pattern of the matrix. Therefore, to identify all significant regulation links, it is of great interest to estimate , given a sequence of node observations (snapshots of the network).

Denote the system state at time by or (or simply when there is no ambiguity.) Define , , , , , and . Then (3) can be represented in a multivariate form

 dxt=(Lπ(Ax+u)−Dx+c)dt+σdBt, (4)

where is an -dimensional standard Brownian motion. While this model is specified in continuous time, in practice, the observations are always collected at discrete time points. Estimating the parameters of an SDE model from few discrete observations is very challenging. There rarely exists an analytical expression of the exact likelihood. A common treatment is to discretize  (4) and use an approximate likelihood instead. We use the Euler discretization scheme (see, e.g., [14]):

 Δx=(Lπ(Ax+u)−Dx+c)Δt+σΔBt.

Suppose the system (4) is observed at time points . Let be the observed values of all nodes at . Define and , . Because , the negative conditional log-likelihood for the discretized model is given by

 ℓ(x1,⋯,xT+1|x1)=−logP(Δx1,⋯,ΔxT|x1) = −logP(Δx1|x1)⋯P(ΔxT|xT) = T∑s=1∥Δxs/Δts−(Lπ(Axs+u)−Dxs+c)∥22Δts/(2σ2)+C(σ2) =: f(A,u,l,d,c)/σ2+C(σ2).

is a function that depends on only. The fitting criterion is separable in . To see this, let and . Then, it is easy to verify that

 f(A,u,l,d,c)=12n∑i=1T∑s=1(Δxi,sΔts−(liπ(αTixs+ui)−dixi,s+ci))2Δts. (5)

Conventionally, the unknown parameters can then be estimated by minimizing . In modern applications, however, the number of available observations () is often much smaller than the number of variables to be estimated (), due to, for example, equipment/budget limitations. Classical MLE methods do not apply well in this high-dimensional setting.

Fortunately, the networks of interest in reality often possess topology sparsity. For example, a stock price may not be directly influenced by all the other stocks in the stock market. A parsimonious network with only significant regulation links is much more interpretable. Statistically speaking, the sparsity in suggests the necessity of shrinkage estimation [15]

which can be done by adding penalties and/or constraints to the loss function. The general penalized maximum likelihood problem is

 minA,u,l,d,c f(A,u,l,d,c)+∑i,jP(αij;λji) (6)

where is a penalty promoting sparsity and are regularization parameters. Among the very many possible choices of , the penalty is perhaps the most popular to enforce sparsity:

 P(t;λ)=λ|t|. (7)

It provides a convex relaxation of the penalty

 P(t;λ)=λ221t≠0. (8)

Taking both topology identification and dynamics prediction into consideration, we are particularly interested in the penalty [16]

 P(t;λ,η)=12λ21+η1t≠0+η2t2, (9)

where the penalty or Tikhonov regularization can effectively deal with large noise and collinearity [17, 18] to enhance estimation accuracy.

The shrinkage estimation problem (6) is however nontrivial. The loss is nonconvex, is nonlinear, and the penalty may be nonconvex or even discrete, let alone the high-dimensionality challenge. Indeed, in many practical networks, the available observations are usually quite limited and noisy. Effective and efficient learning algorithms are in great need to meet the modern big data challenge.

Iii Sparse Sigmoidal Regression for Recurrent Network Learning

Iii-a Univariate-response sigmoidal regression

As analyzed previously, to solve (6), it is sufficient to study a univariate-response learning problem

 min(β,l,γ)12T∑s=1ws(ys−lπ(~xTsβ)−zTsγ)2+n∑k=1P(βk,λk)=:F(β,l,γ) (10)

where are given, and are unknown with desired to be sparse. (10) is the recurrent network learning problem for node when we set , , , and () (note that the intercept is usually subject to no penalization corresponding to ). For notational ease, define , , , , and (to be used in this subsection only, unless otherwise specified).

We propose a simple-to-implement and efficient algorithm to solve the general optimization problem in the form of (10). First define two useful auxiliary functions

 ξ(θ,y) = π(θ)(1−π(θ))(π(θ)−y), (11) k0(y,w) = max1≤s≤Tws16(1+(1−2ys)22). (12)

The vector versions of

and are defined componentwise: and , and the matrix versions are defined similarly. A prototype algorithm is described as follows, starting with an initial estimate , a thresholding rule

(an odd, shrinking and nondecreasing function, cf.

[19]), and .

repeat
0)
1)
2) Fit a weighted least-squares model
 minl,γ∥W1/2(y−[μ(j) Z][l γT]T)∥22, (13)
with the corresponding solution denoted by .
3) Construct , , . Let be any constant no less than , where is the spectral norm.
4) Update via thresholding:
 β(j)=Θ(β(j−1)−1K(j)~XT~W(j)ξ(~Xβ(j−1),~y(j));λ(j)), (14)
where is a scaled version of satisfying for any , .
until convergence

Before proceeding, we give some examples of and . The specific form of the thresholding function in (14) is related to the penalty through the following formula [16]:

 P(t;λ)−P(0;λ)=∫|t|0(sup{s:Θ(s;λ)≤u}−u)du+q(t;λ) (15)

with nonnegative and for all . The regularization parameter(s) are rescaled at each iteration according to the form of . Examples include: (i) the penalty (7), and its associated soft-thresholding in which case implies , (ii) the penalty (8) and the hard-thresholding which determines , and (iii) the penalty (9) and its associated hard-ridge thresholding where . The () penalties, the elastic net penalty, and others [16] are also instances of this framework.

Theorem 1.

Given the objective function in (10), suppose and satisfy (15) for some nonnegative with for all and . Then given any initial point

, with probability 1 the sequence of iterates

generated by the prototype algorithm satisfies

 F(β(j−1),l(j−1),γ(j−1))≥F(β(j),l(j),γ(j)). (16)

See Appendix A for the proof.

Normalization is usually necessary to make all predictors equally span in the space before penalizing their coefficients using a single regularization parameter . We can center and scale all predictor columns but the intercept before calling the algorithm. Alternatively, it is sometimes more convenient to specify componentwise—e.g., in the penalty set for non-intercept coefficients.

Iii-B Cardinality constrained sigmoidal network learning

In the recurrent network setting, one can directly apply the prototype algorithm in Section III-A to solve (6), by updating the columns in one at a time. On the other hand, a multivariate update form is usually more efficient and convenient in implementation. Moreover, it facilitates integrating stability into network learning (cf. Section IV).

To formulate the loss in a multivariate form, we introduce

 Y =[yi,s]=[Δxi,s/Δts]∈RT×n, X =[x1,⋯,xT]T∈RT×n, B =AT=[α1,⋯,αn]∈Rn×n, W =diag{w}=diag{Δt1,⋯,ΔtT}.

Then in (5) can be rewritten as

 f(B,u,l,d,c)=12∥W1/2{Y−[π(XB+1uT)L−XD+1cT]}∥2F, (17)

with denoting the Frobenius norm, and the objective function to minimize becomes

One of the main issues is to choose a proper penalty form for sparse network learning. Popular sparsity-promoting penalties include , SCAD, , among others. The penalty (8) is ideal in pursuing a parsimonious solution. However, the matter of parameter tuning cannot be ignored. Most tuning strategies (such as -fold cross-validation) require computing a solution path for a grid of values of , which is quite time-consuming in large network estimation. Rather than applying the penalty, we propose an constrained sparse network learning

 ∥B∥0≤m, (18)

where denotes the number of nonzero components in a matrix. In contrast to the penalty parameter in (8), is more meaningful and customizable. One can directly specify its value based on prior knowledge or availability of computational resources to have control of the network connection cardinality. To account for collinearity and large noise contamination, we add a further penalty in , resulting in a new ‘’ regularized criterion

 minB,L,D,u,c12∥W1/2{Y−[π(XB+1uT)L−XD+1cT]}∥2F+η2∥B∥2F=:F0, s.t. ∥B∥0≤m. (19)

Not only is the cardinality bound convenient to set, because of the sparsity assumption, but the shrinkage parameter is easy to tune. Indeed, is usually not a very sensitive parameter and does not require a large grid of candidate values. Many researchers simply fix it at a small value (say, ) which can effectively reduce the prediction error (e.g., [18]). Similarly, to handle the possible collinearity between and , we recommend adding mild ridge penalties in (19) (say, and ).

The constrained optimization of (19) does not apparently fall into the penalized framework proposed in Section III-A. However, we can adapt the technique to handle it through a quantile thresholding operator (as a variant of the hard-ridge thresholding (9)). The detailed recurrent network screening (RNS) algorithm is described as follows, where for notational simplicity , , and we denote by the submatrix of consisting of the rows and columns indexed by and , respectively.