Learning Topology and Dynamics of Large Recurrent Neural Networks

10/05/2014 ∙ by Yiyuan She, et al. ∙ 0

Large-scale recurrent networks have drawn increasing attention recently because of their capabilities in modeling a large variety of real-world phenomena and physical mechanisms. This paper studies how to identify all authentic connections and estimate system parameters of a recurrent network, given a sequence of node observations. This task becomes extremely challenging in modern network applications, because the available observations are usually very noisy and limited, and the associated dynamical system is strongly nonlinear. By formulating the problem as multivariate sparse sigmoidal regression, we develop simple-to-implement network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant of progressive recurrent network screening is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov's sense, and integrate such stability constraints into sparse network learning. Experiments show excellent performance of the proposed algorithms in network topology identification and forecasting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There has been an increasing interest in identifying network dynamics and topologies in the emerging scientific discipline of network science. In a dynamical network, the evolution of a node is controlled not only by itself, but also by other nodes. For example, in gene regulatory networks [1], the expression levels of genes influence each other, following some dynamic rules, such that the genes are connected together to form a dynamical system. If the topology and evolution rules of the network are known, we can analyze the regulation between genes or detect unusual behaviors to help diagnose and cure genetic diseases. Similarly, the modeling and estimation of dynamical networks are of great importance in various domains including stock market, brain network and social network [2, 3, 4]. To accurately identify the topology and dynamics underlying those networks, scientists are devoted to developing appropriate mathematical models and corresponding estimation methods.

In the literature, linear dynamical models are commonly used. For example, the human brain connectivity network [5] can be characterized by a set of linear differential equations, where the rate of change of activation/observation of any node is a weighted sum of the activations/observations of its neighbors: Here provide the connection weights and is the decay rate. Nevertheless, a lot of complex dynamical networks clearly demonstrate nonlinear relationships between the nodes. For instance, the strength of influence is unbounded in the previous simple linear combination, but the so-called “saturation

” effect widely exists in physical/biological systems (neurons, genes, and stocks)—the external influence on a node, no matter how strong the total input activation is, cannot go beyond a certain threshold. To capture the mechanism, nonlinearity must be introduced into the network system:

, where

denotes a nonlinear activation function typically taken to be the sigmoidal function

. It has a proper shape to resemble many real-world mechanisms and behaviors.

The model description is associated with a continuous-time recurrent neural network

. The existing feedback loops allow the network to exhibit interesting dynamic temporal behaviors to capture many kinds of relationships. It is also biologically realistic in say modeling the effect of an input spike train. Recurrent networks have been successfully applied to a wide range of problems in bioinformatics, financial market forecast, electric circuits, computer vision, and robotics; see, e.g.,

[6, 7, 8, 9, 10] among many others.

In practical applications, it is often necessary to include noise contamination: where stands for an -dimensional Brownian motion. Among the very many unknown parameters, might be the most important: the zero-nonzero pattern of indicates if there exists a (direct) connection from node to node . Collecting all such connections results in a directed graph to describe the node interaction structure.

A fundamental question naturally arises: Given a sequence of node observations (possibly at very few time points), can one identify all existing connections and estimate all system parameters of a recurrent network?

This task becomes extremely challenging in modern big network applications, because the available observations are usually very noisy and only available at a relatively small number of time points (say ), due to budget or equipment limitations. One frequently faces applications with much larger than . In addition, in this continuous time setting, no analytical formula of the likelihood exists for the stochastic model, which increases the estimation difficulty even in large samples [11]. Instead of considering multi-step ad-hoc procedures, this paper aims at learning the network system as a whole. Multivariate statistical techniques will be developed for identifying complete topology and recovering all dynamical parameters. To the best of our knowledge, automatic topology and dynamics learning in large-scale recurrent networks has not been studied before.

In this work, we are interested in networks that are sparse in topology. First, many real-world complex dynamical networks indeed have sparse or approximately sparse structures. For example, in regulatory networks, a gene is only regulated by a handful of others  [12]. Second, when the number of nodes is large or very large compared with the number of observations, the sparsity assumption reduces the number of model parameters so that the system is estimable. Third, from a philosophical point of view, a sparse network modeling is consistent with the principle of Occam’s razor.

Not surprisingly, there is a surge of interest of using compressive sensing techniques for parsimonious network topology learning and dynamics prediction. However, relying on sparsity alone seems to have only limited power in addressing the difficulties of large-scale network learning from big data. To add more prior knowledge and to further reduce the number of effective unknowns, we propose to study how to incorporate structural properties of the network system into learning and estimation, in addition to sparsity. In fact, real-life networks of interest usually demonstrate asymptotic stability. This is one of the main reasons why practitioners only perform limited number of measurements of the system, which again provides a type of parsimony or shrinkage in network learning.

In this paper we develop sparse sigmoidal network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant, the progressive recurrent network screening, is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov’s sense, and incorporate such stability constraints into sparse network learning. The remaining of this paper is organized as follows. Section II introduces the sigmoidal recurrent network model, and formulates a multivariate regularized problem based on the discrete-time approximate likelihood. Section III proposes a class of sparse network learning algorithms based on the study of sparse sigmoidal regressions. A novel and efficient recurrent network screening (RNS) with theoretical guarantee of convergence is advocated for topology identification in ultra-high dimensions. Section IV investigates asymptotic stability conditions in recurrent systems, resulting in a stable-sparse sigmoidal (S) network learning. In Section V, synthetic data experiments and real applications are given. All proof details are left to the Appendices.

Ii Model Formulation

To describe the evolving state of a continuous-time recurrent neural network, one usually defines an associated dynamical system. Ideally, without any randomness, the network behavior can be specified by a set of ordinary differential equations:

(1)

where , short for , denotes the dynamic process of node . Throughout the paper, is the sigmoidal activation function

(2)

which is most frequently used in recurrent networks. This function is smooth and strictly increasing. It has a proper shape to resemble many real-world mechanisms and behaviors. Due to noise contamination, a stochastic differential equation model is more realistic

(3)

where is a standard Brownian motion and reflects the stochastic nature of the system. Typically , and in some applications (no self-regulation) is required.

In the sigmoidal recurrent network model, the coefficients characterize the between-node interactions. In particular, indicates that node does not directly influence node ; otherwise, node regulates node , and is referred to as a regulator of node in gene regulatory networks. Such a regulation relationship can be either excitatory (if is positive) or inhibitory (if is negative). In this way, is associated with a directed graph that captures the Granger causal relationships between all nodes [13], and the topology of the recurrent network is revealed by the zero-nonzero pattern of the matrix. Therefore, to identify all significant regulation links, it is of great interest to estimate , given a sequence of node observations (snapshots of the network).

Denote the system state at time by or (or simply when there is no ambiguity.) Define , , , , , and . Then (3) can be represented in a multivariate form

(4)

where is an -dimensional standard Brownian motion. While this model is specified in continuous time, in practice, the observations are always collected at discrete time points. Estimating the parameters of an SDE model from few discrete observations is very challenging. There rarely exists an analytical expression of the exact likelihood. A common treatment is to discretize  (4) and use an approximate likelihood instead. We use the Euler discretization scheme (see, e.g., [14]):

Suppose the system (4) is observed at time points . Let be the observed values of all nodes at . Define and , . Because , the negative conditional log-likelihood for the discretized model is given by

is a function that depends on only. The fitting criterion is separable in . To see this, let and . Then, it is easy to verify that

(5)

Conventionally, the unknown parameters can then be estimated by minimizing . In modern applications, however, the number of available observations () is often much smaller than the number of variables to be estimated (), due to, for example, equipment/budget limitations. Classical MLE methods do not apply well in this high-dimensional setting.

Fortunately, the networks of interest in reality often possess topology sparsity. For example, a stock price may not be directly influenced by all the other stocks in the stock market. A parsimonious network with only significant regulation links is much more interpretable. Statistically speaking, the sparsity in suggests the necessity of shrinkage estimation [15]

which can be done by adding penalties and/or constraints to the loss function. The general penalized maximum likelihood problem is

(6)

where is a penalty promoting sparsity and are regularization parameters. Among the very many possible choices of , the penalty is perhaps the most popular to enforce sparsity:

(7)

It provides a convex relaxation of the penalty

(8)

Taking both topology identification and dynamics prediction into consideration, we are particularly interested in the penalty [16]

(9)

where the penalty or Tikhonov regularization can effectively deal with large noise and collinearity [17, 18] to enhance estimation accuracy.

The shrinkage estimation problem (6) is however nontrivial. The loss is nonconvex, is nonlinear, and the penalty may be nonconvex or even discrete, let alone the high-dimensionality challenge. Indeed, in many practical networks, the available observations are usually quite limited and noisy. Effective and efficient learning algorithms are in great need to meet the modern big data challenge.

Iii Sparse Sigmoidal Regression for Recurrent Network Learning

Iii-a Univariate-response sigmoidal regression

As analyzed previously, to solve (6), it is sufficient to study a univariate-response learning problem

(10)

where are given, and are unknown with desired to be sparse. (10) is the recurrent network learning problem for node when we set , , , and () (note that the intercept is usually subject to no penalization corresponding to ). For notational ease, define , , , , and (to be used in this subsection only, unless otherwise specified).

We propose a simple-to-implement and efficient algorithm to solve the general optimization problem in the form of (10). First define two useful auxiliary functions

(11)
(12)

The vector versions of

and are defined componentwise: and , and the matrix versions are defined similarly. A prototype algorithm is described as follows, starting with an initial estimate , a thresholding rule

(an odd, shrinking and nondecreasing function, cf.

[19]), and .

  repeat
     0)
     1)
     2) Fit a weighted least-squares model
(13)
with the corresponding solution denoted by .
     3) Construct , , . Let be any constant no less than , where is the spectral norm.
     4) Update via thresholding:
(14)
where is a scaled version of satisfying for any , .
  until convergence

Before proceeding, we give some examples of and . The specific form of the thresholding function in (14) is related to the penalty through the following formula [16]:

(15)

with nonnegative and for all . The regularization parameter(s) are rescaled at each iteration according to the form of . Examples include: (i) the penalty (7), and its associated soft-thresholding in which case implies , (ii) the penalty (8) and the hard-thresholding which determines , and (iii) the penalty (9) and its associated hard-ridge thresholding where . The () penalties, the elastic net penalty, and others [16] are also instances of this framework.

Theorem 1.

Given the objective function in (10), suppose and satisfy (15) for some nonnegative with for all and . Then given any initial point

, with probability 1 the sequence of iterates

generated by the prototype algorithm satisfies

(16)

See Appendix A for the proof.

Normalization is usually necessary to make all predictors equally span in the space before penalizing their coefficients using a single regularization parameter . We can center and scale all predictor columns but the intercept before calling the algorithm. Alternatively, it is sometimes more convenient to specify componentwise—e.g., in the penalty set for non-intercept coefficients.

Iii-B Cardinality constrained sigmoidal network learning

In the recurrent network setting, one can directly apply the prototype algorithm in Section III-A to solve (6), by updating the columns in one at a time. On the other hand, a multivariate update form is usually more efficient and convenient in implementation. Moreover, it facilitates integrating stability into network learning (cf. Section IV).

To formulate the loss in a multivariate form, we introduce

Then in (5) can be rewritten as

(17)

with denoting the Frobenius norm, and the objective function to minimize becomes

One of the main issues is to choose a proper penalty form for sparse network learning. Popular sparsity-promoting penalties include , SCAD, , among others. The penalty (8) is ideal in pursuing a parsimonious solution. However, the matter of parameter tuning cannot be ignored. Most tuning strategies (such as -fold cross-validation) require computing a solution path for a grid of values of , which is quite time-consuming in large network estimation. Rather than applying the penalty, we propose an constrained sparse network learning

(18)

where denotes the number of nonzero components in a matrix. In contrast to the penalty parameter in (8), is more meaningful and customizable. One can directly specify its value based on prior knowledge or availability of computational resources to have control of the network connection cardinality. To account for collinearity and large noise contamination, we add a further penalty in , resulting in a new ‘’ regularized criterion

(19)

Not only is the cardinality bound convenient to set, because of the sparsity assumption, but the shrinkage parameter is easy to tune. Indeed, is usually not a very sensitive parameter and does not require a large grid of candidate values. Many researchers simply fix it at a small value (say, ) which can effectively reduce the prediction error (e.g., [18]). Similarly, to handle the possible collinearity between and , we recommend adding mild ridge penalties in (19) (say, and ).

The constrained optimization of (19) does not apparently fall into the penalized framework proposed in Section III-A. However, we can adapt the technique to handle it through a quantile thresholding operator (as a variant of the hard-ridge thresholding (9)). The detailed recurrent network screening (RNS) algorithm is described as follows, where for notational simplicity , , and we denote by the submatrix of consisting of the rows and columns indexed by and , respectively.

  given  (initial estimate), (cardinality bound), ( shrinkage parameter).
  
  repeat
     0)
     1)
     2) Update , and by fitting a weighted vector least-squares model:
(20)
with the solution denoted by , , and . (This amounts to solving separate weighted least squares problems.)
     3) Construct , , where denotes the Moore-Penrose pseudoinverse and is the Hadamard product. Let be any constant no less than , and .
     4) Update and : 4.1) , , (the submatrix of without the first row), .4.2) Perform the hard-ridge thresholding (cf. Section III-A for the definition of ) with an adaptive threshold matrix . The entries of are all set to the medium of the th and the th largest components of . See (21) for other variants when certain links must be maintained or forbidden. 4.3)
  until the decrease in function value is small
  deliver , , , , .
Algorithm 1  Recurrent Network Screening (RNS)
Theorem 2.

Given any initial point , Algorithm 1 converges in the sense that with probability 1, the function value decreasing property holds:

and all satisfy , .

See Appendix B for the proof.

Step 2 can be implemented by solving weighted least squares. Or, one can re-formulate it as a single-response problem to obtain the solution in one step. The latter way is usually more efficient. When there are ridge penalties imposed on and , the computation is similar.

In certain applications self-regulations are not allowed. Then should be the medium of the th and the th largest elements of ( in absolute value after excluding its diagonal entries).111Throughout the paper is the absolute value of the elements of . That is, for , . Other prohibited links can be similarly treated in determining the dynamic threshold. In general, given the index set of the links that must be maintained and the index set of the links that are forbidden, the threshold is constructed as follows

(21)

where is the th largest element (in absolution value) in (all entries of except those indexed by ). It is easy to show that the convergence result still holds based on the argument in Appendix B.

In implementation, we advocate reducing the network cardinality in a progressive manner to lessen greediness: at the th step, is replaced by , where is a monotone sequence of integers decreasing from (assuming no self-regulation) to . Empirical study shows that the sigmoidal decay cooling schedule with , works well.

RNS involves no costly operations at each iteration, and is simple to implement. It runs efficiently for large networks. The RNS estimate can be directly used for analyzing the network topology. One can also use it for screening (in which case a relatively large value of is specified), followed by a fine network learning algorithm restricted on the screened connections. In either case RNS substantially reduces the search space of candidate links.

Iv Stable-Sparse Sigmoidal Network Learning

For a general dynamical system possibly nonlinear, stability is one of the most fundamental issues [20, 21]. If a system’s equilibrium point is asymptotically stable, then the perturbed system must approach the equilibrium point as increases. Moreover, one of the main reasons many real network applications only have limited number of observations measured after perturbation is that the associated dynamical systems stabilize fast (e.g., exponentially fast). This offers another important type of parsimony/shrinkage in network parameter estimation.

To design a new type of regularization, we first investigate stability conditions of sigmoidal recurrent networks in Lyapunov’s sense [22, 23]. Then, we develop a stable sparse sigmoidal (S) network learning approach.

Iv-a Conditions for asymptotic stability

Recall the multivariate representation of (1)

(22)

Because is sparse and typically singular and the degradation rates are not necessarily identical, in general (22) can not be treated as an instance of the Cohen-Grossberg neural networks [24]. We must first derive its own stability conditions to be considered in network estimation.

Hereinafter, and stand for the positive semi-definiteness and positive definiteness of

, respectively, and the set of eigenvalues of

is . Our conditions are stated below:

(A1)
(A2)
(A3a)
(A3b)
Theorem 3.

Suppose (A1) & (A2) hold. Then (A3a) guarantees the network defined by (22) has a unique equilibrium point and is globally exponentially stable in the sense that for any solution . The same conclusion holds if (A3a) is replaced by (A3b).

See Appendix C for the detailed proof.

Figure 1 shows an example of stochastic processes generated from a stable recurrent network and an unstable recurrent network respectively. In the upper panel, the recurrent network system parameters satisfy the stability condition (A3a), while those in the lower panel violate (A3a). (In both situations, the number of nodes is 10 and the diffusion parameter is fixed at .) The differences between the two models are obvious.

In reality, asymptotically stable systems are commonly observed. The stability conditions reflect structural characteristics. For example, when all are equal and , then the skew-symmetry of , i.e., , guarantees asymptotic stability. The information provided by the constrains can assist topology learning. This motivates the design of sparse recurrent network learning with stability.

Figure 1: Stochastic processes generated from a stable recurrent network model and an unstable recurrent network model.

Iv-B S network estimation

(A3a) is less restrictive than (A3b). In optimization, imposing (A3a) seems however difficult. We propose

(23)

referred to as the Stable-Sparse Sigmoidal (S) network learning. The stability constraints are now imposed on as the transpose of the raw coefficient matrix . Similar to the discussion of (19), in implementation, we add mild penalties on and to deal with possible collinearity, and it is common to replace by with extremely small.

In this part, we focus on the penalty where matrix usually has the form for any (other options are possible.) Introducing the componentwise regularization matrix is helpful when one wants to maintain or forbid certain links based on prior knowledge or preliminary screening. For example, if no self-regulation is allowed, then all diagonal entries of ought to be .

It turns out that one can modify Step 2 and Step 4 of the RNS algorithm to solve (23).

First, the non-negativity of and can be directly incorporated, if the weighted least-squares problem in Step 2 is replaced by the following programming problem with generalized non-negativity constraints:

Step 2)
s.t. (24)

A variety of algorithms and packages can be used. (The technique in Appendix D also applies.)

Integrating the spectral constraint on into network learning is much trickier. We modify Step 4 of Algorithm 1 as follows.

  1. Update and :

    1. , , , and .

    2. Perform the inner loop iterations, starting with , , , , , , and the operators defined in Lemmas 4-6:
      repeat

      1. , , .

      2. ,
        , .

      3. , , .

      until convergence

Algorithm 1 with such modifications in Step 2 and Step 4 is referred to as the S estimation algorithm.

Theorem 4.

Given any initial point , the S algorithm converges in the sense that the function value decreasing property holds, and furthermore, , , and satisfy for any .

The proof is given in Appendix D.

We observe that practically, the inner loop converges fast (usually within 100 steps). Moreover, to guarantee the functional value is decreasing, one does not have to run the inner loop till convergence. Although it is possible to apply the stable-sparse estimation directly, we recommend running the screening algorithm (RNS) first, followed by the fine S network learning.

V Experiments

V-a Simulation Studies

In this subsection, we conduct synthetic data experiments to demonstrate the performance of the proposed learning framework in recurrent network analysis. An Erdős-Rényi-like scheme of generating system parameters, including a sparse regulation matrix , is described as follows. Given any node

, the number of its regulators is drawn from a binomial distribution

. The regulator set is chosen randomly from the rest nodes (excluding node itself). If , . Otherwise, follows a mixture of two Gaussians ad with probability for each. Then draw random , ,

from Gaussian distributions (independently)

. Finally is generated so that the system satisfies the stability condition (A3a).

Topology identification. First, we test the performance of RNS in recurrent network topology identification. We compare it with TSNI [25] and QTIS [26]. TSNI is a popular network learning approach and applies principle component analysis for dimensionality reduction. QTIS is a network screening algorithm based on sparsity-inducing techniques. To avoid the ad-hoc issue of parameter tuning, ROC curves will be used to summarize link detection results in a comprehensive way, in terms of true positive rate (TPR) and false positive rate (FPR).

We simulate two networks according to the scheme introduced early. In the first example we set , in the second example , and in the third . Given all system parameters, we can call Matlab functions sde and sde.simulate to generate continuous-time stochastic processes according to (4). The discrete-time observations are sampled from the stochastic processes with sampling period .

In this experiment, the number of unknowns in either case is larger than the sample size, especially in Ex.2 which has about 41K variables but only 500 observations. Given any algorithm, we vary the target cardinality parameter from 1 to , collect all estimates, and compute their associated TPRs and FPRs. The experiment is repeated for times and the averaged rates are shown in the ROC curves in Figure 2. RNS beats TSNI and QTIS by a large margin. In fact, the ROC curve of RNS is everywhere above the TSNI curve and the QTIS curve.

(a) Example 1.
(b) Example 2.
(c) Example 3.
Figure 2: Comparison of ROC curves.

System stability. Next, we show the necessity of stable learning in network dynamics forecasting. For simplicity, the penalization is used. We compare the S network estimation with the approach based on sparse sigmoidal regression in Section III-A (with no stability guarantee), denoted by SigSpar.

We use two network examples (Ex.4 and Ex.5) with respectively. In each setting, we generate samples for training, and validation samples for parameter tuning. In this experiment, forecast error at a future time point is the major concern. Suppose , the network snapshot at , is given. With system parameter estimates obtained, one can simulate a stochastic process () starting with based on model (4). The forecast error at time point is defined as . Long-term forecasting corresponds to large values of . We repeat the experiment for 50 times and show the average FE in Table I. The error of SigSpar becomes extremely large as increases, because there is no stability guarantee of its network estimate, while S has excellent performance even in long-term forecasting.

1 5 10 15 20
Ex.4 SigSpar 0.08 3.01 41.95 450.54 4.410
S 0.04 0.48 1.01 1.39 1.76
Ex.5 SigSpar 0.35 23.76 41.95 7.310 1.010
S 0.06 0.70 1.75 2.80 4.04
Table I: Forecasting error comparison on the simulation data.

V-B Real data

Yeast gene regulatory network. We use RNS to study the transcriptional regulatory network in the yeast cell cycle. The dataset is publicly available and a detailed description of the microarray experiment is in [27]. Following [28], we focus on the 20 genes whose expression patterns fluctuate in a periodic manner. The dataset contains their expression levels recorded at 18 time points during a cell cycle. In this regulatory (sub)network, five genes have been identified as transcription factors, namely SWI4, HCM1, NDD1, SWI5, ASH1, and 19 regulatory connections from them to target genes have been reported with direct evidence in the literature (cf. the YEASTRACT database at http://yeastract.com/). [28] found 55 connections from the transcription factors to the target genes, of which 14 have biological evidence (and so the true positive rate is ). For a fair comparison, we also let RNS detect 55 connections from the transcription factors to the target genes, and achieved a higher true positive rate . A detailed comparison of the identified regulatory connections is shown in Table II.

SWI4 HCM1 NDD1 SWI5 ASH1
SWI4
HCM1
NDD1
SWI5
ASH1
CLN2
SVS1
SWE1
MNN1
CLB6
HTA1
HTB1
HHT1
CLB4
CLB2
CDC20
SPO12
SIC1
CLN3
CDC46
Table II: Regulatory connections of the Yeast cell cycle subnetwork. ‘’ stands for the regulatory connections identified by RNS, ‘’ stands for those by [28], and ‘’ stands for the confirmed regulatory connections with published evidence in the literature.

fMRI data. The resting state fMRI data provided by the ADHD-200 consortium [29] have been preprocessed by the NIAK interface [30]. The dataset we are using has 954 ROIs (regions of interest) and 257 time points. In the experiment, the first 200 observations are for training (and so and ), and the following 57 observations are for testing. We applied RNS to get a network pattern, followed by the S network estimation for stability correction. Table III shows the results averaged over 10 randomly chosen subjects. Our learning algorithm can achieve much lower error than TSNI, and its performance is pretty robust to the choice of .

0.5 0.6 0.7 0.8 0.9
TSNI 0.981 1.021 1.044 1.056 1.060
S 0.194 0.194 0.194 0.194 0.194
Table III: Forecast error on the fMRI data.

Appendix A Proof of Theorem 1

For notation simplicity, we introduce . Then , from which it follows that