There has been an increasing interest in identifying network dynamics and topologies in the emerging scientific discipline of network science. In a dynamical network, the evolution of a node is controlled not only by itself, but also by other nodes. For example, in gene regulatory networks , the expression levels of genes influence each other, following some dynamic rules, such that the genes are connected together to form a dynamical system. If the topology and evolution rules of the network are known, we can analyze the regulation between genes or detect unusual behaviors to help diagnose and cure genetic diseases. Similarly, the modeling and estimation of dynamical networks are of great importance in various domains including stock market, brain network and social network [2, 3, 4]. To accurately identify the topology and dynamics underlying those networks, scientists are devoted to developing appropriate mathematical models and corresponding estimation methods.
In the literature, linear dynamical models are commonly used. For example, the human brain connectivity network  can be characterized by a set of linear differential equations, where the rate of change of activation/observation of any node is a weighted sum of the activations/observations of its neighbors: Here provide the connection weights and is the decay rate. Nevertheless, a lot of complex dynamical networks clearly demonstrate nonlinear relationships between the nodes. For instance, the strength of influence is unbounded in the previous simple linear combination, but the so-called “saturation
” effect widely exists in physical/biological systems (neurons, genes, and stocks)—the external influence on a node, no matter how strong the total input activation is, cannot go beyond a certain threshold. To capture the mechanism, nonlinearity must be introduced into the network system:, where . It has a proper shape to resemble many real-world mechanisms and behaviors.
The model description is associated with a continuous-time recurrent neural network
. The existing feedback loops allow the network to exhibit interesting dynamic temporal behaviors to capture many kinds of relationships. It is also biologically realistic in say modeling the effect of an input spike train. Recurrent networks have been successfully applied to a wide range of problems in bioinformatics, financial market forecast, electric circuits, computer vision, and robotics; see, e.g.,[6, 7, 8, 9, 10] among many others.
In practical applications, it is often necessary to include noise contamination: where stands for an -dimensional Brownian motion. Among the very many unknown parameters, might be the most important: the zero-nonzero pattern of indicates if there exists a (direct) connection from node to node . Collecting all such connections results in a directed graph to describe the node interaction structure.
A fundamental question naturally arises: Given a sequence of node observations (possibly at very few time points), can one identify all existing connections and estimate all system parameters of a recurrent network?
This task becomes extremely challenging in modern big network applications, because the available observations are usually very noisy and only available at a relatively small number of time points (say ), due to budget or equipment limitations. One frequently faces applications with much larger than . In addition, in this continuous time setting, no analytical formula of the likelihood exists for the stochastic model, which increases the estimation difficulty even in large samples . Instead of considering multi-step ad-hoc procedures, this paper aims at learning the network system as a whole. Multivariate statistical techniques will be developed for identifying complete topology and recovering all dynamical parameters. To the best of our knowledge, automatic topology and dynamics learning in large-scale recurrent networks has not been studied before.
In this work, we are interested in networks that are sparse in topology. First, many real-world complex dynamical networks indeed have sparse or approximately sparse structures. For example, in regulatory networks, a gene is only regulated by a handful of others . Second, when the number of nodes is large or very large compared with the number of observations, the sparsity assumption reduces the number of model parameters so that the system is estimable. Third, from a philosophical point of view, a sparse network modeling is consistent with the principle of Occam’s razor.
Not surprisingly, there is a surge of interest of using compressive sensing techniques for parsimonious network topology learning and dynamics prediction. However, relying on sparsity alone seems to have only limited power in addressing the difficulties of large-scale network learning from big data. To add more prior knowledge and to further reduce the number of effective unknowns, we propose to study how to incorporate structural properties of the network system into learning and estimation, in addition to sparsity. In fact, real-life networks of interest usually demonstrate asymptotic stability. This is one of the main reasons why practitioners only perform limited number of measurements of the system, which again provides a type of parsimony or shrinkage in network learning.
In this paper we develop sparse sigmoidal network learning algorithms, with rigorous convergence guarantee in theory, for a variety of sparsity-promoting penalty forms. A quantile variant, the progressive recurrent network screening, is proposed for efficient computation and allows for direct cardinality control of network topology in estimation. Moreover, we investigate recurrent network stability conditions in Lyapunov’s sense, and incorporate such stability constraints into sparse network learning. The remaining of this paper is organized as follows. Section II introduces the sigmoidal recurrent network model, and formulates a multivariate regularized problem based on the discrete-time approximate likelihood. Section III proposes a class of sparse network learning algorithms based on the study of sparse sigmoidal regressions. A novel and efficient recurrent network screening (RNS) with theoretical guarantee of convergence is advocated for topology identification in ultra-high dimensions. Section IV investigates asymptotic stability conditions in recurrent systems, resulting in a stable-sparse sigmoidal (S) network learning. In Section V, synthetic data experiments and real applications are given. All proof details are left to the Appendices.
Ii Model Formulation
To describe the evolving state of a continuous-time recurrent neural network, one usually defines an associated dynamical system. Ideally, without any randomness, the network behavior can be specified by a set of ordinary differential equations:
where , short for , denotes the dynamic process of node . Throughout the paper, is the sigmoidal activation function
which is most frequently used in recurrent networks. This function is smooth and strictly increasing. It has a proper shape to resemble many real-world mechanisms and behaviors. Due to noise contamination, a stochastic differential equation model is more realistic
where is a standard Brownian motion and reflects the stochastic nature of the system. Typically , and in some applications (no self-regulation) is required.
In the sigmoidal recurrent network model, the coefficients characterize the between-node interactions. In particular, indicates that node does not directly influence node ; otherwise, node regulates node , and is referred to as a regulator of node in gene regulatory networks. Such a regulation relationship can be either excitatory (if is positive) or inhibitory (if is negative). In this way, is associated with a directed graph that captures the Granger causal relationships between all nodes , and the topology of the recurrent network is revealed by the zero-nonzero pattern of the matrix. Therefore, to identify all significant regulation links, it is of great interest to estimate , given a sequence of node observations (snapshots of the network).
Denote the system state at time by or (or simply when there is no ambiguity.) Define , , , , , and . Then (3) can be represented in a multivariate form
where is an -dimensional standard Brownian motion. While this model is specified in continuous time, in practice, the observations are always collected at discrete time points. Estimating the parameters of an SDE model from few discrete observations is very challenging. There rarely exists an analytical expression of the exact likelihood. A common treatment is to discretize (4) and use an approximate likelihood instead. We use the Euler discretization scheme (see, e.g., ):
Suppose the system (4) is observed at time points . Let be the observed values of all nodes at . Define and , . Because , the negative conditional log-likelihood for the discretized model is given by
is a function that depends on only. The fitting criterion is separable in . To see this, let and . Then, it is easy to verify that
Conventionally, the unknown parameters can then be estimated by minimizing . In modern applications, however, the number of available observations () is often much smaller than the number of variables to be estimated (), due to, for example, equipment/budget limitations. Classical MLE methods do not apply well in this high-dimensional setting.
Fortunately, the networks of interest in reality often possess topology sparsity. For example, a stock price may not be directly influenced by all the other stocks in the stock market. A parsimonious network with only significant regulation links is much more interpretable. Statistically speaking, the sparsity in suggests the necessity of shrinkage estimation 
which can be done by adding penalties and/or constraints to the loss function. The general penalized maximum likelihood problem is
where is a penalty promoting sparsity and are regularization parameters. Among the very many possible choices of , the penalty is perhaps the most popular to enforce sparsity:
It provides a convex relaxation of the penalty
Taking both topology identification and dynamics prediction into consideration, we are particularly interested in the penalty 
The shrinkage estimation problem (6) is however nontrivial. The loss is nonconvex, is nonlinear, and the penalty may be nonconvex or even discrete, let alone the high-dimensionality challenge. Indeed, in many practical networks, the available observations are usually quite limited and noisy. Effective and efficient learning algorithms are in great need to meet the modern big data challenge.
Iii Sparse Sigmoidal Regression for Recurrent Network Learning
Iii-a Univariate-response sigmoidal regression
As analyzed previously, to solve (6), it is sufficient to study a univariate-response learning problem
where are given, and are unknown with desired to be sparse. (10) is the recurrent network learning problem for node when we set , , , and () (note that the intercept is usually subject to no penalization corresponding to ). For notational ease, define , , , , and (to be used in this subsection only, unless otherwise specified).
We propose a simple-to-implement and efficient algorithm to solve the general optimization problem in the form of (10). First define two useful auxiliary functions
The vector versions ofand are defined componentwise: and , and the matrix versions are defined similarly. A prototype algorithm is described as follows, starting with an initial estimate , a thresholding rule
(an odd, shrinking and nondecreasing function, cf.), and .
with nonnegative and for all . The regularization parameter(s) are rescaled at each iteration according to the form of . Examples include: (i) the penalty (7), and its associated soft-thresholding in which case implies , (ii) the penalty (8) and the hard-thresholding which determines , and (iii) the penalty (9) and its associated hard-ridge thresholding where . The () penalties, the elastic net penalty, and others  are also instances of this framework.
See Appendix A for the proof.
Normalization is usually necessary to make all predictors equally span in the space before penalizing their coefficients using a single regularization parameter . We can center and scale all predictor columns but the intercept before calling the algorithm. Alternatively, it is sometimes more convenient to specify componentwise—e.g., in the penalty set for non-intercept coefficients.
Iii-B Cardinality constrained sigmoidal network learning
In the recurrent network setting, one can directly apply the prototype algorithm in Section III-A to solve (6), by updating the columns in one at a time. On the other hand, a multivariate update form is usually more efficient and convenient in implementation. Moreover, it facilitates integrating stability into network learning (cf. Section IV).
To formulate the loss in a multivariate form, we introduce
Then in (5) can be rewritten as
with denoting the Frobenius norm, and the objective function to minimize becomes
One of the main issues is to choose a proper penalty form for sparse network learning. Popular sparsity-promoting penalties include , SCAD, , among others. The penalty (8) is ideal in pursuing a parsimonious solution. However, the matter of parameter tuning cannot be ignored. Most tuning strategies (such as -fold cross-validation) require computing a solution path for a grid of values of , which is quite time-consuming in large network estimation. Rather than applying the penalty, we propose an constrained sparse network learning
where denotes the number of nonzero components in a matrix. In contrast to the penalty parameter in (8), is more meaningful and customizable. One can directly specify its value based on prior knowledge or availability of computational resources to have control of the network connection cardinality. To account for collinearity and large noise contamination, we add a further penalty in , resulting in a new ‘’ regularized criterion
Not only is the cardinality bound convenient to set, because of the sparsity assumption, but the shrinkage parameter is easy to tune. Indeed, is usually not a very sensitive parameter and does not require a large grid of candidate values. Many researchers simply fix it at a small value (say, ) which can effectively reduce the prediction error (e.g., ). Similarly, to handle the possible collinearity between and , we recommend adding mild ridge penalties in (19) (say, and ).
The constrained optimization of (19) does not apparently fall into the penalized framework proposed in Section III-A. However, we can adapt the technique to handle it through a quantile thresholding operator (as a variant of the hard-ridge thresholding (9)). The detailed recurrent network screening (RNS) algorithm is described as follows, where for notational simplicity , , and we denote by the submatrix of consisting of the rows and columns indexed by and , respectively.