The amount of data generated from interconnected networks such as sensor networks, financial time-series, brain-networks, etc., are increasing rapidly. Extraction of meaningful information from such interconnected data, represented in the form of a graph can have many practical applications such as, signal denoising , change point detection , time series prediction 
, etc. Many of the functional relationships in such networks are causal and identification of this causal graph structure is termed topology identification. Many real world causal systems can be well described using vector autoregressive model (VAR) as naturally most of the dependencies are time-lagged in nature. Moreover under causal sufficiency, VAR causality implies well known Granger causality.
Topology identification based on the linear VAR model has been well-studied. In , an efficient way to estimate linear VAR coefficients from streaming data is proposed. However, such linear VAR models fail to capture the real-world nonlinear dependencies.A novel nonlinear VAR topology identification is proposed in  in which, the kernels are used to linearize the nonlinear dependencies by mapping them to a higher-dimensional Hilbert space. However, being a batch-based approach,  is computationally expensive and is not suitable for identifying the time-varying topologies.
The above shortcomings are tackled by kernel-based online algorithms , . In , sparse VAR coefficients are recursively estimated using a composite objective mirror decent (COMID) approach, whereas 
uses functional stochastic gradient descent (FSGD), followed by soft-thresholding. However, the kernel-based representations have a major drawback of unaffordable growth of computational complexity and memory requirement, which is commonly known as the “curse of dimensionality”. Both and  propose to circumvent this issue by restricting the numeric calculation to a limited number of time-series samples using a time window, which results in suboptimal performance.
A standard procedure to address the curse of dimensionality is to invoke the kernel dictionaries . Often, the dictionary elements are selected based on a budget maintaining strategy. In large-scale machine learning problems, the dictionary size can go prohibitively high in order to maintain the budget. Recently, the random feature (RF) approximation  techniques are gaining popularity in approximating the kernels, which are shown to yield promising results compared to the budget maintaining strategies [12, 6].
In this work, we use RF approximation to avoid the curse of dimensionality in learning nonlinear VAR models. We approximate shift-invariant Gaussian kernels using a fixed number of random Fourier features. The major contributions of this paper are i) formulation of a kernel-based optimization framework in the function space, ii) reformulation of i) to a parametric optimization using RF approximation, and iii) an online algorithm to estimate the sparse nonlinear VAR coefficients using COMID updates. We provide numerical results showing the proposed method outperforms the state-of-the-art topology identification algorithms.
2 Kernel Representation
Consider a multi-variate time series with nodes. Let be the value of time series at time observed at node . A -th order nonlinear VAR model assuming additive functional dependencies can be formulated as
We assume that the functions in (1) belong to a reproducing kernel Hilbert space (RKHS):
For a particular node , the estimates of are obtained by solving the functional optimization problem:
It is to be noted that in (2), the functions belong to the RKHS defined in (2), which is an infinite dimensional space. However, by resorting to the Representer Theorem , the solution of (2) can be written using a finite number of data samples:
Notice that the number of coefficients required to express the function increases with the number of data samples. In the recent works , , this problem is solved by using a time window to fix the number of data points, resulting in suboptimality. However, in this work in align with  and , we use RF approximation to tackle the dimensionality growth.
3 Random Feature approximation
To invoke the RF approximation, we assume the kernel to be shift invariant, i.e., . Bochner’s theorem 
is the probability density function which depends on type of the kernel, and
is the random variable associated with it. If sufficient amount of iid samplesare collected from the distribution , the real ensemble mean in (3) can be expressed as a sample mean:
irrespective of the distribution
. Note that the unbiased estimate of kernel evaluation in (3) involves a summation of fixed
number of terms. In general, computing the probability distribution corresponding to a kernel is a difficult task. In this work the kernel under consideration is Gaussian; for a Gaussian kernel
with variance, it is well known that the Fourier transform is a Gaussian with variance . Considering the real part of (3), which is also an unbiased estimator, (3) can be approximated as
where . For the sake of clarity, in the succeeding steps, we define the following notation:
For convenience, optimization parameters and are stacked in the lexicographic order of the indices , , and to obtain the vectors and , respectively, and (12) is rewritten as
Now, in order to avoid overfitting, we propose a regularized optimization framework:
where is the regularization parameter and The second term in (16) is a group-lasso regularizer, which promote a group-sparse structure in , supported by the assumption that most of the real world dependencies are sparse in nature.
However, notice that the batch formulation in (16) has some significant limitations: i) requirement of complete batch of data points before estimation, ii) inability to track time varying topologies, and iii) explosive computational complexity when is large even if RF approximation is used. To mitigate these problems, we adopt an online optimization strategy, which is explained in the following section.
4 Online topology estimation
In this case, we replace the batch loss functionin (16) with the stochastic (instantaneous) loss function :
Notice that the sparsity promoting group lasso regularizer is non-differentiable. The use of online subgradient descent (OSGD) is not advisable in this situation as it linearizes the entire objective function and fails to provide sparse iterates. To avoid this limitation of OSGD, we use the composite objective mirror descent (COMID)  algorithm which resembles the nature of proximal methods, hence improving convergence. The online COMID update can be written as
In (19) denotes the estimate of at time . The first term in equation (19) is the gradient of the loss function , the second and third term are Bergman divergence and sparsity promoting regularizer respectively. The Bregman divergence is included to improve the stability of algorithm from adversaries by constraining to be close to . The Bregman divergence chosen in such a way that the COMID update has a closed form solution  and is the corresponding step size. The gradient in (19) is evaluated as
Expanding the objective function in (19) and omitting the constants leads to the following formulation:
where . The first term in (22) forces the stochastic gradient update of in a way to descend instantaneous loss function and the second term in (22) enforces group sparsity of . Note that the close form solution (22) is separable in and .
The proposed algorithm, termed as Random Feature based Nonlinear Topology Identification via Sparse Online learning (RF-NLTISO), is summarized in Algorithm 1.
We compare the performance of the proposed algorithm, RF-NLTISO, with the the state-of-the-art online topology estimation algorithms. Experiments shown in this section are conducted using 1) synthetic datasets with topologies having different transition patterns and 2) real datasets collected from Lundin’s offshore oil and Gas platform. For the performance comparison, we choose TIRSO  and NL-TISO  algorithms, which are the state-of-the-art counterparts of RF-NLTISO, to the best of our knowledge. TIRSO is developed based on a linear VAR model assumption, whereas NL-TISO, a kernel-based topology estimation algorithm, is developed for nonlinear VAR models. Although a kernel-based functional stochastic gradient based algorithm  is also available, its performance has been shown to be inferior compared to NL-TISO .
5.1 Experiments using Synthetic Data
5.1.1 Topology with switching edges
We generate a multi-variate time series using nonlinear VAR model (1) with . An initial random graph with edge probability of is generated and the graph adjacency coefficients
are drawn from a Uniform distribution. After every samples, one of the active (non-zero) edge disappears and another one appears randomly, which brings an abrupt change in the graph topology. The nonlinearity in (1) is introduced using a Gaussian kernel with variance
and the kernel coefficients are chosen randomly from a zero mean Gaussian distribution with variance. Note that the initial data samples are generated randomly and rest of the data is generated using the model (1).
The coefficients are estimated using the proposed RF-NLTISO algorithm with a Gaussian kernel having variance and number of random features . The hyper-parameters and are heuristically chosen as and , respectively. We compute the norms and arrange them in a matrix form similar to the graph adjacency matrix to visualize the causal dependencies. A similar strategy is adopted for the NL-TISO and the TIRSO algorithms. The normalized version of true and the estimated dependencies at various time samples are shown in Fig. 0(a), where in each subplot, the dependency matrices corresponding to are concatenated, resulting in a size matrix. We normalized the coefficients by dividing each coefficients with highest value of coefficient in a pseudo adjacency matrix. From the Fig. 0(a), it is clear that RF-NLTISO is able to perform equal or better compared to NL-TISO algorithm and clearly outperforms TIRSO.
Next we conduct the same experiments using RF-NLTISO with different numbers of random feature (). These experiments are repeated 1000 times to find probability of miss detection () and false alarm (), which we define as
From Figs. 0(c) and 0(b), it is observed that for the given choices of , is better for RF-NLTISO compared to NL-TISO; however, NL-TISO is performing better in terms of . Both the figures show an overshoot at the topology-switching time instances. It is also observed that for the proposed algorithm, decreases with , whereas increases with , which in turn suggests a tuning for for an effective trade-off between and
5.1.2 Slowly varying topology
We compare the performance of RF-NLTISO with the state-of-the-art algorithms using a slowly varying graph topology. The same experiment setup as discussed in Section 5.1.1 is adopted with the following more slowly time varying topology:
The normalized values of one of the active edges is plotted in Fig. 0(d). The figure also shows the normalized values of the corresponding estimated coefficients ( for NL-TISO and RF-NLTISO and for TIRSO ). From the figure, it can be observed that the RF-NLTISO estimates are closer to the true value compared to the estimates from the other two algorithms. In this example, the quality of TIRSO estimates lags considerably behind the kernel-based algorithms due to the fact that the underlying VAR model is nonlinear.
5.2 Experiments using Real Data
This section is dedicated to experiments using real data collected from Lundin’s offshore oil and gas (O&G) platform Edvard-Grieg111https://www.lundin-energy.com/. We have a multi-variate time series with nodes; and the nodes corresponds to various temperature (T), pressure (P), or oil-level (L) sensors. The sensors are placed in the separators of decantation tanks that separate oil, gas, and water. The time series are obtained by uniformly sampling the sensor readings with a sampling rate of . We assume that hidden logic dependencies are present in the network due to various physical connections and various control actuators. The data obtained from the network is normalized by making it a zero mean unit variance signal, before applying the algorithm.
The causal dependencies are learned using RF-NLTISO with and a Gaussian kernel having a variance of and with hyper parameter values and . The signal is reconstructed using the estimated dependencies. Figure 1(a) shows the mean squared error (), defined as for a particular sensor , of RF-NLTISO estimates in comparison with other algorithms. We observe that the RF-NLTISO estimates with random feature number show better performance compare to NL-TISO. The causality graph estimated by RF-NLTISO is shown in Fig. 1(b).
One of the main attractiveness of RF-NLTISO is that even though it is a kernel-based algorithm, it has a fixed computational complexity throughout the online iterations. To demonstrate this, in Fig. 1(c), we plot the computation time required to estimate the coefficients at each time instant by NL-TISO and RF-NLTISO with different values of . The experiment is conducted in a machine with processor 2.4 GHz 8-core Intel Core i9 and 16GB 2667 MHz DDR4 RAM. Figure 1(c) shows that the computation time of NL-TISO increases considerably with time but that of RF-NLTISO remains more or less constant for a particular value of .
We propose a kernel-based online topology identification method for interconnected networks of time-series with additive nonlinear dependencies. In this work, the curse of dimensionality associated with kernel representation is tackled using random feature approximation. Assuming that the real-world dependencies are sparse, we use composite objective mirror decent update to estimate the online sparse causality graph. The effectiveness of the proposed algorithm is illustrated through experiments conducted on synthetic and real data, which shows that the algorithm outperforms the state-of-the-art competitors. We devote the convergence and stability analysis of the proposed algorithm to our future work.
-  (1959) Lectures on fourier integrals. Princeton University press 42, pp. . Cited by: §3.
-  (2010) Composite objective mirror descent. In COLT’10, pp. 14–26. Cited by: §4.
-  (2011) Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of UAI, UAI’11, Arlington, Virginia, USA, pp. 283–290. External Links: Cited by: §4.
-  (2017-03) Parsimonious online learning with kernels via sparse projections in function space. In 2017 IEEE ICASSP, Vol. , pp. 4671–4675. External Links: Cited by: §1.
-  (2018-11) DYNAMIC network identification from non-stationary vector autoregressive time series. In 2018 IEEE GlobalSIP, Vol. , pp. 773–777. Cited by: §1.
-  (2016) Large scale online kernel learning. Journal of Machine Learning Research 17 (47), pp. 1–43. External Links: Cited by: §1, §2.
-  (2005) New introduction to multiple time series analysis. Springer, Berlin [u.a.]. External Links: Cited by: §1.
-  (2021) Design of asymmetric shift operators for efficient decentralized subspace projection. IEEE Transactions on Signal Processing 69 (), pp. 2056–2069. External Links: Cited by: §1.
-  (2021) Online non-linear topology identification from graph-connected time series. arXiv:2104.00030. Cited by: §1, §2, §5.
-  (2000-06) A generalized representer theorem. Computational Learning Theory 42, pp. . Cited by: §2.
-  (2009-10) A multidimensional shrinkage-thresholding operator. Vol. 18, pp. 113 – 116. External Links: Cited by: §4.
-  (2007) Random features for large-scale kernel machines. In NeurIPS, NIPS’07, Red Hook, NY, USA, pp. 1177–1184. External Links: Cited by: §1.
-  (2019-01) Random feature-based online multi-kernel learning in environments with unknown dynamics. J. Mach. Learn. Res. 20 (1), pp. 773–808. External Links: Cited by: §2.
ONLINE identification of directional graph topologies capturing dynamic and nonlinear dependencies†.
2018 IEEE Data Science Workshop (DSW), Vol. , pp. 195–199. External Links: Cited by: §1, §2, §5.
-  (2019) Nonlinear structural vector autoregressive models with application to directed brain networks. IEEE Transactions on Signal Processing 67, pp. 5325–5339. Cited by: §1.
-  (1990) Spline models for observational data. SIAM Press, Society for Industrial and Applied Mathematics. Cited by: §2.
-  (2017) Online topology estimation for vector autoregressive processes in data networks. In 2017 IEEE CAMSAP, Vol. , pp. 1–5. External Links: Cited by: §1.
-  (2021) Online topology identification from vector autoregressive time series. IEEE Transactions on Signal Processing 69 (), pp. 210–225. External Links: Cited by: §1, §5.