Random Feature Approximation for Online Nonlinear Graph Topology Identification

by   Rohan Money, et al.

Online topology estimation of graph-connected time series is challenging, especially since the causal dependencies in many real-world networks are nonlinear. In this paper, we propose a kernel-based algorithm for graph topology estimation. The algorithm uses a Fourier-based Random feature approximation to tackle the curse of dimensionality associated with the kernel representations. Exploiting the fact that the real-world networks often exhibit sparse topologies, we propose a group lasso based optimization framework, which is solve using an iterative composite objective mirror descent method, yielding an online algorithm with fixed computational complexity per iteration. The experiments conducted on real and synthetic data show that the proposed method outperforms its competitors.



There are no comments yet.


page 5


Leveraging Pre-Images to Discover Nonlinear Relationships in Multivariate Environments

Causal discovery, beyond the inference of a network as a collection of c...

Nonlinear least-squares spline fitting with variable knots

In this paper, we present a nonlinear least-squares fitting algorithm us...

Optimizing Convergence for Iterative Learning of ARIMA for Stationary Time Series

Forecasting of time series in continuous systems becomes an increasingly...

Kernel-Based Structural Equation Models for Topology Identification of Directed Networks

Structural equation models (SEMs) have been widely adopted for inference...

Online Graph Topology Learning from Matrix-valued Time Series

This paper is concerned with the statistical analysis of matrix-valued t...

Graph-Aided Online Multi-Kernel Learning

Multi-kernel learning (MKL) has been widely used in function approximati...

Inferring Network Structures via Signal Lasso

Inferring the connectivity structure of networked systems from data is a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The amount of data generated from interconnected networks such as sensor networks, financial time-series, brain-networks, etc., are increasing rapidly. Extraction of meaningful information from such interconnected data, represented in the form of a graph can have many practical applications such as, signal denoising [8], change point detection [5], time series prediction [17]

, etc. Many of the functional relationships in such networks are causal and identification of this causal graph structure is termed topology identification. Many real world causal systems can be well described using vector autoregressive model (VAR) as naturally most of the dependencies are time-lagged in nature. Moreover under causal sufficiency, VAR causality implies well known Granger causality


Topology identification based on the linear VAR model has been well-studied. In [18], an efficient way to estimate linear VAR coefficients from streaming data is proposed. However, such linear VAR models fail to capture the real-world nonlinear dependencies.A novel nonlinear VAR topology identification is proposed in [15] in which, the kernels are used to linearize the nonlinear dependencies by mapping them to a higher-dimensional Hilbert space. However, being a batch-based approach, [15] is computationally expensive and is not suitable for identifying the time-varying topologies.

The above shortcomings are tackled by kernel-based online algorithms [14], [9]. In [9], sparse VAR coefficients are recursively estimated using a composite objective mirror decent (COMID) approach, whereas [14]

uses functional stochastic gradient descent (FSGD), followed by soft-thresholding. However, the kernel-based representations have a major drawback of unaffordable growth of computational complexity and memory requirement, which is commonly known as the “curse of dimensionality”. Both

[14] and [9] propose to circumvent this issue by restricting the numeric calculation to a limited number of time-series samples using a time window, which results in suboptimal performance.

A standard procedure to address the curse of dimensionality is to invoke the kernel dictionaries [4]. Often, the dictionary elements are selected based on a budget maintaining strategy. In large-scale machine learning problems, the dictionary size can go prohibitively high in order to maintain the budget. Recently, the random feature (RF) approximation [12] techniques are gaining popularity in approximating the kernels, which are shown to yield promising results compared to the budget maintaining strategies [12, 6].

In this work, we use RF approximation to avoid the curse of dimensionality in learning nonlinear VAR models. We approximate shift-invariant Gaussian kernels using a fixed number of random Fourier features. The major contributions of this paper are i) formulation of a kernel-based optimization framework in the function space, ii) reformulation of i) to a parametric optimization using RF approximation, and iii) an online algorithm to estimate the sparse nonlinear VAR coefficients using COMID updates. We provide numerical results showing the proposed method outperforms the state-of-the-art topology identification algorithms.

2 Kernel Representation

Consider a multi-variate time series with nodes. Let be the value of time series at time observed at node . A -th order nonlinear VAR model assuming additive functional dependencies can be formulated as

where is the function that encodes the nonlinear causal influence of the -lagged data at node on the node , is the corresponding entry of the graph adjacency matrix, and is the observation noise. Considering the model (1), topology identification can be defined as the estimation of the functional dependencies for from the observed time series .

We assume that the functions in (1) belong to a reproducing kernel Hilbert space (RKHS):

where is the kernel associated with the Hilbert space. The kernel measures the similarity between data points and . Referring to (2), evaluation of the functional at can be represented as the linear combination of the similarities between and the data points , with weights . The inner product, , is defined in the Hilbert space using kernels with reproducible property . Such a Hilbert space with the reproducing kernels is termed as RKHS and the inner product described above induces a norm, . We refer to [16] for further reading on RKHS.

For a particular node , the estimates of are obtained by solving the functional optimization problem:


It is to be noted that in (2), the functions belong to the RKHS defined in (2), which is an infinite dimensional space. However, by resorting to the Representer Theorem [10], the solution of (2) can be written using a finite number of data samples:


Notice that the number of coefficients required to express the function increases with the number of data samples. In the recent works [14], [9], this problem is solved by using a time window to fix the number of data points, resulting in suboptimality. However, in this work in align with [6] and [13], we use RF approximation to tackle the dimensionality growth.

3 Random Feature approximation

To invoke the RF approximation, we assume the kernel to be shift invariant, i.e., . Bochner’s theorem [1]

states that every shift-invariant kernel can be represented as an inverse Fourier transform of a probability distribution. Hence the kernel evaluation can be expressed as



is the probability density function which depends on type of the kernel, and

is the random variable associated with it. If sufficient amount of iid samples

are collected from the distribution , the real ensemble mean in (3) can be expressed as a sample mean:


irrespective of the distribution

. Note that the unbiased estimate of kernel evaluation in (

3) involves a summation of fixed

number of terms. In general, computing the probability distribution corresponding to a kernel is a difficult task. In this work the kernel under consideration is Gaussian; for a Gaussian kernel

with variance

, it is well known that the Fourier transform is a Gaussian with variance . Considering the real part of (3), which is also an unbiased estimator, (3) can be approximated as


Subsisting (7) in (2), we obtain a fixed dimension ( terms) approximation of the function :


where . For the sake of clarity, in the succeeding steps, we define the following notation:


where . The functional optimization (2) is reformulated as a parametric optimization problem using (3):




For convenience, optimization parameters and are stacked in the lexicographic order of the indices , , and to obtain the vectors and , respectively, and (12) is rewritten as


Now, in order to avoid overfitting, we propose a regularized optimization framework:


where is the regularization parameter and The second term in (16) is a group-lasso regularizer, which promote a group-sparse structure in , supported by the assumption that most of the real world dependencies are sparse in nature.

However, notice that the batch formulation in (16) has some significant limitations: i) requirement of complete batch of data points before estimation, ii) inability to track time varying topologies, and iii) explosive computational complexity when is large even if RF approximation is used. To mitigate these problems, we adopt an online optimization strategy, which is explained in the following section.

4 Online topology estimation

In this case, we replace the batch loss function

in (16) with the stochastic (instantaneous) loss function :


Notice that the sparsity promoting group lasso regularizer is non-differentiable. The use of online subgradient descent (OSGD) is not advisable in this situation as it linearizes the entire objective function and fails to provide sparse iterates. To avoid this limitation of OSGD, we use the composite objective mirror descent (COMID) [2] algorithm which resembles the nature of proximal methods, hence improving convergence. The online COMID update can be written as


In (19) denotes the estimate of at time . The first term in equation (19) is the gradient of the loss function , the second and third term are Bergman divergence and sparsity promoting regularizer respectively. The Bregman divergence is included to improve the stability of algorithm from adversaries by constraining to be close to . The Bregman divergence chosen in such a way that the COMID update has a closed form solution [3] and is the corresponding step size. The gradient in (19) is evaluated as


Expanding the objective function in (19) and omitting the constants leads to the following formulation:


A closed form solution for (18) using (21) is obtained via the multidimensional shrinkage-thresholding operator [11]:


where . The first term in (22) forces the stochastic gradient update of in a way to descend instantaneous loss function and the second term in (22) enforces group sparsity of . Note that the close form solution (22) is separable in and .

The proposed algorithm, termed as Random Feature based Nonlinear Topology Identification via Sparse Online learning (RF-NLTISO), is summarized in Algorithm 1.

Result: and 
Store ,
Initialize , ,

(heuristically chosen) and kernel parameters depending on the type of the kernel.

for  do
       Get data samples and compute
      for  do
             compute using (20)
             for  do
                   compute using (22)
             end for
       end for
end for
Algorithm 1 RF-NLTISO Algorithm

5 Experiments

We compare the performance of the proposed algorithm, RF-NLTISO, with the the state-of-the-art online topology estimation algorithms. Experiments shown in this section are conducted using 1) synthetic datasets with topologies having different transition patterns and 2) real datasets collected from Lundin’s offshore oil and Gas platform. For the performance comparison, we choose TIRSO [18] and NL-TISO [9] algorithms, which are the state-of-the-art counterparts of RF-NLTISO, to the best of our knowledge. TIRSO is developed based on a linear VAR model assumption, whereas NL-TISO, a kernel-based topology estimation algorithm, is developed for nonlinear VAR models. Although a kernel-based functional stochastic gradient based algorithm [14] is also available, its performance has been shown to be inferior compared to NL-TISO [9].

5.1 Experiments using Synthetic Data

5.1.1 Topology with switching edges

We generate a multi-variate time series using nonlinear VAR model (1) with . An initial random graph with edge probability of is generated and the graph adjacency coefficients

are drawn from a Uniform distribution

. After every samples, one of the active (non-zero) edge disappears and another one appears randomly, which brings an abrupt change in the graph topology. The nonlinearity in (1) is introduced using a Gaussian kernel with variance

and the kernel coefficients are chosen randomly from a zero mean Gaussian distribution with variance

. Note that the initial data samples are generated randomly and rest of the data is generated using the model (1).

(a) Causal dependencies estimated using different algorithms compared with the true dependency.

(b) Probability of False Alarm ().

(c) Probability of Miss Detection ().

(d) Slowly varying dependency.
Figure 1:

The coefficients are estimated using the proposed RF-NLTISO algorithm with a Gaussian kernel having variance and number of random features . The hyper-parameters and are heuristically chosen as and , respectively. We compute the norms and arrange them in a matrix form similar to the graph adjacency matrix to visualize the causal dependencies. A similar strategy is adopted for the NL-TISO and the TIRSO algorithms. The normalized version of true and the estimated dependencies at various time samples are shown in Fig. 0(a), where in each subplot, the dependency matrices corresponding to are concatenated, resulting in a size matrix. We normalized the coefficients by dividing each coefficients with highest value of coefficient in a pseudo adjacency matrix. From the Fig. 0(a), it is clear that RF-NLTISO is able to perform equal or better compared to NL-TISO algorithm and clearly outperforms TIRSO.

Next we conduct the same experiments using RF-NLTISO with different numbers of random feature (). These experiments are repeated 1000 times to find probability of miss detection () and false alarm (), which we define as


From Figs. 0(c) and 0(b), it is observed that for the given choices of , is better for RF-NLTISO compared to NL-TISO; however, NL-TISO is performing better in terms of . Both the figures show an overshoot at the topology-switching time instances. It is also observed that for the proposed algorithm, decreases with , whereas increases with , which in turn suggests a tuning for for an effective trade-off between and

5.1.2 Slowly varying topology

We compare the performance of RF-NLTISO with the state-of-the-art algorithms using a slowly varying graph topology. The same experiment setup as discussed in Section 5.1.1 is adopted with the following more slowly time varying topology:


The normalized values of one of the active edges is plotted in Fig. 0(d). The figure also shows the normalized values of the corresponding estimated coefficients ( for NL-TISO and RF-NLTISO and for TIRSO ). From the figure, it can be observed that the RF-NLTISO estimates are closer to the true value compared to the estimates from the other two algorithms. In this example, the quality of TIRSO estimates lags considerably behind the kernel-based algorithms due to the fact that the underlying VAR model is nonlinear.

5.2 Experiments using Real Data

This section is dedicated to experiments using real data collected from Lundin’s offshore oil and gas (O&G) platform Edvard-Grieg111https://www.lundin-energy.com/. We have a multi-variate time series with nodes; and the nodes corresponds to various temperature (T), pressure (P), or oil-level (L) sensors. The sensors are placed in the separators of decantation tanks that separate oil, gas, and water. The time series are obtained by uniformly sampling the sensor readings with a sampling rate of . We assume that hidden logic dependencies are present in the network due to various physical connections and various control actuators. The data obtained from the network is normalized by making it a zero mean unit variance signal, before applying the algorithm.

(a) MSE comparison of NL-TISO with RF-NLTISO.

(b) Causality graph in oil and gas plant estimated by RF-NLTISO. P, T, L represent pressure, temperature, and oil level sensors, respectively.

(c) Comparison of computation time of kernel-based algorithms
Figure 2:

The causal dependencies are learned using RF-NLTISO with and a Gaussian kernel having a variance of and with hyper parameter values and . The signal is reconstructed using the estimated dependencies. Figure 1(a) shows the mean squared error (), defined as for a particular sensor , of RF-NLTISO estimates in comparison with other algorithms. We observe that the RF-NLTISO estimates with random feature number show better performance compare to NL-TISO. The causality graph estimated by RF-NLTISO is shown in Fig. 1(b).

One of the main attractiveness of RF-NLTISO is that even though it is a kernel-based algorithm, it has a fixed computational complexity throughout the online iterations. To demonstrate this, in Fig. 1(c), we plot the computation time required to estimate the coefficients at each time instant by NL-TISO and RF-NLTISO with different values of . The experiment is conducted in a machine with processor 2.4 GHz 8-core Intel Core i9 and 16GB 2667 MHz DDR4 RAM. Figure 1(c) shows that the computation time of NL-TISO increases considerably with time but that of RF-NLTISO remains more or less constant for a particular value of .

6 Conclusion

We propose a kernel-based online topology identification method for interconnected networks of time-series with additive nonlinear dependencies. In this work, the curse of dimensionality associated with kernel representation is tackled using random feature approximation. Assuming that the real-world dependencies are sparse, we use composite objective mirror decent update to estimate the online sparse causality graph. The effectiveness of the proposed algorithm is illustrated through experiments conducted on synthetic and real data, which shows that the algorithm outperforms the state-of-the-art competitors. We devote the convergence and stability analysis of the proposed algorithm to our future work.


  • [1] S. Bochner (1959) Lectures on fourier integrals. Princeton University press 42, pp. . Cited by: §3.
  • [2] J. Duchi, S. Shwartz, and A. Tewari (2010) Composite objective mirror descent. In COLT’10, pp. 14–26. Cited by: §4.
  • [3] M. Gutmann and J. Hirayama (2011) Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of UAI, UAI’11, Arlington, Virginia, USA, pp. 283–290. External Links: ISBN 9780974903972 Cited by: §4.
  • [4] A. Koppel, G. Warnell, E. Stump, and A. Ribeiro (2017-03) Parsimonious online learning with kernels via sparse projections in function space. In 2017 IEEE ICASSP, Vol. , pp. 4671–4675. External Links: ISSN 2379-190X Cited by: §1.
  • [5] L. Lopez-Ramos, D. Romero, B. Zaman, and B. Beferull-Lozano (2018-11) DYNAMIC network identification from non-stationary vector autoregressive time series. In 2018 IEEE GlobalSIP, Vol. , pp. 773–777. Cited by: §1.
  • [6] J. Lu, S. Hoi, J. Wang, P. Zhao, and Z. Liu (2016) Large scale online kernel learning. Journal of Machine Learning Research 17 (47), pp. 1–43. External Links: Link Cited by: §1, §2.
  • [7] H. Lütkepohl (2005) New introduction to multiple time series analysis. Springer, Berlin [u.a.]. External Links: ISBN 3540262393 Cited by: §1.
  • [8] S. Mollaebrahim and B. Beferull-Lozano (2021) Design of asymmetric shift operators for efficient decentralized subspace projection. IEEE Transactions on Signal Processing 69 (), pp. 2056–2069. External Links: Document Cited by: §1.
  • [9] R. Money, J. Krishnan, and B. Beferull-Lozano (2021) Online non-linear topology identification from graph-connected time series. arXiv:2104.00030. Cited by: §1, §2, §5.
  • [10] B. Olkopf, R. Herbrich, A. Smola, and R. Williamson (2000-06) A generalized representer theorem. Computational Learning Theory 42, pp. . Cited by: §2.
  • [11] A. Puig, A. Wiesel, and A. Hero (2009-10) A multidimensional shrinkage-thresholding operator. Vol. 18, pp. 113 – 116. External Links: Document Cited by: §4.
  • [12] A. Rahimi and B. Recht (2007) Random features for large-scale kernel machines. In NeurIPS, NIPS’07, Red Hook, NY, USA, pp. 1177–1184. External Links: ISBN 9781605603520 Cited by: §1.
  • [13] Y. Shen, T. Chen, and G. Giannakis (2019-01) Random feature-based online multi-kernel learning in environments with unknown dynamics. J. Mach. Learn. Res. 20 (1), pp. 773–808. External Links: ISSN 1532-4435 Cited by: §2.
  • [14] Y. Shen and G. B. Giannakis (2018) ONLINE identification of directional graph topologies capturing dynamic and nonlinear dependencies†. In

    2018 IEEE Data Science Workshop (DSW)

    Vol. , pp. 195–199. External Links: Document Cited by: §1, §2, §5.
  • [15] Y. Shen, G. Giannakis, and B. Baingana (2019) Nonlinear structural vector autoregressive models with application to directed brain networks. IEEE Transactions on Signal Processing 67, pp. 5325–5339. Cited by: §1.
  • [16] G. Wahba (1990) Spline models for observational data. SIAM Press, Society for Industrial and Applied Mathematics. Cited by: §2.
  • [17] B. Zaman, L. M. Lopez-Ramos, D. Romero, and B. Beferull-Lozano (2017) Online topology estimation for vector autoregressive processes in data networks. In 2017 IEEE CAMSAP, Vol. , pp. 1–5. External Links: Document Cited by: §1.
  • [18] B. Zaman, L. M. L. Ramos, D. Romero, and B. Beferull-Lozano (2021) Online topology identification from vector autoregressive time series. IEEE Transactions on Signal Processing 69 (), pp. 210–225. External Links: Document Cited by: §1, §5.