1 Introduction
The amount of data generated from interconnected networks such as sensor networks, financial timeseries, brainnetworks, etc., are increasing rapidly. Extraction of meaningful information from such interconnected data, represented in the form of a graph can have many practical applications such as, signal denoising [8], change point detection [5], time series prediction [17]
, etc. Many of the functional relationships in such networks are causal and identification of this causal graph structure is termed topology identification. Many real world causal systems can be well described using vector autoregressive model (VAR) as naturally most of the dependencies are timelagged in nature. Moreover under causal sufficiency, VAR causality implies well known Granger causality
[7].Topology identification based on the linear VAR model has been wellstudied. In [18], an efficient way to estimate linear VAR coefficients from streaming data is proposed. However, such linear VAR models fail to capture the realworld nonlinear dependencies.A novel nonlinear VAR topology identification is proposed in [15] in which, the kernels are used to linearize the nonlinear dependencies by mapping them to a higherdimensional Hilbert space. However, being a batchbased approach, [15] is computationally expensive and is not suitable for identifying the timevarying topologies.
The above shortcomings are tackled by kernelbased online algorithms [14], [9]. In [9], sparse VAR coefficients are recursively estimated using a composite objective mirror decent (COMID) approach, whereas [14]
uses functional stochastic gradient descent (FSGD), followed by softthresholding. However, the kernelbased representations have a major drawback of unaffordable growth of computational complexity and memory requirement, which is commonly known as the “curse of dimensionality”. Both
[14] and [9] propose to circumvent this issue by restricting the numeric calculation to a limited number of timeseries samples using a time window, which results in suboptimal performance.A standard procedure to address the curse of dimensionality is to invoke the kernel dictionaries [4]. Often, the dictionary elements are selected based on a budget maintaining strategy. In largescale machine learning problems, the dictionary size can go prohibitively high in order to maintain the budget. Recently, the random feature (RF) approximation [12] techniques are gaining popularity in approximating the kernels, which are shown to yield promising results compared to the budget maintaining strategies [12, 6].
In this work, we use RF approximation to avoid the curse of dimensionality in learning nonlinear VAR models. We approximate shiftinvariant Gaussian kernels using a fixed number of random Fourier features. The major contributions of this paper are i) formulation of a kernelbased optimization framework in the function space, ii) reformulation of i) to a parametric optimization using RF approximation, and iii) an online algorithm to estimate the sparse nonlinear VAR coefficients using COMID updates. We provide numerical results showing the proposed method outperforms the stateoftheart topology identification algorithms.
2 Kernel Representation
Consider a multivariate time series with nodes. Let be the value of time series at time observed at node . A th order nonlinear VAR model assuming additive functional dependencies can be formulated as
We assume that the functions in (1) belong to a reproducing kernel Hilbert space (RKHS):
For a particular node , the estimates of are obtained by solving the functional optimization problem:
(3) 
It is to be noted that in (2), the functions belong to the RKHS defined in (2), which is an infinite dimensional space. However, by resorting to the Representer Theorem [10], the solution of (2) can be written using a finite number of data samples:
(4) 
Notice that the number of coefficients required to express the function increases with the number of data samples. In the recent works [14], [9], this problem is solved by using a time window to fix the number of data points, resulting in suboptimality. However, in this work in align with [6] and [13], we use RF approximation to tackle the dimensionality growth.
3 Random Feature approximation
To invoke the RF approximation, we assume the kernel to be shift invariant, i.e., . Bochner’s theorem [1]
states that every shiftinvariant kernel can be represented as an inverse Fourier transform of a probability distribution. Hence the kernel evaluation can be expressed as
(5) 
where
is the probability density function which depends on type of the kernel, and
is the random variable associated with it. If sufficient amount of iid samples
are collected from the distribution , the real ensemble mean in (3) can be expressed as a sample mean:(6) 
irrespective of the distribution
. Note that the unbiased estimate of kernel evaluation in (
3) involves a summation of fixednumber of terms. In general, computing the probability distribution corresponding to a kernel is a difficult task. In this work the kernel under consideration is Gaussian; for a Gaussian kernel
with variance
, it is well known that the Fourier transform is a Gaussian with variance . Considering the real part of (3), which is also an unbiased estimator, (3) can be approximated as(7)  
where  
(8) 
Subsisting (7) in (2), we obtain a fixed dimension ( terms) approximation of the function :
(9) 
where . For the sake of clarity, in the succeeding steps, we define the following notation:
(10)  
(11) 
where . The functional optimization (2) is reformulated as a parametric optimization problem using (3):
(12) 
where
(13) 
For convenience, optimization parameters and are stacked in the lexicographic order of the indices , , and to obtain the vectors and , respectively, and (12) is rewritten as
(14)  
(15) 
Now, in order to avoid overfitting, we propose a regularized optimization framework:
(16) 
where is the regularization parameter and The second term in (16) is a grouplasso regularizer, which promote a groupsparse structure in , supported by the assumption that most of the real world dependencies are sparse in nature.
However, notice that the batch formulation in (16) has some significant limitations: i) requirement of complete batch of data points before estimation, ii) inability to track time varying topologies, and iii) explosive computational complexity when is large even if RF approximation is used. To mitigate these problems, we adopt an online optimization strategy, which is explained in the following section.
4 Online topology estimation
In this case, we replace the batch loss function
in (16) with the stochastic (instantaneous) loss function :(17) 
Notice that the sparsity promoting group lasso regularizer is nondifferentiable. The use of online subgradient descent (OSGD) is not advisable in this situation as it linearizes the entire objective function and fails to provide sparse iterates. To avoid this limitation of OSGD, we use the composite objective mirror descent (COMID) [2] algorithm which resembles the nature of proximal methods, hence improving convergence. The online COMID update can be written as
(18)  
(19) 
In (19) denotes the estimate of at time . The first term in equation (19) is the gradient of the loss function , the second and third term are Bergman divergence and sparsity promoting regularizer respectively. The Bregman divergence is included to improve the stability of algorithm from adversaries by constraining to be close to . The Bregman divergence chosen in such a way that the COMID update has a closed form solution [3] and is the corresponding step size. The gradient in (19) is evaluated as
(20) 
Expanding the objective function in (19) and omitting the constants leads to the following formulation:
(21) 
A closed form solution for (18) using (21) is obtained via the multidimensional shrinkagethresholding operator [11]:
(22) 
where . The first term in (22) forces the stochastic gradient update of in a way to descend instantaneous loss function and the second term in (22) enforces group sparsity of . Note that the close form solution (22) is separable in and .
The proposed algorithm, termed as Random Feature based Nonlinear Topology Identification via Sparse Online learning (RFNLTISO), is summarized in Algorithm 1.
5 Experiments
We compare the performance of the proposed algorithm, RFNLTISO, with the the stateoftheart online topology estimation algorithms. Experiments shown in this section are conducted using 1) synthetic datasets with topologies having different transition patterns and 2) real datasets collected from Lundin’s offshore oil and Gas platform. For the performance comparison, we choose TIRSO [18] and NLTISO [9] algorithms, which are the stateoftheart counterparts of RFNLTISO, to the best of our knowledge. TIRSO is developed based on a linear VAR model assumption, whereas NLTISO, a kernelbased topology estimation algorithm, is developed for nonlinear VAR models. Although a kernelbased functional stochastic gradient based algorithm [14] is also available, its performance has been shown to be inferior compared to NLTISO [9].
5.1 Experiments using Synthetic Data
5.1.1 Topology with switching edges
We generate a multivariate time series using nonlinear VAR model (1) with . An initial random graph with edge probability of is generated and the graph adjacency coefficients
are drawn from a Uniform distribution
. After every samples, one of the active (nonzero) edge disappears and another one appears randomly, which brings an abrupt change in the graph topology. The nonlinearity in (1) is introduced using a Gaussian kernel with varianceand the kernel coefficients are chosen randomly from a zero mean Gaussian distribution with variance
. Note that the initial data samples are generated randomly and rest of the data is generated using the model (1).The coefficients are estimated using the proposed RFNLTISO algorithm with a Gaussian kernel having variance and number of random features . The hyperparameters and are heuristically chosen as and , respectively. We compute the norms and arrange them in a matrix form similar to the graph adjacency matrix to visualize the causal dependencies. A similar strategy is adopted for the NLTISO and the TIRSO algorithms. The normalized version of true and the estimated dependencies at various time samples are shown in Fig. 0(a), where in each subplot, the dependency matrices corresponding to are concatenated, resulting in a size matrix. We normalized the coefficients by dividing each coefficients with highest value of coefficient in a pseudo adjacency matrix. From the Fig. 0(a), it is clear that RFNLTISO is able to perform equal or better compared to NLTISO algorithm and clearly outperforms TIRSO.
Next we conduct the same experiments using RFNLTISO with different numbers of random feature (). These experiments are repeated 1000 times to find probability of miss detection () and false alarm (), which we define as
(23) 
From Figs. 0(c) and 0(b), it is observed that for the given choices of , is better for RFNLTISO compared to NLTISO; however, NLTISO is performing better in terms of . Both the figures show an overshoot at the topologyswitching time instances. It is also observed that for the proposed algorithm, decreases with , whereas increases with , which in turn suggests a tuning for for an effective tradeoff between and
5.1.2 Slowly varying topology
We compare the performance of RFNLTISO with the stateoftheart algorithms using a slowly varying graph topology. The same experiment setup as discussed in Section 5.1.1 is adopted with the following more slowly time varying topology:
(24) 
The normalized values of one of the active edges is plotted in Fig. 0(d). The figure also shows the normalized values of the corresponding estimated coefficients ( for NLTISO and RFNLTISO and for TIRSO ). From the figure, it can be observed that the RFNLTISO estimates are closer to the true value compared to the estimates from the other two algorithms. In this example, the quality of TIRSO estimates lags considerably behind the kernelbased algorithms due to the fact that the underlying VAR model is nonlinear.
5.2 Experiments using Real Data
This section is dedicated to experiments using real data collected from Lundin’s offshore oil and gas (O&G) platform EdvardGrieg^{1}^{1}1https://www.lundinenergy.com/. We have a multivariate time series with nodes; and the nodes corresponds to various temperature (T), pressure (P), or oillevel (L) sensors. The sensors are placed in the separators of decantation tanks that separate oil, gas, and water. The time series are obtained by uniformly sampling the sensor readings with a sampling rate of . We assume that hidden logic dependencies are present in the network due to various physical connections and various control actuators. The data obtained from the network is normalized by making it a zero mean unit variance signal, before applying the algorithm.
The causal dependencies are learned using RFNLTISO with and a Gaussian kernel having a variance of and with hyper parameter values and . The signal is reconstructed using the estimated dependencies. Figure 1(a) shows the mean squared error (), defined as for a particular sensor , of RFNLTISO estimates in comparison with other algorithms. We observe that the RFNLTISO estimates with random feature number show better performance compare to NLTISO. The causality graph estimated by RFNLTISO is shown in Fig. 1(b).
One of the main attractiveness of RFNLTISO is that even though it is a kernelbased algorithm, it has a fixed computational complexity throughout the online iterations. To demonstrate this, in Fig. 1(c), we plot the computation time required to estimate the coefficients at each time instant by NLTISO and RFNLTISO with different values of . The experiment is conducted in a machine with processor 2.4 GHz 8core Intel Core i9 and 16GB 2667 MHz DDR4 RAM. Figure 1(c) shows that the computation time of NLTISO increases considerably with time but that of RFNLTISO remains more or less constant for a particular value of .
6 Conclusion
We propose a kernelbased online topology identification method for interconnected networks of timeseries with additive nonlinear dependencies. In this work, the curse of dimensionality associated with kernel representation is tackled using random feature approximation. Assuming that the realworld dependencies are sparse, we use composite objective mirror decent update to estimate the online sparse causality graph. The effectiveness of the proposed algorithm is illustrated through experiments conducted on synthetic and real data, which shows that the algorithm outperforms the stateoftheart competitors. We devote the convergence and stability analysis of the proposed algorithm to our future work.
References
 [1] (1959) Lectures on fourier integrals. Princeton University press 42, pp. . Cited by: §3.
 [2] (2010) Composite objective mirror descent. In COLT’10, pp. 14–26. Cited by: §4.
 [3] (2011) Bregman divergence as general framework to estimate unnormalized statistical models. In Proceedings of UAI, UAI’11, Arlington, Virginia, USA, pp. 283–290. External Links: ISBN 9780974903972 Cited by: §4.
 [4] (201703) Parsimonious online learning with kernels via sparse projections in function space. In 2017 IEEE ICASSP, Vol. , pp. 4671–4675. External Links: ISSN 2379190X Cited by: §1.
 [5] (201811) DYNAMIC network identification from nonstationary vector autoregressive time series. In 2018 IEEE GlobalSIP, Vol. , pp. 773–777. Cited by: §1.
 [6] (2016) Large scale online kernel learning. Journal of Machine Learning Research 17 (47), pp. 1–43. External Links: Link Cited by: §1, §2.
 [7] (2005) New introduction to multiple time series analysis. Springer, Berlin [u.a.]. External Links: ISBN 3540262393 Cited by: §1.
 [8] (2021) Design of asymmetric shift operators for efficient decentralized subspace projection. IEEE Transactions on Signal Processing 69 (), pp. 2056–2069. External Links: Document Cited by: §1.
 [9] (2021) Online nonlinear topology identification from graphconnected time series. arXiv:2104.00030. Cited by: §1, §2, §5.
 [10] (200006) A generalized representer theorem. Computational Learning Theory 42, pp. . Cited by: §2.
 [11] (200910) A multidimensional shrinkagethresholding operator. Vol. 18, pp. 113 – 116. External Links: Document Cited by: §4.
 [12] (2007) Random features for largescale kernel machines. In NeurIPS, NIPS’07, Red Hook, NY, USA, pp. 1177–1184. External Links: ISBN 9781605603520 Cited by: §1.
 [13] (201901) Random featurebased online multikernel learning in environments with unknown dynamics. J. Mach. Learn. Res. 20 (1), pp. 773–808. External Links: ISSN 15324435 Cited by: §2.

[14]
(2018)
ONLINE identification of directional graph topologies capturing dynamic and nonlinear dependencies†.
In
2018 IEEE Data Science Workshop (DSW)
, Vol. , pp. 195–199. External Links: Document Cited by: §1, §2, §5.  [15] (2019) Nonlinear structural vector autoregressive models with application to directed brain networks. IEEE Transactions on Signal Processing 67, pp. 5325–5339. Cited by: §1.
 [16] (1990) Spline models for observational data. SIAM Press, Society for Industrial and Applied Mathematics. Cited by: §2.
 [17] (2017) Online topology estimation for vector autoregressive processes in data networks. In 2017 IEEE CAMSAP, Vol. , pp. 1–5. External Links: Document Cited by: §1.
 [18] (2021) Online topology identification from vector autoregressive time series. IEEE Transactions on Signal Processing 69 (), pp. 210–225. External Links: Document Cited by: §1, §5.
Comments
There are no comments yet.