Single Index Latent Variable Models for Network Topology Inference

06/28/2018
by   Jonathan Mei, et al.
0

A semi-parametric, non-linear regression model in the presence of latent variables is applied towards learning network graph structure. These latent variables can correspond to unmodeled phenomena or unmeasured agents in a complex system of interacting entities. This formulation jointly estimates non-linearities in the underlying data generation, the direct interactions between measured entities, and the indirect effects of unmeasured processes on the observed data. The learning is posed as regularized empirical risk minimization. Details of the algorithm for learning the model are outlined. Experiments demonstrate the performance of the learned model on real data.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/09/2017

SILVar: Single Index Latent Variable Models

A semi-parametric, non-linear regression model in the presence of latent...
11/13/2017

Model Criticism in Latent Space

Model criticism is usually carried out by assessing if replicated data g...
11/01/2019

Learning Deep Bayesian Latent Variable Regression Models that Generalize: When Non-identifiability is a Problem

Bayesian Neural Networks with Latent Variables (BNN+LV's) provide uncert...
05/24/2022

RevUp: Revise and Update Information Bottleneck for Event Representation

In machine learning, latent variables play a key role to capture the und...
12/25/2013

Supervised learning of a regression model based on latent process. Application to the estimation of fuel cell life time

This paper describes a pattern recognition approach aiming to estimate f...
01/28/2019

Ising Models with Latent Conditional Gaussian Variables

Ising models describe the joint probability distribution of a vector of ...
08/20/2018

The Mismatch Principle: Statistical Learning Under Large Model Uncertainties

We study the learning capacity of empirical risk minimization with regar...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are a central tool for representing interpretable networks of relationships between interacting entities that generate large amounts of data. Learning network structure while combatting the effects of noise can be achieved via sparse optimization methods, such as regression (Lasso) [1] and inverse covariance estimation [2]

. In addition, the extension to time series via vector autoregression 

[3, 4] yields interpretations related to causality [5, 6]. In each of these settings, estimated nonzero values correspond to actual relations, while zeros correspond to absence of relations.

However, we are often unable to collect data to observe all relevant variables, and this leads to observing relationships that may be caused by common links with those unobserved variables. Hidden variables can be fairly general: they can be underlying trends in the data, or the effects of a larger network on an observed subnetwork. For example, one year of daily temperature measurements across a country could be related through a graph based on geographical and meteorological features, but all exhibit the same significant trend due to the changing seasons. We have no single sensor that directly measures this trend. In the literature, a standard pipeline is to de-trend the data as a preprocessing step, and then estimate or use a graph to describe the variations of the data on top of the underlying trends [7, 8, 6].

Alternatively, attempts have been made to capture the effects of hidden variables via sparse plus low-rank optimization [9]. This has been extended to time series [10], and even to a non-linear setting via Generalized Linear Models (GLMs) [11]. What if the form of the non-linearity (link function) is not known? Regression using a GLM with an unknown link function is also known as a Single Index Model (SIM). Recent results have shown good performance when using SIMs for sparse regression [12].

Current methods impose a fixed (non-)linearity, assume the absence of any underlying latent variables, perform separate pre-processing or partitioning in an attempt to remove or otherwise explicitly handle such latent variables, or take some combination of these steps. To address all of these issues, we present a model with a non-linear function applied to a linear argument that captures the effects of latent variables. Thus, we apply the Single Index Latent Variable (SILVar) model [13], which uses the SIM in a sparse plus low-rank optimization setting to enable general, interpretable multi-task regression in the presence of unknown non-linearities and unobserved variables. That is, we examine SILVar as a tool for uncovering hidden relationships buried in data.

First, we introduce the SILVar model in Section 2. Then, we outline the numerical procedure for learning the SILVar model in Section 3. Finally, we demonstrate the performance via experiments on synthetic and real data in Section 4.

2 Single Index Latent Variable Models

In this section, we build the Single Index Latent Variable (SILVar) model from fundamental concepts. We extend the single index model (SIM) [14] to the multivariate case and account for effects from unmeasured latent variables in the linear parameter. Let , , for , and . The multivariate SIM model is parameterized by 1) a non-linear link function where is a closed, convex, differentiable, invertible function; and 2) a matrix . Consider the vectorization,

(1)

For the remainder of this paper, we make an assumption that all for notational simplicity, though the formulations readily extend to the case where are distinct.

We propose the SILVar model,

(2)

where we have explicitly split the linear parameter from before into and such that is a sparse matrix, including but not limited to a graph adjacency, and is a low-rank matrix (), capturing the indirect effects of a small number of unmeasured latent variables on the observed data, as introduced in [9]. However, in the presence of the non-linearity, it is not obvious that this low-rank representation should faithfully correspond to the latent variables as intended. Luckily, it does [13]. Letting

(3)

we learn the model using the optimization problem,

(4)

where and are regularizers on and respectively, and with the set of monotonic increasing -Lipschitz functions. We impose this functional constraint for uniqueness and conditioning of the solution. A natural choice for would be the nuclear norm since is approximately low rank due to the influence of a relatively small number of latent variables. We may choose different forms for depending on our assumptions about the structure of . For example, if is sparse, we may use , the norm applied element-wise to the vectorized matrix. This is a “sparse and low-rank” model, which has been shown under certain geometric incoherence conditions to be identifiable [9].

3 Efficiently Learning SILVar Models

In this section, we describe the algorithm for learning the SILVar model. Surprisingly, the pseudo-likelihood functional used for learning the SILVar model in (3) is jointly convex in , , and  [13]. This convexity is enough to ensure that the learning can converge and be computationally efficient.

3.1 Lipschitz Monotonic Regression

The estimation of with the objective function including terms and appears to be an intractable calculus of variations problem. However, there is a marginalization technique that avoids estimating functional gradients with respect to and  [15]. The technique utilizes Lipschitz monotonic regression (LMR) as a subproblem, for which fast algorithms exist [13].

Given ordered pairs

and additive noise , let denote the element of the sorted in ascending order. Then LMR is described by the problem,

(5)

which treats as noisy observations of a function indexed by , sampled at points .

3.2 Learning SILVar Models

Algorithm 1 describes the basic learning procedure for the SILVar model and details the gradient computations while assuming a proximal operator is given.

1:Initialize ,
2:while not converged do Proximal Methods
3:     Computing gradients:
4:end while
5:return
Algorithm 1 Single Index Latent Variable (SILVar) Learning

Under certain assumptions [13], the solution to the optimization problem (3) can be shown to achieve good performance relative to problem parameters including sparsity/rank of linear parameters and and the magnitude of the effect of latent variables.

4 Experiments

We study the performance of the algorithm via simulations on real data. In these experiments, we show two different regression settings under which the SILVar model can be applied.

4.1 Temperature Data

In this setting, we wish to learn the graph capturing relations between the weather patterns at different cities. The data is a real world multivariate time series consisting of daily temperature measurements (in F) for consecutive days during the year taken at each of different cities across the continental USA.

Previously, the analysis on this dataset has been performed by first fitting with a

th order polynomial and then estimating a sparse graph from an autoregressive model using a known link function

assuming Gaussian noise [6].

Here, we fit the time series using a nd order AR SILVar model with regularizers for group sparsity where is the entry of matrix , and nuclear norm .

Figure 1 compares two networks estimated using SILVar and using just sparse SIM without accounting for the low-rank trends, both with the same sparsity level of non-zeros for display purposes, and where . Figure 0(a) shows the network that is estimated using SILVar. The connections imply predictive dependencies between the temperatures in cities connected by the graph. It is intuitively pleasing that the patterns discovered match well previously established results based on first de-trending the data and then separately estimating a network [6]. That is, we see the effect of the Rocky Mountain chain around to

longitude and the overall west-to-east direction of the weather patterns, matching the prevailing winds. In contrast to that of SILVar, the graph estimated by the sparse SIM shown in Figure 

0(b) on the other hand has many additional connections with no basis in actual weather patterns. Two particularly unsatisfying cities are: sunny Los Angeles, California at , with its multiple connections to snowy northern cities including Fargo, North Dakota at ; and Caribou, Maine at , with its multiple connections going far westward against prevailing winds including to Helena, Montana at . These do not show in the graph estimated by SILVar and shown in Figure 0(a).

(a) Weather graph learned using SILVar
(b) Weather graph learned using Sp. SIM (without low-rank)
Figure 1: Learned weather stations graphs

4.2 Bike Traffic Data

The bike traffic data was obtained from HealthyRide Pittsburgh [16]. The dataset contains the timestamps and station locations of departure and arrival (among other information) for each of 127,559 trips taken between 50 stations within the city from May 31, 2015 to September 30, 2016, a total of 489 days.

We consider the task of using the total number of rides departing from and arriving in each location at 6:00AM-11:00AM to predict the number of rides departing from each location during the peak period of 11:00AM-2:00PM for each day. This corresponds to and , where is the set of non-negative integers, and . We estimate the SILVar model (2

) and compare its performance against a sparse plus low-rank GLM model with an underlying Poisson distribution and fixed link function

. We use training samples and compute errors on validation and test sets of size each, and learn the model on a grid of . We repeat this 10 times for each setting, using an independent set of training samples each time. We compute testing errors in these cases for the optimal with lowest validation errors for both SILVar and GLM models.

(a)
(b)
Figure 2: (a) Root mean squared errors (RMSEs) from SILVar and Oracle models; (b) Link function learned using SILVar model

Figure 1(a) shows the test Root Mean Squared Errors (RMSEs) for both SILVar and GLM models for varying training sample sizes, averaged across the 10 trials. We see that the SILVar model outperforms the GLM model by learning the link function in addition to the sparse and low-rank regression matrices. Figure 1(b) shows an example of the link function learned by the SILVar model with training samples, which performs non-negative clipping of the output. This is consistent with the count-valued nature of the data.

Figure 3:

Receiver operating characteristics (ROCs) for classifying each day as a business day or non-business day, using the low-rank embedding provided by

learned from the SILVar model and using the full data

We also demonstrate that the low-rank component of the estimated SILVar model indeed captures unmeasured patterns intrinsic to the data. Naturally, we expect people’s behavior and thus traffic to be different on business days and on non-business days. A standard pre-processing step would be to segment the data along this line and learn two different models. However, as we use the full dataset to learn one single model, we hypothesize that the learned

captures some aspects of this underlying behavior. To test this hypothesis, we perform the singular value decomposition (SVD) on the optimally learned

for and project the data onto the top singular components (SC) . We then use

to train a linear support vector machine (SVM) to classify each day as either a business day or a non-business day, and compare the performance of this lower dimensional feature to that of using the full vector

to train a linear SVM. If our hypothesis is true then the performance of the classifier trained on should be competitive with that of the classifier trained on . We use 50 training samples of and of and test on the remainder of the data. We repeat this 50 times by drawing a new batch of samples each time. We then vary the proportion of business to non-business days in the training sample to trace out a receiver operating characteristic (ROC).

In Figure 3, we see the results of training linear SVM on for and on the full data for classifying business and non-business days. We see that using only the first two SC, the performance is poor. However, by simply taking 3 or 4 SC, the classification performance almost matches that of the full data. Surprisingly, using the top 5 or 6 SC achieves performance greater than that of the full data. This suggests that the projection may even play the role of a de-noising filter in some sense. This classification performance strongly suggests that the low-rank indeed captures the latent behavioral factors in the data.

Figure 4: Intensities of the self-loop at each station

Finally, in Figure 4, we plot the diagonal entries of the optimal network at , as we find this visualization the most intriguing. This corresponds to locations for which incoming bike rides at 6:00AM-11:00AM are good predictors of outgoing bike rides at 11:00AM-2:00PM, beyond the effect of latent factors such as day of the week. We may expect this to correlate with locations that have restaurants open for lunch service, so that people would be likely to ride in for lunch or ride out after lunch. This is confirmed by observing that these stations are in Downtown (-80,40.44), the Strip District (-79.975, 40.45), Lawrenceville (-79.96, 40.47), and Oakland (-79.96, 40.44), known locations of many restaurants in Pittsburgh. It is especially interesting to note that Oakland, sandwiched between the University of Pittsburgh and Carnegie Mellon University, is included. Even though the target demographic is largely within walking distance, there is a high density of restaurants open for lunch, which may explain its non-zero coefficient. The remainder of the locations with non-zero coefficients are also near high densities of lunch spots, while the other locations with coefficients of zero are largely either near residential areas or near neighborhoods known for dinner or nightlife rather than lunch, such as Shadyside () and Southside ()).

5 Conclusion

Data exhibit complex dependencies, and it is often a challenge to deal with non-linearities and unmodeled effects when attempting to uncover meaningful relationships among various interacting entities that generate the data. We apply the SILVar model to estimating sparse graphs from data under the presence of non-linearities and latent factors or trends. The SILVar model estimates a non-linear link function as well as structured regression matrices and in a sparse and low-rank fashion. We outline computationally tractable algorithms for learning the model and demonstrate its performance against existing regression methods on real data sets, namely 2011 US weather sensor network data and 2015-2016 Pittsburgh bike traffic data. We show on the temperature data that the learned can account for the effects of underlying trends in time series while represents a graph consistent with US weather patterns; and we see that, in the bike data, SILVar outperforms a GLM with a fixed link function, the learned encodes latent behavioral aspects of the data, and discovers notable locations consistent with the restaurant landscape of Pittsburgh.

References

  • [1] Robert Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
  • [2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, vol. 9, no. 3, pp. 432–41, July 2008.
  • [3] A. Bolstad, B.D. Van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” IEEE Transactions on Signal Processing, vol. 59, no. 6, pp. 2628–2641, June 2011.
  • [4] Sumanta Basu and George Michailidis, “Regularized estimation in sparse high-dimensional time series models,” The Annals of Statistics, vol. 43, no. 4, pp. 1535–1567, Aug. 2015.
  • [5] C. W. J. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
  • [6] J. Mei and J. M. F. Moura, “Signal processing on graphs: causal modeling of unstructured data,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 2077–2092, Apr. 2017.
  • [7] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, Apr. 2013.
  • [8] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs: Frequency analysis,” IEEE Transactions on Signal Processing, vol. 62, no. 12, pp. 3042–3054, June 2014.
  • [9] Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky, “Latent variable graphical model selection via convex optimization,” The Annals of Statistics, vol. 40, no. 4, pp. 1935–1967, Aug. 2012.
  • [10] Ali Jalali and Sujay Sanghavi, “Learning the dependence graph of time series with latent factors,” arXiv:1106.1887 [cs], June 2011, arXiv: 1106.1887.
  • [11] Mohammad Taha Bahadori, Yan Liu, and Eric P. Xing, “Fast structure learning in generalized stochastic processes with latent factors,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2013, KDD ’13, pp. 284–292, ACM.
  • [12] Ravi Ganti, Nikhil Rao, Rebecca M. Willett, and Robert Nowak, “Learning single index models in high dimensions,” arXiv:1506.08910 [cs, stat], June 2015, arXiv: 1506.08910.
  • [13] J. Mei and J. M. F. Moura, “SILVar: Single index latent variable models,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2790–2803, June 2018.
  • [14] Hidehiko Ichimura, “Semiparametric Least Squares (SLS) and weighted SLS estimation of Single-Index Models,” Journal of Econometrics, vol. 58, no. 1, pp. 71–120, July 1993.
  • [15] Sreangsu Acharyya and Joydeep Ghosh, “Parameter estimation of Generalized Linear Models without assuming their link function,” in

    Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics

    , 2015, pp. 10–18.
  • [16] “Healthy Ride Pittsburgh,” https://healthyridepgh.com/data/, Oct. 2016.