1 Introduction
Graphs are a central tool for representing interpretable networks of relationships between interacting entities that generate large amounts of data. Learning network structure while combatting the effects of noise can be achieved via sparse optimization methods, such as regression (Lasso) [1] and inverse covariance estimation [2]
. In addition, the extension to time series via vector autoregression
[3, 4] yields interpretations related to causality [5, 6]. In each of these settings, estimated nonzero values correspond to actual relations, while zeros correspond to absence of relations.However, we are often unable to collect data to observe all relevant variables, and this leads to observing relationships that may be caused by common links with those unobserved variables. Hidden variables can be fairly general: they can be underlying trends in the data, or the effects of a larger network on an observed subnetwork. For example, one year of daily temperature measurements across a country could be related through a graph based on geographical and meteorological features, but all exhibit the same significant trend due to the changing seasons. We have no single sensor that directly measures this trend. In the literature, a standard pipeline is to de-trend the data as a preprocessing step, and then estimate or use a graph to describe the variations of the data on top of the underlying trends [7, 8, 6].
Alternatively, attempts have been made to capture the effects of hidden variables via sparse plus low-rank optimization [9]. This has been extended to time series [10], and even to a non-linear setting via Generalized Linear Models (GLMs) [11]. What if the form of the non-linearity (link function) is not known? Regression using a GLM with an unknown link function is also known as a Single Index Model (SIM). Recent results have shown good performance when using SIMs for sparse regression [12].
Current methods impose a fixed (non-)linearity, assume the absence of any underlying latent variables, perform separate pre-processing or partitioning in an attempt to remove or otherwise explicitly handle such latent variables, or take some combination of these steps. To address all of these issues, we present a model with a non-linear function applied to a linear argument that captures the effects of latent variables. Thus, we apply the Single Index Latent Variable (SILVar) model [13], which uses the SIM in a sparse plus low-rank optimization setting to enable general, interpretable multi-task regression in the presence of unknown non-linearities and unobserved variables. That is, we examine SILVar as a tool for uncovering hidden relationships buried in data.
2 Single Index Latent Variable Models
In this section, we build the Single Index Latent Variable (SILVar) model from fundamental concepts. We extend the single index model (SIM) [14] to the multivariate case and account for effects from unmeasured latent variables in the linear parameter. Let , , for , and . The multivariate SIM model is parameterized by 1) a non-linear link function where is a closed, convex, differentiable, invertible function; and 2) a matrix . Consider the vectorization,
(1) |
For the remainder of this paper, we make an assumption that all for notational simplicity, though the formulations readily extend to the case where are distinct.
We propose the SILVar model,
(2) |
where we have explicitly split the linear parameter from before into and such that is a sparse matrix, including but not limited to a graph adjacency, and is a low-rank matrix (), capturing the indirect effects of a small number of unmeasured latent variables on the observed data, as introduced in [9]. However, in the presence of the non-linearity, it is not obvious that this low-rank representation should faithfully correspond to the latent variables as intended. Luckily, it does [13]. Letting
(3) |
we learn the model using the optimization problem,
(4) |
where and are regularizers on and respectively, and with the set of monotonic increasing -Lipschitz functions. We impose this functional constraint for uniqueness and conditioning of the solution. A natural choice for would be the nuclear norm since is approximately low rank due to the influence of a relatively small number of latent variables. We may choose different forms for depending on our assumptions about the structure of . For example, if is sparse, we may use , the norm applied element-wise to the vectorized matrix. This is a “sparse and low-rank” model, which has been shown under certain geometric incoherence conditions to be identifiable [9].
3 Efficiently Learning SILVar Models
In this section, we describe the algorithm for learning the SILVar model. Surprisingly, the pseudo-likelihood functional used for learning the SILVar model in (3) is jointly convex in , , and [13]. This convexity is enough to ensure that the learning can converge and be computationally efficient.
3.1 Lipschitz Monotonic Regression
The estimation of with the objective function including terms and appears to be an intractable calculus of variations problem. However, there is a marginalization technique that avoids estimating functional gradients with respect to and [15]. The technique utilizes Lipschitz monotonic regression (LMR) as a subproblem, for which fast algorithms exist [13].
Given ordered pairs
and additive noise , let denote the element of the sorted in ascending order. Then LMR is described by the problem,(5) | ||||
which treats as noisy observations of a function indexed by , sampled at points .
3.2 Learning SILVar Models
Algorithm 1 describes the basic learning procedure for the SILVar model and details the gradient computations while assuming a proximal operator is given.
4 Experiments
We study the performance of the algorithm via simulations on real data. In these experiments, we show two different regression settings under which the SILVar model can be applied.
4.1 Temperature Data
In this setting, we wish to learn the graph capturing relations between the weather patterns at different cities. The data is a real world multivariate time series consisting of daily temperature measurements (in F) for consecutive days during the year taken at each of different cities across the continental USA.
Previously, the analysis on this dataset has been performed by first fitting with a
th order polynomial and then estimating a sparse graph from an autoregressive model using a known link function
assuming Gaussian noise [6].Here, we fit the time series using a nd order AR SILVar model with regularizers for group sparsity where is the entry of matrix , and nuclear norm .
Figure 1 compares two networks estimated using SILVar and using just sparse SIM without accounting for the low-rank trends, both with the same sparsity level of non-zeros for display purposes, and where . Figure 0(a) shows the network that is estimated using SILVar. The connections imply predictive dependencies between the temperatures in cities connected by the graph. It is intuitively pleasing that the patterns discovered match well previously established results based on first de-trending the data and then separately estimating a network [6]. That is, we see the effect of the Rocky Mountain chain around to
longitude and the overall west-to-east direction of the weather patterns, matching the prevailing winds. In contrast to that of SILVar, the graph estimated by the sparse SIM shown in Figure
0(b) on the other hand has many additional connections with no basis in actual weather patterns. Two particularly unsatisfying cities are: sunny Los Angeles, California at , with its multiple connections to snowy northern cities including Fargo, North Dakota at ; and Caribou, Maine at , with its multiple connections going far westward against prevailing winds including to Helena, Montana at . These do not show in the graph estimated by SILVar and shown in Figure 0(a).
![]() |
![]() |
4.2 Bike Traffic Data
The bike traffic data was obtained from HealthyRide Pittsburgh [16]. The dataset contains the timestamps and station locations of departure and arrival (among other information) for each of 127,559 trips taken between 50 stations within the city from May 31, 2015 to September 30, 2016, a total of 489 days.
We consider the task of using the total number of rides departing from and arriving in each location at 6:00AM-11:00AM to predict the number of rides departing from each location during the peak period of 11:00AM-2:00PM for each day. This corresponds to and , where is the set of non-negative integers, and . We estimate the SILVar model (2
) and compare its performance against a sparse plus low-rank GLM model with an underlying Poisson distribution and fixed link function
. We use training samples and compute errors on validation and test sets of size each, and learn the model on a grid of . We repeat this 10 times for each setting, using an independent set of training samples each time. We compute testing errors in these cases for the optimal with lowest validation errors for both SILVar and GLM models.
![]() |
![]() |
Figure 1(a) shows the test Root Mean Squared Errors (RMSEs) for both SILVar and GLM models for varying training sample sizes, averaged across the 10 trials. We see that the SILVar model outperforms the GLM model by learning the link function in addition to the sparse and low-rank regression matrices. Figure 1(b) shows an example of the link function learned by the SILVar model with training samples, which performs non-negative clipping of the output. This is consistent with the count-valued nature of the data.

Receiver operating characteristics (ROCs) for classifying each day as a business day or non-business day, using the low-rank embedding provided by
learned from the SILVar model and using the full dataWe also demonstrate that the low-rank component of the estimated SILVar model indeed captures unmeasured patterns intrinsic to the data. Naturally, we expect people’s behavior and thus traffic to be different on business days and on non-business days. A standard pre-processing step would be to segment the data along this line and learn two different models. However, as we use the full dataset to learn one single model, we hypothesize that the learned
captures some aspects of this underlying behavior. To test this hypothesis, we perform the singular value decomposition (SVD) on the optimally learned
for and project the data onto the top singular components (SC) . We then useto train a linear support vector machine (SVM) to classify each day as either a business day or a non-business day, and compare the performance of this lower dimensional feature to that of using the full vector
to train a linear SVM. If our hypothesis is true then the performance of the classifier trained on should be competitive with that of the classifier trained on . We use 50 training samples of and of and test on the remainder of the data. We repeat this 50 times by drawing a new batch of samples each time. We then vary the proportion of business to non-business days in the training sample to trace out a receiver operating characteristic (ROC).In Figure 3, we see the results of training linear SVM on for and on the full data for classifying business and non-business days. We see that using only the first two SC, the performance is poor. However, by simply taking 3 or 4 SC, the classification performance almost matches that of the full data. Surprisingly, using the top 5 or 6 SC achieves performance greater than that of the full data. This suggests that the projection may even play the role of a de-noising filter in some sense. This classification performance strongly suggests that the low-rank indeed captures the latent behavioral factors in the data.

Finally, in Figure 4, we plot the diagonal entries of the optimal network at , as we find this visualization the most intriguing. This corresponds to locations for which incoming bike rides at 6:00AM-11:00AM are good predictors of outgoing bike rides at 11:00AM-2:00PM, beyond the effect of latent factors such as day of the week. We may expect this to correlate with locations that have restaurants open for lunch service, so that people would be likely to ride in for lunch or ride out after lunch. This is confirmed by observing that these stations are in Downtown (-80,40.44), the Strip District (-79.975, 40.45), Lawrenceville (-79.96, 40.47), and Oakland (-79.96, 40.44), known locations of many restaurants in Pittsburgh. It is especially interesting to note that Oakland, sandwiched between the University of Pittsburgh and Carnegie Mellon University, is included. Even though the target demographic is largely within walking distance, there is a high density of restaurants open for lunch, which may explain its non-zero coefficient. The remainder of the locations with non-zero coefficients are also near high densities of lunch spots, while the other locations with coefficients of zero are largely either near residential areas or near neighborhoods known for dinner or nightlife rather than lunch, such as Shadyside () and Southside ()).
5 Conclusion
Data exhibit complex dependencies, and it is often a challenge to deal with non-linearities and unmodeled effects when attempting to uncover meaningful relationships among various interacting entities that generate the data. We apply the SILVar model to estimating sparse graphs from data under the presence of non-linearities and latent factors or trends. The SILVar model estimates a non-linear link function as well as structured regression matrices and in a sparse and low-rank fashion. We outline computationally tractable algorithms for learning the model and demonstrate its performance against existing regression methods on real data sets, namely 2011 US weather sensor network data and 2015-2016 Pittsburgh bike traffic data. We show on the temperature data that the learned can account for the effects of underlying trends in time series while represents a graph consistent with US weather patterns; and we see that, in the bike data, SILVar outperforms a GLM with a fixed link function, the learned encodes latent behavioral aspects of the data, and discovers notable locations consistent with the restaurant landscape of Pittsburgh.
References
- [1] Robert Tibshirani, “Regression Shrinkage and Selection via the Lasso,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996.
- [2] Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Sparse inverse covariance estimation with the graphical Lasso,” Biostatistics, vol. 9, no. 3, pp. 432–41, July 2008.
- [3] A. Bolstad, B.D. Van Veen, and R. Nowak, “Causal network inference via group sparse regularization,” IEEE Transactions on Signal Processing, vol. 59, no. 6, pp. 2628–2641, June 2011.
- [4] Sumanta Basu and George Michailidis, “Regularized estimation in sparse high-dimensional time series models,” The Annals of Statistics, vol. 43, no. 4, pp. 1535–1567, Aug. 2015.
- [5] C. W. J. Granger, “Investigating causal relations by econometric models and cross-spectral methods,” Econometrica, vol. 37, no. 3, pp. 424–438, Aug. 1969.
- [6] J. Mei and J. M. F. Moura, “Signal processing on graphs: causal modeling of unstructured data,” IEEE Transactions on Signal Processing, vol. 65, no. 8, pp. 2077–2092, Apr. 2017.
- [7] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs,” IEEE Transactions on Signal Processing, vol. 61, no. 7, pp. 1644–1656, Apr. 2013.
- [8] A. Sandryhaila and J. M. F. Moura, “Discrete signal processing on graphs: Frequency analysis,” IEEE Transactions on Signal Processing, vol. 62, no. 12, pp. 3042–3054, June 2014.
- [9] Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky, “Latent variable graphical model selection via convex optimization,” The Annals of Statistics, vol. 40, no. 4, pp. 1935–1967, Aug. 2012.
- [10] Ali Jalali and Sujay Sanghavi, “Learning the dependence graph of time series with latent factors,” arXiv:1106.1887 [cs], June 2011, arXiv: 1106.1887.
- [11] Mohammad Taha Bahadori, Yan Liu, and Eric P. Xing, “Fast structure learning in generalized stochastic processes with latent factors,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2013, KDD ’13, pp. 284–292, ACM.
- [12] Ravi Ganti, Nikhil Rao, Rebecca M. Willett, and Robert Nowak, “Learning single index models in high dimensions,” arXiv:1506.08910 [cs, stat], June 2015, arXiv: 1506.08910.
- [13] J. Mei and J. M. F. Moura, “SILVar: Single index latent variable models,” IEEE Transactions on Signal Processing, vol. 66, no. 11, pp. 2790–2803, June 2018.
- [14] Hidehiko Ichimura, “Semiparametric Least Squares (SLS) and weighted SLS estimation of Single-Index Models,” Journal of Econometrics, vol. 58, no. 1, pp. 71–120, July 1993.
-
[15]
Sreangsu Acharyya and Joydeep Ghosh,
“Parameter estimation of Generalized Linear Models without
assuming their link function,”
in
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, 2015, pp. 10–18. - [16] “Healthy Ride Pittsburgh,” https://healthyridepgh.com/data/, Oct. 2016.