## 1 Introduction

Network regularization is a fundamental approach to encode and incorporate general relationships among variables, which are represented as nodes and linked together by weighted edges that describe their local proximity. A significant amount of effort has been devoted to developing successful regularizations [tibshirani2005sparsity, sharpnack2012sparsistency, hallac2015network, wang2015trend] that take advantage of prior knowledge on network structures to enhance estimation performance for a variety of application settings, including image denoising [pang2017graph] and genomic data analysis [li2008network].

We consider the general setting of network regularization, now proposed as a convex optimization problem defined on an undirected graph with node set and edge set :

(1) |

where on each node , we intend to learn a local model , which in addition to minimizing a predefined convexloss function , carries certain relations to its neighbors, defined by the function .

In this paper, we focus on the particular setting where = , also known as a sum-of-norms (SON) regularizer. It assumes that the network is composed of multiple clusters, suggesting that all nodes within a cluster share the same consensus model (). As the observed data on each node may be sparse, the SON regularizer allows nodes to ”borrow” observations from their neighbors to improve their own models, as well as to determine the network cluster to which they belong. The SON objective was first introduced in [hocking2011clusterpath] for convex clustering problems and used off-the-shelf sequential convex solvers. Chi & Lange [chi2015splitting] adapt SON clustering to incorporate similarity weights with an arbitrary norm and solve the problem by both alternating direction method of multipliers (ADMM) or alternating minimization algorithm (AMA) in a parallel manner. More recently, Hallac et al. [hallac2015network] point out that weighted SON (named as Network Lasso) allows for simultaneous clustering and optimization on graphs, and is highly suitable for a broad class of large-scale network problems.

In network regularization, a static weight assigned to an edge determines how strictly the difference between models on the corresponding nodes is being penalized, relative to the other node pairs in the network. However, unlike few scenarios in which network information is explicitly given, edge weights are usually unavailable or even infeasible to obtain in most real-world networks (e.g., gene regulatory networks [liu2011controllability]). Moreover, due to possible measurement errors and inaccurate prior knowledge, the assigned edge weights don’t necessarily align with the underlying clustering structure of networks as shown in [anselin2002under, harris2011search]. As a result, heavily relying on the edge weights in determining how much penalty should be applied to neighboring models may contaminate the discovered solutions.

Similar intuition applies when we extend to the temporal setting. Models on nodes of temporal networks usually change at the level of groups over time. However, some groups exhibit different evolution patterns than others [kim2013nonparametric]. Moreover, the grouping structures themselves may evolve with time [ho2011evolving]. As static regularization cannot capture such temporal evolution, a direct solution is to induce a time-varying local consensus by employing the SON objective on both spatial and temporal directions, but we face an more drastic problem: how to assign weights between snapshots.

Consequently, a framework that is robust to missing or corrupted edge weights in network-based regularization and adapts to spatio-temporal settings is desired. Note that without the assigned network weights, the setting in Eq. (1) degrades to simply applying isotropic regularization for all the edges. In this work, we propose a generic formulation, called the discrepancy-aware network regularization (DANR), which deploys a suitable amount of anisotropic network regularization in both spatial and temporal aspects. DANR infers models at each node per timestamp and can learn evolution of models and transitions of network structures over time. We develop an ADMM-based algorithm that adopts an efficient and distributed iterative scheme to solve problems formulated by the DANR, and show that the proposed solution obtains guaranteed convergence towards global optimal solutions. By applying to both synthetic and real-world datasets, we demonstrate the effectiveness of the proposed approach on various network problems.

## 2 Problem Setting

### 2.1 Discrepancy-aware Network Regularization

Since existing network-based regularizers rely on known weights on edges, the corresponding solutions get misled when the edge weights are erroneous or unknown. Directly learning the unknown edge weights by using the same regularizer as Eq. (1) would lead to an optimization problem:

(2) |

which yields the trivial all-zero solution of and thus becomes unsatisfactory. Instead of imposing more sophisticated model-based or problem-dependent regularization as suggested in [yu2016temporal], we consider an alternative formulation that explicitly accounts for discrepancies between models on adjacent nodes:

(3) |

where we denote as the sum of -norms of all ’s. We set a penalty parameter to control the overall strength of network regularization, and a portion parameter to control the emphasis between the two terms in . We name this formulation in Eq. (3) the discrepancy-aware network regularization (DANR).

In the first term of , we define a discrepancy-buffering (DB) variable to denote the preserved discrepancy between local model parameters and . More specifically, when an edge weight is given but potentially imprecise or otherwise corrupted, would compensate for abnormally large differences between models and to reduce the magnitude of -weighted edge penalty term in . The DB variable provides additional flexibility to its associated models, allowing them to stay adequately close to their own local solutions w.r.t. minimizing local loss functions, and avoid over-penalized consensus. When all ’s are not given, ’s enable solving Eq. 3 under anisotropic regularizations even with homogeneous weights.

The second term of is the -regularizer (with ) [vogt2012complete]

that ensures sparsity at the group level. Note that we regard each variable vector

as a group ( groups in total). The first key intuition here is that the regularization on helps to exclude trivial solutions, that is . Second, it allows us to identify a succinct set of non-zero vectors ’s, compensating for possible intrinsic discrepancies between two adjacent nodes.To sum up, discrepancy-buffering variable ’s are designed to elaborately adjust network regularization strength on all edges, and thus reduce negative effects of the unsquared norm regularizer. We will show in later sections that this modified formulation remains convex and tractable via parallel optimization algorithms on large-scale networks.

### 2.2 Distributed ADMM-based Solution

In this section, we propose an ADMM-based algorithm for solving the ST-DANR problem in Eq. (12) and present the convergence and complexity of the proposed algorithm. The algorithm can be easily adapted to the spatial-only DANR problem in Eq. (3) by omitting temporal-related updates.

The ADMM method was originally derived in [gabay1976dual] and has been reformulated in many contexts including optimal control and image processing [peng2012rasl]. The method can be considered as combining augmented Lagrangian methods and the method of multipliers [boyd2011distributed]. It aims to solve optimization problems with two-block separable convex objectives in the following form of

(4) |

Observe that our proposed objective in Eq. (12) has a separable convex objective function: it can be reorganized into two-block separable convex objectives as in Eq. (4), for which ADMM methods guarantee convergence to global optimal solutions [eckstein1992douglas].

More precisely, to fit our problem into the ADMM framework, we first define = for model parameter vectors on the given undirected network , and define consensus variables = as copies of in the spatial penalty term . Then we rewrite Eq. (12) as follows:

(5) | ||||

s.t. |

in which the first term corresponds to the block in Eq. (4), and the remaining terms correspond to the block. The equality constraints on Eq. (12) are used to force consensus between variables and . Next, we derive the augmented Lagrangian of Eq. (E.1):

(6) | |||

where = are scaled dual variables for each equality constraint on the elements of . The parameters penalize the violation of equality constraints in the spatial and temporal domain [nishihara2015general] respectively. The iterative scheme of ADMM under the above setting can be written as follows, with denoting the iteration index:

(7) |

where = is composed of replicated elements in , and thus has a one-to-one correspondence with elements in .

Because the augmented Lagrangian in Eq. (6) has a separable structure as well, we can further split the optimization above over each univariate element in and . Next, we provide details for each ADMM update step.

-Update. In the first update step of our ADMM updating scheme, we can decompose the problem of minimizing into separately minimizing for each node :

As above equation suggests, on node first receives local information from the corresponding consensus variables belonging to all of its neighbors , then updates its value to minimize the loss function and remain close to neighbouring consensus variables. Since all remaining regularizations are quadratic, the -update problem can be efficiently solved whenever the loss function has certain properties, such as strong convexity.

()-Update. We can further decompose the ()-update step in Eq. (20) into subproblems on each edge (and its related variables ) as follows, which can be solved in parallel:

(8) |

Notice that the sum of convex functions which are defined on different sets of variables preserves the convexity. To minimize the objective with and simultaneously, the convexity of Eq. (2.2) motivates us to adopt an alternating descent algorithm, which minimizes each component iteratively with respect to and while holding the other fixed. In detail, we solve Eq. (9) and Eq. (10) iteratively until convergence is achieved:

(9) | |||

(10) | ||||

where denoting the iteration index of alternating descent. Further, we can utilize the analytic solution to speed up the calculation of and : Problem (9) has a closed-form solution:

where we denote , , , and for simplification. Details in Appendix B.

-Update. In the last step of our ADMM updating scheme, we have fully independent update rules for each scaled dual variable as follows:

(11) |

Finally, we present the pseudo-code of our DANR approach in Appendix D.

Stopping Criterion and Global Convergence. For our ADMM iterative scheme in Eq. (20), we use the norm of primal residual and dual residual as the termination measure. The optimality condition [boyd2011distributed] of ADMM shows that if both residuals are small then the objective suboptimality must be small, and thus suggests as a reasonable stopping criterion. Convex subproblems in -update and -update need iterative methods to solve as well. The stopping criterion for these subproblems is naturally to keep iteration differences below thresholds, i.e. in -update, we require and .

Our ADMM approach to solve the DANR problem is guaranteed to converge to the global optimum. Details in Appendix C.

Computational Complexity. Let denote the number of iterations that ADMM takes to achieve an approximate solution with an accuracy of . Based on the convergence analysis in [he2015non], the time complexity scales as , which in our problem mostly depends on the properties of cost functions ’s. Assume that all convex subproblems in -update and -update are solved by general first-order gradient descent methods that take and iterations to converge respectively, the overall complexity of the algorithm is therefore .

## 3 Extension to Spatio-temporal Setting

The aforementioned shortcomings with pre-defined edge weights accentuate when we consider temporal networks since acquiring explicit temporal edge weights is usually not feasible. Thus, discrepancy-buffering variables are also crucial for the temporal setting.

Consider a temporal undirected network consisting of sequential network snapshots , where each network snapshot contains a node set and an edge set . We denote as the weight of spatial edge between nodes and within snapshot , and for the weight of temporal edge that links node with across snapshots and (shown in Fig.1). On each node , a convex loss function is given to measure the fitness of a local model parameterized by . With sparse observations for all snapshots, a straightforward task is to find jointly optimal models for every node and every timestamp under the regularization for both network topology and temporal evolution. We can propose an extension of the DANR formulation to spatio-temporal networks, referred as ST-DANR:

(12) |

where we define the discrepancy-aware regularizers as:

(13) | ||||

(14) |

As we are interested in the heterogeneous evolution of nodal models across time, the chosen unsquared norms in spatial and temporal regularization terms and enforce piecewise consensus in both spatial and temporal aspects, which indicates abrupt changes of regional models or sudden transitions in network structures at particular timestamps, and also implies valid persistent models in the remaining segments of time [hallac2017network]. Beside the spatial DB variables in , we define another set of temporal DB variables in to compensate for inadequate regularizations caused by temporal fluctuations. For example, when modeling housing prices in a city, would tolerate anomalous short-term fluctuations in prices, while large values of might be found near boundaries of disparate neighborhoods.

Adaptation to Streaming Data. In many applications, observed network snapshots arrive in a streaming fashion. Instead of recomputing all models on past snapshots whenever a new snapshot is observed, incorporating existing models to facilitate the new incoming snapshot is more efficient. Assume that we have learned models for the -th snapshot, we then read observations of the next snapshots indexed by and attempt to learn models for them. To this intent, we deploy our ST-DANR formulation in Eq. (12) on a limited time interval , and then integrate established model at by fixing = and solving the remaining variables in the corresponding problem. Note that in our setting, temporal messages are only transferable through consecutive snapshots, meaning that fixing the -th snapshot is the same as fixing all past snapshots.

Lastly, we point out that distributed ADMM-based solution for DANR (§2.2) can be adapted to ST-DANR, for either batch formulation (Eq. (12)) or streaming-case formulation, which readers can refer to in Appendix E.

Temporal Evolutionary Patterns. In this work, we consider the unsquared edge penalty term in since it enforces temporal reconstruction, indicating sharp transitions or change points at particular timestamps. Similar to the spatial case, there could be multiple available alternatives to enforce different types of evolutionary patterns [hallac2017network]. For example, replacing the unsquared norm with squared norm form will yield smoothly varying models along consecutive snapshots, and has been well studied in [wang2014low, dang2017subnetwork].

## 4 Experiments

In this section, we present an experimental analysis of the proposed method (DANR) and evaluate its performance on classification and regression tasks. First, we employ a synthetic dataset (§4.1) and two housing price datasets (BAY and SAC in §4.2) to demonstrate the set of scenarios and applications that are particularly good fit for DANR. In addition to an improvement on the estimation task, DANR learns fine-grained neighborhood boundaries and reveals interesting insights into the network’s heterogeneous structure. Lastly, in order to validate the efficacy of ST-DANR in the temporal setting, we conduct experiments with the geospatial and temporal database of Northeastern US lakes (§4.3), from which we aim to estimate the water quality of lakes over 10 years.

Baseline Algorithms. In spatial network scenarios (§4.1 & §4.2), we compare DANR against three baselines: network lasso (NL) [hallac2015network], robust multi-task feature learning (rMTFL) [gong2012robust], factorized multi-task learning (FORMULA) [xu2015formula]. In spatial-temporal network scenarios (§4.3), we compare ST-DANR with two widely-applied temporal regularizers in the literature, which are detailed in §4.3.

Parameters Setting. For both NL and DANR, we tune parameters from to where . Concerning the DANR, for each value of , we further tune parameters from to , where . Lastly, we set . We follow the same strategy for both spatial (DANR) and spatio-temporal (ST-DANR) variants of the proposed method. For all penalty parameters in rMTFL and FORMULA, we tune them in the same way as in DANR. In addition, we vary the number of factorized base models in FORMULA from to

to achieve its best performance. For each dataset, we standardize all features and response variables to zero mean and unit variance.

### 4.1 Node Classification with Limited Data

Following previous work [hallac2015network]

, we first experiment with a synthetic network in which each node attempts to solve its own support vector machine (SVM) classifier, but only has a limited amount of data to learn from. We further split the nodes into clusters where each cluster has an underlying model, i.e., nodes in the same cluster share the same underlying model. Our aim is to learn accurate models for each node by leveraging their network connections and limited observations. Note that the neighbors with different underlying models provide misleading information to each other, which potentially harms the overall performance. Therefore, we later investigate the robustness of the DANR with a varying ratio of such malicious edges.

Synthetic Network Generation. We create a network of 100 nodes, split into 5 ground-truth communities , , , . Each of these communities consists of 20 nodes. We denote the community of a node with

. Next, we form the network connections by adding edges between nodes based on the following criterion: The probability of adding an edge between any node pair

is 0.5 if , i.e., the edge connects nodes from the same community (*intra-community edge*). Otherwise, we set the probability as 0.02 if , i.e., the edge connects nodes from different communities (

*inter-community edge*). The resulting synthetic network has 100 nodes and 574 edges, with 82% of them being intra-community edges. Each ground-truth community has an underlying classifier model

, drawn independently from a normal distribution of zero mean and unit variance, where

. Next, we generate 5 random training example pairs per node, where denotes an observation and denotes the ground-truth value to estimate. The ground-truth value for each observation is then computed using the underlying classifier of the community that a node belongs to:(15) |

where represents the corresponding weight of the classifier w.r.t. observations, while is the noise. All observations and noises are drawn independently from a normal distribution for each data point. Note that 5 observations are insufficient to accurately estimate a model . Hence for this setting, the network connections play an important role as nodes can “borrow” training examples from its neighbors to improve their own models.

Optimization Problem. The objective function on each node is formulated by the soft margin SVM objective, where node

estimates its model (i.e. separating hyperplane)

using its five training examples as follows:where ’s are slack variables that accounts for non-separable data. The constant controls the trade-off between minimizing the training error and maximizing the margin. We set to 0.75, which we empirically find to perform well.

Test Results. In following subsections, we first thoroughly inspect DANR and its prototype method NL on synthetic network , in order to show the benefits of introducing discrepancy-buffering variable . Later in §4.1.2, we evaluate the performance and robustness of DANR and all three baseline methods under more noisy scenarios. To assess the performance, prediction accuracies are computed over a separate set of 1000 test pairs (10 per node). The test pairs are again randomly generated using Eq. (15).

#### 4.1.1 Effect of Discrepancy-Buffering Variables.

We compare DANR with NL and report the prediction accuracy with varying values in Figure 2. Recall that the proposed method introduces the parameter which is coupled with the parameter. Therefore in order to have an adequate comparison between DANR and NL, we modify our parameter to be . Doing so ensures that both methods apply the same amount of penalty to their network regularization terms (See Eq. 2 and 3). For each value of , we vary from 0.3 to 1 and report the highest accuracy achieved.

We first briefly mention the desired behaviors of the NL method and later accentuate its drawbacks. As shown by Figure 2, for small values, the problem reduces to solving the local optimization problem where each node estimates its model solely based on its limited observations. This achieves 61.3% accuracy on the test set. Furthermore, as increases, the NL penalty enforces nodes to form clusters, where nodes in the same cluster share the same model. This behavior firmly improves the test performance with prediction accuracy rising to 75.3%. Right after the peak performance is reached however, a slight increase in causes a sudden drop in the performance and the problem reduces to solving the global optimization problem over the entire network. Therefore when , the problem converges to a common model for all the nodes in the network , i.e., for . This further drops the prediction accuracy to 56.4% on the test set.

The observed rapid drop in performance implies that the NL method is highly sensitive to the parameter. There is only a narrow window of values that result in a proper clustering of the network. As a result, one needs to excessively tune the parameter around to find its optimal value. In addition, such clustering performance is also highly affected by the volume of “malicious” (inter-cluster) edges in the network, as shown in §4.1.2.

From Figure 2, we can observe that the DANR approach exhibits improved performance thanks to its discrepancy-buffering variables, , which allow our regularization term to better exploit the good edges while providing additional tolerance towards bad edges. As a result, DANR achieves 79.1% accuracy on the test set, outperforming the best baseline method by 3.8%. Another important observation is that DANR provides a much wider window for selecting near-optimal than the baseline approach. That being said, we now analyze how the * parameter couples with the parameter, and further present its effect on DANR’s performance. Figure 3 displays the prediction accuracy vs. , with a fixed and . Intuitively, as increases, the optimal value of also increases. The reason behind is that as increases, the more non-zero discrepancy-buffering variables (parameterized by ()) are needed to persist the accurate clustering formation since high values of forces global consensus over the network.*

#### 4.1.2 Robustness to Malicious Edges.

Note that the ratio of inter-community edges in the earlier synthetic network is 18%. Here we use term *noise* for the ratio of such malicious edges in the network. Taking into account that the amount of noise is critical in learning accurate models for nodes in a network, we now investigate the efficacy of DANR w.r.t. varying noise. First, we remove all inter-community edges and later iteratively add malicious edges to the network. Then we solve the same problem with the noise varying from 0 to 0.6, and report prediction accuracies of all methods in Figure 4.
As shown, the performance of DANR, NL and FORMULA drops in the absence of noise. Meanwhile the pure multitask method rMTFL maintains the same low accuracy, since it doesn’t utilize the network information. However, as we increase the noise, DANR starts to outperform NL and FORMULA, where the gap between the models’ performances becomes larger at a higher ratio of malicious edges. This confirms that compared to the baseline methods, the DANR exhibits more robust performance in noisy settings thanks to its discrepancy-buffering variables.

### 4.2 Spatial: Housing Rental Price Estimation

In this section, our goal is to jointly (i) estimate the rental prices of houses by leveraging their geological location and the set of features; (ii) discover boundaries between neighborhoods. The intuition is that the houses in the same neighborhood often tend to have similar pricing models. However, such neighborhoods can have complex underlying structures (as described later in this section), which imposes additional challenges in learning accurate models.

Dataset. We experiment with two largely populated areas in Northern California: the Greater Sacramento Area (SAC) and the Bay Area (BAY). The anonymized data is provided by the property management software company Appfolio Inc, from which we further sample houses with at least one signed rental agreement during the year 2017. The resulting dataset covers 3849 houses in SAC and 1498 houses in BAY. Concerning the houses that have more than one rental agreement signed during 2017, we average the rental prices listed in all the agreements. Each house holds the information about its location (latitude/longitude), number of bedrooms, number of bathrooms, square footage, and the rental price. We regard these areas (SAC and BAY) as two separate datasets for the remainder of this section. We also randomly split 20 % of the houses in each dataset for testing.

Network Construction. After excluding test houses from both datasets, we construct two networks (one for each dataset) based on the houses’ locations. An undirected edge exists between node and , if at least one of them is in the set of 10 nearest neighbors of the other. (Note that node being one of the 10 nearest neighbors of doesn’t necessarily imply that node is also in the set of nearest neighbors of .) In the resulting networks, the average degree of a node is 12.16 for SAC and 12.04 for BAY. Additionally, we construct two versions of these networks, *weighted* and *unweighted*. While the weighted network has edge weights inversely proportional to the distances between houses, the unweighted network ignores the proximity between houses, and thus weights on all the edges are 1.

Optimization Problem.

The model at each house estimates its rental price by solving a linear regression problem. More specifically, at each node

we learn a 4-dimensional model . simply represents the coefficients of each feature, which later is used to estimate the rental price as follows:where is the bias term. The training objective is

where represents the proposed network regularization term (see Eq. 3), while is the regularization weight to prevent over-fitting.

Test Results. Once training converges, we predict the rental prices on the test set. To do that, we connect each house in the test set to its 10 nearest neighbors in the training set. We then infer the new model by taking the average of the models on its neighbors: . The model is further used to estimate the rental price of the corresponding house. Alternatively, one can also infer by solving , while keeping the models on neighbors fixed. However, we empirically find that simply averaging the neighbors’ models performs better for both methods in this particular setting. We compute Mean Squared Error (MSE) over test nodes to evaluate the performance.

Method | BAY | SAC |
---|---|---|

Local estimation () | 0.5984 | 0.6250 |

Global estimation () | 0.4951 | 0.5403 |

rMTFL | 0.4774 | 0.4115 |

FORMULA (unweighted) | 0.4446 | 0.3503 |

FORMULA (weighted) | 0.4181 | 0.3379 |

Network Lasso (unweighted) | 0.4392 | 0.3273 |

Network Lasso (weighted) | 0.4173 | 0.3022 |

DANR (unweighted) | 0.4106 | 0.2978 |

Table 3 displays the test results of eight different settings on both datasets. As shown, local & global estimations and rMTFL method produce high errors for both datasets. We further apply FORMULA and Network Lasso (NL) methods to both weighted and unweighted versions of the networks, while the proposed method is only applied to the unweighted version. Notice that the network weights are irrelevant for the local & global estimation settings and the rMTFL method. Intuitively, both FORMULA and NL performs better on the weighted setting compared to the unweighted setting for both datasets. As shown, the rental price estimation errors achieved by the NL are 0.4173 (weighted) vs. 0.4392 (unweighted) for BAY and 0.3022 (weighted) vs. 0.3273 (unweighted) for SAC. These results suggest that the pre-defined weights on these networks help to learn more accurate models on houses.

However, we further argue that although such pre-defined weights improve the overall clustering performance, they don’t account for more complex clustering scenarios, e.g., two nearby houses falling into different school districts or some houses having higher values compared to their neighbors due to geography, e.g., having a view of the city. That being said, DANR outperforms all the other baselines and achieves the smallest errors for both datasets; 0.4106 for BAY and 0.2978 for SAC. Especially, the DANR (unweighted) even outperforms the weighted version of the baseline approaches by a notable margin. This reveals that the DANR indeed accounts for such heterogeneities in data and provides enhanced clustering of the network.

Figure 5 shows examples of two complex scenarios that are captured by the DANR from the SAC network. We use a color code to represent the clusters in the network, where the same colored houses (, ) are in consensus on their models (). In the left subfigure, the house shown in yellow uses a different model than all of its neighbors. Interestingly, it resides at the border of three different area codes. The area code for this house is 95864, while the area code on its west is 95825 and 95826 on its south-east. Additionally, we observe similar heterogeneous behaviors in some houses that are near creeks, lakes, and rivers. As an example, Figure 5 (right) displays a creek side house (colored in blue), which again uses a different underlying model from its neighbors.

### 4.3 Spatio-Temporal: Water Quality Estimation of Northeastern US Lakes

We now evaluate our method in spatio-temporal setting, where the aim is to dynamically estimate the water quality of Northeastern US lakes over years. We follow the same procedure in §4.2, but have an additional temporal regularization term in our objective that allows nodes to obtain signals from past snapshots of the network, along with their neighbors in the current snapshot.

Dataset and Network. We experiment with the geospatial and temporal dataset of lakes in 17 states in the Northeastern United States, called LAGOS [soranno2017lagos]. The dataset holds extensive information about the physical characteristics, various nutrient concentration measurements (water quality), ecological context (surrounding land use/cover, climate etc.), and location of lakes; from which we select the following available set of features: *{max depth, surface area, elevation, annual mean temperature, % of agricultural land, % of urban land, % of forest land, % of wetland}*. The water quality metric to estimate is the summerly mean *chlorophyll concentration* of the lakes. We represent the feature vector with and the water quality score with .

During our experiments, we consider a 10-year period between 2000 and 2010. Due to sparsity in the data, we allow 2 years range between the two consecutive snapshots of the network. This results in total of 1039 lakes with all the aforementioned features and the water quality measurements available for each of the selected years. After randomly splitting 20% of the lakes for testing, we build our network by using the latitude/longitude information of lakes, where each lake (node) is connected to 10 nearest lakes.

Optimization Problem.
After constructing the network, we now aim to *dynamically* estimate the water quality () of lakes by using their feature vectors (). More formally, for each year , we learn a model per node by solving the following spatio-temporal problem in a streaming fashion: (4.16)

where is the linear regression objective and is the DANR term applied on the spatial edges. is the temporal regularization term, for which we consider two formulations as described next.

Test results. For each year, we first solve the spatial problem (Eq. (4.16) without the temporal term ) and report the Mean Squared Errors in Table 2. Note that the test nodes are again inferred by averaging the neighbors’ models in the current snapshot. Later, in order to successfully assess the improvements gained by the temporal discrepancy-buffering variables (), we solve the above spatio-temporal problem with three different temporal regularizers:

DANR+T-SON: DANR with temporal sum-of-norms regularizer where =

DANR+T-SOS: DANR with temporal sum-of-squares regularizer where =

ST-DANR: Spatio-temporal discrepancy-aware network regularizer where =

DANR+T-SON and DANR+T-SOS approaches simply apply two widely adopted temporal regularizers on the temporal edges (the sum-of-norms regularizer [lindsten2011clustering] and sum-of-squares reqularizer [dang2017subnetwork, wang2014low] respectively), while ST-DANR includes the discrepancy-buffering variables on both spatial and temporal edges. Recall that we solve the Eq. (4.16) in a streaming fashion for simplicity, i.e., each snapshot of the network solves the problem while holding the models learned on the previous snapshot (if available) fixed. Potentially, the performance can further be improved by allowing updates on the previous snapshots.

As we can see from the Table 2, leveraging the temporal connections between the two consecutive snapshots significantly improves the performance. The DANR+T-SON outperforms the DANR by 8%-14%, while DANR+TSOS outperforms the DANR by 10%-15%. Moreover, the ST-DANR consistently outperforms both baselines for all the years. This confirms that the proposed formulation accounts for the heterogeneous nature of the temporal networks where some group of nodes exhibits different evolution patterns than the others.

Method | 2000 | 2002 | 2004 | 2006 | 2008 | 2010 |
---|---|---|---|---|---|---|

DANR | 0.8479 | 0.9033 | 0.6384 | 0.6722 | 0.4556 | 0.4113 |

+ T-SON | N/A | 0.8308 | 0.5744 | 0.6061 | 0.4109 | 0.3517 |

+ T-SOS | N/A | 0.8045 | 0.5618 | 0.6045 | 0.3978 | 0.3476 |

ST-DANR | N/A | 0.8037 | 0.5604 | 0.5866 | 0.3844 | 0.3311 |

Figure 6 further displays the models (color-coded) learned on three consecutive snapshots, corresponding to years 2000, 2002 and 2004. In 2000, the formed clusters don’t go much beyond the boundaries of the states, resulting in coarse clustering of the network. Yet some states such as Missouri and Iowa are grouped into one cluster (shown in green). This suggests that accurately estimating the chlorophyll concentration of lakes based on the selected set of features is indeed challenging. Specifically, the estimation problem depends on several other external factors that potentially affect the volume of plants and algae in lakes; cultural eutrophication, nutrient inputs from human activities to name a few [soranno2017lagos]. However, as it can be seen from the models learned in 2002 and 2004, the clustering performance improves once we allow temporal regularization to synchronize models between two consecutive snapshots, which is analogous with the reduction in errors over years as reported in Table 2. Overall, we observe a split of clusters over time, e.g., in Wisconsin, Minnesota, Iowa and New England. Additionally, a dark brown cluster in south Missouri begins to appear in 2002 and further spreads north in 2004. This indicates that by leveraging the temporal connections, the ST-DANR learns improved models and clustering while allowing for heterogeneity in group level evolutions of nodes.

## 5 Conclusions

We propose discrepancy-aware network regularization (DANR) approach for spatio-temporal networks. By introducing discrepancy-buffering variables, the DANR automatically compensates for inaccurate prior knowledge encoded in edge weights, and enables modeling heterogeneous temporal patterns of network clusters. We develop a scalable algorithm based on alternating direction method of multipliers (ADMM) to solve the proposed problem, which enjoys analytic solutions for decomposed subproblems and employs a distributed update scheme on nodes and edges. Our experimental results show that DANR successfully captures structural changes over spatio-temporal networks and yields enhanced models by providing robustness towards missing or inaccurate edge weights.

## 6 Acknowledgments

This project has received funding from the National Science Foundation under grant IIS-1817046 and the Defense Threat Reduction Agency under award HDTRA1-19-1-0017.

## References

## Appendix A Related Works

Among existing network-based regularization approaches, there are two major types of edge objectives that are most commonly used: the square-norm objective and the unsquared-norm objective, encouraging different styles of expected solutions. The squared-norm edge objective (i.e., ), well-known as graph Laplacian regularization [belkin2006manifold], assumes underlying models on nodes are smoothly varying as one traverses edges in the network, and accordingly enforces similar but not identical models on linked nodes. Due to the merits of simple computations and good performance, graph Laplacian regularizer has gained popularity in solving problems, such as image deblurring and denoising [pang2017graph], genomic data analysis [li2008network], semi-supervised regression [belkin2004regularization], non-negative matrix factorization [guan2011manifold] and semi-definite programming [weinberger2007graph]. However, the square-norm objective usually induces very dense models. The unsquared-norm objective, known as sum-of-norms regularizer [hocking2011clusterpath, lindsten2011clustering], bears some similarity to scale-valued fused lasso signal approximator [tibshirani2005sparsity] that exists in signal processing applications, for example, total-variation (TV) regularizer for image denoising, and graph trend filtering (GTF) regularizer for nonparametric regression. TV regularizer [rudin1992nonlinear] is developed to penalize both the horizontal and vertical differences between neighboring pixels, which can be thought of as a special scalar-valued network lasso on a 2D grid network. GTF [wang2015trend] generalizes the successful idea of trend filtering to graphs, by directly penalizing higher order differences across nodes. All regularizers above require correct edge weights so that network structures can be properly used to improve joint estimation. In contrast, our proposed approach allows ambiguous input of network edge weights by introducing the discrepancy-buffering variables. Also, as a variation of SON regularization, our formulation enjoys small model complexity as local consensus is imposed across large-scale networks.

## Appendix B Proof of Lemma 2.1

We present the analytic solution for updating , introduced in Eq. (2.9). For simplicity, we omit iteration index , and denote , and .

(16) |

We discuss the problem in the following two cases:

Case (1): When the objective function in Eq. (B) is differentiable under the condition of , taking the sufficient condition of optimality as , we have:

and we can solve above linear system equations to obtain:

where we define which still depends on and . Thus we plug the expressions of and back into the definition of , and achieve the expression of which is fully comprised of fixed variables. Hence, the solution is:

(17) |

where .

Case (2): When the objective function in (B) is not differentiable, we need to solve the quadratic problem with the constraint . It yields , , which is in the same form of Eq. (B) if .

To determine which of above differentiable conditions is satisfied, we compare resulting objective values in two cases. Denote and as the optimal objective value in above two cases, we compute their difference as:

From the definition of and , we can derive . Combining above, we have:

which shows that unless . Therefore, we can have a unified solution as Eq. (B).

## Appendix C Proof of Lemma 2.2

Based on rigorous analysis in [boyd2011distributed] and [eckstein1992douglas], we know that the convergence to the global optimum is guaranteed for ADMM algorithms on problems of the following form:

(18) |

where both and need to be convex functions and constraints are linear. Note that our problem in Eq.(2.2) is equivalent to Eq.(2.5). To satisfy conditions of convergence guarantee, we need to fit Eq.(2.5) to the form in Eq.(18).

We let and let , and further define and . Next, is convex because each cmponent is required to be convex in our problem setting. is also convex in terms of the joint variable , by the definition of convexity and the triangle inequality. Lastly, the constraint is also linear in terms of entries in and . Summing up, we can see that Eq.(2.5) fits the form of Eq.(18). Then by theorems in [boyd2011distributed] and [eckstein1992douglas], Our ADMM approach to solve the problem Eq.(2.2) is guaranteed to converge to the global optimum.

## Appendix D Pseudo-code for Solving DANR Problem

## Appendix E Solution of ST-DANR

### e.1 Batch Formulation

Similar to the spatial only case in §2, we define = for model parameter vectors on temporal undirected network , and define consensus variables

as copies of in the corresponding spatial penalty term and temporal penalty term . Then we rewrite the problem as follows:

(19) | ||||

s.t. | ||||

The iterative scheme of ADMM under the above setting is modified as follows, with denoting the iteration index:

(20) |

where =,, is composed of replicated elements in .

-Update. Now we need to decompose the problem of minimizing into separately minimizing for each node :

(21) |

()-Update. The ()-update step in Eq. (20) can be decoupled into separate updates for a spatial ()-update and a temporal ()-update. For spatial ()-update, we have: