1 Introduction
The data generated in many important application domains have an intrinsic network structure. Such networked data arises in the study of social networks, text document collections and personalized medicine Chang and Blei [2009], Barabási et al. [2010], Zachary [1977]. Network science provides powerful tools for the analysis of such data based on its intrinsic network structure Newman [2010], Cui et al. [2016]. However, the network structure of datasets is complemented by the information contained in attributes (such as features or labels) of individual data points Chang and Blei [2009].
In this paper, we study a particular class of statistical models for networked data which are based on modeling the statistics of data attributes using exponential families Brown [1986], Wainwright and Jordan [2008], Jung et al. [2014]. The exponential families describing the individual data points are coupled via the network structure underlying the data. The resulting networked exponential families allows to jointly captialize on the network structure and the statistical properties of features and labels assigned to individual data points.
Our approach extends prior work on networked linear and logistic regression models
Jung and Tran [2019], M.Yamada et al. [2017], Ambos et al. [2018b]. Indeed, the proposed network exponential family model contains linear and logistic regression as special cases. In contrast to Chang and Blei [2009], which formulates a probabilistic model for the network structure, we consider the network structure as fixed and known. The closest to our work is Li et al. [2019] which considers networked models but uses a different smoothness measure for tying the models of neighbouring data points. In particular, while Li et al. [2019] uses the graph Laplacian quadratic form as a smoothness measure, our approach controls the nonsmooth total variation of the model parameters.The main theme of this paper is the application of the network Lasso (nLasso) to learning networked exponential families (see Figure 1). The nLasso has been proposed recently as a natural extension of the least absolute shrinkage and selection operator (Lasso) to networked data Hallac et al. [2015], Hastie et al. [2015]. We show how the nLasso can be implemented efficiently using a primal dual splitting method for convex optimization. The resulting scalable learning method amounts to a message passing protocol over the data network structure.
Outline. The rest of the paper is organized as follows. We start in Section 2 with modeling networked highdimensional data points using empirical graphs whose nodes are equipped with individual probabilistic models forming networked exponential families. In Section 3 we detail how some wellknown networked models, such as linear and logistic regression, are obtained as special cases of networked exponential families. As discussed in Section 4, the nLasso provides a principled approach to learning exponential families via regularized empirical risk minimization. The resulting learning problem can be solved by applying an efficient primaldual method for nonsmooth convex optimization (see Section 5). We also discuss how to cope with partially observed exponential families which is relevant for many popular topic (latent variable) models such as latent Dirichlet allocation.
Contribution. Our main contributions are: (i) We introduce networked exponential families as a novel statistical model for networked highdimensional data. (ii) We develop a scaling method for learning networked exponential families. In particular, we apply a primaldual method for convex optimization and verify convergence of the resulting method which can be implemented as message passing over the data network.
Notation.
Boldface lowercase (uppercase) letters denote vectors (matrices). We denote
the transpose of vector . The norm of a vector is . The spectral norm of a matrix is . The convex conjugate of a function is .2 Networked Exponential Families
We consider networked data that is represented by an undirected weighted graph (the “empirical graph”) . A particular node of the graph represents an individual data point (such as a document, or a social network user profile).^{1}^{1}1With a slight abuse of notation, we refer by to a node of the empirical graph as well as the data point which is represented by that node. Two different data points are connected by an undirected edge if they are considered similar (such as documents authored by the same person or social network profiles of befriended users). For ease of notation, we denote the edge set by .
Each edge is endowed with a positive weight which quantifies the amount of similarity between data points . The neighborhood of a node is .
Beside the network structure, datasets convey additional information in the form of attributes associated with each data point . For a document collection with data points representing text documents, the attributes could be frequencies of particular words Chang and Blei [2009].
We model the attributes of data points
as independent random variables distributed according to an exponential family
Wainwright and Jordan [2008], Brown [1986](2.1) 
The exponential family (2.1) is defined by the map which is known as sufficient statistic or potential function Wainwright and Jordan [2008]. The function is known as the log partition function or cumulant function Wainwright and Jordan [2008].
Note that the distribution (2.1) is parametrized by the (unknown) weight vectors . The networked exponential family combines the nodewise models (2.1) by requiring the weight vectors to be similar for wellconnected data points. In particular, we require the weight vectors to have small total variation (TV)
(2.2) 
Requiring small TV of weight vectors , for , typically implies that weight vectors are approximately constant over well connected subsets (clusters) of nodes.
3 Some Examples
Before we turn to the problem of learning networked exponential families (2.1) in Section 4, we now discuss some important special cases of the model (2.1).
3.1 Networked Linear Regression
Consider a networked dataset whose data points are characterized by features and numeric labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model
(3.1) 
with Gaussian noise
of known variance
. The linear model (3.1), for each node , is parametrized by the weight vector. The networked linear regression model requires the weight vectors in the individual linear models (
3.1) to have a small TV (2.2) Jung and Tran [2019], M.Yamada et al. [2017].3.2 Networked Logistic Regression
Consider a networked dataset whose data points are characterized by features and binary labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model
(3.2) 
The logistic regression model (3.2) is parametrized by the weight vector for each node . The networked logistic regression model requires the weight vectors in the nodewise logistic regression models (3.2) to have a small TV (2.2) Ambos et al. [2018a].
3.3 Networked Latent Dirichlet Allocation
Consider a networked dataset representing a collection of text documents (such as scientific articles). The latent Dirichlet allocation model (LDA) is a probabilistic model for the relative frequencies of words in a document Wainwright and Jordan [2008], Blei et al. [2003]. Within LDA, each document is considered a blend of different topics. Each topic has a characteristic distribution of the words in the vocabulary.
A simplified form of LDA represents each document containing “words” by two sequences of multinomial random variables and with being the size of the vocabulary defining elementary words and is the number of different topics. It can be shown that LDA is a special case of the exponential family (2.1) with particular choices for and (see Wainwright and Jordan [2008], Blei et al. [2003]).
4 Network Lasso
Our goal is to develop a method for learning an accurate estimate
for the weight vectors at . The learning of the weight vectors is based on the availability of the nodes attributes for a small sampling set . A reasonable estimate for the weight vectors can be obtained from maximizing the likelihood of observing the attributes :(4.1) 
It is easy to show that maximizing (4.2) is equivalent to minimizing the empirical risk
(4.2) 
The criterion (4.2) by itself is not sufficient to learn the weights at all nodes , since (4.2) since it completely ignores the weights at unobserved nodes . Therefore, we need to impose some additional structure on the weight vectors. In particular, any reasonable estimate should conform with the cluster structure of the empirical graph Newman [2010].
The network structure of data arising in important applications is organized as clusters (or communities) which are wellconnected subset of nodes Fortunato [2010]
. Many methods of (supervised) machine learning are motivated by a clustering assumption that nodes belonging to the same cluster represent similar data points. We implement this clustering assumption by requiring the parameter vectors
in (2.1) to have a small TV (2.2).We are led quite naturally to learning the weights for the networked exponential family via the regularized empirical risk minimization (ERM)
(4.3) 
The learning problem (4.3) is an instance of the generic nLasso problem Hallac et al. [2015]. The parameter in (4.3) allows to tradeoff small TV against small error (cf. (4.2)). The choice of can be guided by cross validation Hastie et al. [2001].
It will be convenient to reformulate (4.3) using vector notation. We represent a graph signal as the vector
(4.4) 
Define a partitioned matrix blockwise as
(4.5) 
where
is the identity matrix. The term
in (2.2) is the th block of . Using (4.4) and (4.5), we can reformulate the nLasso (4.3) as(4.6) 
with
(4.7) 
with stacked vector .
5 Efficient Implementation
The nLasso (4.6) is a convex optimization problem with a nonsmooth objective function which rules out the use of gradient descent methods Jung [2017]. However, the objective function is highly structured since it is the sum of a smooth convex function and a nonsmooth convex function , which can be optimized efficiently when considered separately. This suggests to use some proximal method Combettes and Pesquet [2011], Parikh and Boyd [2013] for solving (4.6).
One particular example of a proximal method is the alternating direction method of multipliers (ADMM) which has been considered in Hallac et al. [2015]. However, we will choose another proximal method which is based on a dual problem to (4.6). Based on this dual problem, efficient primaldual methods have been proposed recently Pock and Chambolle [2011], Chambolle and Pock [2011]. These methods are attractive since their analysis provides natural choices for the algorithm parameters. In contrast, tuning the ADMM parameters is nontrivial Nishihara et al. [2015].
5.1 PrimalDual Method
The preconditioned primaldual method Pock and Chambolle [2011] launches from reformulating the problem (4.6) as a saddlepoint problem
(5.1) 
with the convex conjugate of Chambolle and Pock [2011].
Any solution of (5.1) is characterized by [Rockafellar, 1970, Thm 31.3]
(5.2) 
This condition is, in turn, equivalent to
(5.3) 
with positive definite matrices . The matrices are design parameters whose choice will be detailed below. The condition (5.3) lends naturally to the following coupled fixed point iterations Pock and Chambolle [2011]
(5.4)  
(5.5) 
If the matrices and in (5.4), (5.5) satisfy
(5.6) 
the sequences obtained from iterating (5.4) and (5.5) converge to a saddle point of the problem (5.1) [Pock and Chambolle, 2011, Thm. 1]. The condition (5.6) is satisfied for the choice and , with node degree and some [Pock and Chambolle, 2011, Lem. 2].
The update (5.5) involves the resolvent operator
(5.7) 
where . The convex conjugate of (see (4.7)) can be decomposed as with the convex conjugate of the scaled norm . Moreover, since is a block diagonal matrix, the th block of the resolvent operator can be obtained by the Moreau decomposition as [Parikh and Boyd, 2013, Sec. 6.5]
where for .
The update (5.4) involves the resolvent operator of (see (4.2) and (4.7)), which does not have a closedform solution in general. Still, for the choice , the update (5.4) decomposes into separate nodewise updates
(5.8) 
with and .
In general, it is not possible to compute the update (5.8) exactly. A notable exception are networked linear Gaussian models, for which (5.8) amounts to matrix multiplications M.Yamada et al. [2017], Jung and Vesselinova [2019]. However, since (5.8) amounts to the unconstrained minimization of a smooth and convex objective function we can apply efficient convex optimization methods Boyd and Vandenberghe [2004]. In fact, the update (5.8) is a regularized maximum likelihood problem for the exponential family (2.1). This can be solved efficiently using quasiNewton methods such as LBGFS Andrew and Gao [2007], Mokhtari and Ribeiro [2015]. We will detail a particular iterative method for approximately solving (5.8) in Section 5.2.
Let us denote the approximate solution to (5.8) by and assume that it is sufficiently accurate such that
(5.9) 
Thus, we require the approximation quality (for approximating the update (5.8)) to increase with the iteration number . According to [Condat, 2013, Thm. 3.2], the error bound (5.9) ensures the sequences obtained by (5.4) and (5.5) when replacing the exact update (5.8) with the approximation still converge to a saddlepoint of (5.1) and, in turn, a solution of the nLasso problem (4.6).
5.2 Approximate PrimalDual Steps
Let us detail here a simple iterative method for computing an approximate solution to the update (5.8). By standard convex analysis (see Boyd and Vandenberghe [2004]), any solution is characterized by the zero gradient condition
(5.10) 
with
(5.11) 
Inserting (5.11) into (5.10), and using some basic calculus, yields
(5.12) 
The condition (5.12), which is necessary and sufficient for to solve (5.8), is a fixed point equation with
(5.13) 
We will use the Hessian of with entries
(5.14) 
According to the meanvalue theorem [Rudin, 1976, Thm. 9.19.], the map with Lipschitz constant . Thus, if we choose such that
(5.15) 
the map is a contraction and the fixedpoint iteration
(5.16) 
will converge to the solution of (5.8).
Moreover, if (5.15) is satisfied, we can bound the deviation between the iterate and the (unique) solution of (5.15) as (see [Rudin, 1976, Proof of Thm. 9.23])
(5.17) 
Thus, if we use the approximation for the update (5.8), we can ensure (5.9) by iterating (5.2) for at least
(5.18) 
iterations.
Note that computing the iterates (5.2) requires the evaluation of the gradient of the log partition function . According to [Wainwright and Jordan, 2008, Prop. 3.1.], this gradient is given by the expectation of the sufficient statistic under the distribution :
(5.19) 
Moreover, the Hessian in (5.14) is obtained as the covariance matrix of the sufficient statistics [Wainwright and Jordan, 2008, Prop. 3.1.]
(5.20) 
In general, the expectations (5.19) and (5.2) cannot be computed exactly in closedform. A notable exception are exponential families obtained from a probabilistic graphical model defined on a triangulated graph such as a tree. In this case it is possible to compute (5.19) in closedform (see [Wainwright and Jordan, 2008, Sec. 2.5.2]). Another special case of (2.1) for which (5.19) and (5.2) can be evaluated in closedform is linear and logistic regression (see Section 3).
5.3 Partially Observed Models
The learning Algorithm 1 can be adapted easily to cope with partially observed exponential families Wainwright and Jordan [2008]. In particular, for the networked LDA described in Section 3, we typically have access only to the word variables of some documents . However, for (approximately) computing the update step (5.8) we would also need the values of the topic variables but those are not observed since they are latent (hidden) variables. In this case we can approximate (5.8
) by some “ExpectationMaximization” (EM) principle (see
[Wainwright and Jordan, 2008, Sec. 6.2]). An alternative to EM methods, based on the method of moments, for learning (latent variable) topic models has been studied in a recent line of work
Arora et al. [2016], Anandkumar et al. [2012].6 Conclusion
We have introduced networked exponential families as a flexible statistical modeling paradigm for networked data. Individual data points are modeled by exponential families whose parameters are coupled across connected data points by requiring a small TV. An efficient method for learning networked exponential families is obtained by applying a primaldual method to solve the nonsmooth nLasso problem.
References
 Ambos et al. [2018a] H. Ambos, N. Tran, and A. Jung. Classifying big data over networks via the logistic network lasso. In Proc. 52nd Asilomar Conference on Signals, Systems, and Computers. 10.1109/ACSSC.2018.8645260, 2018a.
 Ambos et al. [2018b] H. Ambos, N. Tran, and A. Jung. Classifying big data over networks via the logistic network lasso. In Proc. 52nd Asilomar Conf. Signals, Systems, Computers, Oct./Nov. 2018b.
 Anandkumar et al. [2012] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and L. Yikai. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems 25, pages 917–925, 2012.
 Andrew and Gao [2007] G. Andrew and J. Gao. Scalable training of l1regularized loglinear models. In Proc. of the 24th International Conference on Machine Learning (ICML), Corvallis, OR, 2007.
 Arora et al. [2016] S. Arora, R. Ge, F. Koehler, T. Ma, and A. Moitra. Provable algorithms for inference in topic models. In Proc. 33rd Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2016.
 Barabási et al. [2010] A. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a networkbased approach to human disease. Nature Reviews Genetics, 12(56), 2010.
 Blei et al. [2003] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Jan. 2003.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, Cambridge, UK, 2004.
 Brown [1986] L. D. Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, Hayward, CA, 1986.
 Chambolle and Pock [2011] A. Chambolle and T. Pock. A firstorder primaldual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis., 40(1), 2011. ISSN 09249907. doi: 10.1007/s1085101002511. URL http://dx.doi.org/10.1007/s1085101002511.
 Chang and Blei [2009] J. Chang and D. M. Blei. Relational topic models for document networks. In Proc. of the 12th Int. Conf. on Art. Int. Stat. (AISTATS), Florida, USA, 2009.
 Combettes and Pesquet [2011] P. L. Combettes and J.C. Pesquet. Proximal splitting methods in signal processing. In H. Bauschke, R. Burachik, P. Combettes, V. Elser, D. Luke, and H. Wolkowicz, editors, FixedPoint Algorithms for Inverse Problems in Science and Engineering, volume 49. Springer New York, 2011.
 Condat [2013] L. Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, Aug. 2013.
 Cui et al. [2016] S. Cui, A. Hero, Z.Q. Luo, and J. Moura, editors. Big Data over Networks. Cambridge Univ. Press, 2016.
 Fortunato [2010] S. Fortunato. Community detection in graphs. Physics Reports, 486(35):75–174, Feb. 2010.
 Hallac et al. [2015] D. Hallac, J. Leskovec, and S. Boyd. Network lasso: Clustering and optimization in large graphs. In Proc. SIGKDD, pages 387–396, 2015.
 Hastie et al. [2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA, 2001.
 Hastie et al. [2015] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity. The Lasso and its Generalizations. CRC Press, 2015.
 Jung [2017] A. Jung. A fixedpoint of view on gradient methods for big data. Frontiers in Applied Mathematics and Statistics, 3, 2017. doi: 10.3389/fams.2017.00018. URL https://www.frontiersin.org/article/10.3389/fams.2017.00018.
 Jung and Tran [2019] A. Jung and N. Tran. Localized linear regression in networked data. IEEE Sig. Proc. Letters, 2019.
 Jung and Vesselinova [2019] A. Jung and N. Vesselinova. Analysis of network lasso for semisupervised regression. In The 22nd Int. Conf. Art. Int. Stat. (AISTATS), Okinawa, Japan, April 2019.
 Jung et al. [2014] A. Jung, S. Schmutzhard, and F. Hlawatsch. The RKHS approach to minimum variance estimation revisited: Variance bounds, sufficient statistics, and exponential families. IEEE Trans. Inf. Theory, 60(7):4050–4065, Jul. 2014.
 Li et al. [2019] T. Li, E. Levina, and J. Zhu. Prediction models for networklinked data. The Annals of Applied Statistics, 2019.
 Mokhtari and Ribeiro [2015] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory bfgs. Jour. Mach. Learning Res., 16:3151 – 3181, Dec. 2015.

M.Yamada et al. [2017]
M.Yamada, T. Koh, T. Iwata, J. ShaweTaylor, and S. Kaski.
Localized Lasso for HighDimensional Regression.
In
Proceedings of the 20th International Conference on Artificial Intelligence and Statistics
, volume 54, pages 325–333, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.  Newman [2010] M. E. J. Newman. Networks: An Introduction. Oxford Univ. Press, 2010.
 Nishihara et al. [2015] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan. A general analysis of the convergence of admm. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, Lille, France, 2015.
 Parikh and Boyd [2013] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.

Pock and Chambolle [2011]
T. Pock and A. Chambolle.
Diagonal preconditioning for first order primaldual algorithms in
convex optimization.
In
IEEE International Conference on Computer Vision (ICCV)
, Barcelona, Spain, Nov. 2011.  Rockafellar [1970] R. T. Rockafellar. Convex Analysis. Princeton Univ. Press, Princeton, NJ, 1970.
 Rudin [1976] W. Rudin. Principles of Mathematical Analysis. McGrawHill, New York, 3 edition, 1976.
 Wainwright and Jordan [2008] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference, volume 1 of Foundations and Trends in Machine Learning. Now Publishers, Hanover, MA, 2008.
 Zachary [1977] W. W. Zachary. An information flow model for conflict and fission in small groups. J. Anthro. Research, 33(4):452–473, 1977.