Learning Networked Exponential Families with Network Lasso

05/22/2019 ∙ by Alexander Jung, et al. ∙ 22

The data arising in many important big-data applications, ranging from social networks to network medicine, consist of high-dimensional data points related by an intrinsic (complex) network structure. In order to jointly leverage the information conveyed in the network structure as well as the statistical power contained in high-dimensional data points, we propose networked exponential families. We apply the network Lasso to learn networked exponential families as a probabilistic model for heterogeneous datasets with intrinsic network structure. In order to allow for accurate learning from high-dimensional data we borrow statistical strength, via the intrinsic network structure, across the dataset. The resulting method aims at regularized empirical risk minimization using the total variation of the model parameters as regularizer. This minimization problem is a non-smooth convex optimization problem which we solve using a primal-dual splitting method. This method is appealing for big data applications as it can be implemented as a highly scalable message passing algorithm.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The data generated in many important application domains have an intrinsic network structure. Such networked data arises in the study of social networks, text document collections and personalized medicine Chang and Blei [2009], Barabási et al. [2010], Zachary [1977]. Network science provides powerful tools for the analysis of such data based on its intrinsic network structure Newman [2010], Cui et al. [2016]. However, the network structure of datasets is complemented by the information contained in attributes (such as features or labels) of individual data points Chang and Blei [2009].

In this paper, we study a particular class of statistical models for networked data which are based on modeling the statistics of data attributes using exponential families Brown [1986], Wainwright and Jordan [2008], Jung et al. [2014]. The exponential families describing the individual data points are coupled via the network structure underlying the data. The resulting networked exponential families allows to jointly captialize on the network structure and the statistical properties of features and labels assigned to individual data points.

Our approach extends prior work on networked linear and logistic regression models

Jung and Tran [2019], M.Yamada et al. [2017], Ambos et al. [2018b]. Indeed, the proposed network exponential family model contains linear and logistic regression as special cases. In contrast to Chang and Blei [2009], which formulates a probabilistic model for the network structure, we consider the network structure as fixed and known. The closest to our work is Li et al. [2019] which considers networked models but uses a different smoothness measure for tying the models of neighbouring data points. In particular, while Li et al. [2019] uses the graph Laplacian quadratic form as a smoothness measure, our approach controls the non-smooth total variation of the model parameters.

The main theme of this paper is the application of the network Lasso (nLasso) to learning networked exponential families (see Figure 1). The nLasso has been proposed recently as a natural extension of the least absolute shrinkage and selection operator (Lasso) to networked data Hallac et al. [2015], Hastie et al. [2015]. We show how the nLasso can be implemented efficiently using a primal dual splitting method for convex optimization. The resulting scalable learning method amounts to a message passing protocol over the data network structure.

Outline. The rest of the paper is organized as follows. We start in Section 2 with modeling networked high-dimensional data points using empirical graphs whose nodes are equipped with individual probabilistic models forming networked exponential families. In Section 3 we detail how some well-known networked models, such as linear and logistic regression, are obtained as special cases of networked exponential families. As discussed in Section 4, the nLasso provides a principled approach to learning exponential families via regularized empirical risk minimization. The resulting learning problem can be solved by applying an efficient primal-dual method for non-smooth convex optimization (see Section 5). We also discuss how to cope with partially observed exponential families which is relevant for many popular topic (latent variable) models such as latent Dirichlet allocation.

Contribution. Our main contributions are: (i) We introduce networked exponential families as a novel statistical model for networked high-dimensional data. (ii) We develop a scaling method for learning networked exponential families. In particular, we apply a primal-dual method for convex optimization and verify convergence of the resulting method which can be implemented as message passing over the data network.

Notation.

Boldface lowercase (uppercase) letters denote vectors (matrices). We denote

the transpose of vector . The -norm of a vector is . The spectral norm of a matrix is . The convex conjugate of a function is .

2 Networked Exponential Families

Figure 1: A networked exponential family is a probabilistic model for (high-dimensional) data points related by some intrinsic network structure. The data points might represent scientific articles within a collection of articles that are related via joint authors. Another application for such models is learning analytics where data points represent the digital traces left by students within learning management systems. Network medicine considers networks of proteins which are described in large databases such as https://www.uniprot.org.

We consider networked data that is represented by an undirected weighted graph (the “empirical graph”) . A particular node of the graph represents an individual data point (such as a document, or a social network user profile).111With a slight abuse of notation, we refer by to a node of the empirical graph as well as the data point which is represented by that node. Two different data points are connected by an undirected edge if they are considered similar (such as documents authored by the same person or social network profiles of befriended users). For ease of notation, we denote the edge set by .

Each edge is endowed with a positive weight which quantifies the amount of similarity between data points . The neighborhood of a node is .

Beside the network structure, datasets convey additional information in the form of attributes associated with each data point . For a document collection with data points representing text documents, the attributes could be frequencies of particular words Chang and Blei [2009].

We model the attributes of data points

as independent random variables distributed according to an exponential family

Wainwright and Jordan [2008], Brown [1986]

(2.1)

The exponential family (2.1) is defined by the map which is known as sufficient statistic or potential function Wainwright and Jordan [2008]. The function is known as the log partition function or cumulant function Wainwright and Jordan [2008].

Note that the distribution (2.1) is parametrized by the (unknown) weight vectors . The networked exponential family combines the node-wise models (2.1) by requiring the weight vectors to be similar for well-connected data points. In particular, we require the weight vectors to have small total variation (TV)

(2.2)

Requiring small TV of weight vectors , for , typically implies that weight vectors are approximately constant over well connected subsets (clusters) of nodes.

3 Some Examples

Before we turn to the problem of learning networked exponential families (2.1) in Section 4, we now discuss some important special cases of the model (2.1).

3.1 Networked Linear Regression

Consider a networked dataset whose data points are characterized by features and numeric labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model

(3.1)

with Gaussian noise

of known variance

. The linear model (3.1), for each node , is parametrized by the weight vector

. The networked linear regression model requires the weight vectors in the individual linear models (

3.1) to have a small TV (2.2) Jung and Tran [2019], M.Yamada et al. [2017].

The model (3.1) is obtained as the special case of the exponential family (2.1) for the scalar attributes with and .

3.2 Networked Logistic Regression

Consider a networked dataset whose data points are characterized by features and binary labels . Maybe the most basic (yet quite useful) model for the relation between features and labels is the linear model

(3.2)

The logistic regression model (3.2) is parametrized by the weight vector for each node . The networked logistic regression model requires the weight vectors in the node-wise logistic regression models (3.2) to have a small TV (2.2) Ambos et al. [2018a].

We obtain the logistic regression model (3.2) as the special case of the exponential family (2.1) for the scalar node attributes with and .

3.3 Networked Latent Dirichlet Allocation

Consider a networked dataset representing a collection of text documents (such as scientific articles). The latent Dirichlet allocation model (LDA) is a probabilistic model for the relative frequencies of words in a document Wainwright and Jordan [2008], Blei et al. [2003]. Within LDA, each document is considered a blend of different topics. Each topic has a characteristic distribution of the words in the vocabulary.

A simplified form of LDA represents each document containing “words” by two sequences of multinomial random variables and with being the size of the vocabulary defining elementary words and is the number of different topics. It can be shown that LDA is a special case of the exponential family (2.1) with particular choices for and (see Wainwright and Jordan [2008], Blei et al. [2003]).

4 Network Lasso

Our goal is to develop a method for learning an accurate estimate

for the weight vectors at . The learning of the weight vectors is based on the availability of the nodes attributes for a small sampling set . A reasonable estimate for the weight vectors can be obtained from maximizing the likelihood of observing the attributes :

(4.1)

It is easy to show that maximizing (4.2) is equivalent to minimizing the empirical risk

(4.2)

The criterion (4.2) by itself is not sufficient to learn the weights at all nodes , since (4.2) since it completely ignores the weights at unobserved nodes . Therefore, we need to impose some additional structure on the weight vectors. In particular, any reasonable estimate should conform with the cluster structure of the empirical graph Newman [2010].

The network structure of data arising in important applications is organized as clusters (or communities) which are well-connected subset of nodes Fortunato [2010]

. Many methods of (supervised) machine learning are motivated by a clustering assumption that nodes belonging to the same cluster represent similar data points. We implement this clustering assumption by requiring the parameter vectors

in (2.1) to have a small TV (2.2).

We are led quite naturally to learning the weights for the networked exponential family via the regularized empirical risk minimization (ERM)

(4.3)

The learning problem (4.3) is an instance of the generic nLasso problem Hallac et al. [2015]. The parameter in (4.3) allows to trade-off small TV against small error (cf. (4.2)). The choice of can be guided by cross validation Hastie et al. [2001].

It will be convenient to reformulate (4.3) using vector notation. We represent a graph signal as the vector

(4.4)

Define a partitioned matrix block-wise as

(4.5)

where

is the identity matrix. The term

in (2.2) is the -th block of . Using (4.4) and (4.5), we can reformulate the nLasso (4.3) as

(4.6)

with

(4.7)

with stacked vector .

5 Efficient Implementation

The nLasso (4.6) is a convex optimization problem with a non-smooth objective function which rules out the use of gradient descent methods Jung [2017]. However, the objective function is highly structured since it is the sum of a smooth convex function and a non-smooth convex function , which can be optimized efficiently when considered separately. This suggests to use some proximal method Combettes and Pesquet [2011], Parikh and Boyd [2013] for solving (4.6).

One particular example of a proximal method is the alternating direction method of multipliers (ADMM) which has been considered in Hallac et al. [2015]. However, we will choose another proximal method which is based on a dual problem to (4.6). Based on this dual problem, efficient primal-dual methods have been proposed recently Pock and Chambolle [2011], Chambolle and Pock [2011]. These methods are attractive since their analysis provides natural choices for the algorithm parameters. In contrast, tuning the ADMM parameters is non-trivial Nishihara et al. [2015].

5.1 Primal-Dual Method

The preconditioned primal-dual method Pock and Chambolle [2011] launches from reformulating the problem (4.6) as a saddle-point problem

(5.1)

with the convex conjugate of Chambolle and Pock [2011].

Any solution of (5.1) is characterized by [Rockafellar, 1970, Thm 31.3]

(5.2)

This condition is, in turn, equivalent to

(5.3)

with positive definite matrices . The matrices are design parameters whose choice will be detailed below. The condition (5.3) lends naturally to the following coupled fixed point iterations Pock and Chambolle [2011]

(5.4)
(5.5)

If the matrices and in (5.4), (5.5) satisfy

(5.6)

the sequences obtained from iterating (5.4) and (5.5) converge to a saddle point of the problem (5.1) [Pock and Chambolle, 2011, Thm. 1]. The condition (5.6) is satisfied for the choice and , with node degree and some [Pock and Chambolle, 2011, Lem. 2].

The update (5.5) involves the resolvent operator

(5.7)

where . The convex conjugate of (see (4.7)) can be decomposed as with the convex conjugate of the scaled -norm . Moreover, since is a block diagonal matrix, the -th block of the resolvent operator can be obtained by the Moreau decomposition as [Parikh and Boyd, 2013, Sec. 6.5]

where for .

The update (5.4) involves the resolvent operator of (see (4.2) and (4.7)), which does not have a closed-form solution in general. Still, for the choice , the update (5.4) decomposes into separate node-wise updates

(5.8)

with and .

In general, it is not possible to compute the update (5.8) exactly. A notable exception are networked linear Gaussian models, for which (5.8) amounts to matrix multiplications M.Yamada et al. [2017], Jung and Vesselinova [2019]. However, since (5.8) amounts to the unconstrained minimization of a smooth and convex objective function we can apply efficient convex optimization methods Boyd and Vandenberghe [2004]. In fact, the update (5.8) is a regularized maximum likelihood problem for the exponential family (2.1). This can be solved efficiently using quasi-Newton methods such as L-BGFS Andrew and Gao [2007], Mokhtari and Ribeiro [2015]. We will detail a particular iterative method for approximately solving (5.8) in Section 5.2.

Let us denote the approximate solution to (5.8) by and assume that it is sufficiently accurate such that

(5.9)

Thus, we require the approximation quality (for approximating the update (5.8)) to increase with the iteration number . According to [Condat, 2013, Thm. 3.2], the error bound (5.9) ensures the sequences obtained by (5.4) and (5.5) when replacing the exact update (5.8) with the approximation still converge to a saddle-point of (5.1) and, in turn, a solution of the nLasso problem (4.6).

1:, , , ,
2:Initialize: , , , ,
3:repeat
4:     
5:     for  each observed node  do
6:         
7:     end for
8:     
9:      for
10:     
11:until stopping criterion is satisfied
12:.
Algorithm 1 Learning Networked Exponential Family

5.2 Approximate Primal-Dual Steps

Let us detail here a simple iterative method for computing an approximate solution to the update (5.8). By standard convex analysis (see Boyd and Vandenberghe [2004]), any solution is characterized by the zero gradient condition

(5.10)

with

(5.11)

Inserting (5.11) into (5.10), and using some basic calculus, yields

(5.12)

The condition (5.12), which is necessary and sufficient for to solve (5.8), is a fixed point equation with

(5.13)

We will use the Hessian of with entries

(5.14)

According to the mean-value theorem [Rudin, 1976, Thm. 9.19.], the map with Lipschitz constant . Thus, if we choose such that

(5.15)

the map is a contraction and the fixed-point iteration

(5.16)

will converge to the solution of (5.8).

Moreover, if (5.15) is satisfied, we can bound the deviation between the iterate and the (unique) solution of (5.15) as (see [Rudin, 1976, Proof of Thm. 9.23])

(5.17)

Thus, if we use the approximation for the update (5.8), we can ensure (5.9) by iterating (5.2) for at least

(5.18)

iterations.

Note that computing the iterates (5.2) requires the evaluation of the gradient of the log partition function . According to [Wainwright and Jordan, 2008, Prop. 3.1.], this gradient is given by the expectation of the sufficient statistic under the distribution :

(5.19)

Moreover, the Hessian in (5.14) is obtained as the covariance matrix of the sufficient statistics [Wainwright and Jordan, 2008, Prop. 3.1.]

(5.20)

In general, the expectations (5.19) and (5.2) cannot be computed exactly in closed-form. A notable exception are exponential families obtained from a probabilistic graphical model defined on a triangulated graph such as a tree. In this case it is possible to compute (5.19) in closed-form (see [Wainwright and Jordan, 2008, Sec. 2.5.2]). Another special case of (2.1) for which (5.19) and (5.2) can be evaluated in closed-form is linear and logistic regression (see Section 3).

5.3 Partially Observed Models

The learning Algorithm 1 can be adapted easily to cope with partially observed exponential families Wainwright and Jordan [2008]. In particular, for the networked LDA described in Section 3, we typically have access only to the word variables of some documents . However, for (approximately) computing the update step (5.8) we would also need the values of the topic variables but those are not observed since they are latent (hidden) variables. In this case we can approximate (5.8

) by some “Expectation-Maximization” (EM) principle (see

[Wainwright and Jordan, 2008, Sec. 6.2]

). An alternative to EM methods, based on the method of moments, for learning (latent variable) topic models has been studied in a recent line of work

Arora et al. [2016], Anandkumar et al. [2012].

6 Conclusion

We have introduced networked exponential families as a flexible statistical modeling paradigm for networked data. Individual data points are modeled by exponential families whose parameters are coupled across connected data points by requiring a small TV. An efficient method for learning networked exponential families is obtained by applying a primal-dual method to solve the non-smooth nLasso problem.

References

  • Ambos et al. [2018a] H. Ambos, N. Tran, and A. Jung. Classifying big data over networks via the logistic network lasso. In Proc. 52nd Asilomar Conference on Signals, Systems, and Computers. 10.1109/ACSSC.2018.8645260, 2018a.
  • Ambos et al. [2018b] H. Ambos, N. Tran, and A. Jung. Classifying big data over networks via the logistic network lasso. In Proc. 52nd Asilomar Conf. Signals, Systems, Computers, Oct./Nov. 2018b.
  • Anandkumar et al. [2012] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, and L. Yi-kai. A spectral algorithm for latent dirichlet allocation. In Advances in Neural Information Processing Systems 25, pages 917–925, 2012.
  • Andrew and Gao [2007] G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In Proc. of the 24th International Conference on Machine Learning (ICML), Corvallis, OR, 2007.
  • Arora et al. [2016] S. Arora, R. Ge, F. Koehler, T. Ma, and A. Moitra. Provable algorithms for inference in topic models. In Proc. 33rd Int. Conf. Mach. Learn. (ICML), New York, NY, USA, 2016.
  • Barabási et al. [2010] A. Barabási, N. Gulbahce, and J. Loscalzo. Network medicine: a network-based approach to human disease. Nature Reviews Genetics, 12(56), 2010.
  • Blei et al. [2003] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Jan. 2003.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge Univ. Press, Cambridge, UK, 2004.
  • Brown [1986] L. D. Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, Hayward, CA, 1986.
  • Chambolle and Pock [2011] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imag. Vis., 40(1), 2011. ISSN 0924-9907. doi: 10.1007/s10851-010-0251-1. URL http://dx.doi.org/10.1007/s10851-010-0251-1.
  • Chang and Blei [2009] J. Chang and D. M. Blei. Relational topic models for document networks. In Proc. of the 12th Int. Conf. on Art. Int. Stat. (AISTATS), Florida, USA, 2009.
  • Combettes and Pesquet [2011] P. L. Combettes and J.-C. Pesquet. Proximal splitting methods in signal processing. In H. Bauschke, R. Burachik, P. Combettes, V. Elser, D. Luke, and H. Wolkowicz, editors, Fixed-Point Algorithms for Inverse Problems in Science and Engineering, volume 49. Springer New York, 2011.
  • Condat [2013] L. Condat. A primal–dual splitting method for convex optimization involving lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, Aug. 2013.
  • Cui et al. [2016] S. Cui, A. Hero, Z.-Q. Luo, and J. Moura, editors. Big Data over Networks. Cambridge Univ. Press, 2016.
  • Fortunato [2010] S. Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, Feb. 2010.
  • Hallac et al. [2015] D. Hallac, J. Leskovec, and S. Boyd. Network lasso: Clustering and optimization in large graphs. In Proc. SIGKDD, pages 387–396, 2015.
  • Hastie et al. [2001] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer, New York, NY, USA, 2001.
  • Hastie et al. [2015] T. Hastie, R. Tibshirani, and M. Wainwright. Statistical Learning with Sparsity. The Lasso and its Generalizations. CRC Press, 2015.
  • Jung [2017] A. Jung. A fixed-point of view on gradient methods for big data. Frontiers in Applied Mathematics and Statistics, 3, 2017. doi: 10.3389/fams.2017.00018. URL https://www.frontiersin.org/article/10.3389/fams.2017.00018.
  • Jung and Tran [2019] A. Jung and N. Tran. Localized linear regression in networked data. IEEE Sig. Proc. Letters, 2019.
  • Jung and Vesselinova [2019] A. Jung and N. Vesselinova. Analysis of network lasso for semi-supervised regression. In The 22nd Int. Conf. Art. Int. Stat. (AISTATS), Okinawa, Japan, April 2019.
  • Jung et al. [2014] A. Jung, S. Schmutzhard, and F. Hlawatsch. The RKHS approach to minimum variance estimation revisited: Variance bounds, sufficient statistics, and exponential families. IEEE Trans. Inf. Theory, 60(7):4050–4065, Jul. 2014.
  • Li et al. [2019] T. Li, E. Levina, and J. Zhu. Prediction models for network-linked data. The Annals of Applied Statistics, 2019.
  • Mokhtari and Ribeiro [2015] A. Mokhtari and A. Ribeiro. Global convergence of online limited memory bfgs. Jour. Mach. Learning Res., 16:3151 – 3181, Dec. 2015.
  • M.Yamada et al. [2017] M.Yamada, T. Koh, T. Iwata, J. Shawe-Taylor, and S. Kaski. Localized Lasso for High-Dimensional Regression. In

    Proceedings of the 20th International Conference on Artificial Intelligence and Statistics

    , volume 54, pages 325–333, Fort Lauderdale, FL, USA, Apr. 2017. PMLR.
  • Newman [2010] M. E. J. Newman. Networks: An Introduction. Oxford Univ. Press, 2010.
  • Nishihara et al. [2015] R. Nishihara, L. Lessard, B. Recht, A. Packard, and M. I. Jordan. A general analysis of the convergence of admm. In Proceedings of the 32nd International Conference on Machine Learning, volume 37, Lille, France, 2015.
  • Parikh and Boyd [2013] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):123–231, 2013.
  • Pock and Chambolle [2011] T. Pock and A. Chambolle. Diagonal preconditioning for first order primal-dual algorithms in convex optimization. In

    IEEE International Conference on Computer Vision (ICCV)

    , Barcelona, Spain, Nov. 2011.
  • Rockafellar [1970] R. T. Rockafellar. Convex Analysis. Princeton Univ. Press, Princeton, NJ, 1970.
  • Rudin [1976] W. Rudin. Principles of Mathematical Analysis. McGraw-Hill, New York, 3 edition, 1976.
  • Wainwright and Jordan [2008] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference, volume 1 of Foundations and Trends in Machine Learning. Now Publishers, Hanover, MA, 2008.
  • Zachary [1977] W. W. Zachary. An information flow model for conflict and fission in small groups. J. Anthro. Research, 33(4):452–473, 1977.