In the past decade, the research of network data has increased dramatically. Examples include scientific studies involving web data or hyper text documents connected via hyperlinks, social networks or user profiles connected via friend links, co-authorship and citation network connected by collaboration or citation relationships, gene or protein networks connected by regulatory relationships, and much more. Such data appear frequently in modern application domains and has led to numerous high-impact applications. For instance, detecting anomaly in ad-hoc information network is vital for corporate and government security; exploring hidden community structures helps us to better conduct online advertising and marketing; inferring large-scale gene regulatory network is crucial for new drug design and disease control. Due to the increasing importance of network data, principled analytical and modeling tools are crucially needed.
Towards this goal, researchers from the network modeling community have proposed many models to explore and predict the network data. These models roughly fall into two categories: static and dynamic models. For the static model, there is only one single snapshot of the network being observed. In contrast, dynamic models can be applied to analyze datasets that contain many snapshots of the network indexed by different time points. Examples of the static network models include the Erdös-Rényi-Gilbert random graph model (ER:59; erdos1960erg), the (Holland:81), (Duijn:2004) and more general exponential random graph (or ) model (wasserman1996), latent space model (Hoff01latentspace), block model (Lorrain71), stochastic blockmodel (Wasserman:87), and mixed membership stochastic blockmodel (Edoardo08). Examples of the dynamic network models include the preferential attachment model (Barabasi:99), the small-world model (watts1998cds), duplication-attachment model (Kleinberg:99; Kumar:00)
, continuous time Markov model(Snijders05modelsfor), and dynamic latent space model (sarkar05b). A comprehensive review of these models is provided in Gldenberg:09.
Though many methods and models have been proposed, the research of network data analysis is largely disconnected with the classical theory of statistical learning and signal processing. The main reason is that, unlike the usual scientific data for which independent measurements can be repeatedly collected, network data are in general collected in one single realization and the nodes within the network are highly relational due to the existence of many linkages. Such a disconnection prevents us from directly exploiting the state-of-the-art statistical learning methods and theory to analyze network data. To bridge this gap, we present a novel framework to model network data. Our framework assumes that the observed network has a sparse representation with respect to some dictionary (or basis space). Once the dictionary is given, we formulate the network modeling problem into a compressed sensing
problem. Compressed sensing, also known as compressive sensing and compressive sampling, is a technique for finding sparse solutions to underdetermined linear systems. In statistical machine learning, it is related to reconstructing a signal which has a sparse representation in a large dictionary. The field of compressed sensing has existed for decades, but recently it has exploded due to the important contributions ofcandes2005; candes2007; candes2008; Tsaig06compressedsensing. By viewing the observed network adjacency matrix as the output of an underlying function evaluated on a discrete domain of network nodes, we can formulate the network modeling problem into a compressed sensing problem.
Specifically, we consider the network clique detection problem within this novel framework. By considering a generative model in which the observed adjacency matrix is assumed to have a sparse representation in a large dictionary where each basis corresponds to a clique, we connect our framework with a new algebraic tool, namely Randon basis pursuit in homogeneous spaces. Our problem can be regarded as an extension of the work in jaga2008 which studies sparse recovery of functions on permutation groups, while we reconstruct functions on -sets (cliques), often called the homogeneous space associated with a permutation group in the literature (diaconis1988). It turns out that the discrete Radon basis becomes the natural choice instead of the Fourier basis considered in jaga2008. This leaves us a new challenge on addressing the noiseless exact recovery and stable recovery with noise. Unfortunately, the greedy algorithm for exact recovery in jaga2008 cannot be applied to noisy settings, and in general the Radon basis does not satisfy the Restricted Isometry Property (RIP) (candes2008) which is crucial for the universal recovery. In this paper, we develop new theories and algorithms which guarantee exact, sparse, and stable recovery under the choice of Radon basis. These theories have deep roots in Basis Pursuit (chen1999) and its extensions with uniformly bounded noise. Though this paper is mainly conceptual: showing the connection between network modeling and compressed sensing, we also provide some rigorous theoretical analysis and practical algorithms on the clique recovery problem to illustrate the usefulness of our framework.
The main content of this paper can be summarized as follows. Section 2 presents the general framework on compressive network analysis. In Section 3, 4 and 5, we consider the clique detection problem under the compressive network analysis framework. A polynomial time approximation algorithm is provided in Section 6 for the clique detection problem. We also demonstrate successful application examples in Section 7. Section 8 concludes the paper.
2 Main Idea
In this section we present the general framework of compressive network analysis with a nonparametric view. We start with an introduction of notations: let
be a vector andbe the indicator function. We denote
We also denote by the Euclidean inner product and , where
We represent a network as a graph , where is the set of nodes and is the set of edges. Let be the adjacency matrix of the observed network with represents a quantity associated with nodes and . With no loss of generality, we assume that is symmetric: and . With these assumptions, to model we only need to model its upper-triangle. For notational simplicity, we squeeze into a vector where is the number of upper-triangle elements in . Let be an unknown vector-valued function defined on . We assume a generative model of the observed adjacency matrix (or equivalently, ):
where is a noise vector. We can view as evaluating a possibly infinite-dimensional function on a discrete set , thus the model (3) is intrinsically nonparametric and can model any static networks.
Without further regularity conditions or constraints, there is no hope for us to reliably estimate. In our framework, we assume that has a sparse representation with respect to an by dictionary where each is a basis function, i.e., there exists a subset with cardinality , such that
In the sequel, we denote by the element on the -th row and -th column of . Here indexes a pair of different nodes and indexes a basis . To estimate , we only need to reconstruct . Given the dictionary , we can estimate by solving the following program:
One thing to note is that the dictionary can be either constructed based on the domain knowledge, or it can be learned from empirical data. For simplicity, we always assume is pre-given in this paper. In the following sections, we use the clique detection problem as a case study to illustrate the usefulness of this framework.
3 Clique Detection
In network data analysis, The problem of identifying communities or cliques111A clique means a complete subgraph of the network. based on partial information arises frequently in many applications, including identity management (guibas2008), statistical ranking (diaconis1988; jaga2008), and social networks (leskovec2010). In these applications we are typically given a network with its nodes representing players, items, or characters, and edge weights summarizing the observed pairwise interactions. The basic problem is to determine communities or cliques within the network by observing the frequencies of low order interactions, since in reality such low order interactions are often governed by a considerably smaller number of high order communities or cliques. Therefore the clique detection problem can be formulated as compressed sensing of cliques in large networks. To solve this problem, one has to answer two questions: (i) what is the suitable representation basis, and (ii) what is the reconstruction method? Before rigorously formulating the problem, we provide three motivating examples as a glimpse of typical situations which can be addressed within the framework in this paper.
Example 1 (Tracking Team Identities) We consider the scenario of multiple targets moving in an environment monitored by sensors. We assume every moving target has an identity and they each belong to some teams or groups. However, we can only obtain partial interaction information due to the measurement structure. For example, watching a grey-scale video of a basketball game (when it may be hard to tell apart the two teams), sensors may observe ball passes or collaboratively offensive/defensive interactions between teammates. The observations are partial due to the fact that players mostly exhibit to sensors low order interactions in basketball games. It is difficult to observe a single event which involves all team members. Our objective is to infer membership information (which team the players belong to) from such partially observed interactions.
Example 2 (Inferring High Order Partial Rankings) The problem of clique identification also arises in ranking problems. Consider a collection of items which are to be ranked by a set of users. Each user can propose the set of his or her most favorite items (say top 3 items) but without specifying a relative preference within this set. We then wish to infer what are the top most favorite items (say top 5 items). This problem requires us to infer high order partial rankings from low order observations.
Example 3 (Detecting Communities in Social Networks) Detecting communities in social networks is of extraordinary importance. It can be used to understand the organization or collaboration structure of a social network. However, we do not have direct mechanisms to sense social communities. Instead, we have partial, low order interaction information. For example, we observe pairwise or triple-wise co-appearance among people who hang out for some leisure activities together. We hope to detect those social communities in the network from such partially observation data.
In these examples we are typically given a network with some nodes representing players, items, or characters, and edge weights summarizing the observed pairwise interactions. Triple-wise and other low order information can be further exploited if we consider complete sub-graphs or cliques in the networks. The basic problem is to determine common interest groups or cliques within the network by observing the frequency of low order interactions. Since in reality such low order interactions are often governed by a considerably smaller number of high order communities. In this sense we shall formulate our problem as compressed sensing of cliques in networks.
The problem we are going to address has a close relationship with community detection in social networks. Community structures are ubiquitous in social networks. However, there is no consistent definition of a “community”. In the majority of research studies, community detections based on partitions of nodes in a network. Among these works, the most famous one is based on the modularity (Newman06) of a partition of the nodes in a group. A shortcoming in partition-based methods is that they do not allow overlapping communities, which occur frequently in practice. Recently there has been growing interest in studying overlapping community structures (Lanci09)
. The relevance of cliques to overlapping communities was probably first addressed in the clique percolation method(Palla05). In that work, communities were modeled as maximal connected components of cliques in a graph where two -cliques are said to be connected if they share nodes. In this paper, we pursue a compressive representation of signals or functions on networks based on clique information which in turns sheds light on multiple aspects of community structure.
In this paper, we use the same definition as in Palla05 but are more interested in identifying cliques. We pursue an alternative approach on exploring networks based on clique information which potentially sheds light on multiple aspects of community structures. Roughly speaking, we assume that there is a frequency function defined on complete low order subsets. For example, in some social networks edge weights are bivariate functions defined on pairs of nodes reflecting strength of pairwise interactions. We also assume that there is another latent frequency function defined on complete high order subsets which we hope to infer. Intuitively, the interaction frequency of a particular low order subset should be the sum of frequencies of high order subsets which it belongs to. Hence we consider a generative mechanism
in which there exists a linear mapping from frequencies on high order subsets (usually sparsely distributed) to low order subsets. One typically can collect data on low order subsets while the task is to find those few dominant high order subsets. This problem naturally fits into the general compressive network analysis framework we introduced in the previous section. Below we demonstrate that the Radon basis will be an appropriate representation for our purpose which allows the sparse recovery by a simple linear programming reconstruction approach.
4 Radon Basis Pursuit
4.1 Mathematical Formulation
Under the general framework in (3), we formulate the clique detection problem into a compressed sensing problem named Radon Basis Pursuit. For this, we construct a dictionary so that each column of corresponds to one clique. The intuition of such a construction is that we assume there are several hidden cliques within the network, which are perhaps of different sizes and may have overlaps. Every clique has certain weights. The observed adjacency matrix (or equivalently, its vectorized version ) is a linear combination of many clique basis contaminated by a noise vector .
For simplicity, we first restrict ourselves to the case that all the cliques are of the same size . The case with mixed sizes will be discussed later. Let be all the cliques of size and each . We have . For each , we construct the dictionary as the following
The matrix constructed above is related to discrete Radon transforms. In fact, up to a constant and column scaling, the transpose matrix is called the discrete Radon transform for two suitably defined homogeneous spaces (diaconis1988). Our usage here is to exploit the transpose matrix of the Radon transform to construct an over-complete dictionary, so that the observed output has a sparse representation with respect to it. More technical discussions of the Radon transforms is beyond the scope of this paper.
The above formulation can be generalized to the case where is a vector of length () with the ’th entry in characterizing a quantity associated with a -set (a set with cardinality ). The dictionary will then be a binary matrix with entries indicating whether a -set is a subset of a -clique (a clique with nodes), i.e.,
Therefore, the case where is the vector of length corresponds to a special case where . Our algorithms and theory hold for general with .
Now we provide two concrete reconstruction programs for the clique identification problems:
is known as Basis Pursuit (chen1999) where we consider an ideal case that the noise level is zero. For robust reconstruction against noise, we consider the relaxed program . The program in differs from the Dantzig selector (candes2007) which uses the constraint in the form . The reason for our choice of lies in the fact that a more natural noise model for network data is bounded noise rather than Gaussian noise. Moreover, our linear programming formulation of enables practical computation for large scale problems.
Let be the network we are trying to model. The set of vertices represents individual identities such as people in the social network. Each edge in is associated with some weights which represent interaction frequency information.
We assume that there are several common interest groups or communities within the network, represented by cliques (or complete sub-graphs) within graph , which are perhaps of different sizes and may have overlaps. Every community has certain interaction frequency which can be viewed as a function on cliques. However, we only receive partial measurements consisting of low order interaction frequency on subsets in a clique. For example, in the simplest case we may only observe pairwise interactions represented by edge weights. Our problem is to reconstruct the function on cliques from such partially observed data. A graphical illustration of this idea is provided in Figure 1, in which we see an observed network can be written as a linear combination of several overlapped cliques.
One application scenario is to identify two basketball teams from pairwise interactions among players. Suppose we have which is a signal on all -sets of a -player set. We assume it is sparsely concentrated on two -sets which correspond to the two teams with nonzero weights. Assume we have observations of pairwise interactions , where is uniform random noise defined on . We solve , with , which is a linear program over with parameters and .
4.3 Connection with Radon Basis
Let denote the set of all -sets of and be the set of real-valued functions on . The observed interaction frequencies on all -sets, can be viewed as a function in . We build a matrix () as a mapping from functions on all -sets of to functions on all -sets of . In this setup, each row represents a -set and each column represents a -set. The entries of are either or indicating whether the -set is a subset of the -set. Note that every column of has ones. Lacking a priori information, we assume that every -set of a particular -set has equal interaction probability, whence choose the same constant for each column. We further normalize to so that the norm of each column of is . To summarize, we have
where is a -set and is a -set. As we will see, this construction leads to a canonical basis associated with the discrete Radon transform. The size of matrix clearly depends on the total number of items . We omit as its meaning will be clear from the context.
The matrix constructed above is related to discrete Radon transforms on homogeneous space . In fact, up to a constant, the adjoint operator is called the discrete Radon transform from homogeneous space to in diaconis1988. Here all the -sets form a homogeneous space. The collection of all row vectors of is called as the -th Radon basis for . Our usage here is to exploit the transpose matrix of the Radon transform to construct an over-complete dictionary for , so that the observation can be represented by a possibly sparse function ().
The Radon basis was proposed as an efficient way to study partially ranked data in diaconis1988, where it was shown that by looking at low order Radon coefficients of a function on , we usually get useful and interpretable information. The approach here adds a reversal of this perspective, i.e. the reconstruction of sparse high order functions from low order Radon coefficients. We will discuss this in the sequel with a connection to the compressive sensing (chen1999; candes2005).
5 Mathematical Theory
One advantage of our new framework on compressive network analysis is that it enables rigorous theoretical analysis of the corresponding convex programs.
5.1 Failure of Universal Recovery
Recently it was shown by candes2005 and candes2008 that has a unique sparse solution , if the matrix satisfies the Restricted Isometry Property (RIP), i.e. for every subset of columns with , there exists a certain universal constant such that
where is the sub-matrix of with columns indexed by . Then exact recovery holds for all -sparse signals (i.e. has at most non-zero components), whence called the universal recovery.
Unfortunately, in our construction of the basis matrix , RIP is not satisfied unless for very small . The following theorem illustrates the failure of universal recovery in our case.
Let and with . Unless , there does not exist a such that the inequalities
hold universally for every with , where .
Note that does not depend on the network size , which will be problematic. We can only recover a constant number of cliques no matter how large the network is. The main problem for such a negative result is that the RIP tries to guarantee exact recovery for arbitrary signals with a sparse representation in . For many applications, such a condition is too strong to be realistic. Instead of studying such “universal” conditions, in this paper we seek conditions that secure exact recovery of a collection of sparse signals , whose sparsity pattern satisfies certain conditions more appropriate to our setting. Such conditions could be more natural in reality, which will be shown in the sequel as simply requiring bounded overlaps between cliques.
Recall that the matrix has altogether columns. Each column in fact corresponds to a -clique. Therefore, we could also use a -clique to index a column of . In this sense, let be a subset of size . An equivalent notation is to represent as a class of sets: where each and .
We can extract a set of columns ( is interpreted as a -set) and form a submatrix . Recall that has altogether number of rows. Combined with the condition that and the fact that the number of nonzero rows of should be exactly . We know that there must exist rows in which only contains zeroes.
By discarding zero rows, it is easy to show that the rank of is at most , which is less than the number of columns. To see that the rank of is at most , we need to exploit the fact that , therefore
from which we see that the number of nonzero rows of is smaller than the number of columns.
Thus, the columns in must be linearly dependent. In other words, there exist a nonzero vector where such that . When , Since , we can not expect universal sparse recovery for all -sparse signals .
5.2 Exact Recovery Conditions
Here we present our exact recovery conditions for from the observed data by solving the linear program . Suppose is an -by- matrix and is a sparse signal. Let , be the complement of , and (or ) be the submatrix of where we only extract column set (or , respectively). The following proposition from candes2005 characterizes the conditions that has a unique condition. To make this paper self-contained, we also include the proof in this section. (candes2005) Let , we assume that is invertible and there exists a vector such that
Then is the unique solution for .
The necessity of the two conditions come from the KKT conditions of . If we consider an equivalent form of
whose Lagrangian is
Here , , , are the Lagrange multipliers.
Then the KKT condition gives
with and for all .
Clearly . Let , by the Strictly Complementary Theorem for linear programming in yebook1997, there exist and such that for all with , and for all with . Thus, the first equation leads to
the second equation leads to
Therefore, the two conditions are necessary for to be the unique solution of .
To prove that these two conditions are sufficient to guarantee is the unique minimizer to , we need to show any minimizer to the problem must be equal to . Since obeys the constraint , we must have
Now take a obeying the two conditions, we then compute
Thus, the inequalities in the above computation must in fact be equality. Since is strictly less than for all , this in particular forces for all . Thus
Since all columns in are independent, we must have for all . Thus . This concludes the proof of our theorem.
The above theorem points out the necessary and sufficient condition that in the noise-free setting exactly recover the sparse signal . The necessity and sufficiency comes from the KKT condition in convex optimization theory (candes2005). However this condition is difficult to check due to the presence of . If we further assume that lies in the column span of , the condition in Proposition 5.2 reduces to the following condition.
Irrepresentable Condition (IRR) The matrix satisfies the IRR condition with respect to , if is invertible and
where stands for the matrix sup-norm, i.e., and .
By restricting that lies in the image of , the conditions in proposition 5.2 reduce to the IRR condition.
Since lies in the image of , we can write . To make sure that the first condition in Proposition 5.2 holds, we must have , so
Now the second condition in proposition 5.2 can be equivalently written as
which is exactly the IRR condition.
Intuitively, the IRR condition requires that, for the true sparsity signal , the relevant bases is not highly correlated with irrelevant bases . Note that this condition only depends on and , which is easier to check. The assumption that lies in the column span of is mild; it is actually a necessary condition so that can be reconstructed by Lasso (tibshirani1996) or Dantzig selector (candes2007), even under Gaussian-like noise assumptions (zhao2006; yuan2007).
5.3 Detecting Cliques of Equal Size
In this subsection, we present sufficient conditions of IRR which can be easily verified. We consider the case that with . Given data about all -sets, we want to infer important -cliques. Suppose is a sparse signal on all -cliques. We have the following theorem, which is a direct result of Lemma 5.3.
Let , if we enforce the overlaps among -cliques in to be no larger than , then guarantees the IRR condition.
Let and . Suppose for any , the two cliques corresponding to and have overlaps no larger than , we have
If , then ;
If , then where equality holds with certain examples;
If , there are examples such that .
One thing to note is that Theorem 5.3 is only an easy-to-verify condition based on the worst-case analysis, which is sufficient but not necessary. In fact, what really matters is the IRR condition. It uses a simple characterization of allowed clique overlaps which guarantees the IRR Condition. Specifically, clique overlaps no larger than is sufficient to guarantee the exact sparse recovery by , while larger overlaps may violate the IRR Condition. Since this theorem is based on a worst-case analysis, in real applications, one may encounter examples which have overlaps larger than while still works.
In summary, IRR is sufficient and almost necessary to guarantee exact recovery. Theorem 5.3 tells us the intuition behind the IRR is that overlaps among cliques must be small enough, which is easier to check. In the next subsection, we show that IRR is also sufficient to guarantee stable recovery with noises.
To prove Lemma 5.3, given any , we define
the intuition of such a definition is that
As we will see in the following proofs, we essentially try to bound for .
Before we present the detailed technical proof, we first introduce the high-level idea: our main purpose is to bound . Since each entry of the matrix is indexed by two -sets, the value of this entry represents how many -sets are contained in the intersection of these two -sets. Under the condition that , it’s straightforward that the matrix is an identity. Therefore, bounding is equivalent as bounding , which is exactly .
Proof of the case under Condition 1
Under Condition 1, since any satisfy , hence any two columns in are orthogonal. This implies
is an identity matrix.
Now given , we will prove under condition 1. If this is true, then
Let where () are -sets. We need to prove
for all .
Let , so is a collection of -sets of (Here if , then is simply an empty set). Obviously, we have . So
Now we note the fact that for any , we have . This is true because otherwise suppose , then this mean is a -set of and . Hence , which implies that
This contradicts with the condition that ’s() have overlaps at most . So must be pairwise disjoint. Hence
For any , every is a -set of . Hence is of course a -set of . The set is of size . So if we let which is the collection of all -sets of , then we have . So .
Till now, we actually proved . All the above proof about for any will remain valid for condition 2. In the next, we prove if any satisfy , then equality can not hold.
Without loss of generality, we assume , otherwise if none of ’s satisfies , then which actually finishes the proof. To show the the equality will not hold, we only need to find one -set that is does not belong to .
In this case, we can let , where ( because otherwise which contradicts with the fact that ). Now we show that is not a member of . Clearly is not a member of because . Now it remains to show that is not a member of any (). If this was not true, say , then , then , which contradicts with the condition that .
While it is clear that , so this means is a proper subset of . So which means .
Proof of the case under Condition 2
Under condition 2, then almost the same as proof for lemma 1. We have is an identity matrix and . However, one can not show in this case. We have the following example where if is large enough, then can happens to be equal to one exactly.
Let . Denote all the -sets of to be . when is large enough, we choose disjoint -sets of , denoted by .
Let , where . Hence and ’s satisfy . But
Proof of the case under Condition 3
Under condition 3, we can construct examples where
Let be all -sets of . For large enough , it is possible to choose disjoint -sets of , say . Let for and . Define which is of size .
In this case, for any and , for any . Then is a by matrix shown below with rows and columns corresponds to
Here . The inverse of the matrix is
Consider , then the row corresponds to for is a vector of length with each entry being . So the row vector corresponds to in is a vector of length , . This vector has row sum
Hence in this example .
In the following, we construct explicit conditions which allow large overlaps while the IRR still holds, as long as such heavy overlaps do not occur too often among the cliques in . The existence of a partition of in the next theorem is a reasonable assumption in the network settings where network hierarchies exist. In social networks, it has been observed by girvan2002 that communities themselves also join together to form meta-communities. The assumptions that we made in the next theorem where we allow relatively larger overlaps between communities from the same meta-community, while we allow relatively smaller overlaps between communities from different meta-communities characterize such a scenario.
Assume . let . Suppose there exist a partition with each satisfies , such that
for any belong to the same partition, ;
for any belong to different partitions, .