Hierarchical network models for structured exchangeable interaction processes

01/28/2019 ∙ by Walter Dempsey, et al. ∙ 0

Network data often arises via a series of structured interactions among a population of constituent elements. E-mail exchanges, for example, have a single sender followed by potentially multiple receivers. Scientific articles, on the other hand, may have multiple subject areas and multiple authors. We introduce hierarchical edge exchangeable models for the study of these structured interaction networks. In particular, we introduce the hierarchical vertex components model as a canonical example, which partially pools information via a latent, shared population-level distribution. Theoretical analysis and supporting simulations provide clear model interpretation, and establish global sparsity and power-law degree distribution. A computationally tractable Gibbs algorithm is derived. We demonstrate the model on both the Enron e-mail dataset and an ArXiv dataset, showing goodness of fit of the model via posterior predictive validation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 29

page 31

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Modern statistical network analysis focuses on the study of large, complex networks that can emerge in diverse fields, including social, biological, and physical systems [1, 2, 3, 4, 5]. The expanding scope of network analysis has led to an increase in the need for statistical models and inferential tools that can handle the increasing complexity of network data structures. In this paper, we focus on network data arising from sequences of interactions. Network data arising in this manner would benefit from a framework built upon the interaction as the statistical unit [6] rather than upon the constituent elements within each interaction as the statistical units. Edge-exchangeable models [7, 8] are built specifically to analyze datasets containing these complex interactions.

While Crane and Dempsey (2017) [7] provide a framework for statistical analysis of interaction data, the proposed Hollywood model only captures basic global features. Specifically, the Hollywood model’s asymptotic behavior reflects the empirical properties of sparsity and power law degree distributions observed in real-world network data, which are not as well reflected in classic statistical network models such as the ERGMs [9], graphon models [10], and stochastic blockmodels (SBMs) [11]. While edge exchangeability is attractive as a theoretical framework, the set of current edge exchangeable models is inadequate to handle the structural complexity of modern network data.

The edge exchangeable model proposed in this paper is motivated by an important fact: most common complex networks constructed from interaction data are structured. A phone-call interaction, for example, takes the form of a sender and receiver pair. E-mail correspondence generalizes this type of interaction to one sender but potentially multiple receivers with different attributes like “To,” “Cc,” and “Bcc”. This paper makes a substantial push forward by constructing hierarchical models that reflect this common structure of interaction data, hereafter referred to as structured interaction data. The model overlays local behavior (e.g., per sender) with global information by partial pooling through a shared global, latent distribution. Simulation and theoretical analysis confirm that the proposed hierarchical model can achieve simultaneously varying local power-law degree per sender and global power-law degree distribution.

1.1. Relevant prior work on interaction data

Interaction data often arises in settings where communications amongst a set of constituent elements over a specific time period are recorded [12, 13]. Examples are numerous and include: authorship and co-sponsoring of legislation [14, 15], sending and receiving e-mails [12, 16], posting and responding on a community forum [17], and traceroute [18]. In each case, the interaction (edge) is the statistical unit to be modeled, as contrasted with the subjects (nodes) of the interactions considered in other work [19]. See [8, 7] for further discussion of the advantages of defining interactions as the statistical units.

The literature contains several papers focused on statistical modeling of interaction data. Perry and Wolfe (2013) [20] construct a Cox proportional intensity model [21]. Butts (2008) [22] considered likelihood-based inference using a variant of the proportional intensity model to capture interaction behavior in social settings. Crane and Dempsey (2017) [7] consider non-hierarchical models for interaction data. They introduce the notion of edge exchangeable network models and explore its basic statistical properties. In particular, they show that edge exchangeable models allow for sparse structure and power law degree distributions, widely observed empirical behaviors that cannot be handled by conventional approaches.

An alternative approach emerges out of the recent work of Caron and Fox (2017) [23], who construct random graphs from point processes on . The random graph is characterized by an object called a graphex [24]. Random graph models generated by this procedure can incorporate sparse, power law behavior into a well-defined population model. Finite random graphs can be obtained via a thresholding operation, termed p-sampling [25]. Such random graphs are vertex exchangeable in that they are built from exchangeable point processes. In this setting, exchangeability is a consequence of projectivity rather than the simple structured interaction data sampling scheme proposed in this paper. See the contributed discussion to the paper by Caron and Fox (2017) [23], in particular contributions by Bharath [26] and Crane [27], for further discussion.

1.2. Outline and main contributions

The main contributions of this paper are as follows:

  1. We start by formally defining structured interaction data in Definition 2.1. We then define exchangeable structured interaction processes in Definition 3.1.

  2. We prove a representation theorem for these exchangeable processes in Theorem 3.2; we then define, in section 3.1, hierarchical vertex components models (HVCM) – a subfamily of exchangeable processes that capture important interaction dynamics.

  3. A particular computationally tractable HVCM is introduced in section 4 and an efficient Gibbs sampling inferential algorithm is derived in section 6.

  4. We establish basic statistical properties in section 5. In particular, we provide theoretical guarantees of sparsity and power law for the chosen HVCM – two important empirical properties of network data.

  5. We demonstrate this HVCM on both the Enron e-mail dataset and ArXiv dataset in section 7. In particular, we show how the HVCM can be used to perform goodness of fit checks for models of network data via posterior predictive checks, an often under-emphasized aspect of statistical network modeling.

Overall, this paper presents a statistically rigorous, principled hierarchical modeling framework for handling complex structured interaction data.

2. Structured interaction data

We start by defining structured interaction data, illustrating with a sequence of concrete examples of increasing complexity.

Definition 2.1 (Structured interaction data).

Let  denote a set of constituent elements. Then for a set , we write to denote the set of all finite multisets of . A structured interaction process for an ordered sequence of sets  is a correspondence between a set indexing interactions and the ordered sequence of finite multisets of .

Remark 2.2 (Difference from interaction data).

In [7], an interaction process is defined as a correspondence  where is a single population. Structured interaction data, instead, consists of a series of finite multisets, and does not require each set of constituent elements to be equivalent. That is, each population  may contain different types of elements. This flexibility will allow us to introduce hierarchical structure into the exchangeable model.

Finally, let  denote the multisets of size , so that is the disjoint union .

Example 2.3 (Phone-calls).

Assume  are all equivalent and let 

be a countably infinite population. A phone-call can be represented as an ordered pair of “sender” and “receiver” drawn from 

. Therefore, a phone-call interaction process is a correspondence . For instance,  is a phone-call from sender  to receiver , both in population . This is distinct from where sender and receiver roles are reversed.

Example 2.4 (E-mails).

Assume  are all equivalent and let  be a countably infinite population. An e-mail can be represented as the ordered sequence of sets: sender, receivers. Then an e-mail interaction process is a correspondence . For instance,  is an e-mail from sender  to receivers  and . This is distinct from and . Figure 1 is a visualization of a similar structured interaction dataset formed from Facebook posts (i.e., poster followed by finite multiset of responders).

Figure 1. Example of network data from Facebook posts. The post process is a correspondence . Here , and the first post  represents user  posting to the forum and reacting to the post. The second post  represents user  posting to the forum and reacting. User  reacts to both posts.
Example 2.5 (Scientific articles).

Consider summarizing a scientific article by its (1) list of subject areas and (2) list of authors. Then the scientific article process is a correspondence . For instance,  is an article with subject areas  and  and authors  and . Here,  and  are distinct populations.

Example 2.6 (Movies).

Consider summarizing a movie by its (1) genre, (2) list of producers, (3) director, and (4) list of actors. Of course, there is overlap in certain populations, as producers can be directors, directors can be actors, but none are a genre (unless Scorsese, Spielberg, or Tarantino are considered genres unto themselves). Then the movie process is a correspondence . For instance,  is a movie with genre , producers  and , director , and actors , , and . Note, in this example, the director is also one of the actors.

The above shows Definition 2.1 covers a wide variety of examples from network science. Next, we construct interaction-labeled networks and define exchangeable structured interaction processes.

Remark 2.7 (Covariates).

In this paper, we focus on the study of structured interaction processes in Definition 2.1 with no additional information, such as covariates. Incorporating such covariate information is quite difficult; see [28, 29, 30, 31, 32, 33] for examples of incorpating covariates into network analysis. Covariate information can come in two forms: (1) covariate information on the interaction; and (2) covariate information on constituent elements. Examples of (1) include subject line or body text in an e-mail, or genre and gross movie sales for a movie. Examples of (2) include gender, age, job title, or university affiliation of authors of a scientific article. Certain interaction covariates can be incorporated into the models considered in this paper. For example, in the ArXiV dataset, the article’s subject can be viewed as covariate information on the interaction. We show how this can be incorporated as part of the structured interaction data structure, and therefore accounted for in the statistical models.

2.1. Interaction-labeled networks

For the remainder of this paper, we focus on structured interaction processes of the form . This type of a structured interaction process captures the phone-call, e-mail, and scientific article examples. The arguments presented naturally extend to more general structured interaction processes as given in Definition 2.1. When two populations of constituent elements are equivalent, we write . The interaction-labeled network is an equivalence clase constructed from the structured interaction process by quotienting out the labeling of the constituent elements. Let  be a bijection for . We write  to be the composite bijection obtained by applying componentwise. If , then ; that is, bijections among equivalent populations, e.g., the senders and receivers in an email network, denoted by and , respectively, must agree. Then induces an action on the product space  by the composite map

Therefore, the bijection  acts on the structured interaction process via composition . The structured interaction-labeled network is then the equivalence class constructed from the structured interaction network by quotienting out over bijections :

(1)

where  is the cardinality of the population. Note we have only quotiented out labels for constituent elements, so the object  still has uniquely labeled interactions. For simplicity, we write  and leave the subscript  implicit.

In the remainder of the paper, we assume the index set  is countably infinite and replace it by . For any , we define the restriction of to the subset  by . This restricted interaction process induces a restriction to of the interaction-labeled network. We write  to denote the interaction-labeled network associated with the restricted process . For , we simply write  to denote the restricted structured interaction process and to denote the corresponding structured interaction network.

3. Structured interaction exchangeable models

Let  denote the interaction-labeled network constructed from the structured interaction process . Then for any finite permutation , let  denote the relabeled structured interaction process defined by  . Then denotes the corresponding interaction labeled network constructed from . Note that the choice of representative from the equivalence class does not matter. The above relabeling by permutation  is not to be confused with the relabeling in the previous section by the bijection . The bijection  relabels the constituent elements, and is used to construct the equivalence class defining the interaction-labeled network (i.e., the equivalence class). The permutation  reorders the interaction process, and therefore relabels the interaction-labeled network.

In the remainder of this paper, we write  to denote a random interaction-labeled network. We assume the interactions are labeled in the countably infinite set . Interaction exchangeability is characterized by the property that the labeling of the interactions (not the constituent elements) is arbitrary. We now define exchangeable structured interaction networks.

Definition 3.1 (Exchangeable structured interaction network process).

The structured interaction-labeled network  is exchangeable if for all permutations , where denotes equality in distribution.

Next, we provide a representation theorem for structured interaction processes. We focus on the setting where each interaction  is either never observed or observed infinitely often. This is commonly referred to as the “blip-free” setting [34], where blips refer to interactions that are observed once. We first define the -simplex

where  for and . Let 

be a probability measure on the simplex and define 

to be a random variable drawn from this measure. Then, given 

, let the sequence of interactions  be generated according to

(2)

Then, given , set . Theorem 3.2 states that all blip-free structured interaction exchangeable networks can be generated in this manner. The proof can be found in Section LABEL:section:rep of the supplementary materials.

Theorem 3.2 (Blip-free representation theorem).

Let  be a structured interaction exchangeable network that is blip-free with probability 1. Then there exists a probability measure  on such that , where

3.1. Hierarchical vertex components model

Via Theorem 3.2, we can construct a particular family of interaction exchangeable random networks as follows. First, choose a distribution of senders, , in the simplex

Next, choose a second element of , which we denote . Finally, for each , construct a conditional distribution over the receivers, i.e., the second component . That is, for every , we choose in the simplex

We combine these distributions to form , which determines a distribution on the space  by

(3)

where , , , and for each . This determines an interaction exchangeable network, which we call the hierarchical vertex components model (HVCM). Given , are independent, identically distributed (i.i.d.) random structured interactions drawn from (3). The associated random interaction exchangeable network  is obtained through (1), whose distribution we denote by .

In non-HVCMs [35], each constituent element had a single frequency of occurrence. By contrast, HVCMs allow the frequency of occurrence for elements in the second term of (3) (i.e., ) to depend on first component (i.e., ) through . This dependence is two-fold: (1) controls which is chosen across ; and (2) the local distributions can vary, leading to the size-biased ordering of varying as a function of .

Remark 3.3 (Vertex exchangeability versus interaction exchangeability).

While HVCMs are expressed as a function of the vertices, they are interaction exchangeable and not vertex exchangeable. To see this, consider Theorem 3.2. A direct corollary is that vertices are sampled in size-biased order according to their relative frequency of occurrence. In hierarchical models, the size-biased sampling of the second component depends on the first component. Regardless, this implies the observed constituent elements are not exchangeable with the unobserved constituent elements. On the other hand, vertex exchangeability implicitly assumes the observed vertices and unobserved vertices are exchangeable.

4. Sequential description for particular subfamily of HVCMs

Here we provide a sequential description of a particular subfamily of HVCMs. For ease of comprehension, we start with the setting of a single sender where the size of the first component is one (i.e., ). In this setting, the sequential description is presented in the context of e-mails. Let  satisfy either (1) and , or (2) and for some . In setting (1), the population  is infinite, while in setting (2) the population is finite and equal to . In Section 4.3, we show how to extend this model to the general setting of multiple senders. For ease of comprehension, we let  (senders) and  (receivers) denote the two sets of constituent elements.

We introduce some additional notation. For each , the th email  is given by the structured interaction  where  is the sender, and  is the th receiver of the th article. Suppose articles have been observed and define to be the observed history of the first e-mails. For the st e-mail, choose the sender according to

(4)

where  is the outdegree of the subject , and  are the set of unique senders in  and  is the set’s cardinality.

Given 

, we choose the number of recipients according to the discrete probability distribution function 

. Finally, let  denote the indegree of receiver  when restricted to e-mails from sender  after the first  e-mails and the recipients of the th e-mail; that is,

Finally, we define  to be the number of receivers (accounting for multiplicity) of e-mails from sender . Each of these statistics is a measurable function of . Note, these statistics are local (i.e., specific to the particular subject). Here, we describe a procedure for sharing information across senders. To do this, we define a partially observable global set of information. First, define the observable variable  to be the complete set of receivers; that is,

Additionally, let be the cardinality of this set. For each  we posit existence of a latent degree per sender  and receiver denoted by . We then define  and . Next, define  to be the complete set of receivers when restricting to e-mails from sender , and  to be the observable history  union . That is,  is the observed history up to the th receiver on the th e-mail, where  implies only sender information for the th e-mail. Finally, for each , let  satisfy either (1) and , or (2) and for some . In setting (1), the receiver population  is infinite, while in setting (2) the population is finite and equal to . For the remainder of this paper, we assume setting (1).

Given the indegree distribution , the latent degree distribution  , the current sender , along with the observable history , the probability of choosing receiver  is proportional to

(5)

and

(6)

Note the difference in the discount of indegree in (5) and outdegree in (4). For the sender distribution (4), the outdegree discount is ; on the other hand, for (5), the indegree discount is . This reflects that in (4), sender  is chosen from a single distribution; however, in (5), receiver  can be chosen either locally or globally.

The remaining question is how to update the degree distributions. In (5) and (6), we can either observe  “locally”, or we escape the local model and observe  due to the latent global information. Given  we update both local and global degrees. If  then the global degree  increases from zero to one. If  then the local degree  increases by one and the latent degree is increased by one with probability . The exact procedure for incrementing is discussed in section 6.

4.1. Partial pooling

The importance of the latent global degree distribution is that it allows information to be shared across the conditional receiver distributions. The above model formalizes the partial pooling of information. The degree of pooling is controlled by the escape probability , which in general decreases as the number of e-mails from sender  increases. Note that over time as more e-mails by sender  are seen, the escape probability  tends to zero whenever . Therefore, the local impact of the latent global degree information becomes negligible once we have sufficient local information. However, the first time a sender-receiver pair is observed, it must occur via the shared global set of information. The global latent degrees  therefore contribute to the behavior of new and/or rarely seen senders.

4.2. Connection between sequential description and hierarchical vertex components models

The sequential description in section 6 is equivalent to a particular HVCM. When , an analytic stick-breaking representation can be derived. This connects the sequential process directly to (3). To do so, we start by constructing the sender distribution. Here, we assume . For , define independent random variables . Then, conditional on , the probability of choosing sender  is given by

where the product is set equal to one for , and . In our current setting, so the weights  can be ignored for now. See Section 4.3 for a description of how these can be constructed in a similar manner.

We now construct, for each the probabilities  via a hierarchical model given and , and we set . To do this, we first define global independent random variables  for . Then, conditional on , for , we define associated stick-breaking probabilities . These are probabilities of choosing receiver  based on the global random variables . The local stick-breaking distributions are then defined via a perturbation of these global probabilities. That is, for , define independent random variable

where the product is defined equal to one when . This yields a stick breaking representation for for a particular hierarchical vertex components model. Partial pooling occurs via the shared global probabilities . The local distributions satisfy . Therefore, the distribution can be thought of as perturbation of the global distribution  where  controls the amount of perturbation. In particular, with probability one as .

By construction,  is a random variable over the space . Lemma 4.1 establishes the connection between these random variables and the canonical model for . Although this model for does not admit a known stick breaking representation, Theorem 3.2 discussed in Section 3 ensures these asymptotic frequencies exist. Section LABEL:appendix:technical of the supplementary materials describes a specific probabilistic construction of these frequencies.

Lemma 4.1.

The sequential HVCM model for  for is equivalent in distribution to (3), where is distributed according to the stick-breaking construction described above.

Proof can be found in section LABEL:appendix:technical of the supplementary materials; see [36, 37] for further details on the stick-breaking representation.

Remark 4.2 (Connections to CRFP).

Note that the HVCM described above is closely related to the Chinese Restaurant Franchise Process, a well-known process in the machine learning literature 

[36, 38] that is almost exclusively used to model latent clusters in data. Here, we use these ideas in the construction of the interaction process. Thus, the objectives are quite different; for instance, there is almost no focus on the inference of the model parameters in the ML community; in our setting, these parameters are crucial to understanding the overall interaction process behavior. This model is most similar to [39], where it is used for language modeling. Similar to the CFRP, the above construction is related to the Pitman-Yor process and the GEM distributions [36]. More details can be found in Section6 and the supplementary materials.

4.3. Accounting for multiple elements in first component

In the general setting, the first component, , is a random element of (i.e., a random finite multiset of elements from ). In the sequential description, we assumed the size of this multiset was one. We now consider  for general . First, let  denote the history of the first e-mails and the first senders of the st e-mail. Extension of 4 to handle multiple senders is straightforward by replacing by and defining all other terms similarly.

In the sequential description, the sender is used to specify which local statistics (i.e., , and ) to consider. However, when there are multiple senders, this choice is no longer straightforward. To address this, we introduce a random variable  with domain . This variable indicates which local statistics will be used in receiver distributions (i.e., equations (5) and (6)). Define  to be the unique elements in . Then

(7)

where (1)  and if the population  is considered infinite, and (2)  and if population is finite and . This is equivalent to restricting (4) to be non-zero only on the domain . Moreover, it is conditional on the history instead of . If  for , then increase  by one. If , then set .

5. Statistical properties

We now state several theoretical results for the proposed HVCM built from the sequential description in section 4. For ease of comprehension, we refer to this model as the “canonical HVCM model”.

Theorem 5.1.

The canonical HVCM with parameters  determines a structured exchangeable interaction probability distribution for all  in the parameter space.

Theorem 5.1 is not immediate from the sequential construction in section 4, but is clear from the reparameterization of the model presented in section 6, and its connection to the model previously discussed (this is formalized in section LABEL:appendix:technical) of the supplementary materials.

The remainder of this section focuses on the setting where the size of the first component is one (i.e., ). Moreover, we will make certain alternative assumptions concerning the sender distributions. These constraints allow sufficient complexity to be interesting, but assume sufficient regularity to push through the theoretical analysis. First, we turn to the growth rates in the expected number of unique receivers. Unlike the Hollywood model, this rate depends on both the distribution over senders, the global parameter , and the local parameters . Before stating the theorem, we require a formal definition of sparsity. For clarity, we define quantities in terms of receivers to distinguish vertices observed as senders and those observed as receivers (i.e., in and  respectively).

For a structured interaction-labeled network , let  denote the number of non-isolated receivers; is the number of interactions;  is the number of interactions with receivers; is the number of receivers that appear exactly times; and  is the indegree distribution, where . Note that these are global statistics that do not depend on the interaction labels. We define local versions by superscripting each statistic by . For instance, is the number of non-isolated receivers when restricting  to only those interactions involving sender . The statistics  and are defined similarly.

Definition 5.2 (Global and local sparsity).

Let  be a sequence of interaction-labeled networks for which as . The sequence is sparse if

where  is the average arity (i.e., number of receivers) of the interactions in . A non-sparse network is  dense. We say the sequence is  is -locally sparse if

where  is the average arity (i.e., number of receivers) of the interactions in from sender . A network that is not -locally sparse is -locally dense.

For  a sequence of positive random variables and  a sequence of positive non-random variables, let  indicate exists almost surely and equals a finite and positive random variable. Theorem 5.3 shows the canonical model may be either globally sparse and/or dense. The theorem assumes a finite population of senders with number of e-mails per sender drawn from a multinomial distribution.

Theorem 5.3.

Suppose the sender population  is finite, consisting of senders, i.e., . Assume, out of the  e-mails, the number of e-mails per sender , denoted , is drawn from a multinomial distribution with probabilities  such that  and  for all . Let be the average size of emails for sender and the average size of emails across all senders. Then where , and . In particular, if , then is almost surely sparse.

Theorem 5.3 establishes that the canonical HVCM for a special case of the sender distribution can capture degrees of sparsity. If  for all  and  then it must be the case that  for all . Therefore, a dense network must be -locally dense for all . However, a sparse network can be -locally dense for some, but not all, . We turn now to considerations of power-law degree distribution for interaction-labeled networks. We start with a definition.

Definition 5.4 (Global power-law degree distributions).

A sequence exhibits power-law degree distribution [7, 23, 24] if for some  the degree distributions satisfy as for all large for some slowly varying function ; that is, for all , where indicates that as . More precisely, has power law degree distribution with index if

(8)

Theorem 5.5 establishes the power-law degree distribution for the canonical HVCM for the case of .

Theorem 5.5.

Let  obey the sequential description in section 4 with parameters and let for all . For each , let for  be the empirical receiver degree distribution where  is the number of receivers of degree  and  is the number of unique receivers in . Then, for every ,

(9)

where  is the gamma function. That is,  has a power law degree distribution with exponent .

6. Posterior inference

We now consider performing posterior inference for the canonical HVCM given an observed interaction network . As in Section 4, we start with the setting where the size of the first component is one (i.e., ). The parameters for all

are estimated non-parametrically, and are not important for the remainder of the paper; therefore, the details are omitted for these parameters.

We start by reparameterizing the HVCM in a more useful form for inference, and which gives an explicit structure for updating the latent degree - we call this the extended canonical HVCM, or extended model for short. In this representation, every “escape” from the local distribution and choice of receiver  leads to an auxiliary vertex  being introduced locally for a sender - auxiliary vertices are not shared between senders. The label of the auxiliary vertex is ; the auxiliary vertex accounts for the fact that the global distribution can select receiver  multiple times. Finally, the observed reciever is assigned to the auxiliary vertex, and we write that assignment The number of auxiliary vertices with label  and sender  is equal to the number of times the local distribution for sender escapes and choose the global set of information (i.e., ). The sum of the degrees across auxiliary vertices with label  and sender is equal to the indegree for receiver  (i.e., ). Finally, we write to denote the degree of auxiliary vertex in sender that also has label . Note that for .

Given and , the probability that is assigned to auxiliary vertex is:

Further, if , then we add an auxiliary vertex with its label chosen with probability:

The likelihood of observing  given the parameters  is given by

(10)