Modern statistical network analysis focuses on the study of large, complex networks that can emerge in diverse fields, including social, biological, and physical systems [1, 2, 3, 4, 5]. The expanding scope of network analysis has led to an increase in the need for statistical models and inferential tools that can handle the increasing complexity of network data structures. In this paper, we focus on network data arising from sequences of interactions. Network data arising in this manner would benefit from a framework built upon the interaction as the statistical unit  rather than upon the constituent elements within each interaction as the statistical units. Edge-exchangeable models [7, 8] are built specifically to analyze datasets containing these complex interactions.
While Crane and Dempsey (2017)  provide a framework for statistical analysis of interaction data, the proposed Hollywood model only captures basic global features. Specifically, the Hollywood model’s asymptotic behavior reflects the empirical properties of sparsity and power law degree distributions observed in real-world network data, which are not as well reflected in classic statistical network models such as the ERGMs , graphon models , and stochastic blockmodels (SBMs) . While edge exchangeability is attractive as a theoretical framework, the set of current edge exchangeable models is inadequate to handle the structural complexity of modern network data.
The edge exchangeable model proposed in this paper is motivated by an important fact: most common complex networks constructed from interaction data are structured. A phone-call interaction, for example, takes the form of a sender and receiver pair. E-mail correspondence generalizes this type of interaction to one sender but potentially multiple receivers with different attributes like “To,” “Cc,” and “Bcc”. This paper makes a substantial push forward by constructing hierarchical models that reflect this common structure of interaction data, hereafter referred to as structured interaction data. The model overlays local behavior (e.g., per sender) with global information by partial pooling through a shared global, latent distribution. Simulation and theoretical analysis confirm that the proposed hierarchical model can achieve simultaneously varying local power-law degree per sender and global power-law degree distribution.
1.1. Relevant prior work on interaction data
Interaction data often arises in settings where communications amongst a set of constituent elements over a specific time period are recorded [12, 13]. Examples are numerous and include: authorship and co-sponsoring of legislation [14, 15], sending and receiving e-mails [12, 16], posting and responding on a community forum , and traceroute . In each case, the interaction (edge) is the statistical unit to be modeled, as contrasted with the subjects (nodes) of the interactions considered in other work . See [8, 7] for further discussion of the advantages of defining interactions as the statistical units.
The literature contains several papers focused on statistical modeling of interaction data. Perry and Wolfe (2013)  construct a Cox proportional intensity model . Butts (2008)  considered likelihood-based inference using a variant of the proportional intensity model to capture interaction behavior in social settings. Crane and Dempsey (2017)  consider non-hierarchical models for interaction data. They introduce the notion of edge exchangeable network models and explore its basic statistical properties. In particular, they show that edge exchangeable models allow for sparse structure and power law degree distributions, widely observed empirical behaviors that cannot be handled by conventional approaches.
An alternative approach emerges out of the recent work of Caron and Fox (2017) , who construct random graphs from point processes on . The random graph is characterized by an object called a graphex . Random graph models generated by this procedure can incorporate sparse, power law behavior into a well-defined population model. Finite random graphs can be obtained via a thresholding operation, termed p-sampling . Such random graphs are vertex exchangeable in that they are built from exchangeable point processes. In this setting, exchangeability is a consequence of projectivity rather than the simple structured interaction data sampling scheme proposed in this paper. See the contributed discussion to the paper by Caron and Fox (2017) , in particular contributions by Bharath  and Crane , for further discussion.
1.2. Outline and main contributions
The main contributions of this paper are as follows:
We establish basic statistical properties in section 5. In particular, we provide theoretical guarantees of sparsity and power law for the chosen HVCM – two important empirical properties of network data.
We demonstrate this HVCM on both the Enron e-mail dataset and ArXiv dataset in section 7. In particular, we show how the HVCM can be used to perform goodness of fit checks for models of network data via posterior predictive checks, an often under-emphasized aspect of statistical network modeling.
Overall, this paper presents a statistically rigorous, principled hierarchical modeling framework for handling complex structured interaction data.
2. Structured interaction data
We start by defining structured interaction data, illustrating with a sequence of concrete examples of increasing complexity.
Definition 2.1 (Structured interaction data).
Let denote a set of constituent elements. Then for a set , we write to denote the set of all finite multisets of . A structured interaction process for an ordered sequence of sets is a correspondence between a set indexing interactions and the ordered sequence of finite multisets of .
Remark 2.2 (Difference from interaction data).
In , an interaction process is defined as a correspondence where is a single population. Structured interaction data, instead, consists of a series of finite multisets, and does not require each set of constituent elements to be equivalent. That is, each population may contain different types of elements. This flexibility will allow us to introduce hierarchical structure into the exchangeable model.
Finally, let denote the multisets of size , so that is the disjoint union .
Example 2.3 (Phone-calls).
Assume are all equivalent and let
be a countably infinite population. A phone-call can be represented as an ordered pair of “sender” and “receiver” drawn from. Therefore, a phone-call interaction process is a correspondence . For instance, is a phone-call from sender to receiver , both in population . This is distinct from where sender and receiver roles are reversed.
Example 2.4 (E-mails).
Assume are all equivalent and let be a countably infinite population. An e-mail can be represented as the ordered sequence of sets: sender, receivers. Then an e-mail interaction process is a correspondence . For instance, is an e-mail from sender to receivers and . This is distinct from and . Figure 1 is a visualization of a similar structured interaction dataset formed from Facebook posts (i.e., poster followed by finite multiset of responders).
Example 2.5 (Scientific articles).
Consider summarizing a scientific article by its (1) list of subject areas and (2) list of authors. Then the scientific article process is a correspondence . For instance, is an article with subject areas and and authors and . Here, and are distinct populations.
Example 2.6 (Movies).
Consider summarizing a movie by its (1) genre, (2) list of producers, (3) director, and (4) list of actors. Of course, there is overlap in certain populations, as producers can be directors, directors can be actors, but none are a genre (unless Scorsese, Spielberg, or Tarantino are considered genres unto themselves). Then the movie process is a correspondence . For instance, is a movie with genre , producers and , director , and actors , , and . Note, in this example, the director is also one of the actors.
The above shows Definition 2.1 covers a wide variety of examples from network science. Next, we construct interaction-labeled networks and define exchangeable structured interaction processes.
Remark 2.7 (Covariates).
In this paper, we focus on the study of structured interaction processes in Definition 2.1 with no additional information, such as covariates. Incorporating such covariate information is quite difficult; see [28, 29, 30, 31, 32, 33] for examples of incorpating covariates into network analysis. Covariate information can come in two forms: (1) covariate information on the interaction; and (2) covariate information on constituent elements. Examples of (1) include subject line or body text in an e-mail, or genre and gross movie sales for a movie. Examples of (2) include gender, age, job title, or university affiliation of authors of a scientific article. Certain interaction covariates can be incorporated into the models considered in this paper. For example, in the ArXiV dataset, the article’s subject can be viewed as covariate information on the interaction. We show how this can be incorporated as part of the structured interaction data structure, and therefore accounted for in the statistical models.
2.1. Interaction-labeled networks
For the remainder of this paper, we focus on structured interaction processes of the form . This type of a structured interaction process captures the phone-call, e-mail, and scientific article examples. The arguments presented naturally extend to more general structured interaction processes as given in Definition 2.1. When two populations of constituent elements are equivalent, we write . The interaction-labeled network is an equivalence clase constructed from the structured interaction process by quotienting out the labeling of the constituent elements. Let be a bijection for . We write to be the composite bijection obtained by applying componentwise. If , then ; that is, bijections among equivalent populations, e.g., the senders and receivers in an email network, denoted by and , respectively, must agree. Then induces an action on the product space by the composite map
Therefore, the bijection acts on the structured interaction process via composition . The structured interaction-labeled network is then the equivalence class constructed from the structured interaction network by quotienting out over bijections :
where is the cardinality of the population. Note we have only quotiented out labels for constituent elements, so the object still has uniquely labeled interactions. For simplicity, we write and leave the subscript implicit.
In the remainder of the paper, we assume the index set is countably infinite and replace it by . For any , we define the restriction of to the subset by . This restricted interaction process induces a restriction to of the interaction-labeled network. We write to denote the interaction-labeled network associated with the restricted process . For , we simply write to denote the restricted structured interaction process and to denote the corresponding structured interaction network.
3. Structured interaction exchangeable models
Let denote the interaction-labeled network constructed from the structured interaction process . Then for any finite permutation , let denote the relabeled structured interaction process defined by . Then denotes the corresponding interaction labeled network constructed from . Note that the choice of representative from the equivalence class does not matter. The above relabeling by permutation is not to be confused with the relabeling in the previous section by the bijection . The bijection relabels the constituent elements, and is used to construct the equivalence class defining the interaction-labeled network (i.e., the equivalence class). The permutation reorders the interaction process, and therefore relabels the interaction-labeled network.
In the remainder of this paper, we write to denote a random interaction-labeled network. We assume the interactions are labeled in the countably infinite set . Interaction exchangeability is characterized by the property that the labeling of the interactions (not the constituent elements) is arbitrary. We now define exchangeable structured interaction networks.
Definition 3.1 (Exchangeable structured interaction network process).
The structured interaction-labeled network is exchangeable if for all permutations , where denotes equality in distribution.
Next, we provide a representation theorem for structured interaction processes. We focus on the setting where each interaction is either never observed or observed infinitely often. This is commonly referred to as the “blip-free” setting , where blips refer to interactions that are observed once. We first define the -simplex
where for and . Let
be a probability measure on the simplex and define
to be a random variable drawn from this measure. Then, given, let the sequence of interactions be generated according to
Then, given , set . Theorem 3.2 states that all blip-free structured interaction exchangeable networks can be generated in this manner. The proof can be found in Section LABEL:section:rep of the supplementary materials.
Theorem 3.2 (Blip-free representation theorem).
Let be a structured interaction exchangeable network that is blip-free with probability 1. Then there exists a probability measure on such that , where
3.1. Hierarchical vertex components model
Via Theorem 3.2, we can construct a particular family of interaction exchangeable random networks as follows. First, choose a distribution of senders, , in the simplex
Next, choose a second element of , which we denote . Finally, for each , construct a conditional distribution over the receivers, i.e., the second component . That is, for every , we choose in the simplex
We combine these distributions to form , which determines a distribution on the space by
where , , , and for each . This determines an interaction exchangeable network, which we call the hierarchical vertex components model (HVCM). Given , are independent, identically distributed (i.i.d.) random structured interactions drawn from (3). The associated random interaction exchangeable network is obtained through (1), whose distribution we denote by .
In non-HVCMs , each constituent element had a single frequency of occurrence. By contrast, HVCMs allow the frequency of occurrence for elements in the second term of (3) (i.e., ) to depend on first component (i.e., ) through . This dependence is two-fold: (1) controls which is chosen across ; and (2) the local distributions can vary, leading to the size-biased ordering of varying as a function of .
Remark 3.3 (Vertex exchangeability versus interaction exchangeability).
While HVCMs are expressed as a function of the vertices, they are interaction exchangeable and not vertex exchangeable. To see this, consider Theorem 3.2. A direct corollary is that vertices are sampled in size-biased order according to their relative frequency of occurrence. In hierarchical models, the size-biased sampling of the second component depends on the first component. Regardless, this implies the observed constituent elements are not exchangeable with the unobserved constituent elements. On the other hand, vertex exchangeability implicitly assumes the observed vertices and unobserved vertices are exchangeable.
4. Sequential description for particular subfamily of HVCMs
Here we provide a sequential description of a particular subfamily of HVCMs. For ease of comprehension, we start with the setting of a single sender where the size of the first component is one (i.e., ). In this setting, the sequential description is presented in the context of e-mails. Let satisfy either (1) and , or (2) and for some . In setting (1), the population is infinite, while in setting (2) the population is finite and equal to . In Section 4.3, we show how to extend this model to the general setting of multiple senders. For ease of comprehension, we let (senders) and (receivers) denote the two sets of constituent elements.
We introduce some additional notation. For each , the th email is given by the structured interaction where is the sender, and is the th receiver of the th article. Suppose articles have been observed and define to be the observed history of the first e-mails. For the st e-mail, choose the sender according to
where is the outdegree of the subject , and are the set of unique senders in and is the set’s cardinality.
, we choose the number of recipients according to the discrete probability distribution function. Finally, let denote the indegree of receiver when restricted to e-mails from sender after the first e-mails and the recipients of the th e-mail; that is,
Finally, we define to be the number of receivers (accounting for multiplicity) of e-mails from sender . Each of these statistics is a measurable function of . Note, these statistics are local (i.e., specific to the particular subject). Here, we describe a procedure for sharing information across senders. To do this, we define a partially observable global set of information. First, define the observable variable to be the complete set of receivers; that is,
Additionally, let be the cardinality of this set. For each we posit existence of a latent degree per sender and receiver denoted by . We then define and . Next, define to be the complete set of receivers when restricting to e-mails from sender , and to be the observable history union . That is, is the observed history up to the th receiver on the th e-mail, where implies only sender information for the th e-mail. Finally, for each , let satisfy either (1) and , or (2) and for some . In setting (1), the receiver population is infinite, while in setting (2) the population is finite and equal to . For the remainder of this paper, we assume setting (1).
Given the indegree distribution , the latent degree distribution , the current sender , along with the observable history , the probability of choosing receiver is proportional to
Note the difference in the discount of indegree in (5) and outdegree in (4). For the sender distribution (4), the outdegree discount is ; on the other hand, for (5), the indegree discount is . This reflects that in (4), sender is chosen from a single distribution; however, in (5), receiver can be chosen either locally or globally.
The remaining question is how to update the degree distributions. In (5) and (6), we can either observe “locally”, or we escape the local model and observe due to the latent global information. Given we update both local and global degrees. If then the global degree increases from zero to one. If then the local degree increases by one and the latent degree is increased by one with probability . The exact procedure for incrementing is discussed in section 6.
4.1. Partial pooling
The importance of the latent global degree distribution is that it allows information to be shared across the conditional receiver distributions. The above model formalizes the partial pooling of information. The degree of pooling is controlled by the escape probability , which in general decreases as the number of e-mails from sender increases. Note that over time as more e-mails by sender are seen, the escape probability tends to zero whenever . Therefore, the local impact of the latent global degree information becomes negligible once we have sufficient local information. However, the first time a sender-receiver pair is observed, it must occur via the shared global set of information. The global latent degrees therefore contribute to the behavior of new and/or rarely seen senders.
4.2. Connection between sequential description and hierarchical vertex components models
The sequential description in section 6 is equivalent to a particular HVCM. When , an analytic stick-breaking representation can be derived. This connects the sequential process directly to (3). To do so, we start by constructing the sender distribution. Here, we assume . For , define independent random variables . Then, conditional on , the probability of choosing sender is given by
where the product is set equal to one for , and . In our current setting, so the weights can be ignored for now. See Section 4.3 for a description of how these can be constructed in a similar manner.
We now construct, for each the probabilities via a hierarchical model given and , and we set . To do this, we first define global independent random variables for . Then, conditional on , for , we define associated stick-breaking probabilities . These are probabilities of choosing receiver based on the global random variables . The local stick-breaking distributions are then defined via a perturbation of these global probabilities. That is, for , define independent random variable
where the product is defined equal to one when . This yields a stick breaking representation for for a particular hierarchical vertex components model. Partial pooling occurs via the shared global probabilities . The local distributions satisfy . Therefore, the distribution can be thought of as perturbation of the global distribution where controls the amount of perturbation. In particular, with probability one as .
By construction, is a random variable over the space . Lemma 4.1 establishes the connection between these random variables and the canonical model for . Although this model for does not admit a known stick breaking representation, Theorem 3.2 discussed in Section 3 ensures these asymptotic frequencies exist. Section LABEL:appendix:technical of the supplementary materials describes a specific probabilistic construction of these frequencies.
The sequential HVCM model for for is equivalent in distribution to (3), where is distributed according to the stick-breaking construction described above.
Remark 4.2 (Connections to CRFP).
Note that the HVCM described above is closely related to the Chinese Restaurant Franchise Process, a well-known process in the machine learning literature[36, 38] that is almost exclusively used to model latent clusters in data. Here, we use these ideas in the construction of the interaction process. Thus, the objectives are quite different; for instance, there is almost no focus on the inference of the model parameters in the ML community; in our setting, these parameters are crucial to understanding the overall interaction process behavior. This model is most similar to , where it is used for language modeling. Similar to the CFRP, the above construction is related to the Pitman-Yor process and the GEM distributions . More details can be found in Section6 and the supplementary materials.
4.3. Accounting for multiple elements in first component
In the general setting, the first component, , is a random element of (i.e., a random finite multiset of elements from ). In the sequential description, we assumed the size of this multiset was one. We now consider for general . First, let denote the history of the first e-mails and the first senders of the st e-mail. Extension of 4 to handle multiple senders is straightforward by replacing by and defining all other terms similarly.
In the sequential description, the sender is used to specify which local statistics (i.e., , and ) to consider. However, when there are multiple senders, this choice is no longer straightforward. To address this, we introduce a random variable with domain . This variable indicates which local statistics will be used in receiver distributions (i.e., equations (5) and (6)). Define to be the unique elements in . Then
where (1) and if the population is considered infinite, and (2) and if population is finite and . This is equivalent to restricting (4) to be non-zero only on the domain . Moreover, it is conditional on the history instead of . If for , then increase by one. If , then set .
5. Statistical properties
We now state several theoretical results for the proposed HVCM built from the sequential description in section 4. For ease of comprehension, we refer to this model as the “canonical HVCM model”.
The canonical HVCM with parameters determines a structured exchangeable interaction probability distribution for all in the parameter space.
Theorem 5.1 is not immediate from the sequential construction in section 4, but is clear from the reparameterization of the model presented in section 6, and its connection to the model previously discussed (this is formalized in section LABEL:appendix:technical) of the supplementary materials.
The remainder of this section focuses on the setting where the size of the first component is one (i.e., ). Moreover, we will make certain alternative assumptions concerning the sender distributions. These constraints allow sufficient complexity to be interesting, but assume sufficient regularity to push through the theoretical analysis. First, we turn to the growth rates in the expected number of unique receivers. Unlike the Hollywood model, this rate depends on both the distribution over senders, the global parameter , and the local parameters . Before stating the theorem, we require a formal definition of sparsity. For clarity, we define quantities in terms of receivers to distinguish vertices observed as senders and those observed as receivers (i.e., in and respectively).
For a structured interaction-labeled network , let denote the number of non-isolated receivers; is the number of interactions; is the number of interactions with receivers; is the number of receivers that appear exactly times; and is the indegree distribution, where . Note that these are global statistics that do not depend on the interaction labels. We define local versions by superscripting each statistic by . For instance, is the number of non-isolated receivers when restricting to only those interactions involving sender . The statistics and are defined similarly.
Definition 5.2 (Global and local sparsity).
Let be a sequence of interaction-labeled networks for which as . The sequence is sparse if
where is the average arity (i.e., number of receivers) of the interactions in . A non-sparse network is dense. We say the sequence is is -locally sparse if
where is the average arity (i.e., number of receivers) of the interactions in from sender . A network that is not -locally sparse is -locally dense.
For a sequence of positive random variables and a sequence of positive non-random variables, let indicate exists almost surely and equals a finite and positive random variable. Theorem 5.3 shows the canonical model may be either globally sparse and/or dense. The theorem assumes a finite population of senders with number of e-mails per sender drawn from a multinomial distribution.
Suppose the sender population is finite, consisting of senders, i.e., . Assume, out of the e-mails, the number of e-mails per sender , denoted , is drawn from a multinomial distribution with probabilities such that and for all . Let be the average size of emails for sender and the average size of emails across all senders. Then where , , and . In particular, if , then is almost surely sparse.
Theorem 5.3 establishes that the canonical HVCM for a special case of the sender distribution can capture degrees of sparsity. If for all and then it must be the case that for all . Therefore, a dense network must be -locally dense for all . However, a sparse network can be -locally dense for some, but not all, . We turn now to considerations of power-law degree distribution for interaction-labeled networks. We start with a definition.
Definition 5.4 (Global power-law degree distributions).
Theorem 5.5 establishes the power-law degree distribution for the canonical HVCM for the case of .
Let obey the sequential description in section 4 with parameters and let for all . For each , let for be the empirical receiver degree distribution where is the number of receivers of degree and is the number of unique receivers in . Then, for every ,
where is the gamma function. That is, has a power law degree distribution with exponent .
6. Posterior inference
We now consider performing posterior inference for the canonical HVCM given an observed interaction network . As in Section 4, we start with the setting where the size of the first component is one (i.e., ). The parameters for all
are estimated non-parametrically, and are not important for the remainder of the paper; therefore, the details are omitted for these parameters.
We start by reparameterizing the HVCM in a more useful form for inference, and which gives an explicit structure for updating the latent degree - we call this the extended canonical HVCM, or extended model for short. In this representation, every “escape” from the local distribution and choice of receiver leads to an auxiliary vertex being introduced locally for a sender - auxiliary vertices are not shared between senders. The label of the auxiliary vertex is ; the auxiliary vertex accounts for the fact that the global distribution can select receiver multiple times. Finally, the observed reciever is assigned to the auxiliary vertex, and we write that assignment The number of auxiliary vertices with label and sender is equal to the number of times the local distribution for sender escapes and choose the global set of information (i.e., ). The sum of the degrees across auxiliary vertices with label and sender is equal to the indegree for receiver (i.e., ). Finally, we write to denote the degree of auxiliary vertex in sender that also has label . Note that for .
Given and , the probability that is assigned to auxiliary vertex is:
Further, if , then we add an auxiliary vertex with its label chosen with probability:
The likelihood of observing given the parameters is given by