The Hyperedge Event Model

07/22/2018 ∙ by Bomin Kim, et al. ∙ University of Massachusetts Amherst Penn State University 0

We introduce the hyperedge event model (HEM)---a generative model for events that can be represented as directed edges with one sender and one or more receivers or one receiver and one or more senders. We integrate a dynamic version of the exponential random graph model (ERGM) of edge structure with a survival model for event timing to jointly understand who interacts with whom, and when. The HEM offers three innovations with respect to the literature---first, it extends a growing class of dynamic network models to model hyperedges. The current state-of-the-art approach to dealing with hyperedges is to inappropriately break them into separate edges/events. Second, our model involves a novel receiver selection distribution that is based on established edge formation models, but assures non-empty receiver lists. Third, the HEM integrates separate, but interacting, equations governing edge formation and event timing. We use the HEM to analyze emails sent among department managers in Montgomery County government in North Carolina. Our application demonstrates that the model is effective at predicting and explaining time-stamped network data involving edges with multiple receivers. We present an out-of-sample prediction experiment to illustrate how researchers can select between different specifications of the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 26

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Processes that arise as time-stamped directed interactions are common in the social, natural, and phyiscal sciences. The data produced by such processes can be represented as dynamic directed networks—an object that has given rise to the development of several statistical model families. For example, stochastic actor-oriented models (SAOMs) (Snijders, 1996; Snijders et al., 2007) characterize that network evolutions occur as the senders decide to create or remove an edge from the existing network, one edge at a time. Event-based network models (Butts, 2008; Vu et al., 2011; Hunter et al., 2011; Perry and Wolfe, 2013) provide a general framework for modeling the realization of edges that occur as instances in continuous time streams of events. This family of models is flexible enough to account for the many ways in which past network structures beget future ones—e.g., if node directed a tie to node recently, then node will direct one to node in the near future—and useful for understanding the traits and behaviours that are predictive of interactions.

A major limitation of existing dynamic network models is that they apply to edges with one sender and one receiver. Dynamic network data often naturally arise as “hyperedges” (Karypis et al., 1999; Ghoshal et al., 2009; Zlatić et al., 2009; Zhang and Liu, 2010) that include one sender and multiple receivers or one receiver and multiple senders. For example, in email networks (Newman et al., 2002)

, each email encodes a hyperedge from one sender to one or more receviers. Networks formed between neurons via axons and dendrites involve hyperedges with one sender and multiple receivers (axons) or one receiver and multiple senders (dendrites)

(Partzsch and Schüffny, 2012). Networks formed through the cosponsorhip of legislative bills (Fowler, 2006) involve hyperedges with multiple senders (cosponsors) and one receiver (sponsor). Economic sanctions between countries (Cranmer et al., 2014) induce networks with hyperedges between multiple sending countries and one target country. Existing models require researchers to alter hyperedge data to fit with the pairwise edge structure of the model. For instance, Perry and Wolfe (2013) treat multicast interactions—one type of directed hyperedge which involves one sender and one or more receivers—via duplication (i.e., obtain pairwise interactions from the original multicast), to construct an approximate likelihood function in their inferential framework for model parameters. Similarly, Fan and Shelton (2009) duplicate emails sent from one sender to one or more receivers and randomly jitter the sent times in order to avoid violating the assumption that two events cannot occur at the exact same time.

We develop a statistical dynamic network model, which we term the hyperedge event model (HEM), that integrates the two components that govern hyperedge event formation: (1) the formation of the vertices that are incident to the hyperedge, and (2) the timing of the hyperedge event. In what follows, we define the HEM’s generative process for hyperedge event data (Section 2

), derive the conditional posteriors for Bayesian inference, and present tests of our software implementation (Section

3). We then demonstrate our model’s applicability in a case study (Section 4

) where we analyze a corpus of internal county government emails and illustrate how to perform model selection, posterior predictive checks, and exploratory analysis using our model. We conclude in Section

5.

2 The hyperedge event model

The hyperedge event model (HEM) specifies a generative process for unique hyperedge events that occur between nodes. A single hyperedge event is indexed by —where denotes a categorical set with levels —and consists of three components: the sender

, an indicator vector of receivers

—where if is a receiver of hyperedge event and 0 otherwise—and the timestamp . For simplicity, we assume that events are ordered by time such that . While the model can be applied to two types of hyperedge events—events involving (1) one sender and one or more receivers, and (2) one or more senders and one receiver—here we only present the generative process for those involving one sender and one or more receivers (i.e., multicast). One notable feature of our generative process is that we draw auxiliary variables that serve as candidate data. Data is generated from the HEM through a sampling process applied to the auxiliary variables. The auxiliary variables drawn for event include, for each possible sender , a time increment from event at which sender would create event , and an length vector indicating which nodes would be the receivers of event if it were directed by sender . The data generated for event under the HEM corresponds to the sender that would create event the soonest—at the smallest time increment from event . The receivers of event generated under the HEM correspond to those receivers toward which the sender with the minimum time increment would have directed event . We explain these steps in more detail below. For hyperedge events that involve one receiver and one or more senders, we treat to be an indicator vector of senders and to be the single receiver, and then follow the alternative generative process provided in Appendix A, which we derive as a simple reversal of the process used for multicasts (i.e., one sender and multiple receivers).

2.1 Candidate receivers

For every possible sender–receiver pair where , we define the “receiver intensity” as a linear combination of statistics relevant to the receiver selection process:

(2.1)

where is a -dimensional vector of coefficients and

is a set of receiver selection features. As we show below, this intensity contributes to the probability that

directs event towards receiver . The features

can capture common network processes like popularity, reciprocity, and transitivity, as well as the effects of attributes of the sender and receivers (e.g., their gender), or attributes of sender–receiver pairs (e.g., whether the sender is a supervisor of the receiver’s). We also include a normally distributed intercept term to account for the average (or baseline) number of receivers:

.

The HEM assumes that the sender of each hyperedge event is the sender that would initiate their respective event with the greatest urgency (i.e., the earliest timestamp). Our model thus assumes that for every event , every possible sender generates a candidate receiver set that would be the receiver set of event if sender were the sender. For an event , we first define an matrix where the row denotes sender ’s receiver vector of zeros and 1’s—i.e., 1’s indicate the nodes to which intends to direct event . We then assume that each receiver vector comes from a modification of the multivariate Bernoulli (MB) distribution (Dai et al., 2013)

—a model that has been used to model graphs in which the state of each edge indicator is drawn independently from an edge-specific Bernoulli distribution. In order to avoid drawing hyperedge events with no receivers, we define a probability measure “MB

” motivated by the Gibbs measure (Fellows and Handcock, 2017). The probability measure we define amounts to a non-empty Gibbs measure, in which the all-zero vector is excluded from the support of the multivariate Bernoulli distribution. As a result, this measure helps us to (1) allow a sender to select multiple receivers for a single event, (2) force the sender to select at least one receiver, and (3) ensure a tractable normalizing constant for the receiver selection distribution. To be specific, we draw a binary vector

(2.2)

where . In particular, we define as

(2.3)

where is the normalizing constant, is the -norm, and the log-indicator term ensures that empty receiver sets are excluded from the distribution’s support. These modeling assumptions facilitate efficient posterior inference since we can derive a closed form expression for the normalizing constant—i.e., —and thus do not need to perform brute-force summation over the support of . We provide detailed derivation steps for the normalizing constant in Appendix B.

2.2 Candidate timestamps

To generate timing for each event, the HEM first draws a candidate timestamp at which the event would be created given the candidate sender and receiver combinations. The timing rate for sender and event is

(2.4)

where is a -dimensional vector of coefficients with a Normal prior , is a set of event timing features—covariates that could affect timestamps of events, and is the appropriate link function such as identity, log, or inverse.

In modeling “when,” we do not directly model the timestamp . Instead, we assume that each sender’s “time increment”—i.e., waiting time to next event since —is drawn from a specific exponential family distribution. We define the time increment from event to event as (i.e., ) and specify the distribution of candidate timestamps with sender-specfic mean . Following the generalized linear model (GLM) framework (Nelder and Baker, 1972)

, we assume the mean and variance of the

satistify

(2.5)

where

here is a positive real number. Possible choices of distribution include exponential, Weibull, gamma, and log-normal distributions, which are commonly used in time-to-event modeling

(Rao, 2000; Rizopoulos, 2012). Based on the specific distribution, we may need other latent variables to draw the time increment, to account for the variance of time increments, beyond the coefficients for the features used to model the rate. —e.g., the shape parameter for the Weibull, the shape parameter for the gamma, and the variance parameter for the log-normal. We use and

to denote the probability density function (p.d.f) and cumulative density function (c.d.f), respectively, with mean

and variance .

2.3 Senders, receivers, and timestamps

Finally, our model assumes that the observed sender, receivers, and timestamp of hyperedge event are generated by selecting the sender–receiver-set pair with the smallest time increment (Snijders, 1996):

(2.6)
  Input: number of events and nodes , covariates , and coefficients
  for  to  do
     for  to  do
        for  to (do
           set
        end for
        draw
        set
        draw
     end for
     if  tied events then
        set
        set
        set
        jump to
     else
        set
        set
        set
     end if
  end for
Algorithm 1 Generative process: one sender and one or more receivers

Therefore, the HEM assumes a sender-driven process—i.e., the receivers and timestamp of an event are jointly determined by the sender’s urgency to direct the event to those selected receivers. Note that our generative process allows for tied events. In the case of tied events—i.e., multiple senders draw exactly the same candidate timestamps—we assume that all events are generated and occur simultaneously. Algorithm 1 summarizes the entire generative process for hyperedge events with one sender and one or more receivers, and Figure 1 presents an illustrative example on how the event is generated, assuming and .

Figure 1: An illustrative example of the generative process of the HEM.

3 Posterior inference

In this section we describe how we invert the generative process to obtain the posterior distribution over the latent variables—candidate receivers , coefficients for receiver selection features , and coefficients for event timing features —conditioned on the observed data , covariates , and hyperparamters

. We draw the samples using Markov chain Monte Carlo (MCMC) methods, repeatedly resampling the value of each latent variable from its conditional posterior via a Metropolis-within-Gibbs sampling algorithm. In the next subsection, we provide each latent variable’s conditional posterior along with pseudocode of MCMC in Algorithm

2. We also evaluate the correctness of both our mathematical derivations and software implemenation using the prior–posterior simulator test of Geweke (2004).

3.1 Conditional posteriors

3.1.1 Candidate receivers

In our model, direct computation of the posterior densities for the latent variables and —i.e., and —is not possible. However, it is possible to augment the data by candidate receivers such that we can obtain their conditional posterior by conditioning on samples of

. This approach—i.e., “data augmentation”—is commonly used throughout Bayesian statistics

(Tanner and Wong, 1987; Neal and Kypraios, 2015). Since

is a binary random variable, it may be sampled from a Bernoulli distribution with probability

, since

(3.1)

where the subscript “” denotes a quantity excluding data from position and is the indicator function that prevents empty receiver sets.

3.1.2 Coefficients for receiver selection features

Unlike the candidate receivers above, the conditional posterior for does not have a closed form; however may instead be re-sampled using the Metropolis–Hastings (MH) algorithm. Assuming an uninformative prior (i.e., ), the conditional posterior for is proportional to

(3.2)

3.1.3 Coefficients for event timing features

Likewise, we use the MH algorithm to update the latent variable . Assuming an uninformative prior (i.e., ), the conditional posterior for an untied event case is proportional to

(3.3)

where is the probability that the observed time increment comes from the specified distribution with the observed sender’s mean , and is the probability that the rest of (unobserved) senders for event all draw time increments greater than . Moreover, under the existence of tied events, the conditional posterior of is written as proportional to

(3.4)

where are the unique timepoints across events (). If (i.e., no tied events), equation (3.4) reduces to equation (3.3). Note that when we have the latent variable to quantify the variance in time increments (based on the choice of timestamp distribution in Section 2.2), we also use equation (3.3) (or equation (3.4) in case there exist tied events) for the additional MH update—e.g., for Weibull, for gamma, and for log-normal.

  Input: number of outer and inner iterations and initial values of
  for  to  do
     for  to  do
        for  to  do
           for  to (do
              update using Gibbs update —equation (3.1)
           end for
        end for
     end for
     for  to  do
        update using MH algorithm—equation (3.2)
     end for
     for  to  do
        update using MH algorithm—equation (3.3) or (3.4)
     end for
     if extra parameter for  then
        update the variance parameter using MH algorithm—equation (3.3) or (3.4)
     end if
  end for
  summarize the results with the last chain of and
Algorithm 2 MCMC algorithm

3.2 Getting it Right (GiR) test

Software development is integral to the objective of applying our model to real world data. Code review is a valuable process in any research computing context, and the prevalence of software bugs in statistical software is well documented (e.g., Altman et al., 2004; McCullough, 2009). With highly complex models such as the HEM, there are many ways in which software bugs can be introduced and go unnoticed. As such, we present a joint analysis of the integrity of our generative model, sampling equations, and software implementation.

Geweke (2004)

introduced the “Getting it Right” (GiR) test—a joint distribution test of posterior simulators which can detect errors in sampling equations as well as software bugs—and it has been used to test the implementation of Bayesian inference algorithms

(Zhao et al., 2016). The test involves comparing the distributions of variables simulated from two joint distribution samplers, which we call “forward” and “backward” samplers. The “forward” sampler draws joint samples of the latent and observable variables from the prior. The “backward” sampler begins by first drawing a joint sample of the latent and observed variables from the prior. It then alternates between re-sampling the latent variables, conditioned on the observable variables, from the MCMC transition operator, and then re-sampling the observable variable, conditioned on the latent variables, from the model likelihood. If the MCMC transition operator is correctly derived and implemented, this process should asymptotically generate joint samples of the latent and observable variables from the prior, like the forward sampler.

In the forward sampler, both observable and unobservable variables are generated using Algorithm 1. In the backward samples, unobservable variables are generated using the sampling equations for inference, which we derived in Section 3.1. For each forward and backward sample that consists of number of events, we save these statistics:

  • Mean of observed receiver sizes across ,

  • Variance of observed receiver sizes across ,

  • Mean of time increments across ,

  • Variance of time increments across ,

  • value used to generate the samples ,

  • value used to generate the samples ,

  • value used to generate the samples in log-normal distribution

To keep the computational burden of re-running thousands of rounds of inference manageable, we run the GiR using a relatively small artificial sample, consisting of events, nodes, number of receiver selection features, and number of event timing features per each forward or backward sampler, using log-normal distibution for the time increments . We generated

sets of forward and backward samples, and then calculated 1,000 quantiles for each of the statistics. We also calculated t-test and Mann-Whitney test p-values in order to test for differences in the distributions generated in the forward and backward samples. Before we calculated these statistics, we thinned our samples by taking every 9th sample starting at the 10,000th sample for a resulting sample size of 10,000, in order to reduce the autocorrelation in the Markov chains. In each case, if we observe a large p-value, this gives us evidence that the distributions generated under forward and backward sampling have the same locations. We depict the GiR results using probability–probability (P–P) plots, in which the empirical CDF values of the forward and backward samples are plotted on the

and axes, respectively. If the two samples are from equivalent distributions, the empirical CDF values should line up on a line with zero -intercept, and unit slope (i.e., a 45-degree line). The GiR test results are depicted in Figure 2. These results indicate that our sampling equations and software implementation pass the test on every statistic.

(a) Mean of
(b) Variance of
(c) Mean of
(d) Variance of
(e) Value of
(f) Value of
(g) Value of
(h) Value of
(i) Value of
(j) Value of
(k) Value of
(l) Value of
Figure 2:

Probability–probability (P–P) plots for the GiR test statistics.

4 Application to email data

We now present a case study applying our method to Montgomery county government email data. Our data come from the North Carolina county government email dataset collected by ben Aaron et al. (2017) that includes internal email corpora covering the inboxes and outboxes of managerial-level employees of North Carolina county governments. Out of over twenty counties, we chose Montgomery County to (1) test our model using data with a large proportion of hyperedges (16.76%), all of which are emails sent from one sender to two or more receivers, and (2) limit the scope of this initial application. The Montgomery County email network contains 680 emails, sent and received by 18 department managers over a period of 3 months (March–May) in 2012. For this case study, we formulate our model specification through definitions of the receiver selection features and event timing features

. We then report a suite of experiments—out-of-sample prediction for model selection and posterior predictive checks—that illustrate how alternative formulations of the HEM can be compared, and evaluate how well our model recovers the distribution of the observed data. Finally, we demonstrate an exploratory analysis of Montgomery County email data using the model estimates to discover substantively meaningful patterns in organizational communication networks.

4.1 Covariates

4.1.1 Receiver selection features

A primary purpose of any network model is to use the posterior distributions to learn which features predict and/or explain edge formation (e.g., is edge formation reciprocal, are edges more likely to be formed among nodes with the same gender). This email application specifically gives rise to the following question: “To what extent are nodal, dyadic or triadic network effects relevant to predicting future emails?” As an illustrative example, we form the receiver selection features for Montgomery County email data using nodal, dyadic, and triadic covariates. First, as we want to test whether gender plays a role in receiver selection process, we include three nodal covariates—the gender information of sender and receiver, and their homophily indicator (i.e., an indicator of whether the sender and receiver are of the same gender). Additionally, we include four interval-based nodal network covariates—outdegree of sender (i.e., the number of emails sent), indegree of receiver (i.e., the number of emails received), hyperedge size of sender (i.e., the number of total receivers directed from the sender), and the interaction between (i.e., scalar product of) outdegree and hyperedge size—to study the effect of nodal behaviors on future interactions. For dyadic and triadic network effects, we employ the network statistics in Perry and Wolfe (2013) and summarize past interaction behaviors based on the time interval prior to and including . Specifically, our time interval tracks 7 days prior to the last email was sent . For , and , we define 14 covariates for :

  • intercept: ;

  • :

  • :

  • :

  • : ;

  • : ;

  • : ;

  • :

  • : ;

  • : ;

  • : ;

  • : ;

  • : ;

  • : ;

where is an indicator function. The network statistics (5–14) are designed so that their coefficients have a straightforward interpretation. The function “outdegree” and “indegree” measure the gregariousness and popularity effects of the node by counting the number of emails sent from and received by , respectively, within the last 7 days. The gregariousness effect refers to the tendency for nodes that created many events in the past to continue to do so in the future. The popularity effect refers to the tendency for nodes that were selected as receivers of many events in the past to continue to do so in the future. Moreover, in order to capture the individual tendency of senders to select two or more receivers, we include the statistic “”—the number of events directed from sender within last 7 days where events with number of receivers are counted as separate events—as a variant of outdegree statistic, accounting for hyperedges. We also include the interaction term, “”, between outdegree and hyperedge size. This interaction allows us to model a possible tradeoff between the hyperedge size and the total number of events created by . Dyadic statistics “send” and “receive” are defined as above such that these covariates measure the number of events directed from to and to , respectively, within the last 7 days. In the example of triadic statistics, the covariate “” counts the events involving some node distinct from and such that events from to and to are both observed within the last 7 days. This statistic captures the tendency for events to close transitive triads (i.e., triads in which directs to and , and directs to ). We include other triadic covariates that behave similarly and exhibit analogous interpretations, which are illustrated in Figure 3.

Figure 3: Visualization of triadic statistics: two_ send, two_ receive, sibling, and cosibling.

4.1.2 Event timing features

For the event timing features , introduced in Section 2.2, we identify a set of covariates which may affect the time until the next event. Similar to the receiver selection features, we include nodal statistics which are time-invariant (such as gender or manager status) or time-dependent (such as the network statistics used for ). In addition, we select some event-specific covariates based on the temporal aspect of the event—e.g., whether the previous email was sent (1) during the weekend and (2) before or past midday (AM/PM)—since we expect the email interactions within county government to be less active during the weekend and in the evening. To be specific, the timestamp statistics are defined as

  • intercept: ;

  • : ;

  • : ;

  • : ;

  • : ;

  • : ;

  • : .

Note that our generative process for timestamps in Section 2.2 is sender-oriented where the sender determines when to send the email; thus we incorporate network statistics that depend only on —specifically, the in and outdegrees of sender .

4.2 Model selection

The HEM defines a flexible family of models, each of which is defined by a set of features (i.e., the receiver selection features , the selection of event timing features , and the distribution of time increments

). Many of these components will be specified based on user expertise (e.g., regarding which features would drive receiver selection), but some decisions may require a data-driven approach to model specification. For example, though theoretical considerations may inform the specification of features, subject-matter expertise is unlikely to inform the decision regarding the family of event timing distribution. Furthermore, since different distribution families (and model specifications more generally) may involve different size parameter spaces, any data-driven approach to model comparison must guard against over-fitting the data. In this section we present a general-purpose approach to evaluating the HEM specification using out-of-sample prediction. We illustrate this approach by comparing alternative distributional families for the event timing component of the model. Here, we specifically compare the predictive performance from two distributions—log-normal and exponential. We particularly choose the log-normal distribution based on some exploratory analysis (e.g., histogram and simple regressions) on raw time increments data, and select the exponential distribution as a baseline alternative that is a commonly specified distribution for time-to-event data, and is also used in the stochastic actor-oriented models (SAOMs)

(Snijders, 1996) as well as their extensions (Snijders et al., 2007).

  Input: data , number of new data to generate , and initial values of Test splits:
  draw test senders (out of senders)
  draw test receivers (out of receiver indicators )
  draw test timestamps (out of timestamps)
  set the test data as “missing” (NA) Imputation and inference:
  for  to  do
     for  to  do
        if  NA then
           compute , where
           draw
        end if
        for  do
           if  NA then
              draw where
           end if
        end for
        if  NA then
           draw from its conditional distribution using importance sampling, where
        end if
        run inference and update given the imputed and observed data
     end for
     store the estimates for test data
  end for
Algorithm 3 Out-of-sample predictions

We evaluated the models’ ability to predict out-of-sample events and timestamps on Montgomery County email data. We generated a train–test split of the data by randomly selecting 10% of senders, receivers, and timestamp variables to be held out. Our model then imputed these missing variables during inference, sampling them from their conditional posterior along with the other latent variables. Algorithm 3 outlines this procedure in detail. We compare the predictive performance of two versions of our model, each with a different timing distribution over the time increments using . We summarize the results of prediction experiments for missing senders, receivers, and timestamps in Figure 4

. First, we compare the posterior probability of correct senders for each of the missing emails

, which corresponds to in Algorithm 3. We call this measure the “correct sender posterior probability.” In Figure 3(a), we display boxplots for the distribution of mean correct sender posterior probability—i.e., —across the missing emails. The results show that both log-normal and exponential distributions achieve better predictive performance for missing senders compared to what is expected under random guess (i.e., choose one out of possible senders ), with the log-normal model showing better performance than the exponential model. Secondly, since the receiver vector is binary, we compute scores for missing receiver indicators (i.e., all and with

=NA) by taking the harmonic mean of precision and recall:

where (4.1)

with TP denoting true positive (i.e., ), FN denoting false negative (i.e., but ), and FP denoting false positive (i.e., but ). Although the generative process for events (Section 2.1) is not directly affected by the choice of timestamp distribution, Figure 3(b) reveals slight difference between log-normal and exponential in their performance in predicting missing receiver indicators, where log-normal on average outperforms exponential. Finally, we define the prediction error for the missing timestamp bot be the median of the absolute relative errors, often referred to as median absolute percentage error (MdAPE), across predictions:

(a) Sender prediction
(b) Receiver prediction
(c) Timestamp prediction
Figure 4: Comparison of predictive performance between log-normal and exponential distributions: (a) correct sender posterior probability from sender predictions, (b) scores from receiver predictions, and (c) median absolute relative error from timestamp predictions. Blue line in (a) represents the correct sender probability expected by random guess—i.e., .
(4.2)

Figure 3(c) presents boxplots for the median absolute percentage errors on a log scale. These plots show that the log-normal distribution fits the time increments significantly better than the exponential distribution. We speculate that this difference can be simply explained by a lack of flexibility in the one-parameter exponential distribution. As illustrated above, we can use this out-of-sample prediction task for two uses—(1) to provide an effective answer to the question “how does the HEM perform at filling in the missing components of time-stamped network data?” and (2) to offer one standard way to determine the distribution of time increments in Section 2.2.

4.3 Posterior predictive checks

In this section, we perform posterior predictive checks (PPC) (Rubin et al., 1984) to evaluate the appropriateness of our model specification for Montgomery County email data. We formally generated entirely new data by simulating synthetic email datasets from the generative process in Section 2, conditional upon a set of inferred latent variables from inference in Section 4.4. For the test of goodness-of-fit in terms of network dynamics, we use multiple statistics that summarize meaningful aspects of the data: outdegree distribution—the number of emails sent by each node, indegree distribution—the number of emails received by each node, receiver size distribution—the number of receivers on each emails, and a probability–probability (P–P) plot for time increments.

(a) Outdegree distribution
(b) Indegree distribution
(c) Receiver size distribution
(d) P–P plot for time increments
Figure 5: PPC results from log-normal distribution. Blue lines denote the observed statistics in (a)–(c) and denotes the diagonal line in (d).

Figure 5 illustrates the results of posterior predictive checks using the log-normal model, which fit the timestamps better than the exponential model (Section 4.2). The upper two plots show node-specific posterior predictive degree distributions across synthetic samples, where the left one is for the outdegree statistic and the right plot is for the indegree statistic. For both plots, the x-axis represents the nodes (), and the y-axis represents the number of emails sent or received by the node. When compared with the observed outdegree and indegree statistics (red lines), our model appears to fit the overall distribution of sending and receiving activities across the nodes. For example, node 1 and 10 have a significantly higher level of both sending and receiving activities relative to the rest and this is captured in the model-simulated data. The outdegree distribution of some low-activity nodes are not precisely recovered; however, the indegree distribution looks much better. Since we use more information in the receiver selection process (i.e., network effects) while we rely solely on minimum time increments when choosing the observed sender, these results are expected. The lower left plot is the distribution of receiver sizes, where the x-axis spans over the size of receivers 1 to 14 (which is the maximum size of observed receivers) and the y-axis denotes the number of emails with x-number of receivers. The result shows that our model is underestimating emails with one receiver while overestimating emails with two, three, and four receivers. One explanation behind what we observe is that the model is trying to recover broadcast emails, which are the emails with number of receivers, so that the intercept estimate

is slightly moved toward right. It would be an interesting problem in future research to consider how the hyperedge size distribution can be further modified to capture this distribution more accurately. The plot on the lower right is the P–P plot for time increments, which depicts the two cumulative distribution functions—one for simulated time increments and another for observed time increments—against each other in order to assess how closely two data sets agree. Here, the closeness to the diagonal line connecting

and gives a measure of difference between the simulated and observed time increments, and our P–P plot shows that we have great performance in reproducing the observed timing distribution. Our findings from the predictive experiments in Section 4.2 are further revealed in the PPC from exponential distribution, where the PPC plots comparing log-normal and exponential distributions are presented in Appendix C.

4.4 Exploratory analysis

Based on the prediction experiments in Section 4.2, we interpret the results from the HEM using the log-normal distribution, with emphasis on understanding the effects of receiver selection and event timing features defined in Section 4.1. We assume weakly informative priors for latent variables such as , , and , and MCMC (Algorithm 2) with outer iterations and a burn-in of 15,000, where we thin by keeping every sample. While the inner iterations for is fixed as 1, we specify the inner iterations for and for to adjust for slower convergence rates. Convergence diagnostics including the traceplots and Geweke diagnostics (Geweke et al., 1991) are provided in Appendix D.

4.4.1 Coefficients for receiver selection features

Figure 6 shows the boxplots summarizing posterior samples of , where Figure 5(a) displays the coefficients for nodal covariates and 5(b)

displays the coefficients for dyadic and triadic covariates. Since we use the logit functional form

and can interpret the

estimates in terms of odds ratios

. First of all, we find the effects of nodal coavariates “gender_ sender” and “gender_ receiver” are both nearly always negative in the posterior samples. The log odds that any other node will be added as a receiver of an email is approximately two times less if the sender is a woman. The posterior distribution of the statistic “outdegree” is mostly negative, if sender sent number of emails to anyone last week, then sender is approximately times less likely to send an email to . However, this straightforward interpretation of the outdegree statistic only applies when the hyperedge size is low. The scenario in which a sender sends a lot of low-hyperedge-size emails may arise due to the use of email for a one-on-one conversation. The large positive estimates of the interaction between hyperedge size and outdegree indicate that those who have recently sent many emails with many receivers on each email are likely to continue sending broadcast emails. This scenario may arise from someone being responsible for distributing timely announcements. When we look at the effect of “indegree,” we see a clear popularity effect—those who have received a lot of emails a lot recently are likely to continue receiving a lot of emails. If the receiver received number of emails over the last week, sender is times more likely to send an email to .

(a) Nodal covariates
(b) Dyadic and triadic covariates
Figure 6: Posterior distribution of estimates.

When we look at the effects of dyadic and triadic covariates, one thing that stands out is the large and positive posterior distribution of the statistic “send” (i.e., number of times sent emails to over the last week) with the posterior mean , implying that if sent number of emails to last week, then sender is approximately times more likely to send an email to . The posterior distributions for the reciprocity effect (i.e., “receive”), and the four triadic effects, are all fairly evenly spread around zero, so our results do not justify conclusions regarding the nature of these effects in the Montgomery county email network.

4.4.2 Coefficients for event timing features

For event timing features, Figure 7 shows the boxplots summarizing posterior samples of . Note that interpretations of the estimated coefficients for should be based on the specified time unit of the datset; we specify time units to be hours for the Montgomery county email data. Moreover, since we assume the log-normal distribution for time increments, the coefficients are interpreted in terms of the change in the average log time.

The posterior estimates of two temporal effects—“weekend” and “PM”—indicate that if the email was sent during the weekend or after midday, then the time to the email is expected to take hours and hours longer, respectively, compared to their counterparts (i.e., weekdays and am). On the contrary, the covariates “manager”, “outdegree”, and “indegree” shorten the amount of time until the next email. For example, being a county manager (i.e., the lead county administrator) lowers the expected value of by . The posterior mean estimates for the “outdegree” and “indegree” statistics are and , respectively. These effects indicate that those who are involved in either sending or receiving a lot of emails recently are likely to send emails with greater speed. The posterior distribution for the effect of the gender of the manager is evenly spread around zero. In addition, the posterior mean estimates for the variance parameter in the log-normal distribution is approximately

with its 95% credible interval

, indicating that there exists large variability in the time increments of emails.

Figure 7: Posterior distribution of estimates.

5 Conclusion

Motivated by a growing class of dynamic network models which deal with events recorded in continuous time, the hyperedge event model (HEM) can effectively learn the underlying dynamics in events and their corresponding timestamp formations, providing novel insights to the literature. The HEM explicitly models hyperedges through a receiver selection distribution that forces the sender to select at least one receiver; this obviates the need to preprocess hyperedge data—e.g., by “duplicating” hyperedges—to match the assumptions of traditional network models. Our model treating them as pure duplicates. In modeling the timestamps (more precisely time increments) of events, our generalized linear model (GLM) based formulation offers new innovations by eliminating the need to stick with one parameter distribution (e.g., exponential distribution). To our knowledge, the HEM is the only existing model that can be used to generate the sender, receivers, and timestamp of interactions in real time. To make better use of the proposed model, we provide an algorithm for predictive experiments that help to learn which specification of HEM provides a better fit to the data.

We have demonstrated the effectiveness of our model by analyzing the Montgomery County government emails, where emails serve as a canonical example of directed hyperedge events with one sender and one or more receivers. The estimated effects for receiver selection features reveal that our model is able to understand the structural dynamics similar to those used in the exponential random graph model (ERGM). Our model also learns the effects of event timing features by integrating a survival model for event timing. Although we illustrate the entire framework and application in the context of one type of hyperedge, one sender and one or more receivers, our model can be easily extended to allow the opposite case, one or more sender and one receiver, by slight modification of the generative process (shown in Appendix A). This extension involves promising applications to socio-political networks such as international sanctions and co-sponsorship of bills, and biological networks such as those formed through neural dendrites.

This work was supported in part by the University of Massachusetts Amherst Center for Intelligent Information Retrieval and in part by National Science Foundation grants DGE-1144860, SES-1619644, and CISE-1320219. Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily reflect those of the sponsors.

References

  • Altman et al. (2004) Altman, M., Gill, J., and McDonald, M. P. (2004). Numerical issues in statistical computing for the social scientist, volume 508. John Wiley & Sons.
  • ben Aaron et al. (2017) ben Aaron, J., Denny, M., Desmarais, B., and Wallach, H. (2017). “Transparency by Conformity: A Field Experiment Evaluating Openness in Local Governments.” Public Administration Review, 77(1): 68–77.
  • Butts (2008) Butts, C. T. (2008). “A RELATIONAL EVENT FRAMEWORK FOR SOCIAL ACTION.” Sociological Methodology, 38(1): 155–200.
  • Cranmer et al. (2014) Cranmer, S. J., Heinrich, T., and Desmarais, B. A. (2014). “Reciprocity and the structural determinants of the international sanctions network.” Social Networks, 36: 5–22.
  • Dai et al. (2013) Dai, B., Ding, S., Wahba, G., et al. (2013). “Multivariate bernoulli distribution.” Bernoulli, 19(4): 1465–1483.
  • Fan and Shelton (2009) Fan, Y. and Shelton, C. R. (2009). “Learning continuous-time social network dynamics.” In

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    , 161–168. AUAI Press.
  • Fellows and Handcock (2017) Fellows, I. and Handcock, M. (2017).

    “Removing Phase Transitions from Gibbs Measures.”

    In Artificial Intelligence and Statistics, 289–297.
  • Fowler (2006) Fowler, J. H. (2006). “Legislative cosponsorship networks in the US House and Senate.” Social Networks, 28(4): 454–465.
  • Geweke (2004) Geweke, J. (2004). “Getting it right: Joint distribution tests of posterior simulators.” Journal of the American Statistical Association, 99(467): 799–804.
  • Geweke et al. (1991) Geweke, J. et al. (1991).

    Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments

    , volume 196.
    Federal Reserve Bank of Minneapolis, Research Department Minneapolis, MN, USA.
  • Ghoshal et al. (2009) Ghoshal, G., Zlatić, V., Caldarelli, G., and Newman, M. (2009). “Random hypergraphs and their applications.” Physical Review E, 79(6): 066118.
  • Hunter et al. (2011) Hunter, D., Smyth, P., Vu, D. Q., and Asuncion, A. U. (2011). “Dynamic egocentric models for citation networks.” In

    Proceedings of the 28th International Conference on Machine Learning (ICML-11)

    , 857–864.
  • Karypis et al. (1999) Karypis, G., Aggarwal, R., Kumar, V., and Shekhar, S. (1999). “Multilevel hypergraph partitioning: applications in VLSI domain.” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 7(1): 69–79.
  • McCullough (2009) McCullough, B. D. (2009). “The accuracy of econometric software.” Handbook of computational econometrics, 55–79.
  • Neal and Kypraios (2015) Neal, P. and Kypraios, T. (2015). “Exact Bayesian inference via data augmentation.” Statistics and Computing, 25(2): 333–347.
  • Nelder and Baker (1972) Nelder, J. A. and Baker, R. J. (1972). Generalized linear models. Wiley Online Library.
  • Newman et al. (2002) Newman, M. E., Forrest, S., and Balthrop, J. (2002). “Email networks and the spread of computer viruses.” Physical Review E, 66(3): 035101.
  • Partzsch and Schüffny (2012) Partzsch, J. and Schüffny, R. (2012).

    “Developing structural constraints on connectivity for biologically embedded neural networks.”

    Biological cybernetics, 106(3): 191–200.
  • Perry and Wolfe (2013) Perry, P. O. and Wolfe, P. J. (2013). “Point process modelling for directed interaction networks.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(5): 821–849.
  • Rao (2000) Rao, P. (2000). “Applied survival analysis: regression modeling of time to event data.” Journal of the American Statistical Association, 95(450): 681–681.
  • Rizopoulos (2012) Rizopoulos, D. (2012). Joint models for longitudinal and time-to-event data: With applications in R. CRC Press.
  • Rubin et al. (1984) Rubin, D. B. et al. (1984). “Bayesianly justifiable and relevant frequency calculations for the applied statistician.” The Annals of Statistics, 12(4): 1151–1172.
  • Snijders et al. (2007) Snijders, T., Steglich, C., and Schweinberger, M. (2007). Modeling the coevolution of networks and behavior. na.
  • Snijders (1996) Snijders, T. A. (1996). “Stochastic actor-oriented models for network change.” Journal of mathematical sociology, 21(1-2): 149–172.
  • Tanner and Wong (1987) Tanner, M. A. and Wong, W. H. (1987). “The calculation of posterior distributions by data augmentation.” Journal of the American statistical Association, 82(398): 528–540.
  • Vu et al. (2011) Vu, D. Q., Hunter, D., Smyth, P., and Asuncion, A. U. (2011). “Continuous-Time Regression Models for Longitudinal Networks.” In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems 24, 2492–2500. Curran Associates, Inc.
  • Zhang and Liu (2010) Zhang, Z.-K. and Liu, C. (2010). “A hypergraph model of social tagging networks.” Journal of Statistical Mechanics: Theory and Experiment, 2010(10): P10005.
  • Zhao et al. (2016) Zhao, T., Wang, Z., Cumberworth, A., Gsponer, J., de Freitas, N., Bouchard-Côté, A., et al. (2016). “Bayesian analysis of continuous time Markov chains with application to phylogenetic modelling.” Bayesian Analysis, 11(4): 1203–1237.
  • Zlatić et al. (2009) Zlatić, V., Ghoshal, G., and Caldarelli, G. (2009). “Hypergraph topological quantities for tagged social networks.” Physical Review E, 80(3): 036118.

Appendix

Appendix A: Alternative generative process

  Input: number of events and nodes , covariates , and coefficients
  for  to  do
     for  to  do
        for  to (do
           set
        end for
        draw
        set
        draw
     end for
     if  tied events then
        set
        set
        set
        jump to
     else
        set
        set
        set
     end if
  end for
Algorithm 4 Generative process: one receiver and one or more senders

Appendix B: Normalizing constant of MB

Our probability measure “MB”—the multivariate Bernoulli distribution with non-empty Gibbs measure—defines the probability of sender selecting the binary receiver vector as

where the receiver intensity is a linear combination of receiver selection features—i.e., —as defined in Secton 2.1.

To use this distribution efficiently, we derive a closed-form expression for that does not require brute-force summation over the support of (i.e., ). We recognize that if were drawn via independent Bernoulli distributions in which was given by logit, then

This is straightforward to verify by looking at

where the subscript “” denotes a quantity excluding data from position . Now we denote the logistic-Bernoulli normalizing constant as , which is defined as

Now, since

except when , we note that

We can therefore derive a closed form expression for via a closed form expression for . This can be done by looking at the probability of the zero vector under the logistic-Bernoulli model:

Then, we have

Finally, the closed form expression for the normalizing constant is

Appendix C: Comparison of PPC results: log-normal vs. exponential

(a) Outdegree distribution
(b) Indegree distribution
(c) Receiver size distribution
(d) P–P plot for time increments
Figure 8: Comparison of PPC results between log-normal (red) and exponential (green) distributions. Blue lines denote the observed statistics in (a)–(c) and denotes the diagonal line in (d).

Appendix D: Convergence diagnostics

(a) Traceplots of
(b) Traceplot of
(c) Geweke diagnostics for
(d) Geweke diagnostics for and
Figure 9: Convergence diagnostics from log-normal distribution.