Model-based clustering for random hypergraphs

08/15/2018
by   Tin Lok James Ng, et al.
0

A probabilistic model for random hypergraphs is introduced to represent unary, binary and higher order interactions among objects in real-world problems. This model is an extension of the Latent Class Analysis model, which captures clustering structures among objects. An EM (expectation maximization) algorithm with MM (minorization maximization) steps is developed to perform parameter estimation while a cross validated likelihood approach is employed to perform model selection. The developed model is applied to three real-world data sets where interesting results are obtained.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/31/2018

An Evolutionary Algorithm with Crossover and Mutation for Model-Based Clustering

The expectation-maximization (EM) algorithm is almost ubiquitous for par...
01/30/2013

An Experimental Comparison of Several Clustering and Initialization Methods

We examine methods for clustering in high dimensions. In the first part ...
12/10/2020

Cluster analysis and outlier detection with missing data

A mixture of multivariate contaminated normal (MCN) distributions is a u...
12/20/2017

Model-Based Clustering of Time-Evolving Networks through Temporal Exponential-Family Random Graph Models

Dynamic networks are a general language for describing time-evolving com...
12/02/2019

Clustering via Ant Colonies: Parameter Analysis and Improvement of the Algorithm

An ant colony optimization approach for partitioning a set of objects is...
02/08/2017

Clustering For Point Pattern Data

Clustering is one of the most common unsupervised learning tasks in mach...
03/30/2021

Detecting informative higher-order interactions in statistically validated hypergraphs

Recent empirical evidence has shown that in many real-world systems, suc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A large number of random graph models have been proposed (Nowicki and Snijders, 2001; Hoff et al., 2002; Handcock et al., 2007; Latouche et al., 2011) to describe complex interactions among objects of interest. Pairwise relationships among objects can be naturally represented as a graph, in which the objects are represented by the vertices, and two vertices are joined by an edge if certain relationship exists between them. While graphs are capable of representing pairwise interaction between objects, they are inadequate to represent higher order and unary interactions that are typically observed in many real-world problems. Examples of higher-order and unary relationships include co-authorship on academic papers, co-appearance in movie scenes, and songs performed in a concert.

For example, the study of coauthorship networks of scientists have attracted significant research interests in both natural and social sciences (Newman, 2001a, b, 2004; Moody, 2004; Azondekon et al., 2018). Such networks are typically constructed by connecting two scientists if they have coauthored one or more papers together. However, as we will illustrate below, such representation inevitably results in loss of information while a hypergraph representation naturally preserves all information. A hypergraph is a generalization of a graph in which hyperedges are arbitrary sets of vertices, and can contain any number of vertices. As a result, hypergraphs are capable of representing relationships of any arbitrary orders.

We consider a simple example of a coauthorship network with 7 authors and 4 papers in order to illustrate the benefits of hypergraph modelling. A hypergraph representation of the network is given in Figure 1 where the vertices represent the authors while the hyperedges represent the papers. For example, the paper is written by four authors , and , and the paper is written by two authors and , while the paper has a single author .

On the other hand, a graph representation of this coauthorship network with edges between any two authors who have coauthored at least one paper results in the edge set . It is evident that much information is lost with this representation. In particular, this representation removes information about the number of authors that co-authored a paper. For example, one can only deduce from this edge set that has co-authored with and while unable to conclude that the co-authorship was for the same paper. Furthermore, the hyperedge which contains a singleton is left out in the graph representation.

A number of random hypergraph models were studied in probability and combinatorics literature where theoretical properties such as phase transition, chromatic number were investigated

(Karoński and Łuczak, 2002; Goldschmidt, 2005; de Panafieu, 2015; Dyer et al., 2015; Poole, 2015). A novel parametrization of distributions on hypergraphs based on geometry of points is proposed in Lunagómez et al. (2017) which is used to infer Markov structure for multivariate distributions.

On the other hand, statistical modeling with random hypergraph is less explored. Stasi et al. (2014) introduced the hypergraph beta model with three variants, which is a natural extension of the beta model for random graphs (Holland and Leinhardt, 1981). In their model, the probability of a hyperedge

appearing in the hypergraph is parameterized by a vector

, which represents the “attractiveness” of each vertex. However, their model does not capture clustering among objects which is a typical real world phenomenon. In addition, the assumption of an upper bound on the size of hyperedges violates many real world data sets.

One may equivalently represent a hypergraph using a bipartite network (also called two-mode network and affiliation network). Two-mode networks consist of two different kinds of vertices and edges can only be observed between the two types of vertices, but not between vertices of the same type. A hypergraph can be represented as a two-mode by considering the hyperedges as a second type of vertices. For example, an equivalent bipartite representation of the hypergraph shown in Figure 1 is provided in Figure 2 where the hyperedges are now replaced by the four green vertices.

Two-mode networks have been studied in various disciplines including computer science (Perugini et al., 2004), social sciences (Faust et al., 2002; Robins and Alexander, 2004; Koskinen and Edling, 2012; Friel et al., 2016) and physics (Lind et al., 2005). A number of approaches have been proposed to analyze and model two-mode network data (Borgatti and Everett, 1997; Robins and Alexander, 2004; Doreian and Batagelj, 2004; Latapy et al., 2008; Wang et al., 2009; Snijders et al., 2013). In particular, models originally developed for binary networks were extended for two-mode networks.

Doreian and Batagelj (2004) developed a blockmodeling approach of two-mode network data which aims to simultaneously partition the two types of vertices into blocks. Skvoretz and Faust (1999)

proposed exponential random graph models (ERGMs) for two-mode networks which models the logit of the probability of an actor belong to an event as a function of actor and event specific effects and other graph statistics. A clustering algorithm for two-mode network is developed in

Field et al. (2006) based on the modelling framework in Skvoretz and Faust (1999). Several extensions to the ERGMs for bipartite networks are proposed in recent years (Wang et al., 2009, 2013). Snijders et al. (2013) proposed a methodology for studying the co-evolution of two-mode and one-mode networks. A network autocorrelation model for two-mode networks is introduced in Fujimoto et al. (2011).

Representing network observations using two-mode networks has the benefits of modelling vertices of both types jointly. However, in analyzing a two-mode network, one type of vertices may attract most interest. For example, in co-authorship networks, the main interest may lie in the collaborations rather than in co-authored papers. In such scenarios, a hypergraph representation is most nature by converting one type of vertices into hyperedges with no loss of information.

In this paper, we propose the Extended Latent Class Analysis (ELCA) model for random hypergraphs, which is a natural extension of the Latent Class Analysis (LCA) model (Lazarsfeld and Henry, 1968; Goodman, 1974; Celeux and Govaert, 1991) and includes the LCA model as a special case. The model is applied to two applications, including Star Wars movie scenes and Lady Gaga concerts 2014.

Figure 1: A hypergraph representation of a coauthorship network.

1

2

3

4

5

6

7

1

2

3

4

Figure 2: Bipartite graph representation of the hypergraph in Figure 1.

2 Model and Motivation

2.1 Hypergraph

A hypergraph is represented by a pair , where is the set of vertices and is the set of hyperedges. A hyperedge is a subset of , and we allow repetitions in the hyperedge set . Thus, the hypergraph can alternatively be represented with a matrix where if vertex appears in hyperedge and otherwise.

2.2 Latent Class Analysis Model for Random Hypergraphs

The binary latent class analysis (LCA) model (Lazarsfeld and Henry, 1968; Goodman, 1974) is a commonly used mixture model for high dimensional binary data. It assumes that each observation is a member of one and only one of the latent classes, and conditional on the latent class membership, the manifest variables are mutually independent of each other. The LCA model appears to be a natural candidate to model random hypergraphs where hyperedges are partitioned into latent classes, and the probability that a hyperedge contains a vertex depends only on its latent class assignment.

Let be the a priori latent class assignment probabilities where is the number of latent classes, and define the matrix and is the probability that vertex is contained in a hyperedge with latent class label . The likelihood function can be written as

By introducing the latent class membership matrix where if hyperedge has latent class label and otherwise, the complete data likelihood of and can be expressed as (1).

(1)

In comparison to the hypergraph beta models introduced in Stasi et al. (2014), the LCA model is capable of capturing the clustering and heterogeneity of hyperedges. For example, academic papers can be naturally labelled according to subject areas and conditional on a paper being labelled mathematics, one would expect that the probability a mathematician co-authored the paper is higher than a biologist. The LCA model does not assume an upper bound on the size of hyperedges and can model hyperedges of any size. Furthermore, an efficient expectation maximization algorithm (Dempster et al., 1977) can be easily derived to perform parameter estimation.

2.3 Extended Latent Class Analysis for Random Hypergraphs

While the LCA model captures the clustering and heterogeneity of hyperedges in real world data sets, it is quite restrictive in modeling the size of a hyperedge. The size of a hyperedge with latent class label

follows the Poisson Binomial distribution

(Wang, 1993) with parameters , and with expected value

and variance

. As we will illustrate in a few real world data sets, the LCA model underestimates the variation in sizes of hyperedges. Thus, we extend the LCA model by including an additional clustering structure to address this shortcoming.

We develop the Extended Latent Class Analysis model (ELCA) by introducing an additional clustering to the hyperedges. We assume that the two clustering are independent. We let be the a priori additional clustering assignment probabilities where is the number of additional clusters. Thus, the probability that a hyperedge has cluster label and additional cluster label is given by . We define the matrix and dimensional vector so that the probability that vertex is contained in a hyperedge with cluster label and additional cluster label is given by .

Let denote the model parameters, the likelihood function can be written as

We define the additional cluster membership matrix where if hyperedge has additional cluster label and otherwise. The complete data likelihood function of , and is given as:

(2)

We further impose the constraint to ensure that the model is identifiable. It is easy to see that the LCA model is a special case of the ELCA model by letting the number of additional clusters .

2.4 Theoretical Properties

We compare the theoretical properties of the LCA and ELCA models developed above. Proposition 2.1 below shows that the size of hyperedge simulated from the ELCA model has larger variance than simulated from the LCA model.

Proposition 2.1.

Suppose we are given the LCA model with parameters and the ELCA model with parameters and vertices. Suppose the condition holds for and .

Let denote the size of a random hyperedge generated under the LCA model. Similarly, let denote the size of a random hyperedge generated under the ELCA model. We have the following results.

Proof.

The proof is straightforward and is given in the Appendix. ∎

We now let be the probability mass functions of the size of a random hyperedge simulated from a cluster LCA model. Similarly, we let be the probability mass function of the size of a random hyperedge simulated from the ELCA model with clusters and additional clusters. The following result can be derived.

Proposition 2.2.
  1. Under the specifications of a LCA model with parameters and , and suppose the following conditions hold for ,

    as . We have

    That is, the distribution of the size of a random hyperedge converges to a mixture of Poisson distribution with

    components.

  2. Under the specification of a ELCA model with parameters , , , and . Further suppose the following conditions hold for , and .

    as . We have

    That is, the distribution of the size of a random hyperedge converges to a mixture of Poisson distribution with components.

Proof.

Conditional on the event that a random hyperedge is generated from cluster , (Wang, 1993, Theorem 3) implies that

Part 1 result follows by marginalizing over the clusters. The second part of the proposition can be proved similarly. ∎

Proposition 2.2 implies that the size distribution of a random hyperedge generated under the ELCA model is far more flexible than for the LCA model.

2.5 Co-Clustering

The concept of having two clustering structure is related to co-clustering or block clustering. In co-clustering, the objective is to simultaneously cluster rows and columns of a data matrix. In particular, mixture models have been proposed with EM algorithms developed in the context of co-clustering (Govaert and Nadif, 2003, 2008). Co-clustering has also received significant attention in various application such as text mining, bioinformatics and recommender systems (Dhillon et al., 2003; Cheng and Church, 2000; George and Merugu, 2005). In comparison, we aim to obtain two types of clustering structure for the rows of a data matrix.

In the work of Rau et al. (2015), a Poisson mixture model was proposed for clustering of digital gene expression to discover groups of co-expressed genes, where observations of biological entities under different conditions are collected. In order to model the variations in overall expression level among biological entities, a scaling parameter is introduced for each entity. In comparison, we explicitly model the size of random hyperedge using clustering which results in a more parsimonious model structure.

3 EM Algorithm

We estimate the parameters of the ELCA model using an EM algorithm (Dempster et al., 1977) which is a popular method in fitting mixture models. The E-step of the EM algorithm involves computing the expected value of the logarithm of the complete data likelihood (2) with respect to the distribution of the unobserved and given the current estimates. The M-step involves maximizing the expected complete data log-likelihood.

Taking logarithm of the complete data likelihood in (2), we obtain the complete data log-likelihood function below.

(3)

For the E-step, we need to evaluate the expectation of (3) conditional on data and current parameter estimates .

That is, we need to evaluate the expectation . We have that

(4)

In particular, the E-step has a computational complexity of for each pair . While the E-step of the EM algorithm is straightforward, the M-step involves complicated maximization. Thus, we use the ECM algorithm (Meng and Rubin, 1993) which replaces the complex M-step by a series of simpler conditional maximizations. The conditional maximizations with respect to the parameters and do not have closed form solutions. We resort to the MM algorithm (Lange et al., 2000; Hunter and Lange, 2004) which works by lower bounding the objective function by a minorizing function and then maximizing the minorizing function. Details of the M-step are given in the appendix and the EM algorithm is summarized in Algorithm 1. In particular, we note that the computational complexity for maximizing and are given by and , respectively, where is the number of iterations required for the MM algorithm.

1:
2:Random initialization of
3:while  do
4:     Do the E-step according to (4)
5:     for  do
6:         for  do
7:              Update according to (7)
8:         end for
9:     end for
10:     for  do
11:         Update according to (10)
12:     end for
13:     for  do
14:         Update according to (11)
15:     end for
16:     for  do
17:         Update according to (12)
18:     end for
19:     Evaluate Change in log-likelihood resulting from parameter updates
20:     if   then
21:         
22:     end if
23:end while
Algorithm 1 EM Algorithm
Input:
Output:

4 Model Selection

4.1 Cross Validated Likelihood

Given a fixed model, the cross validated likelihood method (Smyth, 2000) works by repetitively partitioning the observations into two disjoint sets, one of which is used to fit the model and obtain estimates of model parameters by maximizing the log-likelihood, and the other is for evaluating the model by computing its log-likelihood.

For each and , we define to be the ELCA model with clusters and additional clusters. To apply the cross validated likelihood method, we randomly partition the hyperedges into two sets and where each hyperedge in is included in with probability . In our applications we set . The EM algorithm developed in section 3 is then used to fit and obtain the parameter estimates . We then compute the log-likelihood of under the estimated parameters and obtain the test log-likelihood . The above procedure is then repeated times and the estimated cross validated log-likelihood is obtained by averaging over . The procedure above is summarized in Algorithm 2.

We perform a greedy search for the optimal combination of and which produces the largest estimated cross validated log-likelihood . Starting with one cluster and one additional cluster, an additional cluster is then successively added to the model until the estimated cross validated log-likelihood does not increase. At this stage, we then increment the number of clusters by 1 and the above procedure is repeated provided that . The greedy search algorithm is summarized in Algorithm 3. The greedy search can be computationally intensive when the search space for and the number of cross validation are large.

1:for   do
2:      ,
3:     for   do
4:         
5:         if  then
6:              
7:         else
8:              
9:         end if
10:     end for
11:     
12:     
13:end for
14:
Algorithm 2 Estimated Cross Validated Log-likelihood
Input:
Output:
1:
2:
3:
4:Obtain using Algorithm 2
5:while  do
6:     
7:     while  do
8:         
9:         Obtain using Algorithm 2
10:         if   then
11:              
12:         else
13:              
14:         end if
15:     end while
16:     
17:     Obtain using Algorithm 2
18:     if  then
19:         
20:     else
21:         
22:     end if
23:end while
Algorithm 3 Greedy Search For Model Selection
Input:
Output:

5 Applications

5.1 Star Wars Movie Scenes

Our first application is modeling co-appearance of the main characters in the scenes of the movie “Star Wars: A New Hope”. We collected the scripts of the movie from The Internet Movie Script Database 333Movie script data freely available at https://www.imsdb.com/ and constructed a hypergraph for the eight main characters. We define each scene in the movie as a hyperedge with a total of 178 hyperedges, and a character is contained in the scene if he/she speaks in the scene.

We first performed model selection using the greedy search algorithm and the cross validated likelihood method presented in Section 4.1 to select the optimal number of clusters and additional clusters for the ELCA model. The results of the greedy search are provided in Table A1 and the model with 3 clusters and 2 additional clusters is selected.

The results from fitting the ELCA model with and are provided in Table 1 and Table 2. We can see the variation in the size of hyperedges from the parameter estimates and with the majority () of hyperedges having size much smaller than the rest of the hyperedges. Thus, one can deduce that a small proportion of the movie scenes have far more characters.

Table 1: Estimates of , and from fitting the ELCA model with 3 clusters and 2 additional clusters for the Star Wars data set
Character Cluster 1 Cluster 2 Cluster 3
Wedge 0.18 0.00 0.36
Han 0.00 1.00 0.00
Luke 1.00 1.00 0.00
C-3PO 0.75 0.30 0.00
Obi-Wan 0.00 0.00 1.00
Leia 0.12 0.48 0.07
Biggs 0.31 0.00 0.28
Darth Vader 0.19 0.35 0.06
Table 2: Estimates of from fitting the ELCA model with 3 clusters and 2 additional clusters for the Star Wars data set

The estimates in Table 2 reveal interesting clustering structure for the 8 main characters in the movie. For example, the lead character “Luke” has a strong tendency to appear in the two largest clusters. On the other hand, it is extremely unlikely for “Obi-Wan” and “Han” appear in the same scene.

Figure 3: Probability of clusters for movie scenes in Star Wars data set

The estimated cluster assignment probabilities from the EM algorithm for each movie scene in the Star Wars movie are shown in chronological order in Figure 3. We can see from the plot that scenes in the early part of the movie are mainly associated with cluster 1, while cluster 2 contains most of the scenes from roughly scene 40 to scene 100. We can deduce from this, for example, that the character “Han” is very active in the middle part of the movie. On the other hand, there does not appear to be any obvious pattern for the third cluster. The clustering for many early and late movie scenes is relatively uncertain, as shown in the plot.

Figure 4: Ternary plot of the a posteriori group membership probabilities for the scenes in the Star Wars data set

The uncertainties in clustering are also illustrated in a ternary plot in Figure 4. Each dot in the plot represents a movie scene, and the three corners of the plot represent the three clusters. The closer the dot is to the corner, the higher probability that the corresponding movie scene belongs to the corresponding cluster. The ternary plot in Figure 4 shows significant uncertainties in clustering a number of movie scenes into the first two clusters. This is reasonable since for a number of actors including the lead actor “Luke”, the probabilities of scene appearance are similar for the first two clusters.

5.2 Lady Gaga Concerts 2014

As a second application of the ELCA model, we collected the list of songs that Lady Gaga performed in all concerts in 2014 444 The Lady Gaga setlist data are available at: http://www.setlist.fm/ . The data set contains 96 concerts with a total of 51 distinct songs performed. The hypergraph is constructed by defining each concert as a hyperedge and each song as a vertex. A vertex is contained in a hyperedge if the corresponding song is performed in the corresponding concert. The results of performing model selection using the approach of Section 4.1 is presented in Table A2. We can see from Table A2 that ELCA models with more than one additional cluster significantly out-perform standard latent class analysis models.

The model with 5 clusters and 2 additional clusters was chosen and fitted to the data set. The parameter estimates , and are given in Table 3. We can deduce from and that there are a small number of very short concerts of length approximately 14% of the rest of the “full” concerts.

Table 3: Estimates of , and from fitting the ELCA model with 5 clusters and 2 additional clusters for Lady Gaga concerts 2014 data set

Table 4 shows the parameter estimates where the popularity of the 51 songs across 5 clusters are shown. One can see a small number of extremely popular songs which tend to be performed in most concerts, such as “Paparazzi”, “Bad Romance”, “Born This Way”, “G.U.Y” and “Just Dance”. Among the least performed songs, “Fashion!” and “Cake Like Lady Gaga” tend to be performed in the same concert, while “Lush Life”, “It Don’t Mean a Thing (If It Ain’t Got That Swing)” and “But Beautiful” are more likely to be performed in the same concert.

Songs Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
Monster for Life 0.04 0.00 0.00 0.00 0.00
Fashion! 0.00 0.00 0.68 0.00 0.00
Paparazzi 1.00 1.00 1.00 1.00 1.00
Bad Romance 1.00 1.00 1.00 1.00 1.00
What’s Up 0.12 0.00 0.00 0.00 0.44
Sophisticated Lady 0.00 0.00 0.00 0.00 0.31
Dance in the Dark 0.00 0.00 0.00 0.00 0.44
Born This Way 1.00 1.00 1.00 1.00 1.00
Judas 1.00 0.87 0.00 0.00 0.00
Partynauseous 1.00 1.00 1.00 0.00 1.00
Yo and I 1.00 0.13 0.00 1.00 1.00
I Will Always Love You 0.00 0.03 0.00 0.00 0.00
Monster 0.00 0.00 0.00 1.00 0.00
Bang Bang (My Baby Shot Me Down) 0.72 0.00 0.00 0.00 1.00
The Queen 0.00 0.00 0.09 0.00 0.00
Dope 1.00 0.60 0.00 1.00 1.00
Jewels N’ Drugs 1.00 1.00 1.00 0.00 1.00
Hair 0.00 0.00 0.05 0.00 0.00
Mary Jane Holland 1.00 0.00 0.95 0.00 1.00
G.U.Y. 1.00 1.00 1.00 1.00 1.00
MANiCURE 1.00 1.00 1.00 0.00 1.00
Lush Life 0.00 0.00 0.00 0.10 0.51
It Don’t Mean a Thing (If It Ain’t Got That Swing) 0.00 0.00 0.00 0.00 0.49
You’ve Got a Friend 0.00 0.00 0.00 0.12 0.00
The Edge of Glory 1.00 1.00 0.04 0.00 1.00
Donatella 1.00 1.00 1.00 0.00 1.00
But Beautiful 0.00 0.00 0.00 0.00 0.31
Do What U Want 1.00 1.00 1.00 0.00 1.00
Gypsy 1.00 1.00 1.00 0.00 1.00
Applause 1.00 1.00 1.00 1.00 1.00
Marry the Night 0.04 0.00 0.05 0.00 0.44
Sexxx Dreams 1.00 0.97 1.00 1.00 1.00
Another One Bites the Dust 0.00 0.00 0.00 0.12 0.00
Just Dance 1.00 1.00 1.00 1.00 1.00
Cake Like Lady Gaga 0.00 0.00 0.68 0.00 0.00
I Can’t Give You Anything but Love, Baby 0.00 0.03 0.00 0.00 0.49
Black Jesus Amen Fashion 0.00 0.00 0.00 1.00 0.00
Ratchet 1.00 1.00 1.00 0.00 1.00
Aura 1.00 0.87 0.91 0.00 0.00
Poker Face 1.00 1.00 1.00 1.00 1.00
Venus 1.00 1.00 1.00 0.00 1.00
If I Ever Lose My Faith in You 0.00 0.00 0.00 0.12 0.00
Bell Bottom Blues 0.04 0.00 0.00 0.00 0.00
Whole Lotta Love 0.04 0.00 0.00 0.00 0.00
Telephone 1.00 1.00 1.00 0.00 1.00
Willkommen 0.16 0.00 0.00 0.00 0.00
Brooklyn Nights 0.00 0.03 0.00 0.00 0.00
Alejandro 1.00 1.00 1.00 0.00 1.00
ARTPOP 1.00 1.00 1.00 1.00 1.00
I’ve Got a Crush on You 0.00 0.00 0.05 0.00 0.00
Swine 1.00 0.97 1.00 0.00 1.00
Table 4: Estimates of from fitting the ELCA model with 5 clusters and 2 additional clusters to the Lady Gaga concerts 2014 data set

The estimated cluster assignment probabilities for each concert performed by Lady Gaga in 2014 are shown in chronological order in Figure 5. There is a strong association between clusters and time of the year. For example, the first 30 concerts performed in 2014 are mainly associated with cluster 1 where songs such as “Bad Romance”, “Judas” and “Aura” are among the most popular ones. On the other hand, the next 30 concerts are strongly associated with cluster 2 where songs such as “The Edge of Glory”, “Venus” and “Ratchet” are popular. The last 10 concerts of 2014 are mostly clustered into cluster 4 where songs such as “Yo and I”, “Monster” and “Black Jesus Amen Fashion” are frequently performed.

Figure 5: Probability of clusters for concerts in Lady Gaga Concerts 2014 data set

Figure 6 shows the distribution of hyperedge sizes (or the number of songs performed in concerts) along with the estimated hyperedge sizes by the ELCA model with and , and the LCA model with 5 clusters. Adding an additional cluster to the model significantly improves the fit, especially on the tails of the distribution.

Figure 6: Size distribution of hyperedges of the Lady Gaga Concerts 2014 data set

6 Conclusion

In this paper, we have proposed the Extended Latent Class Analysis model as a generative model for random hypergraphs. The model introduces two clustering structures for hyperedges which captures variation in sizes of hyperedges.

An EM algorithm has developed for model fitting where the M-step is implemented using a MM algorithm. Model selection is performed using cross validated likelihood method to account for the small sample sizes relative to the number of vertices.

The model has been shown to give an improved fit relative to the Latent Class Analysis model for three illustrative examples. Furthermore, the fitted model reveals interesting and interpretable structure within the vertices and hyperedges.

References

  • Azondekon et al. (2018) Azondekon, R., Harper, Z. J., Agossa, F. R., Welzig, C. M., and McRoy, S. (2018), “Scientific authorship and collaboration network analysis on malaria research in Benin: Papers indexed in the Web of Science (1996–2016),” Global Health Research and Policy, 3, 11.
  • Borgatti and Everett (1997) Borgatti, S. P. and Everett, M. G. (1997), “Network analysis of 2-mode data,” Soc. Networks, 19, 243 – 269.
  • Celeux and Govaert (1991) Celeux, G. and Govaert, G. (1991), “Clustering criteria for discrete data and latent class models,” Journal of Classification, 8, 157–176.
  • Cheng and Church (2000) Cheng, Y. and Church, G. M. (2000), “Biclustering of expression data,” in Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, AAAI Press, pp. 93–103.
  • de Panafieu (2015) de Panafieu, É. (2015), “Phase transition of random non-uniform hypergraphs,” J. Discrete Algorithms, 31, 26–39.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977), “Maximum likelihood from incomplete data via the EM algorithm,” J. Roy. Statist. Soc. Ser. B, 39, 1–38, with discussion.
  • Dhillon et al. (2003) Dhillon, I. S., Mallela, S., and Modha, D. S. (2003), “Information-theoretic co-clustering,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 89–98.
  • Doreian and Batagelj (2004) Doreian, P. and Batagelj, V. (2004), “Generalized blockmodeling of two-mode network data,” Soc. Networks, 29–53.
  • Dyer et al. (2015) Dyer, M., Frieze, A., and Greenhill, C. (2015), “On the chromatic number of a random hypergraph,” J. Combin. Theory Ser. B, 113, 68–122.
  • Faust et al. (2002) Faust, K., Willert, K., Rowlee, D., and Skvoretz, J. (2002), “Scaling and statistical models for affiliation networks: patterns of participation among Soviet politicians during the Brezhnev era,” Soc. Networks, 24, 231–259.
  • Field et al. (2006) Field, S., Frank, K. A., Schiller, K., Riegle-Crumb, C., and Muller, C. (2006), “Identifying positions from affiliation networks: Preserving the duality of people and events,” Soc. Networks, 28, 97 – 123.
  • Friel et al. (2016) Friel, N., Rastelli, R., Wyse, J., and Raftery, A. E. (2016), “Interlocking directorates in Irish companies using a latent space model for bipartite networks,” Proc. Natl. Acad. Sci. U.S.A., 113, 6629–6634.
  • Fujimoto et al. (2011) Fujimoto, K., Chou, C.-P., and Valente, T. W. (2011), “The network autocorrelation model using two-mode data: Affiliation exposure and potential bias in the autocorrelation parameter,” Soc. Networks, 33, 231 – 243.
  • George and Merugu (2005) George, T. and Merugu, S. (2005), “A scalable collaborative filtering framework based on co-clustering,” in Proceedings of the Fifth IEEE International Conference on Data Mining, pp. 625–628.
  • Goldschmidt (2005) Goldschmidt, C. (2005), “Critical random hypergraphs: the emergence of a giant set of identifiable vertices,” Ann. Probab., 33, 1573–1600.
  • Goodman (1974) Goodman, L. A. (1974), “Exploratory latent structure analysis using both identifiable and unidentifiable models,” Biometrika, 61, 215–231.
  • Govaert and Nadif (2003) Govaert, G. and Nadif, M. (2003), “Clustering with block mixture models.” Pattern Recognition, 36, 463–473.
  • Govaert and Nadif (2008) — (2008), “Block clustering with Bernoulli mixture models: comparison of different approaches,” Comput. Statist. Data Anal., 52, 3233–3245.
  • Handcock et al. (2007) Handcock, M. S., Raftery, A. E., and Tantrum, J. M. (2007), “Model-based clustering for social networks,” J. Roy. Statist. Soc. Ser. A, 170, 301–354.
  • Hoff et al. (2002) Hoff, P. D., Raftery, A. E., and Handcock, M. S. (2002), “Latent space approaches to social network analysis,” J. Amer. Statist. Assoc., 97, 1090–1098.
  • Holland and Leinhardt (1981)

    Holland, P. W. and Leinhardt, S. (1981), “An exponential family of probability distributions for directed graphs,”

    J. Amer. Statist. Assoc., 76, 33–65.
  • Hunter and Lange (2004) Hunter, D. R. and Lange, K. (2004), “A tutorial on MM algorithms,” Amer. Statist., 58, 30–37.
  • Karoński and Łuczak (2002) Karoński, M. and Łuczak, T. (2002), “The phase transition in a random hypergraph,” J. Comput. Appl. Math., 142, 125–135.
  • Koskinen and Edling (2012) Koskinen, J. and Edling, C. (2012), “Modelling the evolution of a bipartite network - Peer referral in interlocking directorates,” Soc. Networks, 34, 309 – 322, dynamics of Social Networks (2).
  • Lange et al. (2000) Lange, K., Hunter, D. R., and Yang, I. (2000), “Optimization transfer using surrogate objective functions,” J. Comput. Graph. Statist., 9, 1–59.
  • Latapy et al. (2008) Latapy, M., Magnien, C., and Vecchio, N. D. (2008), “Basic notions for the analysis of large two-mode networks,” Soc. Networks, 30, 31–48.
  • Latouche et al. (2011) Latouche, P., Birmelé, E., and Ambroise, C. (2011), “Overlapping stochastic block models with application to the French political blogosphere,” Ann. Appl. Stat., 5, 309–336.
  • Lazarsfeld and Henry (1968) Lazarsfeld, P. F. and Henry, N. W. (1968), Latent structure analysis, Houghton Mifflin, Boston, MA 02110, USA.
  • Lind et al. (2005) Lind, P. G., González, M. C., and Herrmann, H. J. (2005), “Cycles and clustering in bipartite networks,” Phys. Rev. E, 72, 056127.
  • Lunagómez et al. (2017) Lunagómez, S., Mukherjee, S., Wolpert, R. L., and Airoldi, E. M. (2017), “Geometric representations of random hypergraphs,” J. Amer. Stat. Assoc., 112, 363–383.
  • Meng and Rubin (1993) Meng, X.-L. and Rubin, D. B. (1993), “Maximum likelihood estimation via the ECM algorithm: a general framework,” Biometrika, 80, 267–278.
  • Moody (2004) Moody, J. (2004), “The structure of a social science collaboration network: disciplinary cohesion from 1963 to 1999,” Am. Sociol. Rev, 69, 213–238.
  • Newman (2004) Newman, M. E. (2004), Who Is the Best Connected Scientist? A Study of Scientific Coauthorship Networks, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 337–370.
  • Newman (2001a) Newman, M. E. J. (2001a), “Scientific collaboration networks. I. Network construction and fundamental results,” Phys. Rev. E, 64, 016131.
  • Newman (2001b) — (2001b), “Scientific collaboration networks. II. Shortest paths, weighted networks, and centrality,” Phys. Rev. E, 64, 016132.
  • Nowicki and Snijders (2001) Nowicki, K. and Snijders, T. A. B. (2001), “Estimation and prediction for stochastic blockstructures,” J. Amer. Statist. Assoc., 96, 1077–1087.
  • Perugini et al. (2004) Perugini, S., Gonçalves, M. A., and Fox, E. A. (2004), “Recommender systems research: a connection-centric survey,” Journal of Intelligent Information Systems, 23, 107–143.
  • Poole (2015) Poole, D. (2015), “On the strength of connectedness of a random hypergraph,” Electron. J. Combin., 22, Paper 1.69, 16.
  • Rau et al. (2015) Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., and Celeux, G. (2015), “Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models,” Bioinformatics, 31, 1420.
  • Robins and Alexander (2004) Robins, G. and Alexander, M. (2004), “Small worlds among interlocking directors: network structure and distance in bipartite graphs,” ‎Comput. Math. Organ. Theory, 10, 69–94.
  • Skvoretz and Faust (1999) Skvoretz, J. and Faust, K. (1999), “Logit models for affiliation networks,” Sociol. Methodol, 29, 253–280.
  • Smyth (2000) Smyth, P. (2000), “Model selection for probabilistic clustering using cross-validated likelihood,” Statistics and Computing, 10, 63–72.
  • Snijders et al. (2013) Snijders, T. A., Lomi, A., and Torló, V. J. (2013), “A model for the multiplex dynamics of two-mode and one-mode networks, with an application to employment preference, friendship, and advice,” Soc. Networks, 35, 265–276.
  • Stasi et al. (2014) Stasi, D., Sadeghi, K., Rinaldo, A., Petrovic, S., and Fienberg, S. (2014), “ models for random hypergraphs with a given degree sequence,” in Proceedings of COMPSTAT 2014—21st International Conference on Computational Statistics, pp. 593–600.
  • Wang et al. (2013) Wang, P., Pattison, P., and Robins, G. (2013), “Exponential random graph model specifications for bipartite networks - A dependence hierarchy,” Soc. Networks, 35, 211–222.
  • Wang et al. (2009) Wang, P., Sharpe, K., Robins, G., and Pattison, P. (2009), “Exponential random graph (p*) models for affiliation networks,” Soc. Networks, 31, 12–25.
  • Wang (1993) Wang, Y. H. (1993), “On the number of successes in independent trials,” Statist. Sinica, 3, 295–312.

Appendix A Proof on Proposition 2.1

Proof.

We can write where if node appears in the hyperedge and otherwise. Similarly, we write . Let be the latent cluster assignment of where if is generated from cluster . Let and be the latent cluster and additional clusters assignments of , where and if is generated from cluster and additional clusters . We have

For the variance of the LCA model, we have that

where

Hence, we have that

Now,

We have

Now,

To show the quantity above is non-negative, we have to show that

which follows from Jensen’s inequality. ∎

Appendix B M-step of EM Algorithm

For the M-step, we need to maximize with respect to the model parameters , , and .

b.1 Maximize w.r.t.

For fixed and , the objective function retaining terms involving can be written as

(5)

Since an analytic expression for does not exist due to the term , we apply the MM (Minorization Maximization) algorithm (Hunter and Lange, 2004). We first apply a quadratic lower bound on the concave function for . We let

We then have