I Introduction
Clustering, a fundamental task in data mining and machine learning, aims to divide a set of entities into several groups such that entities in the same group are more similar to each other than to those in other groups. In
graph clustering or partitioning, the entities are modeled as the vertices of a graph and their similarities are encoded in the edges. In this setting, the goal is to group the vertices into clusters such that there are more edges within each cluster than across clusters.While graphs serve as a popular tool to model pairwise relationships, in many real world applications the entities engage in more complicated, higherorder relationships. For example, in coauthorship networks [17] more than two authors can interact in writing a manuscript. Hypergraphs can be used to represent such datasets, where the notion of an edge is extended to a hyperedge that can connect more than two vertices. Existing research on hypergraph partitioning mainly follows two directions. One is to project a hypergraph onto a proxy graph via hyperedge expansion and then graph partitioning methods can be directly leveraged [2, 38, 1]
. Another one is to represent hypergraphs using tensors and adopt tensor decomposition algorithms
[32, 16, 5, 20].To better accommodate hypergraphs for the representation of realworld data, several extensions over the classical hypergraph have been recently proposed [23, 3, 7, 18, 30]. These more elaborate models consider different types of vertices or hyperedges, or different levels of relations. In this paper, we consider edgedependent vertex weights (EDVWs) [7], which can be used to reflect the different importance or contribution of vertices in a hyperedge. This model is highly relevant in practice. For example, an ecommerce system can be modeled as a hypergraph with EDVWs where users and products are respectively modeled as vertices and hyperedges, and EDVWs represent the quantity of a product in a user’s shopping basket [22]. EDVWs can also be used to model the relevance of a word to a document in text mining [18]
, the probability of an image pixel belonging to a segment in image segmentation
[14], and the author positions in a coauthorship or citation network [7], to name a few.A large portion of clustering algorithms focus on oneway clustering, i.e., clustering data entities based on their features and, in the hypergraph setting, clustering vertices based on hyperedges. Indeed, in [18], a hypergraph partitioning algorithm was proposed to cluster the vertices in a hypergraph with EDVWs. However, it is more desirable to simultaneously cluster (or cocluster) both vertices and hyperedges in many applications including text mining [12, 11], product recommendation [35], and bioinformatics [6, 8]. Moreover, coclustering can leverage the benefit of exploiting the duality between data entities and features to effectively deal with highdimensional and sparse data [11, 26].
In this paper, we study the problem of coclustering vertices and hyperedges in a hypergraph with EDVWs.
Our contributions can be summarized as follows:
(i) We define a Laplacian for hypergraphs with EDVWs through random walks on vertices and hyperedges and show its equivalence to the Laplacian of a specific digraph obtained via a modified star expansion of the hypergraph.
(ii) We propose a spectral hypergraph coclustering method based on the proposed hypergraph Laplacian.
(iii) We validate the effectiveness of the proposed method via numerical experiments on realworld datasets.
Notation: The entries of a matrix are denoted by or . Operations and represent transpose and trace, respectively. and
refer to the allones vector and the identity matrix, where the sizes are clear from context.
and refer to the identity matrix of sizeand the allzero matrix of size
. denotes a diagonal matrix whose diagonal entries are given by the vector . Finally, represents the matrix obtained by vertically concatenating two matrices and , while denotes horizontal concatenation.Ii Preliminaries
Iia Hypergraphs with edgedependent vertex weights
Hypergraphs are generalizations of graphs where edges can connect more than two vertices. In this paper, we consider the hypergraph model with EDVWs [7] as defined next.
Definition 1.
A hypergraph with EDVWs consists of a set of vertices , a set of hyperedges where a hyperedge is a subset of the vertex set, a weight for every hyperedge , and a weight for every hyperedge and every vertex .
The difference between the above hypergraph model and the typical hypergraph model considered in most existing papers is the introduction of the EDVWs . The motivation is to enable the model to describe the cases when the vertices in the same hyperedge contribute differently to this hyperedge. For example, in a coauthorship network, every author (vertex) in general has a different degree of contribution to a paper (hyperedge), usually represented by the order of the authors. This information is lost in traditional hypergraph models but it can be easily encoded through EDVWs.
For convenience, let collect edgedependent vertex weights, with if and otherwise. Also, let collect hyperedge weights, with if and otherwise. Throughout the paper we assume that the hypergraph is connected.
IiB Spectral graph partitioning
Given an undirected graph with vertices, the goal of graph partitioning is to divide its vertex set into disjoint subsets (clusters) such that there are more (heavily weighted) edges inside a cluster and few edges across clusters, while these clusters are also balanced in size.^{1}^{1}1Although there are different variations of the graph partitioning problem [4], this is the one that we adopt in this paper.
To postulate this problem, let , , and denote the weighted adjacency matrix, the degree matrix, and the combinatorial graph Laplacian, respectively. Denote by a subset of vertices and its complement. Then, the cut between and is defined as the sum of weights of edges across them whereas the volume of is defined as the sum of weighted degrees of vertices in . More formally, we have
One wellknown measure for evaluating the partition is normalized cut (Ncut) [33] defined as
If we define an matrix whose entries are
(1) 
then it can be shown that . Thus, we can write the problem of minimizing the Ncut as
(2) 
The spectral graph partitioning method [33] relaxes (2) to a continuous optimization problem by ignoring its second constraint. The solution to the relaxed problem is the
generalized eigenvectors of
associated with thesmallest eigenvalues. Then,
means [25] can be applied to the rows of to obtain the desired clusters .Iii The Proposed Hypergraph Coclustering
Iiia Star expansion and hypergraph Laplacians
We project the hypergraph onto a directed graph via the socalled star expansion, where we replace each hyperedge with a star graph. More precisely, we introduce a new vertex for every hyperedge , thus . The graph connects each new vertex representing a hyperedge with each vertex through two directed edges (one in each direction) that we weigh differently, as explained next.
We consider a random walk on the hypergraph (equivalently, on ) in which we walk from a vertex to a hyperedge that contains with probability proportional to , and then walk from to a vertex contained in with probability proportional to . We define two matrices and to collect the transition probabilities from to and from to , respectively. The corresponding entries are given by and . Then, the transition probability matrix associated with a random walk on can be written as
When the hypergraph is connected, the graph is strongly connected, thus the random walk defined by is irreducible (every vertex can reach any vertex). Moreover, it is periodic since is bipartite and once we start at a vertex , we can only return to after even steps.
It is well known that a random walk has a unique stationary distribution if it is irreducible and aperiodic [9]. To fix the above periodicity problem, we introduce selfloops to and define a new transition probability matrix where . Matrix defines a random walk (the socalled lazy random walk) where at each discrete time point we take a step of the original random walk with probability and stay at the current vertex with probability . The stationary distribution of the random walk is the allpositive dominant left eigenvector of , i.e. , scaled to satisfy . Notice that different choices of lead to the same .
Datasets  Subsets  # documents  # words  Classes 

20 Newsgroups  Dataset 1  3,863  2,000  comp.os.mswindows.misc, rec.autos, sci.crypt, talk.politics.guns 
Dataset 2  5,663  2,000  alt.atheism, comp.graphics, misc.forsale, rec.sport.hockey, sci.electronics, talk.politics.mideast  
RCV1  Dataset 3  4,000  2,000  CCAT, ECAT, GCAT, MCAT 
Dataset 4  8,000  2,000  C15, C18, E31, E41, GCRIM, GDIS, M11, M14 
Given and , we generalize the directed combinatorial Laplacian and the normalized Laplacian [9] to hypergraphs as follows
(3)  
(4) 
It can be readily verified that (3) and (4) are equal to the combinatorial and normalized Laplacians of the undirected graph defined by the following weighted adjacency matrix
(5) 
where is the corresponding degree matrix.
IiiB Spectral hypergraph partitioning
We can leverage the hypergraph Laplacians proposed in Section IIIA to apply spectral graph partitioning methods (as introduced in Section IIB) to hypergraphs. More precisely, we compute the generalized eigenvectors of the generalized eigenproblem associated with the smallest eigenvalues, and then cluster the rows of using means. Note that can be written as , implying that is an eigenpair of the normalized Laplacian . Hence, if is an eigenvector of , then .
Since obtaining eigenvectors can be computationally challenging, we show next how to compute the eigenvectors of from a smaller size matrix. To do this, let us first rewrite and as
Proposition 1.
Define the following matrix
(6) 
and denote by and the left and right singular vectors of
associated with the singular value
, respectively. Then, the vector is the eigenvector of associated with the eigenvalue .Proof.
Let us rewrite as
(7) 
Split its eigenvector into two parts where and respectively have length and . Then we have
and it follows that
When , i.e. , and are respectively the left and right singular vectors of and is the corresponding singular value. ∎
Based on Proposition 1, our proposed spectral hypergraph coclustering algorithm is given by the following steps:
1) Compute the left and right singular vectors of associated with the largest singular values, denoted by and , respectively.
2) Leverage Proposition 1 to form .
3) (Optional) Normalize the rows of to have unit norm.
4) Apply means to the rows of (or its normalized version).
The optional normalization step above is inspired by the spectral partitioning algorithm proposed in [29]. In our next section, we denote the variant of our algorithm without normalization as sspec1 whereas the one that implements the third step above is denoted as sspec2.
How to choose parameter ? From Proposition 1 and (7) we can see that the choice of affects the eigenvalues of
but does not change its eigenvectors (or their order). Hence, the proposed spectral clustering method is independent of
.Iv Experiments
In this section, we evaluate the performance of the proposed methods via numerical experiments.^{2}^{2}2The code needed to replicate the numerical experiments presented in this paper can be found at https://github.com/yuzhu2019/hypergraph_cocluster. We consider two widely used realworld text datasets: 20 Newsgroups^{3}^{3}3http://qwone.com/~jason/20Newsgroups/ and Reuters Corpus Volume 1 (RCV1) [21]. Both of them contain documents in different categories. We extract two subsets of documents from each of them to build datasets of different levels of difficulty (datasets 1 and 3 are easier than datasets 2 and 4; see Table I). We consider the most frequent words in the corpus after removing stop words and words appearing in and of the documents.
To model text datasets using hypergraphs with EDVWs, we follow the procedure in [18]. More precisely, we consider documents as vertices and words as hyperedges. A document (vertex) belongs to a word (hyperedge) if the word appears in the document. The EDVWs (the entries in
) are taken as the corresponding tfidf (term frequency–inverse document frequency) values, which reflect how relevant a word is to a document in a collection of documents. The weight associated with a hyperedge is computed as the standard deviation of the entries in the corresponding row of
.We compare the proposed methods (sspec1 and sspec2) with the following three methods. (i) The naive method (naive): We run means on the columns and the rows of the tfidf matrix to cluster documents and words, respectively. (ii) Bipartite spectral graph partitioning (bispec) [12]: The dataset is modeled as an (undirected) bipartite graph between documents and words, then a spectral graph partitioning algorithm is applied; see Section IIB. (iii) Clique expansion (cspec, Algorithm 1 in [18]): This method projects the hypergraph with EDVWs onto a proxy graph via the socalled clique expansion, then applies a spectral graph partitioning algorithm. We consider it as the stateoftheart method. Since cspec can only cluster the vertices (and not the hyperedges), we build a hypergraph as mentioned above to cluster documents and then we construct another hypergraph in which we take words as vertices and documents as hyperedges to cluster words. Notice that of the above mentioned methods only the proposed methods (sspec1 and sspec2) and bispec can cocluster documents and words.
To evaluate the clustering performance, we consider four metrics, namely, clustering accuracy score (ACC), normalized mutual information (NMI), weighted F1 score (F1), and adjusted Rand index (ARI) [15]. For all of them, a larger value indicates a better performance. Notice that there are no groundtruth classes for words. Hence, following [13], we consider the class conditional word distribution. More precisely, we compute the aggregate word distribution for each document class, then for every word we assign it to the class in which it has the highest probability in the aggregate distribution. We regard this assignment as the ground truth for performance evaluation.
The numerical results (averaged over runs of means) are shown in Fig. 1. We first notice that, of the proposed methods, sspec2 usually performs better than sspec1. This is in line with [29], where it was observed that the lack of a normalization step (as in our sspec1) might lead to performance decays when the connectivity within each cluster varies substantially across clusters. It can also be seen that the proposed methods and cspec tend to work better than the naive method and the classical bipartite spectral graph partitioning method. This underscores the value of the hypergraph model considered. Importantly, sspec2 achieves similar clustering accuracy as the stateoftheart cspec for documents but tends to perform better in clustering words. Moreover, the proposed methods achieve small standard deviations, indicating their robustness to different centroid initializations in means.
Having showed the superiority in performance of sspec2, we now present visualizations of its application to Dataset 1 to further illustrate its effectiveness. In Fig. 2, we depict the embeddings of documents and words obtained by sspec2 by mapping them to a 2D space using tSNE [27]. We can see that documents and words in the same class appear to form groups. In Fig. 3, we plot the word clouds^{4}^{4}4https://github.com/amueller/word_cloud for the words predicted in the classes ‘comp.os.mswindows.misc’ (Microsoft Windows operating system) and ‘sci.crypt’ (cryptography). The size of a word is determined by its frequency in the documents predicted in the same class, thus is able to reveal its importance in the class. We can see that the top words (such as windows, file, dos, ms in ‘comp.os.mswindows.misc’) align well with our intuitive understanding of the class topics.
V Conclusions
We developed valid Laplacian matrices for hypergraphs with EDVWs, based on which we proposed spectral partitioning algorithms for coclustering vertices and hyperedges. Through realworld text mining applications, we showcased the value of considering hypergraph models and demonstrated the effectiveness of our proposed methods. Future research avenues include: (i) Developing alternative coclustering methods where we replace the spectral clustering step by nonnegative matrix trifactorization algorithms [13, 31, 36] of matrices related to the hypergraph Laplacians. (ii) Generalizing additional existing digraph Laplacians [24, 10] to the hypergraph case. (iii) Study the use of the hypergraph model with EDVWs in other network analysis tasks such hypergraph alignment [37, 34, 28]. Related to this last point, the fact that our proposed methods embed vertices and hyperedges in the same vector space (as shown in Fig. 2) facilitates the development of embeddingbased hypergraph alignment algorithms [19].
References
 [1] (2006) Higher order learning with graphs. In ICML, pp. 17–24. Cited by: §I.
 [2] (2005) Beyond pairwise clustering. In CVPR, Vol. 2, pp. 838–845. Cited by: §I.
 [3] (2018) Heterogeneous hypernetwork embedding. In ICDM, pp. 875–880. Cited by: §I.
 [4] (2016) Recent advances in graph partitioning. Algorithm engineering, pp. 117–158. Cited by: footnote 1.
 [5] (2017) The fiedler vector of a laplacian tensor for hypergraph partitioning. Siam Journal on Scientific Computing 39 (6), pp. A2508–A2537. Cited by: §I.
 [6] (2000) Biclustering of expression data.. In Ismb, Vol. 8, pp. 93–103. Cited by: §I.
 [7] (2019) Random walks on hypergraphs with edgedependent vertex weights. In ICML, pp. 1172–1181. Cited by: §I, §IIA.
 [8] (2004) Minimum sumsquared residue coclustering of gene expression data. In SDM, pp. 114–125. Cited by: §I.
 [9] (2005) Laplacians and the cheeger inequality for directed graphs. Annals of Combinatorics 9 (1), pp. 1–19. Cited by: §IIIA, §IIIA.
 [10] (2020) Hermitian matrices for clustering directed graphs: insights and applications. In AISTATS, pp. 983–992. Cited by: §V.
 [11] (2003) Informationtheoretic coclustering. In KDD, pp. 89–98. Cited by: §I.
 [12] (2001) Coclustering documents and words using bipartite spectral graph partitioning. In KDD, pp. 269–274. Cited by: §I, §IV.
 [13] (2006) Orthogonal nonnegative matrix tfactorizations for clustering. In KDD, pp. 126–135. Cited by: §IV, §V.
 [14] (2010) Interactive image segmentation using probabilistic hypergraphs. Pattern Recognition 43 (5), pp. 1863–1873. Cited by: §I.
 [15] (2016) Analysis of network clustering algorithms and cluster quality metrics at scale. PloS one 11 (7). Cited by: §IV.
 [16] (2015) A provable generalized tensor spectral method for uniform hypergraph partitioning. In ICML, pp. 400–409. Cited by: §I.
 [17] (2009) Understanding importance of collaborations in coauthorship networks: a supportiveness analysis approach. In SDM, pp. 1112–1123. Cited by: §I.
 [18] (2020) Hypergraph random walks, laplacians, and clustering. In CIKM, pp. 495–504. Cited by: §I, §I, §IV, §IV.
 [19] (2018) Regal: representation learningbased graph alignment. In CIKM, pp. 117–126. Cited by: §V.
 [20] (2019) Community detection for hypergraph networks via regularized tensor power iteration. arXiv preprint arXiv:1909.06503. Cited by: §I.
 [21] (200404) RCV1: a new benchmark collection for text categorization research. JMLR 5, pp. 361–397. Cited by: §IV.
 [22] (2018) Etail product return prediction via hypergraphbased local graph cut. In KDD, pp. 519–527. Cited by: §I.
 [23] (2017) Inhomogeneous hypergraph clustering with applications. In NIPS, pp. 2308–2318. Cited by: §I.
 [24] (2010) Random walks on digraphs, the generalized digraph laplacian and the degree of asymmetry. In International Workshop on Algorithms and Models for the WebGraph, pp. 74–85. Cited by: §V.
 [25] (1982) Least squares quantization in PCM. IEEE transactions on information theory 28 (2), pp. 129–137. Cited by: §IIB.
 [26] (2005) Coclustering by block value decomposition. In KDD, pp. 635–640. Cited by: §I.
 [27] (200811) Visualizing data using tsne. JMLR 9, pp. 2579–2605. Cited by: §IV.
 [28] (2016) Triangular alignment (TAME): a tensorbased approach for higherorder network alignment. IEEE/ACM transactions on computational biology and bioinformatics 14 (6), pp. 1446–1458. Cited by: §V.

[29]
(2002)
On spectral clustering: analysis and an algorithm
. In NIPS, pp. 849–856. Cited by: §IIIB, §IV.  [30] (2021) Signal processing on higherorder networks: livin’on the edge… and beyond. arXiv preprint arXiv:2101.05510. Cited by: §I.
 [31] (2012) Graph dual regularization nonnegative matrix factorization for coclustering. Pattern Recognition 45 (6), pp. 2237–2250. Cited by: §V.
 [32] (2006) Multiway clustering using supersymmetric nonnegative tensor factorization. In ECCV, pp. 595–608. Cited by: §I.
 [33] (2000) Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence 22 (8), pp. 888–905. Cited by: §IIB.
 [34] (2014) Mapping users across networks by manifold alignment on hypergraph. In AAAI, Vol. 28. Cited by: §V.
 [35] (2014) Improving cocluster quality with application to product recommendations. In CIKM, pp. 679–688. Cited by: §I.
 [36] (2011) Fast nonnegative matrix trifactorization for largescale data coclustering. In IJCAI, Cited by: §V.
 [37] (2008) Probabilistic graph and hypergraph matching. In CVPR, pp. 1–8. Cited by: §V.
 [38] (2007) Learning with hypergraphs: clustering, classification, and embedding. In NIPS, pp. 1601–1608. Cited by: §I.
Comments
There are no comments yet.