Structural representations of data are ubiquitous in different domains such as biological networks, online social networks, information networks, co-authorship networks, and so on. The problem of community detection or graph clustering aims to identify densely connected groups of nodes in the network [fortunato2010community], one of the central tasks in network analysis. Examples of useful applications include that of finding clusters in protein-protein interaction networks [ravasz2002hierarchical] or groups of people with similar interests in social networks [watts2002identity]. Recently, it has become easier to collect content-centric networks in a time-sensitive way, enabling the possibility of using tightly-integrated analysis across different factors that affect network structure. Aggregate topological and content information can enable more informative community detection, in which cues from different sources are integrated into more powerful models.
Another important aspect of complex networks is that such networks evolve, meaning that nodes may move from one community to another, making some communities grow, and others shrink. For example, authors that usually publish in the data mining community could move to the machine learning community. Furthermore, the temporal aspects of changes in community structure could interact with the content in unusual ways. For example, it is possible for an author in a bibliographic network to change their topic of work, preceding a corresponding change in community structure. The converse is also possible, with a change in community structure affecting content-centric attributes.
Matrix factorization methods are traditional techniques that allow us to reduce the dimensional space of network adjacency representations. Such methods have broad applicability in various tasks such as clustering, dimensionality reduction, latent semantic analysis, and recommender systems. The main point of matrix factorization methods is that they embed matrices in a latent space where the clustering characteristics of the data are often amplified. A useful variant of matrix factorization methods is shared matrix factorization, which factors two or more different matrices simultaneously. Shared matrix factorization is not new, and is used in various settings where different matrices define different parts of the data (e.g., links and content). This method could be used to embed link and content in a shared feature space, which is convenient because it allows the use of traditional clustering techniques, such as -means. However, incorporating a temporal aspect to the shared factorization process adds some challenges concerning the adjustment of the shared factorization as data evolves.
In Figure 1, we illustrate the broad principles of our approach. We factorize link and content over time at once. The matrix represents link and content in a shared dimension and each snapshot of time will have a different matrix . These different values of the matrix provide insights at various temporal snapshots, and therefore they can be used for temporal community detection as shown in Figure 2.
Related to the problem of community detection is is that of community prediction, in which one attempts to predict future communities from previous snapshots of the network. It is notoriously difficult to predict future clustering structures from a complex combination of data types such as links and content. However, the matrix factorization methodology provides a nice abstraction, because one can now use the multidimensional representations created by sequences of matrices over different snapshots. The basic idea is that we can consider each of the entries in the latent space representation as a stream of evolving entries, which implicitly creates a time-series that can be used to predict future communities, as illustrated in Figure 3.
In this work, we present Chimera, a method that uses link and content from networks over time to detect and predict community structure. To the best of our knowledge, there is no work addressing these three aspects simultaneously for both detection and prediction of communities. The main contributions of this paper are:
An efficient algorithm based on shared matrix factorization that uses link and content over time; the uniform nature of the embedding allows the use of any traditional clustering algorithm on the corresponding representation.
A method for predicting future communities from embeddings over snapshots.
Ii Related Work
In this section, we review the existing work for community detection using link analysis, content analysis, temporal analysis and their combination. Since these methods are often proposed in different contexts (sometimes even by different communities), we will organize these methods into separate sections.
Topological Community Detection
These methods are based mainly on links among nodes. The idea is to minimize the number of edges across nodes belonging to different communities. Thus, the nodes inside the community should have a higher density (number of edges) with other nodes inside the community than with nodes outside the community. There are several ways of defining and quantifying communities based on their topology, modularity [louvain], conductance [Leskovec2008], betweeness [girvan2002community], and spectral partition [Barnes1982]. More can be found in fortunato2010community.
Content-Centric Community Detection
Topic modeling is a common approach for content analysis and is often used for clustering, in addition to dimensionality reduction. PLSA-PHITS [hofman2008bayesian] and LDA [Cohn:2000] are the most traditional methods for content analysis, but they are susceptible to words that appear very few times. Extended methods that are more reliable are Link-PLSA-LDA [Nallapati:2008] and Community-User-Topic model [Zhou:2006]. In most cases, the combination of link and content provides insights that are missing with the use of a single modality.
Link and Temporal Community Detection
A few authors address the problem of temporal community detection that aims to identify how communities emerge, grow, combine, and decay over time [lin2008facetnet, kawadia2012sequential], [chakrabarti2006evolutionary], [kim2009particle], tang2011dynamic use temporal Dirichlet processes to detect communities and track their evolution. chen2013detecting tackle the problem of overlapping temporal communities. bazzi2016community propose the detection of communities in temporal networks represented as multilayer networks. Pietilanen:2012 identify clusters of nodes that are frequently connected for long periods of time, and such sets of nodes are referred to as temporal communities. He201587 propose an algorithm for dynamic community detection in temporal networks, which takes advantage of community information at previous time steps. Yu2017WSDM present a model-based matrix factorization for link prediction and also for community prediction. However, their work uses only links for the prediction process.
Link and Content-Centric Community Detection
In recent years, some approaches were developed to use link and content information for community detection [Ruan:2013, Yang:2009, liu2015community, xu2014exploiting]. Among them, probabilistic models have been applied to fuse content analysis and link analysis in a unified framework. Examples include generative models that combine a generative linkage model with a generative content-centric model through some shared hidden variables [cohn2001missing, nallapati2008joint]. A discriminative model is proposed by Yang:2009, where a conditional model for link analysis and a discriminative model for content analysis are unified. In addition to probabilistic models, some approaches integrate the two aspects from other directions. For instance, a similarity-based method [zhou2009graph] adds virtual attribute nodes and edges to a network, and computes the similarity based on the augmented network. gupta2010nonnegative use matrix factorization to combine sources to improving tagging. It is evident that none of the aforementioned works combine all the three factors of link, content, and temporal information within a unified framework; caused in part by the fact that these modalities interact with one another in complex ways. Therefore, the use of latent factors is a particularly convenient way to achieve this goal.
There has been a growing interest in the dynamics of communities in evolving social networks, with recent studies addressing the problem of building a predictive model for community detection. Most of the community prediction techniques described in these works are about community evolution prediction that aim to predict events such as growth, survival, shrinkage, splits and merges [Ngonmang2010, Bringmann2010, Ngonmang2013, Saganowski2013, Takaffoli2014, Sharma2015, Saganowski20152]. In [Ilhan2013, Ilhan2015] the authors use ARIMA models to predict community events in a network without using any previous community detection method. ILHAN2016 propose to use a small number of features to predict community events. Pavlopoulou2017 employ several structural and temporal features to represent communities and improve community evolution prediction.
The community prediction addressed in our work can predict not only community evolution but also a more accurate prediction about each node of the network, in which community the node will be and if its community will change or not. We do so by using topological characteristics and also content associated with nodes.
Iii Problem Definition
We assume we have graphs that form a time-series. The graphs are defined over a fixed set of nodes of cardinality . In each timestamp, a different set of edges may exist over time. For example, in the case of a co-authorship network, the node set may correspond to the authors in the network, and the graph might correspond to the co-author relations among them in the th year. These co-author relations are denoted by the adjacency matrix . Note that the entries in need not be binary, but might contain arbitrary weights. For example, in a co-authorship network, the entries might correspond to the number of publications between a pair of authors. For undirected graphs, the adjacency matrix is symmetric, while in directed graphs the adjacency matrix is asymmetric. Our approach can handle both settings. Hence, the graph is denoted by the pair .
We assume that for each timestamp , we have an content matrix . contains one row for each node, and each row contains attribute values representing the content for that node at the th timestamp. For example, in the case of the co-authorship network,
might correspond to the lexicon size, and each row might contain the word frequencies of various keywords in the titles. Therefore, one can fully represent the content and structural pair at theth timestamp with the triplet .
In this paper, we study the problem of content-centric community detection in networks. We study two problems: temporal community detection, and community prediction. While the problem of temporal community detection has been studied in the literature, as presented in Section II, the problem of community prediction, as defined in this work, has not been studied to any significant extent. We define these problems as follows.
Definition 1 (Temporal Community Detection).
Given a sequence of snapshots of graphs , with adjacency matrices , and content matrices , create a clustering of the nodes into partitions at each timestamp .
The clustering of the nodes at each timestamp may use only the graph snapshots up to and including time . Furthermore, the clusters in successive timestamps should be temporally related to one another. Such a clustering provides better insights about the evolution of the graph. In this sense, the clustering of the nodes for each timestamp will be somewhat different from what is obtained using an independent clustering of the nodes at each timestamp.
Definition 2 (Temporal Community Prediction).
Given a sequence of snapshots of graphs with adjacency matrices , and content matrices , predict the clustering of the nodes into partitions at future timestamp .
The community prediction problem attempts to predict the communities at a future timestamp, before the structure of the network is known. To the best of our knowledge, this problem is new, and it has not been investigated elsewhere in the literature. Note that the temporal community prediction problem is more challenging than temporal community detection, because it requires us to predict the community structure of the nodes without any knowledge of the adjacency matrix at that timestamp.
Temporal prediction is generally a much harder problem in the structural domain of networks as compared to the multidimensional setting. In the multidimensional domain, one can use numerous time-series models such as the auto-regressive (AR) model to predict future trends. However, in the structural domain, it is far more challenging to make such predictions.
Iv Mathematical Model
In this section, we discuss the optimization model for converting the temporal sequences of graphs and content to a multidimensional time-series. To achieve this goal, we use a non-negative matrix factorization framework. Although the non-negativity is not essential, one advantage is that it leads to a more interpretable analysis. Consider a setting in which the rank of the factorization is denoted by . The basic idea is to use three sets of latent factor matrices in a shared factorization process, which is able to combine content and structure in a holistic way:
The matrix is an matrix, which is specific to each timestamp . Each row of the matrix describes the -dimensional latent factors of the corresponding node at time stamp , while taking into account both the structural and content information.
The matrix is an matrix, which is global to all timestamps. Each row of the matrix describes the -dimensional latent factors of the corresponding node over all time stamps, based on only the structural information.
The matrix is an matrix, which is global to all timestamps. Each row of the matrix describes the -dimensional latent factors of one of the keywords over all time stamps, based on only the content information.
The matrices are more informative than the other matrices, because they contain latent information specific to the content and structure, and they are also specific to each timestamp. However, the matrices and are global, and they contain only information corresponding to the structure and the content in the nodes, respectively. This is a setting that is particularly suitable to shared matrix factorization, where the matrices are shared between the factorization of the adjacency and content matrices.
Therefore, we would like to approximately factorize the adjacency matrices as , for all . Similarly, we would like to approximately factorize the content matrices as . With this setting, we propose the following optimization problem:
Where is a balancing parameter, is the regularization parameter, and is a regularization term to avoid overfitting. The notation denotes the Frobenius norm, which is the sum of the squares of the entries in the matrix. The regularization term is defined as
We would also like to ensure that the embeddings between successive timestamps do not change suddenly because of random variations. For example, an author might publish together with a pair of authors every year, but might not be publishing in a particular year because of random variations. To ensure that the predicted values do not change suddenly, we add a temporal regularization term:
This additional regularization term ensures the variables in any pair of successive years do not change suddenly. The additional regularization term is added to the objective function, after multiplying it with . The enhanced objective function is defined as
In order to ensure a more interpretable solution, we impose non-negativity constraints on the factor matrices
One challenge with this optimization model is that it can become very large. The main size of the optimization model is a result of the adjacency matrix. The content matrix is often manageable, because one can often reduce the keyword-lexicon in many real settings. However, the adjacency matrix scales with the square of the number of nodes, which can be onerous in real settings. An important observation here is that the adjacency matrix is sparse, and most of its values are zeros. Therefore, one can often use sampling on the zero entries of the adjacency matrix in order to reduce the complexity of the problem. This also has a beneficial effect of ensuring that the solution is not dominated by the zeros in the matrix.
Iv-a Solving the Optimization Model
In this section, we discuss a gradient-descent approach for solving the optimization model. The basic idea is to compute the gradient of with respect to the various parameters. Note that can be seen as the “prediction” of the value of . Obviously, this predicted value may not be the same as the observed entries in the adjacency matrices. Similarly, while the product predicts , the predicted values may be different from the observed values. The gradient descent steps are dependent on the errors of the prediction. Therefore, we define the error for the structural and content-centric entries as and . Also let , with , since the difference is not defined at this boundary value.
Our goal is to compute the partial derivative of with respect to the various optimization variables, and then use it to construct the gradient-descent steps. By computing the partial derivatives of (4) with respect to each of the decision variables, we obtain
The gradient-descent steps use these partial derivatives for the updates. The gradient-descent steps may be written as
Here, is the step-size, which is a small value, such as 0.01. The matrices , , and are initialized to non-negative values in , and the updates (9–11) are performed until convergence or until a pre-specified number of iterations is performed. Non-negativity constraints are enforced by setting an entry in these matrices to zero whenever it becomes negative due to the updates.
is a sparse matrix, and should be stored using sparse data structures. As a practical matter, it makes sense to first compute those entries in that correspond to non-zero entries in , and then store those entries using a sparse matrix data structure. This is because a matrix may be too large to hold using a non-sparse representation.
The set of updates above are typically performed “simultaneously” so that the entries in , and (on the right-hand side) are fixed to their values in the previous iteration during a particular block of updates. Only after the new values of , , and have been computed (using temporary variables), can they be used in the right-hand side in the next iteration.
Iv-B Complexity Analysis
With the algorithm fully specified, we can now analyze its asymptotic complexity. Per gradient descent iteration, the computational cost of the algorithm is the sum of (i) the complexity of evaluating the objective function (4) and (ii) the complexity of the update step (12). Recall from section IV that , , , , and have dimensions , , , , and , respectively. Since matrix factorization reduces the dimensions of the data, we can safely assume and that .
Assuming the basic matrix multiplication algorithm is used, the complexity of multiplying matrices of dimensions and is . Therefore, the complexity of computing , since the norm can be computed by iterating over all elements of the matrix, squaring and summing them. Hence, the complexity of evaluating the objective function (4) is
To obtain the asymptotic complexity of the updates, note that , and . Hence, , , , and . Therefore, the asymptotic complexity of the gradient descent update is .
V Applications to clustering
V-a Temporal Community Detection
The learned factor matrices can be used for temporal community detection. In this context, the matrix is very helpful in determining the communities at time , because it accounts for structure, content, and smoothness constraints. The overall approach is:
Extract the rows from , so that each of the rows is associated with a timestamp from . This timestamp will be used in step 3 of the algorithm.
Cluster the rows into clusters using a -means clustering algorithm.
Partition each into its different timestamped clusters , depending on the timestamp of the corresponding rows.
In most cases, the clusters will be such that the different avatars of the th row in will belong to the same cluster. However, in some cases, rows may drift from one cluster to the other. Furthermore, some clusters may shrink with time, whereas others may increase with time. All these aspects provide interesting insights about the community structure in the network. Even though the data is clustered into groups, it is often possible for one or more timestamps to contain clusters without any members. This is likely when the number of clusters expands or shrinks with time.
V-B Temporal Community Prediction
This approach can also be naturally used for community prediction. The basic idea here is to treat as a time-series of matrices, and predict how the weights evolve with time. The overall approach is as follows:
For each of the non-zero entries of matrix , represent the time series .
Use an autoregressive model onto predict for each of the non-zero entries. Set all other entries in to 0.
Perform node clustering on the rows of to create the predicted node clusters at time . This provides the predicted communities at a future timestamp.
Thus, Chimera can provide not only the communities in the current timestamp, but also the communities in a future timestamp.
|Synthetic 1||Synthetic 2|
This section describes the experimental results of the approach. We describe the datasets, evaluation methodology, and the results obtained.
A key point in choosing a dataset to evaluate algorithms such as Chimera is that there must be co-evolving interactions between network and content. In order to check our model’s consistency, and to have a fair comparison with other state-of-the-art algorithms, we generated a couple of synthetic dataset.
The synthetic dataset was generated in the following way: first, we create the matrix
with 5 groups. Then, we follow a randomized approach to rewire edges. According to some probability, we connect edges from one group to another. In this dataset, all link matrices () have 5,000 nodes and 20,000 edges. For the content matrices (), we generate five groups of five words. As in the link case, we have a probability of a word being in more than one group. Due to the nature of its construction, all content matrices have 25 words. For transitioning between timestamps, we have another probability that defines whether a node changes group or not. The transitions are constrained to be at most 10% of the nodes. We generated 3 timestamps for each synthetic dataset. The rewire probabilities used in each synthetic dataset were (Synthetic 1) and (Synthetic 2).
We used the arXiv API222https://arxiv.org/help/api/index to download information about preprints submitted to the arXiv system. We extracted information about 7107 authors during a period of five years (from 2013 to 2017). We used the papers’ titles and abstracts to build the author-content network with 10256 words, and we selected words with more than 25 occurrences after removal of stop words and stemming. Since every preprint submitted to the arXiv has a category, we used the category information as a group label. We selected 10 classes: cs.IT, cs.LG, cs.DS, cs.CV, cs.SI, cs.AI, cs.NI, cs, math, and stat. Authors were added to the set of authors if they published for at least three years in the five-year period we consider. In years without publications, we assume authors belong to the temporally-closest category.
There are several metrics for evaluating cluster quality. We use two well-known supervised metrics: the Jaccard index and cluster purity. Cluster purity[Manning:2008] measures the quality of the communities by examining the dominant class in a given cluster. It ranges from 0 to 1, with higher purity values indicating better clustering performance.
We compared our approach with state-of-the-art algorithms in four categories: Content-only, Link-only, Temporal-Link-only and Link-Content-only. By following this approach, we are also able to isolate the specific effects of using data in different modalities.
Content-only method. We use GibbsLDA++ as a baseline for the content-only method. As input for this method, we considered that a document consists of the words used in the title and abstract of a paper.
Link-only method. For link we use the Louvain [louvain] method for community detection. Temporal-Link-only method. For temporal link-only method we used the work presented by He201587, which we refer to as DCTN.
Combination of Link and Content333Code from authors obtained from https://github.com/LiyuanLucasLiu/Content-Propagation. For link and content combination, we used the work presented by liu2015community, with algorithms CPRW-PI, CPIP-PI, CPRW-SI, CPIP-SI. Since all them perform very similarly and we have a space constraint we will report only the results obtained with CPIP-PI.
Vi-a Evaluation Results
In this section, we present the results of our experiments.
The Louvain and DCTN methods are based on link structure and do not allow fixed numbers of clusters. They use topological structure to find the number of communities. All methods in the baseline were used in their default configuration.
First, we present the results with synthetic data we generated (Synthetic 1 and Synthetic 2) in Table I. In synthetic datasets we use , , and with and 1000 steps.
The only methods that are able to find the clusters in all datasets are CPIP-PI and Chimera, both using content and link information. In the synthetic data the changes between timestamps were small. Thus, CPIP-PI and Chimera performed similarly. However, Chimera displayed almost perfect performance in all datasets and timestamps. Louvain and DCTN, which use only link information, were not able to find the clusters. Despite the purity of 1, they cluster all the data into only one cluster. DCTN finds clusters only for the two first timestamps of synthetic 2, obtaining 3 and 4 clusters respectively. Louvain found 3 clusters in timestamps 1 and 3 of synthetic 2.
Table II presents the Jaccard and Purity metrics over all methods for the real dataset arXiv. In arXiv, the Louvain method found 3636, 2679, 2006, 1800 and 2190 communities respectively for each year. CDTN, which is based on Louvain has a very similar result with 3636, 2656, 1829, 1500 and 1791 communities respectively for each year. Since they are methods based on link, they consider specially disconnected nodes as isolated communities. Methods that combine link and content use content to aggregate such nodes in a community. Also, as we can note in Table II, our method can learn with time and improve its results in the following years. GibbsLDA++ presents a nice performance because the content was much more stable and had more quality over the years than the link information. This is another reason to combine various sources to achieve better performance.
To tune the hyperparameters ofChimera, we used Bayesian Optimization [bergstra2013making, shahriari2016taking] to perform a search in the hyperparameter space. Bayesian Optimization is the appropriate technique in this setting, because minimizing the model loss (4) does not necessarily translate into better performance. We defined an objective function that minimizes the mean silhouette coefficient [rousseeuw1987silhouettes] of the labels assigned by Chimera, as described in section V-A. We used Bayesian Optimization to determine the number of clusters as well. With this approach, the optimization process is completely unsupervised and, although we have access to the true labels, they were not used during optimization, a situation closer to reality. With Bayesian Optimization, our model was able to learn that the actual number of clusters was in the order of 10. The full set of hyperparameters and their ranges are shown in Table III, with best results shown in bold face.
In Table IV
we show our results for prediction. Here, we will not compare our results with other methods that estimate or evaluate the size of each community. The idea here is to predict in which community an author will be in the future. One advantage of our method is that we can augment our time series with our predictions. Clearly, doing so will add noise to further predictions, but the results presented are very similar to the ones present in the original dataset.Chimera is the only one that allows us to do that kind of analysis in an easy way, since the embeddings create multidimensional representations of the nodes in the graph.
Vi-B Performance evaluation
We generated 15 datasets of increasing sizes (with ranging from 250 to 14,000). Since in these datasets, we expect Chimera’s asymptotic complexity to be . To verify this, we measure the time it took to execute 1000 iterations of Chimera with , , , , , . Being and small integers, it is expected the factor will dominate the growth of the algorithm. To know whether that is the case, we also fit the data to a degree two polynomial that minimizes the squared error. The obtained data is summarized in Figure 4. As can be seen from the figure, there is a good fit between the measured data and the fitted polynomial, indicating the order of growth is quadratic for datasets similar to the ones presented.
In this work, we presented Chimera a novel shared factorization overtime model that can simultaneously take the link, content, and temporal information of networks into account improving over the state-of-the-art approaches for community detection. Our approach model and solve in efficient time the problem of combining link, content and temporal analysis for community detection and prediction in network data. Our method extracts the latent semantic structure of the network in multidimensional form, but in a way that takes into account the temporal continuity of the embeddings. Such approach greatly simplifies temporal analysis of the underlying network by using the embedding as a surrogate. A consequence of this simplification is that it is also possible to use this temporal sequence of embeddings to predict future communities with good results. The experimental results illustrate the effectiveness of Chimera, since it outperforms the baseline methods. Our experiments also show that the prediction is efficient in using embeddings to predict near future communities, which opens a vast array of new possibilities for exploration.
Charu C. Aggarwal’s research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-09-2-0053. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.