Complex networks appear in different categories such as social networks, citation networks, collaboration networks, and communication networks Newman; StatisticalMechanics; SurveyOfMeasurements; GraphConcepts. In recent years, complex networks are frequently studied and many evidences indicate that they show some non-trivial structural properties Newman; SurveyOfMeasurements; StructDynam; DynamicalSys; WaterDistribution; MusicRecom. For example, power law degree distribution, high clustering and small path lengths are some properties that distinguish complex networks from completely random graphs.
An active field of research is dedicated to the development of algorithms for generating complex networks. These algorithms, called “generative models”, try to generate synthetic graphs that adhere the structural properties of complex networks SurveyOfMeasurements; StatisticalMechanics.
Realistic generative models have many applications and benefits. Once a generative model is fitted to a given real network, we can replace the real network with artificial networks in tasks such as simulation, extrapolation (by generating similar graphs with larger sizes), sampling (reverse of extrapolations), capturing the network structure and networks comparison Kronecker; RTG.
Despite the advances in the field, there is no universal generative model suitable for all network types and features. The prerequisite of network generation is the stage of generative model selection. In fact, when we generate synthetic networks, we hope to reach graphs that are structurally similar to a target network. In the model selection stage, the properties of a given network (called target network) are analyzed and the best model suitable for generating similar networks is selected. A model selection method tries to answer this question: “Among candidate generative models, which one is the most suitable one for generating complex network instances similar to the given network?” In this paper, we investigate this problem and by the means of machine learning algorithms, we propose a new model selection method based on network structural properties. The proposed method is named “Generative Model Selection for Complex Networks” (GMSCN). The need for model selection is frequently indicated in literature ModelSelection; NetSamplingClassification; Drosophila. More specifically some works ModelSelection; Drosophila; Superfamilies are based on counting subgraphs of small sizes (called graphlets or motifs Motifs; Superfamilies; GraphConcepts; SubgraphCounting; NetMotifDiscovery; EngineeringView; EffectOfNetTopology; ModelSelection; Drosophila), and some others concentrate on structural features of complex networks NetSamplingClassification, and some are based on manually selecting a model through watching a small set of network features RichClub. We will show that by using an appropriate combination of local and global network features, we can develop a more accurate model selection method. In our proposed method (GMSCN), we consider seven prominent generative models by which we have generated datasets of network instances. The datasets are used as training data for learning a decision tree for model selection. Our method also consists of a special technique for quantification of degree distribution. In comparison to existing methods ModelSelection; Drosophila; NetSamplingClassification, we have considered wider, newer and more significant generative models. Due to a better selection of network features, GMSCN is also more efficient and more scalable than similar methods ModelSelection; Drosophila.
The rest of this paper is organized as follows. Section II reviews the related work. Section III presents GMSCN. Section IV is dedicated to evaluation of GMSCN. Section V describes a case study on some real network samples. The results and evaluations of this paper are discussed in Section VI. Finally, Section VII concludes the paper.
Ii Related Work
ii.1 Network Generation Models
In this subsection, we briefly introduce the leading methods of network generation:
Kronecker Graphs Model (KG) Kronecker
. This model generates realistic synthetic networks by applying a matrix operation (the kronecker product) on a small initiator matrix. This model is mathematically tractable and supports many network features such as small path lengths, heavy tail degree distribution, heavy tails for eigenvalues and eigenvectors, densification and shrinking diameters over time.
Forest Fire Model (FF) GraphsOverTime. In this model, edges are added in a process similar to a fire-spreading process. This model is inspired by Copying model WebAsGraph and Community Guided Attachment GraphsOverTime but supports the shrinking diameter property.
Random Typing Generator Model (RTG) RTG. RTG uses a process of “random typing” for generating node identifiers. This model mimics real world graphs and conforms to eleven important patterns (such as power law degree distribution, densification power law and small and shrinking diameter) observed in real networks RTG.
Preferential Attachment Model (PA) EmergenceOfScaling
. The classical preferential attachment model generates scale-free networks with power law degree distribution. In this model, the nodes are added to the network incrementally and the probability of the attachments depends on the degree of existing nodes.
Small World Model (SW) SmallWorld. This is another classical network generation model that synthesizes networks with small path lengths and high clustering. It starts with a regular lattice and then rewires some edges of the network randomly.
Erdös-Rényi Model (ER) ER. This model generates a completely random graph. The number of nodes and edges are configurable.
Random Power Law Model (RP) RP. The RP model generates synthetic networks by following a variation of ER model that supports the power law degree distribution property.
Other generative models are also available (we have not utilized them but they are used in related model selection methods), such as Copying Model (CM) WebAsGraph, Random Geometric Model (GEO) RandomGeoGraphs, Spatial Preferential Attachment (SPA) SpatialWebGraph, Random Growing (RDG) RandomlyGrown, Duplication-Mutation-Complementation (DMC) ProteomeEvolution, Duplication-Mutation using Random mutations (DMR) RandomlyGrown, Aging Vertex (AGV) HighlyClustered, Ring Lattice (RL) RandomGraphs, Core-periphery (CP) CorePeriphery, and Cellular model (CL) CellularNets.
ii.2 Model Selection Methods
The aim of this paper and the model selection methods is to find the best generative model that fits a given network instance. Some model selection methods are based on graphlet counting ModelSelection; Drosophila; Superfamilies. Graphlets are subgraphs of bounded sizes (e.g., all possible subgraphs with three or four nodes) and the frequency of graphlets in a network is considered as a way of capturing the network structure ModelSelection. In some works, directed graphs and graphlets are considered Superfamilies; Motifs and some others consider the network as simple (undirected) graphs ModelSelection; Superfamilies.
Janssen et al. ModelSelection
have tested both graphlet features and structural features (degree distribution, assortativity and average path length) in the model selection problem. They conclude that counting graphlets of three and four nodes is sufficient for capturing the structure of the network, i.e., appending structural features to the feature vector of graphlet counts does not improve the accuracy of the model selector. In this paper, we critique this claim and show that using a better set of local (such as transitivity) and global (such as effective diameterGraphEvolution; GraphsOverTime) network structural features, along with an appropriate degree distribution quantification algorithm, actually improves the accuracy of the model selection. In fact, graphlet counts are limited local features and are not able to reflect the structural properties of a network instance. Janssen et al ModelSelection implemented six generative models and generated a dataset of synthetic networks as the training data for decision tree learning MulticlassADT. In this method, candidate generative models are: PA EmergenceOfScaling, CM WebAsGraph, GEO RandomGeoGraphs (GEO2D and GEO3D) and SPA SpatialWebGraph (SPA2D and SPA3D).
A similar method is proposed by Middendorf et al. Drosophila. In this method, the feature vectors are the counts of graphlets with small sizes. Seven different generative models are considered by which network instances are generated as the training data. Candidate generative models are: ER ER, PA EmergenceOfScaling , SW SmallWorld, RDG RandomlyGrown , DMC ProteomeEvolution, DMR RandomlyGrown and AGV HighlyClustered. The authors have used a generalized decision tree called alternating decision tree (ADT) as the learning algorithm.
Airoldi et al. NetSamplingClassification propose to form feature vectors according to structural network properties. They have considered some classical generative models and generated a dataset by which a naïïve Bayes classifier is learned. Candidate generative models are: PA EmergenceOfScaling, ER ER, RL RandomGraphs, CP CorePeriphery and CL CellularNets. This method is dependent on the size and average connectivity of the target network and this dependency is one of its limitations.
Patro et al. MissingModels propose a framework for implementing network generation models. The user of this framework can specify the important network features and the weight of each feature. In other words, we consider each generative model as a class of networks. This model, more than to be a specific method, is a relatively open framework and the user should determine different parameters of the framework according to the target application.
Iii The Proposed Method
GMSCN is based on learning a classifier for model selection. The goal of a classifier is to accurately predict the target class for a given network instance and in our method, generative models play the role of network classes. In GMSCN, the classifier suggests the best model that generates networks similar to a given network. The inputs of the classifier are the structural properties of the target network and the output is the selected model among the candidate network generation models.
shows the high-level methodology of GMSCN. The methodology is configurable by several parameters and decision points, such as the set of considered network features, the chosen supervised learning algorithm and the candidate generative models. The steps of constructing the network classifier, as illustrated in Fig.1, are described in the following:
Many artificial network instances are synthesized using the candidate network generative models. These network instances will form the dataset (training and test data) for learning a network classifier. In this step, the parameters of the generative models are tuned in order to synthesize networks with densities similar to the density of the given target network.
After generating the network instances, the structural features (e.g., the degree distribution and the clustering coefficient) of each network instance are extracted. The result is a dataset of labeled structural features in which each record consists of topological features of a synthesized network along with the label of its generative model.
The labeled dataset forms the training and test data for the supervised learning algorithm. The learning algorithm will return a network classifier which is able to predict the class (the best generative model) of the given network instance.
The structural features of the target network are also extracted. The same “Feature Extraction” block which is used in the second step is applied here. The structural features of the target network are used as input for the learned classifier.
The learned network classifier is a customized “model selector” for finding the model that fits the target network. It gets the structural features of the target network as input and returns the most compatible generative model.
In this methodology, the density of the target network is considered as an important property of the target network. Network density is defined as the ratio of the existing edges to potential edges and is regarded as an indicator of the sparseness of the graph. In the proposed methodology, generative models are configured to synthesize networks with densities similar to the density of the target network. This decision is due to the fact that it is hard to compare networks of completely different densities for predicting their growth mechanism and generation process. On the other hand, even with similar network densities, various generative models create different network structures. So, we try to keep the density of the generated networks similar to the density of the target network. In this manner, the network classifier can learn the difference among the structure of various generative models with similar network densities.
It is also worth noting that it is not possible to generate networks with exactly equal densities with some of the existing generative models. This is because some generative models (such as Kronecker graphs and RTG) are not configurable for finely tuning the exact density of synthesized networks. So, we generate the networks of training data with similar, and not exactly equal, densities to the density of the given network.
Our proposed methodology, unlike existing methods ModelSelection; NetSamplingClassification; Drosophila, is not dependent on the size (number of nodes) of the target network. Size-independence is an important feature of our method. It enables the classifier to learn from a dataset of generated networks with sizes different -perhaps smaller- from the size of the target network, but with a similar density. This facility decreases the time of network generation and feature extraction considerably. We will demonstrate the size-independence property of the GMSCN in the evaluation section.
GMSCN is actually a realization of the described methodology. In the following subsections, we further illustrate the details of GMSCN by specifying the open parameters and decision points of the methodology.
iii.2 Network Features
The process of model selection, as described in Fig. 1, utilizes structural network features in the second and fourth steps. There are plenty of different network features, so we clarify the considered features in GMSCN here.
To capture the properties of a network, we should analyse a wide and diverse feature set of network connectivity patterns. We propose the utilization of a combination of local and global network structural features. The utilization of a limited set of local features (graphlet counts) in similar methods ModelSelection; Drosophila
has resulted in a lower precision for the model selector. As explained later, we have utilized ten network features from four feature categories. While trying to find the best and minimal set of network features, we considered features that are not only effective on the classification accuracy, but also efficiently computable and size-independent. One may consider a longer list of network features, even from different feature categories (e.g. eigenvalues). In such an approach, automatic methods for feature selection such as the methodology explained in Ref.Zanin may be helpful. But supporting specified diverse criteria (effectiveness, efficiency and size-independence) for selected features is quite difficult in such an automatic methodologies.
The utilized features and measurements in GMSCN are:
Transitivity of relationships. In this category of network features, we consider two measurements of “average clustering coefficient” SmallWorld; Newman and “transitivity” TransitivityProp.
Degree correlation. The measure of assortativity Newman is selected from this category of network features.
Path lengths. There are different global features about the path lengths in a network, such as diameter DiameterProp, effective diameter GraphEvolution; GraphsOverTime and average path length SurveyOfMeasurements. We selected the “effective diameter” measurement since it is more robust Kronecker and also because of its less computation cost and sensitivity to small network changes Sensitivity. Effective diameter indicates the minimum number of edges in which 90 percent of all connected pairs can reach each other Kronecker; GraphEvolution; InternetTopology. Effective diameter is well defined for both connected and disconnected networks GraphEvolution.
Degree distribution. It is a common approach to fit a power law on the degree distribution and extract the power law exponent as a representative quantity for the degree distribution. But a single number (the power law exponent) is too limited for representing the whole degree distribution. On the other hand, some real networks do not conform to the power law degree distribution Slashdot; GooglePlus; WhatIsTwitter
. We propose an alternative method for quantification of the degree distribution by computing its probability percentiles. The percentiles are calculated from some defined regions of the degree distribution according to its mean and standard deviation. We deviseintervals in the degree distribution and then calculate the probability of degrees of each interval. is always an even number greater than or equal to four. The size of all intervals, except the first and the last one, is considered equal to where is the standard deviation of the distribution and is a tunable parameter. In any application, we can configure the values of and in a manner that the percentile values become more distinctive. In our experiments we let and , so we extract six quantities (DegDistP1..DegDistP6 percentiles) from any degree distributions. If we increase the value of , we should normally decrease the value of so that most of the interval points stay in the range of existing node degrees. Smaller values for also necessitate larger values for . Large values (e.g., ) and small values (e.g., ) for will also decrease the distinction power of the extracted features vector. The specified values for and are found through trial and error. Equation 1 shows the interval points of the degree distribution and Equation 2 specifies the probability for a node degree to sit in the th interval. The set of six percentiles (DegDistP1..DegDistP6) are used as the network features representing the degree distribution.
Let be the th interval point and
be the degree random variable.
iii.3 Learning the Classifier
The third step of the proposed methodology is the utilization of a supervised machine learning algorithm. The learning algorithm constructs the network classifier based on the features of generated network instances as the training data. Each record of the training data consists of the structural features -as described in the previous subsection- of a generated network along with the label of its generative model. By the means of supervised algorithms, we can learn from this training data a classifier which predicts the best generative model for a given network with the specified structural features.
We examined several supervised learning algorithms such as decision tree learning QC4.5; MulticlassADTBayesianRefSMORef
(SVM) and neural networksNeuralNetRef among which the LADTree method showed better results. A short description of examined learning algorithms is presented in Appendix A. In our experiments, although some methods (such as Bayesian networks) resulted in a small improvement in the accuracy of the learned classifier, but the decision tree learned by LADTree algorithm was obviously more robust and less sensitive to noises than other learning methods. The robustness to noise analysis is described in the evaluation section. To avoid over-fitting, we always used stratified 10-fold cross-validation.
iii.4 Network Models
Among several existing network generative models, we have selected seven important models: Kronecker Graphs Kronecker Model, Forest Fire GraphsOverTime Model , Random Typing Generator RTG Model, Preferential Attachment EmergenceOfScaling Model, Small World SmallWorld Model, Erdös-Rényi ER Model and Random Power Law RP Model. The selected models are the state of the art methods of network generation. The existing model selection methods such as Ref. ModelSelection and Ref. Drosophila have ignored some new and important generative models such as Kronecker Graphs Kronecker, Forest Fire GraphsOverTime and RTG RTG models.
In this section, we evaluate our proposed method of model selection (GMSCN). We also compare GMSCN with the baseline method ModelSelection and show that it outperforms state of the art methods with respect to different criteria.
Despite most of the existing methods, GMSCN has no dependency on the size of the given network. In other words, we ignore the number of nodes of the target network and we only consider its density in generating the training data. Because the baseline method is dependent on the size of the target network, we evaluate the methods in two stages. In the first stage, we fix the size of the generated networks to prepare a fair condition for comparing GMSCN with the baseline method. Although size-dependence is a drawback for the baseline method, the evaluation shows that GMSCN outperforms the baseline method even in fixed network size condition. In the second stage, we allow the generation models to synthesize networks of different sizes. In this stage, we show that the size diversity of generated networks does not affect the accuracy of the learned decision tree. As described in Section III, GMSCN is based on learning a decision tree from a training set of generated networks. In each evaluation stage, we generated 100 networks from each network generative model and with seven candidate models, we gathered 700 generated networks. We used these network instances as the training and test data for learning the decision tree.
iv.1 The Baseline method
We have selected the graphlet-based method proposed by Janssen et al. ModelSelection as the baseline method. The baseline method has some similarities to GMSCN: it is based on considering some network generative models and then learning a decision tree for network classification with the aid of a set of generated networks. In the baseline method, eight graphlet counts are considered as the network features. All subgraphs with three nodes (two graphlets) and four nodes (six graphlets) are considered in the baseline method (Fig. 2). A similar approach is also proposed by Middendorf et al. Drosophila, with distinctions on the learning algorithm and the set of candidate generative models.
The graphlet-based method is selected as the baseline because it is a new method and its evaluations show a high accuracy, and it is proposed similarly in different research domains such as social networks ModelSelection and protein networks Drosophila.
Despite the similarities, there exist some important differences between GMSCN and the baseline method. First, the baseline method is based on counting graphlets in networks while GMSCN proposes a wider set of local and global features. Janssen et al. ModelSelection conclude that considering structural features does not improve the accuracy of the graphlet-based classifier, but we will show that choosing a better set of local and global network features and with the aid of our proposed degree distribution quantification method, structural features will play an undeniable role in model selection. Second, the baseline method is size-dependent, i.e., it considers both the size and the density of the target network, and it generates network instances according to these two properties. On the other hand, GMSCN is size-independent and we only consider the density of the target network in the network generation phase. Third, GMSCN employs newer and more important generative models such as the Kronecker Graphs Kronecker model, the Forest Fire GraphsOverTime model and the RTG RTG model. Fourth, we examined different learning algorithms and then selected LADTree as the best learning algorithm for this application. Our evaluation of GMSCN is more thorough, considering different evaluation criteria. We have also presented a new algorithm for quantifying the network degree distribution.
Graphlet counting is a very time consuming task and there is no efficient algorithm for computing the full counts of graphlets for large networks. To handle the algorithmic complexity, most of the graphlet-counting methods (e.g., Refs. ModelSelection; NetMotifDiscovery; ModelsInBiology; CellBiology
) propose a sampling phase before counting the graphlets. But the sampling algorithm may affect the graphlet counts and the resulting counts may be biased towards the features of the sampling algorithm. It is also possible to estimate the graphlet counts with approximate algorithmsSubgraphCounting; Graft, but this approach may also bring remarkable errors in graphlet counts. To prepare a fair comparison situation, we have counted the exact number of graphlets in original networks and have not employed any sampling or approximation algorithms. It is worth noting that reported accuracy of the baseline method in this paper is different from the report of the original paper ModelSelection, mainly because the set of generative models are not the same in the two papers.
iv.2 Accuracy of the Model Classifier
We first set a fixed size for generated networks of the dataset and generate networks with about 4096 nodes. Almost all the generated networks in our dataset contain 4096 nodes, but the networks generated by RTG RTG model have small variations in their size. Number of nodes in these networks is in the range of 4000 to 4200 and this is because the exact number of nodes is not configurable in the RTG model. Since the Kronecker Graphs model generates networks with nodes in its original form, we chosen 4096 as the size of the networks. The average density of networks in this dataset is equal to 0.0024.
In addition to overall accuracy, we evaluate the precision and recall of the learned decision tree for different network models. “Precision” shows the percentage of correctly classified instances calculated for each category (e.g.,), “Recall” illustrates the ability of the method in finding the instances of a category (e.g., ), and “Accuracy” is an indicator of overall effectiveness of the classifier across the entire dataset (i.e., ). The overall accuracy of GMSCN is 97.14% while the accuracy of the baseline method is 78.57% which indicates 18.57% improvements. Fig. 3 and Fig. 4 show the precision and recall of GMSCN and the baseline method respectively for different network models. In addition to an apparent improvement in the precision and recall for most of the generative models, the figures show the stability (less undesired deviation) of GMSCN over the baseline method. The accuracy and precision of GMSCN show small deviation for different generative models, while these measures for baseline method vary in a wide range. Table 1 shows the details of GMSCN results for different network models. For example, the first row of this table indicates that among 700 network instances, 104 networks are predicted to be generated by the ER model but in fact 97 (out of 104) instances are ER, six instances are the KG model and one is generated by the SW model. Because we have utilized cross-validation, all of the 700 network instances are included in the evaluation. Table 2 shows corresponding results for the baseline method.
It is worth noting that considering both the graphlet counts and the structural features does not improve the accuracy of the classifier considerably. Since we want to prepare a size-independent and efficient method, we do not consider the graphlet counts in feature vectors.
|True ER||True FF||True KG||True PA||True RP||True SW||True RTG||Class Precision|
|Class Recall||97%||100%||93%||98%||94%||99%||99%||Accuracy: 97.14%|
|True ER||True FF||True KG||True PA||True RP||True SW||True RTG||Class Precision|
|Class Recall||94%||73%||37%||100%||52%||94%||100%||Accuracy: 78.57%|
iv.3 Size Independence
GMSCN for model selection is independent from the size of the target network. When we want to find the best model fitting a real network, we can discard the number of nodes in the network and generate the training data only according to its density. The size-independence is an important feature of GMSCN which is missing in the baseline method. This feature is especially important when we want to find the generative model for a relatively large network. In this condition, we can generate the training network instances with smaller sizes than the target network. This feature also increases the applicability, scalability and performance of GMSCN.
For evaluating the dependency of GMSCN to the size of the network, we generate a new dataset with networks of different sizes. Instead of fixing the number of nodes in each network instance (such as about 4096 nodes in the previous evaluation) we allow networks with different number of nodes in the dataset. In this test, with each of the generative models, we generated 100 networks of different sizes: 24 networks with 4,096 nodes, 24 networks with 32,768 nodes, 24 networks with 131,072 nodes, 24 networks with 524,288 nodes and four networks with 1,048,576 nodes. Again, the only exception is the RTG model which generates networks with small variations from the specified sizes. The node counts are powers of two because the original version of Kronecker graph model is able to generate networks with nodes. The average density of networks in this dataset is equal to 0.000885.
Table 3 shows the precision and recall of GMSCN for this dataset. In this evaluation, the overall accuracy of the classifier is 97.29% which is very close to the accuracy of the system in the evaluation with fixed network sizes. This fact shows that GMSCN is not dependent on the size of the target network. The average density of networks in this dataset (0.000885) is different from the average density of networks in the fixed-size dataset (0.0024). So, the model selection is also performing well for different densities of the given network. We also extended this experiment to ensure that there is no meaningful lower bound for GMSCN in terms of network size. The new experiment is configured similar to the previous trial, but it examines a wider range of network sizes. Fig. 5 plots the result of this experiment at each number of nodes. It indicates that GMSCN shows good performance for the varying network sizes. Obviously, the baseline method is size-dependent ModelSelection; Drosophila because the graphlet counts completely depend on the size of the network. So, it is not necessary to show the precision and recall of the baseline method for dataset of networks of different sizes. We ignored such a useless evaluation because the calculation of graphlet counts for large networks is very time consuming.
iv.4 Robustness to Noise
We also evaluate the robustness of GMSCN with respect to random changes in networks. For each test-case network, we randomly select a fraction of edges, rewire them to random nodes, and test the accuracy of the classifier for the resulting network. We start from the pure network samples and in each step, we change five percent of the edges until all the edges (100 percent change) are randomly rewired. In other words, in addition to pure networks, we generated 20 test-sets with from zero to 100 percent edge changes, each of which containing 700 network samples from seven generative models.
As discussed before, we have chosen LADTree as the supervised learning algorithm in GMSCN. Fig. 6 shows the average accuracy of GMSCN for different random change fractions. This figure shows the effect of choosing different learning algorithms for GMSCN. As the figure shows, LADTree results in a more robust classifier for this application, since it is less sensitive to noise. The accuracy of GMSCN is smoothly decreasing nearly linear with random changes. There is no sudden drop in the chart of the GMSCN (based on LADTree). With 100 percent random changes (the right end of the diagram), the accuracy of the classifier reaches the value of 14.43 percent, which is near to 1/7 (i.e., ). This is due to existence of seven network models and indicates that almost all the characteristics of the generative model is eliminated from a generated network with 100 percent edge rewiring.
iv.5 Scalability and Performance
The aim of GMSCN is finding a generative model best fitting a given real network. We define the scalability of such a method as its ability to handle networks of large sizes as the input. Noting to the methodology of the proposed method (Fig. 1), the most time-consuming part of the model classification is the feature extraction task. For the feature extraction task, GMSCN is obviously more scalable than the baseline method. There is no efficient algorithm for counting the graphlets in large networks. The selected network features in GMSCN (effective diameter, clustering coefficient, transitivity, assorativity and degree distribution percentiles) are efficiently computable by existing algorithms. We have also discarded “timely to extract” features such as “average path length” because their extraction has more computationally complex algorithms.
Most of the graphlet-based methods such as Ref. ModelSelection and Ref. Drosophila try to increase their scalability by incorporating a pre-stage of network sampling with very small rates such as 0.01% (one out of 10,000) in Ref. ModelSelection. But such sampling rates decreases the accuracy of graph counts and the chosen sampling algorithm will also bias the graph counts. On the other hand, if sampling or approximation algorithms are accepted for baseline method, these techniques will improve the performance of GMSCN too. In other words, utilization of sampling and approximation algorithms increases the scalability of both of the baseline method and GMSCN similarly. Some notes about the implementation and evaluation of GMSCN are presented in the Appendix B.
iv.6 Effectiveness of the Degree Distribution Quantification Method
As described in Section III, we have proposed a new method for the quantification of the degree distribution based on its mean and standard deviation. In this subsection, we test the effectiveness of this quantification method. We show that without the proposed features of degree distribution, the accuracy of the network classifier will diminish. Table 4 shows the results of GMSCN by eliminating six features related to the degree distribution (DegDistP1..DegDistP6 percentiles). By this change, the overall accuracy of the method decreases about eight percent (from 97.14% to 89.29%). This can be seen by comparing the values in Table 4 with those of Table 1 which reflects the results of GMSCN when employing all the features. Precision and recall are improved for almost all the models with incorporating features related to the degree distribution. This fact shows the effectiveness of our proposed quantification method for degree distribution.
V Case study
We applied GMSCN for some real networks. The real network instances and the result of applying GMSCN on these networks are illustrated here:
“dblp_cite” 111http://dblp.uni-trier.de/xml (with 475,886 nodes and 2,284,694 edges) is a network which is extracted from the DBLP service. This network shows the citation network among scientific papers. GMSCN proposes Forest Fire as the best fitting generative model for this network. Leskovec et al. GraphsOverTime also propose Forest Fire model for two similar graphs of arXiv and patent citation networks.
“dblp_collab” 222http://dblp.uni-trier.de/xml (with 975,044 nodes and 3,489,572 edges) is a co-authorship network of papers indexed in the DBLP service. A node in this network represents an author and an edge indicates at least one collaboration in writing papers between the two authors. GMSCN suggests Forest Fire for this network instance too.
“p2p-Gnutella08” 333http://snap.stanford.edu (with 6,301 nodes and 20,777 edges) is a relatively small P2P network with about 6000 nodes. The best fitting model suggested by GMSCN for this network instance is Kronecker Graphs.
Slashdot, as a technology-related news website, presented the Slashdot Zoo which allowed users to tag each other as friends. “Slashdot0902” 444http://snap.stanford.edu (with 82,168 nodes and 543,381 edges) is a network of friendship links between the users of Slashdot, obtained in February 2009. The output of GMSCN for this social network is the Random Power Law model.
In the “web-Google” 555http://snap.stanford.edu (with 875,713 nodes and 4,322,051 edges) network, the nodes represent web pages and directed edges represent hyperlinks among them. We ignored the direction of the links and considered the network as a simple undirected graph. The random Power Law model is also proposed for this network by GMSCN.
“Email-EuAll” 666http://konect.uni-koblenz.de (with 265,214 nodes and 365,025 edges) is a communication network of email contacts which is predicted to follow the RTG model.
Finally, for the small network of “Email-URV” 777http://deim.urv.cat/~aarenas (with 1,133 nodes and 5,451 edges), which is another communication network of emails, GMSCN suggests the Small World model.
As explained above, various real networks, which are selected from a wide range of sizes, densities, and domains, are categorized in different network models by the GMSCN classifier. This fact indicates that no generative model is dominated in GMSCN for real networks and it suggests different models for different network structures. The case study also verifies that no generative model is sufficient for synthesizing networks similar to real networks and we should find the best model fitting to the target network in each application. As a result, it is worth noting that the task of generative model selection is an important stage before generating network instances.
We evaluated GMSCN from different perspectives. GMSCN proposes a size-independent methodology for building the network classifier based on a wide range of local and global network features as the inputs of a decision tree. It shows a high accuracy in predicting the generative model for a given network. It is tolerant and insensitive to small network changes. In addition to size-independence, GMSCN outperforms the baseline method –that only considers local features of graphlet counts with respect to accuracy and efficiency. A new structural feature is also proposed in GMSCN which quantifies the network degree distribution.
One may argue that the size of the training set (700 network instances) is relatively small for a machine learning task. But we have actually utilized many more network instances in the process of evaluating GMSCN. Our dataset for evaluating GMSCN includes 15,400 different network instances: 700 instances in the fixed-size evaluation, 700 instances in the size-independence test and 14,000 (20700) instances in the robustness test. The dataset size seems to be sufficient for evaluating the learned classifier because the network instances are generated with different parameters (e.g., different sizes) and the results for various evaluation steps are stable.
It can also be argued that the definition of “Accuracy” in the evaluations is not fair. When we compute the accuracy of the classifier, given that a network is generated precisely according to one of seven models, the classifier attempts to determine the generative model. One may argue that real networks are unlikely to be determined by one of these models, so accuracy in predicting the origin of artificially generated networks does not necessarily imply accuracy for real networks. But we have shown that GMSCN is able to classify synthesized network instances even with random noises (in subsection IV.3). In other words, networks that are not completely compatible to one of the generative models are also well categorized with GMSCN. We should note that no accepted benchmark exists for suggesting the best generative model for real networks. So, the computation of the actual accuracy of a model selection algorithm for real networks is fundamentally impossible.
Considering the existing model selection methods, we summarize the main distinctions and contributions of GMSCN here. First, we have proposed new structural network features based on the quantification of degree distribution. We have shown the effectiveness of these features in improving the accuracy of the model selection method. Second, we proposed a set of local and global network features for the problem of model selection. The baseline method suggests a set of graphlet counts that are limited local features and the evaluations show that such features are not sufficient for this application. It is not possible to capture important characteristics of real networks such as heavy-tailed degree distribution, small path lengths, and degree correlation (assortativity) only by counting graphlets, while such characteristics are among the main distinctions of generative models. For example, the Small World model generates networks with high clustering and small path lengths and artificial networks generated by most of the models demonstrate heavy-tailed degree distributions. Third, GMSCN is a size-independent method and the learned classifier is applicable for networks of different sizes. This is an important feature especially in the case of suggesting a generative model for a large network. In this case, we can generate the training set of artificial networks with a relatively smaller number of nodes. Fourth, although our proposed methodology is not dependent on the generative models, we have chosen seven important and outstanding network generative models as the candidate models of the classifier. Important models such as Kronecker graphs, Forest Fire and RTG are not considered in similar existing methods. Fifth, we have investigated different learning algorithms and reached LADTree as the most robust learning algorithm for this application. Sixth, we have presented a diverse set of evaluations for GMSCN with different criteria such as precision, recall, accuracy, robustness to noise, size-independence, scalability and effectiveness of the features.
In this paper, we proposed a new method (GMSCN) for network model selection. This method, which is based on learning a decision tree, finds the best model for generating complex networks similar to a specified network instance. The structural features of the given network instance are utilized as the input of the decision tree and the result is the best fitting model. GMSCN outperforms the existing methods with respect to different criteria. The accuracy of GMSCN shows a considerable improvement over the baseline method ModelSelection. In addition, the set of supported generative models in GMSCN contains wider, newer and more important generative models such as Kronecker graphs, Forest Fire and RTG. Despite most of the existing methods, GMSCN is independent from the size of the input network. GMSCN is a robust model and insensitive to small network changes and noises. It is also a scalable method and its performance is obviously better than the baseline method. GMSCN also includes a new and effective algorithm for the quantification of network degree distribution. We have examined different learning algorithms and as a result, decision tree learning by LADTree method was the most accurate and robust model. We showed that the local structural features, such as graphlet counts, are insufficient for inferring the network mechanisms and it is a must to consider a wider range of local and global structural features to be able to predict the network growth mechanisms.
In future, we will investigate the effect of network structural features and growth mechanisms on dynamics and behavior of the network when it is faced with different processes. For example, we will evaluate the similarity of the information diffusion process in a network and its counterparts synthesized by the selected network generation model.
Acknowledgements.We wish to thank Masoud Asadpour, Mehdi Jalili and Abbas Heydarnoori for their great comments.
Appendix A Brief Introduction to Classification Methods
Machine Learning is a subfield of Artificial Intelligence in which the main goal is to learn knowledge through experience. Classification is a learning task of inferring a classification function from labeled training data. Here, we explain some classifiers that are used in this paper.
Support Vector Machines (SVM)SMORef
. SVM performs a classification by mapping the inputs into a high-dimensional feature space and constructing hyperplanes to categorize the data instances. The best hyperplanes are those that cause the largest margin among the classes. The parameters of such a maximum-margin hyperplane are derived by solving an optimization problem. Sequential Minimal Optimization (SMO)SMORef is a common method for solving the optimization problem.
Bayesian Networks Learning BayesianRef
. A Bayesian network model is a probabilistic graphical model that represents a set of random variables and their conditional dependencies by a directed acyclic graph. The nodes in this graph represent the random variables and an edge shows a conditional dependency between two variables. Bayesian network learning aims to create a network that “best describes” the probability distribution over the training data. To find the best network among the set of possible Bayesian networks, the heuristic search techniques has been frequently used in the literature.
Artificial Neural NetworksNeuralNetRef
. ANN is inspired by human brain neural network. An ANN consists of neuron units, arranged in layers and connected with weighted links, which convert an input vector into some outputs. Usually, the networks are defined to be feed-forward, with no feedback to the previous layer. In the training phase, the weights of the links are tuned to adapt an ANN to the training data. Back-propagation algorithm is a common method for the training phase.
C4.5 Decision Tree LearningQC4.5. A decision tree is a tree structure of decision rules which can be used as a classification function (leaf nodes show the returned classes). C4.5 constructs a decision tree based on a labeled training data. C4.5 uses “information entropy” to evaluate the goodness of branches in the tree.
LADTreeMulticlassADT.This classifier generates a multi-class alternating decision tree and it uses the “boosting” strategy. Boosting is a well-established classification technique that combines some weak classifiers to form a single powerful classifier. A prediction node in a LADTree includes a score for each of candidate classes. LADTree calculates confidences for different classes according to their visited score in prediction nodes, and it returns the best class according to the confidences.
Appendix B Implementation Notes
To implement Kronecker Graphs, Forest Fire model, Preferential Attachment, Small World, and Random Power Law models, we utilized the SNAP library (http://snap.stanford.edu/snap/). The implementation of RTG model is available in a MATLAB library (http://www.cs.cmu.edu/~lakoglu/). We also developed our own implementation of the ER model. The features are extracted by the aid of different network analysis tools. The igraph package (http://igraph.sourceforge.net/) of the R project helped us calculate the assortativity and transitivity measures. We used the SNAP library for measuring the effective diameter, average clustering coefficient, density and also the graphlet counts. Since we proposed a new method for quantifying network degree distribution, we have implemented this method ourselves. We utilized RapidMiner as an open source tool for machine learning. The implementation of LADTree and Bayesian network learning and SVM are actually part of the Weka tool which is embedded in RapidMiner. The amount of computation needed for this research, especially counting the exact number of graphlets, was enormous. We utilized three virtual machines on a super-computer for this enormous computation task, each of which simulated a computer with 16 processing cores of 2.8 GHz and 24 GB of memory. Most of the computation time was spent for counting the graphlets of the generated network instances.