1 Introduction
One major limitation of a neural networkbased learning systems is that they requires a large amount of data for training. This is one of the biggest differences between human intelligence and artificial nongeneral intelligence like an artificial neural network. Unlike a deep learning (i.e. deep neural network) model, a human can learn from a very limited number of examples, whereas a deep learning model requires to see a substantially larger number of samples to learn from. Thus, it is essential to have access to a large number of training data instances to unlock and evaluate the full potential of the neural networkbased model. A straightforward technique to solve this problem of insufficiency of the realworld datasets for neural networkbased learning systems is to simulate high quality realworld alike synthetic data and use it to train the model. Additionally, if not for training, simulated datasets are particularly useful to evaluate the models’ performance, i.e. during the testing phase. In many cases, it is far more convenient to simulate test cases representing exceptional situations than collecting data for those situations in the real world. In fact, for some realworld scenarios, it might not even be possible to get a dataset describing some exceptional scenario due to the rarity of the event or ethical constraints.
It is however crucial to test the trained model in those exceptional scenarios because the cost of failure for those unlikely situations can be significantly higher than a regular situation. One such area where high quality simulated and augmented data is extensively being used are in the neural networkbased learning systems for selfdriving cars. Almost all the advanced autonomous vehicle technologies use simulated datasets. For example, Nvidia has developed the Nvidia Drive Constellation, a Virtual Reality Autonomous Vehicle Simulator [1]. Billions of miles have been driven in the simulated environment by Google’s Waymo [2] etc. Similar to the selfdriving cars, in many other applications of deep learning, high quality simulated datasets are now in high demand. One such important application of deep learning, which is the focus of this study, is in the area of social networks, where graph specific deep learning models are everincreasingly being developed and evaluated [3]. With the advancement of graph specific neural networkbased models, the demand for such datasets is growing rapidly. Furthermore, it is becoming more and more difficult to have access to complete (i.e. inclusive of node attributes) datasets representing social networks mainly due to user privacy concerns that we discuss later in this section.
Social network datasets are very complex in nature, thus, they can be difficult to simulate and there is a lack of comprehensive guidelines on how to simulate social network datasets with both the features and ground truth labels.
As mentioned earlier, graph data mining has become a very important research area due to the recent advancement and popularity of social networks [4, 5], especially the online ones. Advancements in graphbased predictive modelling or graph community detection algorithms require datasets with ground truth labels for evaluation purposes [6]. However, majority of the available social network datasets do not contain labels. Moreover, realworld social network datasets contain high dimensional features (node attributes and features are used interchangeably in this paper) [7] that represent information about both nodes and relationships. For example, a Facebook user generates variety of information such as posts he/she likes, photos, status updates, etc. Even in citation networks, there are features such as domain, authors’ affiliations, documents with thousands of words, etc [8]. In publicly available datasets, such features are rarely included. For a small number of datasets, these node attributes could be included but then usually the complete structure of the network is not; instead only its subset (mainly ego networks) are available [9, 10, 11]. This is due to the fact that during the anonymisation process of networked data, in most cases we need to get rid of majority of features as these could be used to identify individuals [12], potentially raising ethical concerns. Deidentification of network datasets is particularly difficult because of the unique topological structure a network may have. In a 2011 Kaggle link prediction competition, the most successful team won by deanonymising most of the network data [13]. On top of that, nowadays, even such graph datasets are becoming very difficult to obtain due to the aftermath of the notorious usage of realworld dataset from social networks for the purpose of political influence [14, 15].
To ensure user’s personal data is only used with explicit consent, governments and political unions are increasingly putting pressure on the technology companies [16]. Additionally, new regulations such as the European General Data Protection Regulation (GDPR) on the usage of personal data, has already come into force in many countries such as the UK [17]. Unquestionably, such regulations are essential to guarantee user privacy. However, due to those, getting hold of datasets from social media is becoming increasingly challenging. Maintaining the advancement of the research in social networks requires good quality realworld datasets. One solution is to supplement the realworld social network datasets with synthetic, good quality, realworld alike data.
The demand for graph datasets is further on the rise, due to the advancement of graphbased machine learning, as traditional learning and data mining algorithms are being adopted for graph mining. Machine learning tasks for nonrelational datasets only consider features and labels. However, graph datasets also contain edges between instances. These relationships have the ability to provide additional predictive power for a machine learning model. As a result, including these relations along with features in a predictive model is vital for prediction based on graphstructured datasets. To include relationships, one may capture these relational links between instances through graph embedding and then train any traditional machine learning model for the task of classification or regression
[18, 19]. However, besides this indirect consideration of relations or links, there are developments in the area of graph mining which directly encode a relational component of a graph dataset into a deep artificial neural network, termed as Graph Convolutional Networks (GCNs) [20]. In this approach, the topology of a graph is directly translated into the layers of a deep learning model. In GCN, the features of the graph are multiplied with filters of a neural network in the spectral domain (i.e. graph Laplacian) of the graph, thus resulting in a direct convolutional operation. Apart from node classification which is one of the most researched problems in machine learning, an important research area in graph mining is link prediction. A difficulty encountered when analysing any link prediction technique is not being able to get enough, open, dynamic (timedependent snapshots) social networks with features and labels. Typically, a link prediction algorithm is tested based on its predictive power on a future snapshot of the network. A supervised link prediction algorithm should ideally utilise both the topology and available node attributes [21, 7]. For example, Scellato et al. [22] found that including features such as places and other related user activity improves the accuracy of link prediction considerably. Most of the developments in link prediction have been based on a single snapshot of the network, although, incorporating evolution of the graph may result in better performance in link prediction as shown by Tylenda et al. [23] and Xu et al. [24].GCN is a semisupervised classification model shown to outperform other stateoftheart graph classification approaches based on as little as 0.07% of labelled nodes per class [20]. In the paper where GCN is introduced, datasets considered in the experiment were citation networks and knowledge graphs with explicitly defined class labels [20]. However, defining class labels for Facebook, Twitter, LinkedIn like social networks is not trivial. As discussed earlier, the difficulty is mainly associated with obtaining realworld graph datasets with labels and node attributes. One approach to evaluate such graph mining algorithms is by simulating graphs containing features. In this work, we propose that the preference of each node in a social network is the strongest, useful and meaningful candidate for label in a social graph.
2 Related work
A straightforward way to simulate graphs is to generate them using wellestablished network models [25]: (1) BarabásiAlbert model for the scalefree network [26], (2) WattsStrogatz smallworld model for the smallworld network [27], (3) ErdősRényi model for the random graph network [28, 29, 30, 31], (4) Forestfire Model model [32, 33].
Randomgraph Model: In the random graph network model, one creates a network with some properties of interest (specific degree distribution) and otherwise random. Although random graph model was first studied by Solomonoff and Rapoport [28] this model is mainly associated with Paul Erdős and Alfréd Rényi [29, 31].
Scalefree Model: The scalefree model shows power law node degree distribution (where is the node degree and typically ) for a social network. This kind of distribution was first discussed by Price [34]. Price, in turn, was inspired by Herbert Simon, who discusses power law in a variety of nonnetwork economic data [35].
Smallworld Model: Transitivity measured by the network clustering coefficient despite being extensively studied, is still one of the least understood properties in network analysis according to Newman [36]. Another important property we observe in real networks is the smallworld effect – all nodes are connected with each other by relatively short paths. To model these two properties Watts and Strogatz introduced a smallworld network model [27].
Forestfire Model: In this model the new node, , connects to another existing node , and then again makes a connection with the adjacent node of the newly connected node . The node
then carries on making connections with a probability
based on adjacent nodes [32, 33]. For example, in citation networks, an author finds a paper and cites it. He or she then cites more papers through that paper recursively [32]. In a social network, a friend may introduce someone with his/her mutual friend and then the friend circle grows for the person [32]. The model is named as forest fire because it imitates selforganising behaviour of a forest fire [33].These quintessential network models are one of the most important contribution towards understanding and modelling complex networks. However, these mathematical models are solely driven by the topology of a network. For example, the Scalefree model considers the degree of a node and the Smallworld model considers mutual friends. Neither features nor labels of nodes and/or connections are mimicked by those models. However, one can generate synthetic social networks with features is to find similarities/correlations between randomly assigned number of features and let those similarities define connections [37, 38, 39]
. For obvious reasons, this naïve approach is not ideal due to several limitations. Firstly, correlations between feature vectors do not consider the network topology. Secondly, a common correlation metric would assume every person in a social network views and prefers a potential friend’s features equally in a linear fashion. Finally, it is often not obvious what the node labels are, which is an issue we discuss in detail in Section
3.However, there are some recent developments in agent and eventbased social network modelling which are discussed below.
Agentbased modelling: Bruch and Atwell [40] provide a guideline on the agentbased modelling of social networks. In the paper by Bruch and Atwell [40], it is argued that the interplay between the micro and macro level characteristics is complex, and the macro level characteristics are not emergent solely from the simple aggregation of micro level characteristics or low level entities such as social network users [41]. Instead, micro and macro level behaviour or characteristics form a feedback loop, resulting in a nonlinear interaction. From a social network point of view, the graph level and node level characteristics could be thought of as macro and micro level characteristics of the network respectively.
However, to simulate the modelling approach specific to social networks, one should consider the wellstudied graph properties such as preferential attachment, mutual friend preferences etc., and provide instructions on how to account for these properties in the simulation. In the research article of Granovetter [41], these social network properties are considered, but implicitly included in terms of the macro and micro level characteristics. In summary, this study by Granovetter [41] provides a generic guideline to model social networks but a detailed and specific mathematical modelling instruction and analysis of the social network properties are not discussed.
In another work by Kavak et al. [42], the authors have argued that modelling should be performed by explicitly using available realworld dataset. In their experiment, they have simulated human mobility model based on 826,021,868 twitter messages. Furthermore, they have uncovered the Geolocation of 92,296 users for the purpose of modelling. However, the purpose of our simulation is to produce synthetic good quality graph structured datasets when realworld data is not available, which is increasingly the case as discussed in Section 1.
Event based modelling: One recent interesting development in modelling dynamic eventbased graph is the Cognitiondriven Social Network (CogSNet) model [43]. The CogSNet models social networkbased on the human memory model. Authors argue that, similar to the human memory, a social event is strengthened by repeated exposure to a similar event and weakens by deprivation of that event. Although CogSNet proposes a new paradigm in social network modelling, it does not provide an explicit explanation modelling features within the dynamic event based graph. Providing open source social network datasets with labels, features, and graph or topological characteristics is the primary goal of this study.
To address the issues discussed above, i.e., 1) lack of guidelines on implementing both the wellstudied network properties in social networks and features, 2) insufficient research on simulating dynamic social networks with node features, 3) lack of rigorous study providing directions on defining node labels in social networks, we propose a framework for social graph simulation. In our model, the simulated networks have the following characteristics based on understanding of Facebooktype social networks, along with wellstudied social network properties such as preferential attachment.

Node features are evaluated by other nodes before connecting. If two nodes are forming a connection, the decision of forming a link is taken by both of the nodes, thus both parties should evaluate each other’s features.

The decision of forming a connection is based on the preferences of nodes, which are consist of a set of latent variables. These preferences are not directly linked with users’ features. For example, two people could live in any state or county, but the preference towards a particular political party could be same, thus resulting in different features but common preferences.

People have common preferences. For example, a group of people in social network may prefer a common ideology or political view.

The node and graph level characteristics should both be taken into account while modelling a network. Node level characteristics consist of features (i.e. node attributes such as age, gender, etc.), individual preferences (latent variables such as preference towards a particular type of people, discussed in more detail in the Section 3), node degree (i.e. preferential attachment). Whereas, graph level characteristics is e.g. smaller path length preference, i.e. connecting with friends who are nearer in terms of the graph topology.
3 Proposed approach
The proposed simulation is based on preferences (i.e. a set of latent variables) of nodes, which can be interpreted as social rules. Node preference represents the preference of a person in a social network, and at the same time the networkbased projection of personality and behaviour. This in turn translates into the network topology. For example, one of networkbased behaviours might be to only connect with people with many mutual friends, shaping the topology of the ego network.
We identify the types of preference a node can have based on their topological and nontopological characteristics. The preferences/behaviours emerge from the following phenomena:

Featurebased (nontopological): From node/user point of view, a combination of variable preferences towards the features of other nodes/users acts as a deciding factor for who they wish to connect with. For example, someone may prefer to connect with people who live near in terms geographic location (e.g. city, town), thus similar location feature is preferred. Whereas, for some other features such as gender, being opposite or same, could be preferred. Thus, for this particular node, preference towards geographic location in combination with gender is considered while connecting.

Topologybased: Besides node features, the local topological characteristics may also play important role, e.g. someone may prefer to connect with other people with whom he/she has mutual friends, whereas others may be more open. Secondly, some people may still prefer to connect with someone who has many friends, i.e. popular nodes. Both of these preferences are solely based on the graph topology and could be mostly identified via the topological properties of the social graph.

Hybrid feature and topology based (combination of topological and nontopological features): People in social networks may prefer someone who is nearer to them in terms of geographic location and has similar age, education level and has many mutual friends. In this scenario, we have a combination of both the featurebased and topologybased preferences. Someone may also only connect with a politician who has many friends, only if, he or she has similar political views.
All three types of human preferences are reflected in both nontopological features of a node and the topology of a graph. Although the first, featurebased preference, solely emerges from the nontopological component of a node, once the connections are made, these preferences are also reflected or encoded within the topology of the graph. Without the consideration of the graph or relations between nodes, the predictive model will not be able to capture these complex patterns, which may negatively influence model performance. As a result, by including node attributes in graphs we can achieve higher predictive power.
We propose that the labels of social networks in supervised or semisupervised classification will capture patterns resulting from the preferences discussed above. We name these preferences or behaviours for a particular node as their socialDNA (sDNA). Although most people in a social graph have different features, many people have a similar sDNA (i.e. they share preferences). As a result, the sDNA is the most valuable and meaningful candidate for class labels for grouping nodes in a graph. However, these labels may not be explicitly defined for a given node classification problem. For example, a classification task may require identifying a group of nodes who may prefer to buy a certain type of product, for marketing purposes. The label for the class who have bought the products should capture a certain group of people with similar preferences in the social network. In semisupervised classification, if we have a dataset for only a few people who may have bought the product, the classification model would associate a certain type of sDNA in the social network as the most likely group to buy that product. In addition, if we do not have any historical information which tells us who have bought the product, it may be possible that a group of people with similar sDNA may prefer the product more often than others. However, a person or node itself in a social network, may not entirely know his or her preferences or sDNA. As a result, finding these preferences (i.e. labels) in terms of sDNA is a nontrivial task. One solution to this problem could be to define a few randomly selected nodes with different labels. These labels could also be selected based on a strategy of focusing on the features of nodes. For example, selecting a few nodes with very different features from each other. After labelling, the semisupervised classification algorithm, such as GCN will infer other nodes with similar preferences/sDNA
. Even if no prior knowledge is available, randomly selected nodes with different class labels could be used with only one label per class. GCN is shown to be powerful enough to accurately classify nodes with only one label per class
[20].4 Graph Formation
Let’s assume that we have number of nodes with features each. Each element or feature in the could be unbounded or bounded. Each and every node subscribes to an sDNA. There is a total of different sDNAs, such that . Nodes which subscribe to the same sDNA have the same preference, thus the same label. Each sDNA consists of two vectors of length (i.e. the same as the number of node features). These two vectors are, 1) , which defines the strength or weight of a particular feature’s preference and ranges between and , and 2) , which defines whether similar or dissimilar features are preferred with a binary attribute or . Although could be incorporated into as its sign, to make the preference standout separately for user readability and its contribution in the sDNA mutation process discussed in Section 7 for dynamic graphs, the vector is used. This also allows to have a separate label for the preferences within the sDNA, which could be learned using machine learning algorithms from the graph enabling a more indepth predictive analysis. The featurebased scores between two nodes and are calculated as (where is the Hadamard product):
(1) 
Equation 1 gives featurebased score which entails if node is a potential friend for node . In this case, node evaluates if it wants to connect with node or not, as we consider only ’s sDNA. In many social networks, mainly for the undirected ones, the final connection or friendship decision is made by both of the nodes. We can introduce twoway evaluation simply by adding node ’s sDNA based score in Equation 1.
(2) 
If both the and subscribe to the same sDNA then . However, Equation 2 does not prefer similar sDNA over different sDNA or vice versa. In a social network, the preference or sDNA is a set of latent variables. It may well be that two people have a similar preference and this results in a lower score. For example, if two social network users prefer to connect with the opposite gender more often, then if they have the same gender then they are less likely to connect.
Equation 2 does not consider topological (i.e. graph based geometric features) of the nodes while calculating the score. In social networks, popularity, i.e. the degree of the nodes is a common topological feature with a significant effect on the growth of the network. Typically, people tend to prefer other people who have large number of connections. This is why famous people tend to get more connections. This phenomenon is well studied and known as preferential attachment [26]. To add the preferential attachment effect, we can simply add the degree of the connected nodes to the score. If node has degree or connections, then from ’s perspective, the popularitybased score can be calculated as follows:
(3) 
However, the preference for nodes with higher degrees varies from person to person. We can incorporate this variability by including sDNA’s preferential attachment parameter while calculating the score, resulting in:
(4) 
Equation 4 only considers score of and from ’s perspective (i.e. ’s sDNA). For an undirected graph, we can use the following Equation 5 to calculate preferentialattachmentbased score from both and ’s perspective by adding both of their scores from each other’s perspective:
(5) 
A social network user tends to prefer people who are nearer to them in terms of the graph topological distance [44]. Creating a connection with somebody who is friendofourfriend is usually more likely than starting a relationship with someone who is further away from us in the structure. However, this preference is also subjective and varies among social network users. As a result, we add this variability of path length preferences subjective to a user by using sDNA’s variables, where is the longest pathlength in the graph that the model is considering, and . The sDNA’s vector has a length of .
(6) 
In Equation 6, is a generalised Kronecker delta function in the Iverson bracket where is the adjacency matrix of the graph. The value of is one, if a path of length between and exists, and zero otherwise. This function of path length introduces nonlinearity in the score. Equation 6 gives the score of when is evaluating ’s potential to be able to connect or become friends with . This is done by using ’s sDNA parameter to calculate the score of . This imitates the behaviour, firstly, may find someone interesting to send him a friend request on Facebook. Secondly, the final connection will be made if also finds interesting. Equation 6 accounts for the first case and the following Equation 7, similar to score based on features in Equation 2 accounts for the score from ’s point of view for . The final score function based on path topology is:
(7) 
Equation 7 is required if were to simulate a directed graph. For an undirected graph, we only consider Equation 6.
Finally, we add all three scores, i) feature based from Equation 2, ii) popularity based from Equation 5, and iii) shortest path length based from Equation 7 (directed) or 6 (undirected) to calculate the final score . We consider an undirected graph where both and make mutual decision to connect with each other. In case of an undirected graph, for connecting with , we can simply consider Equations 1, 4, and 6. For simplicity we are not including the subscript in the final score function .
(8) 
Equation 8 gives the score between any two nodes. The scores are weighted or modified according to the sDNA
a node belongs to. To further enforce some global graph level control in the effects of featurebased, popularitybased, and shortestpath scores we introduce two hyperparameters. This global control is useful in many situations, for example, one may wish to generate networks where strong preferential attachment phenomena exist. To be able to control this global weighting, we introduce
and global weighting factors in Equation 8. is a vector of length , where is the number of shortest path length considered starting from length two.(9) 
Equation 9 contains variables from , , and . These do not come from the nodes directly but from their sDNAs, which in turn expresses their behaviour in the network. Equation 9 is one possible linear combination of , , and , however, other possible nonlinear combination functions may be used depending on the target domain.
5 Simulation Process
The link formation process for a graph with nodes is given in Algorithm 1. Each node subscribes to exactly one of different types of sDNA (Figure 1) and contains features. In Line 2 the algorithm generates all pairs of nodes. In case of an undirected graph, the pairwise permutation (without repetition) is considered. Furthermore, if selfconnection is desired then self pairwise combination is also included. Social network users do not necessarily explore all potential friends whom they might connect with. For example, a Facebook user does not explore all existing Facebook users to connect with. As a result, the simulation process selects a pair of nodes to calculate scores with the exploration probability (Line 6), much like how connections are made in a random graph. will result in calculation of scores for all possible pairs, while for , no score will be calculated between any pair of nodes. The exploration probability incorporates controlled stochasticity. In order to determine the minimum score a pair of nodes should have to connect, we define a cutoff point .
To sum up, first, we calculate scores between pairs selected based on the exploration probability and then we sort these scores in descending order. After that, we connect fraction of pairs of nodes in the entire graph. Smaller values of will result in a social network where the users are very particular about with whom they connect. On the other hand, very high values of will result in a network where users do not care about features or topological properties while connecting. Thus the latter will be close to a random graph model with probability of edge occurring being equal to , i.e. will result in a pure random graph model with probability of edges formation. In Line 4 and 5, from all sets of pairs we select a pair for score calculation with probability, . In Line 5,
is a random number generator function which returns a random number from 0 to 1 from uniform distribution. In Line
6, we use Equation 9 to calculate a score between the selected pairs of nodes. Afterwards, the stopping length based on the suggested fraction of node pairs to be connected is calculated (Line 11). In Line 12 we sort the selected pairwise nodes’ scores in descending order. Afterwards, in Line 14, we connect the first fraction pairs of nodes’ for which we have calculated scores in , in Line 12, thus, pairs with higher scores will have a higher likelihood of forming connections.6 Curse of Dimensionality in Networks
Realworld social networks contain high dimensional features. If we consider a Facebook user’s posts, likes, photos, comments, etc. as features, then we have thousands of features for each of the node. One problem with nodes with high dimensional features is the linear increase of computational complexity for the simulation process discussed in Section 5. To overcome this problem, GPU computation can be used to calculate Scores in Equation 9. In our simulation library, we have enabled GPU computation and Figure 2 shows computation time with 300 nodes with increasing number of features.
7 Dynamic graph
Real social networks evolve over time and are dynamic in their nature. However, dynamic graph datasets are very rare to find, especially with groundtruth labels and node attributes. These datasets are crucial in the field of dynamic graph research, but also essential for the evaluation of a link prediction, which usually deals only with static graphs or a snapshot of a graph at time . The link prediction problem is to identify new links that will be present in the network at time [45, 46]. Assuming the network has a set of nodes and set of edges at time expressed as , and that a link between a pair of vertices and is denoted by , the goal of link prediction is to predict whether , where . The prediction is performed by using topological and/or nontopological information about node characteristics and their relationships. Thus, to evaluate or test the performance of a link prediction method, a future snapshot at time is required. Additionally, machine learning based link prediction algorithms require a future snapshot of the network at time as ground truth for training purposes. Interestingly, by using multiple runs of Algorithm 1 we can already get dynamic graphs, i.e. a future snapshot of the network. Every time we socialise the graph using Algorithm 1 containing pairs of nodes which are not yet connected, we will get new connections occurring within the graph. However, this is perhaps not the best simulation of the dynamic nature of real social networks. The reason is that by running Algorithm 1 multiple times we are forcing each of the social network users to make consideration and connect with people which they didn’t find interesting enough in the previous run(s)^{2}^{2}2Here, we are assuming no arrival of new nodes or a constant number of nodes. In case of new nodes, we can easily run Algorithm 1 with the new nodes and include them in the graph.. What we really want is not to force the users in the graph to make new connections but allow the users’ interest and preference to change and then run the Socialise Algorithm 1. This will result in a concept drift in the user preferences, which can be achieved via changing values of the variables in the sDNA’s of the nodes. This changes in the sDNA reflect the phenomenon that, the rules that govern social networks can and do change over time. This change of preference can be achieved via the sDNA Mutation given in Algorithm 2. The intensity of mutation can be controlled by mutation intensity parameter , which results in changing values of the variables in . A lower value of would change only few of the . As a result, the user’s preference towards a potential friend’s feature would change. In case the value of the mutation intensity parameter is defined larger, this would result in changes to the entire preference vector . In Algorithm 2, in Line 1, the procedure takes all existing from the graph, and , a boolian parameter to determine if should also be changed. In Line 2 we iterate trough each of the , one at a time. For the given sDNA, we then iterate trough each of the elements in and in Line 3. We than reassign the value of with the probability (Line 5). In Line 7 we check if the is set True. If so, then we also reassign the value of , or , with a probability of (Line 9). The selection between or selected randomly from uniform random distribution. In Line 14 we reassign the parameter of sDNA, which is for preferential attachment strength. Afterwards, in Line 16, number of random numbers are generated, for each path length preference in Equation 6. The intervals are selected such that it satisfies, . Afterwards, in Line 20, we again iterate through each elements of and reassign from the already generated random numbers in Line 16.
An interesting observation is, generally people’s behaviours or preference changes are correlated with time. This change in behaviour, for social network users, contributes to change in the topology of the social network. In our simulation strategy, a snapshot to snapshot time difference then should also be correlated with the change of the users’ behaviour or preferences, i.e. sDNAs. The parameter in Algorithm 2 defines this intensity of mutation in sDNA or intensity of social network users change in behaviour. As a result, the value of is proportional to the time between two snapshots of the network. For example, if one wishes to run the Socialise algorithm (Algorithm 1), it will produce a social network with the first snapshot, . Then running the Mutation algorithm (Algorithm 2) will result in change in preferences with a particular value of the parameter , and then rerunning the Socialise algorithm will result in another snapshot of the network in a forward time dimension, . A high value of will result in higher time difference between these two snapshots, and .
One may wish to generate event based dynamic networks, i.e. timestamped link formation. This can also be achieved by setting the ‘fraction of nodes to be connected’ parameter from the Socialise Algorithm 1 such that only one link is formed. A repeated run of Algorithm 1 will result in edge stream with timestamp for each of the edges. As we have discussed earlier in the section, the time between each of the edge appearing could also be manipulated by changing the value of parameter .
8 How to validate simulation
In order to assess if the desired integration of features, labels, and topology is achieved, we measure and compare different trained model’s predictability of the labels of the nodes. This comparison is done by designing different setups of the models such that, the models are able to perform predictions with the entire set of information (features, labels, and topology) as well as with partial information.
Here we discuss the validation setup. The predictability of the label of a node, i.e. sDNA, can be performed via the following configurations of an ideal machine learning model:

Predictability of nodes’ sDNAs with features combined with the graph topology

Predictability of nodes’ sDNAs using features only

Predictability of node’s sDNA using topology only
We can expect for an ideal machine learning model to fully capture and learn patterns both from the topological and feature based information from the network without overfitting or being susceptive to the noise or stochasticity in the network. Needless to say, such an ideal model is not currently available in the realworld. However, we should at least use a machine learning model which can directly utilise both the topological and nontopological information, i.e. features.
In our case we use the GCN [20] to analyse sDNA predictability of the simulated networks, which can be regarded as one of the best models to directly combine both the topological and nontopological information of the graph [47].
8.1 Graph Convolutional Networks (GCNs)
GCN is a multilayer graph based neural network. In each layer, the features are multiplied with the topology of a graph in the spectral domain (i.e. symmetric normalised Laplacian matrix [20]
). Weights of connections (edges/links by which the features of a node are passed, considered or summed) are learned using backpropagation. However, as most of the realworld social networks are not regular graphs, one single weight is learned for all links of a particular node.
The layerwise propagation rule for the ’th layer is:
(10) 
In Equation 10, are the trainable weight matrices for each layer. (the feature matrix) and is a graph representative matrix that we discuss in more detail in Section 8.2. is fed in every layer of the model until the output layer. Finally,
denotes a nonlinear activation function.
For this model, the receptive field grows with the depth of the network [20]. In the first layer, only friends’ features are considered, and in the second layer friend of friends’ features are also considered, i.e. summed before passing through a nonlinearity. This is because the summarised friends’ information is already gathered in the first layer.
The direct translation from a graph to the structure of the neural network^{3}^{3}3In this paper when we talk about a graph (i.e. a social network) we mention it as a ‘graph’ or a ‘network’ but when we talk about a neural network it is written in its full form. is achieved via the graph representative matrix . Symmetric normalised Laplacian matrix of the adjacency matrix has been used in the original formulation of GCN, i.e. [20].
(11) 
(12) 
In Equation 12, adds the selfconnections for each of the nodes in , is the degree matrix of the adjacency matrix, and is the adjacency matrix with added selfconnections. The addition of selfconnections facilitates incorporation of selffeatures of the nodes for better predictability. For example, a social network user’s friends may give away his or her preference or class label (i.e. predictability based on the labels of the connected nodes), but additionally, his or her own features (i.e. selfconnections in the graph) are also important to consider to predict his or her preference.
In Equation 10, the main transformation to the neural network from a graph is performed through . If the adjacency matrix in (Equation 11) is replaced with a different representative function of the graph, the structure of the neural network itself will change. However, this does not change the input feature matrix . As a result, this is not exactly data preprocessing technique but rather a change in the architecture of the neural network. We discuss this usage of different graph representatives later in Section 8.2.
Using the GCN we calculate the three mentioned setups for node label predictions in Section 8 by changing the propagation rule in Equation 10 as follows:

Prediction of nodes’ sDNA with both the features and graph topology using propagation rule of Equation 10 for the first layer, where , and , where is the feature matrix:
(13) This is the straightforward GCN model proposed by Kipf and Welling [20]. Here, the graph representative is fed in every layer of the model, but the feature matrix is fed only in the first layer.

Prediction of nodes’ sDNA with features excluding graph topology with the following propagation rule:
(14) In this first layer propagation rule in Equation 14,
is the identity matrix of the adjacency matrix
. is fed into the model until the output layer. Thus, only features of each node are considered and the graph topology does not play any role for label or sDNA predictions. 
Prediction of node’s sDNA excluding the features but solely with the graph topology:
(15) In this propagation rule in Equation 15, is the identity matrix of the feature matrix . As a result, only the graph topology is considered, and features do not play any role in the model. Here, the graph representative is fed in each layer of the model until the output layer of the model, however is fed only in the first layer of the model.
We assume that during the simulations, the first setup will produce more accurate results than the remaining two. This hypothesis is represented through the following inequality:
(16)  
where is the test accuracy of the trained neural network model for four different setups, based on four different propagation rules in Equations 13, 14, 15, and 26. However, the assumption of GCN is that for a layer of the model, only the order neighbourhood nodes are influential [20, 47]. To work around this problem, we develop a strategy of replacing the adjacency matrix in the Laplacian transformation in Equation 11 graph representative function , with three different existing nodesimilarity measures. In social networks, not all connected nodes have the same influence and in fact, some nondirectly connected nodes in the graph may have greater influence over a node in question than the directly connected ones. As a result, usage of the adjacency matrix as a graph representative may not always entail the best performance of the neural network.
8.2 Nodesimilarities as Graph Representatives for GCN
In social networks, the adjacency matrix represents direct links between nodes. In GCN the features propagate through those links. Thus, a node’s label is predicted by utilising patterns on the surrounding connected nodes’ features and labels. However, in social networks, not all the connections of a given node have same or even similar effect on this node. It can be assumed e.g. that the influence that one node has on its neighbour will increase with the number of their mutual friends. In a similar way, it may happen that a friend of a friend of node can influence node more than a directly connected node (a not influential node, e.g. does not have any common friend with the node ). As a result, this effectively changes the representation of the network so one can incorporate these relationship characteristics as a form of social nodesimilaritybased matrix for the GCN. One way to extract and represent these types of social relationship (not necessarily direct ones) strengths and other information between nodes is to use a matrix which describes the similarity between nodes instead of an adjacency matrix. For example, the Katz similarity measurement considers the number of all direct paths from node to [48]. Thus, more mutual friends would result in a higher number of paths, resulting in a higher value of the Katz score. In this study, we replace the adjacency matrix with the three different types nodesimilarity matrices, as they encompass richer information about underlying structure than traditional adjacency matrix. Following are the three nodesimilarity measures we have considered:

Katz, which considers the number of all the paths from node to [48]. The shorter paths have bigger weight (i.e. are more important), which is damped exponentially with the increase of the path length and the parameter ( is the adjacency matrix):
(17) The above similarity in Equation 17 will result in the following graph representative
(18) (19) 
Rooted PageRank (RPR) is used by search engines to rank websites. In graph analysis it ranks nodes by the probability of each node being reached via random walk on the graph [49]. The
is calculated using the stationary probability distribution of the degree matrix
in a random walk. The random walk returns to with probability at each step, moving to a random neighbour with probability . This results in the following graph representative :(20) (21) 
Graph Gravity (GG), Inspired by the Newton’s law of universal gravitation, this nodesimilarity measure uses degree centrality as the mass of the nodes, while the lengths of shortest paths between them act as distances [44, 50]. The above analogy leads to the following formula for calculating the score between two nodes:
(22) where denotes the degree centrality, is the shortest path. Nodesimilarity in Equation 22 will result in the following graph representative :
(23) (24)
For all the above three nodesimilarity measures, each , represents the nodessimilarity matrix (only for all possible links) which has been preprocessed and reconfigured further which is discussed in Section 9.1.
8.3 Weighted Feature Matrix
is a powerful model for node classification, and it has been shown to perform well even only with the graph topology, i.e. without the feature matrix Kipf and Welling [20]. The reason for such a good predictability without the features could be due to two reasons. Firstly, when our focus is on node classification for graphstructured datasets, the preferred features of the nodes should be reflected in the topology of the graph as we have discussed in our simulation process in Section 4. The fact that these features are encoded in the topology may result in a good predictability even when the features are not directly considered in the model. Additionally, this better predictability based on feature only or topology only may vary from node to node. For some nodes, the topology only may have better predictability when compared with the node’s feature. This could be due to the fact that topological position of a node overshadows the importance of the features.
Secondly, the good performances solely based on topology could be because, similar to realsocial network users, we have defined our sDNA for nodes such that it results in some of the features of other nodes being preferred and some others not (Section 4). In other words, not all the features play similar roles when it comes to the predictability of the sDNA. As a result, in the entire graph, some of the features may be disliked or not preferred by the majority of the nodes when forming graph connections. This is why an additional learnable common weight for a particular feature for all the nodes may result in better predictability. In our analysis, we have found that adding this additional weight, which defines the weight for each of the features for all the nodes, seems to perform best, and this is what we present in Section 10. To introduce this relative importance of features we use one additional weight vector in the model. We use a common weight for a particular feature for all the nodes. If we have a network with nodes and different features each, for each feature of all the nodes a common (i.e. across all the nodes) weight is used to learn the strength of each feature. This additional feature weight matrix is the size of the number of features and is used only in the first layer of the model. Hence all the input features, are weighted before passing to the hidden layers.
This additional weight vector results in the following first layer propagation rule based on the Equation 10:
(25) 
where is the matrix containing the unbounded learnable parameters to define strength of the feature matrix . is an all one matrix. defines the Hadamard product between the feature matrix and the dot product of and .
9 Experimental Setup
In the experiments, we simulate social networks, with nodes each. Each of the networks has four different types of sDNAs with nodes subscribing to a single type. We take three different snapshots of the same network, resulting from an initial networks to a total of networks (i.e. three snapshots of the same network). Each of the nodes has features, each set of features of a node is generated from a uniform distribution. All the variables for the sDNA (described in Figure 1 and Section 4
) are also generated from uniform distributions. For all the models, we have four graph convolutional layers. All layers except the output layer use rectified linear units (ReLU) as nonlinear activation functions. The output graph convolutional layer contains softmax activation and categorical cross entropy loss is calculated for the four types of
sDNAs or node labels. Each of the layers, except for the output layer containsunits of neurons, and the output layer has
units, the same as the number of nodes that need to be classified. Finally, we used Adam optimiser, a firstorder gradientbased algorithm for our differentiable neural network model to learn the weights (i.e. to optimise the loss function). Each and every model is evaluated with the same setup. On every network, 10fold crossvalidation is performed and the average accuracy is reported. Additionally, the standard deviation of the accuracy is reported for the best accuracy and the original GCN in Table
2. All the hyperparameters are kept fixed for all the models. We have used the learning rate of , L2 kernel regularisation (i.e. weight decay) for all the hidden layers with the decay rate of , and a dropout layer after each hidden layer with , i.e 50% of the randomly selected neurons are trained in each training iteration.Model  Description  Graph representative  Eq. 

FTVanilla  Original GCN, feature + toplogy  10  
T  Original GCN, topology only  15  
TLR  Original GCN, topology only  26  
F  Original GCN, feature only  N/A  15 
FTKatz  Feature + Katz based toplogy  18  
FTRPR  Feature + RPR based toplogy  20  
FTGG  Feature + GG based toplogy  22 
In Table 1, for the topology only model, T (Equation 15), the weight matrix contains more trainable parameters compared with the the model in FTVanilla (Equation 13). This is because we have nodes per network with features each. As a result for the model with both the topology and feature matrix model, FTVanilla the dimension of the first layer weight matrix needs to be , where is the hyperparameter for the number of units we consider in all the models, i.e. , and the resulting matrix has a dimension of , while the output from the first layer has a dimension of . Whereas for the topology only T, in Equation 15, where the feature matrix is only an identity matrix, , the weight matrix is directly multiplied with the graph representative, i.e. the graph topology, . As a result the dimension of the weight matrix is a lot higher (), i.e. , and the resulting output from the first layer has a dimension of . As we can see there are more trainable parameters in the T model compared with FTVanilla, i.e. vs . If we were to compare both of the models’ performance, T vs FTVanilla, to test if the Inequality 16 holds as a validation of the feature and topology integration process, the total number of trainable parameters for the both the models should be as close as possible. To make both the models comparable, we introduce another setup for the topology only model to keep the number of parameters at the similar level to the model using both the features and topology.
(26) 
In Equation 26, weight matrix is split into two matrices to keep the number of trainable parameters roughly in line with the FTVanilla model.
9.1 Augmented nodesimilarity matrix
For (Equation 24), the similarity scores for all the nonexisting links are calculated and then all the scores are normalised between zero and one. Afterwards, the adjacency matrix is summed with the calculated scores for all the nonexisting links. As a result, all the existing links for has a value of one and for the nonexisting links, the value ranges from zero to one. For and , the pathbased similarity scores are calculated for all possible links. For all the networks, to calculate Katz score, with the highest exponent of five for the adjacency matrix (i.e. in Equation 17) and the is used. As for the RPR, the parameter is set to . For each of the calculated similarity matrices (, and ), the row is normalised for each of the nonzero elements using the norm. Moreover, on the similaritybased adjacency matrices (i.e. , , and ), several thresholds are used. The thresholds are applied on the row normalised matrices. The thresholds are set in a way that, if the value in the similaritybased matrix is less than or equal to the first threshold then it is set to zero. Whereas for the second threshold point, if the value is greater than the threshold, it is set to one. If the thresholds are set as zero and one respectively, then none of the values is changed in the matrix. Also, for some set of thresholds, if they are not the same, the elements in the matrix which are in between the two thresholds, are unaltered in the matrix. The sets of thresholds are selected based on empirical analysis, i.e. crossvalidation accuracy of the model. However, we also select a threshold based on the mean value of the elements of the matrix. The mean value threshold hold is applied such that, if a nonzero element in the matrix is less than or equal to the mean value then it is set to zero and one otherwise.
In GCN, for the layer, only the path length neighbouring nodes are considered [20]. Thus, it limits the scope of the receptive field of the node in each layer and also the maximum receptive field is limited by the maximum number of layers used in the model. This limitation has also been pointed out in the paper where GCN was first introduced [20]. However, using nodesimilarity measures along with the augmentation process we describe here allows the model to consider a threepath distant node even in the first layer (i.e. a direct connection) for the classification of the node , assuming that they have a high nodesimilarity score. As a result, this augmented nodesimilarity measure solves the limitation of layerwise nodeneighbourhood dependencies for the GCN.
10 Results and Discussion
In Figure 3, we show accuracy for all the models that we have tested on 30 simulated networks. All the results are 10folds crossvalidated and average accuracy is reported. In Figure 3 and Table 2, we observe that according to the hypothesis of Equation 16, the accuracy of the model which uses node features only, i.e. , is very low. In fact, the predictability is not better than random chance (the accuracy is around and we have four equally represented labels or sDNA types to predict). Additionally, from Figure 3 and Table 2 we can see that for the majority of the datasets (except only three networks) models which utilise both the topology and features of the graph perform better than the two other setups where topology and features are considered independently. When only topology is used (i.e. T), the model T performs the best in three networks. Two of them are a third snapshot(i.e. the 3rd run of the Algorithm 1) of a network (networks 12 and 62), and the third one is the second snapshot of the network (21). This can be due to the fact that as we run Algorithm 1 multiple times, the patterns of preferences get encoded within the network topology so that the topology only model performs better. This is something we also expect in real world networks i.e. as a person makes more connections, their connection making patter becomes eminent.
Amongst methods using nodesimilarity matrices instead of the node adjacency matrix, we see that the choice of threshold seems to have a significant effect on the modelperformance. However, we can also see that using the mean value of the L2 normalised nodesimilarity matrix as a threshold (described in Section 9.1) performs quite well. In fact, on seven networks with the setup of using mean value as a threshold on the nodesimilarity matrix (discussed in Section 9.1) outperforms all the other models (Table 2). The models with the mean value as thresholds are written as ‘auto’ in Table 2.
If we do not consider differences between thresholds and usage of the vector on the nodesimilarity measures, significantly outperforms the original GCN (i.e. ) and two other nodesimilarity measures ( and ) based on the results in Table 2. On five networks, performs the best and on one network. However, on the basis of average bestperforming models on all the datasets from all the different models, following models performs best: 1) , 2) , and 3) . Thus, on average, performs best across all the datasets.
The results also show that, the usage of a trainable parameter based on Equation 25 gives us a better model for many datasets than when not using it. In 15 out of 30 datasets, using on the first layer of the model outperforms the other models (Table 2). Furthermore, models with which perform best are mainly not the original GCN but the nodesimilaritybased models, except for one dataset. However, this may not imply that the use of additional weights in the first layer based on Equation 25 only performs well on nodesimilaritybased models. This is because the usage of nodesimilarity may have better predictability in general than the adjacency matrix.
From Figure 3 and Table 1 we can see that the performance of a nodesimilaritybased model varies depending on the network the model is trained on. This is because all the networks are simulated with different rules, and no two networks are exactly the same. We can expect to see the same in realworld networks as well. Thus, the choice of a nodesimilarity method could be based on empirical analysis. However, one may also use the mean value of the normalised nodesimilarity matrix, especially with GG as we have discussed earlier in this section.
The results in Figure 3 show that we can achieve high accuracy (in fact higher than using only the adjacency matrix) on node classification when a nodesimilaritybased graph topology is used. This is particularly useful for very dense networks. The training time that is required for a very dense network is extremely high for GCN. Many realworld datasets, such as facetoface interaction networks, tend to be very dense. Thus, the similaritybased matrices can be used (with a suitable threshold to reduce the number of connections as per Section 9.1) in such scenario to reduce training time.
Networks  FTvanilla([20]) Acc (SD)  Max Acc (SD)  Max Acc Model 

00  0.721 (0.011)  0.732 (0.007)  FTRPRauto 
01  0.71 (0.022)  0.762 (0.011)  SFTkatzAuto 
02  0.699 (0.032)  0.739 (0.012)  SFTkatzAuto 
10  0.741 (0.013)  0.753 (0.012)  SFTkatz0.11.0 
11  0.736 (0.034)  0.767 (0.007)  SFTvanilla 
12  0.754 (0.027)  0.767 (0.033)  T 
20  0.507 (0.045)  0.559 (0.009)  SFTkatz0.00.5 
21  0.599 (0.044)  0.618 (0.074)  T 
22  0.671 (0.017)  0.675 (0.014)  SFTRPRauto 
30  0.554 (0.018)  0.576 (0.009)  SFTkatz0.00.5 
31  0.517 (0.027)  0.553 (0.013)  SFTGG0.01.0 
32  0.515 (0.045)  0.555 (0.016)  SFTRPRauto 
40  0.49 (0.011)  0.508 (0.006)  FTRPR0.00.5 
41  0.51 (0.010)  0.532 (0.013)  SFTkatz0.00.5 
42  0.506 (0.010)  0.545 (0.009)  FTkatz0.01.0 
50  0.701 (0.028)  0.741 (0.005)  FTkatz0.11.0 
51  0.758 (0.010)  0.783 (0.005)  SFTkatz0.11.0 
52  0.756 (0.044)  0.807 (0.004)  FTRPRauto 
60  0.686 (0.009)  0.709 (0.010)  FTkatz0.00.5 
61  0.701 (0.030)  0.748 (0.018)  SFTkatzAuto 
62  0.726 (0.035)  0.759 (0.015)  T 
70  0.757 (0.010)  0.762 (0.008)  SFTkatz0.00.5 
71  0.738 (0.013)  0.764 (0.010)  FTkatz0.00.5 
72  0.731 (0.015)  0.761 (0.013)  FTkatzAuto 
80  0.677 (0.016)  0.724 (0.007)  FTkatz0.00.5 
81  0.717 (0.028)  0.752 (0.007)  SFTkatz0.11.0 
82  0.731 (0.016)  0.742 (0.013)  SFTkatz0.00.5 
90  0.699 (0.014)  0.747 (0.011)  FTkatz0.01.0 
91  0.68 (0.012)  0.726 (0.019)  FTkatz0.00.5 
92  0.627 (0.012)  0.726 (0.006)  FTkatz0.11.0 
11 Conclusions and Future Work
In this work, we have evaluated the performance of GCN on simulated friendshipbased social network datasets. One limitation of the GCN is that it is limited to the number of neighbourhood path by the number of layers used in the model. We argued that using the nodesimilarity matrix as a graph representative allows us to solve this dependency between the layers and the order of the neighbourhood nodes. Additionally, our approach with the nodesimilarity measures may perform well enough with only a few layers compared with the original GCN due to the less dependency between the highest number of layers used in the model and the highest order of node neighbourhood considered. The GCN or any deep learning model is prone to overfitting when a large number of layers are used [20], and our approach may get around this problem and achieve higher accuracy only with a few layers. It has also been empirically shown that most of the models with the augmented nodesimilarity measures outperform the original GCN.
In total we have proposed four new variations of the GCN model. Three of them are primarily based on the Katz, RPR, and GG scores as a form of the graph topology encoding. The fourth model is where we add learnable parameters for each of the features independent of the nodes for the entire graph, allowing the model to ignore the input features if it so chooses. This variation of the model can be used with the adjacency matrix as well as with the Katz, RPR or GG scores, and its primary motivation was the observation that for some datasets using the topology only, gives superior results. The results show that these new variations outperform the original GCN model in terms of accuracy.
For the nodesimilaritybased matrices, we have proposed a reconfiguration technique. This reconfiguration results in augmentation of the graph represented by the nodesimilarity matrix. This is particularly important as for node classification task with GCNlike models, we only have one graph sample to train the model. The augmentation technique can be used to better train the model on the same graphs with several different augmented nodesimilarity matrices (with different thresholds and similarity measurements). Several representations of the same graph topology can also work as a regularisation technique to prevent overfitting of the model.
We argued that a node in Facebooktype social networks can be defined in terms of a set of preferences (which we coined as the sDNA) of a node). Based on the sDNA, our simulation strategy provides a comprehensive guideline on how to generate dynamic networks with features, and ground truth labels, particularly useful to train and test the neural networkbased learning systems. We have validated the integration of features and topology of the simulated graphs based on the predictability of the GCN. If the integration process is good enough, the GCN should not perform better on a model with topology only, compared with a model with both the topology and feature. However, we have found that a large number of models would perform better with only the topology of the network. We have concluded that this is because not all the features play a similar role in the graph. To include such variation of importance for each of the features for all the nodes, we have introduced the weighted feature matrix for the GCN. The new variant of the GCN with weighted feature matrix, have shown to have great potential. With the weighted feature matrix, the majority (except in three datasets) of the models perform better with both the feature and topology compared with the topology only. This not only produces a new variant of the GCN model but also shows that our integration process of the topology and features is successful.
The three cases where topology only performs better could be due to significantly more learnable parameters the model has compared with the feature and topology model that we have discussed. To solve this problem of an unequal number of learnable parameters, we have introduced another variation for the topology only model, where the number of learnable parameters is reduced by using a lowrank approximation of the weight matrix. The reduced parameter model for the topology only also performs well compared with the model with more parameters. A further inspection of those datasets may reveal the underlying reason why the topology only models perform well for them. However, it could be possible that for those three datasets, the features are reflected within the topology so well that the topology only model becomes more powerful and adding features simply results in redundancy.
We have used a few empirically selected thresholds for the augmented nodesimilarity matrix. A more effective way to select optimal thresholds is another future direction to explore.
We also provide an opensource library for social network simulation written in Python with GPU computation support for high dimensional features. We aim to incorporate more features in the future for the simulation library.
12 References
References
 [1] Virtualbased safety testing for selfdriving cars from nvidia drive constellation. URL https://www.nvidia.com/engb/selfdrivingcars/driveconstellation/.
 [2] Waymo safety report on the road to fully selfdriving. https://waymo.com/safety/.
 Goyal and Ferrara [2018] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. KnowledgeBased Systems, 151:78–94, 2018.
 Wasserman and Faust [1994] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.
 Newman [2018] Mark Newman. Networks. Oxford university press, 2018.
 Sapountzi and Psannis [2018] Androniki Sapountzi and Kostas E Psannis. Social networking data analysis tools & challenges. Future Generation Computer Systems, 86:893–913, 2018.

Pecli et al. [2018]
Antonio Pecli, Maria Claudia Cavalcanti, and Ronaldo Goldschmidt.
Automatic feature selection for supervised learning in link prediction applications: a comparative study.
Knowledge and Information Systems, 56(1):85–121, 2018.  Popescul and Ungar [2003] Alexandrin Popescul and Lyle H Ungar. Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data, volume 2003. Citeseer, 2003.
 Viswanath et al. [2009] Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks, pages 37–42. ACM, 2009.
 [10] Facebook friendships network dataset – konect. http://konect.unikoblenz.de/networks/facebookwosnlinks. Accessed: 20190211.
 Leskovec and Krevl [2014] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 Townsend and Wallace [2016] Leanne Townsend and Claire Wallace. Social media research: A guide to ethics. University of Aberdeen, pages 1–16, 2016.
 Narayanan et al. [2011] Arvind Narayanan, Elaine Shi, and Benjamin IP Rubinstein. Link prediction by deanonymization: How we won the kaggle social network challenge. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1825–1834. IEEE, 2011.
 Hand [2018] David J Hand. Aspects of data ethics in a changing world: Where are we now? Big data, 6(3):176–190, 2018.
 Cadwalladr and GrahamHarrison [2018] Carole Cadwalladr and Emma GrahamHarrison. Revealed: 50 million facebook profiles harvested for cambridge analytica in major data breach. The Guardian, 17, 2018.
 Michelle [2018] Michelle. Social media’s year of falling from grace. Voice of America (VOA), Dec 2018. URL https://www.voanews.com/a/socialmediasyearoffallingfromgrace/4720871.html.
 Bennett [2018] Colin J Bennett. The european general data protection regulation: An instrument for the globalization of privacy standards? Information Polity, 23(2):239–246, 2018.
 Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
 Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Lichtenwalter et al. [2010] Ryan N Lichtenwalter, Jake T Lussier, and Nitesh V Chawla. New perspectives and methods in link prediction. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 243–252. ACM, 2010.
 Scellato et al. [2011] Salvatore Scellato, Anastasios Noulas, and Cecilia Mascolo. Exploiting place features in link prediction on locationbased social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1046–1054. ACM, 2011.
 Tylenda et al. [2009] Tomasz Tylenda, Ralitsa Angelova, and Srikanta Bedathur. Towards timeaware link prediction in evolving social networks. In Proceedings of the 3rd workshop on social network mining and analysis, page 9. ACM, 2009.
 Xu et al. [2018] Shuai Xu, Kai Han, and Naiting Xu. A supervised learning approach to link prediction in dynamic networks. In International Conference on Wireless Algorithms, Systems, and Applications, pages 799–805. Springer, 2018.
 WahidUlAshraf et al. [2018] Akanda WahidUlAshraf, Marcin Budka, and Katarzyna Musial. Netsim–the framework for complex network generator. Procedia Computer Science, 126:547–556, 2018.
 Barabási and Albert [1999] AlbertLászló Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
 Watts and Strogatz [1998] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘smallworld’networks. nature, 393(6684):440–442, 1998.
 Solomonoff and Rapoport [1951] Ray Solomonoff and Anatol Rapoport. Connectivity of random nets. The bulletin of mathematical biophysics, 13(2):107–117, 1951.
 Erdős and Rényi [1959] Paul Erdős and Alfréd Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
 Erdős and Rényi [1960] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(1761):43, 1960.
 Erdős and Rényi [1961] Paul Erdős and Alfréd Rényi. On the strength of connectedness of a random graph. Acta Mathematica Hungarica, 12(12):261–267, 1961.
 Leskovec et al. [2005] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187. ACM, 2005.
 Drossel and Schwabl [1992] Barbara Drossel and Franz Schwabl. Selforganized critical forestfire model. Physical review letters, 69(11):1629, 1992.
 Price [1976] Derek de Solla Price. A general theory of bibliometric and other cumulative advantage processes. Journal of the American society for Information science, 27(5):292–306, 1976.

Simon [1955]
Herbert A Simon.
On a class of skew distribution functions.
Biometrika, 42(3/4):425–440, 1955.  Newman [2010] Mark Newman. Networks: an introduction. United Slates: Oxford University Press Inc., New York, pages 1–2, 2010.
 Papadimitriou et al. [2011] A Papadimitriou, P Symeonidis, and Y Manolopoulos. Predicting links in social networks of trust via bounded local path traversal. In Proceedings 3rd Conference on Computational Aspects of Social Networks (CASON’2011), Salamanca, Spain, 2011.
 Symeonidis et al. [2010] Panagiotis Symeonidis, Eleftherios Tiakas, and Yannis Manolopoulos. Transitive node similarity for link prediction in social networks with positive and negative links. In Proceedings of the fourth ACM conference on Recommender systems, pages 183–190. ACM, 2010.
 Papadimitriou et al. [2012] Alexis Papadimitriou, Panagiotis Symeonidis, and Yannis Manolopoulos. Fast and accurate link prediction in social networking systems. Journal of Systems and Software, 85(9):2119–2132, 2012.
 Bruch and Atwell [2015] Elizabeth Bruch and Jon Atwell. Agentbased models in empirical social research. Sociological methods & research, 44(2):186–221, 2015.
 Granovetter [1978] Mark Granovetter. Threshold models of collective behavior. American journal of sociology, 83(6):1420–1443, 1978.
 Kavak et al. [2018] Hamdi Kavak, Jose J Padilla, Christopher J Lynch, and Saikou Y Diallo. Big data, agents, and machine learning: towards a datadriven agentbased modeling approach. In Proceedings of the Annual Simulation Symposium, page 12. Society for Computer Simulation International, 2018.
 Michalski et al. [2018] Radosław Michalski, Bolesław K Szymański, Przemysław Kazienko, Christian Lebiere, Omar Lizardo, and Marcin Kulisiewicz. Social networks through the prism of cognition. arXiv preprint arXiv:1806.04658, 2018.
 WahidUlAshraf et al. [2017] Akanda WahidUlAshraf, Marcin Budka, and Katarzyna MusialGabrys. Newton’s gravitational law for link prediction in social networks. In International Workshop on Complex Networks and their Applications, pages 93–104. Springer, 2017.

Bliss et al. [2014]
Catherine A Bliss, Morgan R Frank, Christopher M Danforth, and Peter Sheridan
Dodds.
An evolutionary algorithm approach to link prediction in dynamic social networks.
Journal of Computational Science, 5(5):750–764, 2014. 
Hristova et al. [2016]
Desislava Hristova, Anastasios Noulas, Chloë Brown, Mirco Musolesi, and
Cecilia Mascolo.
A multilayer approach to multiplexity and link prediction in online
geosocial networks.
EPJ Data Science
, 5(1):24, 2016. 
Li et al. [2018]
Qimai Li, Zhichao Han, and XiaoMing Wu.
Deeper insights into graph convolutional networks for semisupervised learning.
InThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Katz [1953] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
 Brin and Page [2012] Sergey Brin and Lawrence Page. Reprint of: The anatomy of a largescale hypertextual web search engine. Computer networks, 56(18):3825–3833, 2012.
 WahidUlAshraf et al. [2019] Akanda WahidUlAshraf, Marcin Budka, and Katarzyna Musial. How to predict social relationships—physics–inspired approach to link prediction. Physica A: Statistical Mechanics and its Applications, 2019.
Comments
There are no comments yet.