Simulation and Augmentation of Social Networks for Building Deep Learning Models

05/22/2019 ∙ by Akanda Wahid -Ul- Ashraf, et al. ∙ University of Technology Sydney Bournemouth University 0

A limitation of the Graph Convolutional Networks (GCN) is that it assumes at a particular l^th layer of the neural network model only the l^th order neighbourhood nodes of a social network are influential. Furthermore, the GCN has been evaluated on citation and knowledge graphs, but not extensively on friendship-based social graphs. The drawback associated with the dependencies between layers and the order of node neighbourhood for the GCN can be more prevalent for friendship-based graphs. The evaluation of the full potential of the GCN on friendship-based social network requires openly available datasets in larger quantities. However, most available social network datasets are not complete. Also, the majority of the available social network datasets do not contain both the features and ground truth labels. In this work, firstly, we provide a guideline on simulating dynamic social networks, with ground truth labels and features, both coupled with the topology. Secondly, we introduce an open-source Python-based simulation library. We argue that the topology of the network is driven by a set of latent variables, termed as the social DNA (sDNA). We consider the sDNA as labels for the nodes. Finally, by evaluating on our simulated datasets, we propose four new variants of the GCN, mainly to overcome the limitation of dependency between the order of node-neighbourhood and a particular layer of the model. We then evaluate the performance of all the models and our results show that on 27 out of the 30 simulated datasets our proposed GCN variants outperform the original model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One major limitation of a neural network-based learning systems is that they requires a large amount of data for training. This is one of the biggest differences between human intelligence and artificial non-general intelligence like an artificial neural network. Unlike a deep learning (i.e. deep neural network) model, a human can learn from a very limited number of examples, whereas a deep learning model requires to see a substantially larger number of samples to learn from. Thus, it is essential to have access to a large number of training data instances to unlock and evaluate the full potential of the neural network-based model. A straightforward technique to solve this problem of insufficiency of the real-world datasets for neural network-based learning systems is to simulate high quality real-world alike synthetic data and use it to train the model. Additionally, if not for training, simulated datasets are particularly useful to evaluate the models’ performance, i.e. during the testing phase. In many cases, it is far more convenient to simulate test cases representing exceptional situations than collecting data for those situations in the real world. In fact, for some real-world scenarios, it might not even be possible to get a dataset describing some exceptional scenario due to the rarity of the event or ethical constraints.

It is however crucial to test the trained model in those exceptional scenarios because the cost of failure for those unlikely situations can be significantly higher than a regular situation. One such area where high quality simulated and augmented data is extensively being used are in the neural network-based learning systems for self-driving cars. Almost all the advanced autonomous vehicle technologies use simulated datasets. For example, Nvidia has developed the Nvidia Drive Constellation, a Virtual Reality Autonomous Vehicle Simulator [1]. Billions of miles have been driven in the simulated environment by Google’s Waymo [2] etc. Similar to the self-driving cars, in many other applications of deep learning, high quality simulated datasets are now in high demand. One such important application of deep learning, which is the focus of this study, is in the area of social networks, where graph specific deep learning models are ever-increasingly being developed and evaluated [3]. With the advancement of graph specific neural network-based models, the demand for such datasets is growing rapidly. Furthermore, it is becoming more and more difficult to have access to complete (i.e. inclusive of node attributes) datasets representing social networks mainly due to user privacy concerns that we discuss later in this section.

Social network datasets are very complex in nature, thus, they can be difficult to simulate and there is a lack of comprehensive guidelines on how to simulate social network datasets with both the features and ground truth labels.

As mentioned earlier, graph data mining has become a very important research area due to the recent advancement and popularity of social networks [4, 5], especially the online ones. Advancements in graph-based predictive modelling or graph community detection algorithms require datasets with ground truth labels for evaluation purposes [6]. However, majority of the available social network datasets do not contain labels. Moreover, real-world social network datasets contain high dimensional features (node attributes and features are used interchangeably in this paper) [7] that represent information about both nodes and relationships. For example, a Facebook user generates variety of information such as posts he/she likes, photos, status updates, etc. Even in citation networks, there are features such as domain, authors’ affiliations, documents with thousands of words, etc [8]. In publicly available datasets, such features are rarely included. For a small number of datasets, these node attributes could be included but then usually the complete structure of the network is not; instead only its subset (mainly ego networks) are available [9, 10, 11]. This is due to the fact that during the anonymisation process of networked data, in most cases we need to get rid of majority of features as these could be used to identify individuals [12], potentially raising ethical concerns. De-identification of network datasets is particularly difficult because of the unique topological structure a network may have. In a 2011 Kaggle link prediction competition, the most successful team won by de-anonymising most of the network data [13]. On top of that, nowadays, even such graph datasets are becoming very difficult to obtain due to the aftermath of the notorious usage of real-world dataset from social networks for the purpose of political influence [14, 15].

To ensure user’s personal data is only used with explicit consent, governments and political unions are increasingly putting pressure on the technology companies [16]. Additionally, new regulations such as the European General Data Protection Regulation (GDPR) on the usage of personal data, has already come into force in many countries such as the UK [17]. Unquestionably, such regulations are essential to guarantee user privacy. However, due to those, getting hold of datasets from social media is becoming increasingly challenging. Maintaining the advancement of the research in social networks requires good quality real-world datasets. One solution is to supplement the real-world social network datasets with synthetic, good quality, real-world alike data.

The demand for graph datasets is further on the rise, due to the advancement of graph-based machine learning, as traditional learning and data mining algorithms are being adopted for graph mining. Machine learning tasks for non-relational datasets only consider features and labels. However, graph datasets also contain edges between instances. These relationships have the ability to provide additional predictive power for a machine learning model. As a result, including these relations along with features in a predictive model is vital for prediction based on graph-structured datasets. To include relationships, one may capture these relational links between instances through graph embedding and then train any traditional machine learning model for the task of classification or regression 

[18, 19]. However, besides this indirect consideration of relations or links, there are developments in the area of graph mining which directly encode a relational component of a graph dataset into a deep artificial neural network, termed as Graph Convolutional Networks (GCNs) [20]. In this approach, the topology of a graph is directly translated into the layers of a deep learning model. In GCN, the features of the graph are multiplied with filters of a neural network in the spectral domain (i.e. graph Laplacian) of the graph, thus resulting in a direct convolutional operation. Apart from node classification which is one of the most researched problems in machine learning, an important research area in graph mining is link prediction. A difficulty encountered when analysing any link prediction technique is not being able to get enough, open, dynamic (time-dependent snapshots) social networks with features and labels. Typically, a link prediction algorithm is tested based on its predictive power on a future snapshot of the network. A supervised link prediction algorithm should ideally utilise both the topology and available node attributes [21, 7]. For example, Scellato et al. [22] found that including features such as places and other related user activity improves the accuracy of link prediction considerably. Most of the developments in link prediction have been based on a single snapshot of the network, although, incorporating evolution of the graph may result in better performance in link prediction as shown by Tylenda et al. [23] and Xu et al. [24].

GCN is a semi-supervised classification model shown to outperform other state-of-the-art graph classification approaches based on as little as 0.07% of labelled nodes per class [20]. In the paper where GCN is introduced, datasets considered in the experiment were citation networks and knowledge graphs with explicitly defined class labels [20]. However, defining class labels for Facebook, Twitter, LinkedIn like social networks is not trivial. As discussed earlier, the difficulty is mainly associated with obtaining real-world graph datasets with labels and node attributes. One approach to evaluate such graph mining algorithms is by simulating graphs containing features. In this work, we propose that the preference of each node in a social network is the strongest, useful and meaningful candidate for label in a social graph.

2 Related work

A straightforward way to simulate graphs is to generate them using well-established network models [25]: (1) Barabási-Albert model for the scale-free network [26], (2) Watts-Strogatz small-world model for the small-world network [27], (3) Erdős-Rényi model for the random graph network [28, 29, 30, 31], (4) Forest-fire Model model [32, 33].

Random-graph Model: In the random graph network model, one creates a network with some properties of interest (specific degree distribution) and otherwise random. Although random graph model was first studied by Solomonoff and Rapoport [28] this model is mainly associated with Paul Erdős and Alfréd Rényi [29, 31].

Scale-free Model: The scale-free model shows power law node degree distribution (where is the node degree and typically ) for a social network. This kind of distribution was first discussed by Price [34]. Price, in turn, was inspired by Herbert Simon, who discusses power law in a variety of non-network economic data [35].

Small-world Model: Transitivity measured by the network clustering coefficient despite being extensively studied, is still one of the least understood properties in network analysis according to Newman [36]. Another important property we observe in real networks is the small-world effect – all nodes are connected with each other by relatively short paths. To model these two properties Watts and Strogatz introduced a small-world network model [27].

Forest-fire Model: In this model the new node, , connects to another existing node , and then again makes a connection with the adjacent node of the newly connected node . The node

then carries on making connections with a probability

based on adjacent nodes [32, 33]. For example, in citation networks, an author finds a paper and cites it. He or she then cites more papers through that paper recursively [32]. In a social network, a friend may introduce someone with his/her mutual friend and then the friend circle grows for the person  [32]. The model is named as forest fire because it imitates self-organising behaviour of a forest fire [33].

These quintessential network models are one of the most important contribution towards understanding and modelling complex networks. However, these mathematical models are solely driven by the topology of a network. For example, the Scale-free model considers the degree of a node and the Small-world model considers mutual friends. Neither features nor labels of nodes and/or connections are mimicked by those models. However, one can generate synthetic social networks with features is to find similarities/correlations between randomly assigned number of features and let those similarities define connections [37, 38, 39]

. For obvious reasons, this naïve approach is not ideal due to several limitations. Firstly, correlations between feature vectors do not consider the network topology. Secondly, a common correlation metric would assume every person in a social network views and prefers a potential friend’s features equally in a linear fashion. Finally, it is often not obvious what the node labels are, which is an issue we discuss in detail in Section 

3.

However, there are some recent developments in agent and event-based social network modelling which are discussed below.

Agent-based modelling: Bruch and Atwell [40] provide a guideline on the agent-based modelling of social networks. In the paper by Bruch and Atwell [40], it is argued that the interplay between the micro and macro level characteristics is complex, and the macro level characteristics are not emergent solely from the simple aggregation of micro level characteristics or low level entities such as social network users [41]. Instead, micro and macro level behaviour or characteristics form a feedback loop, resulting in a nonlinear interaction. From a social network point of view, the graph level and node level characteristics could be thought of as macro and micro level characteristics of the network respectively.

However, to simulate the modelling approach specific to social networks, one should consider the well-studied graph properties such as preferential attachment, mutual friend preferences etc., and provide instructions on how to account for these properties in the simulation. In the research article of Granovetter [41], these social network properties are considered, but implicitly included in terms of the macro and micro level characteristics. In summary, this study by Granovetter [41] provides a generic guideline to model social networks but a detailed and specific mathematical modelling instruction and analysis of the social network properties are not discussed.

In another work by  Kavak et al. [42], the authors have argued that modelling should be performed by explicitly using available real-world dataset. In their experiment, they have simulated human mobility model based on 826,021,868 twitter messages. Furthermore, they have uncovered the Geolocation of 92,296 users for the purpose of modelling. However, the purpose of our simulation is to produce synthetic good quality graph structured datasets when real-world data is not available, which is increasingly the case as discussed in Section 1.

Event based modelling: One recent interesting development in modelling dynamic event-based graph is the Cognition-driven Social Network (CogSNet) model [43]. The CogSNet models social network-based on the human memory model. Authors argue that, similar to the human memory, a social event is strengthened by repeated exposure to a similar event and weakens by deprivation of that event. Although CogSNet proposes a new paradigm in social network modelling, it does not provide an explicit explanation modelling features within the dynamic event based graph. Providing open source social network datasets with labels, features, and graph or topological characteristics is the primary goal of this study.

To address the issues discussed above, i.e., 1) lack of guidelines on implementing both the well-studied network properties in social networks and features, 2) insufficient research on simulating dynamic social networks with node features, 3) lack of rigorous study providing directions on defining node labels in social networks, we propose a framework for social graph simulation. In our model, the simulated networks have the following characteristics based on understanding of Facebook-type social networks, along with well-studied social network properties such as preferential attachment.

  • Node features are evaluated by other nodes before connecting. If two nodes are forming a connection, the decision of forming a link is taken by both of the nodes, thus both parties should evaluate each other’s features.

  • The decision of forming a connection is based on the preferences of nodes, which are consist of a set of latent variables. These preferences are not directly linked with users’ features. For example, two people could live in any state or county, but the preference towards a particular political party could be same, thus resulting in different features but common preferences.

  • People have common preferences. For example, a group of people in social network may prefer a common ideology or political view.

  • The node and graph level characteristics should both be taken into account while modelling a network. Node level characteristics consist of features (i.e. node attributes such as age, gender, etc.), individual preferences (latent variables such as preference towards a particular type of people, discussed in more detail in the Section 3), node degree (i.e. preferential attachment). Whereas, graph level characteristics is e.g. smaller path length preference, i.e. connecting with friends who are nearer in terms of the graph topology.

3 Proposed approach

The proposed simulation is based on preferences (i.e. a set of latent variables) of nodes, which can be interpreted as social rules. Node preference represents the preference of a person in a social network, and at the same time the network-based projection of personality and behaviour. This in turn translates into the network topology. For example, one of network-based behaviours might be to only connect with people with many mutual friends, shaping the topology of the ego network.

We identify the types of preference a node can have based on their topological and non-topological characteristics. The preferences/behaviours emerge from the following phenomena:

  • Feature-based (non-topological): From node/user point of view, a combination of variable preferences towards the features of other nodes/users acts as a deciding factor for who they wish to connect with. For example, someone may prefer to connect with people who live near in terms geographic location (e.g. city, town), thus similar location feature is preferred. Whereas, for some other features such as gender, being opposite or same, could be preferred. Thus, for this particular node, preference towards geographic location in combination with gender is considered while connecting.

  • Topology-based: Besides node features, the local topological characteristics may also play important role, e.g. someone may prefer to connect with other people with whom he/she has mutual friends, whereas others may be more open. Secondly, some people may still prefer to connect with someone who has many friends, i.e. popular nodes. Both of these preferences are solely based on the graph topology and could be mostly identified via the topological properties of the social graph.

  • Hybrid feature and topology based (combination of topological and non-topological features): People in social networks may prefer someone who is nearer to them in terms of geographic location and has similar age, education level and has many mutual friends. In this scenario, we have a combination of both the feature-based and topology-based preferences. Someone may also only connect with a politician who has many friends, only if, he or she has similar political views.

All three types of human preferences are reflected in both non-topological features of a node and the topology of a graph. Although the first, feature-based preference, solely emerges from the non-topological component of a node, once the connections are made, these preferences are also reflected or encoded within the topology of the graph. Without the consideration of the graph or relations between nodes, the predictive model will not be able to capture these complex patterns, which may negatively influence model performance. As a result, by including node attributes in graphs we can achieve higher predictive power.

We propose that the labels of social networks in supervised or semi-supervised classification will capture patterns resulting from the preferences discussed above. We name these preferences or behaviours for a particular node as their social-DNA (sDNA). Although most people in a social graph have different features, many people have a similar sDNA (i.e. they share preferences). As a result, the sDNA is the most valuable and meaningful candidate for class labels for grouping nodes in a graph. However, these labels may not be explicitly defined for a given node classification problem. For example, a classification task may require identifying a group of nodes who may prefer to buy a certain type of product, for marketing purposes. The label for the class who have bought the products should capture a certain group of people with similar preferences in the social network. In semi-supervised classification, if we have a dataset for only a few people who may have bought the product, the classification model would associate a certain type of sDNA in the social network as the most likely group to buy that product. In addition, if we do not have any historical information which tells us who have bought the product, it may be possible that a group of people with similar sDNA may prefer the product more often than others. However, a person or node itself in a social network, may not entirely know his or her preferences or sDNA. As a result, finding these preferences (i.e. labels) in terms of sDNA is a nontrivial task. One solution to this problem could be to define a few randomly selected nodes with different labels. These labels could also be selected based on a strategy of focusing on the features of nodes. For example, selecting a few nodes with very different features from each other. After labelling, the semi-supervised classification algorithm, such as GCN will infer other nodes with similar preferences/sDNA

. Even if no prior knowledge is available, randomly selected nodes with different class labels could be used with only one label per class. GCN is shown to be powerful enough to accurately classify nodes with only one label per class 

[20].

4 Graph Formation

Let’s assume that we have number of nodes with features each. Each element or feature in the could be unbounded or bounded. Each and every node subscribes to an sDNA. There is a total of different sDNAs, such that . Nodes which subscribe to the same sDNA have the same preference, thus the same label. Each sDNA consists of two vectors of length (i.e. the same as the number of node features). These two vectors are, 1) , which defines the strength or weight of a particular feature’s preference and ranges between and , and 2) , which defines whether similar or dissimilar features are preferred with a binary attribute or . Although could be incorporated into as its sign, to make the preference standout separately for user readability and its contribution in the sDNA mutation process discussed in Section 7 for dynamic graphs, the vector is used. This also allows to have a separate label for the preferences within the sDNA, which could be learned using machine learning algorithms from the graph enabling a more in-depth predictive analysis. The feature-based scores between two nodes and are calculated as (where is the Hadamard product):

(1)

Equation 1 gives feature-based score which entails if node is a potential friend for node . In this case, node evaluates if it wants to connect with node or not, as we consider only ’s sDNA. In many social networks, mainly for the undirected ones, the final connection or friendship decision is made by both of the nodes. We can introduce two-way evaluation simply by adding node ’s sDNA based score in Equation 1.

(2)

If both the and subscribe to the same sDNA then . However, Equation 2 does not prefer similar sDNA over different sDNA or vice versa. In a social network, the preference or sDNA is a set of latent variables. It may well be that two people have a similar preference and this results in a lower score. For example, if two social network users prefer to connect with the opposite gender more often, then if they have the same gender then they are less likely to connect.

Figure 1: Two types of sDNA subscribed by 5 nodes (The lines do not represent edges in the graph and sDNAs are not nodes. The arrows define subscription or common preferences of different nodes as sDNAs)

Equation 2 does not consider topological (i.e. graph based geometric features) of the nodes while calculating the score. In social networks, popularity, i.e. the degree of the nodes is a common topological feature with a significant effect on the growth of the network. Typically, people tend to prefer other people who have large number of connections. This is why famous people tend to get more connections. This phenomenon is well studied and known as preferential attachment [26]. To add the preferential attachment effect, we can simply add the degree of the connected nodes to the score. If node has degree or connections, then from ’s perspective, the popularity-based score can be calculated as follows:

(3)

However, the preference for nodes with higher degrees varies from person to person. We can incorporate this variability by including sDNA’s preferential attachment parameter while calculating the score, resulting in:

(4)

Equation 4 only considers score of and from ’s perspective (i.e. ’s sDNA). For an undirected graph, we can use the following Equation 5 to calculate preferential-attachment-based score from both and ’s perspective by adding both of their scores from each other’s perspective:

(5)

A social network user tends to prefer people who are nearer to them in terms of the graph topological distance [44]. Creating a connection with somebody who is friend-of-our-friend is usually more likely than starting a relationship with someone who is further away from us in the structure. However, this preference is also subjective and varies among social network users. As a result, we add this variability of path length preferences subjective to a user by using sDNA’s variables, where is the longest path-length in the graph that the model is considering, and . The sDNA’s vector has a length of .

(6)

In Equation 6, is a generalised Kronecker delta function in the Iverson bracket where is the adjacency matrix of the graph. The value of is one, if a path of length between and exists, and zero otherwise. This function of path length introduces non-linearity in the score. Equation 6 gives the score of when is evaluating ’s potential to be able to connect or become friends with . This is done by using ’s sDNA parameter to calculate the score of . This imitates the behaviour, firstly, may find someone interesting to send him a friend request on Facebook. Secondly, the final connection will be made if also finds interesting. Equation 6 accounts for the first case and the following Equation 7, similar to score based on features in Equation 2 accounts for the score from ’s point of view for . The final score function based on path topology is:

(7)

Equation 7 is required if were to simulate a directed graph. For an undirected graph, we only consider Equation 6.

Finally, we add all three scores, i) feature based from Equation 2, ii) popularity based from Equation 5, and iii) shortest path length based from Equation 7 (directed) or  6 (undirected) to calculate the final score . We consider an undirected graph where both and make mutual decision to connect with each other. In case of an undirected graph, for connecting with , we can simply consider Equations 14, and 6. For simplicity we are not including the subscript in the final score function .

(8)

Equation 8 gives the score between any two nodes. The scores are weighted or modified according to the sDNA

a node belongs to. To further enforce some global graph level control in the effects of feature-based, popularity-based, and shortest-path scores we introduce two hyperparameters. This global control is useful in many situations, for example, one may wish to generate networks where strong preferential attachment phenomena exist. To be able to control this global weighting, we introduce

and global weighting factors in Equation 8. is a vector of length , where is the number of shortest path length considered starting from length two.

(9)

Equation 9 contains variables from , , and . These do not come from the nodes directly but from their sDNAs, which in turn expresses their behaviour in the network. Equation 9 is one possible linear combination of , , and , however, other possible nonlinear combination functions may be used depending on the target domain.

5 Simulation Process

The link formation process for a graph with nodes is given in Algorithm 1. Each node subscribes to exactly one of different types of sDNA (Figure 1) and contains features. In Line 2 the algorithm generates all pairs of nodes. In case of an undirected graph, the pairwise permutation (without repetition) is considered. Furthermore, if self-connection is desired then self pairwise combination is also included. Social network users do not necessarily explore all potential friends whom they might connect with. For example, a Facebook user does not explore all existing Facebook users to connect with. As a result, the simulation process selects a pair of nodes to calculate scores with the exploration probability (Line 6), much like how connections are made in a random graph. will result in calculation of scores for all possible pairs, while for , no score will be calculated between any pair of nodes. The exploration probability incorporates controlled stochasticity. In order to determine the minimum score a pair of nodes should have to connect, we define a cut-off point .

To sum up, first, we calculate scores between pairs selected based on the exploration probability and then we sort these scores in descending order. After that, we connect fraction of pairs of nodes in the entire graph. Smaller values of will result in a social network where the users are very particular about with whom they connect. On the other hand, very high values of will result in a network where users do not care about features or topological properties while connecting. Thus the latter will be close to a random graph model with probability of edge occurring being equal to , i.e. will result in a pure random graph model with probability of edges formation. In Line 4 and 5, from all sets of pairs we select a pair for score calculation with probability, . In Line 5,

is a random number generator function which returns a random number from 0 to 1 from uniform distribution. In Line 

6, we use Equation 9 to calculate a score between the selected pairs of nodes. Afterwards, the stopping length based on the suggested fraction of node pairs to be connected is calculated (Line 11). In Line 12 we sort the selected pairwise nodes’ scores in descending order. Afterwards, in Line 14, we connect the first fraction pairs of nodes’ for which we have calculated scores in , in Line 12, thus, pairs with higher scores will have a higher likelihood of forming connections.

1:procedure Socialise()
2:     
3:     
4:     for all pair in  do
5:         if  then
6:              
7:              
8:         end if
9:     end for
10:     
11:     
12:     
13:     for all score in  do
14:         
15:         if  then break
16:         end if
17:         
18:     end for
19:end procedure
Algorithm 1 Socialise algorithm

6 Curse of Dimensionality in Networks

Real-world social networks contain high dimensional features. If we consider a Facebook user’s posts, likes, photos, comments, etc. as features, then we have thousands of features for each of the node. One problem with nodes with high dimensional features is the linear increase of computational complexity for the simulation process discussed in Section 5. To overcome this problem, GPU computation can be used to calculate Scores in Equation 9. In our simulation library, we have enabled GPU computation and Figure 2 shows computation time with 300 nodes with increasing number of features.

Figure 2: CPU vs GPU computation time with varying number of features. (CPU: Intel(R) Xeon(R) W3680 @ 3.33GHz 6 cores and 12 threads, system memory: DIMM DDR3- 20 GB, GPU: NVIDIA GeForce GTX 1080Ti)

7 Dynamic graph

Real social networks evolve over time and are dynamic in their nature. However, dynamic graph datasets are very rare to find, especially with ground-truth labels and node attributes. These datasets are crucial in the field of dynamic graph research, but also essential for the evaluation of a link prediction, which usually deals only with static graphs or a snapshot of a graph at time . The link prediction problem is to identify new links that will be present in the network at time  [45, 46]. Assuming the network has a set of nodes and set of edges at time expressed as , and that a link between a pair of vertices and is denoted by , the goal of link prediction is to predict whether , where . The prediction is performed by using topological and/or non-topological information about node characteristics and their relationships. Thus, to evaluate or test the performance of a link prediction method, a future snapshot at time is required. Additionally, machine learning based link prediction algorithms require a future snapshot of the network at time as ground truth for training purposes. Interestingly, by using multiple runs of Algorithm 1 we can already get dynamic graphs, i.e. a future snapshot of the network. Every time we socialise the graph using Algorithm 1 containing pairs of nodes which are not yet connected, we will get new connections occurring within the graph. However, this is perhaps not the best simulation of the dynamic nature of real social networks. The reason is that by running Algorithm 1 multiple times we are forcing each of the social network users to make consideration and connect with people which they didn’t find interesting enough in the previous run(s)222Here, we are assuming no arrival of new nodes or a constant number of nodes. In case of new nodes, we can easily run Algorithm 1 with the new nodes and include them in the graph.. What we really want is not to force the users in the graph to make new connections but allow the users’ interest and preference to change and then run the Socialise Algorithm 1. This will result in a concept drift in the user preferences, which can be achieved via changing values of the variables in the sDNA’s of the nodes. This changes in the sDNA reflect the phenomenon that, the rules that govern social networks can and do change over time. This change of preference can be achieved via the sDNA Mutation given in Algorithm 2. The intensity of mutation can be controlled by mutation intensity parameter , which results in changing values of the variables in . A lower value of would change only few of the . As a result, the user’s preference towards a potential friend’s feature would change. In case the value of the mutation intensity parameter is defined larger, this would result in changes to the entire preference vector . In Algorithm 2, in Line 1, the procedure takes all existing from the graph, and , a boolian parameter to determine if should also be changed. In Line 2 we iterate trough each of the , one at a time. For the given sDNA, we then iterate trough each of the elements in and in Line 3. We than reassign the value of with the probability (Line 5). In Line 7 we check if the is set True. If so, then we also reassign the value of , or , with a probability of (Line 9). The selection between or selected randomly from uniform random distribution. In Line 14 we reassign the parameter of sDNA, which is for preferential attachment strength. Afterwards, in Line 16, number of random numbers are generated, for each path length preference in Equation 6. The intervals are selected such that it satisfies, . Afterwards, in Line 20, we again iterate through each elements of and reassign from the already generated random numbers in Line 16.

An interesting observation is, generally people’s behaviours or preference changes are correlated with time. This change in behaviour, for social network users, contributes to change in the topology of the social network. In our simulation strategy, a snapshot to snapshot time difference then should also be correlated with the change of the users’ behaviour or preferences, i.e. sDNAs. The parameter in Algorithm 2 defines this intensity of mutation in sDNA or intensity of social network users change in behaviour. As a result, the value of is proportional to the time between two snapshots of the network. For example, if one wishes to run the Socialise algorithm (Algorithm 1), it will produce a social network with the first snapshot, . Then running the Mutation algorithm (Algorithm 2) will result in change in preferences with a particular value of the parameter , and then rerunning the Socialise algorithm will result in another snapshot of the network in a forward time dimension, . A high value of will result in higher time difference between these two snapshots, and .

One may wish to generate event based dynamic networks, i.e. time-stamped link formation. This can also be achieved by setting the ‘fraction of nodes to be connected’ parameter from the Socialise Algorithm 1 such that only one link is formed. A repeated run of Algorithm 1 will result in edge stream with timestamp for each of the edges. As we have discussed earlier in the section, the time between each of the edge appearing could also be manipulated by changing the value of parameter .

1:procedure Mutate()
2:     for all  in  do
3:         for all , in , do
4:              if  then
5:                  
6:              end if
7:              if  then
8:                  if  then
9:                       
10:                  end if
11:              end if
12:         end for
13:         if  then
14:              
15:         end if
16:         
17:         
18:         for all  in  do
19:              if  then
20:                  
21:                  i++
22:              end if
23:         end for
24:     end for
25:end procedure
Algorithm 2 Mutation algorithm

8 How to validate simulation

In order to assess if the desired integration of features, labels, and topology is achieved, we measure and compare different trained model’s predictability of the labels of the nodes. This comparison is done by designing different setups of the models such that, the models are able to perform predictions with the entire set of information (features, labels, and topology) as well as with partial information.

Here we discuss the validation setup. The predictability of the label of a node, i.e. sDNA, can be performed via the following configurations of an ideal machine learning model:

  1. Predictability of nodes’ sDNAs with features combined with the graph topology

  2. Predictability of nodes’ sDNAs using features only

  3. Predictability of node’s sDNA using topology only

We can expect for an ideal machine learning model to fully capture and learn patterns both from the topological and feature based information from the network without over-fitting or being susceptive to the noise or stochasticity in the network. Needless to say, such an ideal model is not currently available in the real-world. However, we should at least use a machine learning model which can directly utilise both the topological and non-topological information, i.e. features.

In our case we use the GCN [20] to analyse sDNA predictability of the simulated networks, which can be regarded as one of the best models to directly combine both the topological and non-topological information of the graph [47].

8.1 Graph Convolutional Networks (GCNs)

GCN is a multi-layer graph based neural network. In each layer, the features are multiplied with the topology of a graph in the spectral domain (i.e. symmetric normalised Laplacian matrix [20]

). Weights of connections (edges/links by which the features of a node are passed, considered or summed) are learned using backpropagation. However, as most of the real-world social networks are not regular graphs, one single weight is learned for all links of a particular node.

The layer-wise propagation rule for the ’th layer is:

(10)

In Equation 10, are the trainable weight matrices for each layer. (the feature matrix) and is a graph representative matrix that we discuss in more detail in Section 8.2. is fed in every layer of the model until the output layer. Finally,

denotes a nonlinear activation function.

For this model, the receptive field grows with the depth of the network [20]. In the first layer, only friends’ features are considered, and in the second layer friend of friends’ features are also considered, i.e. summed before passing through a non-linearity. This is because the summarised friends’ information is already gathered in the first layer.

The direct translation from a graph to the structure of the neural network333In this paper when we talk about a graph (i.e. a social network) we mention it as a ‘graph’ or a ‘network’ but when we talk about a neural network it is written in its full form. is achieved via the graph representative matrix . Symmetric normalised Laplacian matrix of the adjacency matrix has been used in the original formulation of GCN, i.e.  [20].

(11)
(12)

In Equation 12, adds the self-connections for each of the nodes in , is the degree matrix of the adjacency matrix, and is the adjacency matrix with added self-connections. The addition of self-connections facilitates incorporation of self-features of the nodes for better predictability. For example, a social network user’s friends may give away his or her preference or class label (i.e. predictability based on the labels of the connected nodes), but additionally, his or her own features (i.e. self-connections in the graph) are also important to consider to predict his or her preference.

In Equation 10, the main transformation to the neural network from a graph is performed through . If the adjacency matrix in (Equation 11) is replaced with a different representative function of the graph, the structure of the neural network itself will change. However, this does not change the input feature matrix . As a result, this is not exactly data preprocessing technique but rather a change in the architecture of the neural network. We discuss this usage of different graph representatives later in Section 8.2.

Using the GCN we calculate the three mentioned setups for node label predictions in Section 8 by changing the propagation rule in Equation 10 as follows:

  1. Prediction of nodes’ sDNA with both the features and graph topology using propagation rule of Equation 10 for the first layer, where , and , where is the feature matrix:

    (13)

    This is the straightforward GCN model proposed by Kipf and Welling [20]. Here, the graph representative is fed in every layer of the model, but the feature matrix is fed only in the first layer.

  2. Prediction of nodes’ sDNA with features excluding graph topology with the following propagation rule:

    (14)

    In this first layer propagation rule in Equation 14,

    is the identity matrix of the adjacency matrix

    . is fed into the model until the output layer. Thus, only features of each node are considered and the graph topology does not play any role for label or sDNA predictions.

  3. Prediction of node’s sDNA excluding the features but solely with the graph topology:

    (15)

    In this propagation rule in Equation 15, is the identity matrix of the feature matrix . As a result, only the graph topology is considered, and features do not play any role in the model. Here, the graph representative is fed in each layer of the model until the output layer of the model, however is fed only in the first layer of the model.

We assume that during the simulations, the first setup will produce more accurate results than the remaining two. This hypothesis is represented through the following inequality:

(16)

where is the test accuracy of the trained neural network model for four different setups, based on four different propagation rules in Equations 1314, 15, and 26. However, the assumption of GCN is that for a layer of the model, only the order neighbourhood nodes are influential [20, 47]. To work around this problem, we develop a strategy of replacing the adjacency matrix in the Laplacian transformation in Equation 11 graph representative function , with three different existing node-similarity measures. In social networks, not all connected nodes have the same influence and in fact, some non-directly connected nodes in the graph may have greater influence over a node in question than the directly connected ones. As a result, usage of the adjacency matrix as a graph representative may not always entail the best performance of the neural network.

8.2 Node-similarities as Graph Representatives for GCN

In social networks, the adjacency matrix represents direct links between nodes. In GCN the features propagate through those links. Thus, a node’s label is predicted by utilising patterns on the surrounding connected nodes’ features and labels. However, in social networks, not all the connections of a given node have same or even similar effect on this node. It can be assumed e.g. that the influence that one node has on its neighbour will increase with the number of their mutual friends. In a similar way, it may happen that a friend of a friend of node can influence node more than a directly connected node (a not influential node, e.g. does not have any common friend with the node ). As a result, this effectively changes the representation of the network so one can incorporate these relationship characteristics as a form of social node-similarity-based matrix for the GCN. One way to extract and represent these types of social relationship (not necessarily direct ones) strengths and other information between nodes is to use a matrix which describes the similarity between nodes instead of an adjacency matrix. For example, the Katz similarity measurement considers the number of all direct paths from node to  [48]. Thus, more mutual friends would result in a higher number of paths, resulting in a higher value of the Katz score. In this study, we replace the adjacency matrix with the three different types node-similarity matrices, as they encompass richer information about underlying structure than traditional adjacency matrix. Following are the three node-similarity measures we have considered:

  • Katz, which considers the number of all the paths from node to  [48]. The shorter paths have bigger weight (i.e. are more important), which is damped exponentially with the increase of the path length and the parameter ( is the adjacency matrix):

    (17)

    The above similarity in Equation 17 will result in the following graph representative

    (18)
    (19)
  • Rooted PageRank (RPR) is used by search engines to rank websites. In graph analysis it ranks nodes by the probability of each node being reached via random walk on the graph [49]. The

    is calculated using the stationary probability distribution of the degree matrix

    in a random walk. The random walk returns to with probability at each step, moving to a random neighbour with probability . This results in the following graph representative :

    (20)
    (21)
  • Graph Gravity (GG), Inspired by the Newton’s law of universal gravitation, this node-similarity measure uses degree centrality as the mass of the nodes, while the lengths of shortest paths between them act as distances [44, 50]. The above analogy leads to the following formula for calculating the score between two nodes:

    (22)

    where denotes the degree centrality, is the shortest path. Node-similarity in Equation 22 will result in the following graph representative :

    (23)
    (24)

For all the above three node-similarity measures, each , represents the nodes-similarity matrix (only for all possible links) which has been preprocessed and reconfigured further which is discussed in Section 9.1.

8.3 Weighted Feature Matrix

is a powerful model for node classification, and it has been shown to perform well even only with the graph topology, i.e. without the feature matrix Kipf and Welling [20]. The reason for such a good predictability without the features could be due to two reasons. Firstly, when our focus is on node classification for graph-structured datasets, the preferred features of the nodes should be reflected in the topology of the graph as we have discussed in our simulation process in Section 4. The fact that these features are encoded in the topology may result in a good predictability even when the features are not directly considered in the model. Additionally, this better predictability based on feature only or topology only may vary from node to node. For some nodes, the topology only may have better predictability when compared with the node’s feature. This could be due to the fact that topological position of a node overshadows the importance of the features.

Secondly, the good performances solely based on topology could be because, similar to real-social network users, we have defined our sDNA for nodes such that it results in some of the features of other nodes being preferred and some others not (Section 4). In other words, not all the features play similar roles when it comes to the predictability of the sDNA. As a result, in the entire graph, some of the features may be disliked or not preferred by the majority of the nodes when forming graph connections. This is why an additional learnable common weight for a particular feature for all the nodes may result in better predictability. In our analysis, we have found that adding this additional weight, which defines the weight for each of the features for all the nodes, seems to perform best, and this is what we present in Section 10. To introduce this relative importance of features we use one additional weight vector in the model. We use a common weight for a particular feature for all the nodes. If we have a network with nodes and different features each, for each feature of all the nodes a common (i.e. across all the nodes) weight is used to learn the strength of each feature. This additional feature weight matrix is the size of the number of features and is used only in the first layer of the model. Hence all the input features, are weighted before passing to the hidden layers.

This additional weight vector results in the following first layer propagation rule based on the Equation 10:

(25)

where is the matrix containing the unbounded learnable parameters to define strength of the feature matrix . is an all one matrix. defines the Hadamard product between the feature matrix and the dot product of and .

9 Experimental Setup

In the experiments, we simulate social networks, with nodes each. Each of the networks has four different types of sDNAs with nodes subscribing to a single type. We take three different snapshots of the same network, resulting from an initial networks to a total of networks (i.e. three snapshots of the same network). Each of the nodes has features, each set of features of a node is generated from a uniform distribution. All the variables for the sDNA (described in Figure 1 and Section 4

) are also generated from uniform distributions. For all the models, we have four graph convolutional layers. All layers except the output layer use rectified linear units (ReLU) as nonlinear activation functions. The output graph convolutional layer contains softmax activation and categorical cross entropy loss is calculated for the four types of

sDNAs or node labels. Each of the layers, except for the output layer contains

units of neurons, and the output layer has

units, the same as the number of nodes that need to be classified. Finally, we used Adam optimiser, a first-order gradient-based algorithm for our differentiable neural network model to learn the weights (i.e. to optimise the loss function). Each and every model is evaluated with the same setup. On every network, 10-fold cross-validation is performed and the average accuracy is reported. Additionally, the standard deviation of the accuracy is reported for the best accuracy and the original GCN in Table 

2. All the hyperparameters are kept fixed for all the models. We have used the learning rate of , L2 kernel regularisation (i.e. weight decay) for all the hidden layers with the decay rate of , and a dropout layer after each hidden layer with , i.e 50% of the randomly selected neurons are trained in each training iteration.

Model Description Graph representative Eq.
FTVanilla Original GCN, feature + toplogy  10
T Original GCN, topology only  15
TLR Original GCN, topology only  26
F Original GCN, feature only N/A  15
FTKatz Feature + Katz based toplogy  18
FTRPR Feature + RPR based toplogy  20
FTGG Feature + GG based toplogy  22
Table 1: Models used along with the original GCN. All of the models with features are trained twice, once with the weighted feature matrix in Equation 25 and once without

In Table 1, for the topology only model, T (Equation 15), the weight matrix contains more trainable parameters compared with the the model in FTVanilla (Equation 13). This is because we have nodes per network with features each. As a result for the model with both the topology and feature matrix model, FTVanilla the dimension of the first layer weight matrix needs to be , where is the hyperparameter for the number of units we consider in all the models, i.e. , and the resulting matrix has a dimension of , while the output from the first layer has a dimension of . Whereas for the topology only T, in Equation 15, where the feature matrix is only an identity matrix, , the weight matrix is directly multiplied with the graph representative, i.e. the graph topology, . As a result the dimension of the weight matrix is a lot higher (), i.e. , and the resulting output from the first layer has a dimension of . As we can see there are more trainable parameters in the T model compared with FTVanilla, i.e. vs . If we were to compare both of the models’ performance, T vs FTVanilla, to test if the Inequality 16 holds as a validation of the feature and topology integration process, the total number of trainable parameters for the both the models should be as close as possible. To make both the models comparable, we introduce another setup for the topology only model to keep the number of parameters at the similar level to the model using both the features and topology.

(26)

In Equation 26, weight matrix is split into two matrices to keep the number of trainable parameters roughly in line with the FTVanilla model.

9.1 Augmented node-similarity matrix

For (Equation 24), the similarity scores for all the non-existing links are calculated and then all the scores are normalised between zero and one. Afterwards, the adjacency matrix is summed with the calculated scores for all the non-existing links. As a result, all the existing links for has a value of one and for the non-existing links, the value ranges from zero to one. For and , the path-based similarity scores are calculated for all possible links. For all the networks, to calculate Katz score, with the highest exponent of five for the adjacency matrix (i.e. in Equation 17) and the is used. As for the RPR, the parameter is set to . For each of the calculated similarity matrices (, and ), the row is normalised for each of the non-zero elements using the norm. Moreover, on the similarity-based adjacency matrices (i.e. , , and ), several thresholds are used. The thresholds are applied on the row normalised matrices. The thresholds are set in a way that, if the value in the similarity-based matrix is less than or equal to the first threshold then it is set to zero. Whereas for the second threshold point, if the value is greater than the threshold, it is set to one. If the thresholds are set as zero and one respectively, then none of the values is changed in the matrix. Also, for some set of thresholds, if they are not the same, the elements in the matrix which are in between the two thresholds, are unaltered in the matrix. The sets of thresholds are selected based on empirical analysis, i.e. cross-validation accuracy of the model. However, we also select a threshold based on the mean value of the elements of the matrix. The mean value threshold hold is applied such that, if a non-zero element in the matrix is less than or equal to the mean value then it is set to zero and one otherwise.

In GCN, for the layer, only the path length neighbouring nodes are considered [20]. Thus, it limits the scope of the receptive field of the node in each layer and also the maximum receptive field is limited by the maximum number of layers used in the model. This limitation has also been pointed out in the paper where GCN was first introduced [20]. However, using node-similarity measures along with the augmentation process we describe here allows the model to consider a three-path distant node even in the first layer (i.e. a direct connection) for the classification of the node , assuming that they have a high node-similarity score. As a result, this augmented node-similarity measure solves the limitation of layer-wise node-neighbourhood dependencies for the GCN.

10 Results and Discussion

In Figure 3, we show accuracy for all the models that we have tested on 30 simulated networks. All the results are 10-folds cross-validated and average accuracy is reported. In Figure 3 and Table 2, we observe that according to the hypothesis of Equation 16, the accuracy of the model which uses node features only, i.e. , is very low. In fact, the predictability is not better than random chance (the accuracy is around and we have four equally represented labels or sDNA types to predict). Additionally, from Figure 3 and Table 2 we can see that for the majority of the datasets (except only three networks) models which utilise both the topology and features of the graph perform better than the two other setups where topology and features are considered independently. When only topology is used (i.e. T), the model T performs the best in three networks. Two of them are a third snapshot(i.e. the 3rd run of the Algorithm 1) of a network (networks 1-2 and 6-2), and the third one is the second snapshot of the network (2-1). This can be due to the fact that as we run Algorithm 1 multiple times, the patterns of preferences get encoded within the network topology so that the topology only model performs better. This is something we also expect in real world networks i.e. as a person makes more connections, their connection making patter becomes eminent.

Amongst methods using node-similarity matrices instead of the node adjacency matrix, we see that the choice of threshold seems to have a significant effect on the model-performance. However, we can also see that using the mean value of the L2 normalised node-similarity matrix as a threshold (described in Section 9.1) performs quite well. In fact, on seven networks with the setup of using mean value as a threshold on the node-similarity matrix (discussed in Section 9.1) outperforms all the other models (Table 2). The models with the mean value as thresholds are written as ‘auto’ in Table 2.

If we do not consider differences between thresholds and usage of the vector on the node-similarity measures, significantly outperforms the original GCN (i.e. ) and two other node-similarity measures ( and ) based on the results in Table 2. On five networks, performs the best and on one network. However, on the basis of average best-performing models on all the datasets from all the different models, following models performs best: 1) , 2) , and 3) . Thus, on average, performs best across all the datasets.

The results also show that, the usage of a trainable parameter based on Equation 25 gives us a better model for many datasets than when not using it. In 15 out of 30 datasets, using on the first layer of the model outperforms the other models (Table 2). Furthermore, models with which perform best are mainly not the original GCN but the node-similarity-based models, except for one dataset. However, this may not imply that the use of additional weights in the first layer based on Equation 25 only performs well on node-similarity-based models. This is because the usage of node-similarity may have better predictability in general than the adjacency matrix.

From Figure 3 and Table 1 we can see that the performance of a node-similarity-based model varies depending on the network the model is trained on. This is because all the networks are simulated with different rules, and no two networks are exactly the same. We can expect to see the same in real-world networks as well. Thus, the choice of a node-similarity method could be based on empirical analysis. However, one may also use the mean value of the normalised node-similarity matrix, especially with GG as we have discussed earlier in this section.

The results in Figure 3 show that we can achieve high accuracy (in fact higher than using only the adjacency matrix) on node classification when a node-similarity-based graph topology is used. This is particularly useful for very dense networks. The training time that is required for a very dense network is extremely high for GCN. Many real-world datasets, such as face-to-face interaction networks, tend to be very dense. Thus, the similarity-based matrices can be used (with a suitable threshold to reduce the number of connections as per Section 9.1) in such scenario to reduce training time.

Figure 3: Accuracy from different Models (average from 10-fold cross-validation). Models written as, - feature only, - topology only, - both the feature and topology, - the original GCN. Models with in the right box represents if an additional feature weight matrix in the first layer (Equation 25). The left box shows results for models that use graph representative , Equation 10 (i.e. adjacency matrix). The middle box uses , where is a node-similarity (, , and ) measure with different thresholds ( Equation 1820 and , 22). Similarity-based is preprocessed based on Section 9.1. The preprocessing threshold implies automatic selection of a threshold based on the mean value of the (Section 9.1). All the networks are represented in terms of snapshots. For example, 0-0, is the first network’s first snapshot, 0-1 is the first network’s second.
Networks FTvanilla([20]) Acc (SD) Max Acc (SD) Max Acc Model
0-0 0.721 (0.011) 0.732 (0.007) FTRPRauto
0-1 0.71 (0.022) 0.762 (0.011) SFTkatzAuto
0-2 0.699 (0.032) 0.739 (0.012) SFTkatzAuto
1-0 0.741 (0.013) 0.753 (0.012) SFTkatz0.1-1.0
1-1 0.736 (0.034) 0.767 (0.007) SFTvanilla
1-2 0.754 (0.027) 0.767 (0.033) T
2-0 0.507 (0.045) 0.559 (0.009) SFTkatz0.0-0.5
2-1 0.599 (0.044) 0.618 (0.074) T
2-2 0.671 (0.017) 0.675 (0.014) SFTRPRauto
3-0 0.554 (0.018) 0.576 (0.009) SFTkatz0.0-0.5
3-1 0.517 (0.027) 0.553 (0.013) SFTGG0.0-1.0
3-2 0.515 (0.045) 0.555 (0.016) SFTRPRauto
4-0 0.49 (0.011) 0.508 (0.006) FTRPR0.0-0.5
4-1 0.51 (0.010) 0.532 (0.013) SFTkatz0.0-0.5
4-2 0.506 (0.010) 0.545 (0.009) FTkatz0.0-1.0
5-0 0.701 (0.028) 0.741 (0.005) FTkatz0.1-1.0
5-1 0.758 (0.010) 0.783 (0.005) SFTkatz0.1-1.0
5-2 0.756 (0.044) 0.807 (0.004) FTRPRauto
6-0 0.686 (0.009) 0.709 (0.010) FTkatz0.0-0.5
6-1 0.701 (0.030) 0.748 (0.018) SFTkatzAuto
6-2 0.726 (0.035) 0.759 (0.015) T
7-0 0.757 (0.010) 0.762 (0.008) SFTkatz0.0-0.5
7-1 0.738 (0.013) 0.764 (0.010) FTkatz0.0-0.5
7-2 0.731 (0.015) 0.761 (0.013) FTkatzAuto
8-0 0.677 (0.016) 0.724 (0.007) FTkatz0.0-0.5
8-1 0.717 (0.028) 0.752 (0.007) SFTkatz0.1-1.0
8-2 0.731 (0.016) 0.742 (0.013) SFTkatz0.0-0.5
9-0 0.699 (0.014) 0.747 (0.011) FTkatz0.0-1.0
9-1 0.68 (0.012) 0.726 (0.019) FTkatz0.0-0.5
9-2 0.627 (0.012) 0.726 (0.006) FTkatz0.1-1.0
Table 2: Accuracy (ACC) and standard deviation (SD) of the best vs original GCN model. Models written as: - features only, - topology only, - both features and topology, - the original GCN. in the right column denotes usage of an additional feature weight matrix in the first layer (Equation 25). The models that use , where is a node-similarity (, , and ) measure with different thresholds (Equation 1820 and, 22) are represented in the last column with the corresponding node-similarity matrix (e.g. katz for the model FTkatz0.0-0.5). All the similarity-based are preprocessed and reconfigured based on Section 9.1. The preprocessing threshold implies automatic selection of a threshold based on the mean value of the normalised node-similarity matrix (as per Section 9.1). Networks are represented in terms of snapshots, e.g. 0-0: first network’s first snapshot, 0-1: first network’s second snapshot etc.

11 Conclusions and Future Work

In this work, we have evaluated the performance of GCN on simulated friendship-based social network datasets. One limitation of the GCN is that it is limited to the number of neighbourhood path by the number of layers used in the model. We argued that using the node-similarity matrix as a graph representative allows us to solve this dependency between the layers and the order of the neighbourhood nodes. Additionally, our approach with the node-similarity measures may perform well enough with only a few layers compared with the original GCN due to the less dependency between the highest number of layers used in the model and the highest order of node neighbourhood considered. The GCN or any deep learning model is prone to overfitting when a large number of layers are used [20], and our approach may get around this problem and achieve higher accuracy only with a few layers. It has also been empirically shown that most of the models with the augmented node-similarity measures outperform the original GCN.

In total we have proposed four new variations of the GCN model. Three of them are primarily based on the Katz, RPR, and GG scores as a form of the graph topology encoding. The fourth model is where we add learnable parameters for each of the features independent of the nodes for the entire graph, allowing the model to ignore the input features if it so chooses. This variation of the model can be used with the adjacency matrix as well as with the Katz, RPR or GG scores, and its primary motivation was the observation that for some datasets using the topology only, gives superior results. The results show that these new variations outperform the original GCN model in terms of accuracy.

For the node-similarity-based matrices, we have proposed a reconfiguration technique. This reconfiguration results in augmentation of the graph represented by the node-similarity matrix. This is particularly important as for node classification task with GCN-like models, we only have one graph sample to train the model. The augmentation technique can be used to better train the model on the same graphs with several different augmented node-similarity matrices (with different thresholds and similarity measurements). Several representations of the same graph topology can also work as a regularisation technique to prevent overfitting of the model.

We argued that a node in Facebook-type social networks can be defined in terms of a set of preferences (which we coined as the sDNA) of a node). Based on the sDNA, our simulation strategy provides a comprehensive guideline on how to generate dynamic networks with features, and ground truth labels, particularly useful to train and test the neural network-based learning systems. We have validated the integration of features and topology of the simulated graphs based on the predictability of the GCN. If the integration process is good enough, the GCN should not perform better on a model with topology only, compared with a model with both the topology and feature. However, we have found that a large number of models would perform better with only the topology of the network. We have concluded that this is because not all the features play a similar role in the graph. To include such variation of importance for each of the features for all the nodes, we have introduced the weighted feature matrix for the GCN. The new variant of the GCN with weighted feature matrix, have shown to have great potential. With the weighted feature matrix, the majority (except in three datasets) of the models perform better with both the feature and topology compared with the topology only. This not only produces a new variant of the GCN model but also shows that our integration process of the topology and features is successful.

The three cases where topology only performs better could be due to significantly more learnable parameters the model has compared with the feature and topology model that we have discussed. To solve this problem of an unequal number of learnable parameters, we have introduced another variation for the topology only model, where the number of learnable parameters is reduced by using a low-rank approximation of the weight matrix. The reduced parameter model for the topology only also performs well compared with the model with more parameters. A further inspection of those datasets may reveal the underlying reason why the topology only models perform well for them. However, it could be possible that for those three datasets, the features are reflected within the topology so well that the topology only model becomes more powerful and adding features simply results in redundancy.

We have used a few empirically selected thresholds for the augmented node-similarity matrix. A more effective way to select optimal thresholds is another future direction to explore.

We also provide an opensource library for social network simulation written in Python with GPU computation support for high dimensional features. We aim to incorporate more features in the future for the simulation library.

12 References

References

  • [1] Virtual-based safety testing for self-driving cars from nvidia drive constellation. URL https://www.nvidia.com/en-gb/self-driving-cars/drive-constellation/.
  • [2] Waymo safety report on the road to fully self-driving. https://waymo.com/safety/.
  • Goyal and Ferrara [2018] Palash Goyal and Emilio Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78–94, 2018.
  • Wasserman and Faust [1994] Stanley Wasserman and Katherine Faust. Social network analysis: Methods and applications, volume 8. Cambridge university press, 1994.
  • Newman [2018] Mark Newman. Networks. Oxford university press, 2018.
  • Sapountzi and Psannis [2018] Androniki Sapountzi and Kostas E Psannis. Social networking data analysis tools & challenges. Future Generation Computer Systems, 86:893–913, 2018.
  • Pecli et al. [2018] Antonio Pecli, Maria Claudia Cavalcanti, and Ronaldo Goldschmidt.

    Automatic feature selection for supervised learning in link prediction applications: a comparative study.

    Knowledge and Information Systems, 56(1):85–121, 2018.
  • Popescul and Ungar [2003] Alexandrin Popescul and Lyle H Ungar. Statistical relational learning for link prediction. In IJCAI workshop on learning statistical models from relational data, volume 2003. Citeseer, 2003.
  • Viswanath et al. [2009] Bimal Viswanath, Alan Mislove, Meeyoung Cha, and Krishna P Gummadi. On the evolution of user interaction in facebook. In Proceedings of the 2nd ACM workshop on Online social networks, pages 37–42. ACM, 2009.
  • [10] Facebook friendships network dataset – konect. http://konect.uni-koblenz.de/networks/facebook-wosn-links. Accessed: 2019-02-11.
  • Leskovec and Krevl [2014] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • Townsend and Wallace [2016] Leanne Townsend and Claire Wallace. Social media research: A guide to ethics. University of Aberdeen, pages 1–16, 2016.
  • Narayanan et al. [2011] Arvind Narayanan, Elaine Shi, and Benjamin IP Rubinstein. Link prediction by de-anonymization: How we won the kaggle social network challenge. In Neural Networks (IJCNN), The 2011 International Joint Conference on, pages 1825–1834. IEEE, 2011.
  • Hand [2018] David J Hand. Aspects of data ethics in a changing world: Where are we now? Big data, 6(3):176–190, 2018.
  • Cadwalladr and Graham-Harrison [2018] Carole Cadwalladr and Emma Graham-Harrison. Revealed: 50 million facebook profiles harvested for cambridge analytica in major data breach. The Guardian, 17, 2018.
  • Michelle [2018] Michelle. Social media’s year of falling from grace. Voice of America (VOA), Dec 2018. URL https://www.voanews.com/a/social-medias-year-of-falling-from-grace/4720871.html.
  • Bennett [2018] Colin J Bennett. The european general data protection regulation: An instrument for the globalization of privacy standards? Information Polity, 23(2):239–246, 2018.
  • Mikolov et al. [2013] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • Grover and Leskovec [2016] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864. ACM, 2016.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Lichtenwalter et al. [2010] Ryan N Lichtenwalter, Jake T Lussier, and Nitesh V Chawla. New perspectives and methods in link prediction. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 243–252. ACM, 2010.
  • Scellato et al. [2011] Salvatore Scellato, Anastasios Noulas, and Cecilia Mascolo. Exploiting place features in link prediction on location-based social networks. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1046–1054. ACM, 2011.
  • Tylenda et al. [2009] Tomasz Tylenda, Ralitsa Angelova, and Srikanta Bedathur. Towards time-aware link prediction in evolving social networks. In Proceedings of the 3rd workshop on social network mining and analysis, page 9. ACM, 2009.
  • Xu et al. [2018] Shuai Xu, Kai Han, and Naiting Xu. A supervised learning approach to link prediction in dynamic networks. In International Conference on Wireless Algorithms, Systems, and Applications, pages 799–805. Springer, 2018.
  • Wahid-Ul-Ashraf et al. [2018] Akanda Wahid-Ul-Ashraf, Marcin Budka, and Katarzyna Musial. Netsim–the framework for complex network generator. Procedia Computer Science, 126:547–556, 2018.
  • Barabási and Albert [1999] Albert-László Barabási and Réka Albert. Emergence of scaling in random networks. science, 286(5439):509–512, 1999.
  • Watts and Strogatz [1998] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. nature, 393(6684):440–442, 1998.
  • Solomonoff and Rapoport [1951] Ray Solomonoff and Anatol Rapoport. Connectivity of random nets. The bulletin of mathematical biophysics, 13(2):107–117, 1951.
  • Erdős and Rényi [1959] Paul Erdős and Alfréd Rényi. On random graphs i. Publicationes Mathematicae (Debrecen), 6:290–297, 1959.
  • Erdős and Rényi [1960] Paul Erdős and Alfréd Rényi. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci, 5(17-61):43, 1960.
  • Erdős and Rényi [1961] Paul Erdős and Alfréd Rényi. On the strength of connectedness of a random graph. Acta Mathematica Hungarica, 12(1-2):261–267, 1961.
  • Leskovec et al. [2005] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 177–187. ACM, 2005.
  • Drossel and Schwabl [1992] Barbara Drossel and Franz Schwabl. Self-organized critical forest-fire model. Physical review letters, 69(11):1629, 1992.
  • Price [1976] Derek de Solla Price. A general theory of bibliometric and other cumulative advantage processes. Journal of the American society for Information science, 27(5):292–306, 1976.
  • Simon [1955] Herbert A Simon.

    On a class of skew distribution functions.

    Biometrika, 42(3/4):425–440, 1955.
  • Newman [2010] Mark Newman. Networks: an introduction. United Slates: Oxford University Press Inc., New York, pages 1–2, 2010.
  • Papadimitriou et al. [2011] A Papadimitriou, P Symeonidis, and Y Manolopoulos. Predicting links in social networks of trust via bounded local path traversal. In Proceedings 3rd Conference on Computational Aspects of Social Networks (CASON’2011), Salamanca, Spain, 2011.
  • Symeonidis et al. [2010] Panagiotis Symeonidis, Eleftherios Tiakas, and Yannis Manolopoulos. Transitive node similarity for link prediction in social networks with positive and negative links. In Proceedings of the fourth ACM conference on Recommender systems, pages 183–190. ACM, 2010.
  • Papadimitriou et al. [2012] Alexis Papadimitriou, Panagiotis Symeonidis, and Yannis Manolopoulos. Fast and accurate link prediction in social networking systems. Journal of Systems and Software, 85(9):2119–2132, 2012.
  • Bruch and Atwell [2015] Elizabeth Bruch and Jon Atwell. Agent-based models in empirical social research. Sociological methods & research, 44(2):186–221, 2015.
  • Granovetter [1978] Mark Granovetter. Threshold models of collective behavior. American journal of sociology, 83(6):1420–1443, 1978.
  • Kavak et al. [2018] Hamdi Kavak, Jose J Padilla, Christopher J Lynch, and Saikou Y Diallo. Big data, agents, and machine learning: towards a data-driven agent-based modeling approach. In Proceedings of the Annual Simulation Symposium, page 12. Society for Computer Simulation International, 2018.
  • Michalski et al. [2018] Radosław Michalski, Bolesław K Szymański, Przemysław Kazienko, Christian Lebiere, Omar Lizardo, and Marcin Kulisiewicz. Social networks through the prism of cognition. arXiv preprint arXiv:1806.04658, 2018.
  • Wahid-Ul-Ashraf et al. [2017] Akanda Wahid-Ul-Ashraf, Marcin Budka, and Katarzyna Musial-Gabrys. Newton’s gravitational law for link prediction in social networks. In International Workshop on Complex Networks and their Applications, pages 93–104. Springer, 2017.
  • Bliss et al. [2014] Catherine A Bliss, Morgan R Frank, Christopher M Danforth, and Peter Sheridan Dodds.

    An evolutionary algorithm approach to link prediction in dynamic social networks.

    Journal of Computational Science, 5(5):750–764, 2014.
  • Hristova et al. [2016] Desislava Hristova, Anastasios Noulas, Chloë Brown, Mirco Musolesi, and Cecilia Mascolo. A multilayer approach to multiplexity and link prediction in online geo-social networks.

    EPJ Data Science

    , 5(1):24, 2016.
  • Li et al. [2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.

    In

    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Katz [1953] Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39–43, 1953.
  • Brin and Page [2012] Sergey Brin and Lawrence Page. Reprint of: The anatomy of a large-scale hypertextual web search engine. Computer networks, 56(18):3825–3833, 2012.
  • Wahid-Ul-Ashraf et al. [2019] Akanda Wahid-Ul-Ashraf, Marcin Budka, and Katarzyna Musial. How to predict social relationships—physics–inspired approach to link prediction. Physica A: Statistical Mechanics and its Applications, 2019.