Related work
Link prediction has recently become very popular for prediction of future relationships between individuals of social networks. Consequently, a great variety of different approaches were invented. In the past decade, many efforts have been made by psychologists, computer scientists, physicists and economists to solve the link prediction problem in social networks. According to Wang et al. (WangSurvey2014)
there are two ways to predict links: similaritybased approaches and learningbased approaches. Similaritybased approaches calculate a similarity score for every pair of nodes, where higher score means higher probability that the corresponding nodes will be connected in the future.
Learningbased approaches are treating the link prediction problem as a binary classification task (Hasan2006)
. Therefore, typical machine learning models can be employed for solving the problem. These include classifiers like random forest
(RandomForest), multilayer perceptron or support vector machine (SVM)
(SVM), as well as probabilistic models. The learningbased approaches use non–connected pairs of nodes as instances with features describing nodes and the class label. Pairs of nodes which have potential to become connected are labeled as positive and the others as negative.Their feature set consists of similarity features from the similaritybased approaches and features derived from domain knowledge (e.g. textual information about members of social networks). Using combination of both can remarkably improve the link prediction performance. Scellato et al. (Scellato2011) considered social features, place features and global features in locationbased social networks for link prediction based on a supervised learning framework.
Both types of approaches rely on various metrics, which use information of nodes, topology of network and social theory to calculate similarity between a pair of nodes. Metrics consist of three categories: nodebased, topologybased and social theory based metrics.
Nodebased metrics use the attributes and actions of individuals to assess similarity of node pairs. They are very useful in link prediction; however, it is usually hard to get the data because of privacy issues.
Most metrics are based on the topological information and are called topologybased metrics. They are most commonly used for prediction, because they are generic and domain independent. Topologybased metrics are further divided into the following subcategories: neighborbased, pathbased and random walk based metrics. Neighbour based metrics assume that people tend to form new relationships with people that are closer to them. The most famous are Common Neighbors (Newman (Newman2001)), Jaccard Coefficient (Salton & McGill (Salton1986)), Adamic Adar Coefficient (Adamic & Adar (Adamic2003)) and Preferential Attachment Index (Barabási et al. (Barabasi2002)). The first three all use the same idea that two nodes are more likely to be connected if they share a lot of common neighbours. On the other hand Preferential Attachment Index assumes that nodes with higher degree have higher probability of forming new edges.
Neighborbased metrics capture local neighbourhood but do not consider how nodes are reachable from one another. Pathbased metrics incorporate this information by considering paths between nodes. They are more suitable to small networks and are not scalable to big networks. Examples of pathbased metrics are Local Path (Lü et al. (Lu2009)) and Katz metric (Katz (Katz1953)). Local Path metric makes use of information of local paths with length two and three, while giving more importance to the paths of length two. Katz metric calculates the similarity by summing all the paths connecting the two nodes, giving higher weight to shorter paths.
Social interactions between members of social networks can also be modeled by random walk, which uses transition probabilities from a node to its neighbors to denote the destination of a random walker from the current node. Examples of random walk based metrics are Hitting Time and SimRank (Jeh and Widom (Jeh2002)). Hitting time metric calculates similarity based on the expected number of steps required for a random walk starting at a node to reach the other node. SimRank metric computes similarity according to the assumption that two nodes are alike if they are connected to structurally similar nodes.
Social theory based metrics take advantage of classical social theories, such as community, triadic closure, strong and weak ties and homophily, improving performance by capturing additional social interaction information. Liu et al. (Liu2013) proposed a link prediction model based on weak ties and degree, closeness and betweenness node centralities.
When designing a feature set, the choice of features tremendously influence the performance of link prediction. Sometimes is it hard to find appropriate features, hence it is desirable that an algorithm learns important features on its own. Network embedding methods aim at learning lowdimensional latent representation of nodes in a network. Embeddings should follow the principle that similar nodes in the network have similar embedding representations. The advantage of node embedding as a technique is enormous since it does not require feature engineering by domain experts. Network embeddings methods can be broadly categorized into four classes: methods based on random walks, matrix factorization, neural networks, and probabilistic approaches. For the purpose of this paper methods based on random walks are the most relevant.
Methods based on random walks determine similarities using random walks on the original network. The Skip–Gram model, described in Mikolov et al. (Mikolov2013), is then usually used to generate node embeddings from the random walks. Examples of such methods are DeepWalk (Perozzi et al. (Perozzi2014)) and node2vec (Grover & Leskovec (Grover2016)
). DeepWalk was the first technique for network embeddings, inspired by deep learning. It uses random walks with fixed transition probabilities to measure node similarity, while embeddings are derived using the SkipGram model. Node2vec is a generalization of DeepWalk which uses supervised random walks for node neighbourhood exploration. The random walk is controlled by a return parameter
and an inout parameter . Then similarly Skip–Gram model is used, but this time approximated via negative sampling, for embedding generation.Evaluation of the methods play the crucial role in machine learning task in general. To estimate performance of the link prediction approaches more evaluation criteria exists. While some papers utilize for a range of values (Zhang et al. (Zhang2018)), others use AUROC (Grover & Leskovec (Grover2016)).
Data
We study the Facebook social network of friendships at one hundred American colleges and universities at a single moment of time in September 2005
(facebook100, facebook5). Besides the information about friendships, network also contains limited demographic information. The following information is available for each user: student/faculty status flag, gender, major, second major/minor (if applicable), dorm/house, year and high school. Network is unweighted and undirected. The whole network consists of million nodes with million links between them (facebook100statistics). Maximum degree of a single node is approximately and minimum degree is only , with an average of . According to statistics network appears to be disassortative but this is only the consequence of its size. It also has high average clustering coefficient , which is characteristic of social networks.Dataset  

Train  4943.5  206247.6  77.26787  0.2808 
Unseen  3517.8  140793.2  80.3072  0.2689 
Table 1 contains structural measurements for used network sets. All of the presented measurements are average over all networks in the corresponding set. The presented values are: average clustering coefficient (C), average degree (d), average number of nodes (n) and average number of edges (m). We can see by number of nodes and edges that train set is larger than unseed dataset. Clustering coefficient and average degree are high, which is one of expected characteristics of social networks.
Degree distribution for used networks is presented in figures 1 and 2
. Degree distribution is presented on a loglog scale for all networks in both sets respectively. For cleaner overview, we used interpolation (Univariate spline) to showcase distribution of all networks. It is visible on both figures that all networks follow power law, which is expected for social networks. Having in mind that we are using real life social networks, it can be concluded that they are scalefree networks by degree distribution results. However, it is interesting to point out existence of big hubs, nodes with very high degree. They are visible on right side of the distribution graph. This is one of the reasons why interpolation at the end of the plot has an unexpected minimum.
Feature set
Feature engineering probably plays the most important role when coping with a machine learning problem. Informative features crucially effect model accuracy, hence the process of feature engineering is usually very time consuming. In learningbased link prediction each pair of nodes is described using a combination of nodebased, topologybased and node embedding features, depending on approach. In this paper we are using three different datasets.
Nodebased features
Nodebased features use domainspecific information about individuals. Facebook100 dataset has already a basic set of features, however, not all of them are useful for link prediction task. Almost all features had to be transformed, in order to describe node pairs, instead of individuals. From some features, for example dormatory information, new features had to created, because otherwise model would not be transferable between networks. Problem arises from the fact that different universities use different numerations of their dormatories. Considering the above constraints, we derived the following features:

is_dorm: binary value, indicating whether the nodes live in the same dormatory

is_year: binary value, indicating if the nodes started college in the same year

year_diff: numerical value, stating the absolute difference between the years, when the nodes stated college

from_high_school, to_high_school: numerical values, stating indices of nodes’ high schools

from_major, to_major: numerical values, stating indices of nodes’ majors

is_faculty: binary value, indicating whether the nodes have the same faculty status

is_gender: binary value, indicating if the nodes have the same gender
Since networks are undirected, each pair of nodes must be uniquely represented using above features. Representation should not depend on order of the pair, thus from_major and to_major are ordered in a way that the value of from_major is not greater than the value of to_major. The same holds for from_high_school and to_high_school.
Like the majority of datasets Facebook100 does not contain all information about all individuals. Therefore, missing values had to be handled. We decided that imputing is reasonable only for attribute years, where missing values were substituted with the mean. Values of other attributes were left intact but as soon as one of the nodes in the pair had a missing value, the corresponding binary values was automatically zero.
Topologybased features
The most commonly used features for link prediction are topologybased features. They are particularly useful, when you do not have any problem specific information, because they are generic and domain independent. Although Facebook100 dataset has additional domain specific data, topologybased features still have great impact on model accuracy. In this paper we are using the following topologybased features:

Jaccard Coefficient (Salton1986). Jaccard Coefficient normalizes the size of common neighbors. According to Jaccard Coefficient a pair of nodes is assigned a higher value when the nodes share a higher proportion of common neighbors relative to total number of their neighbours.
where is a set of neighbours of node .

Adamic Adar Coefficient (Adamic2003). Adamic Adar Coefficient measure is closely related to Jaccard Coefficient. It is calculated as a weighted sum of common neighbours, where common neighbours with fewer neighbours have greater impact. The rationale behind it is that high degree nodes are more likely to occur in common neighbourhood, thus they should contribute less than low degree nodes.

Preferential Attachment Index (Barabasi2002). The measure is based on the concept that nodes with higher degree have higher probability of forming new edges.

Resource Allocation Index (Zhou2009). Resource Allocation Index metric is very similar to Adamic Adar Index. The only difference is that Resource Allocation Index punishes high degree nodes more.
Node embedding features
Network embeddings methods aim to learn lowdimensional latent representation of the nodes in a network. Generating a dataset comprising of every node in a network we are able to use these representations as features. This can be used for a wide variety of tasks such as classification, clustering, link prediction, and visualization. Using node2vec (Grover2016) we were able to generate our embeddings dataset.
The key point is that node2vec is based on random node walks performed in a biased manner across the network. With this generic approach we are able to sample any network in a search for vector representation of its structural properties. With the introduction of search bias we are able to control our search in BFS or DFS manner. If we choose “inout parameter” () , walks are more biased to visit nodes further from the start node, thus expressing the nature of exploration. Fixing “return parameter” () ensures that we are less likely to visit same node twice, which in return adopts the strategy of modern exploration (avoids 2hop redundancy in sampling).
As stated in the case study by Grover & Leskovec (Grover2016)
for social structures it is beneficial to tune node2vec hyperparameters to discover communities of nodes which are interacting with each other. Capturing this type of behavior using embedding representation is significantly beneficial for the link prediction task. Dimensions were discovered heuristically. Having in mind that we are sampling different networks, the vector dimension should be as small as possible while carrying all relevant information. Using exploration with biased random walk, hyperparameters were arbitrarily chosen to achieve best model performance. We defined our selection of hyperparameters by this reasoning, which was used to define the starting point for grid search. The final parameters were: 64 dimensions, 50 walks per node,
and 20 nodes in each walk. Since base node2vec approach yields embeddings for nodes, we used Hadamard product to express vector representations for edges.Datasets
Because Facebook100 dataset is enormous, lack of computing power prevented us from considering the complete dataset for analysis. Therefore, we have decided that we would perform analysis only on a subset of networks. We selected ten networks as normal data and five networks as unseen data. Normal data was used for training and testing, whereas using unseen data we evaluated if our models are transferable to new data.
Firstly, we had to preprocess graphs and obtain train and test node pairs. For every graph we used the standard approach of generating an incomplete train graph from the original graph . The connected node pairs , which are present in the whole graph but not included in the train graph, are used as positive instances for link prediction task. Positive instances were randomly sampled from the original graph’s edge set . We decided to sample 2% of edges in original graph . Since dataset should contain positive as well as negative instances, we had to obtain also negative instances – pairs of nodes that are not connected by an edge. These were obtained by using randomly selected node pairs , which are not in the original graph’s edge set. To get a balanced dataset the number of negative instances is the same as the number of positive ones.
In our experiments, all unseen data instances were used for testing models’ ability to adapt to new graphs. However, normal data was further split into train and test data. We used standard division: 80% of it was used as the train data and the remaining 20% was used as the test data, both containing approximately the same number of positive and negative instances.
Using this data three datasets were created: baseline, topological and embedding dataset. Each dataset represents node pairs using a different combination of features. Baseline dataset is the simplest one and contains only topologybased features. A bit more complex is topological dataset, which in addition to the topologybased features makes use of nodebased features as well. Node pairs in embedding dataset are described using nodebased features and Hadamard product of the corresponding nodes’ embeddings.
Feature selection
Contemporary datasets usually have abundance of data, which is not always relevant to the problem. Hence, datasets should be preprocessed before models are used on them. Preprocessing takes place mainly to reduce the size of the dataset and achieve more efficient analysis, as well as removing redundant features, which have negative impact on the performance of the model. The aim of feature selection is to maximize relevance and minimize redundancy of the features.
Our feature sets are not enormous, thus feature selection was done solely for the sake of performance improvement of the models. We are using recursive feature elimination with crossvalidated selection (RFECV) in combination with linear kernel support vector machine (SVM) to get reduced feature sets. This method recursively considers smaller and smaller sets of features, while after every iteration prunes the least important features according to the chosen model. It belongs to wrapper methods for feature selection, since it appraises subsets of features based on performance of the modelling algorithm. According to Jović et al. (Jovic2015) wrapper methods have been empirically proven to yield better results than other methods because subsets are evaluated using real models.
Baseline dataset
The above feature selection method recognized Adamic Adar Coefficient, Jaccard Coefficient and Resource Allocation Index as the most informative features. The most relevant feature is Adamic Adar Coefficient and the least relevant one is Preferential Attachment Index. This is completely coherent with random forest feature importance shown in figure 4. Adamic Adar Coefficient is the most relevant feature, although Jaccard Coefficient has higher correlation with labels. All selected features are highly correlated with label, whereas Preferential Attachment Index is not. This is probably the reason why Preferential Attachment Index is the only feature which was not selected. From correlation matrix it is also evident that Adamic Adar Coefficient and Resource Allocation Index are almost perfectly correlated, which is expected because of the similarity in their definitions. Nonetheless adding it results in a slightly better performance, thus the algorithm decides to keep it.
Topological dataset
The advantages of feature selection are more evident on topological dataset, because it has more features. This time algorithm selected the following features: all four topologybased features, is_year, is_faculty, is_dorm, from_major and to_major. On figure 5 it is clearly shown that topologybased features are far more important than other nodebased features. This is also consistent with correlation matrix, since topologybased features have the highest correlations with labels. They are so informative due to phenomenon called triadic closure. The triadic closure states that in social networks connections tend to form between people who share common friends, which is precisely what these topologybased features are describing. Among the nodebased features is_year, is_faculty and is_dorm were selected, all having relatively high correlation with label. Particularly high correlation has is_year, which is expected, as college students often form friendships with their classmates. Because of this from_major and to_major are also relevant. Feature is_faculty exploits the fact that students’ and professors’ social circles are rarely overlapping.
Embedding dataset
Feature selection on embedding dataset was especially hard due to artificial features from node2vec. Because hyperparameters of node2vec were carefully tuned, we assumed that node embeddings are optimal, hence we filtered only nodebased features (Attributed). Therefore, we are using only a few crucial ones: is_year and is_dorm. We did not select is_faculty, although it is more important than is_dorm if considered on its own. We decided so, because is_faculty has high correlation with is_year and correlated features usually have negative impact on performance of a model.
Results
Evaluation of our datasets was conducted using an ensemble of classification models. We used simpler models like logistic regression and random forest, as well as more complex ones – support vector machines (SVM) and neural networks (NN). The latter are capable of modeling more complex nonlinear functions, whilst logistic regression can model only the linear ones. Link prediction task was tested on all three datasets, on test and unseen data, and all aforementioned models. Performance of the models was evaluated using Area Under the Receiver Operating Characteristics (AUROC), which is one of the most common evaluation metrics for link prediction.
Dataset  Logistic regression  Random forest  SVM  NN 

Baseline  0.9401  0.9227  0.9628  0.9618 
Topological  0.9570  0.9173  0.9639  0.9623 
Embedding  0.9365  0.9145  0.9414  0.9389 
Dataset  Logistic regression  Random forest  SVM  NN 

Baseline  0.9263  0.9031  0.9570  0.9560 
Topological  0.9478  0.8901  0.9563  0.9538 
Embedding  0.9229  0.9047  0.9217  0.9218 
Table 2 contains AUROC scores for all combinations of the datasets and models on the test data. Similarly, table 3 states the same values, but on unseen data. These tables reveal that support vector machine (SVM) and neural network (NN) are the best models for the link prediction task. Their performance is almost exactly the same, although they are based on completely different concepts. This is indicating that all relevant information from datasets is used for prediction. Only a little worse did logistic regression, which is very surprising, since it is much simpler than SVM and NN. Even more unexpected is that it outperformed random forest, which is nonlinear model. This is a consequence of linear separability of the data, but more about this will be written in discussion section. All models appear to be stable, since there is only a slight decrease in performance, when applied to unseen data. Difference is negligible for baseline and topological dataset, whereas noteworthy on embedding dataset.
The models were able to extract more useful information from topological dataset than baseline and embedding ones. Baseline dataset has only a bit worse results, showing that additional nodebased features have minimal influence on performance. Difference is visible for logistic regression, whereas SVM and NN have the same score on both datasets. Shockingly, embedding dataset gets the worst results. However, this might be the consequence of the chosen evaluation metric. Embedding dataset gets worse AUROC scores than the other two datasets, but better scores. For example logistic regression on embedding dataset gets score, while baseline and topological datasets get only and .
Model analysis
Here we used previously obtained data and results to optimize our models. Prior to that, data was standardized to have variance
and mean . With this approach, we have done model analysis to interpret best combinations of hyperparameters, which are useful to understand and discover patterns in data.Logistic regression
Using grid search crossvalidation on logistic regression we saw that different approaches are using different configurations. In the case of the baseline approach features equally impact the decision process, which is reflected in features’ coefficients. For this dataset we used ridge regularization () with default regularization strength of .
In the case of topological dataset we notice that regularization here is a significant part of the process. We used lasso regularization () regularization with immense regularization strength of . In this case regularization is crucial for preventing overfitting. Having in mind that our features are measurements which are not calculated in a fixed interval we are benefiting from the property of data sparsity. Model coefficients are imbalanced, where major study features (e.g. from_major and to_major) are given low coefficient values, which shows that most of the information is contained in the rest of the features.
In embedding dataset we noticed that addition of nodebased data does not have any benefits. Having in mind that vectors form a feature set, we can argue that correlations inside these vectors possess structural information which is used in model learning. In this case regularization with strength is used. Lasso regularization is improving model performance on unseen networks. This is achieved with generalization of the obtained knowledge from social network onto new unseen networks. As expected, coefficients show that all attributes of embedding vectors are equally important.
Random forest
Random forest did not perform well on our datasets. We can justify that conclusion by the fact that this approach lacks mechanism for regularization. Higher number of dimensions in respect to number of samples (unbalanced training and unseen data) is causing our decision tree models to overfit. Grid search in this case did not yield specific results, as well as tuning of parameters failed to find feature dependent information. This behaviour is shown in our model comparison where it is expected to experience better benchmarks on different linear models such as logistic regression and SVM. We notice that unseen networks’ AUROC scores are the lowest over all datasets, therefore we can conclude that random forest model did not respond well to our problem.
Support vector machine (SVM)
For support vector machine (SVM) only kernels were carefully tuned. Best fit for each dataset was chosen using grid search. Grid search consisted of linear, polynomial and Gaussian kernel, so the model could work with arbitrary dimensional data. It turned out that for baseline and topological datasets linear kernel was the best option, while embedding dataset required Gaussian kernel. This is so, because baseline and topological datasets are linearly separable as stated in discussion section, in contrast to embedding dataset, which uses node2vec that generates nonlinear node embeddings.
Neural network (NN)
Choosing the right hyperparameters for neural networks (NN) was very complicated and tedious task, since neural networks have a lot of different parameters. Nevertheless, correctly setting them can yield much better performance in comparison to other models. For some of the parameters like loss and optimization functions default settings were selected. Because learningbased link prediction is a binary classification task, binary crossentropy loss function and Adam optimizer were utilized. Hidden and output activation functions were selected using random experimentation. The best results yielded ReLU as hidden activation function and sigmoid as output activation function. Lastly, architecture of neural network had to be defined, which was done using grid search. We tried a great variety of different depths and numbers of nodes per layer, but in the end architectures with only two hidden layer and small number of nodes were the best performing. Deeper architectures did not work well on topological and baseline datasets, because data is linearly separable, as explained in discussion. However, it is interesting that even on embedding dataset, which is nonlinear, architectures with smaller number of layers worked better. This indicates there must be some kind of linearity even in the embedding dataset.
Discussion
Embedding dataset yielded worse results than topological and baseline ones because they are linearly separable. It is much easier to train models on linearly separable data than complex ones with a lot of nonlinearity, such as node2vec. However, the embedding dataset is more stable according to score as mentioned in the results.
Random forest is the worst on baseline and topological data, because it is harder for it to adjust to linear data. Other models are capable of that, while logistic regression is the linear model by default. To contribute to that, SVM works best while using the linear kernel. Neural network works better if it has only two hidden layers, which indicates that the model needs to content some sort of linearity. To prove this we can use linear discriminant analysis (LDA) to visualize our data. If any linearity coexists in the dataset, it should be visible in reduced dimension space. With this representation we can truly see that the data is more linear in topological data (figure 6). Comparing to embedding data, we can see that we have higher non linearity (figure 7). Having that in mind, slightly expressed nonlinearity can be suppressed with regularization and result in a tradeoff, resulting in increased stability of the prediction model.
Conclusion
In the presented paper we can conclude that results are unexpectedly good for link prediction tasks of this nature. Having in mind that we are generating predictions learned from separated social networks we can say that our models succeeded in their task. It is visible that models successfully generalized unseen data, based only on the 50% bigger training set.
For optimization of AUROC score baseline and topological approaches are the best. It turns out that simplicity has benefits in terms of high classification scores. In these two cases nodebased features did not really effect performance, except for logistic regression, where binary data was utilized in a manner where linearity was enlarged. Although SVM and NN got better results, we recommend logistic regression in combination with topological approach because the model is easier to train and interpret. When very high AUROC scores are important (e.g. link prediction on medical data), we suggest SVM with linear kernel and baseline approach. It gets almost the same results on unseen data, even though it is simpler model than NN.
We have shown that collecting data from multiple social networks yields promising datasets, which can be used for modelling of various predictors in similar social structures. Besides that, in this paper we have shown that use of regularization can be a solution in the case of social networks, when lack of the training data is present. Using this approach we can obtain data insights globally.
Comments
There are no comments yet.