I Introduction
Networks are often used to describe complex systems in various areas, such as social sicence [1, 2], biology [3], electric system [4] and economics[5] etc. And the vast majority of the real world systems evolve with time, which can be modeled as dynamic networks [6, 7], where the nodes may come and go and the links may vanish and recover as time goes by. Links, representing the interactions between different entities, are of particular significance in the analysis of dynamic networks.
Link prediction of a dynamic network [8, 9] tries to predict the future structure of the network based on the historical data, which helps us better understand network evolution and further the relationships between topologies and functions. For instance, in online social networks [10, 11, 12]
, we can predict which links are going to be established in the near future. It means that we can infer with what kind of people, or even which particular one, the target user probably makes friends base on their historical behaviors. It can also be applied to the studies on disease contagions
[13], proteinprotein interactions [14] and many other fields where the evolution matters.Similarity indices, like Common Neighbor (CN) [15] and Resource Allocation Index (RA) [16], are widely used in link prediction of static networks [17], but they can hardly deal with the changes of the network structure directly. To learn temporal dependencies, Yao et al. [18] assigned timevaried weights to previous graphs and then execute link prediction task using the refined CN which considers the neighbors within two hops. Similarly, Zhang et al. [19] proposed an improved RA based dynamic network link prediction algorithm, which updates the similarity between pairwise nodes when the network structure changes. These methods, however, mostly depend on simple statistics of networks and thus cannot effectively deal with high nonlinearity.
In order to tackle this problem, a bunch of network embedding techniques were proposed to learn the representations of networks that can preserve highorder proximity. Random walk based method, such as DeepWalk [20] and node2vec[21]
, sample sequences of nodes and get node vectors by applying skipgram. Furthermore, with the development of deep learning
[22, 23, 24], methods like structural deep network embedding (SDNE) [25] and Graph Convolution Network (GCN) [26], can automatically learn node representations end to end. The embedding vectors ensure the nodes of similar structural properties stay close in the embedding space. These embedding methods are powerful but still lack the ability of analyzing the evolution of networks. To learn such temproal dependencies, some recent works take the evolution of network into consideration. Ahmed et al. [27] assigned damping weights to each snapshots, ensuring that more recent snapshots are more important, and combine them into a weighted graph to do local random walk. As an extension of [27], Ahmed and Chen [28]proposed Time Series Random Walk (TSRW) to integrate temporal and global information. There are also some methods based on Restrict Boltzmann Machine (RBM), which regard the evolution of network as a special case of Markov random field with twolayer variables. Conditional temporal RBM
[29], namely ctRBM, considers not only neighboring connections but also temporal connections, and thus has the ability to predict future links. Zhou et al. [30] modeled the network evolution as a triadic closure process, which however is limited to undirected networks. Following the idea of SDNE, Li et al. [31]used Gated Recurrent Unit (GRU)
[32] as encoder to learn both spatial and temporal information. Most of these combinations, however, are limited to predicting the added links, which only reflects a part of network evolution. Moreover, they have to obtain a representation of links and then train a binary classification model, which is less unified.In this paper, we address the problem of predicting the global structure of networks in the near future, focusing on the links that are going to appear or disappear. We propose a novel endtoend EncoderLSTMDecoder (ELSTMD) deep learning model for link prediction in dynamic networks, which takes the advantages of encoderdecoder architecture and a stacked Long ShortTerm Memory (LSTM). The model thus can effectively handle the problems of high dimension, nonlinearity and sparsity. Due to the encoderdecoder architecture, the model can automatically learn representations of networks, as well as reconstruct a graph on the grounds of the extracted information. Relatively low dimensional representations for the sequences of graphs can be well learned from the stacked LTSM module placed right behind the encoder. Considering that network sparsity may seriously affect the performance of the model, we amplify the effect of existing links at the training process, enforcing the model to account for the existing links more than missing/nonexistent ones. We conduct comprehensive experiments on five realworld datasets. The results show that our model significantly outperforms the current stateoftheart methods. In particular, we make the following main contributions.

We propose a general endtoend deep learning framework, namely ELSTMD, for link prediction in dynamic networks, where the encoderdecoder architecture automatically learns representations of networks and the stacked LSTM module enhances the ability of learning temporal features.

Our newly proposed ELSTMD model is competent to make long term prediction tasks with only slight drop of performances; It suits the networks of different scales by fine tuning the model structure, i.e., changing the number of units in different layers; Besides, it can predict the links that are going to appear or disappear, while most existing methods only focus on the former.

We define a new metric, Error Rate, to measure the performance of dynamic network link prediction, which is a good addition to the Area Under the ROC Curve (AUC), so that the evaluation is more comprehensive.

We conduct extensive experiments, comparing our ELSTMD model with five baseline methods on various metrics. It is shown that our model outperforms the others and obtain the stateoftheart results.
The rest of paper is organized as follows. In Section II, we provide a rigorous definition of dynamic network link prediction and a detailed description of our ELSTMD model. Comprehensive experiments are presented in Section III, with the results carefully discussed. Finally, we conclude the paper and outline some future works in Section IV.
Ii Methodology
In this section, we will introduce our ELSTMD model used to predict the evolution of dynamic networks.
Iia Problem Definition
A dynamic network is modeled as a sequence of snapshot graphs taken at a fixed interval.
Definition 1 (Dynamic Networks)
Given a sequence of graphs, {, …, }, where denotes the snapshot of a dynamic network. Let be the set of all vertices and the temporal links within the fixed timespan . The adjacency matrix of is denoted by with the element if there is a directed link from to and otherwise.
In a static network, link prediction aims to find edges that actually exist according to the distribution of observed edges. Similarly, link prediction in a dynamic network makes full use of the information extracted from previous graphs to reveal the underlying network evolving patterns, so as to predict the future status of the network. Since the adjacency matrix can precisely describe the structure of a network, it is ideal to use it as the input and output of the prediction model. We could infer just based on , due to the strong relationship between the successive snapshots of the dynamic network. However, the information contained in may be too little to do precise inference. In fact, not only the structure itself but also the structure change overtime matters in the network evolution. Thus, we prefer to use a sequence of length , i.e., {, …,}, to predict .
Definition 2 (Dynamic Network Link Prediction)
Given a sequence of graphs with length , ={, …, }, Dynamic Network Link Prediction (DNLP) aims to learn a function that maps the input sequence to .
The structure of a dynamic network evolves with time. As shown in Fig. 1, some links may emerge while some others may vanish, which can be reflected by the changes of the adjacency matrix overtime. The goal is to find the links of the network that are most likely to appear or disappear at the next timespan. Mathematically, it can also be interpreted as an optimization problem of finding a matrix, whose element is either 0 or 1, that can best fit the ground truth.
IiB ELSTMD Framework
Here, we propose a novel deep learning model, namely ELSTMD, combining the architecture of encoderdecoder and stacked LSTM, with the overall framework shown in Fig. 2. Specifically, the encoder is placed at the entrance of the model to learn the highly nonlinear network structures and the decoder converts the extracted features back to the original space. Such encoderdecoder architecture is capable of dealing with spatial nonlinearity and sparsity, while the stacked LSTM between the encoder and decoder can learn temporal dependencies. The well designed endtoend model thus can learn both structural and temporal features and do link prediction in a unified way.
We first introduce terms and notations that will be frequent used later, all of which are listed in TABLE I. Other notations will be explained along with the corresponding equations. Notice that a single LSTM cell can be regarded as a layer, in which the terms with subscript f are the parameters of forget gate, the terms with subscripts i and C are the parameters of input gate, and those with subscript o are the parameters of output gate.
IiB1 Encoderdecoder architecture
Autoencoder can efficiently learn representations of data in an unsupervised way. Inspired by this, we place an encoder at the entrance of the model to capture the highly nonlinear network structure and a graph reconstructor at the end to transform the latent features back into a matrix of fixed shape. Here, however, the whole process is supervised, which is different from autoencoder, since we have labeled data ( to guide the decoder to build matrices that can better fit the target distributions. In particular, the encoder, composed of multiple nonlinear perceptions, projects the high dimensional graph data into a relatively lower dimensional vector space. Therefore, the obtained vectors could characterize the local structure of vertices in the network. This process can be characterized as
(1) 
where represents graph in the input sequence . For an input sequence, each encoder layer processes every term separately and then concatenates all the activations in the order of time. Here, we use as the activation function for each encoder/decoder layer to accelerate convergence.
Symbol  Definition 

number of encoder/decoder layers  
number of LSTM cells  
output of the decoder  
output of the stacked LSTM  
,  output of encode/decoder layer 
,  weight of encode/decoder layer 
,  bias of encoder/decoder layer 
weight of LSTM layer  
bias of LSTM layer 
The decoder with the mirror structure of the encoder receives the latent features and maps them into the reconstruction space under the supervision of , represented by
(2) 
where is generated by the stacked LSTM and represents the features of the target snapshot rather than a sequence of features of all previous snapshots used in the encoder. Another difference is the last layer of the decoder, or the output layer, uses sigmoid as the activation function rather than . And the number of units of the output layer always equals to the number of nodes.
IiB2 Stacked LSTM
Although encoderdecoder architecture could deal with the high nonlinearity, it is not able to capture the timevarying characteristics. LSTM [33]
, as a special kind of recurrent neural network (RNN)
[34, 35], can learn longterm dependencies and is introduced here to solve this problem. An LSTM consists of three gates, i.e., a forget gate, an input gate and an output gate. The first step is to decide what information is going to be thrown away from previous cell state. The operation is performed by the forget gate, which is defined as(3) 
where represents the output at time . Then the input gate decides what new information should be added to the cell state. First, a sigmoid layer decides what information the input contains, , should be updated. Second, a tanh layer generates a vector of candidate state values, , which could be added to the cell state. The combination of and represents the current memory that can be used for updating . The operation is defined as
(4) 
Taking the benefit of the forget gate and the input gate, LSTM cell can not only store longterm memory but also filter out the useless information. The output of LSTM cell is based on and it is controlled by the output gate which decides what information, , should be exported. The process is described as
(5) 
A single LSTM cell is capable of learning time dependencies, but a chainlike LSTM module, namely stacked LSTM, is more suitable for processing time sequence data. Stacked LSTM consists of multiple LSTM cells that take signals as input in the order of time. We place the stacked LSTM between the encoder and the decoder to learn the patterns under which the network evolves. After receiving the features extracted at time
, the LSTM module turns them into and then feed back to the model at next training step. It helps the model make use of the remaining information of previous training data. It should be always noticed that the numbers of units in encoder, LSTM cells and decoder vary when changes. The larger , the more units we need in the model.The encoder at the entrance could reduce the dimension for each graph and thus keep the computation of the stacked LSTM at a reasonable cost. And the stacked LSTM which is advanced at dealing with temporal and sequential data is supplementary to the encoder in turn.
IiC Balanced Training Process
distance, often applied in regression, can measure the similarity between two samples. But if we simply use it as loss function in the proposed model, the cost could probably not converge to an expected range or result in overfitting due to the sparsity of the network. There are far more zero elements than nonzero elements in
, making the decoder appeal to reconstruct zero elements. To address this sparsity problem, we should focus more on those existing links rather than nonexistent links in back propagation. We define a new loss function as(6) 
where means the Hadamard product. For each training process, if and otherwise. Such penalty matrix exerts more penalty on nonzero elements so that the model could avoid overfitting to a certain extent. And we finally use the mixed loss function
(7) 
where , defined in Eq. (8), is a regularizer to prevent the model from overfitting and is a tradeoff parameter.
(8) 
The value of each element in
is either 0 or 1. The output data, however, are not onehot encoded. They are decimals and could go to infinity or move towards the opposite direction theoretically. In order to get a valid adjacency matrix, we impose a sigmoid function at the output layer and then modify the values to 0 and 1 with 0.5 as the demarcation point. That is, there exists a link between
and if and there is no link otherwise. To optimize the proposed model, we should first make a forward propagation to obtain the loss and then do back propagation to update all the parameters. In particular, the key operation is to calculate the partial derivative of and .We would like to take the calculation of for instance. Taking partial derivative with respect to of Eq. (7), we have
(9) 
According to Eq. (6), we can easily obtain
(10) 
To calculate , we should iteratively take partial derivative with respect to on both sides of Eq. (1). After getting , we update the weight by
(11) 
where is the learning rate which is set as 1e3 in the following experiments.
As for and , the calculation of partial derivative almost follows the same procedure, though it is a little more complicated when it comes to the weights in LSTM cells. This is because the recurrent network makes use of cell states at every forward propagation cycle.
Iii Experiments
The proposed ELSTMD then is evaluated on five benchmark datasets, compared with four baseline methods.
Iiia Datasets
We perform the experiments on five realworld dynamic networks, all of which are human contact networks, where nodes denote humans and links stand for their contacts. The contacts could be facetoface proximity, emailing and so on. The detailed descriptions of these datasets are listed below.

contact [36]: It is a human contact dynamic network of facetoface proximity. The data are collected through the wireless devices carried by people. A link between person (source) and (target) emerges along with a timestamp if gets in touch with . The data are recorded every 20 seconds and multiple edges may be shown at the same time if multiple contacts are observed in a given interval.

fbforum [39]: The data were attained from a Facebooklike online forum of students at University of California, Irvine, in 2004. It is an online social network where nodes are users and links represent interactions (e.g., messages) between them. The records span more than 5 months.

lkml [40]: The data were collected from linux kernel mailing list. The nodes represent users which are identified by their email addresses and each link donates a reply from one user to another. We only focus on the 2210 users that were recorded from 20070101 to 20070401 and then construct a dynamic network based on the links between these users that appeared from 20070401 to 20131201.
All the experiments are implemented in both longterm and shortterm networks. The basic statistics of the five datasets are summarized in TABLE II.
Dataset 



contact  274  28.2K  206.2  2,092  4.0  
enron  151  50.5K  669.8  1,841  164.5  
radoslaw  167  82.9K  993.1  9,053  271.2  
fbforum  899  50.5K  669.8  5,177  164.5  
lkml  2210  422.4K  34.6  47,995  2,436.3  
Before training, we take snapshots for each dataset at a fixed interval and then sort them in an ascending order of time. Considering that the connections between people are probably temporary, we remove the links that do not show up again in the following 8 intervals and the length of each interval may vary for different timespan. To obtain enough samples, we split each dataset into 320 snapshots with different intervals and set . In this case, is treated as a sample with the first ten snapshots as the input and the last one as the output. As a result, we can get 310 samples in total. We then group the first 230 samples, with varying from 11 to 240, as the training set, and the rest 80 samples, with varying from 241 to 320, as the test set.
IiiB Baseline Methods
To validate the effectiveness of our ELSTMD model, we compare it with node2vec, as a widely used baseline network embedding method, as well as four stateoftheart DNLP methods that could handle time dependencies, including Temporal Network Embedding (TNE) [41], conditional temporal RBM (ctRBM) [29]
, Gradient boosting decision tree based Temporal RBM (GTRBM)
[42] and Deep Dynamic Network Embedding (DDNE) [31]. In particular, the five baselines are introduced as follows.
node2vec [21]: As a network embedding method, it maps the nodes of a network from a high dimensional space to a lower dimensional vector space. A pair of nodes tend to be connected with a higher probability, i.e., they are more similar, if the corresponding vectors are of shorter distance.

TNE [41]: It models network evolution as a Markov process and then use the matrix factorization to get the embedding vector for each node.

ctRBM [29]: It is a generative model based on temporal RBM. It first generates a vector for each node based on temporal connections and predict future linkages by integrating neighbor information.

GTRBM [42]: It takes the advantages of both tRBM and GBDT to effectively learn the hidden dynamic patterns.

DDNE [31]: Similar to autoencoder, it uses a GRU as an encoder to read historical information and decodes the concatenated embeddings of previous snapshot into future network structure.
When implementing node2vec, we set the dimension of the embedding vector as 80 for contact, enron and radoslaw which have less than 500 nodes. And for fbforum and lkml with larger size, we set the dimension as 256. We grid search over {0.5, 1, 1.5, 2} to find the optimal values for hyperparameters and , and then use WeightedL2 [21] to obtain the vector for each pair of nodes and , with each element defined as
(12) 
where and are the element of embedding vectors of nodes and , respectively. For TNE, we set the dimension as 80 for contact, enron and radoslaw and 200 for fbforum and lkml. The parameters of ctRBM and GTRBM are mainly about the numbers of visible units and hidden units in tRBM. The number of visible units always equals to the number of corresponding network’s nodes and we set the dimension of hidden layers as 128 for smaller datasets like contact, enron and radoslaw and 256 for the rest. For DDNE, we set the dimension as 128 for the first three smaller datasets and 512 for the rest. When implementing our proposed model, ELSTMD, we choose the parameters accordingly: For the first three smaller datasets, we set and and add an additional layer to both encoder and decoder when for the rest two larger datasets. The details of the parameters are illustrated in TABLE III. Note that these parameters are chosen to get the best performance for each method, so as to make fair comparison.
Dataset 





contact  128  256 256  274  
enron  128  256 256  151  
radoslaw  128  256 256  167  
fbforum  512 256  384 384  256 899  
lkml  1024 512  384 384  512 2210 

Method  contact  enron  radoslaw  fbforum  lkml  

20  80  20  80  20  80  20  80  20  80  
AUC  node2vec  0.5212  0.5126  0.7659  0.6806  0.6103  0.7676  0.5142  0.5095  0.6348  0.5892  
TNE  0.9443  0.9297  0.8096  0.8314  0.8841  0.8801  0.9810  0.9749  0.9861  0.9867  
ctRBM  0.9385  0.9109  0.8468  0.8295  0.8834  0.8590  0.8728  0.8349  0.8091  0.7729  
GTRBM  0.9451  0.9327  0.8527  0.8491  0.9237  0.9104  0.9023  0.8749  0.8547  0.8329  
DDNE  0.9347  0.9433  0.7985  0.7638  0.9027  0.8974  0.9238  0.8729  0.9328  0.9115  
ELSTMD  0.9908  0.9893  0.8931  0.8734  0.9814  0.9782  0.9670  0.9650  0.9572  0.9553  
GMAUC  node2vec  0.1805  0.1398  0.4069  0.5417  0.7241  0.7203  0.2744  0.2886  0.2309  0.2193  
TNE  0.9083  0.8958  0.8233  0.7974  0.8282  0.8251  0.9689  0.9629  0.9839  0.9778  
ctRBM  0.9126  0.8893  0.7207  0.6921  0.8004  0.7998  0.8926  0.8632  0.7723  0.7206  
GTRBM  0.9240  0.9136  0.9148  0.8675  0.9157  0.8849  0.9329  0.9117  0.6529  0.6038  
DDNE  0.8925  0.8684  0.8724  0.8476  0.8938  0.8724  0.9126  0.9023  0.7894  0.7809  
ELSTMD  0.9940  0.9902  0.9077  0.8763  0.9956  0.9938  0.9926  0.9865  0.8657  0.8511  
Error Rate  node2vec  44.7753  25.2278  23.9053  24.8060  20.7240  21.2489  40.5109  48.5376  53.2895  61.0274  
TNE  13.1410  7.1556  23.1276  19.9167  16.7078  16.7175  19.1058  24.4350  18.5702  18.2091  
ctRBM  1.8976  1.9046  2.4890  2.7328  1.8920  2.0937  3.4509  3.6782  2.9903  3.3089  
GTRBM  1.5843  1.6953  1.5947  1.8836  1.9079  2.0031  2.2347  2.4396  2.5351  2.7942  
DDNE  1.1780  1.6036  1.7664  1.9014  1.6316  1.5941  1.9014  1.8266  2.0134  2.2258  
ELSTMD  0.4011  0.5735  0.9038  0.9880  0.3392  0.3938  0.5583  0.5777  0.9840  1.0093 
IiiC Evaluation Metrics
There are few metrics specifically designed for the evaluation of DNLP. Usually, those evaluation metrics used in static link prediction are also employed for DNLP. The Area Under the ROC Curve (AUC) is commonly used to measure the performance of a dynamic link predictor. AUC equals to the probability that the predictor gives a higher score to a randomly chosen existing link than a randomly chosen nonexistent one. The predictor is considered more informative if its AUC value is closer to 1. Other measurements, such as precision, Mean Average Precision (MAP), F1score and accuracy evaluate link prediction methods from the perspective of binary classification. All of them suffer from the sparsity problem and cannot give measurements to dynamic performances. The Area Under the PrecisionRecall Curve (PRAUC)
[43] developed from AUC is designed to deal with the sparsity of networks. However, the removed links in the near future, as a significant aspect of DNLP, are not characterized by PR curve and thus PRAUC may lose its effectiveness in this case. Junuthula et al. [44]restricted the measurements to only part of node pairs and proposed the Geometric Mean of AUC and PRAUC (GMAUC) for the added and removed links, which can better reflect the dynamic performance. Li et al.
[29] use SumD that counts the differences between the predicted network and the true one, evaluating link prediction methods in a more strict way. But the absolute difference could be misleading. For example, two dynamic link predictors both achieve SumD at 5. However, one predictor mispredicts 5 links in 10, while the other mispredicts 5 in 100. It’s obvious that the latter one performs better than the former one but SumD cannot tell.In our experiments, we choose AUC and GMAUC, and also define a new metric, Error Rate, to evaluate our ELSTMD model and other baseline methods.

AUC: If among independent comparisons, there are times that the existing link gets a higher score than the nonexistent link and times they get the same score, then we have
(13) Before calculation, we randomly sample nonexistent links with the same number of existing links to ease the impact of sparsity.

GMAUC: It is a metric specifically designed for measuring the performance of DNLP. It combines PRAUC(the area under the PrecisionRecall curve) and AUC by taking geometric mean of the two quantities, which is defined as
(14) where and refer to the numbers of added and removed edges, respectively. is the PRAUC value calculated among the new links and represents the AUC for the observed links.

Error Rate: It is defined as the ratio of the number of mispredicted links, denoted by , to the total number of truly existing links, denoted by , which is represented by
(15) Different from SumD that only counts the absolute different links in two graphs, Error Rate takes the number of truly existing links into consideration to avoid deceits.
IiiD Experimental Results
For each epoch, we feed 10 historical snapshots, {
, …, } to ELSTMD and infer . And it is the same for implementing the other four DNLP approaches. For the methods that are not able to deal with time dependencies, i.e. node2vec, there are following two typical treatments: 1) only using to infer [18]; or 2) aggregating previous 10 snapshots into a single network and then do link prediction [45, 31]. We choose the former one when implementing node2vec, because the relatively long sequence of historical snapshots here may carry some disturbing information that node2vec cannot handle, leading to even poor performance.We compare our ELSTMD model with the five baseline methods on the performance metrics AUC, GMAUC and Error Rate. Since the patterns of network evolution may change with time, the model trained by the history data may not capture the pattern in the remote future. To investigate both shortterm and longterm prediction performance, we report the average values of the three performance metrics for both the first 20 test samples and all the 80 samples. The results are presented in TABLE IV, where we can see that, generally, the ELSTMD model outperforms all the baseline methods in almost all the cases, no matter the network is large or small, dense or sparse, for both shortterm and longterm prediction. In particular, for the metrics of AUC and GMAUC, the poor performances obtained by node2vec indicate that the methods, designed for static networks, are indeed not suitable for DNLP. On the contrary, ELSTMD and other DNLP baselines can get much better performances, due to their dynamic nature.

Method  contact  enron  radoslaw  fbforum  lkml  

20  80  20  80  20  80  20  80  20  80  

node2vec  0.6279  0.6297  0.4900  0.4524  0.4735  0.5203  0.3873  0.3454  0.5034  0.5289  
TNE  0.9622  0.9551  0.3446  0.3315  0.5068  0.4413  0.0595  0.0558  0.6390  0.6288  
ctRBM  0.2739  0.3307  0.4193  0.4410  0.3028  0.3097  0.1095  0.1137  0.3291  0.3341  
GTRBM  0.2209  0.2390  0.4098  0.4322  0.2109  0.2198  0.1127  0.1239  0.2973  0.3030  
DDNE  0.1293  0.1359  0.2270  0.2133  0.0803  0.1249  0.1190  0.1088  0.1653  0.1821  
ELSTMD  0.0484  0.1109  0.2182  0.2096  0.0516  0.0761  0.0160  0.0222  0.1863  0.1992  

node2vec  0.6747  0.6509  0.4607  0.5953  0.4657  0.4397  0.6517  0.6799  0.8729  0.8698  
TNE  0.9998  0.9987  0.9598  0.9590  1.0000  1.0000  1.0000  0.9986  1.0000  0.9992  
ctRBM  0.5396  0.5619  0.6512  0.7381  0.2165  0.2291  0.4432  0.4508  0.7279  0.7503  
GTRBM  0.4418  0.4573  0.6906  0.7420  0.2399  0.2511  0.4507  0.4529  0.6370  0.6524  
DDNE  0.2713  0.2849  0.4988  0.5471  0.2083  0.2508  0.2697  0.3014  0.6435  0.6614  
ELSTMD  0.2004  0.2547  0.5067  0.6157  0.1617  0.2159  0.2643  0.2825  0.5820  0.6126 
Moreover, for each predicted snapshot, we also compare the predicted links with truly existing ones to obtain the Error Rate. We find that node2vec can easily predict much more links than the truly existing ones, leading to relatively large Error Rates. We argue that it might blame to the classification process that the pretrained linear regression model is not suitable for the classification of embedding vectors. As presented in TABLE
IV, the results again demonstrate the best performance of our ELSTMD model on DNLP. TNE performaces poorly on Error Rate, because it does not specially fit the distribution of the network as the other deep learning based methods do. The dramatic difference of the Error Rate between ELSTMD and TNE indicates that this metric is a good addition to AUC to comprehensively measure the performance of DNLP. Other deep learning based methods, like ctRBM and DDNE, have similar performances while they could not compete with ELSTMD in most cases. It is worth noticing that the TNE outperforms the others on lkml from the perspective of traditional AUC and GMAUC, which shows its robustness to the scale of networks on these metrics, however, it has much larger Error Rate compared with the other DNLP methods.For the 80 test samples with as the output, where varies from 1 to 80, we draw the DNLP performances on the three metrics, obtained by ELSTMD, as functions of for the five datasets to see how long it can predict network evolution with satisfying performance. The results are shown in Fig. 3 for ELSTMD, where we can see that, generally, AUC and GMAUC decrease, while Error Rate increases, as increases, indicating that longterm prediction on structure is indeed relatively difficult for most dynamic networks. Interestingly, for radoslaw, fbforum and lkml, the prediction performances are relatively stable, which might be because their network structures evolve periodically, making the collection of snapshots easy to predict, especially when LSTM is integrated in our deep learning framework. To further illustrate this, we investigate the changing trends of the most common structural properties, i.e., average degree and average clustering coefficient, of the five networks as increases. The results are shown in Fig. 4, where we can see that these two properties change dramatically for contact and enron, while they are relatively stable for radoslaw, fbforum and lkml. These results explain why we can make better longterm prediction on the last two dynamic networks.
As described above, although some methods have excellent performances on AUC, they might mispredict many links. In most realworld scenarios, however, we may only focus on the most important links. Therefore, we further evaluate our model on part of the links that are of particular significance in the network. Here, we use two metrics, degree centrality and edge betweenness centrality, to measure the importance of each link. DC is originally used to measure the importance of nodes according to the amount of neighbors. To measure the importance of a link, we use the sum of degree centralities of the two terminal nodes (source and target). We then calculate the Error Rate when predicting the top 10% important links. The results are presented in TABLE V, which demonstrate again the outstanding performance of our ELSTMD model in predicting important links. It also shows that the ELSTMD model is more capable of learning networks’ features, i.e. degree distribution and edge betweenness, which could account for the effectiveness in a way. Moreover, comparing TABLE. IV and TABLE. V, we find that Error Rates on the top 10% important links are much smaller than those on all the links in the five networks by adopting any method. This indicates that, actually, those more important links are also more easily to be predicted.
IiiE Beyond Link Prediction
Our ELSTMD model learns low dimensional representation for each node in the process of link prediction. These vectors, like those generated by other network embedding methods, contains local or global structural information that can be used in other tasks such as node classification etc. To illustrate this, we conduct experiment on karate club dataset, with the network structure shown in Fig. 5 (a). We first obtain by randomly removing 10 links form the original network and then use it to predict the original network . After training, we use the output of the stacked LSTM as the input to the visualization method tSNE [46]. Besides obtaining the excellent performance on link prediction, we also visualize the embedding vectors, as shown in Fig. 5 (b), where we can see that the nodes of the same class are close to each other while those of different classes are relatively far away. This indicates that the embedding vectors obtained by our ELSTMD model on link prediction can also be used to effectively solve the node classification problem, validating the outstanding transferability of the model.
IiiF Parameter Sensitivity
The performance of our ELSTMD model is mainly determined by three parts: the structure of model, the length of historical snapshots , and the penalty coefficient . In the following, we will investigate their influences on the model performance.
IiiF1 Influence of the model’s structure
The results shown in TABLE IV are obtained by the models with selected structures. The numbers of units in each layer and the number of layers are set with concerns on both computation complexity and models’ performance. We test the model with different number of units and encoder layers to prove the validity of the structures above. Fig. 6 shows that the performance will slightly drop with the reduction of the number of units in the first encoder layer. And further increasing the complexity has little contribution to the performance and may even lead to worse results. TABLE VI reports the difference of the performances between the model with an additional encoder layer which shares the same structure of the previous layer and the original model. The results show that there seems no significant improvements on AUC and GMAUC with an additional layer. But it could actually lower Error Rates with the increasing of the model’s complexity. Overall, the general structure of ELSTMD can achieve stateofart performance in most cases.
AUC  GMAUC  ERROR RATE  

contact  0.0038  0.0024  0.1037 
enron  0.0119  0.0206  0.0397 
radoslaw  0.0035  0.0054  0.0920 
fbforum  0.0029  0.0011  0.1033 
lkml  0.0079  0.0108  0.1375 
IiiF2 Influence of historical snapshot length
Usually, longer length of historical snapshots contains more information and thus may improve link prediction performance. On the other hand, snapshots from long ago, however, might have little influence on the current snapshot, while more historical snapshots will increase the computational complexity. Therefore, it is necessary to find a proper length to balance efficiency and performance. We thus vary the length of historical snapshots from 5 to 25 with a regular interval 5. The results are shown in Fig. 7 (a), which tell that more historical snapshots can indeed improve the performance of our model, i.e., leading to larger AUC and GMAUC while smaller Error Rate. Moreover, it seems that AUC and GMAUC increase most when changes from 1 to 10, while Error Rate decreases most when changes from 1 to 20. Thereafter, for most dynamic networks, these metrics keep almost the same as further increases. This phenomenon suggests us to choose in the previous experiments.
IiiF3 Influence of the penalty coefficient
The penalty coefficient is applied in the objective to avoid overfitting and accelerate convergence. When , the objective simply equals to distance. In reality, is usually larger than 1 to help the model focus more on existing links in the training process. As shown in Fig. 7 (b), we can see that the performance is relatively stable as varies. However, for some datasets, the increasing of penalty coefficient could actually lead to slightly larger GMAUC but smaller Error Rate, while it has little effect on AUC. As further increases, both GMAUC and Error Rate keep relatively stable. These suggest us to choose a relatively small , i.e., in the experiments, varying for different datasets to obtain the optimal results.
Iv Conclusion
In this paper, we propose a new deep learning model, namely ELSTMD, for DNLP. Specifically, to predict future links, we design an endtoend model integrating a stacked LSTM into the architecture of encoderdecoder, which can make fully use of historical information. The proposed model learns not only the low dimensional representations and nonlinearity but also the time dependencies between successive network snapshots, as a result, it can better capture the patterns of network evolution. To cope with the problem of sparsity, we impose more penalty to exis links in the objective, which can also help to preserve local structure and accelerate convergence. Empirically, we conduct extensive experiments to compare our model with traditional link prediction methods on a variety of datasets. The results demonstrate that our model outperforms the others and achieve the stateoftheart performance. Moreover, we show that the latent features generated by our model in link prediction can be used to well characterize the global and local structure of the nodes in a network and thus may also benefit other tasks, such as node classification.
Our future research will focus on predicting the evolution of layered dynamic networks. Besides, we will make efforts to reduce the computational complexity of our ELSTMD model to make it suitable for largescale network. Also, we will study the transferability of our model on various tasks by conducting more comprehensive experiments.
References
 [1] D. Ediger, K. Jiang, J. Riedy, D. A. Bader, and C. Corley, “Massive social network analysis: Mining twitter for social good,” in Parallel Processing (ICPP), 2010 39th International Conference on. IEEE, 2010, pp. 583–593.
 [2] C. Fu, J. Wang, Y. Xiang, Z. Wu, L. Yu, and Q. Xuan, “Pinning control of clustered complex networks with different size,” Physica A: Statistical Mechanics and its Applications, vol. 479, pp. 184–192, 2017.
 [3] L. Wang and J. Orchard, “Investigating the evolution of a neuroplasticity network for learning,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, doi:10.1109/TSMC.2017.2755066.
 [4] J. Gao, Y. Xiao, J. Liu, W. Liang, and C. P. Chen, “A survey of communication/networking in smart grids,” Future Generation Computer Systems, vol. 28, no. 2, pp. 391 – 404, 2012.
 [5] M. Kazemilari and M. A. Djauhari, “Correlation network analysis for multidimensional data in stocks market,” Physica A: Statistical Mechanics and its Applications, vol. 429, pp. 62–75, 2015.
 [6] J. Sun, Y. Yang, N. N. Xiong, L. Dai, X. Peng, and J. Luo, “Complex network construction of multivariate time series using information geometry,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 49, no. 1, pp. 107–122, Jan 2019.
 [7] H. Liu, X. Xu, J.A. Lu, G. Chen, and Z. Zeng, “Optimizing pinning control of complex dynamical networks based on spectral properties of grounded laplacian matrices,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, 2018, doi:10.1109/TSMC.2018.2882620.
 [8] N. M. A. Ibrahim and L. Chen, “Link prediction in dynamic social networks by integrating different types of information,” Applied Intelligence, vol. 42, no. 4, pp. 738–750, 2015.
 [9] Q. Xuan, H. Fang, C. Fu, and V. Filkov, “Temporal motifs reveal collaboration patterns in online taskoriented networks,” Physical Review E, vol. 91, no. 5, p. 052813, 2015.
 [10] Q. Xuan, Z.Y. Zhang, C. Fu, H.X. Hu, and V. Filkov, “Social synchrony on complex networks,” IEEE transactions on cybernetics, vol. 48, no. 5, pp. 1420–1431, 2018.
 [11] Q. Xuan, M. Zhou, Z.Y. Zhang, C. Fu, Y. Xiang, Z. Wu, and V. Filkov, “Modern food foraging patterns: Geography and cuisine choices of restaurant patrons on yelp,” IEEE Transactions on Computational Social Systems, vol. 5, no. 2, pp. 508–517, 2018.

[12]
C. Fu, M. Zhao, L. Fan, X. Chen, J. Chen, Z. Wu, Y. Xia, and Q. Xuan, “Link weight prediction using supervised learning methods and its application to yelp layered network,”
IEEE Transactions on Knowledge and Data Engineering, 2018.  [13] H. H. Lentz, A. Koher, P. Hövel, J. Gethmann, C. SauterLouis, T. Selhorst, and F. J. Conraths, “Disease spread through animal movements: a static and temporal network analysis of pig trade in germany,” PloS one, vol. 11, no. 5, p. e0155196, 2016.
 [14] A. Theocharidis, S. Van Dongen, A. J. Enright, and T. C. Freeman, “Network visualization and analysis of gene expression data using biolayout express 3d,” Nature protocols, vol. 4, no. 10, p. 1535, 2009.
 [15] M. E. Newman, “Clustering and preferential attachment in growing networks,” Physical review E, vol. 64, no. 2, p. 025102, 2001.
 [16] T. Zhou, L. Lü, and Y.C. Zhang, “Predicting missing links via local information,” The European Physical Journal B, vol. 71, no. 4, pp. 623–630, 2009.
 [17] L. Lü and T. Zhou, “Link prediction in complex networks: A survey,” Physica A: statistical mechanics and its applications, vol. 390, no. 6, pp. 1150–1170, 2011.
 [18] L. Yao, L. Wang, L. Pan, and K. Yao, “Link prediction based on commonneighbors for dynamic social network,” Procedia Computer Science, vol. 83, pp. 82–89, 2016.
 [19] Z. Zhang, J. Wen, L. Sun, Q. Deng, S. Su, and P. Yao, “Efficient incremental dynamic link prediction algorithms in social network,” KnowledgeBased Systems, vol. 132, pp. 226–235, 2017.
 [20] B. Perozzi, R. AlRfou, and S. Skiena, “Deepwalk: Online learning of social representations,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 701–710.
 [21] A. Grover and J. Leskovec, “node2vec: Scalable feature learning for networks,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016, pp. 855–864.

[22]
Z. Han, Z. Liu, C.M. Vong, Y.S. Liu, S. Bu, J. Han, and C. P. Chen, “Deep spatiality: Unsupervised learning of spatiallyenhanced global and local 3d features by deep neural network with coupled softmax,”
IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3049–3063, 2018. 
[23]
Q. Xuan, B. Fang, Y. Liu, J. Wang, J. Zhang, Y. Zheng, and G. Bao, “Automatic pearl classification machine based on a multistream convolutional neural network,”
IEEE Transactions on Industrial Electronics, vol. 65, no. 8, pp. 6538–6547, 2018.  [24] Q. Xuan, H. Xiao, C. Fu, and Y. Liu, “Evolving convolutional neural network and its application in finegrained visual categorization,” IEEE Access, 2018.
 [25] D. Wang, P. Cui, and W. Zhu, “Structural deep network embedding,” in Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016, pp. 1225–1234.
 [26] T. N. Kipf and M. Welling, “Semisupervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
 [27] N. M. Ahmed, L. Chen, Y. Wang, B. Li, Y. Li, and W. Liu, “Samplingbased algorithm for link prediction in temporal networks,” Information Sciences, vol. 374, pp. 1–14, 2016.
 [28] N. M. Ahmed and L. Chen, “An efficient algorithm for link prediction in temporal uncertain social networks,” Information Sciences, vol. 331, pp. 120–136, 2016.
 [29] X. Li, N. Du, H. Li, K. Li, J. Gao, and A. Zhang, “A deep learning approach to link prediction in dynamic networks,” in Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM, 2014, pp. 289–297.
 [30] L. Zhou, Y. Yang, X. Ren, F. Wu, and Y. Zhuang, “Dynamic Network Embedding by Modelling Triadic Closure Process,” in AAAI, 2018.
 [31] T. Li, J. Zhang, S. Y. Philip, Y. Zhang, and Y. Yan, “Deep dynamic network embedding for link prediction,” IEEE Access, 2018.
 [32] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 [33] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual prediction with lstm,” 1999.
 [34] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.
 [35] Z. Han, M. Shang, Z. Liu, C.M. Vong, Y.S. Liu, M. Zwicker, J. Han, and C. P. Chen, “Seqviews2seqlabels: Learning 3d global features via aggregating sequential views by rnn with attention,” IEEE Transactions on Image Processing, vol. 28, no. 2, pp. 658–672, 2019.
 [36] “Haggle network dataset – KONECT,” Apr. 2017. [Online]. Available: http://konect.unikoblenz.de/networks/contact

[37]
R. A. Rossi and N. K. Ahmed, “The network data repository with interactive
graph analytics and visualization,” in
Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence
, 2015. [Online]. Available: http://networkrepository.com  [38] R. Michalski, S. Palus, and P. Kazienko, “Matching organizational structure and social network extracted from email communication,” in Lecture Notes in Business Information Processing, vol. 87. Springer Berlin Heidelberg, 2011, pp. 197–206.
 [39] “Facebook wall posts network dataset – KONECT,” Apr. 2017. [Online]. Available: http://konect.unikoblenz.de/networks/facebookwosnwall
 [40] “Linux kernel mailing list replies network dataset – KONECT,” Apr. 2017. [Online]. Available: http://konect.unikoblenz.de/networks/lkmlreply
 [41] L. Zhu, D. Guo, J. Yin, G. Ver Steeg, and A. Galstyan, “Scalable temporal latent space inference for link prediction in dynamic social networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 10, pp. 2765–2777, 2016.
 [42] T. Li, B. Wang, Y. Jiang, Y. Zhang, and Y. Yan, “Restricted boltzmann machinebased approaches for link prediction in dynamic networks,” IEEE Access, 2018.
 [43] Y. Yang, R. N. Lichtenwalter, and N. V. Chawla, “Evaluating link prediction methods,” Knowledge and Information Systems, vol. 45, no. 3, pp. 751–782, 2015.
 [44] R. R. Junuthula, K. S. Xu, and V. K. Devabhaktuni, “Evaluating link prediction accuracy in dynamic networks with added and removed edges,” in Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloudSocialComSustainCom), 2016 IEEE International Conferences on. IEEE, 2016, pp. 377–384.
 [45] G. H. Nguyen, J. B. Lee, R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim, “Continuoustime dynamic network embeddings,” in 3rd International Workshop on Learning Representations for Big Networks (WWW BigNet), 2018.

[46]
L. Maaten and G. Hinton, “Visualizing data using tsne,”
Journal of machine learning research
, vol. 9, no. Nov, pp. 2579–2605, 2008.